<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/1_Basics/14_List_Comprehensions.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# List Comprehensions

## Notes

* A way to create a new list (with shorter syntax) based on the values of an existing list.

Not limited to only `list` comprehension: 
- `set` comprehension
- `tuple` comprehension
- `dictionary` comprehension

## Importance

Provide a concise way to create lists. Useful for data manipulation and filtering in pandas.

In [2]:
# Creating a list of numbers from 0 to 9
numbers = [x for x in range(10)]
numbers

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

## Example # 1

We're going to modify our example that we used in our `for` loop. Intsead of having the whole print statement with "Position requires X years of experience". We are just going to print out the experience required. This is a simplified version of our code earlier.

In [3]:
# Minimum experience required for job positions
position_experience_requirements = [1, 2, 3]

# Iterate over each experience requirement in the list of job positions
for x in position_experience_requirements:
    print(x)

1
2
3


Now let's use list comprehension to shorten this.

- The code defines `position_experience_requirements` as a list of integers representing minimum years of experience required for various job positions.
- The for loop goes through each list item in `postion_experience_requirements` and prints out the `requirement`.

In [4]:
# Create a list of job positions 
experience = [x for x in position_experience_requirements]

# The result will be a list of job positions 
experience

[1, 2, 3]

This is pretty basic. So let's make it a bit more useful. I'm going to add in a variable that stores the user's years of experience.

In [5]:
user_experience = 2
user_experience

2

Now, we are adding an if condition to our list comprehension. This condition checks if the user's experience (`user_experience`) is greater than or equal to each item (`x`) in the `position_experience_requirements` list.

```python
if user_experience >= x
```

It returns only the jobs where the requirement is met or is lower than the user's experience. 

In [7]:
# Create a list of job positions for which the user is qualified
    
qualified_positions= [x for x in position_experience_requirements if user_experience>= x]

qualified_positions

[1, 2]

## Example # 2

This first code block extracts the data we need for this exercise; we'll dive into this later in the course.

For now just understand I'm extracting the list of `job_titles` form our dataset.

In [10]:
%pip install datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
     -------------------------------------- 471.6/471.6 kB 5.0 MB/s eta 0:00:00
Collecting numpy>=1.17 (from datasets)
  Downloading numpy-2.1.2-cp310-cp310-win_amd64.whl (12.9 MB)
     ---------------------------------------- 12.9/12.9 MB 4.2 MB/s eta 0:00:00
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-win_amd64.whl (25.1 MB)
     ---------------------------------------- 25.1/25.1 MB 4.7 MB/s eta 0:00:00
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     -------------------------------------- 116.3/116.3 kB 7.1 MB/s eta 0:00:00
Collecting pandas (from datasets)
  Downloading pandas-2.2.3-cp310-cp310-win_amd64.whl (11.6 MB)
     ---------------------------------------- 11.6/11.6 MB 5.0 MB/s eta 0:00:00
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
     ------------------


[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading xxhash-3.5.0-cp310-cp310-win_amd64.whl (30 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl (134 kB)
     -------------------------------------- 134.8/134.8 kB 4.0 MB/s eta 0:00:00
Collecting fsspec[http]<=2024.6.1,>=2023.1.0 (from datasets)
  Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
     -------------------------------------- 177.6/177.6 kB 5.2 MB/s eta 0:00:00
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.10.10-cp310-cp310-win_amd64.whl (381 kB)
     -------------------------------------- 381.1/381.1 kB 7.9 MB/s eta 0:00:00
Collecting huggingface-hub>=0.22.0 (from datasets)
  Downloading huggingface_hub-0.25.2-py3-none-any.whl (436 kB)
     -------------------------------------- 436.6/436.6 kB 6.8 MB/s eta 0:00:00
Collecting packaging (from datasets)
  Downloading packaging-24.1-py3-none-any.whl (53 kB)
     ---------------------------------------- 54.0/54.0 kB 2.7 MB/s eta 0:00:00
Collecting py

In [11]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Create a list of job titles from the dataset
job_list = df['job_title'].tolist()

# Remove any non-string values from the list
job_list = [job for job in job_list if isinstance(job, str)]

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 785741/785741 [00:07<00:00, 111831.43 examples/s]


Let's modify our previous `for` loop into a list comp!

In [12]:
# previous for loop
analyst_list = []

for job in job_list:
  if "Data Analyst" in job:
    analyst_list.append(job)

# show first 10 values
analyst_list[:10]

['Data Analyst',
 'Stagiaire Data Analyst (H/F) - Lyon (69006)',
 'Data Analyst',
 'Senior Officer, Data Analyst, GTO',
 'Stage - Data Analyst F/H',
 'Data Analyst als Marketing Manager Automation (W/D/M)',
 'Data Analyst',
 'Senior Data Analyst',
 'Data Analyst (Bangkok Based, relocation provided)',
 'Senior Data Analyst']

However that was 4 lines of code! 

With list comprehension we can do it in only 1.

In [13]:
analyst_list = [job for job in job_list if "Data Analyst" in job]

# show first 10 values
analyst_list[:10]

['Data Analyst',
 'Stagiaire Data Analyst (H/F) - Lyon (69006)',
 'Data Analyst',
 'Senior Officer, Data Analyst, GTO',
 'Stage - Data Analyst F/H',
 'Data Analyst als Marketing Manager Automation (W/D/M)',
 'Data Analyst',
 'Senior Data Analyst',
 'Data Analyst (Bangkok Based, relocation provided)',
 'Senior Data Analyst']

In [14]:
print("Job list is:     " , len(job_list), "jobs")
print("Analyst list is: ", len(analyst_list), "jobs")

Job list is:      785740 jobs
Analyst list is:  162708 jobs
