# List Comprehension Byte Size Session

### **Prerequisite**
1. Completion of Python for Everybody Specialization Week 1 and 2 through Coursera
2. Completion of Week 1 of Introduction to Data Science in Python Course through Coursera
3. Basic understanding of Pandas

### **Background**

You may have seen the term 'list comprehension' during your Coursera training and may have a basic understanding of it. If you haven't, I suggest that you go to Coursera and enrol yourself in the [Introduction to Data Science in Python](https://www.coursera.org/learn/python-data-analysis) offered by University of Michigan. It will be in Week 1 of the course.

As a recap, list comprehension is an abbreviated syntax of the 'for' loop to create a sequence/list. A traditional 'for' loop would take a couple of lines whereas a list comprehension would typically take one line or two. The usual syntax of a list comprehension is as follows:

>**new_list = [expression for member in iterable (if condition)]** - Variation 1  
>**new_list = [expression (if condition) for member in iterable]** - Variation 2
>
>1. The 'expression' can be the member, a method or an expression that returns a value e.g. a function.  
>2. The 'member' is the object or value in the iterable.  
>3. The 'iterable' is a list or iterator that can return elements one at a time.  
>4. The 'if condition' is an optional conditional logic that can be added to filter our certain values in the itreable (variation 1) or to change the value of the result based on a conditional logic (variation 2)

We will start off with a recap of creating a basic list comprehension and then move deeper into various ways of expanding on list comprehension. We will use list comprehension in the context of Pandas. First off, let's import Pandas and existing dataset from Scikit Learn which is plant iris data.

In [3]:
''' Importing pandas and datasets from sklearn '''

import pandas as pd
from sklearn import datasets

''' Assign the iris dataset to 'download' and extract data into a dataframe. To know more about this dataset you can go to
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris '''

download = datasets.load_iris()
df = pd.DataFrame(download.data)
df.columns = download.feature_names
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Let's do a simple loop and list comprehension of an arithmetic operation for comparison. Let's use values from the sepal length (cm) column of the dataframe.

In [4]:
''' First initialise an empty list '''
empty = []

''' Then iterate over the column 'sepal length (cm)' and for each value, divide by 2 and append to empty list '''
for i in df['sepal length (cm)']:
    empty.append(i / 2)
empty[0:5]

[2.55, 2.45, 2.35, 2.3, 2.5]

In the above, you have to create an empty list first, set up a loop to iterate through each value of the sepal length (cm) column and append each calculated result back into the empty list. Now, let's create a list comprehension to match.

In [5]:
empty = [ (i/2) for i in df['sepal length (cm)']]
empty[0:5]

[2.55, 2.45, 2.35, 2.3, 2.5]

The list comprehension above is shown to achieve the same result as the earlier 'for' loop. That is a recap of the very basics of list comprehension. 

## List Comprehension - Variation 1
### [expression for member in iterable (if condition)]
Now let's try adding some conditional logic into it. First, let's filter out the dataframe to show only sepal length (cm) above 5.0 for comparison

In [6]:
'''Using boolean masking, filter out the dataframe. Boolean masking is a method to evaluate the dataframe for True or False
of the conditions applied. The mask is then applied to the dataframe to show rows where the value is True '''

df[df['sepal length (cm)'] > 5.0].head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
5,5.4,3.9,1.7,0.4
10,5.4,3.7,1.5,0.2
14,5.8,4.0,1.2,0.2
15,5.7,4.4,1.5,0.4


Now let's build a list comprehension that conducts an arithmetic with the conditional logic filtering the iterable to be above a certain value - variation 1

In [7]:
test_1 = [(i/2) for i in df['sepal length (cm)'] if i>5.0 ]
test_1[0:5]

[2.55, 2.7, 2.7, 2.9, 2.85]

With simple arithmetic, you can see that the result of test_1 is half the result of the 'sepal length (cm)' column of the above dataframe. Let's compare the size of the filtered dataframe and the list comprehension created.

In [8]:
len(df[df['sepal length (cm)'] > 5.0]) == len(test_1)

True

A true evaluation means that the size of both output matches. This confirms the execution of variation 1 of the list comprehension: the condition (i>5.0) is evaluated on the iterable first before being passed to the function (i/2). How would the above look like in a normal 'for' loop.

In [9]:
test_1 = []
for i in df['sepal length (cm)']:
    if i>5.0:
        test_1.append(i/2)
test_1[0:5]

[2.55, 2.7, 2.7, 2.9, 2.85]

You could actually achieve the above using the traditional Pandas bool masking for filter and apply arithmetic to the whole column resulting in a Pandas Series. However, it looks less comprehensible.

In [10]:
((df[df['sepal length (cm)']>5.0]['sepal length (cm)'])/2).head()

0     2.55
5     2.70
10    2.70
14    2.90
15    2.85
Name: sepal length (cm), dtype: float64

## List Comprehension - Variation 2
### [expression (if condition) for member in iterable]
For variation 2, the condition is not applied on the iterable but when it is passed to the function. Note that with this variation, the 'else' context must be added. Otherwise, it will throw an error.

In [30]:
test_2 = [(i/2) if i>5.0 else 0 for i in df['sepal length (cm)']]
test_2[0:5]

[2.55, 0, 0, 0, 0]

How would the above look like in a normal 'for' loop

In [31]:
test_2 = []
for i in df['sepal length (cm)']:
    if i > 5.0:
        test_2.append(i/2)
    else: test_2.append(0)
test_2[0:5]

[2.55, 0, 0, 0, 0]

You can chain conditional logic like below

In [32]:
test_3 = [ (i/2) if (i>4.5) and (i<5.0) else 0 for i in df['sepal length (cm)']]
test_3[0:5]

[0, 2.45, 2.35, 2.3, 0]

Or alternatively

In [33]:
test_3 = [ (i/2) if (4.5<i<5.0) else 0 for i in df['sepal length (cm)']]
test_3[0:5]

[0, 2.45, 2.35, 2.3, 0]

In summary, the difference between the two variations is one filters the iterable before passing it to the function and the other doesn't.

## Using defined function in list comprehension
You can also use your own defined function within the list comprehension

In [34]:
import numpy as np

''' Establish an area function with an argument. The argument will be passed into the body of the function for evaluation
before returning a result'''

def area(value):
    area_cm2 = np.pi*value**2
    return area_cm2

In [35]:
test_4 = [ area(i) if (4.5<i<5.0) else 0 for i in df['sepal length (cm)']]
test_4[0:5]

[0, 75.42963961269095, 69.39778171779854, 66.47610054996001, 0]

## Aggregating list comprehension
Like any list you can use aggregation to quickly get an answer as follows. Let's use the mean() from statistics library for average.

In [17]:
from statistics import mean
mean([ (i/2) if (4.5<i<5.0) else 0 for i in df['sepal length (cm)']][0:5])

1.42

The above results includes the 0 values on either end. We can replace 0 as nan by using the math.nan and use the np.nanmean() to get the mean while ignoring nan

In [19]:
from math import nan
import numpy as np
np.nanmean([ (i/2) if (4.5<i<5.0) else nan for i in df['sepal length (cm)']][0:5])

2.3666666666666667

There are many other aggregation techniques which you can use like mean, max, standard deviation and etc.

## List comprehension using multiple columns
We can extend list comprehension across several columns in a dataframe. Let's review the dataframe again.

In [20]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Let's multiply sepal length and width together and round to 2 decimals using list comprehension. To use two columns, the zip method is used. Zip allows the creation of a temporary dataframe from equal dataset e.g. selected columns.

In [21]:
test_5 = [ round(i*j,2) for i,j in zip(df['sepal length (cm)'],df['sepal width (cm)'])]
test_5[0:5]

[17.85, 14.7, 15.04, 14.26, 18.0]

You can use tuples in a list comprehension to identify the index of each value.

In [22]:
test_6 = [ (i,j) for i,j in zip(df.index,df['sepal width (cm)'])]
test_6[0:5]

[(0, 3.5), (1, 3.0), (2, 3.2), (3, 3.1), (4, 3.6)]

And as before you can add conditions to the list comprehension as well. Let's identify row index and the results when either of the column values is more than a certain value.

In [23]:
test_7 = [ (m,round(i*j,2)) for m,i,j in zip(df.index,df['sepal length (cm)'],df['sepal width (cm)']) if (i>5.0) or (j>5.0)]
test_7[0:5]

[(0, 17.85), (5, 21.06), (10, 19.98), (14, 23.2), (15, 25.08)]

The above would obviously create a list leaving the rows out where the conditions are not met. To include the rows that do not meet the condition you would do the following.

In [24]:
test_8 = [ (round(i*j,2)) if ((i>5.0) or (j>5.0)) else nan 
          for i,j in zip(df['sepal length (cm)'],df['sepal width (cm)'])]
test_8[0:11]

[17.85, nan, nan, nan, nan, 21.06, nan, nan, nan, nan, 19.98]

## Enriching dataframe using list comprehension
Technically we could add the previous result (test_8) that back into the dataframe since it is the same length. Let's create a copy of df.

In [25]:
df2 = df.copy()

''' Create a new column and assign test_8 result to it'''
df2["i*j"] = test_8
df2.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),i*j
0,5.1,3.5,1.4,0.2,17.85
1,4.9,3.0,1.4,0.2,
2,4.7,3.2,1.3,0.2,
3,4.6,3.1,1.5,0.2,
4,5.0,3.6,1.4,0.2,
5,5.4,3.9,1.7,0.4,21.06
6,4.6,3.4,1.4,0.3,
7,5.0,3.4,1.5,0.2,
8,4.4,2.9,1.4,0.2,
9,4.9,3.1,1.5,0.1,


What is shown above is a quick way of using list comprehension to create a new column within a dataframe. Although shown in separate cells, you can write the above as follows in one line

    df2["i*j"] = [ (round(i*j,2)) if ((i>5.0) or (j>5.0)) else nan 
                    for i,j in zip(df['sepal length (cm)'],df['sepal width (cm)'])]

          
Let's try it out

In [36]:
''' Using the list comprehension from before and assigning it directly to a new dataframe column '''

df2["i*j(v2)"] = [ (round(i*j,2)) if ((i>5.0) or (j>5.0)) else nan 
              for i,j in zip(df['sepal length (cm)'],df['sepal width (cm)'])]
df2.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),i*j,i*j(v2),sepal length (cm)2,sepal width (cm)2
0,5.1,3.5,1.25,0.2,17.85,17.85,5.1,3.0
1,4.9,3.0,1.25,0.2,,,5.0,3.0
2,4.7,3.2,1.25,0.2,,,5.0,3.0
3,4.6,3.1,1.75,0.2,,,5.0,3.0
4,5.0,3.6,1.25,0.2,,,5.0,4.0
5,5.4,3.9,1.75,0.4,21.06,21.06,5.4,4.0
6,4.6,3.4,1.25,0.3,,,5.0,3.0
7,5.0,3.4,1.75,0.2,,,5.0,3.0
8,4.4,2.9,1.25,0.2,,,4.0,2.9
9,4.9,3.1,1.75,0.1,,,5.0,3.0


How would you do the above using pandas alone. Let's use the original df

In [37]:
for m,i,j in zip(df.index,df['sepal length (cm)'],df['sepal width (cm)']):
    if (i>5.0) or (j>5.0):
        df.at[m,'i*j'] = i*j
    else:
        df.at[m,'i*j'] = nan
df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),i*j
0,5.1,3.5,1.4,0.2,17.85
1,4.9,3.0,1.4,0.2,
2,4.7,3.2,1.3,0.2,
3,4.6,3.1,1.5,0.2,
4,5.0,3.6,1.4,0.2,
5,5.4,3.9,1.7,0.4,21.06
6,4.6,3.4,1.4,0.3,
7,5.0,3.4,1.5,0.2,
8,4.4,2.9,1.4,0.2,
9,4.9,3.1,1.5,0.1,


How about replacing values within an existing column using list comprehension? Let's change the values in petal length (cm) where anything between 1.0 to 1.4 is 1.25 and anything from 1.5 to 2.0 is 1.75. Let's do it on df2.

In [39]:
''' Using list comprehension and assigning it to an existing column in the dataframe '''

df2['petal length (cm)'] = [ 1.25 if (1.0<=i<=1.4) else 1.75 if (1.5<=i<2.0) else i
                            for i in df['petal length (cm)']]
df2.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),i*j,i*j(v2),sepal length (cm)2,sepal width (cm)2
0,5.1,3.5,1.25,0.2,17.85,17.85,5.1,3.0
1,4.9,3.0,1.25,0.2,,,5.0,3.0
2,4.7,3.2,1.25,0.2,,,5.0,3.0
3,4.6,3.1,1.75,0.2,,,5.0,3.0
4,5.0,3.6,1.25,0.2,,,5.0,4.0
5,5.4,3.9,1.75,0.4,21.06,21.06,5.4,4.0
6,4.6,3.4,1.25,0.3,,,5.0,3.0
7,5.0,3.4,1.75,0.2,,,5.0,3.0
8,4.4,2.9,1.25,0.2,,,4.0,2.9
9,4.9,3.1,1.75,0.1,,,5.0,3.0


In the above, what you will notice is that the list comprehension has an 'else'-'if' block. The following is the syntax

**new_list = [expression (if condition) else expression (if condition) else expression for member in iterable]** - Variation 2 with additional 'else if'.

Now, what if we would like to do the same for two columns. First, we can define a function that returns a tuple. Then we pass that function into the list comprehension and unpack it to each dataframe column. The zip() combines to columns of equal length while zip(\*) unpacks each tuple back out to individual values.

In [29]:
def conver_num(a,b):
    if (4.0<a<=4.5): a=4.0
    if (4.6<=a<=5.0): a=5.0
    if (3.0<b<=3.5): b=3.0
    if (3.6<=b<=4.0): b=4.0
    return (a,b)

df2['sepal length (cm)2'], df2['sepal width (cm)2'] = zip(*[conver_num(i,j) 
                                                            for i,j in zip(df['sepal length (cm)'],df['sepal width (cm)'])])
df2.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),i*j,i*j(v2),sepal length (cm)2,sepal width (cm)2
0,5.1,3.5,1.25,0.2,17.85,17.85,5.1,3.0
1,4.9,3.0,1.25,0.2,,,5.0,3.0
2,4.7,3.2,1.25,0.2,,,5.0,3.0
3,4.6,3.1,1.75,0.2,,,5.0,3.0
4,5.0,3.6,1.25,0.2,,,5.0,4.0
5,5.4,3.9,1.75,0.4,21.06,21.06,5.4,4.0
6,4.6,3.4,1.25,0.3,,,5.0,3.0
7,5.0,3.4,1.75,0.2,,,5.0,3.0
8,4.4,2.9,1.25,0.2,,,4.0,2.9
9,4.9,3.1,1.75,0.1,,,5.0,3.0


The above maybe useful say if you have various currencies in one column and values in another. You can define a function that will do the appropriate currency conversion and pass that to a list comprehension to create a new column.

That is a recap of list comprehension and how you can use it in Pandas. There is an exercise below that you can attempt. If you have any further questions / comments on list comprehension or this training, do provide feedback to the Foundational Data Science CoI. 
Note: Although the above covers list comprehension, the same concept can be used for set comprehension or dictionary comprehension

**new_set = {expression for member in iterable (if condition)}** - Variation 1  
**new_set = {expression (if condition) else expression (if condition) else expression for member in iterable}** - Variation 2

**new_dictionary = {label:item for member in iterable (if condition)}** - Variation 1  
**new_dictionary = {label:item (if condition) else label:item (if condition) else label:item for member in iterable (if condition)}** - Variation 2  

#### Exercise:
Create a function that returns the value in inches (2.54 cm ~ 1 inch). Then use the function in list comprehensions to append additional columns in df for all columns with (cm) in the column header.


*This notebook has been created by CS Chun. It is used for training purposes and can be freely distributed, unaltered. If altered, please document your alteration before distributing.*