# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Session 03: Control Flow + Functions. Defensive programming. Pandas: I/O operations + `apply`, `filter`, `groupby`, `agg`

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### 0. What do we want to do today?

Our goal in **Session03** is to learn

- about the basics of Control Flow in Python (e.g. how do we tell the computer what to do with data in this language);
- a bit more about functions in Python;
- a bit about defensive programming in Python;
- several new things that can be done with `pd.DataFrame`, such as
   - Pandas I/O operations
   - `apply` a function to a `pd.DataFrame` column
   - use `filter`, `groupy`, and `agg` to filter out and produce data aggregates from `pd.DataFrame`

#### 1. Where am I?

Your are (or you should be...) in the `session03` directory, where we find 
- this notebook, 
- it's HTML version, 
- another directory: `_data`
- and a `.csv` file named `BostonHousingData.csv` in it.

**NOTE.** `csv.` files play a prominent role in Data Science. If you do not know what a `.csv` (Comma Separated Values) file is, read through this document:[What Is a CSV File, and How Do I Open It? - by Chris Hoffman, Editor-in-Chief, How-To Geek](https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/).

The Boston Housing Data Set is available from GitHub [here](https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv).

In [1]:
import os
work_dir = os.getcwd()
print(work_dir)
print(os.listdir(work_dir))
data_dir = os.path.join(work_dir, "_data")
print(os.listdir(data_dir))

/home/ikacikac/workspace/dss03python2023/session03
['dss03_py_session03.html', 'tasklist03_solutions.html', 'dss03_py_session03.ipynb', 'tasklist03_solutions.ipynb', 'tasklist03.ipynb', '.ipynb_checkpoints', '_data']
['world_indicators.csv', 'BostonHousingData.csv', 'MovieRatings.csv']


`BostonHousingData.csv` is a data set, in a CSV format, that we will next load into a Pandas DataFrame in order to study it. It is also one of the data sets that were frequently used to benchmark Machine Learning algorithms in the past. 

The data was originally published by Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, *J. Environ. Economics & Management*, vol.5, 81-102, 1978.


### 2. Data: The Boston Housing Data Set

Import Pandas:

In [2]:
import pandas as pd

Load `BostonHousingData.csv` into a Pandas DataFrame. 

In [3]:
filename = 'BostonHousingData.csv'
data_set = pd.read_csv(os.path.join(data_dir, filename))
data_set.head(5)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


List column data types on data_set. This is how Pandas parses our data. Pay attention, this may not always give the best result in regard to recognized data types.

In [4]:
data_set.dtypes

crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
b          float64
lstat      float64
medv       float64
dtype: object

Let's get more comprehensive information about our data set:

In [5]:
data_set.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


Let's now invest some effort to **understand** the data set at hand before we proceed with the `pd.DataFrame` class:

- **crim**: per capita crime rate by town.

- **zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

- **indus**: proportion of non-retail business acres per town.

- **chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

- **nox**: nitrogen oxides concentration (parts per 10 million).

- **rm**: average number of rooms per dwelling.

- **age**: proportion of owner-occupied units built prior to 1940.

- **dis**: weighted mean of distances to five Boston employment centres.

- **rad**: index of accessibility to radial highways.

- **tax**: full-value property-tax rate per \$10,000.

- **ptratio**: pupil-teacher ratio by town.

- **black**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

- **lstat**: lower status of the population (percent).

- **medv**: median value of owner-occupied homes in \$1000s.

### 3. Control Flow A: Iterating in Python

We will first grab some values from `data_set` and turn them into a list.

In [6]:
my_data = data_set['medv'][0:20]
my_data = list(my_data)
print(my_data)

[24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2]


Task: check if the rounded values in `my_data` are even or not; print the result for each member of `my_data`.

In [7]:
for number in my_data:
    if round(number) % 2 == 0:
        print('Rounded ' + str(number) + ' is even.')
    else:
        print('Rounded ' + str(number) + ' is odd.')

Rounded 24.0 is even.
Rounded 21.6 is even.
Rounded 34.7 is odd.
Rounded 33.4 is odd.
Rounded 36.2 is even.
Rounded 28.7 is odd.
Rounded 22.9 is odd.
Rounded 27.1 is odd.
Rounded 16.5 is even.
Rounded 18.9 is odd.
Rounded 15.0 is odd.
Rounded 18.9 is odd.
Rounded 21.7 is even.
Rounded 20.4 is even.
Rounded 18.2 is even.
Rounded 19.9 is even.
Rounded 23.1 is odd.
Rounded 17.5 is even.
Rounded 20.2 is even.
Rounded 18.2 is even.


**Iterables** in Python: an iterable is an object that can return its elements one by one.

**Sequences** in Pythin are iterbales: **lists**, **strings**, and **tuples**.

In [8]:
my_data = (1, 2, '3', 4, 5, '7', 8, 9, '10')
for d in my_data:
    print(type(d))

<class 'int'>
<class 'int'>
<class 'str'>
<class 'int'>
<class 'int'>
<class 'str'>
<class 'int'>
<class 'int'>
<class 'str'>


In [9]:
my_data = "Belgrader"
for letter in my_data:
    print(letter)

B
e
l
g
r
a
d
e
r


In [10]:
my_data = (1, 2, 9, 3, 7, 6, 4, 10, 8, 5)
for t in my_data:
    print(str(t))

1
2
9
3
7
6
4
10
8
5


**Dictionaries** are **iterables** but **not sequences**.

In [11]:
my_data = {'a':1,
           'b':2,
           'c':3,
           'd':4,
           'e':5,
           'f':6}
for item in my_data:
    print(item)

a
b
c
d
e
f


**Keys only?** Yes. What about:

In [12]:
for item in my_data.values():
    print(item)

1
2
3
4
5
6


Because:

In [13]:
my_data.values()

dict_values([1, 2, 3, 4, 5, 6])

And we can also do something like this:

In [14]:
for key in my_data:
    print('When ' + key + ' then ' + str(my_data[key]))

When a then 1
When b then 2
When c then 3
When d then 4
When e then 5
When f then 6


Or:

In [15]:
for kv in my_data.items():
    key, value = kv
    print('When ' + key + ' then ' + str(my_data[key]))

When a then 1
When b then 2
When c then 3
When d then 4
When e then 5
When f then 6


Because:

In [16]:
my_data.items()

dict_items([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5), ('f', 6)])

Also, unpacking `my_data.items()` before entering the loop:

In [17]:
for key, value in my_data.items():
    print('When ' + key + ' then ' + str(my_data[key]))

When a then 1
When b then 2
When c then 3
When d then 4
When e then 5
When f then 6


Let's apply a 20% discount to all prices in `data_set['medv']`!

In [18]:
print("Original prices:")
print(list(data_set['medv'][0:20]))

medv_discount = list(data_set['medv'])

for price in range(len(medv_discount)):
    medv_discount[price] = round(medv_discount[price] - .2*medv_discount[price],2)

print("Discounted prices (20%):")
print(medv_discount[0:20])

Original prices:
[24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2]
Discounted prices (20%):
[19.2, 17.28, 27.76, 26.72, 28.96, 22.96, 18.32, 21.68, 13.2, 15.12, 12.0, 15.12, 17.36, 16.32, 14.56, 15.92, 18.48, 14.0, 16.16, 14.56]


What is this: `range(len(medv_discount))`?

In [19]:
range(len(medv_discount))

range(0, 506)

In [20]:
list(range(5, 15))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [21]:
print(list(range(len(medv_discount))))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

### 3. Control Flow B: list comprehension

Now, **this is interestening:**

In [22]:
medv = list(data_set['medv'])

# - list comprehension:
medv_discount = [round(x - .2*x, 2) for x in medv]

print("Original prices:")
print(list(data_set['medv'][0:20]))

print("Discounted prices (20%):")
print(medv_discount[0:20])

Original prices:
[24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2]
Discounted prices (20%):
[19.2, 17.28, 27.76, 26.72, 28.96, 22.96, 18.32, 21.68, 13.2, 15.12, 12.0, 15.12, 17.36, 16.32, 14.56, 15.92, 18.48, 14.0, 16.16, 14.56]


This: `[round(x - .2*x, 2) for x in medv]` 

is called: **a list comprehension**. It is a superpowerfull way of expressing iterations in Python!

In [23]:
my_list = ['Belgrade', 'New York', 'Moscow', 'London', 'New Delhi', 'Tokyo']
[x[0] for x in my_list]

['B', 'N', 'M', 'L', 'N', 'T']

In [24]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10]
[x**2 for x in my_list]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [25]:
l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
[el1 + ':' + el2 for el1 in l_1 for el2 in l_2]

['A:X', 'A:Y', 'A:Z', 'B:X', 'B:Y', 'B:Z', 'C:X', 'C:Y', 'C:Z']

Similarly, we could do:

In [26]:
result = list()
l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
for el1 in l_1:
    for el2 in l_2:
        result.append(el1 + ':' + el2)
print(result)

['A:X', 'A:Y', 'A:Z', 'B:X', 'B:Y', 'B:Z', 'C:X', 'C:Y', 'C:Z']


Create a list of tuples from list comprehension:

In [27]:
l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
[(el1, el2) for el1 in l_1 for el2 in l_2]

[('A', 'X'),
 ('A', 'Y'),
 ('A', 'Z'),
 ('B', 'X'),
 ('B', 'Y'),
 ('B', 'Z'),
 ('C', 'X'),
 ('C', 'Y'),
 ('C', 'Z')]

And now for a bit more complicated expression...

In [28]:
l_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[str(x) + ' is even.' if x % 2 == 0 else str(x) + ' is odd.' for x in l_1]

['1 is odd.',
 '2 is even.',
 '3 is odd.',
 '4 is even.',
 '5 is odd.',
 '6 is even.',
 '7 is odd.',
 '8 is even.',
 '9 is odd.',
 '10 is even.']

### 3. Control Flow C: `while`, `continue`, and `break`

What did we forgot? Ah, the `while` loop!

In [29]:
l = list()
x = 1
while x < 100:
  if x % 2 == 0:
    l.append(x)
  x += 1
print(l)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]


`break` and `continue` in Python loops:

In [30]:
l_1 = [1, 2, 3, 4, 5, '6', 7, 8, 9]
for i in range(len(l_1)):
    if isinstance(l_1[i], str):
        break
    else:
        print(l_1[i])

1
2
3
4
5


In [31]:
isinstance(10, int)

True

In [32]:
isinstance(10, float)

False

In [33]:
l_1 = [1, 2, 3, 4, 5, '6', 7, 8, 9]
i = 0
while i < len(l_1):
    if isinstance(l_1[i], str):
        break
    else:
        print(l_1[i])
    i += 1

1
2
3
4
5


In [34]:
l_1 = [1, 2, 3, 4, 5, '6', 7, 8, 9]
for i in range(len(l_1)):
    if isinstance(l_1[i], str):
        break
    else:
        print(l_1[i])

1
2
3
4
5


`continue` skips an iteration:

In [35]:
l_1 = [1, 2, '3', 4, 5, '6', 7, 8, '9']
i = 0
while i < len(l_1):
    if isinstance(l_1[i], str):
        i += 1
        continue
    print(l_1[i])
    i += 1

1
2
4
5
7
8


In [36]:
l_1 = [1, 2, '3', 4, 5, '6', 7, 8, '9']
i = 0
for i in range(len(l_1)):
    if isinstance(l_1[i], str):
        continue
    print(l_1[i])

1
2
4
5
7
8


### 3. Control Flow D: dictionary comprehension

In [37]:
squares = {num: num*num for num in range(1, 11)}
print(squares)

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100}


In [38]:
squares[1]

1

Apply a 20% discount to first 20 elements in `data_set['medv']` represented by a dictionary.

Step 1. Represent `data_set['medv'][0:20]` by a dictionary, introducing property names:

In [39]:
medv = data_set['medv'][0:20]
properties = ['p_' + str(i) for i in range(0, 20)]
medv_dict = dict(zip(properties, medv))
medv_dict

{'p_0': 24.0,
 'p_1': 21.6,
 'p_2': 34.7,
 'p_3': 33.4,
 'p_4': 36.2,
 'p_5': 28.7,
 'p_6': 22.9,
 'p_7': 27.1,
 'p_8': 16.5,
 'p_9': 18.9,
 'p_10': 15.0,
 'p_11': 18.9,
 'p_12': 21.7,
 'p_13': 20.4,
 'p_14': 18.2,
 'p_15': 19.9,
 'p_16': 23.1,
 'p_17': 17.5,
 'p_18': 20.2,
 'p_19': 18.2}

Step 2. Dictionary comprehension:

In [40]:
medv_dict_discount = {key: round(value - .2*value, 2) for (key, value) in medv_dict.items()}
medv_dict_discount

{'p_0': 19.2,
 'p_1': 17.28,
 'p_2': 27.76,
 'p_3': 26.72,
 'p_4': 28.96,
 'p_5': 22.96,
 'p_6': 18.32,
 'p_7': 21.68,
 'p_8': 13.2,
 'p_9': 15.12,
 'p_10': 12.0,
 'p_11': 15.12,
 'p_12': 17.36,
 'p_13': 16.32,
 'p_14': 14.56,
 'p_15': 15.92,
 'p_16': 18.48,
 'p_17': 14.0,
 'p_18': 16.16,
 'p_19': 14.56}

Change keys by a dictionary comprehension:

In [41]:
medv_dict_discount = {key + '_changed': value for (key, value) in medv_dict.items()}
medv_dict_discount

{'p_0_changed': 24.0,
 'p_1_changed': 21.6,
 'p_2_changed': 34.7,
 'p_3_changed': 33.4,
 'p_4_changed': 36.2,
 'p_5_changed': 28.7,
 'p_6_changed': 22.9,
 'p_7_changed': 27.1,
 'p_8_changed': 16.5,
 'p_9_changed': 18.9,
 'p_10_changed': 15.0,
 'p_11_changed': 18.9,
 'p_12_changed': 21.7,
 'p_13_changed': 20.4,
 'p_14_changed': 18.2,
 'p_15_changed': 19.9,
 'p_16_changed': 23.1,
 'p_17_changed': 17.5,
 'p_18_changed': 20.2,
 'p_19_changed': 18.2}

### 3. Control Flow E: Decisions in Python

Oh... we are using `if` and `else` again without telling you about them. It is really simple:

In [42]:
x = 10
if x**2 == 100:
    print("x is definitely 10.")
else:
    print("It is  definitely not 10.")

x is definitely 10.


In [43]:
def is_something_ten(x):
    if x**2 == 100:
        return(True)
    else:
        return(False)
l_1 = [1, 10, 20, 4, 10]
[is_something_ten(x) for x in l_1]

[False, True, False, False, True]

Nicer:

In [44]:
def is_something_ten(x):
    if x**2 == 100:
        return(True)
    else:
        return(False)
l_1 = [1, 10, 20, 4, 10]
[str(x) + ' is 10!' if is_something_ten(x) else str(x) + ' is  not 10!' for x in l_1]

['1 is  not 10!', '10 is 10!', '20 is  not 10!', '4 is  not 10!', '10 is 10!']

We can also branch our `if` statements with `elif`:

In [45]:
x = 10
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('This is strange!')

Ok it is less than 20, now... 


In [46]:
x = 25
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

It's between 20 and 30!


In [47]:
x = 31
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

Ok it is not larger than 30.


`if` statements can be nested of course:

In [48]:
x = 19
if x < 20:
    print('Ok it is less than 20, now... ')
    if x < 18:
        print('And it is less than 18 too... ')
    else:
        print('But not less than 18... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

Ok it is less than 20, now... 
But not less than 18... 


### 4. Pandas I/O operations

The data set that we will next use is a subset of

[2019/W10: World Development Indicators - Health and Equality](https://data.world/makeovermonday/2019w10), from [data.world](https://data.world/)

It is provided as a CSV file: `world_indicators.csv`.

In [49]:
os.listdir(data_dir)

['world_indicators.csv', 'BostonHousingData.csv', 'MovieRatings.csv']

Read CSV and inspect:

In [50]:
filename = os.path.join(data_dir, 'world_indicators.csv')
data_set = pd.read_csv(filename)
data_set.head(20)

Unnamed: 0,country,year,births_attended_by_health_staff,current_health_expenditure_of_gdp,hospital_beds_per_1000
0,Afghanistan,1960,,,0.170627
1,Afghanistan,1961,,,
2,Afghanistan,1962,,,
3,Afghanistan,1963,,,
4,Afghanistan,1964,,,
5,Afghanistan,1965,,,
6,Afghanistan,1966,,,
7,Afghanistan,1967,,,
8,Afghanistan,1968,,,
9,Afghanistan,1969,,,


`NaN` is, by a convention, the way to represent missing data in `pd.DataFrame`. 

In [51]:
data_set.isna().head(20)

Unnamed: 0,country,year,births_attended_by_health_staff,current_health_expenditure_of_gdp,hospital_beds_per_1000
0,False,False,True,True,False
1,False,False,True,True,True
2,False,False,True,True,True
3,False,False,True,True,True
4,False,False,True,True,True
5,False,False,True,True,True
6,False,False,True,True,True
7,False,False,True,True,True
8,False,False,True,True,True
9,False,False,True,True,True


In [52]:
data_set.isna().sum()

country                                 0
year                                    0
births_attended_by_health_staff      9366
current_health_expenditure_of_gdp    8816
hospital_beds_per_1000               8361
dtype: int64

Change row index:

In [53]:
filename = os.path.join(data_dir, 'world_indicators.csv')
data_set = pd.read_csv(filename, index_col=0)
data_set.head(20)

Unnamed: 0_level_0,year,births_attended_by_health_staff,current_health_expenditure_of_gdp,hospital_beds_per_1000
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,1960,,,0.170627
Afghanistan,1961,,,
Afghanistan,1962,,,
Afghanistan,1963,,,
Afghanistan,1964,,,
Afghanistan,1965,,,
Afghanistan,1966,,,
Afghanistan,1967,,,
Afghanistan,1968,,,
Afghanistan,1969,,,


In [54]:
data_set.loc['Afghanistan', 'hospital_beds_per_1000']

country
Afghanistan    0.170627
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.199000
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.275600
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.309100
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.249800
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.300000
Afghanis

In [55]:
data_set.iloc[0:25, 3]

country
Afghanistan    0.170627
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.199000
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Afghanistan    0.275600
Afghanistan         NaN
Afghanistan         NaN
Afghanistan         NaN
Name: hospital_beds_per_1000, dtype: float64

In [56]:
my_data = {'a':[1, 2, 3], 
           'b':[4, 5, 6],
           'c':[7, 8 , 9]}
my_data = pd.DataFrame(my_data)
display(my_data)

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


Write `my_data` as a `.csv` file to `data_dir`:

In [57]:
filename = os.path.join(data_dir, 'my_data.csv')
my_data.to_csv(filename)
# - check in your local directory
os.listdir(data_dir)

['world_indicators.csv',
 'BostonHousingData.csv',
 'MovieRatings.csv',
 'my_data.csv']

Read `my_data.csv` back to Pandas:

In [58]:
read_my_data = pd.read_csv(filename)
read_my_data

Unnamed: 0.1,Unnamed: 0,a,b,c
0,0,1,4,7
1,1,2,5,8
2,2,3,6,9


Ooops:

In [59]:
read_my_data = pd.read_csv(filename, index_col=0)
read_my_data

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [60]:
read_my_data == my_data

Unnamed: 0,a,b,c
0,True,True,True
1,True,True,True
2,True,True,True


Remove `my_data.csv` from `data_directory`:

In [61]:
print(filename)
os.remove(filename)
os.listdir(data_dir)

/home/ikacikac/workspace/dss03python2023/session03/_data/my_data.csv


['world_indicators.csv', 'BostonHousingData.csv', 'MovieRatings.csv']

### 5. Pandas transformations and aggregations: `apply`, `filter`, `groupby`, `agg`

Let's again create the sample data frame.

In [62]:
my_data = {'a':[1, 2, 3, 6, 2], 
           'b':[4, 5, 6, 2, 3],
           'c':[7, 8 , 9, 1, 1],
           'd':[3, 4, 1, 4, 2]}
my_data = pd.DataFrame(my_data)
display(my_data)

Unnamed: 0,a,b,c,d
0,1,4,7,3
1,2,5,8,4
2,3,6,9,1
3,6,2,1,4
4,2,3,1,2


You remember how we defined the function for testing if number is equal to 10 or not? Let's define another method but test if number is even instead!

In [63]:
def is_even(x):
    if x % 2 == 0:
        return True
    else:
        return False

You also remember how we iterated through list and called method on every of it's items? Let's define method for that:

In [64]:
def is_even_list(lst):
    return [is_even(x) for x in lst]

Pandas DataFrame gives us method `apply` that is able to perform method along the axes. Remember `axis=0` are rows, `axis=1` are columns. Here is how we do it:

In [65]:
my_data.apply(is_even_list , axis=0)

Unnamed: 0,a,b,c,d
0,False,True,False,False
1,True,False,True,True
2,False,True,False,False
3,True,True,False,True
4,True,False,False,True


Now, here, it doesn't make much difference if we call it on rows or columns, since our method takes each cell's value separately, disregarding all of the row's or colum's values.

But what can make difference between calling method on rows and columns? Let's define method that includes all of the rows or columns values in calculation:

In [66]:
def distance_from_the_sum(lst):
    s = sum(lst)
    return [x-s for x in lst]

In [67]:
my_data.apply(distance_from_the_sum , axis=0)

Unnamed: 0,a,b,c,d
0,-13,-16,-19,-11
1,-12,-15,-18,-10
2,-11,-14,-17,-13
3,-8,-18,-25,-10
4,-12,-17,-25,-12


Think about practical use of this method. Where and when it makes sense to use it?

Let's move on to `filter` method. It is used for selecting columns or rows based on their labels.

In [68]:
my_data.filter(['a', 'b'])

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6
3,6,2
4,2,3


And what happens when we set `axis=1`?

In [69]:
my_data.filter(['a', 'b'], axis=1)

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6
3,6,2
4,2,3


And what happens with `axis=0`?

In [70]:
my_data.filter(['a', 'b'], axis=0)

Unnamed: 0,a,b,c,d


When we filter by rows we are essentially filtering by index values. Look at the data frame now. Index values are numbers. Let's do it correctly now:

In [71]:
my_data.filter([0, 2], axis=0)

Unnamed: 0,a,b,c,d
0,1,4,7,3
2,3,6,9,1


Let's for the sake of our next example redefine the data set.

In [72]:
my_data = {'age':[20, 34, 30, 25, 20, 34], 
           'town':['Chicago', 'LA', 'SF', 'Chicago', 'SF', 'WA'],
           'name':['Jake', 'Fin', 'Maria', 'Timmy', 'Eric', 'Sarah'],
           'income_in_k':[100, 150, 300, 50, 60, 300]}
my_data = pd.DataFrame(my_data)
display(my_data)

Unnamed: 0,age,town,name,income_in_k
0,20,Chicago,Jake,100
1,34,LA,Fin,150
2,30,SF,Maria,300
3,25,Chicago,Timmy,50
4,20,SF,Eric,60
5,34,WA,Sarah,300


This is all good, but we want to have sum of incomes. There are always more than one option to do it. Something like this:

In [73]:
my_data['income_in_k'].sum()

960

But what about the mean income?

In [74]:
my_data['income_in_k'].mean()

160.0

You must be wondering if there is a method to get both results at the same time? Well, pay close attention to the next `agg` method:

In [75]:
my_data['income_in_k'].agg(['mean', 'sum'])

mean    160.0
sum     960.0
Name: income_in_k, dtype: float64

We now have everything in one place.

Let's try approaching our data set with different example. Say we need the sum of incomes per age. How can we do it?

Remember, we used `loc` for filtering rows based on come condition. Something like this:

In [76]:
my_data.loc[my_data['age'] == 20, 'income_in_k'].sum()

160

But this is just for one value of age. Should we go and do it for all ages? NO! There is a much better and faster way. It is by using `groupby` data frame method.

In [77]:
my_data.groupby('age')['income_in_k'].sum()

age
20    160
25     50
30    300
34    450
Name: income_in_k, dtype: int64

Nice! But what if we want to have mean value of income?

In [78]:
my_data.groupby('age')['income_in_k'].mean()

age
20     80.0
25     50.0
30    300.0
34    225.0
Name: income_in_k, dtype: float64

Nice! But now we have two data frames with different data that we want to have in one place. Well, there is another method called `agg` (shorter from aggregation) that we shall use.

In [79]:
my_data.groupby('age')['income_in_k'].agg(['mean','sum'])

Unnamed: 0_level_0,mean,sum
age,Unnamed: 1_level_1,Unnamed: 2_level_1
20,80.0,160
25,50.0,50
30,300.0,300
34,225.0,450


We can have all sorts of aggregations, some of which are builtin:

In [80]:
my_data.groupby('age')['income_in_k'].agg(['mean','sum', 'min', 'max'])

Unnamed: 0_level_0,mean,sum,min,max
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20,80.0,160,60,100
25,50.0,50,50,50
30,300.0,300,300,300
34,225.0,450,150,300


There are more ways to aggregate data when grouping, but we will leave it at this for now. We will tackle this subject more in next sessions.

### Readings and Videos
- [Bill Lubanovic, Introducing Python, 1st Edition](https://www.oreilly.com/library/view/introducing-python-2nd/9781492051374/), Chapters 1 - 3.
- [freeCodeCamp.org Intermediate Python Programming Course](https://www.youtube.com/watch?v=HGOBQPFzWKo), Sections 1 - 4 (Lists, Tuples, Dictionaries, Sets)

### A highly recommended To Do
- Watch [Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)](https://www.youtube.com/watch?v=vmEHCJofslg)
- Watch [Python NumPy Tutorial for Beginners](https://www.youtube.com/watch?v=QUT1VHiLmmI)
- Read chapter [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) from [Python Data Science Handbook, Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)

<hr>

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>