# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Session 03: Control Flow + Functions. Defensive programming. Pandas: I/O operations + `apply`, `filter`, `groupby`, `agg`

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### 0. What do we want to do today?

Our goal in **Session03** is to learn

- about the basics of Control Flow in Python (e.g. how do we tell the computer what to do with data in this language);
- a bit more about functions in Python;
- a bit about defensive programming in Python;
- several new things that can be done with `pd.DataFrame`, such as
   - Pandas I/O operations
   - `apply` a function to a `pd.DataFrame` column
   - use `filter`, `groupy`, and `agg` to filter out and produce data aggregates from `pd.DataFrame`

#### 1. Where am I?

Your are (or you should be...) in the `session03` directory, where we find 
- this notebook, 
- it's HTML version, 
- another directory: `_data`
- and a `.csv` file named `BostonHousingData.csv` in it.

**NOTE.** `csv.` files play a prominent role in Data Science. If you do not know what a `.csv` (Comma Separated Values) file is, read through this document:[What Is a CSV File, and How Do I Open It? - by Chris Hoffman, Editor-in-Chief, How-To Geek](https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/).

The Boston Housing Data Set is available from GitHub [here](https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv).

In [1]:
import os
work_dir = os.getcwd()
print(work_dir)
print(os.listdir(work_dir))
data_dir = os.path.join(work_dir, "_data")
print(os.listdir(data_dir))

/Users/goransm/Work/___DataKolektiv/_EDU/DSS_Vol00_PythonDS_2023/dss03python2023/session03
['dss03_py_session03.ipynb', '.DS_Store', 'dss03_py_session03.html', '.ipynb_checkpoints', '_data']
['BostonHousingData.csv', '.DS_Store', 'MovieRatings.csv']


`BostonHousingData.csv` is a data set, in a CSV format, that we will next load into a Pandas DataFrame in order to study it. It is also one of the data sets that were frequently used to benchmark Machine Learning algorithms in the past. 

The data was originally published by Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, *J. Environ. Economics & Management*, vol.5, 81-102, 1978.


### 2. Data: The Boston Housing Data Set

Import Pandas:

In [2]:
import pandas as pd

Load `BostonHousingData.csv` into a Pandas DataFrame. 

In [3]:
filename = 'BostonHousingData.csv'
data_set = pd.read_csv(os.path.join(data_dir, filename))

Descibe `data_set`.

In [4]:
data_set.dtypes

crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
b          float64
lstat      float64
medv       float64
dtype: object

Let's now invest some effort to **understand** the data set at hand before we proceed with the `pd.DataFrame` class:

- **crim**: per capita crime rate by town.

- **zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

- **indus**: proportion of non-retail business acres per town.

- **chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

- **nox**: nitrogen oxides concentration (parts per 10 million).

- **rm**: average number of rooms per dwelling.

- **age**: proportion of owner-occupied units built prior to 1940.

- **dis**: weighted mean of distances to five Boston employment centres.

- **rad**: index of accessibility to radial highways.

- **tax**: full-value property-tax rate per \$10,000.

- **ptratio**: pupil-teacher ratio by town.

- **black**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

- **lstat**: lower status of the population (percent).

- **medv**: median value of owner-occupied homes in \$1000s.

### 3. Control Flow A: Iterating in Python

We will first grab some values from `data_set` and turn them into a list.

In [5]:
my_data = data_set['medv'][0:20]
my_data = list(my_data)
print(my_data)

[24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2]


Task: check if the rounded values in `my_data` are even or not; print the result for each member of `my_data`.

In [6]:
for number in my_data:
    if round(number) % 2 == 0:
        print('Rounded ' + str(number) + ' is even.')
    else:
        print('Rounded ' + str(number) + ' is odd.')

Rounded 24.0 is even.
Rounded 21.6 is even.
Rounded 34.7 is odd.
Rounded 33.4 is odd.
Rounded 36.2 is even.
Rounded 28.7 is odd.
Rounded 22.9 is odd.
Rounded 27.1 is odd.
Rounded 16.5 is even.
Rounded 18.9 is odd.
Rounded 15.0 is odd.
Rounded 18.9 is odd.
Rounded 21.7 is even.
Rounded 20.4 is even.
Rounded 18.2 is even.
Rounded 19.9 is even.
Rounded 23.1 is odd.
Rounded 17.5 is even.
Rounded 20.2 is even.
Rounded 18.2 is even.


**Iterables** in Python: an iterable is an object that can return its elements one by one.

**Sequences** in Pythin are iterbales: **lists**, **strings**, and **tuples**.

In [7]:
my_data = (1, 2, '3', 4, 5, '7', 8, 9, '10')
for d in my_data:
    print(type(d))

<class 'int'>
<class 'int'>
<class 'str'>
<class 'int'>
<class 'int'>
<class 'str'>
<class 'int'>
<class 'int'>
<class 'str'>


In [8]:
my_data = "Belgrader"
for letter in my_data:
    print(letter)

B
e
l
g
r
a
d
e
r


**Dictionaries** are **iterables** but **not sequences**.

In [9]:
my_data = {'a':1,
           'b':2,
           'c':3,
           'd':4,
           'e':5,
           'f':6}
for item in my_data:
    print(item)

a
b
c
d
e
f


**Keys only.** What about:

In [10]:
for item in my_data.values():
    print(item)

1
2
3
4
5
6


And we can also do something like this:

In [11]:
for key in my_data:
    print('When ' + key + ' then ' + str(my_data[key]))

When a then 1
When b then 2
When c then 3
When d then 4
When e then 5
When f then 6


Or:

In [12]:
for kv in my_data.items():
    key, value = kv
    print('When ' + key + ' then ' + str(my_data[key]))

When a then 1
When b then 2
When c then 3
When d then 4
When e then 5
When f then 6


But also, unpacking `my_data.items()` before entering the loop:

In [13]:
for key, value in my_data.items():
    print('When ' + key + ' then ' + str(my_data[key]))

When a then 1
When b then 2
When c then 3
When d then 4
When e then 5
When f then 6


Let's apply a 20% discount to all prices in `data_set['medv']!

In [14]:
print("Original prices:")
print(list(data_set['medv'][0:20]))
medv_discount = list(data_set['medv'])
for price in range(len(medv_discount)):
    medv_discount[price] = round(medv_discount[price] - .2*medv_discount[price],2)
print("Discounted prices (20%):")
print(medv_discount[0:20])

Original prices:
[24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2]
Discounted prices (20%):
[19.2, 17.28, 27.76, 26.72, 28.96, 22.96, 18.32, 21.68, 13.2, 15.12, 12.0, 15.12, 17.36, 16.32, 14.56, 15.92, 18.48, 14.0, 16.16, 14.56]


What is this: `range(len(medv_discount))`?

In [15]:
range(len(medv_discount))

range(0, 506)

In [16]:
list(range(5, 15))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [17]:
print(list(range(len(medv_discount))))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

Now, **this is interestening:**

In [18]:
medv = list(data_set['medv'])

# - list comprehension:
medv_discount = [round(x-.2*x,2) for x in medv]

print("Original prices:")
print(list(data_set['medv'][0:20]))
print("Discounted prices (20%):")
print(medv_discount[0:20])

Original prices:
[24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2]
Discounted prices (20%):
[19.2, 17.28, 27.76, 26.72, 28.96, 22.96, 18.32, 21.68, 13.2, 15.12, 12.0, 15.12, 17.36, 16.32, 14.56, 15.92, 18.48, 14.0, 16.16, 14.56]


This ^^ is called: **a list comprehension**. It is a superpowerfull way of expressing iterations in Python!

In [19]:
my_list = ['Belgrade', 'New York', 'Moscow', 'London', 'New Delhi', 'Tokyo']
[x[0] for x in my_list]

['B', 'N', 'M', 'L', 'N', 'T']

In [20]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10]
[x**2 for x in my_list]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [21]:
l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
[el1 + ':' + el1 for el1 in l_1 for el2 in l_2]

['A:A', 'A:A', 'A:A', 'B:B', 'B:B', 'B:B', 'C:C', 'C:C', 'C:C']

In [22]:
l_1 = ['A', 'B', 'C']
l_2 = ['X', 'Y', 'Z']
[(el1, el2) for el1 in l_1 for el2 in l_2]

[('A', 'X'),
 ('A', 'Y'),
 ('A', 'Z'),
 ('B', 'X'),
 ('B', 'Y'),
 ('B', 'Z'),
 ('C', 'X'),
 ('C', 'Y'),
 ('C', 'Z')]

And now for a bit more complicated expression...

In [23]:
l_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[str(x) + ' is even.' if x % 2 == 0 else str(x) + ' is odd.' for x in l_1]

['1 is odd.',
 '2 is even.',
 '3 is odd.',
 '4 is even.',
 '5 is odd.',
 '6 is even.',
 '7 is odd.',
 '8 is even.',
 '9 is odd.',
 '10 is even.']

What did I forgot? Ah, the `while` loop!

In [35]:
l = list()
x = 1
while x < 100:
  if x % 2 == 0:
    l.append(x)
  x += 1
print(l)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]


`break` statements in loops:

### Control Flow C: Decisions in Python

Oh... we are using `if` and `else` again without telling you about them. It is really simple:

In [24]:
x = 10
if x**2 == 100:
    print("x is definitely 10.")
else:
    print("It is  definitely not 10.")

x is definitely 10.


In [25]:
def is_something_ten(x):
    if x**2 == 100:
        return(True)
    else:
        return(False)
l_1 = [1, 10, 20, 4, 10]
[is_something_ten(x) for x in l_1]

[False, True, False, False, True]

Nicer:

In [26]:
def is_something_ten(x):
    if x**2 == 100:
        return(True)
    else:
        return(False)
l_1 = [1, 10, 20, 4, 10]
[str(x) + ' is 10!' if is_something_ten(x) else str(x) + ' is  not 10!' for x in l_1]

['1 is  not 10!', '10 is 10!', '20 is  not 10!', '4 is  not 10!', '10 is 10!']

We can also branch our `if` statements with `elif`:

In [27]:
x = 10
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('This is strange!')

Ok it is less than 20, now... 


In [29]:
x = 25
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

It's between 20 and 30!


In [30]:
x = 31
if x < 20:
    print('Ok it is less than 20, now... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

Ok it is not larger than 30.


`if` statements can be nested of course:

In [31]:
x = 19
if x < 20:
    print('Ok it is less than 20, now... ')
    if x < 18:
        print('And it is less than 18 too... ')
    else:
        print('But not less than 18... ')
elif x > 30:
    print('Ok it is not larger than 30.')
else:
    print('It\'s between 20 and 30!')

Ok it is less than 20, now... 
But not less than 18... 


Did I fo

### Readings and Videos
- [Bill Lubanovic, Introducing Python, 1st Edition](https://www.oreilly.com/library/view/introducing-python-2nd/9781492051374/), Chapters 1 - 3.
- [freeCodeCamp.org Intermediate Python Programming Course](https://www.youtube.com/watch?v=HGOBQPFzWKo), Sections 1 - 4 (Lists, Tuples, Dictionaries, Sets)

### A highly recommended To Do
- Watch [Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)](https://www.youtube.com/watch?v=vmEHCJofslg)
- Watch [Python NumPy Tutorial for Beginners](https://www.youtube.com/watch?v=QUT1VHiLmmI)
- Read chapter [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) from [Python Data Science Handbook, Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)

<hr>

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>