# The notebook

A notebook is a collection of `cells` that can be executed in different ways

## Code

In [None]:
# Code cell, then we are using python
print('Hello!')

## Markdown

You can make text **bold** or *italic*, and many other things...

* list
* with
* items

[Link to interesting resources](https://www.youtube.com/watch?v=7-t6UDgPu9Y) or images: ![images](https://www.am-team.com/site/layout/img/logo@2x.png)

> Blockquotes if you like them
> This line is part of the same blockquote.

Mathematical formulas can also be incorporated (LaTeX)
$$\frac{dX}{dt}=X_{in} - k_1 .X$$
$$\frac{dO}{dt}=k_2 .(O_{sat}-O) - k_1 .O$$

### HTML

You can also use HTML commands, just check this cell:
<h3> html-adapted titel with &#60;h3&#62; </h3> <p></p>
<b> Bold text &#60;b&#62; </b> of <i>or italic &#60;i&#62; </i>

## Essential tips for using notebooks

<big><center>To run a cell: push the start triangle in the menu or type **SHIFT + ENTER/RETURN**
![](./img/Hit-Enter.jpg)

* The **TAB** button is essential: It provides you all **possible actions** you can do after loading in a library *AND* it is used for **automatic autocompletion**:

In [None]:
superlongvariablename = 3

In [None]:
superlongvariablename

sup + TAB

* The  **SHIFT-TAB** combination is ultra essential to get information/help about the current operation 

In [None]:
round(2.23445634,3)

In [None]:
# alternatively
round?

<b> What is the difference with the double "?" ?

In [None]:
import glob
glob.glob??

* NOTE: cells have an edit mode and a command mode

* **A** creates a new cell **A**bove

* **B** creates a new cell **B**elow

<div class="alert alert-success">
    <b>Now try CTRL+SHIFT+P</b> 
</div>

## Having troubles executing?

<div class="alert alert-danger">
    <b>NOTE</b>: When, for any reason, a cell seem to enter an endless loop: 
    <ul>
    <li> first try **Kernel** > **Interrupt** -> you're cell should stop running
    <li> if no succes -> **Kernel** > **Restart** -> restart you're notebook
    </ul>
</div>

## want to see some magic?

In [None]:
# %%timeit

mylist = range(1000)
for i in mylist:
    i = i**2

In [None]:
import numpy as np

In [None]:
%%timeit
np.arange(1000)**2

In [None]:
%lsmagic

# Load packages and execute

In [None]:
from IPython.display import Image
Image(url='http://python.org/images/python-logo.gif')

Importing packages is the first thing you do in python, you import those functionalities you need.

Different importing options:

* <span style="color:green">import <i>package-name</i></span>  importing all functionalities as such
* <span style="color:green">from <i>package-name</i> import <i>specific function</i></span> importing a specific function or subset of the package
* <span style="color:green">from <i>package-name</i> import *  </span>   importing all definitions and actions of the package (sometimes better than option 1)
* <span style="color:green">import <i>package-name</i> as <i>short-package-name</i></span>    Very good way to keep a good insight in where you use what package

In [None]:
import numpy as np

In [None]:
np.arange(1,30,0.5)

# Datatypes

Python supports the following numerical, scalar types:
* integer
* floats
* complex
* boolean

In [None]:
theinteger = 3
print(type(theinteger))

In [None]:
type(theinteger)

In [None]:
theinteger

In [None]:
# type casting: converting the integer to a float type
float(theinteger)

In [None]:
thefloat = 0.2
type(thefloat)

In [None]:
thecomplex = 1.5 + 0.5j
# get the real or imaginary part of the complex number by using the functions
# real and imag on the variable
print(type(thecomplex), thecomplex.real, thecomplex.imag)

In [None]:
type(thecomplex)

In [None]:
thecomplex.imag

In [None]:
3>4

In [None]:
theboolean = (3 > 4)
theboolean

You can do basic calculations

In [None]:
print(7 * 3.)
print(2**10)
print(8 % 3)

**But careful!**

In [None]:
print(3/2)
print(3/2.)
print(3.//2.)  #integer division

## Grouping datatypes

### List

A list is an ordered collection of objects, that may have different types. The list container supports slicing, appending, sorting ...

Indexing starts at 0 (as in C, C++ or Java), not at 1 (as in Fortran or Matlab)!

In [None]:
thelist = [2.,'aa', 0.2]

In [None]:
# accessing individual object in the list
thelist[1]

In [None]:
# negative indices are used to count from the back
thelist[-1]

In [None]:
thelist = ['first', 'second', 'third', 'fourth', 'fifth']
print(thelist[3:])
print(thelist[:2])
print(thelist[::2])

In [None]:
thelist[3] = 'newFourth'
print(thelist)
thelist[1:3] = ['newSecond', 'newThird']
print(thelist)

In [None]:
# Appending
thelist.append('pink')
thelist

In [None]:
# Removes the last element
thelist.pop()
thelist

In [None]:
# Extends the list in-place
thelist.extend(['pink', 'purple'])
thelist

In [None]:
# Reverse the list
thelist.reverse()
thelist

In [None]:
# Remove the first occurence of an element
thelist.remove('purple')
thelist

In [None]:
# Sort list
thelist.sort()
thelist

### string

You can slice strings just like variables

In [None]:
a_string = "hello"
print(a_string[0])
print(a_string[1:5])
print(a_string[-4:-1:2])

You can mix data types in a printout

In [None]:
print('An integer: %i; a float: %f; another string: %s' % (4, 0.0001, 'strong'))

In [None]:
n_dataset_number = [1, 55, 32]
for i in n_dataset_number:
    sFilename = 'processing_of_dataset_%d.txt' % i
    print(sFilename)

### Dictionaries

In [None]:
# Always key : value combinations, datatypes can be mixed
hourly_wage = {'Jos':10, 'Frida': 9, 'Gaspard': '13', 23 : 3}
hourly_wage

In [None]:
hourly_wage.keys()

In [None]:
hourly_wage.values()

In [None]:
# or get it in a list
hourly_wage.items()

### Tuple

This is basically a list which cannot be modified

In [None]:
a_tuple = (2, 3, 'aa', [1, 2])
a_tuple

## Loops and controls

In [None]:
an_int = 20
if an_int == 1:
    print(1)
elif an_int == 2:
     print(2)
else:
    print('A lot')

In [None]:
adjectives = ('cool', 'powerful', 'readable', 'a snake?')

for word in adjectives:
    # remember string formatting?
    print('Python is {}!'.format(word))

In [None]:
aList = [1, 0, 2, 4]
print('printing the inverse of the integers, excluding division by zero')
for element in aList:
    if element == 0:
        continue
    print(1. / element)

In [None]:
word = 'supercalifragilistichespiralidoso'

<div class="alert alert-success">
    <b>NOW count the 'i' in this word</b> 
</div>

In [None]:
a = 0
for i in word:
    if i == 'i':
        a = a+1
print(a)

In [None]:
mydictionary = {'unipi': 3, 'unifi': 2, 'polimi': 4, 'ugent': 5, 'unipa': 1}

In [None]:
mydictionary

<div class="alert alert-success">
    <b>EXERCISE</b>: Return the key of an item in the dictionary `mydictionary` if the value is provided (assume that the user is always providing a value that is part of the dictionary and all values only occur once).
</div>

In [None]:
mydictionary.keys()

<div class="alert alert-success">
    <b>EXERCISE</b>: Given the dictionary `mydictionary`, check if a key is already existing in the dictionary and print a message `already in dictionary`. Use 'polimi' as string to check for.
</div>

# Numpy

In [None]:
%matplotlib inline

NumPy is the fundamental package for scientific computing with Python. Contains:

* a powerful N-dimensional array/vector/matrix object
* sophisticated (broadcasting) functions
* function implementation in C/Fortran assuring good performance if vectorized
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Also known as *array oriented computing*. The recommended convention to import numpy is:

In [None]:
import numpy as np

In the `numpy` package the terminology used for vectors, matrices and higher-dimensional data sets is *array*. Let's already load some other modules too.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

Do you want to know the distribution of the sum of two dices?

In [None]:
plt.plot()

In [None]:
def mydices(throws):
    """
    Function to create the distrrbution of the sum of two dices. This is updated
    
    Parameters
    ----------
    throws : int
        Number of throws with the dices
    """
    stone1 = np.random.uniform(1, 6, throws) 
    stone2 = np.random.uniform(1, 6, throws) 
    total = stone1 + stone2
    return plt.hist(total, bins=50) # We use matplotlib to sho histogram

In [None]:
mydices(100000)

In [None]:
# random numbers (X, Y in 2 columns)
Z = np.random.random((10,2))
X, Y = Z[:,0], Z[:,1]

# distance
R = np.sqrt(X**2 + Y**2)
# angle
T = np.arctan2(Y, X) # Array of angles in radians
Tdegree = T*180/(np.pi) # If you like degrees more

# NEXT PART (now for illustration)
#plot the cartesian coordinates
plt.figure(figsize=(14, 6))
ax1 = plt.subplot(121)
ax1.plot(Z[:,0], Z[:,1], 'o')
ax1.set_title("Cartesian")
#plot the polar coorsidnates
ax2 = plt.subplot(122, polar=True)
ax2.plot(T, R, 'o')
ax2.set_title("Polar")

## `numpy` arrays

In [None]:
# a vector: the argument to the array function is a Python list
V = np.array([1, 2, 3, 4])
V

In [None]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M

The `v` and `M` objects are both of the type `ndarray` that the `numpy` module provides.

In [None]:
type(V), type(M)

In [None]:
V.shape

In [None]:
V.size

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has (always fixed for each array, cfr. Matlab):

In [None]:
V.dtype

In [None]:
V[0] = 'a string'

In [None]:
f = np.array(['Bonjour', 'Hello', 'Hallo',])
f

In case you can also define exactly which data type you're using

In [None]:
M = np.array([[1, 2], [3, 4]], dtype=complex)  #np.float64, np.float, np.int64

print(M, '\n', M.dtype)

You can build your N-dimensional array

In [None]:
C = np.array([[[1], [2]], [[3], [4]]])
print(C.shape)
C

In [None]:
C.ndim # number of dimensions

### generate array

In [None]:
# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step
x

In [None]:
# using linspace, both end points ARE included
np.linspace(0, 10, 25)

In [None]:
np.logspace(0, 10, 10, base=np.e)

In [None]:
plt.plot(np.logspace(0, 10, 10, base=np.e), np.random.random(10), 'x')
plt.xscale('log')

In [None]:
# uniform random numbers in [0,1]
np.random.rand(5,5)

In [None]:
# standard normal distributed random numbers
np.random.randn(5,5)

<div class="alert alert-success">
    <b>EXERCISE:</b> Create a vector with values ranging from 10 to 49 with steps of 1
</div>

In [None]:
np.arange(10, 50, 1)

<div class="alert alert-success">
    <b>EXERCISE:</b> Create a 3x3 identity matrix (look into docs!)
</div>

In [None]:
np.identity(3)

<div class="alert alert-success">
    <b>EXERCISE</b>: Create a 3x3x3 array with random values
</div>

## File handling

In [None]:
a = np.random.random(40).reshape((20, 2))

In [None]:
a

In [None]:
np.savetxt("random-matrix.csv", a, delimiter=",")

In [None]:
a2 = np.genfromtxt("random-matrix.csv", delimiter=',')
a2

## Indexing and slicing arrays

With vectors

In [None]:
V

In [None]:
# V is a vector, and has only one dimension, taking one index
V[0]

In [None]:
V[-1:]

In [None]:
a

In [None]:
a.shape

In [None]:
# a is a matrix, or a 2 dimensional array, taking two indices 
# the first dimension corresponds to rows, the second to columns.
a[1, 1]

In [None]:
a[1, :] # row 1

In [None]:
a[:, 1] # column 1

In [None]:
a[1:5, 1] # slicing!

In [None]:
a[::2]

In [None]:
a[-3:] # the last three elements

<div class="alert alert-success">
    <b>EXERCISE</b>: Create a null vector of size 10 but the fifth value must be <b>1</b>
</div>

An array can also be passed as index (**Fancy indexing!**)

In [None]:
row_indices = [1, 2, 3]
a[row_indices]

In [None]:
B = np.array([n for n in range(5)])  #range is pure python
B

<div class="alert alert-success">
    <b>EXERCISE</b>: Make the above shorter with numpy
</div>

Often a mask is also something useful for smart indexing

In [None]:
row_mask = np.array([True, False, True, False, False, True, False, True, False, False, 
                    True, False, True, False, False, True, False, True, False, False])
a[row_mask]

In [None]:
a[a > 0.3]

Swapping rows

In [None]:
#SWAP
a[[0, 1]] = a[[1, 0]]
a

<div class="alert alert-success">
    <b>EXERCISE</b>: Change all even rows into zero-values
</div>

## Array operations

In [None]:
v1 = np.arange(0, 5)

In [None]:
v1

In [None]:
v1 * 2

In [None]:
v1 + 2

In [None]:
A = np.arange(25).reshape(5,5)
A * 2

In [None]:
A * A # element-wise multiplication

In [None]:
A * v1

Remember the speed difference with pure python?

In [None]:
a = np.arange(10000)
%timeit a + 1  

l = range(10000)
%timeit [i+1 for i in l] 

In [None]:
x, y = np.arange(5), np.arange(5).reshape((5, 1)) # a row and a column array

In [None]:
distance = np.sqrt(x ** 2 + y ** 2)
distance

In [None]:
#let's put this in a figure:
plt.pcolor(distance)    
plt.colorbar()  

## Calculations

In [None]:
a = np.random.random(40)

In [None]:
a

In [None]:
print('Mean value is', np.mean(a))
print('Median value is',  np.median(a))
print('Std is', np.std(a))
print('Variance is', np.var(a))
print('Min is', a.min())
print('Element of minimum value is', a.argmin())
print('Max is', a.max())
print('Sum is', np.sum(a))
print('Prod', np.prod(a))
print('Cumsum is', np.cumsum(a)[-1])
print('CumProd of 5 first elements is', np.cumprod(a)[4])
print('Unique values in this array are:', np.unique(np.random.random_integers(1,6,10)))
print('85% Percentile value is: ', np.percentile(a, 85))

In [None]:
b = np.random.uniform(3, 40, (5,5))

In [None]:
b

In [None]:
b.max()

In [None]:
b.max(axis=1)

<div class="alert alert-success">
    <b>EXERCISE</b>: Rescale the 5x5 matrix to values between 0 and 1:
</div>

There are many more functions, e.g. reshaping, resizing, repeating, concatenating, ...

## View and copy

In [None]:
A = np.array([[1, 2], [3, 4]])
A

In [None]:
# now B is referring to the same array data as A 
B = A 

This makes python very efficient in memory usage, but important to remember what we're doing

In [None]:
# changing B affects A
B[0,0] = 10

B

To avoid this, we can make a copy

In [None]:
B = np.copy(A)

In [None]:
# now, if we modify B, A is not affected
B[0,0] = -5

B

## Fitting functions

In [None]:
b_data = np.genfromtxt("../data/bogota_part_dataset.csv", skip_header=3, delimiter=',')
plt.scatter(b_data[:,2], b_data[:,3])

In [None]:
x, y = b_data[:,1], b_data[:,3] 
t = np.polyfit(x, y, 2) # fit a 2nd degree polynomial to the data, result is x**2 + 2x + 3
t

In [None]:
x.sort()
plt.plot(x, y, 'o')
plt.plot(x, t[0]*x**2 + t[1]*x + t[2], '-')

<div class="alert alert-success">
    <b>EXERCISE</b>: Make a fourth order fit between the fourth and fifth column.
</div>

In [None]:
b_data[:, 4]

In [None]:
x = b_data[:, 3]
y = b_data[:, 4]

t = np.polyfit(x, y, 4) # fit a 2nd degree polynomial to the data, result is x**2 + 2x + 3
t

In [None]:
x.sort()
plt.plot(x, y, 'o')
plt.plot(x, t[0]*x**4 +t[1]*x**3 +t[2]*x**2 + t[3]*x + t[4], '-')

## Advanced DA functions

In [None]:
def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

In [None]:
moving_average(b_data[:,3], 5)

More efficient and already existing functions of this type can be used with **pandas**

**Interesting reading**
* http://numpy.scipy.org
* http://scipy.org/Tentative_NumPy_Tutorial
* http://scipy.org/NumPy_for_Matlab_Users - A Numpy guide for MATLAB users.
* http://wiki.scipy.org/Numpy_Example_List
* http://wiki.scipy.org/Cookbook

**Sources**
* http://scipy-lectures.github.io/intro/numpy/index.html
* http://www.labri.fr/perso/nrougier/teaching/numpy.100/index.html
* https://github.com/stijnvanhoey
* https://github.com/jorisvandenbossche

# Pandas

For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential.

* Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
* Pandas can also be thought of as `R`'s `data.frame` in Python.
* Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

In [None]:
import pandas as pd

In [None]:
my_series = pd.Series([0.1, 0.2, 0.3, 0.4])
my_series

The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1*

In [None]:
my_series.index

In [None]:
my_series.values

But the index can be anything

In [None]:
my_series2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
my_series2

In [None]:
my_series2.values

In [None]:
pop_dict = {'California': 38.3,
            'Texas': 26.4,
            'New York': 19.7,
            'Florida': 19.6,
            'Illinois': 12.9}
populations = pd.Series(pop_dict)
populations

In [None]:
populations['California']

In [None]:
populations['Texas':]

In [None]:
populations[populations > 20]

In [None]:
populations.mean()

## Dataframes

In [None]:
data = {'state': ['California', 'Texas', 'New York', 'Florida', 'Illinois'],
        'population': [38.3, 19.6, 12.9, 19.7, 26.4],
        'area':[424, 696, 141, 170, 150]}
states = pd.DataFrame(data)
states

In [None]:
states.index

In [None]:
states.columns

In [None]:
states.dtypes

In [None]:
states.info()

Check the difference in calling the column as Dataframe,...

In [None]:
states['area']

... or its values

In [None]:
states[['area']]

In [None]:
states[['area', 'population']]

Any column can be made as index

In [None]:
states = states.set_index('state')
states

In [None]:
states['area']

Columns can be added easily

In [None]:
states['density'] = states['population'] / states['area']
states

And masking can be used as well

In [None]:
states[states['density'] > 0.1]

Sorting

In [None]:
states.sort_values(by='area', ascending=True)

A Dataframe can be easily described with main stats

In [None]:
states.describe()

In [None]:
states.plot()

In [None]:
states.plot(subplots=True);

In [None]:
states['population'].plot(kind='bar')

In [None]:
states.plot(kind='scatter', x='population', y='area')

In [None]:
states.plot(kind='box')

<div class="alert alert-success">
    <b>EXERCISE</b>: select the area and the population column of those states where the density is larger than 0.1
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>: add a column 'density_ratio' with the ratio of the density to the mean density
</div>

## Data import/export

A wide range of input/output formats are natively supported by pandas:

* CSV, text
* SQL database
* Excel
* HDF5
* json
* html
* pickle
* ...

## Other features

* Working with missing data (`.dropna()`, `pd.isnull()`)
* Merging and joining (`concat`, `join`)
* Grouping: `groupby` functionality
* Reshaping (`stack`, `pivot`)
* Time series manipulation (resampling, timezones, ..)
* Easy plotting