# FACT ML Workshop Dortmund 2017 – Python Introduction

In [12]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# Inhalt
<div id="toc"></div>

# Python 

## Warum python?

<img alt="stackoverflow" src="https://zgab33vy595fw5zq-zippykid.netdna-ssl.com/wp-content/uploads/2017/09/growth_major_languages-1-1024x878.png" width="600px" />


## Grundlagen

Würde heute zu viel brauchen

* [PeP et al. Toolbox Workshop](https://toolbox.pep-dortmund.org)
* [The scientific Python lectures](https://github.com/jrjohansson/scientific-python-lectures)
* [A Byte Of Python](https://python.swaroopch.com/)

Sehr ausführlich, Datenanalyse mit Python:
* [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)



# numpy

Numerical computations in python

* Schnell durch Vektorisierung und kompilierten C++/C/Fortran-Code  
   ⇒ Keine Python Schleifen über numpy arrays
* Viele Funktionen für Datenanalyse, Zufallszahlen, Numerik, Lineare Algebra etc


In [2]:
import numpy as np

In [3]:
# convert list to array
x = np.array([1, 2, 3, 4, 5])

In [4]:
2 * x

array([ 2,  4,  6,  8, 10])

In [5]:
x**2

array([ 1,  4,  9, 16, 25])

In [6]:
x**x

array([   1,    4,   27,  256, 3125])

In [19]:
np.cos(x)

array([ 0.54030231, -0.41614684, -0.9899925 , -0.65364362,  0.28366219])

In [13]:
# two-dimensional array
y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

y + y

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

## Numpy Indexing

Numpy erlaubt einem sehr bequem bestimmte Elemente aus einem Array auszuwählen

In [None]:
x = np.arange(0, 10)

# like lists:
x[4]

In [None]:
# all elements with indices ≥1 and <4:
x[1:4]

In [None]:
# negative indices count from the end
x[-1], x[-2]

In [None]:
# combination:
x[3:-2]

In [None]:
# step size
x[::2]

In [None]:
# trick for reversal: negative step
x[::-1]

![Indexing1D](Indexing1D.svg)

In [20]:
y = np.array([x, x + 10, x + 20, x + 30])
y

array([[ 1,  2,  3,  4,  5],
       [11, 12, 13, 14, 15],
       [21, 22, 23, 24, 25],
       [31, 32, 33, 34, 35]])

In [21]:
# comma between indices
y[3, 2:-1]

array([33, 34])

In [22]:
# only one index ⇒ one-dimensional array
y[2]

array([21, 22, 23, 24, 25])

In [23]:
# other axis: (: alone means the whole axis)
y[:, 3]

array([ 4, 14, 24, 34])

In [None]:
# inspecting the number of elements per axis:
y.shape

![Indexing1D](Indexing2D.svg)

### Array creation helpers

In [14]:
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [15]:
np.ones((5, 2))

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [16]:
np.linspace(0, 1, 11)

array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])

In [17]:
# like range() for arrays:
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [18]:
np.logspace(-4, 5, 10)

array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00,   1.00000000e+01,
         1.00000000e+02,   1.00000000e+03,   1.00000000e+04,
         1.00000000e+05])

## Mask

Using arrays of booleans to access specific elements

In [25]:
a = np.random.normal(0, 1, 10)
a

array([-0.08575379, -1.67296643, -0.38066952,  1.3888896 , -0.72379286,
        0.35470826, -1.84232262, -1.03459401,  2.2945754 , -0.34056718])

In [26]:
a > 0

array([False, False, False,  True, False,  True, False, False,  True, False], dtype=bool)

In [27]:
a[a > 0]

array([ 1.3888896 ,  0.35470826,  2.2945754 ])

In [29]:
# Parenthese are important because of operator precedence
# | = or
# & = and
# ~ = not

a[(a > -1) & (a < 1)]

array([-0.08575379, -0.38066952, -0.72379286,  0.35470826, -0.34056718])

In [8]:
np.logical_and(a > -1, a < 1)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

## Aggregations

In [31]:
x = np.random.normal(size=100)

In [32]:
np.sum(x)

-14.56472552253666

In [34]:
np.prod(x)

4.9309456983586592e-28

In [37]:
np.mean(x)

-0.14564725522536659

In [38]:
np.std(x)

0.8633160474640954

### The axis keyword

Evaluation of aggregations along a certain axis

In [42]:
X = np.arange(12).reshape(4, 3)
X

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [43]:
np.sum(X)

66

In [44]:
np.sum(X, axis=0)

array([18, 22, 26])

### Broadcasting

Arrays of different shapes can be used together, thanks to broadcasting

In [45]:
a = np.arange(12).reshape(4, 3)
b = 5
c = np.arange(3)
d = np.arange(4)

In [46]:
a - b

array([[-5, -4, -3],
       [-2, -1,  0],
       [ 1,  2,  3],
       [ 4,  5,  6]])

In [47]:
a - c

array([[0, 0, 0],
       [3, 3, 3],
       [6, 6, 6],
       [9, 9, 9]])

In [48]:
# a - d -> error
(a.T - d).T

array([[0, 1, 2],
       [2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

### Timing Example: Python loops vs. Numpy

Find the closest point in points

In [50]:
point = (0, 1)
points = [
    (0, 0),
    (0.5, -0.5),
    (1, -1),
    (0, 2),
    (0, 1.1),
    (-2, 3),
    (5, 1),
    (10, 4),
    (-4, 2),
    (-3, 0),
] * 100

Pure Python using loops:

In [51]:
def find_closest(points, point):
    min_distance = float('inf')
    for i, other in enumerate(points):
        distance = ((other[0] - point[0])**2 + (other[1] - point[1])**2)**0.5
        if distance < min_distance:
            min_distance = distance
            min_idx = i
    
    return min_idx

idx = find_closest(points, point)
print(idx, points[idx])

4 (0, 1.1)


In [52]:
%%timeit 
find_closest(points, point)

651 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [53]:
points = np.array(points)
point = np.array(point)

In [54]:
def find_closest_numpy(points, point):
    distances = np.linalg.norm(points - point, axis=1)
    idx = np.argmin(distances)
    return idx

idx = find_closest_numpy(points, point)
print(idx, points[idx])

4 [ 0.   1.1]


In [55]:
%%timeit 
find_closest_numpy(points, point)

37.8 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Random numbers

In [22]:
uniform = np.random.uniform(-5, 5, 1000)
gaussian = np.random.normal(0, 1, 1000)
poisson = np.random.poisson(3, 1000)

mean = [2, 1]
cov = [[2, 1],
       [1, 4]]
gauss_2d = np.random.multivariate_normal(mean, cov, 1000)

Setting the seed enables reproducibility

In [56]:
np.random.seed(42)

np.random.normal()

0.4967141530112327

## pandas

Docu: [Hier](https://pandas.pydata.org/pandas-docs/stable/)

Bibliothek für Datenanalyse, zentrales Konzept: `pd.DataFrame` → 2d-Tabelle aus Daten

### The titanic dataset

In [83]:
df = pd.read_csv('titanic.csv')

In [84]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Wie viele Valide Werte gibt es in jeder Spalte?

In [85]:
df.count()

pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64

Spalten loswerden, die zu viele missing values haben

In [86]:
# axis=1 Spalten droppen
# inplace=df direkt bearbeiten
df.drop(['cabin', 'boat', 'body', 'home.dest'], axis=1, inplace=True) 

df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S


Wie war die Geschlechter Verteilung auf der Titanic?

In [87]:
df.sex.value_counts()

male      843
female    466
Name: sex, dtype: int64

Mächtige Operation: GroupBy → Aggregate

Datensatz in mehrere Gruppen unterteilen und pro Gruppe zusammenfassen.

Hier: Aufgeschlüsselt nach Geschlecht, den Prozentsatz der überlebenden.

In [88]:
df.groupby('sex')['survived'].agg('mean')

sex
female    0.727468
male      0.190985
Name: survived, dtype: float64

Auch DataFrames unterstützen masken:

In [89]:
df['child'] = df.age < 9

df[df.child].survived.mean(), df[~df.child].survived.mean()

(0.6388888888888888, 0.3670169765561843)

Alternativ

In [90]:
df.groupby('child').survived.mean()

child
False    0.367017
True     0.638889
Name: survived, dtype: float64