# Python for Data Science
For starting the workshop, please make sure you have anaconda installed. We will use python 3 for this workshop. 

The material for the course is available at 
https://github.com/alexanderbuchholz/cudss_workshops/blob/master/python_for_data_science_workshop/python_for_data_science_workshop.ipynb


## Main tools in python for data science
The main tools/libraries that you need to learn for doing data science are the following (this is of course a bit subjective!):
1. Numpy, (numeric python). This library handles matrices, vectors and matrix-vector calculations. The underlying code is written in C. 
2. Pandas. It is a data handling library that allows you to load your data, visualize and preprocess it. Essential for gettig the first insights! 
3. Matplotlib and seaborn. Two libraries that can be used to make nice plots of your data. 
4. Sklearn (scikit learn). To run all your fancy models. All models are set up in the same way: you create your model, fit (train) it and make predictions of unseen data. 

## Ipython (jupyter) notebooks
Notebooks are a good way of playing around with your data and test different things. However, be careful as you can execute your cells in various orders. 
You can start with a notebook and easily turn it into something presentable using Markdown choosing different cell types.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn
# this line lets us have the plot shown without calling "plt.show()"
%matplotlib inline


The import "xxx" as "x" is the standard in python. Stick to the naming conventions, as this will make your code more readable. 

Let us now read a dataset, more precisely the Pima diabetes dataset. This data was is a benchmark for testing machine learning algorithms (nowadays, it is considered too simple). It contains information on a group of Pima, a native american tribe. In the dataset, there are 768 female individuals, some of which suffer from diabetes. Our aim is to predict whether an individual suffers from diabetes given other indicators (bmi, pregnancy record...). 
For more information see here: 
https://www.kaggle.com/uciml/pima-indians-diabetes-database

The variables in the data set are: 

Pregnancies - Number of times pregnant

Glucose - Plasma glucose concentration after 2 hours (oral glucose tolerance test)

BloodPressure - Diastolic blood pressure (mm Hg)

SkinThickness - Triceps skin fold thickness (mm)

Insulin - 2-Hour serum insulin (mu U/ml)

BMI - Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction - Diabetes pedigree function

Age - Age (years)

Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0

In [None]:
pima_all = pd.read_csv("pima-indians-diabetes.csv", header=None)

What does this command do? Try to find out more by using "help(pd.read_csv)".

If the previous command does not work, try to specify the path where the file is located. 

In [None]:
pima_all.head()

In [None]:
pima_all.columns = ['num_pregnant', 'glucose', 'pressure', 'skin', 'insulin', 'bmi', 'pedigree' , 'age', 'diab_class']

In [None]:
pima_all.head()

What problems do you see here? 

In [None]:
pima_all.skin.plot.hist(bins=20)

In [None]:
pima_all.insulin.plot.hist(bins=20)

### Exercise: 
Look at the other variables and see if you can find anything suspicious.

In [None]:
pima_all.head()

In [None]:
pima_all.describe()

Let's take a more systematic approach: replace zero values by missing values.

In [None]:
pima_all.iloc[:,[1,2,3,4,5]] = pima_all.iloc[:,[1,2,3,4,5]].replace(0, np.NaN)

What does this command do? 

In [None]:
pima_all.describe()

Let's try to understand what drives diabetes.
We will use a different library that yields some nice plots, called seaborn. 

In [None]:
sns.violinplot(x="diab_class", y="glucose", data=pima_all)
plt.xlabel('Diabetes status') 

In [None]:
pima_all.diab_class.mean() # what does this number tell you?

### Exercise
Do the same thing for the other variables.

In [None]:
pima_all.groupby('diab_class').mean()

In [None]:
pima_all[['glucose', 'pressure', 'diab_class']].boxplot(by='diab_class')

What can you say about the factors that drive diabetes based on the first two plots?

In [None]:
sns.lmplot(x='glucose', y='insulin', data=pima_all, hue='diab_class')

In [None]:
sns.lmplot(x='bmi', y='insulin', data=pima_all)

In [None]:
sns.distplot(pima_all[pima_all['diab_class']==0].pedigree)
sns.distplot(pima_all[pima_all['diab_class']==1].pedigree)
#plt.savefig('two_histograms.pdf')

How do we handle missing values? 
One way is to drop all lines with missing values.


In [None]:
pima_all.dropna().describe()

What is the problem here? 

A better approach: impute missing values using for example the mean or the most frequent value. We will use the imputing method provided by pandas.

In [None]:
pima_all.fillna(pima_all.mean(), inplace=True) # can you explain what this command does?

In [None]:
pima_all.describe()

### Training a model
First step: split the data into a train and a test dataset.
sklearn is a library that contains a lot of machine learning algorithms.

In [None]:
from sklearn.model_selection import train_test_split

We have to transform the data first. 

In [None]:
X = pima_all.iloc[:,0:8].values
y = pima_all.iloc[:,8].values

Now we split the data into a train and a test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Exercise: 
What is wrong with this approach? Hint: look at what we did before, how did we preprocess the data?

Now we will fit the first model using only the first three variables: num_pregnant, glucose, pressure. 

In [None]:
from sklearn.linear_model import LogisticRegression

# Logistic regression:
### What is a logistic regression?
A logistic regression is a model that assigns to every outcome (diabetes or not) a probability between 0 and 1. 
The idea is that every individual that we observe can be represented as the observation of a coin flip (either 0 or 1). However every coin is different for all individiuals. That means every individual has its own unique coin. The properties of this unique coin are determined by the observed covariates (the bmi for instance). We assume that there is a shared way of how the individual covariates influence the properties of the coin. This shared structure allows to learn the parameters that govern the model:

$$
\mathbb{P}(Y_i = 1| X_i) = logit(\sum_{j=1}^p x_{i,j} \beta_j)
$$
and $\beta_j$ is the same accross individuals. 

Thus, a logistic regression allows us to model the individual probability of having diabetes. 

For more details see here 
https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102


In [None]:
logisticmodel = LogisticRegression() # we initiate a model by calling the class.

In [None]:
logisticmodel.fit(X_train[:,0:3], y_train)

If you want to learn more about how the model is trained, look at maximum likelihood estimation. 

In [None]:
y_train_pred = logisticmodel.predict(X_train[:,0:3])

In [None]:
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

### Exercise: 
Try to understand what these different evaluation metrics do. 

In [None]:
confusion_matrix(y_train, y_train_pred)

In [None]:
y_test_pred = logisticmodel.predict(X_test[:,0:3])
confusion_matrix(y_test, y_test_pred)

In [None]:
f1_score(y_test, y_test_pred), f1_score(y_train, y_train_pred)

In [None]:
accuracy_score(y_test, y_test_pred), accuracy_score(y_train, y_train_pred)

### Exercise: 
Also use the other variables. What accuracy do you obtain? 

### Exercise
Use another regression model, the random forest classification (look up how to use it). What is the best score that you get? 

### Exercise
Can you think of a way of how to fix the imputation problem? 
Look at imputation in sklearn and pipelines. This allows you to find a better way of imputation. 

# How to go further: 

## Learning using moocs: 
https://www.coursera.org/learn/python-data-analysis
or using other ressources on coursera

## Learning using kaggle
Kaggle has lot of material, that can get you started. 
For example you might want to look at 
https://www.kaggle.com/learn/overview
