<img src="py-logo.png" width="100pt"/>


# PYTHON FOR DATA SCIENCE 
# COURSE RECAP
*Lasse Ruokolainen*

*Seasoned Data Master, BILOT Consulting Oy* 

***

## (1) Data structures & types

Two list have been created for person name and age. Make a third list indicating person gender.

In [None]:
# Character list:
name = ['Bill','Erica','Kristina','John','Monica','Anthony','Judy','Mark']

# Numeric list:
age = [32,25,47,17,39,62,28,21]

# Character list of gender:
gender = ['M','F','F','M','F','M','F','M']

Make a dictionary called `person_data` out of the theree lists above and print out the dictionary keys:

In [None]:
person_data = {'name':name,'age':age,'gender':gender}
print(person_data.keys())

Next, make a dataframe out of the `person_data` dictionary and add two new variables to the dataframe, giving the height and weight for each person. Just come up with some numbers. If you're up to it, try using appropriate random number generators. 

In [None]:
# import pandas
import pandas as pd

# make dataframe:


In [None]:
from numpy.random import uniform

# add new variables:
df['height'] = uniform()
df['weight'] = uniform()

# print out the dataframe header
df.head()

Add yet another variable to the dataframe, indicating whether a person is old (>30yrs) or young. Given the small size of the data, you could do this manually. However, this quickly becomes unfeasible when data size increases. Here one can utilise, e.g., the power of list comprehensions.

In [None]:
df['senior'] = 
df.head()

## (2) Data processing

Read the `tips.csv` file to Python and print the header of the dataframe:

In [None]:
tips = 

Calculate descriptive statistics for the dataframe:

Calculate the average spending per person in the restaurant for each day and serving time:

In [None]:
tips['avg'] = 

Joining dataframes:

In [None]:
df2 = df.loc[0:6,['name','weight','height']]
df2.rename(columns = {'name':'person'},inplace=True)
df3 = df.drop(['weight','height'],axis='columns')
print(df2.head())
df3.head()

In [None]:
pd.merge(df3,df2,left_on='name',right_on='person',how='inner')\
  .drop(['senior'],axis=1)

## (3) Visualisation

Make a barplot of the average per person spendings calculated above.

In [None]:
import matplotlib.pyplot as plt


Next, make a boxplot / violin plot of the age of each gender in the dataframe `df`.

In [None]:
import seaborn as sns

plt.show()

Finally, visualise regression between the total bill and tip amount and how this depends on being a smoker or not.

In [None]:
sns.lmplot()
plt.show()

This brings us conveniently to the next topic of statistics.

## (4) Statistics

Calculate descriptive statistics of `tip` amount for each day and the gender of the person. This can be ackhieved, e.g., by using `.groupby` or `.pivot_table`. 

In [None]:
import numpy as np
from scipy import stats

Calculate the correlation between `total_bill` and `tip`, separately for each gender.

In [None]:
from scipy import stats
for s in tips.sex.unique():


Finally make a logistic regression model to predict the gender, based on tip amount.

In [None]:
from sklearn.linear_model import LogisticRegression

# prepare data for analysis:
x = tips.tip.values.reshape(-1, 1)
y = tips.sex.values



## (5) Predictive modeling
In this final exercise, we'll try to predict the risk of type-II diabetes in Pima indians (a classical machine learning data).

In [None]:
# read in the data:
diabetes = pd.read_csv('Datasets/diabetes.csv')
print(diabetes.shape)
diabetes.head(10)

Note that `SkinThickness` and `Insulin` contain many zero values. Are these real zeros or missing values?

In [None]:
print('Proportion zeros in ST: %.2f' % \
      (sum(diabetes.SkinThickness==0)/diabetes.shape[0]))
print('Proportion zeros in Insulin: %.2f' % \
      (sum(diabetes.Insulin==0)/diabetes.shape[0]))

In [None]:
print('Proportion zeros in BMI: %.2f' % \
      (sum(diabetes.BMI==0)/diabetes.shape[0]))
print('Proportion zeros in BP: %.2f' % \
      (sum(diabetes.BloodPressure==0)/diabetes.shape[0]))

So, it looks like we need to do something about the data before analysis.

In [None]:
# drop attributes with many missing values:
X = diabetes.drop(['SkinThickness','Insulin'],axis='columns')

# filter rows with missing values:
X = X.loc[np.logical_and(X.BMI>0,X.BloodPressure > 0),:]

print(X.shape)

How many percents of data did we lose?

In [None]:
print(1-len(X)/len(diabetes))

Next we need to make the data useable for `sklearn`:

In [None]:
x = 
y = 

And then we need to split the data:

In [None]:
# import a function for doing the splitting:
from sklearn.model_selection import train_test_split



It is good to check that the split is representative:

In [None]:
print('Proportion positives in total: %.2f' %(sum(y)/len(y)))
print('Proportion positives in training: %.2f' %(sum(y_tr)/len(y_tr)))
print('Proportion positives in testing: %.2f' %(sum(y_ts)/len(y_ts)))

Looks good. Now we are ready for model training:

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

# train model:

# model performance:
print('Accuracy: %.2f' %(model.score(x_tr,y_tr)))

In [None]:
model.score?

Now calculate model accuracy from the confusion matrix:

In [None]:
from sklearn.metrics import 

conf = 
print(conf)

In [None]:
# accuracy:
print(conf.diagonal().sum() / conf.sum())
print(conf[1,1]/conf[1,:].sum())

Finally, let's plot the ROC curve and calculate the area under the curve (AUC):

In [None]:
from sklearn.metrics import roc_curve, auc

pred = model.predict_proba(x_ts)
fpr, tpr, thresholds = roc_curve(y_ts, pred[:,1], pos_label=1)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, color='firebrick',
         lw=2, label='ROC curve (AUC = %0.2f)' %roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()