# Machine Learning 5: Real life example

Our dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically **predict whether or not a patient has diabetes**, based on certain diagnostic measurements included in the dataset.

The datasets consists of several medical predictor variables and one target variable, **Outcome**. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

The data is relatively clean, we won't discuss pre-processing in detail here, besides the very basics. 

First, let's **import the libraries** we will need. Then, we need to **read the csv file**.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

diabetes = pd.read_csv('diabetes.csv')

Let's **print the columns and the few top values** from the dataset, to see what we're working with.

In [None]:
diabetes.____ 

In [None]:
diabetes.____()

Let's take a look at the Outcome variable that we want to predict. **How many people from our dataset have diabetes?**

In [None]:
diabetes.groupby('Outcome').size()

Do we have **missing values** in any columns?

In [None]:
diabetes.____().sum()

Let's look at the **distribution** of each feature to see if we find anything weird.

In [None]:
diabetes.hist(figsize=(9, 9))

From the histagram we can see `0` values for *Blood Pressure, BMI, Skin Fold Thikness, Insulin,* and *Blood Glucose*. Those values aren't realistic, and are probably in fact hidden **missing values**. 

For *Blood Pressure, BMI,* and *Glucose*, only a **few values** are missing. We can safely **drop** those rows.

In [None]:
diabetes_mod = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]

It seems like there are a lot of `0` values in the *Insulin* column. Let's check how many.

In [None]:
diabetes[diabetes.Insulin == 0].shape[0]

We will **lose a lot of information** if we remove all rows where Insulin value is zero. It seems like a valuable variable to drop it completely. Let's keep it for now and see how it goes.

Now that the data is ready we can **train our model**. Let's try a model called Logistic Regression. 

First, we will create `X` variable with the **features**, and `y` variable with the **outcome**. We will need to split the data into **train and test sets**. There is a function `train_test_split` for this. Please google it to find scikit documentation with the example of usage. We will use **accuracy** to get the estimates. It's a good choice because our outcome variable is rather balanced. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

X = diabetes_mod[feature_names]
y = diabetes_mod.____

X_train, X_test, y_train, y_test = ____

Now the fun part! 
To get our predictions we need to:
1. **Initialize** the model
2. **Fit** the model (using the input feature vectors `X` and the outcome vector `y` of the **training set**)
3. Use the trained model to **predict** outcome values from the **training set**, and then the **test set** (here, we only provide the input features `X`, the model will give us the predicted `y` values)
4. **Compare** the values the model predicted with the actual outcome values. We will use the accuracy_score function. Please google how to use it. 

In [None]:
model = LogisticRegression()
model.fit(____, ____)

In [None]:
y_pred_train = model.predict(X_train)
accuracy_score(y_train, y_pred_train)

In [None]:
y_pred_test = model.____(____)
accuracy_score(____, y_pred_test)

## Congratulations on your first model! 
The accuracy on the test set is almost as good as the accuracy on the training set. That means that the model doesn't **overfit** to the training set.
Let's try another model, that is famous for overfitting if not tuned well.

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.____(____, ____)

In [None]:
y_pred_train = tree.____(____)
accuracy_score(____, y_pred_train)

In [None]:
y_pred_test = tree.____(____)
accuracy_score(____, y_pred_test)

We can see that the model **fits the training data perfectly** (which is already suspicious), and performs much worse on the test set. One of the way to prevent overfitting is to force the model to be "**simpler**". For the Decision Tree algorithm, one of the indication of its complexity is its "depth". Let's try limiting it to `3`.

In [None]:
tree = DecisionTreeClassifier(max_depth=3)
tree.____(____, ____)

In [None]:
y_pred_train = tree.____(____)
accuracy_score(____, ____)

In [None]:
y_pred_test = tree.____(____)
accuracy_score(____, ____)

Now we can see that the accuracy **decreased** for the training set, but **increased** for the test set. It means our model **generalizes** better, and will be able to predict unseen cases. 