<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/9%20-%20Classification/exercises/exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification

## Logistic Regression 

In this lab we will explore logistic regression which is a well known method for classification problems. We will work with a hearth disease data-set, and we will try to predict whether the patient has a heart disease or not.


![Heart](https://img.webmd.com/dtmcms/live/webmd/consumer_assets/site_images/articles/health_tools/how_heart_disease_affects_your_body_slideshow/493ss_thinkstock_rf_heart_anatomy_illustration.jpg)


In [11]:
# Imports 
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
sns.set_style("white")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
# set random seed 
np.random.seed = 72

### Load data

We will start with the hear disease data-set. Here's a description of the attributes in the data-set:

1. `age`
2. `sex`
3. `cp`: chest pain type (4 values)
4. `trestbps`: resting blood pressure
5. `chol`: serum cholestoral in mg/dl
6. `fbs`: fasting blood sugar > 120 mg/dl
7. `restecg`: resting electrocardiographic results (values 0,1,2)
8. `thalach`: maximum heart rate achieved
9. `exang`: exercise induced angina
10. `target`: presence of heart disease (1), absence of heart disease(0)

![ECG](https://media.eurekalert.org/multimedia_prod/pub/web/230705_web.jpg)






In [13]:
#Load data
# data-set: heart.csv
url = "https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/heart.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,target
0,63,M,D,145,233,yes,A,150,no,1
1,37,M,C,130,250,no,B,187,no,1
2,41,F,B,130,204,no,A,172,no,1
3,56,M,B,120,236,no,B,178,no,1
4,57,F,A,120,354,no,B,163,yes,1


In [None]:
# sklearn imports 
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

### Simple Logistic Regression

Let's start with only 2 features: age and maximum heart rate achieved (`talach`). Define your features and target variable. 

In [None]:
X = # TODO : select features
y = # TODO :  select features

Split your data set into train and test subsets. 

In [None]:
from sklearn.model_selection import train_test_split   
"Your code here" = train_test_split(X, y, test_size=0.3, random_state=72)

#### Standardizing

When you do standardization (or any other modification) to the training data, you have to apply the same modifications to the test data as well. Otherwise your test accuracy would be non-sense.

Here we apply the same standardisation to test data, which means that we normalize the test data with mean and standard deviation from the train data.

Use *StandardScaler()* for normalization. 

In [None]:
scaler = StandardScaler()
# TODO : fit normalizer 

Apply the standardization to your train and test set. 

In [None]:
X_train = # TODO : transform train set 
X_test =# TODO : transform test set 

Define your model. Try to use a logistic regression with cross validation (cv = 10). 

In [None]:
# logistic regression with 10 fold cross validation
LR_cv = LogisticRegressionCV(solver='lbfgs', cv="Your code here", max_iter=100)

Fit your model now using the train set. 

In [None]:
LR_cv.fit("Your code here","Yout code here")

Compare your train and test accuracy for your model.

In [None]:
# train accuracy with CV
LR_cv.score("Your code here", "Your code here")

In [None]:
# test accuracy with CV
LR_cv.score("Your code here", "Your code here")

Have a look at the class distribution

In [None]:
import warnings
warnings.filterwarnings('ignore')
#Your code here

Compute the baserate.

$$Base rate = \frac{Most\_frequent\_class}{Total\_observations}$$

In [None]:
#Compute the base rate

nbr_heart_disease = df.loc[df["target"] == 1].shape[0]
print("#Heart disease = ", "Your code here")

nbr_no_heart_disease = df.loc[df["target"] == 0].shape[0]
print("#No heart disease = ", "Your code here")

print("Baserate = ", max("Your code here", "Your code here")/("Your code here" + "Your code here"))

Use the confusion_matrix module to show the confusion matrix.

In [None]:
#Confusion matrix
from sklearn.metrics import confusion_matrix
cf = confusion_matrix("Your code here", LR_cv.predict("Your code here"))
print(cf)

Plot the confusion matrix

In [None]:
ax = "Your code here"
r = sns.heatmap("Your code here")

"Your code here".('Predicted label')
"Your code here"('True labels')
"Your code here"('Confusion Matrix')
"Your code here"
"Your code here"


### Decision boundary

As we used only two features for classification, we can observe the linear decision boundary made by the logistic regression in a 2D plot. You can also observe the mis-classified training points in this plot. Let's plot the decision boundary for the model with cross validation. 

In [None]:
#Decision boundaries

model = LogisticRegression()
model.fit("Your code here", "Your code here")


plt.scatter(X[:,0], X[:,1], c=y, edgecolors='k', cmap=plt.cm.Paired)
ax = plt.gca()
x_vals = np.array(ax.get_xlim())
y_vals = (-x_vals * model.coef_[0][0] - model.intercept_[0])/model.coef_[0][1]
plt.plot(x_vals, y_vals, '--', c="red")

plt.xlabel("Your code here")
plt.ylabel("Your code here")


plt.xticks()
plt.yticks()

plt.show()

Try to predict the class and probability of correct classification for Age = 50, Thalach = 130 and Age = 70, Thalach = 160

In [None]:
print(model.predict([["Your code here"],["Your code here"]]))
print(model.predict_proba([["Your code here"],["Your code here"]]))

Now let's try more numerical features and see if the accuracy improuves.
We will use now "age", "thalach", "trestbps" and "chol". 
Define your features and your target variable. 

In [None]:
X = df[["Your code here"]]
y = df["Your code here"]

Split your data set into train and test subsets.

In [None]:
"Your code here" = train_test_split(X, y, test_size=0.2, random_state=72)

Standardize your data.

In [None]:
scaler = StandardScaler()
#TODO: fit normalizer 

In [None]:
X_train =  #TODO: tranform train set 
X_test =  #TODO: tranform test set 

Fit your model using the train data. Let's use the logistic regression with cross validation here. 

In [None]:
LR_cv.fit("Your code here", "Your code here")

Compare your train and test accurary. 

In [None]:
# train accuracy
LR_cv.score("Your code here")

In [None]:
# test accuracy
LR_cv.score("Your code here")

Finally, show the confusion matrix.

In [None]:
#Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix("Your code here", LR_cv.predict("Your code here"))