# Module 7: Exercise A

In this exercise, you will practice tree-based methods for classification.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.metrics import classification_report, accuracy_score, f1_score

## Data Preprocessing

We will analyze stroke data set, which has binary target values (stroke yes or no) and consists of patient information.

In [2]:
stroke = pd.read_csv('stroke_data.csv')

>__Task 1__
>
>Count the number of NAs for each column

In [None]:
...

>__Task 2__
>
>Visualize the distributon of __stroke__ with and without missing values of __bmi__ using a count plot 

In [None]:
...

This is a highly imbalanced data, which should be augmented or more data points should be collected with `stroke=1`. For this assignment, let's move on and observe the result with this imbalanced data set. We can drop the rows with missing values or fill __bmi__ with mean.

>__Task 3__
>
>Fill the mising value of __bmi__ with mean

In [None]:
...

To apply the supervised learning models, we need the categorical data represented with numeric labels.

>__Task 4__
>
>Convert columns to numerical values: __gender__, __ever_married__, __work_type__, __Residence_type__, __smoking_status__
>
>- Check unique values in each column
>- For __gender__, __ever_married__, and __Residence_type__
>     - Use the  dictionaries `{'Female':1, 'Male':0}`, `{'Yes':1, 'No':0}`, and `{'Urban':1, 'Rural':0}` respectively to encode values and replace the original columns
>- For __work_type__ and __smoking_status__
>     - Apply one-hot encoding (dummy variables)
>     - Remember to add prefix, otherwise it will be hard to link the new column names back to the original column. 
>     - Drop the original ones, so you don't mix up these columns. We only need the transformed ones for modelling.

In [None]:
...

You can self check the result. If you done everything properly, you should get a cleaned data as follows.

In [9]:
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoke_Unknown,smoke_formerly smoked,smoke_never smoked,smoke_smokes
0,59437,1,57.0,0,0,1,1,221.89,37.3,1,0,0,1,0,0,0,0,0,1
1,71719,0,66.0,0,0,1,0,57.17,25.5,0,1,0,0,0,0,0,1,0,0
2,26723,1,57.0,0,0,1,1,83.14,31.9,0,0,0,1,0,0,0,0,1,0
3,42899,0,78.0,0,0,1,1,133.19,23.6,1,0,0,0,1,0,0,1,0,0
4,12674,0,44.0,0,0,1,0,74.15,34.5,0,0,0,1,0,0,0,1,0,0


>__Task 5__
>
>Visualize __age__ distribution of two groups (stroke vs non-stroke) in one single plot 
>
>What can you learn from the plot?

In [None]:
...

### Train/Test Split

>__Task 6__
>
>- Assign the __stroke__ to `y`, and the rest (except __id__) to `X`
>- Split with a 80(train):20(test) ratio and set 89 randomness
>- Remember to use stratified split as the data set has majority records of non-stroke
>- Make sure your function returns `X_train`, `X_test`, `y_train`, `y_test`

In [None]:
...

## Decision Tree Classifier

>__Task 7__
>
>- Initiate a decision-tree classifier, set tree depth to 2 and 89 randomness
>- Print a text report showing the rules of tree
>- Plot the tree

In [None]:
...

### Feature Importance

>__Task 8__
>
>- Map feature names to their importance scores
>- Print and plot feature importance

In [None]:
...

### Performance Evaluation

>__Task 9__
>
>- Predict the class (0 or 1) on test set
>- Predict the class probabilities
>- Calculate the accuracy of the model

In [None]:
...

### The Best Performing Depth

>__Task 10__
>
>Find the depth that maximizes accuracy
>
>- Fill in a for loop that iterates over the `k` argument
>- Initiate the model with `max_depth=k`
>- Fit the model on train set
>- Predict on both train and test sets
>- Calculate accuracies by comparing the predictions with `y_train` and `y_test` respectively
>- Plot the results
>- Print the best performing depth and its accuracy

In [None]:
...

---

## Ensemble Methods

### Bagging

>__Task 11__
>
>Build a bagging model for classification task
>
>- Set 100 base estimators and 89 randomness
>- Calculate accuracy on test set

In [None]:
...

>__Task 12__
>
>Fit different bagging models by changing `n_estimators` parameter in a loop
>
>- Set a range between 50 and 160 with a step of 10
>- Initiate a bagging model with `n_estimators=n`and 89 randomness
>- Fit the model to train pairs
>- Predict values on test sets
>- Add train accuracy to `accuracy['train_acc']` and test accuracy to `accuracy['test_acc']`
>- Print the number of estimators and its accuracy

In [None]:
...

### Random Forest

>__Task 13__
>
>Build a random forest model for classification task
>
>- Set 100 base estimators, 10 features for split, 89 randomness 
>- Calculate accuracy and F1 score on test set

In [None]:
...

>__Task 14__
>
>Check feature importance of the random forest model and plot the results

In [None]:
...

>__Task 15__
>
>Tune hyperparameter for the random forest model
>
>- Define the classifier with 89 randomness
>- Define the parameter grid with:
>     - `max_depth` range `(5,30,5)` 
>     - `n_estimators` range `(50,210,50)`
>- Define the grid search with the parameter grid and set:
>     - `accuracy` as the evaluation score
>     - `n_jobs=-1`
>     - 5-fold cross-validation
>     - `verbose=1`
>     - `return_train_score=True`
>- Fit the grid search to train set
>- Print the best resulting parameters

In [None]:
...

>__Task 16__
>
>Predict on test set using `.best_estimator_` and print the accuracy and F1 score of the tuned model

In [None]:
...

### Gradient Boosting

>__Task 17__
>
>Build a gradient boosting model for classification task
>
>- Set 0.01 learning rate, 30 base estimators, 5 features, 5 depth, and 89 randomness 
>- Calculate accuracy and F1 score on test set

In [None]:
...

>__Task 18__
>
>Find the best performing learning rate
>
>- Define the classifier with 89 randomness
>- Define the parameter grid with:
>     - `max_depth` range `(5,30,5)` 
>     - `n_estimators` range `(50,210,50)`
>     - `learning_rate` range `(0.01,0.31,0.1)`
>- Define the grid search with the parameter grid and set:
>     - `accuracy` as the evaluation score
>     - `n_jobs=-1`
>     - 5-fold cross-validation
>     - `verbose=1`
>     - `return_train_score=True`
>- Fit the grid search to train set
>- Print the best resulting parameters

In [None]:
...

>__Task 19__
>
>Predict on test set using `.best_estimator_` and print the accuracy and F1 score of the tuned model

In [None]:
...