# <span style='color:Blue'> Classification  </span>

In this notebook, we will be working with the [Fetal Health Dataset](https://www.kaggle.com/andrewmvd/fetal-health-classification).
This dataset contains 2126 records of features extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into 3 classes:

* `Normal`

* `Suspect`

* `Pathological`

### 1. Import Libraries
To develop our classification model, we need to import the necessary Python libraries:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, StratifiedKFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats

%matplotlib inline
set_matplotlib_formats('svg')
sns.set_style("darkgrid")

### 2. Load Data

Load and show the dataset

In [None]:
data = pd.read_csv('fetal_health.csv')
data.head()

Splitting the dataset into the source variables (independant variables) and the target variable (dependant variable)

In [None]:
X = data.iloc[:,:-1]
Y = data.iloc[:,-1]

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the classification model. The test dataset will be used as a comparasion and see the performance of our model. We will 67% of the data as the training data and the rest of it as the testing data. Also we are using stratified Train-Test split, that is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, stratify=Y, random_state=10)

In [None]:
f, ax = plt.subplots()
sns.histplot(data.iloc[:,-1],kde=False,label='All', ax=ax)
sns.histplot(y_train+.05, kde=False, label='train', color='green', ax=ax)
sns.histplot(y_test+.1, kde=False, label='test', color='orange', ax=ax)
plt.xlabel('fetal_health')
plt.ylabel('Frequency')
plt.xticks([1, 2, 3],['Normal','Suspect','Pathological'])
plt.title("Distribution of Classes")
plt.legend()

In [None]:
#from imblearn.over_sampling import RandomOverSampler

#ros = RandomOverSampler(random_state=0)
#X_data, y_data = ros.fit_resample(data.iloc[:,:-1], data.iloc[:,-1])
#X_train, y_train = ros.fit_resample(X_train, y_train)

In [None]:
#f, ax = plt.subplots()
#sns.histplot(y_data,kde=False,label='All', ax=ax)
#sns.histplot(y_train+.05, kde=False, label='train', color='green', ax=ax)
#sns.histplot(y_test+.1, kde=False, label='test', color='orange', ax=ax)
#plt.xlabel('fetal_health')
#plt.ylabel('Frequency')
#plt.xticks([1, 2, 3],['Normal','Suspect','Pathological'])
#plt.title("Distribution of Classes")
#plt.legend()

---
### 3. Random Forest Classification

![Random Forest Image](randomforest.png)

A random forest consists of multiple decision trees, which predict the label given the input features. The overall output is then averaged over the predictions of the multiple trees in the forest. This prediction scheme helps to avoid overfitting, since the forest is an "ensemble" of a large number of trees. Each tree is given a subset of the samples and / or features in order to create an "expert tree" that is fit to a subportion of the dataset. 

We now define a random forest with a pre-defined number of trees ("n_estimators") and maximum number of estimators ("max_features"). More details about what the parameters mean are given in [what these parameters mean](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
#from imblearn.ensemble import BalancedRandomForestClassifier

# Fitting Random Forest Classification to the Training set
rfc = RandomForestClassifier(n_estimators = 100, criterion = 'gini', random_state = 42, max_depth=2)
rfc.fit(X_train, y_train)

y_pred_train = rfc.predict(X_train)
y_pred_test = rfc.predict(X_test)

acc_train = accuracy_score(y_train, y_pred_train)

print(f'Mean accuracy train score: {acc_train:.3}')

acc_test = accuracy_score(y_test, y_pred_test)

print(f'Mean accuracy test score: {acc_test:.3}')


from sklearn.metrics import confusion_matrix

conf_train = confusion_matrix(y_train,y_pred_train)
conf_test = confusion_matrix(y_test,y_pred_test)

fg, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
sns.heatmap(conf_train, annot=True, fmt="d", ax=ax1)
ax1.set(xlabel="predicted label")
ax1.set_xticklabels(['Normal','Suspect','Pathological'])
ax1.set_yticklabels(['Normal','Suspect','Pathological'])
ax1.set(ylabel="actual label")
ax1.set(title="Confusion Matrix for training set")
sns.heatmap(conf_test, annot=True, fmt="d", ax=ax2)
ax2.set(xlabel="predicted label")
ax2.set(ylabel="actual label")
ax2.set_xticklabels(['Normal','Suspect','Pathological'])
ax2.set_yticklabels(['Normal','Suspect','Pathological'])
ax2.set(title="Confusion Matrix for test set");

We can access each single tree on our forest as shown below. You can see that the trees are not pruned (ccp_alhpa=0).

In [None]:
rfc.estimators_[98:100]

**Cocalc Task**: Perform a grid search to come up with better parameters for our dataset than the pre-defined ones. 

Bonus: Are there other parameters of the tree, which you can change to improve on the results on our data set? Which parameters are most / least influence on the performance?

In [None]:
gsN = [50, 100, 500, 1000]
gsK = [2, "auto", "log2"]


### Validation
cval = KFold(n_splits=3, random_state=42, shuffle=True)

model = RandomForestClassifier(n_jobs=-1)

param_grid = {'n_estimators': gsN,
              'max_features': gsK}

search = GridSearchCV(model, param_grid, n_jobs=-1,cv=cval,return_train_score=True)
search.fit(X_train, y_train)
print("Accuracy=%0.3f):" % search.best_score_)
print(search.best_params_)

# Check the result for our test set
best_estimator = search.best_estimator_
y_pred_train = best_estimator.predict(X_train)
y_pred_test = best_estimator.predict(X_test)

acc_train = accuracy_score(y_train, y_pred_train)

print(f'Mean accuracy train score: {acc_train:.3}')

acc_test = accuracy_score(y_test, y_pred_test)

print(f'Mean accuracy test score: {acc_test:.3}')

In [None]:
conf_train = confusion_matrix(y_train,y_pred_train)
conf_test = confusion_matrix(y_test,y_pred_test)

fg, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
sns.heatmap(conf_train, annot=True, fmt="d", ax=ax1)
ax1.set(xlabel="predicted label")
ax1.set_xticklabels(['Normal','Suspect','Pathological'])
ax1.set_yticklabels(['Normal','Suspect','Pathological'])
ax1.set(ylabel="actual label")
ax1.set(title="Confusion Matrix for training set")
sns.heatmap(conf_test, annot=True, fmt="d", ax=ax2)
ax2.set(xlabel="predicted label")
ax2.set(ylabel="actual label")
ax2.set_xticklabels(['Normal','Suspect','Pathological'])
ax2.set_yticklabels(['Normal','Suspect','Pathological'])
ax2.set(title="Confusion Matrix for test set");

---

### 4. Feature Importance

An interesting "free candy" of Trees is the fact that you can rank importantance of each feature. This is given by the `estimator.feature_importances_` attribute in sklearn, which gives you an impurity-based ranking of the features. If you like to have more statistics on the importance, you can retrieve the importances of all trees in the forest and then calculate the standard deviation.

In [None]:
# Fit the forest

rfc = search.best_estimator_

# Visualize the feature importances
importances = rfc.feature_importances_

std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, data.drop('fetal_health',axis=1).columns[indices[f]], importances[indices[f]]))

# Plot the impurity-based feature importances of the forest
plt.figure()
plt.title("Feature importances for Random Forest Classification")
plt.bar(range(X.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), data.drop('fetal_health',axis=1).columns[indices],rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()

In [None]:
from sklearn.tree import export_graphviz

# Extract single tree
estimator = rfc.estimators_[2]

export_graphviz(estimator, out_file='tree.dot', 
                feature_names = X.columns,
                class_names = 'fetal_health',
                rounded = True, proportion = False, 
                precision = 2, filled = True)

from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])


# Display in jupyter notebook
#from IPython.display import Image
#Image(filename = 'tree.png')


from sklearn.tree import plot_tree
plt.figure()
plot_tree(rfc.estimators_[2], filled=True, feature_names=X_train.columns)
plt.show()

### Analysis of the stability of the features
As always, our fitted model will depend on the samples that we used for training. Ideally, the ranking of the features will not be affected, but it's always recommendable to have a look at it. In real world datasets it can happen that your features are not stable and change their ranking with varying training data.

**Cocalc Task:** How stable is the feature importance when we use different train/test splits? 

In [None]:
n_estimators = 1000
max_features = 3

kf = KFold(n_splits=3, shuffle=True)
f, axs = plt.subplots(1,3,figsize=(12,3))

for num, [train_index, test_index] in enumerate(kf.split(X)):
    
    # split into training and testing data
    X_train = X.iloc[train_index, :]
    y_train = Y[train_index]
    
    # not needed but we still do it for completeness
    X_test = X.iloc[test_index, :]
    y_test = Y[test_index]
    
    print("For fold: %s" %(num))
    
    # Fit the forest
    rfc = RandomForestClassifier(n_estimators=n_estimators, max_features=max_features, n_jobs=-1)
    rfc.fit(X_train, y_train)
    
    # Visualize the feature importances
    importances = rfc.feature_importances_
    
    std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]

    # Print the feature ranking
    print("Feature ranking:")

    #for f in range(X.shape[1]):
    #    print("%d. %s (%f)" % (f + 1, data.drop('fetal_health',axis=1).columns[indices[f]], importances[indices[f]]))

    # Plot the impurity-based feature importances of the forest
    #plt.figure()
    
    axs[num].set_title("Feature importances for Random\nForest Classification - Fold %s " %(num))
    axs[num].bar(range(X.shape[1]), importances[indices],
            color="r", yerr=std[indices], align="center")
    axs[num].set_xticks(range(X.shape[1]))
    axs[num].set_xticklabels( data.drop('fetal_health',axis=1).columns[indices],rotation=90)
    axs[num].set_xlim([-1, X.shape[1]])
plt.show()

---
## 5. Adaptive Boosting (AdaBoost)

In contrast to bagging, the initial formulation of the boosting algorithm uses random subsets of training examples drawn from the training dataset without replacement. In contrast to the original boosting procedure, AdaBoost uses the complete training dataset to train the weak learners, where the training examples are reweighted in each iteration to build a strong classifier that learns from the mistakes of the previous weak learners in the ensemble.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, stratify=Y, random_state=10)

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.1,
    'random_state': 1
}

tree = DecisionTreeClassifier(criterion='entropy',
                              random_state=1,
                              max_depth=1)

ada = AdaBoostClassifier(base_estimator = tree,
                         **ada_params)

tree = tree.fit(X_train, y_train)

y_pred_train = tree.predict(X_train)
y_pred_test = tree.predict(X_test)

acc_train = accuracy_score(y_train, y_pred_train)

print(f'Mean accuracy train score: {acc_train:.3}')

acc_test = accuracy_score(y_test, y_pred_test)

print(f'Mean accuracy test score: {acc_test:.3}')

In [None]:
ada = ada.fit(X_train, y_train)

y_pred_train = ada.predict(X_train)
y_pred_test = ada.predict(X_test)

acc_train = accuracy_score(y_train, y_pred_train)

print(f'Mean accuracy train score: {acc_train:.3}')

acc_test = accuracy_score(y_test, y_pred_test)

print(f'Mean accuracy test score: {acc_test:.3}')


fg, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
sns.heatmap(conf_train, annot=True, fmt="d", ax=ax1)
ax1.set(xlabel="predicted label")
ax1.set_xticklabels(['Normal','Suspect','Pathological'])
ax1.set_yticklabels(['Normal','Suspect','Pathological'])
ax1.set(ylabel="actual label")
ax1.set(title="Confusion Matrix for training set")
sns.heatmap(conf_test, annot=True, fmt="d", ax=ax2)
ax2.set(xlabel="predicted label")
ax2.set(ylabel="actual label")
ax2.set_xticklabels(['Normal','Suspect','Pathological'])
ax2.set_yticklabels(['Normal','Suspect','Pathological'])
ax2.set(title="Confusion Matrix for test set");

Here, you can see that the AdaBoost model predicts all class labels of the training dataset correctly and also shows a slightly improved test dataset performance compared to the decision tree stump. However, you can also see that we introduced additional variance by our attempt to reduce the model bias—a greater gap between training and test performance.

In [None]:
# Visualize the feature importances
importances = ada.feature_importances_

std = np.std([tree.feature_importances_ for tree in ada.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, data.drop('fetal_health',axis=1).columns[indices[f]], importances[indices[f]]))

# Plot the impurity-based feature importances of the forest
plt.figure()
plt.title("Feature importances for AdaBoost Classifier")
plt.bar(range(X.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), data.drop('fetal_health',axis=1).columns[indices],rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()

In [None]:
for num, [train_index, test_index] in enumerate(kf.split(X)):
    
    # split into training and testing data
    X_train = X.iloc[train_index,:]
    y_train = Y[train_index]
    
    # not needed but we still do it for completeness
    X_test = X.iloc[test_index,:]
    y_test = Y[test_index]
    
    print("For fold: %s" %(num))
    
    # Fit the forest
    ada = AdaBoostClassifier(n_estimators=n_estimators)
    ada.fit(X_train, y_train)
    
    # Visualize the feature importances
    importances = ada.feature_importances_
    
    std = np.std([tree.feature_importances_ for tree in ada.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]

    # Print the feature ranking
    print("Feature ranking:")

    #for f in range(X.shape[1]):
    #    print("%d. %s (%f)" % (f + 1, data.drop('csMPa',axis=1).columns[indices[f]], importances[indices[f]]))

    # Plot the impurity-based feature importances of the forest
    plt.figure()
    plt.title("Feature importances for AdaBoost Classifier - Fold %s " %(num))
    plt.bar(range(X.shape[1]), importances[indices],
            color="r", yerr=std[indices], align="center")
    plt.xticks(range(X.shape[1]), data.drop('fetal_health',axis=1).columns[indices],rotation=90)
    plt.xlim([-1, X.shape[1]])
    plt.show()

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.model_selection import cross_val_score

clf_labels = ['Random Forest', 'Decision Tree', 'AdaBoost']

all_clf = [rfc, tree, ada]


for clf, la in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv = cval,
                             scoring='roc_auc_ovo')
    print("ROC AUC: %0.2f (+/- %0.2f) [%s]"
    % (scores.mean(), scores.std(), la))
