<img src="Logo.png" width="100" align="left"/> 

# <center> Unit 3 Project </center>
#  <center> Third section : supervised task </center>

In this notebook you will be building and training a supervised learning model to classify your data.

For this task we will be using another classification model "The random forests" model.

Steps for this task: 
1. Load the already clustered dataset 
2. Take into consideration that in this task we will not be using the already added column "Cluster" 
3. Split your data.
3. Build your model using the SKlearn RandomForestClassifier class 
4. classify your data and test the performance of your model 
5. Evaluate the model ( accepted models should have at least an accuracy of 86%). Play with hyper parameters and provide a report about that.
6. Provide evidence on the quality of your model (not overfitted good metrics)
7. Create a new test dataset that contains the testset + an additional column called "predicted_class" stating the class predicted by your random forest classifier for each data point of the test set.

## 1. Load the data and split the data:

In [1]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd 
from sklearn.model_selection import train_test_split

In [2]:
# To-Do:  load the data 
df = pd.read_csv("clustered_HepatitisCdata.csv")
df.head()

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster
0,1,0,32,0,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0,3
1,2,0,32,0,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5,3
2,3,0,32,0,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3,3
3,4,0,32,0,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7,3
4,5,0,32,0,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7,3


In [3]:
# To-Do : keep only the columns to be used : all features except ID, cluster 
# The target here is the Category column 
# Do not forget to split your data (this is a classification task)
# test set size should be 20% of the data 
data = df.drop(["ID", "cluster"], axis=1, inplace=False)

In [4]:
data

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,0,32,0,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,0,32,0,38.5,70.3,18.0,24.7,3.9,11.17,4.80,74.0,15.6,76.5
2,0,32,0,46.9,74.7,36.2,52.6,6.1,8.84,5.20,86.0,33.2,79.3
3,0,32,0,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,0,32,0,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
610,4,62,1,32.0,416.6,5.9,110.3,50.0,5.57,6.30,55.7,650.9,68.5
611,4,64,1,24.0,102.8,2.9,44.4,20.0,1.54,3.02,63.0,35.9,71.3
612,4,64,1,29.0,87.3,3.5,99.0,48.0,1.66,3.63,66.7,64.2,82.0
613,4,46,1,33.0,66.2,39.0,62.0,20.0,3.56,4.20,52.0,50.0,71.0


In [5]:
X = data.iloc[:,1:]
Y = data.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

## 2. Building the model and training and evaluate the performance: 

In [None]:
# To-do build the model and train it 
# note that you will be providing explanation about the hyper parameter tuning 
# So you will be iterating a number of times before getting the desired performance 

### I. Choosing the hyper parameters for the RandomForestClassifier method.

In [6]:
# Finding all available parameters of the RandomForestClassifier() method
import inspect
a_signature = inspect.signature(RandomForestClassifier)
parameters = a_signature.parameters
parameter_list = list(parameters)
parameter_list

['n_estimators',
 'criterion',
 'max_depth',
 'min_samples_split',
 'min_samples_leaf',
 'min_weight_fraction_leaf',
 'max_features',
 'max_leaf_nodes',
 'min_impurity_decrease',
 'bootstrap',
 'oob_score',
 'n_jobs',
 'random_state',
 'verbose',
 'warm_start',
 'class_weight',
 'ccp_alpha',
 'max_samples']

  Wow, that is quite an overwhelming list! The documentation on the random forest in Scikit-Learn tells us the most important settings are the number of trees in the forest **(n_estimators)** and the number of features considered for splitting at each leaf node **(max_features)**. But a more efficient way is just to try out a wide range of values and see what works. <br><br>I will try adjusting the following set of hyperparameters:
<br>- **n_estimators**: number of trees in the foreset
<br>- **criterion**: The function to measure the quality of a split
<br>- **max_features**: max number of features considered for splitting a node
<br>- **bootstrap**: method for sampling data points (with or without replacement)
<br>- **max_depth**: max number of levels in each decision tree

### II. Using grid search to find the best combination of hyper parameters.

In [7]:
# Initializing GridSearchCV() object and fitting it with hyperparameters
forest_params = [{
    'n_estimators': [100, 150, 200, 300, 400], 
    'criterion': ["gini", "entropy"],
    'max_features': ["sqrt","log2"],
    'bootstrap': [True, False],
    'max_depth': [10, 15, 20, 30, None]
}]

Here we will be running 5 x 2 x 2 x 2 x 5 = 200 possible combinations of hyper parameters to find the best one.

In [8]:
from sklearn.model_selection import GridSearchCV

def grid_search(X_train, y_train, tuned_params):
    print("# Tuning hyper_parameters")
    clf = GridSearchCV(RandomForestClassifier(), tuned_params)
    clf.fit(X_train, y_train)
    print("Best parameters found: ")
    print(clf.best_params_)

In [9]:
import warnings
warnings.filterwarnings("ignore")

grid_search(X_train, y_train, forest_params)

# Tuning hyper_parameters
Best parameters found: 
{'bootstrap': False, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 150}


In [10]:
model = RandomForestClassifier(n_estimators=150,criterion='entropy',max_features="sqrt", max_depth=10, bootstrap= False)
model.fit(X_train,y_train)

RandomForestClassifier(bootstrap=False, criterion='entropy', max_depth=10,
                       max_features='sqrt', n_estimators=150)

In [11]:
y_hat = model.predict(X_test)

### III. Evaluating the model

In [14]:
# To-do : evaluate the model in terms of accuracy and precision 
# Provide evidence that your model is not overfitting 
from sklearn.metrics import precision_score, accuracy_score

y_hat_train = model.predict(X_train)
print("Accuracy on Training set : ", accuracy_score(y_train,y_hat_train))
print("Precision on Training set :", precision_score(y_train,y_hat_train, average="weighted"))

print("Accuracy on Test set : ", accuracy_score(y_test,y_hat))
print("Precision on Test set : ", precision_score(y_test,y_hat, average="weighted"))

Accuracy on Training set :  1.0
Precision on Training set : 1.0
Accuracy on Test set :  0.9349593495934959
Precision on Test set :  0.9286279588459602


In [15]:
from sklearn.metrics import classification_report

print(f'Classification report on test set : \n{classification_report(y_test,y_hat)}')

Classification report on test set : 
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       108
           1       1.00      1.00      1.00         1
           2       0.50      0.50      0.50         4
           3       0.50      0.25      0.33         4
           4       1.00      0.50      0.67         6

    accuracy                           0.93       123
   macro avg       0.79      0.65      0.70       123
weighted avg       0.93      0.93      0.93       123



Overfitting is a scenario where the model performs **well on training data** but performs **poorly on data not seen during training**. 
<br>In our case, the model has performed **well** on **both** trainig data and testing data (with an accuracy of 0.93 on the test set). Thus there is not a real problem of overfitting here.

> Hint : A Perfect accuracy on the train set suggest that we have an overfitted model So the student should be able to provide a detailed table about the hyper parameters / parameters tuning with a good conclusion stating that the model has at least an accuracy of 86% on the test set without signs of overfitting  

## 3. Create the summary test set with the additional predicted class column: 
In this part you need to add the predicted class as a column to your test dataframe and save this one 

In [16]:
# To-Do : create the complete test dataframe : it should contain all the feature column + the actual target and the ID as well 
test_df = X_test
test_df['ID'] = df['ID']
test_df['cluster'] = df['cluster']
test_df['Category'] = df['Category']

In [17]:
test_df.head()

Unnamed: 0,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,ID,cluster,Category
205,50,0,42.2,145.0,27.5,37.9,4.5,13.71,8.8,103.0,239.0,73.1,206,4,0
434,48,1,42.5,62.2,12.1,20.1,23.1,4.01,5.58,67.0,13.0,74.2,435,3,0
482,54,1,39.9,30.7,17.0,19.3,6.3,6.99,4.95,68.0,13.3,70.7,483,3,0
414,46,1,41.1,47.5,21.0,17.7,7.1,7.55,4.42,62.0,11.9,69.8,415,3,0
504,57,1,38.7,62.8,21.8,29.2,9.2,6.55,7.08,68.0,13.0,70.7,505,3,0


In [18]:
# To-Do : Add the predicted_class column 
test_df["Predicted_class"] = y_hat

In [23]:
test_df.head()

Unnamed: 0,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,ID,cluster,Category,Predicted_class
205,50,0,42.2,145.0,27.5,37.9,4.5,13.71,8.8,103.0,239.0,73.1,206,4,0,0
434,48,1,42.5,62.2,12.1,20.1,23.1,4.01,5.58,67.0,13.0,74.2,435,3,0,0
482,54,1,39.9,30.7,17.0,19.3,6.3,6.99,4.95,68.0,13.3,70.7,483,3,0,0
414,46,1,41.1,47.5,21.0,17.7,7.1,7.55,4.42,62.0,11.9,69.8,415,3,0,0
504,57,1,38.7,62.8,21.8,29.2,9.2,6.55,7.08,68.0,13.0,70.7,505,3,0,0


In [26]:
test_df.shape

(123, 16)

> Make sure you have 16 column in this test set  

In [28]:
# Save the test set 
test_df.to_csv("test_summary.csv", index=False)