<img src="Logo.png" width="100" align="left"/> 

# <center> Unit 3 Project </center>
#  <center> Third section : supervised task </center>

In this notebook you will be building and training a supervised learning model to classify your data.

For this task we will be using another classification model "The random forests" model.

Steps for this task: 
1. Load the already clustered dataset 
2. Take into consideration that in this task we will not be using the already added column "Cluster" 
3. Split your data.
3. Build your model using the SKlearn RandomForestClassifier class 
4. classify your data and test the performance of your model 
5. Evaluate the model ( accepted models should have at least an accuracy of 86%). Play with hyper parameters and provide a report about that.
6. Provide evidence on the quality of your model (not overfitted good metrics)
7. Create a new test dataset that contains the testset + an additional column called "predicted_class" stating the class predicted by your random forest classifier for each data point of the test set.

## 1. Load the data and split the data:

In [1]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd 
from sklearn.model_selection import train_test_split

In [2]:
# To-Do:  load the data 
df = pd.read_csv('HepatitisCdataCluster.csv')
df.head()

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster
0,1,0,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0,3
1,2,0,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5,3
2,3,0,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3,3
3,4,0,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7,3
4,5,0,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7,3


In [3]:
# To-Do : keep only the columns to be used : all features except ID, cluster 
# The target here is the Category column 
# Do not forget to split your data (this is a classification task)
# test set size should be 20% of the data 

X = df.drop(columns=['ID', 'cluster'])
y = df['Category']

# splitting data into Test Train sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

print("Train Shape",x_train.shape)
print("Test Shape",x_test.shape)

Train Shape (492, 13)
Test Shape (123, 13)


## 2. Building the model and training and evaluate the performance: 

In [51]:
# To-do build the model and train it 
# note that you will be providing explanation about the hyper parameter tuning 
# So you will be iterating a number of times before getting the desired performance 
# need to use random forest

from sklearn.ensemble import RandomForestClassifier

# Define and train the RandomForestClassifier model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(x_train, y_train)

In [52]:
y_hat_train = rf_model.predict(x_train)
y_hat_test = rf_model.predict(x_test)


In [53]:
# To-do : evaluate the model in terms of accuracy and precision 
# Provide evidence that your model is not overfitting 
from sklearn.metrics import precision_score, accuracy_score
# Evaluate the model's accuracy and precision
accuracy_train = accuracy_score(y_train, y_hat_train)
precision_train = precision_score(y_train, y_hat_train, average='macro')

accuracy_test = accuracy_score(y_test, y_hat_test)
precision_test = precision_score(y_test, y_hat_test, average='macro')

# Print out the performance metrics
print("Training Set Performance:")
print(f"Accuracy on train set: {accuracy_train:.2f}")
print(f"Precision on train set: {precision_train:.2f}")

print("\nTest Set Performance:")
print(f"Accuracy on test set: {accuracy_test:.2f}")
print(f"Precision on test set: {precision_test:.2f}")

Training Set Performance:
Accuracy on train set: 1.00
Precision on train set: 1.00

Test Set Performance:
Accuracy on test set: 0.98
Precision on test set: 0.91


> Hint : A Perfect accuracy on the train set suggest that we have an overfitted model So the student should be able to provide a detailed table about the hyper parameters / parameters tuning with a good conclusion stating that the model has at least an accuracy of 86% on the test set without signs of overfitting  

Evident from the 100% accuracy and precision - there is over fitting


In [60]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters grid for Grid Search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           cv=5, n_jobs=-1, scoring='accuracy', verbose=2)

# Fit the model to the training data
grid_search.fit(x_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:", grid_search.best_params_)

# Use the best model from grid search
best_rf_model = grid_search.best_estimator_

# Predict using the best model
y_hat_test_best = best_rf_model.predict(x_test)

# Re-evaluate the model with the best hyperparameters
accuracy_test_best = accuracy_score(y_test, y_hat_test_best)
precision_test_best = precision_score(y_test, y_hat_test_best, average='macro')

print("\nBest Model Performance:")
print(f"Accuracy on test set (best model): {accuracy_test_best:.2f}")
print(f"Precision on test set (best model): {precision_test_best:.2f}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   0.2s
[CV] END max_dep

## 3. Create the summary test set with the additional predicted class column: 
In this part you need to add the predicted class as a column to your test dataframe and save this one 

In [71]:
# To-Do : create the complete test dataframe : it should contain all the feature column + the actual target and the ID as well  

test_df = x_test.copy()
test_df['ID'] = df['ID'].loc[x_test.index]  # Add the 'ID' column from the original DataFrame
test_df['Category'] = y_test  # Add the actual target values
test_df['cluster'] = df.loc[x_test.index, 'cluster']

# Reorder columns so that 'ID' is first
test_df = test_df[['ID'] + [col for col in test_df.columns if col != 'ID']]



In [72]:
test_df.head()

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster
49,50,0,36,1,47.8,89.0,48.5,38.4,8.6,8.26,5.62,96.0,21.9,76.2,3
496,497,0,56,0,45.1,79.1,39.0,30.5,5.2,6.47,5.1,64.0,145.3,66.7,4
211,212,0,51,1,45.9,66.7,31.8,28.1,9.0,10.08,5.61,85.0,36.2,73.0,3
249,250,0,55,1,44.7,71.6,22.9,22.1,5.5,6.82,4.61,105.0,59.2,72.7,4
142,143,0,45,1,43.2,68.2,27.8,42.3,6.6,10.93,6.61,105.0,27.2,74.5,3


In [73]:
# To-Do : Add the predicted_class column 
test_df['Predicted_class'] = y_hat_test  

In [74]:
test_df.head()

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster,Predicted_class
49,50,0,36,1,47.8,89.0,48.5,38.4,8.6,8.26,5.62,96.0,21.9,76.2,3,0
496,497,0,56,0,45.1,79.1,39.0,30.5,5.2,6.47,5.1,64.0,145.3,66.7,4,0
211,212,0,51,1,45.9,66.7,31.8,28.1,9.0,10.08,5.61,85.0,36.2,73.0,3,0
249,250,0,55,1,44.7,71.6,22.9,22.1,5.5,6.82,4.61,105.0,59.2,72.7,4,0
142,143,0,45,1,43.2,68.2,27.8,42.3,6.6,10.93,6.61,105.0,27.2,74.5,3,0


> Make sure you have 16 column in this test set  

In [76]:
# Save the test set 
test_df.to_csv("test_summary.csv")