<a href="https://colab.research.google.com/github/ac-26/CSI-25/blob/main/week6_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Week 6 Assignment-> Train multiple machine learning models and evaluate their performance using metrics such as accuracy, precision, recall, and F1-score. Implement hyperparameter tuning techniques like GridSearchCV and RandomizedSearchCV to optimize model parameters. Analyze the results to select the best-performing model.**


## **By -> Arnav Chopra**

In [58]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [59]:
wine = datasets.load_wine()
X = wine.data
y = wine.target

In [60]:
df = pd.DataFrame(X, columns=wine.feature_names)
df['target'] = y

In [61]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [62]:
df.shape

(178, 14)

### **Doing train test test split**

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### **Doing scaling to get uniformity**

In [64]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### **We will first train a baseline model and then apply hyperparameter tuning to see how much we improved**

In [65]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42),
    'KNN': KNeighborsClassifier()
}

In [66]:
baseline_results = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)

    baseline_results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, average='weighted'),
        'recall': recall_score(y_test, y_pred, average='weighted'),
        'f1': f1_score(y_test, y_pred, average='weighted')
    }

In [67]:
baseline_df = pd.DataFrame(baseline_results).T
baseline_df = baseline_df.round(4)

In [68]:
baseline_df

Unnamed: 0,accuracy,precision,recall,f1
Logistic Regression,0.9722,0.9741,0.9722,0.972
Decision Tree,0.9444,0.9514,0.9444,0.945
Random Forest,1.0,1.0,1.0,1.0
SVM,0.9722,0.9741,0.9722,0.972
KNN,0.9722,0.9747,0.9722,0.9724


### **There might or might no be some overfitting we can check for that using cross validation but that is not our aim for today**

### **There is a siginificant margin for improvement in decision tree so we will do hyperparameter tuning in it**

In [69]:
parameters = {
    'Decision Tree': {
        'max_depth': [3, 4, 5, 6, 7],
        'min_samples_split': [2, 5, 10, 15],
        'min_samples_leaf': [1, 2, 4, 6]
    },

    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l2'],
    },

    'SVM': {
        'C': [0.1, 1, 10, 100],
        'kernel': ['rbf', 'poly'],
        'gamma': ['scale', 'auto', 0.001, 0.01, 0.1]
    }
}

### **Trying using GridSearch CV because it is good for small datasets**

In [70]:
models_tuned = ['Decision Tree', 'Logistic Regression', 'SVM']

tuned_results = {}

for name in models_tuned:
    grid_search = GridSearchCV(
        estimator=models[name],
        param_grid=parameters[name],
        cv=5,
        scoring='f1_weighted',
        n_jobs=-1,
        verbose=0
    )

    grid_search.fit(X_train_scaled, y_train)

    y_pred = grid_search.best_estimator_.predict(X_test_scaled)

    tuned_results[name] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'test_accuracy': accuracy_score(y_test, y_pred),
        'test_precision': precision_score(y_test, y_pred, average='weighted'),
        'test_recall': recall_score(y_test, y_pred, average='weighted'),
        'test_f1': f1_score(y_test, y_pred, average='weighted')
    }

In [71]:
tuned_df = pd.DataFrame(tuned_results).T

In [72]:
tuned_df

Unnamed: 0,best_params,best_cv_score,test_accuracy,test_precision,test_recall,test_f1
Decision Tree,"{'max_depth': 3, 'min_samples_leaf': 2, 'min_s...",0.930318,1.0,1.0,1.0,1.0
Logistic Regression,"{'C': 1, 'penalty': 'l2'}",0.986081,0.972222,0.974074,0.972222,0.97197
SVM,"{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}",0.986255,0.972222,0.974074,0.972222,0.97197


### **We clearly see huge improvements**

### **Now we try Randomised Search CV**

In [73]:
from sklearn.model_selection import RandomizedSearchCV

random_results = {}

for name in models_tuned:
    random_search = RandomizedSearchCV(
        models[name],
        parameters[name],
        n_iter=10,
        cv=5,
        random_state=42
    )

    random_search.fit(X_train_scaled, y_train)

    y_pred = random_search.predict(X_test_scaled)

    random_results[name] = {
        'best_params': random_search.best_params_,
        'accuracy': accuracy_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred, average='weighted')
    }

random_df = pd.DataFrame(random_results).T



In [74]:
random_df

Unnamed: 0,best_params,accuracy,f1
Decision Tree,"{'min_samples_split': 2, 'min_samples_leaf': 2...",1.0,1.0
Logistic Regression,"{'penalty': 'l2', 'C': 0.01}",1.0,1.0
SVM,"{'kernel': 'rbf', 'gamma': 'auto', 'C': 1}",0.972222,0.97197


### **This is also working fine!!**

### **Analysing the results and finding out the best model**

In [75]:
print("GridSearchCV Results:")
for model in models_tuned:
    print(f"{model}:{tuned_results[model]['test_accuracy']:.4f}")

print("\nRandomizedSearchCV Results:")
for model in models_tuned:
    print(f"{model}:{random_results[model]['accuracy']:.4f}")

final = {
    'Logistic Regression': baseline_results['Logistic Regression']['accuracy'],
    'Decision Tree (Original)': baseline_results['Decision Tree']['accuracy'],
    'Decision Tree (GridSearch)': tuned_results['Decision Tree']['test_accuracy'],
    'Decision Tree (RandomSearch)': random_results['Decision Tree']['accuracy'],
    'Random Forest': baseline_results['Random Forest']['accuracy'],
    'SVM': baseline_results['SVM']['accuracy'],
    'KNN': baseline_results['KNN']['accuracy']
}

best_model = max(final, key=final.get)
best_score = final[best_model]

GridSearchCV Results:
Decision Tree:1.0000
Logistic Regression:0.9722
SVM:0.9722

RandomizedSearchCV Results:
Decision Tree:1.0000
Logistic Regression:1.0000
SVM:0.9722


In [76]:
print("The bedt model is:", best_model)
print("The best score is:", best_score)

The bedt model is: Decision Tree (GridSearch)
The best score is: 1.0


## **Hence we find our best model to be Decision Tree Classifier fine tuned using GridSearchCV**