# **OBJECTIVE :- Model Evaluation and Hyperparameter Tuning**



# Breast Cancer Classification using Machine Learning

This notebook trains and evaluates multiple machine learning models on the [Breast Cancer Wisconsin dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset) to classify tumors as **malignant or benign**. It includes:

* Logistic Regression
* Random Forest
* Support Vector Machine (SVM)

Each model undergoes **hyperparameter tuning** using `GridSearchCV` or `RandomizedSearchCV` and is evaluated using metrics like **accuracy**, **precision**, **recall**, and **F1-score**.


# 1.Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


# 2. Load dataset

In [2]:

data = load_breast_cancer()
X, y = data.data, data.target

# 3. Train/test split

In [3]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


# 4. Preprocessing + Pipelines

In [4]:
pipe_lr = Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=10000))])
pipe_rf = Pipeline([('scaler', StandardScaler()), ('clf', RandomForestClassifier(random_state=42))])
pipe_svc = Pipeline([('scaler', StandardScaler()), ('clf', SVC(random_state=42))])


# 5. Define hyperparameter grids

In [5]:

param_grid_lr = {'clf__C': [0.01, 0.1, 1, 10, 100], 'clf__penalty': ['l2']}
param_dist_rf = {'clf__n_estimators': [50, 100],'clf__max_depth': [None, 5, 10],'clf__max_features': ['sqrt', 'log2']}
param_grid_svc = {'clf__C': [0.1, 1, 10], 'clf__kernel': ['linear', 'rbf'], 'clf__gamma': ['scale', 'auto']}


# 6. GridSearchCV & RandomizedSearchCV setups

In [6]:

grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='f1', n_jobs=-1)
grid_rf = RandomizedSearchCV(pipe_rf,param_distributions=param_dist_rf, cv=5,scoring='f1',n_iter=10,random_state=42,n_jobs=-1)
grid_svc = GridSearchCV(pipe_svc, param_grid_svc, cv=5, scoring='f1', n_jobs=-1)


# 7. Train models

In [7]:

for name, model in [('Logistic Regression', grid_lr),
                    ('Random Forest', grid_rf),
                    ('SVM', grid_svc)]:
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)
    print(f"  Best params: {model.best_params_}")
    y_pred = model.predict(X_test)
    print("  Metrics on test set:")
    print(classification_report(y_test, y_pred, digits=4))


Training Logistic Regression...
  Best params: {'clf__C': 0.1, 'clf__penalty': 'l2'}
  Metrics on test set:
              precision    recall  f1-score   support

           0     0.9756    0.9524    0.9639        42
           1     0.9726    0.9861    0.9793        72

    accuracy                         0.9737       114
   macro avg     0.9741    0.9692    0.9716       114
weighted avg     0.9737    0.9737    0.9736       114


Training Random Forest...
  Best params: {'clf__n_estimators': 100, 'clf__max_features': 'log2', 'clf__max_depth': 10}
  Metrics on test set:
              precision    recall  f1-score   support

           0     0.9512    0.9286    0.9398        42
           1     0.9589    0.9722    0.9655        72

    accuracy                         0.9561       114
   macro avg     0.9551    0.9504    0.9526       114
weighted avg     0.9561    0.9561    0.9560       114


Training SVM...
  Best params: {'clf__C': 0.1, 'clf__gamma': 'scale', 'clf__kernel': 'linear'

# 8. Compare results

In [8]:
results = {}
for name, model in [('Logistic Regression', grid_lr),
                    ('Random Forest', grid_rf),
                    ('SVM', grid_svc)]:
    y_pred = model.predict(X_test)
    results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred)
    }

df_results = pd.DataFrame(results).T
print("\nSummary of performance metrics:")
print(df_results.sort_values(by='f1_score', ascending=False))



Summary of performance metrics:
                     accuracy  precision    recall  f1_score
SVM                  0.982456   0.986111  0.986111  0.986111
Logistic Regression  0.973684   0.972603  0.986111  0.979310
Random Forest        0.956140   0.958904  0.972222  0.965517


# Best Performing Model: SVM
Highest F1-score, which balances precision and recall.

Also has the best precision and recall independently.

No overfitting signs great performance on test set.