Adel Movahedian 400102074 <br>
Assignment 7


## 1) Data Loading and Preprocessing

We load the Pima Indians Diabetes dataset (download it from [Kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) or any other source). Then, we perform an exploratory data analysis (EDA), check for missing values, split the data into features and target, and perform an 80/20 train-test split. Finally, we apply standard scaling (important for models like SVM and KNN).


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

# For splitting data and scaling features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# For evaluation
from sklearn.metrics import confusion_matrix, classification_report, f1_score

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Load the dataset (ensure the file "diabetes.csv" is in the same directory)
df = pd.read_csv("diabetes.csv")

# Quick look at the dataset
print("First 5 rows of the dataset:")
print(df.head(), "\n")

print("Dataset shape:", df.shape, "\n")
print("Dataset info:")
print(df.info(), "\n")

# Check for missing values
print("Missing values in the dataset:")
print(df.isnull().sum(), "\n")

# Basic statistics of the dataset
print("Statistical description:")
print(df.describe(), "\n")

# Define features (X) and target (y)
X = df.drop("Outcome", axis=1)
y = df["Outcome"]

# Split the data: 80% training, 20% testing (using stratification to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Scale the features using StandardScaler (mean=0, std=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


First 5 rows of the dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
0            6      148             72             35        0  33.6  \
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1   

Dataset shape: (768, 9) 

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies  

## 2) Logistic Regression for Classification

In this section, we train a Logistic Regression model. We then predict on the test set and measure the F1-score (which should be above 0.75 as required).


In [2]:
from sklearn.linear_model import LogisticRegression

# Instantiate and fit the Logistic Regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predict on the test set and compute F1-score
y_pred_lr = log_reg.predict(X_test_scaled)
f1_lr = f1_score(y_test, y_pred_lr)

print("Logistic Regression - Test F1 Score:", f1_lr)
print("\nClassification Report for Logistic Regression:")
print(classification_report(y_test, y_pred_lr))


Logistic Regression - Test F1 Score: 0.5599999999999999

Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.76      0.82      0.79       100
           1       0.61      0.52      0.56        54

    accuracy                           0.71       154
   macro avg       0.68      0.67      0.67       154
weighted avg       0.71      0.71      0.71       154



## 3) Support Vector Machine (Linear SVM)

Here, we train a Linear SVM model using scikit-learn's SVC with a linear kernel. The goal is to achieve an F1-score above 0.80 on the test set.


In [3]:
from sklearn.svm import SVC

# Instantiate and fit a Linear SVM model
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_scaled, y_train)

# Predict on the test set and compute F1-score
y_pred_svm = svm_linear.predict(X_test_scaled)
f1_svm = f1_score(y_test, y_pred_svm)

print("Linear SVM - Test F1 Score:", f1_svm)
print("\nClassification Report for Linear SVM:")
print(classification_report(y_test, y_pred_svm))


Linear SVM - Test F1 Score: 0.5656565656565656

Classification Report for Linear SVM:
              precision    recall  f1-score   support

           0       0.76      0.83      0.79       100
           1       0.62      0.52      0.57        54

    accuracy                           0.72       154
   macro avg       0.69      0.67      0.68       154
weighted avg       0.71      0.72      0.71       154



## 4) Kernel SVM (RBF Kernel)

We now train an SVM with a non-linear kernel (RBF) to possibly capture more complex patterns. A grid search is performed on hyperparameters (`C` and `gamma`) to maximize the F1-score which should be above 0.80.


In [4]:
from sklearn.model_selection import GridSearchCV

# Define a parameter grid for the RBF Kernel SVM
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}

# Set up the SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
grid_svm_rbf = GridSearchCV(svm_rbf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_svm_rbf.fit(X_train_scaled, y_train)

# Choose the best estimator and evaluate on test set
best_svm_rbf = grid_svm_rbf.best_estimator_
y_pred_svm_rbf = best_svm_rbf.predict(X_test_scaled)
f1_svm_rbf = f1_score(y_test, y_pred_svm_rbf)

print("Best RBF-SVM parameters:", grid_svm_rbf.best_params_)
print("Kernel SVM (RBF) - Test F1 Score:", f1_svm_rbf)
print("\nClassification Report for Kernel SVM (RBF):")
print(classification_report(y_test, y_pred_svm_rbf))


Best RBF-SVM parameters: {'C': 10, 'gamma': 0.01}
Kernel SVM (RBF) - Test F1 Score: 0.6041666666666667

Classification Report for Kernel SVM (RBF):
              precision    recall  f1-score   support

           0       0.78      0.87      0.82       100
           1       0.69      0.54      0.60        54

    accuracy                           0.75       154
   macro avg       0.73      0.70      0.71       154
weighted avg       0.75      0.75      0.74       154



## 5) K-Nearest Neighbors (KNN)

### 5.1) Tuning for the Best Number of Neighbors (k)

KNN is sensitive to the choice of the parameter `k` (number of neighbors). We perform a grid search over various `k` values and weight methods to maximize the F1-score, which should exceed 0.80.


In [5]:
from sklearn.neighbors import KNeighborsClassifier

# Define parameter grid for KNN
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance']
}

knn = KNeighborsClassifier()
grid_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='f1', n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)

# Choose the best estimator and evaluate on test set
best_knn = grid_knn.best_estimator_
y_pred_knn = best_knn.predict(X_test_scaled)
f1_knn = f1_score(y_test, y_pred_knn)

print("Best KNN parameters:", grid_knn.best_params_)
print("KNN - Test F1 Score:", f1_knn)
print("\nClassification Report for KNN:")
print(classification_report(y_test, y_pred_knn))


Best KNN parameters: {'n_neighbors': 7, 'weights': 'uniform'}
KNN - Test F1 Score: 0.613861386138614

Classification Report for KNN:
              precision    recall  f1-score   support

           0       0.79      0.84      0.81       100
           1       0.66      0.57      0.61        54

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.71       154
weighted avg       0.74      0.75      0.74       154



## 6) Decision Trees

### 6.1) Tuning for the Best Maximum Depth

Decision Trees are prone to overfitting. Therefore, we tune key hyperparameters such as `max_depth`, `min_samples_split`, and `min_samples_leaf` to regularize the model. The F1-score must be above 0.80.


In [6]:
from sklearn.tree import DecisionTreeClassifier

# Define parameter grid for Decision Trees
param_grid_dt = {
    'max_depth': [2, 3, 4, 5, 6, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

dt = DecisionTreeClassifier(random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, cv=5, scoring='f1', n_jobs=-1)
grid_dt.fit(X_train_scaled, y_train)

# Choose the best estimator and evaluate on test set
best_dt = grid_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test_scaled)
f1_dt = f1_score(y_test, y_pred_dt)

print("Best Decision Tree parameters:", grid_dt.best_params_)
print("Decision Tree - Test F1 Score:", f1_dt)
print("\nClassification Report for Decision Tree:")
print(classification_report(y_test, y_pred_dt))


Best Decision Tree parameters: {'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 2}
Decision Tree - Test F1 Score: 0.6930693069306931

Classification Report for Decision Tree:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       100
           1       0.74      0.65      0.69        54

    accuracy                           0.80       154
   macro avg       0.78      0.76      0.77       154
weighted avg       0.80      0.80      0.80       154



### 6.2) Three Techniques to Regularize Decision Trees

Regularization in decision trees helps prevent overfitting. Here are three common methods:

1. **Limit the Maximum Depth (`max_depth`):**  
   Restricting the depth of a tree limits its complexity, reducing the chance of overfitting by preventing the tree from modeling noise.

2. **Minimum Samples for Splitting and at Leaf Nodes (`min_samples_split` and `min_samples_leaf`):**  
   These parameters ensure that a node must have a minimum number of samples to be split or to form a leaf, thus avoiding splits on very few samples.

3. **Cost Complexity Pruning (`ccp_alpha`):**  
   This method prunes the tree post-training by penalizing more complex trees. A higher `ccp_alpha` leads to a simpler tree.


## 7) Random Forest

Random Forest is an ensemble method that combines multiple decision trees. We perform grid search on key parameters (e.g., number of estimators, max_depth, etc.) to achieve an F1-score above 0.85.


In [7]:
from sklearn.ensemble import RandomForestClassifier

# Define parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier(random_state=42)
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='f1', n_jobs=-1)
grid_rf.fit(X_train_scaled, y_train)

# Choose the best estimator and evaluate on test set
best_rf = grid_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test_scaled)
f1_rf = f1_score(y_test, y_pred_rf)

print("Best Random Forest parameters:", grid_rf.best_params_)
print("Random Forest - Test F1 Score:", f1_rf)
print("\nClassification Report for Random Forest:")
print(classification_report(y_test, y_pred_rf))


Best Random Forest parameters: {'max_depth': None, 'min_samples_leaf': 2, 'n_estimators': 100}
Random Forest - Test F1 Score: 0.6

Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.78      0.84      0.81       100
           1       0.65      0.56      0.60        54

    accuracy                           0.74       154
   macro avg       0.71      0.70      0.70       154
weighted avg       0.73      0.74      0.73       154



## 8) Bonus: Achieve F1-Score Above 0.90

For the bonus, we attempt to reach an F1-score above 0.90 using XGBoost. XGBoost is an advanced ensemble algorithm that can often improve performance when tuned properly. Additionally, if class imbalance is an issue, techniques like oversampling may be applied.


In [9]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-3.0.0-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.0-py3-none-win_amd64.whl (150.0 MB)
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/150.0 MB ? eta -:--:--


In [10]:
import xgboost as xgb

# Define parameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0]
}

xgb_clf = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
grid_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1', n_jobs=-1)
grid_xgb.fit(X_train_scaled, y_train)

# Choose the best estimator and evaluate on test set
best_xgb = grid_xgb.best_estimator_
y_pred_xgb = best_xgb.predict(X_test_scaled)
f1_xgb = f1_score(y_test, y_pred_xgb)

print("Best XGBoost parameters:", grid_xgb.best_params_)
print("XGBoost - Test F1 Score:", f1_xgb)
print("\nClassification Report for XGBoost:")
print(classification_report(y_test, y_pred_xgb))


Best XGBoost parameters: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 50, 'subsample': 0.8}
XGBoost - Test F1 Score: 0.594059405940594

Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.78      0.83      0.80       100
           1       0.64      0.56      0.59        54

    accuracy                           0.73       154
   macro avg       0.71      0.69      0.70       154
weighted avg       0.73      0.73      0.73       154



### Optional: Using Oversampling to Improve Performance

If the F1-score is still below 0.90, we can try using oversampling (e.g., Random OverSampler) to balance the classes and improve performance.


In [11]:
# Import oversampling technique from imblearn
from imblearn.over_sampling import RandomOverSampler

# Create oversampled training data
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train_scaled, y_train)

# Refit the XGBoost model using the oversampled data
grid_xgb.fit(X_train_ros, y_train_ros)
best_xgb_ros = grid_xgb.best_estimator_
y_pred_xgb_ros = best_xgb_ros.predict(X_test_scaled)
f1_xgb_ros = f1_score(y_test, y_pred_xgb_ros)

print("XGBoost with Random Oversampling - Test F1 Score:", f1_xgb_ros)
print("\nClassification Report for XGBoost with Oversampling:")
print(classification_report(y_test, y_pred_xgb_ros))


XGBoost with Random Oversampling - Test F1 Score: 0.6194690265486725

Classification Report for XGBoost with Oversampling:
              precision    recall  f1-score   support

           0       0.80      0.76      0.78       100
           1       0.59      0.65      0.62        54

    accuracy                           0.72       154
   macro avg       0.70      0.70      0.70       154
weighted avg       0.73      0.72      0.72       154



## 9) Summary of All Models' F1-Scores

Below we print the F1-scores of all the models to check if they meet the required thresholds:
- Logistic Regression: F1-score should be > 0.75  
- Linear SVM: F1-score should be > 0.80  
- Kernel SVM: F1-score should be > 0.80  
- KNN: F1-score should be > 0.80  
- Decision Tree: F1-score should be > 0.80  
- Random Forest: F1-score should be > 0.85  
- Bonus: XGBoost (with or without oversampling): F1-score > 0.90 (if achieved)

because of the datasets data we couldn't achieve more accuracy.


In [12]:
print("Logistic Regression (F1):", f1_lr)
print("Linear SVM (F1):", f1_svm)
print("Kernel SVM (F1):", f1_svm_rbf)
print("KNN (F1):", f1_knn)
print("Decision Tree (F1):", f1_dt)
print("Random Forest (F1):", f1_rf)
print("XGBoost (F1):", f1_xgb, "(Bonus attempt)")
print("XGBoost with Oversampling (F1):", f1_xgb_ros, "(Optional bonus attempt)")


Logistic Regression (F1): 0.5599999999999999
Linear SVM (F1): 0.5656565656565656
Kernel SVM (F1): 0.6041666666666667
KNN (F1): 0.613861386138614
Decision Tree (F1): 0.6930693069306931
Random Forest (F1): 0.6
XGBoost (F1): 0.594059405940594 (Bonus attempt)
XGBoost with Oversampling (F1): 0.6194690265486725 (Optional bonus attempt)


# Conclusion

In this notebook, we demonstrated a complete binary classification pipeline on the Pima Indians Diabetes Dataset:
- **Data loading, EDA, and preprocessing** with scaling.
- Implementation and tuning of **Logistic Regression, Linear SVM, Kernel SVM, KNN, Decision Trees, and Random Forests**.
- Discussion of **regularization techniques for Decision Trees**.
- A **bonus attempt** with XGBoost (with and without oversampling) aimed at achieving an F1-score above 0.90 on the test set.

Each cell includes comments and explanations that detail why each step is performed and what is learned from it.
