## Homework 2

### Task 1

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from tpot import TPOTClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, StratifiedKFold, LeaveOneOut

In [2]:
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data"
data = pd.read_csv(url, header=None)
print(data)

# Split the data
X = data.iloc[:, 1:-1]  # Features
y = data.iloc[:, -1]   # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

      0        1      2     3     4      5     6     7     8    9   10
0      1  1.52101  13.64  4.49  1.10  71.78  0.06  8.75  0.00  0.0   1
1      2  1.51761  13.89  3.60  1.36  72.73  0.48  7.83  0.00  0.0   1
2      3  1.51618  13.53  3.55  1.54  72.99  0.39  7.78  0.00  0.0   1
3      4  1.51766  13.21  3.69  1.29  72.61  0.57  8.22  0.00  0.0   1
4      5  1.51742  13.27  3.62  1.24  73.08  0.55  8.07  0.00  0.0   1
..   ...      ...    ...   ...   ...    ...   ...   ...   ...  ...  ..
209  210  1.51623  14.14  0.00  2.88  72.61  0.08  9.18  1.06  0.0   7
210  211  1.51685  14.92  0.00  1.99  73.06  0.00  8.40  1.59  0.0   7
211  212  1.52065  14.36  0.00  2.02  73.42  0.00  8.44  1.64  0.0   7
212  213  1.51651  14.38  0.00  1.94  73.61  0.00  8.48  1.57  0.0   7
213  214  1.51711  14.23  0.00  2.08  73.36  0.00  8.62  1.67  0.0   7

[214 rows x 11 columns]


In [3]:
# Model Exploration and Hyperparameter Tuning and perform hyperparameter tuning using Cross-Validation.
# Model 1: Random Forest
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])

# Model 2: Support Vector Machine (SVM)
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC())
])

#Evaluate Models with Cross-Validation
models = [
    ("Random Forest", rf_pipeline),
    ("SVM", svm_pipeline)
]

for name, model in models:
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print(f"{name} - Cross-Validation Scores: {cv_scores}")
    print(f"{name} - Mean Accuracy: {cv_scores.mean()}")

Random Forest - Cross-Validation Scores: [0.68571429 0.67647059 0.82352941 0.82352941 0.70588235]
Random Forest - Mean Accuracy: 0.7430252100840337
SVM - Cross-Validation Scores: [0.65714286 0.64705882 0.61764706 0.73529412 0.67647059]
SVM - Mean Accuracy: 0.6667226890756304


### Comments and Comparison the results:
Random Forest:
The Random Forest model shows relatively consistent performance across folds, with accuracy ranging from approximately 62.86% to 82.35%.
The mean accuracy of around 73.16% indicates overall decent performance.

SVM:
The SVM model has varying performance across folds, with accuracy ranging from approximately 61.76% to 73.53%.
The mean accuracy of around 66.67% suggests moderate performance.

The Random Forest model generally outperforms the SVM model in terms of both mean accuracy and consistency across folds.
Random Forest has a higher mean accuracy of 73.16% compared to SVM's 66.67%.While Random Forest appears to be the better-performing model in this evaluation.


In [4]:
# Define different cross-validation techniques
cv_techniques = [KFold(n_splits=5, shuffle=True, random_state=42),
                 StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                 LeaveOneOut()]

# Iterate through the models and cross-validation techniques
for name, model in models:
    for cv_technique in cv_techniques:
        cv_scores = cross_val_score(model, X_train, y_train, cv=cv_technique, scoring='accuracy')
        print(f"{name} - {cv_technique.__class__.__name__} CV Scores: {cv_scores}")
        print(f"{name} - {cv_technique.__class__.__name__} Mean Accuracy: {cv_scores.mean()}")


Random Forest - KFold CV Scores: [0.85714286 0.76470588 0.76470588 0.61764706 0.76470588]
Random Forest - KFold Mean Accuracy: 0.753781512605042
Random Forest - StratifiedKFold CV Scores: [0.65714286 0.70588235 0.73529412 0.73529412 0.76470588]
Random Forest - StratifiedKFold Mean Accuracy: 0.7196638655462185
Random Forest - LeaveOneOut CV Scores: [1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1.
 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1.
 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0.
 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1.
 1. 1. 0.]
Random Forest - LeaveOneOut Mean Accuracy: 0.7426900584795322
SVM - KFold CV Scores: [0.8        0.55882353 0.76470588 0.5882352

### Comments:
Random Forest generally performs well and is more consistent across different cross-validation techniques compared to SVM.
Leave-One-Out validation tends to produce higher accuracy but might be sensitive to outliers.
SVM shows more variability in performance.

In [5]:
# Run AutoML on the dataset using TPOT
tpot = TPOTClassifier(generations=5, population_size=20, random_state=42, verbosity=2, config_dict='TPOT sparse')
tpot.fit(X_train, y_train)
#print(tpot.score(X_test, y_test))

#Compare AutoML with Manual Models
tpot_accuracy = accuracy_score(y_test, tpot.predict(X_test))
print("TPOT - Test Accuracy: ", tpot_accuracy)

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7021848739495798

Generation 2 - Current best internal CV score: 0.7368067226890757

Generation 3 - Current best internal CV score: 0.7368067226890757

Generation 4 - Current best internal CV score: 0.7660504201680672

Generation 5 - Current best internal CV score: 0.7660504201680672

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.1, min_samples_leaf=2, min_samples_split=12, n_estimators=100)
TPOT - Test Accuracy:  0.7906976744186046


### Comparison:
The Random Forest model identified by the AutoML approach achieved a test accuracy of 0.791, outperforming the mean accuracy of the manually tuned Random Forest (0.732).

The AutoML approach demonstrated the ability to automatically search and find a more optimal model configuration within the given time constraints compared to manual exploration.

The SVM model from manual exploration had a mean accuracy of 0.667, which is lower than the test accuracy achieved by the AutoML-selected Random Forest model (0.791).

In summary, the AutoML approach, in this case represented by TPOT, has provided a more optimized model configuration compared to the manually explored models for the given dataset. It highlights the effectiveness of automated approaches in finding better-performing models with less manual intervention.





