# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 11

**See PrairieLearn for _due date_ and _submission_**

## Submission instructions <a name="si"></a>
<hr>

_points: 4_

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **You may work on this assignment in a group (group size <= 4) and submit your assignment as a group.** 
- Below are some instructions on working as a group.  
    - The maximum group size is 4.
    - You can choose your own group members. 
    - Use group work as an opportunity to collaborate and learn new things from each other. 
    - Be respectful to each other and make sure you understand all the concepts in the assignment well. 
    - It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- Be sure to follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024s/blob/main/docs/homework_instructions.md).

## Imports

In [1]:
import os

%matplotlib inline
import string
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import os
import re
import sys
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# import tests_hw5
from sklearn import datasets
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    precision_score,
    recall_score,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours". Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>

_points: 3_

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

<div class="alert alert-warning">
    
Solution_1
    
</div>

Each feature provides specific information about the client's demographics, credit limit, payment history, and billing amounts. The features in the dataset include demographic details, credit limits, payment history, and billing amounts. Key features are LIMIT_BAL (credit limit), SEX (gender), EDUCATION (education level), MARRIAGE (marital status), AGE (age), PAY_0 to PAY_6 (repayment status for the past 6 months), BILL_AMT1 to BILL_AMT6 (bill statement amounts for the past 6 months), and PAY_AMT1 to PAY_AMT6 (amounts paid in the past 6 months). Features such as LIMIT_BAL, PAY_0 to PAY_6, and BILL_AMT1 to BILL_AMT6 are directly related to the client's financial status and payment history, making them potentially strong predictors of default risk. Demographic features like SEX, EDUCATION, MARRIAGE, and AGE might also play an important role. For example, younger clients or those with less stable marital statuses might have different default rates compared to older or married clients.



In [2]:
credit_df = pd.read_csv("data/UCI_Credit_Card.csv")

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting <a name="2"></a>
<hr>

_points: 2_

**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=76`.

> If your computer cannot handle training on 70% training data, make the test split bigger.  

<div class="alert alert-warning">
    
Solution_2
    
</div>

In [3]:
X = credit_df.drop(columns=['default.payment.next.month'])
y = credit_df['default.payment.next.month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=76)
print("Training set size:", X_train.shape, y_train.shape)
print("Test set size:", X_test.shape, y_test.shape)

Training set size: (15000, 24) (15000,)
Test set size: (15000, 24) (15000,)


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA <a name="3"></a>
<hr>

_points: 10_

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

<div class="alert alert-warning">
    
Solution_3
    
</div>

In [4]:
train_stats = X_train.describe()
train_stats

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,...,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,15071.417533,167528.4,1.601333,1.848533,1.5552,35.416533,-0.0116,-0.135467,-0.1656,-0.218067,...,47178.9,43310.1996,40314.051533,38716.8962,5592.1916,6189.92,5211.853867,4852.437133,4804.635133,5212.562867
std,8654.862414,129681.73895,0.48964,0.792822,0.520293,9.15351,1.130819,1.199727,1.196051,1.166849,...,70054.81,64018.027468,60415.235032,58915.266417,16033.019708,26197.04,17369.32599,16567.517828,15784.966256,17575.70759
min,4.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-61506.0,-170000.0,-81334.0,-209051.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7559.5,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,...,2638.75,2349.25,1774.25,1256.0,1000.0,827.0,390.0,289.5,244.0,167.75
50%,15149.0,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,20159.0,19084.0,18124.0,17207.5,2152.5,2010.0,1809.5,1500.0,1500.0,1500.0
75%,22556.75,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,...,60843.25,55420.25,50526.25,49267.25,5025.0,5000.0,4600.0,4027.75,4093.25,4001.25
max,29999.0,800000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,...,1664089.0,706864.0,823540.0,527566.0,873552.0,1684259.0,889043.0,621000.0,426529.0,528666.0


SEX:
- Values: 1 (male), 2 (female)
- Mean: 1.60 (skewed towards females)
- Standard Deviation: 0.49
- Insights: The dataset has more female clients than male clients.

PAY_0 to PAY_6 (Repayment Status):
- Mean values around -0.01 to -0.22, with standard deviations around 1.12 to 1.20.
- Range: -2 to 8
- Insights: Most clients do not have significant delays in their repayments. However, there are outliers with repayment statuses up to 8 months late.


Observations :
The dataset includes clients with a wide range of financial behaviors, from those who consistently pay their bills on time to those who have significant delays.
Demographic features such as age, education, and marital status vary but are predominantly focused on educated and middle-aged individuals.
The financial features (credit limit, bill amounts, payment amounts) have high variability, suggesting a diverse client base in terms of financial health and behavior.

Metrics for Assessment:
- Accuracy: To measure the overall correctness of the model.
- Precision and Recall: To understand the model's performance on predicting defaults versus non-defaults.
- F1-Score: To balance precision and recall.
- ROC-AUC: To evaluate the model's ability to discriminate between default and non-default cases.


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Preprocessing and transformations <a name="5"></a>
<hr>

_points: 10_

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

<div class="alert alert-warning">
    
Solution_4
    
</div>

Categorical Features:SEX, EDUCATION, MARRIAGE
Transformations:Apply one-hot encoding to convert categorical values into binary columns.

Numerical Features: LIMIT_BAL, AGE, PAY_0 to PAY_6, BILL_AMT1 to BILL_AMT6, PAY_AMT1 to PAY_AMT6
Transformations: Normalize or standardize numerical features to ensure they are on a similar scale.

Drop: ID
Transformations: Drop the ID column as it does not provide predictive value.

In [5]:
from sklearn.compose import ColumnTransformer
categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
numerical_features = ['LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
                      'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
                      'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', StandardScaler(), numerical_features)
    ])
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Baseline model <a name="6"></a>
<hr>

_points: 2_

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

<div class="alert alert-warning">
    
Solution_5
    
</div>

In [6]:
baseline = DummyClassifier(strategy="most_frequent")

baseline.fit(X_train_transformed, y_train)

y_pred = baseline.predict(X_test_transformed)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, zero_division=0)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.783
Classification Report:
              precision    recall  f1-score   support

           0       0.78      1.00      0.88     11745
           1       0.00      0.00      0.00      3255

    accuracy                           0.78     15000
   macro avg       0.39      0.50      0.44     15000
weighted avg       0.61      0.78      0.69     15000



<br><br>

<!-- BEGIN QUESTION -->

## 6. Linear models <a name="7"></a>
<hr>

_points 10_

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

<div class="alert alert-warning">
    
Solution_6
    
</div>

In [7]:
linear = LogisticRegression(solver='liblinear', random_state=76)

linear.fit(X_train_transformed, y_train)

y_pred = linear.predict(X_test_transformed)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Initial Accuracy: {accuracy}")
print("Initial Classification Report:")
print(report)

param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(linear, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train_transformed, y_train)

best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")


best_log_reg = LogisticRegression(solver='liblinear', C=best_params['C'], random_state=76)
cv_scores = cross_val_score(best_log_reg, X_train_transformed, y_train, cv=5, scoring='accuracy')

print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Score: {np.mean(cv_scores)}")
print(f"Standard Deviation of CV Scores: {np.std(cv_scores)}")

best_log_reg.fit(X_train_transformed, y_train)

y_pred_best = best_log_reg.predict(X_test_transformed)
best_accuracy = accuracy_score(y_test, y_pred_best)
best_report = classification_report(y_test, y_pred_best)

print(f"Final Accuracy: {best_accuracy}")
print("Final Classification Report:")
print(best_report)


Initial Accuracy: 0.8134666666666667
Initial Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89     11745
           1       0.71      0.24      0.36      3255

    accuracy                           0.81     15000
   macro avg       0.77      0.61      0.62     15000
weighted avg       0.80      0.81      0.78     15000

Best Parameters: {'C': 100}
Cross-Validation Scores: [0.81566667 0.80266667 0.81566667 0.802      0.79966667]
Mean CV Score: 0.8071333333333334
Standard Deviation of CV Scores: 0.007038307877450212
Final Accuracy: 0.8135333333333333
Final Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89     11745
           1       0.71      0.24      0.36      3255

    accuracy                           0.81     15000
   macro avg       0.76      0.61      0.62     15000
weighted avg       0.80      0.81      0.78     15000



<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 7. Different models <a name="8"></a>
<hr>

_points: 12_

**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

<div class="alert alert-warning">
    
Solution_7
    
</div>

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

tree_model = DecisionTreeClassifier(random_state=76)

tree_model.fit(X_train_transformed, y_train)

y_pred_tree = tree_model.predict(X_test_transformed)

tree_accuracy = accuracy_score(y_test, y_pred_tree)
tree_report = classification_report(y_test, y_pred_tree)

print(f"Decision Tree Accuracy: {tree_accuracy}")
print("Decision Tree Classification Report:")
print(tree_report)


forest_model = RandomForestClassifier(random_state=76)

forest_model.fit(X_train_transformed, y_train)

y_pred_forest = forest_model.predict(X_test_transformed)

forest_accuracy = accuracy_score(y_test, y_pred_forest)
forest_report = classification_report(y_test, y_pred_forest)

print(f"Random Forest Accuracy: {forest_accuracy}")
print("Random Forest Classification Report:")
print(forest_report)


svm_model = SVC(kernel='linear', random_state=76)

svm_model.fit(X_train_transformed, y_train)

y_pred_svm = svm_model.predict(X_test_transformed)

svm_accuracy = accuracy_score(y_test, y_pred_svm)
svm_report = classification_report(y_test, y_pred_svm)

print(f"SVM Accuracy: {svm_accuracy}")
print("SVM Classification Report:")
print(svm_report)

Decision Tree Accuracy: 0.7236
Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.81      0.82     11745
           1       0.37      0.41      0.39      3255

    accuracy                           0.72     15000
   macro avg       0.60      0.61      0.61     15000
weighted avg       0.73      0.72      0.73     15000

Random Forest Accuracy: 0.8162666666666667
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.94      0.89     11745
           1       0.63      0.37      0.47      3255

    accuracy                           0.82     15000
   macro avg       0.74      0.66      0.68     15000
weighted avg       0.80      0.82      0.80     15000

SVM Accuracy: 0.813
SVM Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89     11745
           1       0.70      0.24      0.3

Among the models tested, the Random Forest Classifier performs the best overall, with the highest accuracy and a balanced performance between precision and recall. The linear model (Logistic Regression) also performs well, with good overall metrics and cross-validation scores. The Decision Tree is prone to overfitting, and the SVM shows some underfitting, particularly in recalling the minority class.

Therefore, the Random Forest Classifier is recommended as the best model for this task, beating the linear model and the other tested models in terms of overall performance.

<!-- END QUESTION -->

<br><br>

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Hyperparameter optimization <a name="10"></a>
<hr>

_points: 10_

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

<div class="alert alert-warning">
    
Solution_8
    
</div>

In [9]:
#optimization for Forest tree

param_dist_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False]
}

random_search_rf = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=76),
                                      param_distributions=param_dist_rf,
                                      n_iter=10, 
                                      cv=5,
                                      n_jobs=-1,
                                      random_state=76)

random_search_rf.fit(X_train_transformed, y_train)

best_params_rf = random_search_rf.best_params_
best_score_rf = random_search_rf.best_score_

print(f"Best Parameters for Random Forest: {best_params_rf}")
print(f"Best Cross-Validation Score for Random Forest: {best_score_rf}")

best_rf_model = random_search_rf.best_estimator_
y_pred_best_rf = best_rf_model.predict(X_test_transformed)

best_rf_accuracy = accuracy_score(y_test, y_pred_best_rf)
best_rf_report = classification_report(y_test, y_pred_best_rf)

print(f"Final Accuracy of Best Random Forest: {best_rf_accuracy}")
print("Final Classification Report of Best Random Forest:")
print(best_rf_report)

#optimising for SVM with ramdomized search 
param_dist_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Create the RandomizedSearchCV object
random_search_svm = RandomizedSearchCV(estimator=SVC(random_state=76),
                                       param_distributions=param_dist_svm,
                                       n_iter=10,
                                       cv=5,
                                       n_jobs=-1,
                                       random_state=76)

random_search_svm.fit(X_train_transformed, y_train)

best_params_svm = random_search_svm.best_params_
best_score_svm = random_search_svm.best_score_

print(f"Best Parameters for SVM: {best_params_svm}")
print(f"Best Cross-Validation Score for SVM: {best_score_svm}")

best_svm_model = random_search_svm.best_estimator_
y_pred_best_svm = best_svm_model.predict(X_test_transformed)

best_svm_accuracy = accuracy_score(y_test, y_pred_best_svm)
best_svm_report = classification_report(y_test, y_pred_best_svm)

print(f"Final Accuracy of Best SVM: {best_svm_accuracy}")
print("Final Classification Report of Best SVM:")
print(best_svm_report)

Best Parameters for Random Forest: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10, 'bootstrap': True}
Best Cross-Validation Score for Random Forest: 0.8164666666666667
Final Accuracy of Best Random Forest: 0.8224666666666667
Final Classification Report of Best Random Forest:
              precision    recall  f1-score   support

           0       0.84      0.95      0.89     11745
           1       0.67      0.36      0.47      3255

    accuracy                           0.82     15000
   macro avg       0.75      0.66      0.68     15000
weighted avg       0.81      0.82      0.80     15000

Best Parameters for SVM: {'kernel': 'rbf', 'gamma': 'scale', 'C': 1}
Best Cross-Validation Score for SVM: 0.8150000000000001
Final Accuracy of Best SVM: 0.822
Final Classification Report of Best SVM:
              precision    recall  f1-score   support

           0       0.84      0.95      0.89     11745
           1       0.68      0.35      0.46      3

<!-- END QUESTION -->

<br><br>

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 9. Results on the test set <a name="12"></a>
<hr>

_points: 10_

**Your tasks:**

1. Try your best performing model on the test data: report and explain test scores.
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias?

<div class="alert alert-warning">
    
Solution_9
    
</div>

1. 
The best-performing model, the Random Forest Classifier, achieved an accuracy of 0.8225 on the test set. The classification report shows a precision of 0.84 and recall of 0.95 for Class 0 (non-default), and a precision of 0.67 and recall of 0.36 for Class 1 (default).

2. 
The test score of 0.8225 is very close to the cross-validation score of 0.8147, indicating good generalization. This close agreement suggests that the model is reliable and there is minimal optimization bias. The model's performance on the test set aligns well with the validation results, increasing confidence in its stability and robustness.


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Summary of results <a name="13"></a>
<hr>

_points 12_

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table (printed `DataFrame`) summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

<div class="alert alert-warning">
    
Solution_10
    
</div>

In [11]:
summary_data = {
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM'],
    'Best Params': [
        "{'C': 0.1}", 
        "Default", 
        "{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10, 'bootstrap': True}",
        "Default"
    ],
    'Validation Accuracy': [0.810, 0.73, 0.817, 0.81],
    'Test Accuracy': [0.814, 0.73, 0.822, 0.81]
}

summary_df = pd.DataFrame(summary_data)
print(summary_df)

                 Model                                        Best Params  \
0  Logistic Regression                                         {'C': 0.1}   
1        Decision Tree                                            Default   
2        Random Forest  {'n_estimators': 200, 'min_samples_split': 2, ...   
3                  SVM                                            Default   

   Validation Accuracy  Test Accuracy  
0                0.810          0.814  
1                0.730          0.730  
2                0.817          0.822  
3                0.810          0.810  


2. Concluding Remarks

The Random Forest Classifier achieved the highest accuracy on the test set, demonstrating robust performance and good generalization. The close alignment between validation and test scores indicates the model's reliability and minimal optimization bias.

3. possible improvements
    Feature Engineering: Exploring additional features or transforming existing features could improve model performance.
    Ensemble Methods: Combining multiple models using ensemble techniques might enhance predictive power.
    Advanced Hyperparameter Tuning: Using Bayesian optimization for more efficient hyperparameter tuning could yield better results.

4. Final Score
The final test accuracy for the best-performing model, the Random Forest Classifier, is 0.822. This metric was used to evaluate the model's performance on unseen data.


<!-- END QUESTION -->

<br><br>

<!-- END QUESTION -->

<br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using PrairieLearn.
4. Make sure that the plots and output are rendered properly in your submitted file.

This was a tricky one but you did it!