# Assignment 1

## <span style="color: #81e64b;">Students: <br></span>
## <span style="color: #81e64b;">James Burnett (U00150685) <br> </span>
## <span style="color: #81e64b;">Julio Figueroa (U06228812) </span> 

In this assignment, we will focus on education. This dataset contains data about high school students. Each row represents a single student. The school administrators want to predict a student's cumulative GPA at the time of graduation so that they can make interventions for struggling students. The goal is to predict the CGPA of a student. 

## Description of Variables

The description of variables are provided in "High School - Data Dictionary.docx"

## Goal

Use the **high_school.csv** data set and build a model to predict **CGPA**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [142]:
# Import everything in the file

import pandas as pd # Data frame manipulation and management
import numpy as np # Primarily used for random seed

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

np.random.seed(42)

# Get the data (Setup)

In [143]:
# Load the data
data = pd.read_csv('high_school.csv')
print(data.shape)
data.head()

(2369, 15)


Unnamed: 0,Gender,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences,CGPA
0,female,bachelor's degree,married,regularly,yes,3.0,school_bus,71,74,2,2,no,3,6,C
1,female,some college,married,sometimes,yes,0.0,,90,88,1,2,yes,3,4,D
2,female,master's degree,single,sometimes,yes,4.0,school_bus,93,91,1,2,yes,3,10,B
3,male,associate's degree,married,never,no,1.0,,56,42,1,3,yes,2,2,F
4,male,some college,married,sometimes,yes,0.0,school_bus,78,75,1,2,no,3,4,C


In [144]:
# Find NaN values (sum)
data.isna().sum()

Gender                   0
ParentEdu              133
ParentMaritalStatus    103
ExtraCurricular         47
IsFirstChild            84
Siblings               118
Transportation         266
AvgReadingScore          0
AvgWritingScore          0
traveltime               0
studytime                0
internet                 0
freetime                 0
absences                 0
CGPA                     0
dtype: int64

In [145]:
# Data cleaning and handling missing values (use appropriate method as needed)
# data = data.dropna()  # You can also try imputation if needed

# One-hot encoding of categorical variables
# data = pd.get_dummies(data, drop_first=True)

In [146]:
data.describe()

Unnamed: 0,Siblings,AvgReadingScore,AvgWritingScore,traveltime,studytime,freetime,absences
count,2251.0,2369.0,2369.0,2369.0,2369.0,2369.0,2369.0
mean,2.118614,68.694386,67.660616,1.44829,2.03588,3.235965,5.70916
std,1.478638,14.725655,15.568467,0.696855,0.838261,0.997695,7.99632
min,0.0,17.0,10.0,1.0,1.0,1.0,0.0
25%,1.0,59.0,57.0,1.0,1.0,3.0,0.0
50%,2.0,69.0,68.0,1.0,2.0,3.0,4.0
75%,3.0,79.0,78.0,2.0,2.0,4.0,8.0
max,7.0,100.0,100.0,4.0,4.0,5.0,75.0


In [147]:
# Assign gender as binary, drop gendcer from 'data' df, retain everything else in 'data'
data['gender_binary'] = data['Gender'].map({'male': 0, 'female': 1})
data = data.drop(['Gender'], axis=1)

CGPA_dictionary = {'F': 0, 'D': 1, 'C': 2, 'B': 3, 'A': 4}
data['CGPA_num'] = data['CGPA'].map(CGPA_dictionary)
data = data.drop(['CGPA'], axis=1)

In [148]:
numeric_columns = ['Siblings', 'AvgReadingScore', 'AvgWritingScore', 'traveltime', 'studytime', 'freetime', 'absences']

binary_columns = ['gender_binary'] # 'CGPA' is also categorical, but it is the y-variable

categorical_columns = ['ParentEdu', 'ParentMaritalStatus', 'ExtraCurricular', 'IsFirstChild', 'Transportation','internet' ]

data.head()

Unnamed: 0,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences,gender_binary,CGPA_num
0,bachelor's degree,married,regularly,yes,3.0,school_bus,71,74,2,2,no,3,6,1,2
1,some college,married,sometimes,yes,0.0,,90,88,1,2,yes,3,4,1,1
2,master's degree,single,sometimes,yes,4.0,school_bus,93,91,1,2,yes,3,10,1,3
3,associate's degree,married,never,no,1.0,,56,42,1,3,yes,2,2,0,0
4,some college,married,sometimes,yes,0.0,school_bus,78,75,1,2,no,3,4,0,2


# Pipeline

In [149]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [150]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [151]:
binary_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent'))])

In [152]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),# 'num' = Step 1, apply numeric_transformer to headers identified as 'numeric_columns'
        ('cat', categorical_transformer, categorical_columns), # 'cat' = Step 2, apply categorical_transformer to headers identified as 'categorical_columns'
        ('binary', binary_transformer, binary_columns)], # 'binary' = Step 3, apply binary_transformer to headers identified as 'binary_columns'
        remainder='drop')

## Split the data (Train and Test)

In [153]:
inputs = data.drop(['CGPA_num'], axis=1)
target = data['CGPA_num']

target.head()

0    2
1    1
2    3
3    0
4    2
Name: CGPA_num, dtype: int64

In [154]:
x_train, x_test, y_train, y_test = train_test_split(inputs, target, test_size=0.3, random_state=42)

## Transform: fit_transform (train dataset)

In [155]:
x_train = preprocessor.fit_transform(x_train)

print("Train data shape after preprocessing:", x_train.shape)
x_train

Train data shape after preprocessing: (1658, 32)


array([[-0.07311637, -1.35881531, -1.65738705, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.62359018, -0.26725329, -0.04930921, ...,  0.        ,
         1.        ,  1.        ],
       [-0.76982292, -0.54014379, -0.56389412, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.76982292,  0.41497297,  0.46527569, ...,  0.        ,
         1.        ,  1.        ],
       [-1.46652948, -1.97281894, -1.40009459, ...,  0.        ,
         1.        ,  0.        ],
       [-0.76982292,  0.41497297,  0.27230635, ...,  0.        ,
         1.        ,  1.        ]])

## Tranform: transform (test)

In [156]:
x_test = preprocessor.transform(x_test)

print("Test data shape after preprocessing:", x_test.shape)
x_test

Test data shape after preprocessing: (711, 32)


array([[-0.07311637,  1.43831236,  0.91553748, ...,  0.        ,
         1.        ,  0.        ],
       [-0.07311637,  0.55141822,  0.07933701, ...,  0.        ,
         1.        ,  0.        ],
       [-0.07311637, -0.33547592,  0.0150139 , ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.62359018, -2.17748682, -2.17197195, ...,  0.        ,
         1.        ,  0.        ],
       [-0.76982292, -1.56348318, -1.65738705, ...,  0.        ,
         1.        ,  1.        ],
       [-0.76982292, -1.42703793, -1.20712525, ...,  0.        ,
         1.        ,  0.        ]])

## Find the Baseline (0.5 point)

In [157]:
# Create and fit the dummy classifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(x_train, y_train)

In [158]:
# Calculate Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(x_train)
baseline_train_acc = accuracy_score(y_train, dummy_train_pred)
print('Baseline Train Accuracy: {:.2f}'.format(baseline_train_acc))

# Calculate Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(x_test)
baseline_test_acc = accuracy_score(y_test, dummy_test_pred)
print('Baseline Test Accuracy: {:.2f}'.format(baseline_test_acc))

Baseline Train Accuracy: 0.33
Baseline Test Accuracy: 0.33


# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)



## SVM Model 1:

In [159]:
from sklearn.svm import SVC
from sklearn.metrics import mean_squared_error, accuracy_score

In [160]:
# Model 1: Linear Kernel
svc_linear = SVC(kernel='linear', C=1.0)  # C=1.0 is the regularization parameter
svc_linear.fit(x_train, y_train)

# Predictions
y_pred_train_linear_svm_mod_1 = svc_linear.predict(x_train)
y_pred_test_linear_svm_mod_1 = svc_linear.predict(x_test)

# Performance Evaluation
train_mse_linear_svm_mod_1 = mean_squared_error(y_train, y_pred_train_linear_svm_mod_1)
test_mse_linear_svm_mod_1 = mean_squared_error(y_test, y_pred_test_linear_svm_mod_1)

print(f"Linear Kernel SVM Module 1- Train MSE: {train_mse_linear_svm_mod_1}, Test MSE: {test_mse_linear_svm_mod_1}")

# Performance Evaluation 2
train_acc_linear_svm_mod_1 = accuracy_score(y_train, y_pred_train_linear_svm_mod_1)
test_acc_linear_svm_mod_1 = accuracy_score(y_test, y_pred_test_linear_svm_mod_1)

print(f"Linear Kernel SVM Module 1- Train Acc: {train_acc_linear_svm_mod_1}, Test Acc: {test_acc_linear_svm_mod_1}")

Linear Kernel SVM Module 1- Train MSE: 0.38721351025331724, Test MSE: 0.44022503516174405
Linear Kernel SVM Module 1- Train Acc: 0.6821471652593486, Test Acc: 0.630098452883263


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">While this model does not provide high predictive reliability, the accuracy values for the Train and Test data (0.68 to 0.63) is a good fit for the data and not considered overfitting. The linear kernel has the highesr value of the other models</span> 

## SVM Model 2:

In [161]:
# Model 2: RBF Kernel
svc_rbf = SVC(kernel='rbf', C=1.0)  # C=1.0 is the regularization parameter
svc_rbf.fit(x_train, y_train)

# Predictions
y_pred_train_rbf_svm_mod_2 = svc_rbf.predict(x_train)
y_pred_test_rbf_svm_mod_2 = svc_rbf.predict(x_test)

# Performance Evaluation
train_mse_rbf_svm_mod_2 = mean_squared_error(y_train, y_pred_train_rbf_svm_mod_2)
test_mse_rbf_svm_mod_2 = mean_squared_error(y_test, y_pred_test_rbf_svm_mod_2)

print(f"RBF Kernel SVM Module 2- Train MSE: {train_mse_rbf_svm_mod_2}, Test MSE: {test_mse_rbf_svm_mod_2}")

# Performance Evaluation 2
train_acc_rbf_svm_mod_2 = accuracy_score(y_train, y_pred_train_rbf_svm_mod_2)
test_acc_rbf_svm_mod_2 = accuracy_score(y_test, y_pred_test_rbf_svm_mod_2)

print(f"RBF Kernel SVM Module 2- Train Accuracy: {train_acc_rbf_svm_mod_2}, Test Accuracy: {test_acc_rbf_svm_mod_2}")

RBF Kernel SVM Module 2- Train MSE: 0.2973462002412545, Test MSE: 0.5372714486638537
RBF Kernel SVM Module 2- Train Accuracy: 0.7738238841978287, Test Accuracy: 0.5794655414908579


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

 
<span style="color: #81e64b;">The model does not appear to be overfitting, with a training score of 0.77 and a test score of 0.58. However, given the relatively low test performance, it is likely that the model may not generalize well to new, unseen data</span> 

## SVM Model 3:

In [162]:
# Model 3: Polynomial Kernel
svc_poly_c100 = SVC(kernel='poly', C=100)  # C=100 is the regularization parameter
svc_poly_c100.fit(x_train, y_train)

# Predictions
y_pred_train_poly_svm_mod_3 = svc_poly_c100.predict(x_train)
y_pred_test_poly_svm_mod_3 = svc_poly_c100.predict(x_test)

# Performance Evaluation
train_mse_poly_svm_mod_3 = mean_squared_error(y_train, y_pred_train_poly_svm_mod_3)
test_mse_poly_svm_mod_3 = mean_squared_error(y_test, y_pred_test_poly_svm_mod_3)

print(f"Polynomial Kernel SVM Module 3- Train MSE: {train_mse_poly_svm_mod_3}, Test MSE: {test_mse_poly_svm_mod_3}")

# Performance Evaluation 2
train_acc_poly_svm_mod_3 = accuracy_score(y_train, y_pred_train_poly_svm_mod_3)
test_acc_poly_svm_mod_3 = accuracy_score(y_test, y_pred_test_poly_svm_mod_3)

print(f"Polynomial Kernel SVM Module 3- Train Accuracy: {train_acc_poly_svm_mod_3}, Test Accuracy: {test_acc_poly_svm_mod_3}")

Polynomial Kernel SVM Module 3- Train MSE: 0.0012062726176115801, Test MSE: 0.7566807313642757
Polynomial Kernel SVM Module 3- Train Accuracy: 0.9987937273823885, Test Accuracy: 0.5161744022503516


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">The model is showing signs of overfitting since the Train (0.999) and Test (0.516) accuracy is far apart. To address this, we will rerun the model below and adjust the C-value, which controls regularization, to mitigate the overfitting. See the module below (3.1) for this correction.</span> 

## SVM Model 3.1:

In [163]:
# Model 3.1: Polynomial Kernel
svc_poly_c1 = SVC(kernel='poly', C=1.0)  # C=1.0 is the regularization parameter
svc_poly_c1.fit(x_train, y_train)

# Predictions
y_pred_train_poly_svm_mod_3_1 = svc_poly_c1.predict(x_train)
y_pred_test_poly_svm_mod_3_1 = svc_poly_c1.predict(x_test)

# Performance Evaluation
train_mse_poly_svm_mod_3_1 = mean_squared_error(y_train, y_pred_train_poly_svm_mod_3_1)
test_mse_poly_svm_mod_3_1 = mean_squared_error(y_test, y_pred_test_poly_svm_mod_3_1)

print(f"Polynomial Kernel SVM Module 3- Train MSE: {train_mse_poly_svm_mod_3_1}, Test MSE: {test_mse_poly_svm_mod_3_1}")

# Performance Evaluation 2
train_acc_poly_svm_mod_3_1 = accuracy_score(y_train, y_pred_train_poly_svm_mod_3_1)
test_acc_poly_svm_mod_3_1 = accuracy_score(y_test, y_pred_test_poly_svm_mod_3_1)

print(f"Polynomial Kernel SVM Module 3- Train Accuracy: {train_acc_poly_svm_mod_3_1}, Test Accuracy: {test_acc_poly_svm_mod_3_1}")

Polynomial Kernel SVM Module 3- Train MSE: 0.24849215922798554, Test MSE: 0.5836849507735584
Polynomial Kernel SVM Module 3- Train Accuracy: 0.8124246079613993, Test Accuracy: 0.5780590717299579


<span style="color: #81e64b;">By reducing the C-value from 100 to 1.0, we successfully corrected the overfitting. This adjustment resulted in a more balanced model with improved accuracy on the test data</span> 

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [164]:
from sklearn.linear_model import SGDClassifier

In [165]:
# Create an SGDClassifier object
sgd_clf = SGDClassifier(random_state=42, penalty='l2')  # You can specify other hyperparameters here if needed

# Train the classifier on the training data
sgd_clf.fit(x_train, y_train)

# Make predictions on the test data
y_pred_test_sgd_mod_1 = sgd_clf.predict(x_test)
y_pred_train_sgd_mod_1 = sgd_clf.predict(x_train)


# Evaluate the classifier's MSE
mse_test_sgd_mod_1 = mean_squared_error(y_test, y_pred_test_sgd_mod_1)
mse_train_sgd_mod_1 = mean_squared_error(y_train, y_pred_train_sgd_mod_1)
print(f"Test MSE SGD Module 1: {mse_test_sgd_mod_1:.2f}")
print(f"Train MSE SGD Module 1: {mse_train_sgd_mod_1:.2f}")

# Evaluate the classifier's accuracy
accuracy_test_sgd_mod_1 = accuracy_score(y_test, y_pred_test_sgd_mod_1)
accuracy_train_sgd_mod_1 = accuracy_score(y_train, y_pred_train_sgd_mod_1)
print(f"Test accuracy SGD Module 1: {accuracy_test_sgd_mod_1:.2f}")
print(f"Train accuracy SGD Module 1: {accuracy_train_sgd_mod_1:.2f}")

Test MSE SGD Module 1: 0.83
Train MSE SGD Module 1: 0.79
Test accuracy SGD Module 1: 0.53
Train accuracy SGD Module 1: 0.55


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">In this case the model is not overfitting as the train accuracy of 0.55 and the test accuracy of 0.53 is reasonably close together</span> 

## SGD Model 2:

In [166]:
# Create an SGDClassifier object
sgd_clf = SGDClassifier(random_state=42, penalty='l1')  # You can specify other hyperparameters here if needed

# Train the classifier on the training data
sgd_clf.fit(x_train, y_train)

# Make predictions on the test data
y_pred_test_sgd_mod_2 = sgd_clf.predict(x_test)
y_pred_train_sgd_mod_2 = sgd_clf.predict(x_train)


# Evaluate the classifier's MSE
mse_test_sgd_mod_2 = mean_squared_error(y_test, y_pred_test_sgd_mod_2)
mse_train_sgd_mod_2 = mean_squared_error(y_train, y_pred_train_sgd_mod_2)
print(f"Test MSE SGD Module 2: {mse_test_sgd_mod_2:.2f}")
print(f"Train MSE SGD Module 2: {mse_train_sgd_mod_2:.2f}")

# Evaluate the classifier's accuracy
accuracy_test_sgd_mod_2 = accuracy_score(y_test, y_pred_test_sgd_mod_2)
accuracy_train_sgd_mod_2 = accuracy_score(y_train, y_pred_train_sgd_mod_2)
print(f"Test accuracy SGD Module 2: {accuracy_test_sgd_mod_2:.2f}")
print(f"Train accuracy SGD Module 2: {accuracy_train_sgd_mod_2:.2f}")

Test MSE SGD Module 2: 0.79
Train MSE SGD Module 2: 0.67
Test accuracy SGD Module 2: 0.54
Train accuracy SGD Module 2: 0.57


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">No, the model is not overfitting. The small difference between training (0.57) and test (0.54) accuracy indicates that the model is performing consistently well on both the training and test sets. Therefore, no additional steps are required to correct overfitting in this case</span> 

## LogisticRegression Model:

In [167]:
from sklearn.linear_model import LogisticRegression
# Create a LogisticRegression object with a penalty (e.g., L2 regularization)
logreg = LogisticRegression(penalty='l2', C=1.0, random_state=42)  

# Train the classifier on the training data
logreg.fit(x_train, y_train)

# Make predictions on the training data
y_train_pred_lr = logreg.predict(x_train)

# Make predictions on the test data
y_test_pred_lr = logreg.predict(x_test)

# Evaluate the classifier's MSE on the training data
train_mse_lr = mean_squared_error(y_train, y_train_pred_lr)
print(f"Training MSE LR: {train_mse_lr:.2f}")

# Evaluate the classifier's MSE on the test data
test_mse_lr = mean_squared_error(y_test, y_test_pred_lr)
print(f"Test MSE LR: {test_mse_lr:.2f}")

# Evaluate the classifier's accuracy on the training data
train_accuracy_lr = accuracy_score(y_train, y_train_pred_lr)
print(f"Training Accuracy LR: {train_accuracy_lr:.2f}")

# Evaluate the classifier's accuracy on the test data
test_accuracy_lr = accuracy_score(y_test, y_test_pred_lr)
print(f"Test Accuracy LR: {test_accuracy_lr:.2f}")

Training MSE LR: 0.38
Test MSE LR: 0.44
Training Accuracy LR: 0.67
Test Accuracy LR: 0.63


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

<span style="color: #81e64b;">No, the model is not overfitting significantly. The difference between training and test accuracy is minor (0.67 vs. 0.63), meaning the model generalizes well and is not memorizing the training data</span> 

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

**If the train/test values listed here do not match the outputs of models, you will lose points.**

## Support Vector Machine Models

In [168]:
# SVM Model 1: Linear Kernel
print(f"Linear Kernel SVM Module 1- Train MSE: {train_mse_linear_svm_mod_1}, \nTest MSE: {test_mse_linear_svm_mod_1}")

print(f"Linear Kernel SVM Module 1- Train Acc: {train_acc_linear_svm_mod_1}, \nTest Acc: {test_acc_linear_svm_mod_1}")
print("---------------------")
# SVM Model 2: RBF Kernel
print(f"RBF Kernel SVM Module 2- Train MSE: {train_mse_rbf_svm_mod_2}, \nTest MSE: {test_mse_rbf_svm_mod_2}")

print(f"RBF Kernel SVM Module 2- Train Accuracy: {train_acc_rbf_svm_mod_2}, \nTest Accuracy: {test_acc_rbf_svm_mod_2}")
print("---------------------")
# SVM Model 3: Polynomial Kernel (C=100)
print(f"Polynomial Kernel SVM Module 3- Train MSE: {train_mse_poly_svm_mod_3}, \nTest MSE: {test_mse_poly_svm_mod_3}")

print(f"Polynomial Kernel SVM Module 3- Train Accuracy: {train_acc_poly_svm_mod_3}, \nTest Accuracy: {test_acc_poly_svm_mod_3}")
print("---------------------")
# SVM Model 3.1: Polynomial Kernel (C=1)
print(f"Polynomial Kernel SVM Module 3- Train MSE: {train_mse_poly_svm_mod_3_1}, \nTest MSE: {test_mse_poly_svm_mod_3_1}")

print(f"Polynomial Kernel SVM Module 3- Train Accuracy: {train_acc_poly_svm_mod_3_1}, \nTest Accuracy: {test_acc_poly_svm_mod_3_1}")

Linear Kernel SVM Module 1- Train MSE: 0.38721351025331724, 
Test MSE: 0.44022503516174405
Linear Kernel SVM Module 1- Train Acc: 0.6821471652593486, 
Test Acc: 0.630098452883263
---------------------
RBF Kernel SVM Module 2- Train MSE: 0.2973462002412545, 
Test MSE: 0.5372714486638537
RBF Kernel SVM Module 2- Train Accuracy: 0.7738238841978287, 
Test Accuracy: 0.5794655414908579
---------------------
Polynomial Kernel SVM Module 3- Train MSE: 0.0012062726176115801, 
Test MSE: 0.7566807313642757
Polynomial Kernel SVM Module 3- Train Accuracy: 0.9987937273823885, 
Test Accuracy: 0.5161744022503516
---------------------
Polynomial Kernel SVM Module 3- Train MSE: 0.24849215922798554, 
Test MSE: 0.5836849507735584
Polynomial Kernel SVM Module 3- Train Accuracy: 0.8124246079613993, 
Test Accuracy: 0.5780590717299579


## Stochastic Gradient Descent Models

In [169]:
# SGD Model 1: L2 (MSE)
print(f"Test MSE SGD Module 1: {mse_test_sgd_mod_1:.2f}")
print(f"Train MSE SGD Module 1: {mse_train_sgd_mod_1:.2f}")

# SGD Model 1: L2 (Accuracy)
print(f"Test accuracy SGD Module 1: {accuracy_test_sgd_mod_1:.2f}")
print(f"Train accuracy SGD Module 1: {accuracy_train_sgd_mod_1:.2f}")
print("---------------------")
# SGD Model 2: L1 (MSE)
print(f"Test MSE SGD Module 2: {mse_test_sgd_mod_2:.2f}")
print(f"Train MSE SGD Module 2: {mse_train_sgd_mod_2:.2f}")

# SGD Model 2: L1 (Accuracy)
print(f"Test accuracy SGD Module 2: {accuracy_test_sgd_mod_2:.2f}")
print(f"Train accuracy SGD Module 2: {accuracy_train_sgd_mod_2:.2f}")

Test MSE SGD Module 1: 0.83
Train MSE SGD Module 1: 0.79
Test accuracy SGD Module 1: 0.53
Train accuracy SGD Module 1: 0.55
---------------------
Test MSE SGD Module 2: 0.79
Train MSE SGD Module 2: 0.67
Test accuracy SGD Module 2: 0.54
Train accuracy SGD Module 2: 0.57


## Logistic Regression

In [170]:
# Logistic Regression: L2 (MSE)
print(f"Training MSE LR: {train_mse_lr:.2f}")

print(f"Test MSE LR: {test_mse_lr:.2f}")

# Logistic Regression: L2 (Accuracy)
print(f"Training Accuracy LR: {train_accuracy_lr:.2f}")

print(f"Test Accuracy LR: {test_accuracy_lr:.2f}")

Training MSE LR: 0.38
Test MSE LR: 0.44
Training Accuracy LR: 0.67
Test Accuracy LR: 0.63


## Which model performs the best and why? (1 point) 

Hint: The best model is the one that has the best TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

<span style="color: #81e64b;">The best performing model is the SVM model 1 with a linear kernal (SVM-1). At a C-value of 1.0, SVM-1 has a test accuracy of 0.63, which is much better than the other models and does not overfit as indicated by a reasonably close training accuracy and test accuracy</span> 

## How does your best model compare to the baseline? (1 point)

<span style="color: #81e64b;">SVM Model 1 Test Accuracy is 0.63, which is significantly higher than the baseline test accuracy of 0.33. This shows that SVM-1 outperforms the baseline by a substantial margin, suggesting that the model is making more accurate predictions than the dummy model, which does not learn from the data.</span> 