# Homework 4: SGD and KNN

Summary: The purpose of this task is to gain experience and familiarity with using KNN and SGD regression and classification models. From there, we can put theory into practice by fully analyzing the results and data. Full descruptions are provided below. 

1. Provide a detailed description of the outlier analysis. - *this answer is isolated in a markdown cell below*

*Each of the following three will be handled in each of the 4 subsection markdown summaries: SGD Regression, SGD Classification, KNN Regression, KNN Classifier*

2. Provide a detailed summary of your findings comparing regression and classification models.
3. What recommendations would you make to enhance the findings from the analysis?
4. What anomalies or issues did you notice with the data? - *These are covered in the outliers and final section*

## Importing importing and implementing ML pipeline


In [46]:
# -----------------------------------
# Importing the necessary libraries
# -----------------------------------

import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import StandardScaler

# Libraries related to outlier detection
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
import seaborn as sns
import warnings
import sklearn.metrics

import seaborn as sns
import warnings
from datetime import datetime
warnings.filterwarnings('ignore') 
sns.set(rc={'figure.figsize':(11,8)})
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.display.float_format = '{:.2f}'.format



In [47]:
%cd '/Users/chandlersmith/Desktop/CS6140/HW4'
os.listdir()

# The first column is index: skipping that column
cars = pd.read_csv("cars_data_2023.csv")
print(cars.shape)

cars.head()

/Users/chandlersmith/Desktop/CS6140/HW4
(3377, 44)


Unnamed: 0,Model Year,Represented Test Veh Make,Represented Test Veh Model,Test Vehicle ID,Test Veh Configuration #,Test Veh Displacement (L),Vehicle Type,Rated Horsepower,# of Cylinders and Rotors,Tested Transmission Type Code,...,DT-Energy Economy Rating,Target Coef A (lbf),Target Coef B (lbf/mph),Target Coef C (lbf/mph**2),Set Coef A (lbf),Set Coef B (lbf/mph),Set Coef C (lbf/mph**2),Aftertreatment Device Cd,Aftertreatment Device Desc,Police - Emergency Vehicle?
0,2023,Aston Martin,DB11 V8,562TT5348,0,4.0,Car,503,8.0,SA,...,-7.71,40.94,0.02,0.03,11.26,0.09,0.03,TWC,Three-way catalyst,N
1,2023,Aston Martin,DB11 V8,562TT5348,0,4.0,Car,503,8.0,SA,...,-0.96,40.94,0.02,0.03,11.26,0.09,0.03,TWC,Three-way catalyst,N
2,2023,Aston Martin,DBS,7002PT7056,0,5.2,Car,715,12.0,SA,...,-0.58,40.94,0.02,0.03,6.81,0.08,0.02,TWC,Three-way catalyst,N
3,2023,Aston Martin,DBS,7002PT7056,0,5.2,Car,715,12.0,SA,...,-0.08,40.94,0.02,0.03,6.81,0.08,0.02,TWC,Three-way catalyst,N
4,2023,Aston Martin,DBX,8001PT8342,1,4.0,Both,550,8.0,A,...,-2.11,60.68,-0.33,0.04,-4.88,-0.53,0.04,TWC,Three-way catalyst,N


### Note befor advancing
Before we advance the basic ML pipeline: let's identidy and remove outliers in our dataset! 

## Outlier detection using Local Outlier Factor (LOF) method
- This method uses KNN

In [48]:
# -----------------------------------------------------------------------------
# Step 1
# Select a few important numerical features for outlier detection
# Make sure to avoid using Response variable (if one already exists)
# -----------------------------------------------------------------------------

num_cols = ['Rated Horsepower', 'Equivalent Test Weight (lbs.)','CO2 (g/mi)','RND_ADJ_FE']

# -----------------------------------------------------------------------------
# Step 2
# At this stage, either drop NAs or impute them with a value
# I have shown filling NAs with 0, as it seems approproate in this example            
# -----------------------------------------------------------------------------

X = cars[num_cols].fillna(0).values

# -----------------------------------------------------------------------------
# Step 3a
# fit the Local Outlier Factor model (based on KNN)
# Notice the contamination parameter to identify a certain proportion of outliers
# -----------------------------------------------------------------------------

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
# predict the labels for each data point (as Outlier or inlier)
y_pred_lof = lof.fit_predict(X)

# -----------------------------------------------------------------------------
# Step 3b
# fit the Isolation Forest outlier detection (based on decision trees)
# -----------------------------------------------------------------------------
iforest = IsolationForest(n_estimators=100,  contamination=0.05)
# predict the labels for each data point (as Outlier or inlier)
y_pred_if = iforest.fit_predict(X)


# -----------------------------------------------------------------------------
# Step 3c
# fit the robust covariance model (based on Mahalanobis distance)
# -----------------------------------------------------------------------------
rob_cov = EllipticEnvelope(contamination=0.05)
rob_cov.fit(X)

# predict the labels for each data point (as Outlier or inlier)
y_pred_rob = rob_cov.predict(X)

# -----------------------------------------------------------------------------
# Adding the newly created columns to the carsrotion table             
# -----------------------------------------------------------------------------
cars["y_pred_lof"] = y_pred_lof
cars["y_pred_if"] = y_pred_if
cars["y_pred_rob"] = y_pred_rob

# -----------------------------------------------------------------------------
# Converting them to a binary -1, 0. 
# Where -1 denotes outlier
# The purpose is to then add these columns and find out which rows were identified as outliers from multiple methods
# -----------------------------------------------------------------------------
cars["y_pred_lof_2"] = np.where(cars["y_pred_lof"]<0, -1, 0)
cars["y_pred_if_2"] = np.where(cars["y_pred_if"]<0, -1, 0)
cars["y_pred_rob_2"] = np.where(cars["y_pred_rob"]<0, -1, 0)




## Summing the outlier status 

In [49]:
cars.iloc[:,-3:]
pd.crosstab(cars["y_pred_if"], cars["y_pred_rob"] )

cars["all_out"] = cars.loc[:,["y_pred_if_2","y_pred_rob_2","y_pred_lof_2"]].sum(axis = 1)
cars["all_out"].value_counts()

# -----------------------------------------------------------------------------
# List of cars identified as outliers based by at least two methods
# -----------------------------------------------------------------------------
cars[cars["all_out"]<-1]
# Double check the data in csv
#outliers = cars[cars["all_out"]<-1]
#outliers.to_csv( "outliers.csv", index=False)

Unnamed: 0,Model Year,Represented Test Veh Make,Represented Test Veh Model,Test Vehicle ID,Test Veh Configuration #,Test Veh Displacement (L),Vehicle Type,Rated Horsepower,# of Cylinders and Rotors,Tested Transmission Type Code,...,Aftertreatment Device Cd,Aftertreatment Device Desc,Police - Emergency Vehicle?,y_pred_lof,y_pred_if,y_pred_rob,y_pred_lof_2,y_pred_if_2,y_pred_rob_2,all_out
103,2023,BMW,"i4 eDrive 40 Gran Coupe (18"" Wheels)",FK96502,0,0.00,Car,335,,A,...,,,N,1,-1,-1,0,-1,-1,-2
210,2023,BMW,MINI COOPER SE HARDTOP 2 DOOR,2N13701,0,0.00,Car,181,,A,...,,,N,-1,-1,-1,-1,-1,-1,-3
211,2023,BMW,MINI COOPER SE HARDTOP 2 DOOR,2N13701,0,0.00,Car,181,,A,...,,,N,-1,-1,-1,-1,-1,-1,-3
267,2023,BMW,X5 xDrive45e,LE48294,2,3.00,Both,282,6.00,SA,...,TWC,Three-way catalyst,N,-1,1,-1,-1,0,-1,-2
269,2023,BMW,X5 xDrive45e,LE48294,2,3.00,Both,282,6.00,SA,...,TWC,Three-way catalyst,N,-1,1,-1,-1,0,-1,-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3299,2023,Volkswagen,ID.4 Pro S,VW316630184,0,0.00,Truck,201,,A,...,,,N,1,-1,-1,0,-1,-1,-2
3301,2023,Volkswagen,ID.4 Pro S,VW316630184,0,0.00,Truck,201,,A,...,,,N,1,-1,-1,0,-1,-1,-2
3303,2023,Volkswagen,ID.4 S,VW316630217,0,0.00,Truck,201,,A,...,,,N,1,-1,-1,0,-1,-1,-2
3305,2023,Volkswagen,ID.4 S,VW316630217,0,0.00,Truck,201,,A,...,,,N,1,-1,-1,0,-1,-1,-2


Outlier Analysis: So let's take a step back, what did we just do. Essentially, we used Mahalanobis distance and KNN to identify which cars were outliers! The result was a list of 113 rows that we should remove prior to conducting the rest of the analysis. Now lets continue with our ML Pipeline. It is a bit easier to review the data in the csv file so feel free to uncomment the code and check yourself. It is very clear that many of the horsepower outliers were extremely low, down at 37. The high end of horsepower was 1500, far above the expected mean. Additionally, test weight was as high as 7000 and as low as 2375. I imagine most outliers were high though. 3rd Metric was C02, and, as expected, were hybrid or electric cars. And finally, RND_ADJ_FE had drastic outliers from electric and highbrid vehicals, similar to C02 emissions. Removing the electric cards from the dataset seemed a wise next step to take.

In [52]:
from sklearn.preprocessing import LabelEncoder

# Drop columns with lots of gaps to preserve columns we really care about. 
# cars = cars.drop(["DT-Inertia Work Ratio Rating", "DT-Absolute Speed Change Ratg", "DT-Energy Economy Rating", "N2O (g/mi)", "CH4 (g/mi)",
  #         "PM (g/mi)", "NOx (g/mi)", "THC (g/mi)"], inplace=True)

# remove outliers from df
#cars = cars.loc(cars["all_out"]<-1)
cars.drop('CO2 (g/mi)', axis=1)
cars = cars.dropna()
cars.head()


#standardize all numerical (except for the response)
for i in cars:
    if cars[i].dtypes == 'number' and i != 'RND_ADJ_FE':
        cars[i] = StandardScaler.fit_transform(cars[i])


#Encode all categorical
for i in cars:
    if cars[i].dtype == "object":
        encode = LabelEncoder()
        cars[i] = encode.fit_transform(cars[i])

cars.to_csv("test.csv", index=False)        



Now that we have our dataset, let's move into SGDRegressor and SGDClassifier

## SGD Regressor

In [51]:
# --------------------------------------------------------------------------------
# # Separating the features and target
# Notice no attention is paid to Test-Train separation. Consider that step as an 
# intergal part of ML pipeline, and should not be skipped
# --------------------------------------------------------------------------------

X = cars.iloc[:, cars.columns != 'RND_ADJ_FE']
y = cars["RND_ADJ_FE"] 

# # ----------------------------------------------------------------------
# # # # Feature Standardization
# # This is a required step, as recommended by sklearn
# # ----------------------------------------------------------------------
scaler = StandardScaler()
X = scaler.fit_transform(X)

# --------------------------
# # # Training the model
# --------------------------
model = SGDRegressor(max_iter=100, 
                     tol = 0.0001,
                     early_stopping=False, warm_start=False,
                     n_iter_no_change = 5)
model.fit(X,y)
    
# -------
# Predict    
# -------
y_pred = model.predict(X)

# ----------------------
# Evaluating using MSE
# Using sklearn's built in function
# ----------------------

from sklearn.metrics import mean_squared_error
print(f"MSE is {mean_squared_error(y, y_pred)}")

# --------------------------------------------
# Alternatively writing the formula
# --------------------------------------------

mse = np.mean((y - y_pred)**2)




print(f"Number of adjustments in weights = {model.t_}, Coefficients are {model.coef_}, Number of iterations=  {model.n_iter_} and the R2 score is {model.score(X,y)}")
print("\nNote that the coefficients are for the standardized data and not on the original scale\n")
    
    
# --------------------------------------------------------------------------------
# Repeating the above step 20 times
# The purpose is to illustrate how the results are similar but not exactly the same 
# This is due to the stochastic nature of the process
# --------------------------------------------------------------------------------
    
for i in range(1,20):
    # # # Training the model
    model1 = SGDRegressor(max_iter=100, #tol=None, 
                         tol = 0.0001,
                         early_stopping=False, warm_start=False,
                         n_iter_no_change = 5)
    model1.fit(X,y)
    
 
    # # # Making predictions
    y_pred = model1.predict(X)

    # # # Evaluating the model
    mse = np.mean((y - y_pred)**2)
    print("-----------")
    print(model1.t_,model1.coef_, "{:.3f}".format(model1.score(X,y)), model1.n_iter_)


MSE is 5.954389662060781
Number of adjustments in weights = 12551.0, Coefficients are [ 0.          0.52458371 -0.17676355 -0.32459014  0.11304282  0.02936833
 -0.28302942  0.08778093  0.65614135 -0.63235245  0.41942968 -0.03363442
 -0.31447654  0.25204878  0.23350941 -0.16131028 -0.11088276  0.34730694
  0.10979521  1.45453287 -0.79338486 -0.03012788 -0.45721701  0.4505512
 -0.47436524 -7.42951727 -0.24829191 -0.00893577 -0.28225529  0.45628276
  0.          0.81435265  0.24704535 -0.73866477 -0.41630734 -0.61731467
 -0.73494624 -0.18884155 -0.20088104  0.2499468  -0.14203917 -0.20305498
  0.         -0.13290387 -0.25943412 -0.32240282 -0.13290387 -0.25943412
 -0.32240282 -0.27097836], Number of iterations=  25 and the R2 score is 0.9076000142889846

Note that the coefficients are for the standardized data and not on the original scale

-----------
28615.0 [ 0.          0.47661487 -0.17822586 -0.32965921  0.12328941 -0.32013217
 -0.27172056  0.30367057  1.05781437 -0.87422632  0.60749

Results from SGD Regressor: In SGD regression analysis, the algorithm randomly selects a batch from the training data for each iteration, computes the gradient of the loss function, and updates the model. The results will change and vary because we are working with training data, which naturally will return slightly different results. The overall number of iterations was 25, and the R2 score is 0.9076. This is a very high R2 value even after removing the c02 predictor but makes sense because we retiained a high number of features. If there were more data, I would like want to reduce the number of features and remove the lease influential ones. The regressor model, with a R2 score of 0.9076, could therefore be expected to explain around 90% of the miles per gallon (target) variable provided the input features. This is a high performant model and there isn't much we can do to improve the model other than increase data or put back in the C02 feature. More to come on the comparison for the classifier. 



## SGD Classifier

In [53]:
# ----------------------------------------
#importing necessary libraries
# ----------------------------------------

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
import sklearn.metrics

X = cars.iloc[:, cars.columns != 'RND_ADJ_FE']
y = cars["RND_ADJ_FE"] 

# High when MPG is 30, else low
y_cat = np.where(y>29, 0,1)

# # # ----------------------------------------------------------------------
# # # # # Feature Standardization
# # # This is a required step, as recommended by sklearn
# # # ----------------------------------------------------------------------
scaler = StandardScaler()
X = scaler.fit_transform(X)


#splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_cat, test_size = 0.2, random_state=42)

#creating and fitting the SGDClassifier
clf = SGDClassifier(max_iter=1000, tol=1e-3, loss = 'log')
clf.fit(X_train, y_train)

#predicting the test set results
y_pred = clf.predict(X_test)

#calculating the accuracy of the model
accuracy = clf.score(X_test, y_test)

print("Accuracy: {:.2f}".format(accuracy))

Accuracy: 0.93


In [22]:
# help(sklearn.metrics)
# ------------------------------
# Various classification metrics
# ------------------------------

print(f"Overall accuracy is {sklearn.metrics.accuracy_score(y_pred, y_test)}")
print(f"Precision or TP/(TP + FP) is {sklearn.metrics.precision_score(y_pred, y_test)}")
print(f"Recall or TP / (TP + FN) is {sklearn.metrics.recall_score(y_pred, y_test)}")
print(f"F1-score or 2*Precision*Recall / (Precision + Recall)  is {sklearn.metrics.f1_score(y_pred, y_test)}")
print(f"Confusion Matrix \n{sklearn.metrics.confusion_matrix(y_pred, y_test)}")


Overall accuracy is 0.8811881188118812
Precision or TP/(TP + FP) is 0.9259259259259259
Recall or TP / (TP + FN) is 0.8620689655172413
F1-score or 2*Precision*Recall / (Precision + Recall)  is 0.8928571428571429
Confusion Matrix 
[[39  4]
 [ 8 50]]


In [23]:
#predicting the test set results
y_pred = clf.predict(X_test)

# # predict the decision scores for the data
decision_scores = clf.decision_function(X_test)

# # change the decision threshold to 0.6
y_pred_new = (decision_scores > 0.6).astype(int)

# # print the accuracy of the new predictions
print("Accuracy with threshold of 0.6:", sum(y_pred_new == y_test) / len(y_test))

Accuracy with threshold of 0.6: 0.8811881188118812


Results from SGD Classifier: This is test which basically says, can this model predict the right result, yes or no. It turns out, this model is pretty good, however, not quite as accurate as the regression. Accuracy with threshold of 0.6: 0.8811881188118812 means that when using a threshold of 0.6, the model was able to classify the row correctly about 88% of the time. This is within a couple of percentage points for the regressor models. One action we could take to improve the model is potentially up the decision threshold, which might improve the model. Additionally, adding the C02 feature back in would help. Both the vlassifier and the regression model were highly performant and there is not much to split the two in terms of accuracy. They do tell us slightly different things, hovever, and it is probably good to use both to verify one another. 

## KNN Regression

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score


X = cars.iloc[:, cars.columns != 'RND_ADJ_FE']
y = cars["RND_ADJ_FE"] 
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [56]:
# ---------------------------------------------
# Split data into training and testing sets
# ---------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ---------------------------------------------
# Train a KNN regressor with 5 neighbors
# Choice of neighbors is hyperparameter
# ---------------------------------------------
knn_reg = KNeighborsRegressor(n_neighbors=5, leaf_size=10)
knn_reg.fit(X_train, y_train)

# ---------------------------------------------
# Make predictions on the testing set
# ---------------------------------------------
y_pred = knn_reg.predict(X_test)

# ---------------------------------------------
# Evaluate the MSE and R-squared values 
# ---------------------------------------------
mse = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mse}")
print(f"R-sq value is {r2_score(y_test, y_pred)}")

Mean squared error: 13.961734437086095
R-sq value is 0.7696460730767708


KNN Regressor results: The KNN regressor is designed to predict a target based on the nearest neighbors. The MSE, in this case 13.42, explains the distance between the predicted values and the actual target values. This is a pretty high different given the context of the model and that is verified by the model's R-sq value sitting at 0.77, far lower than the SGD regression R-sq value calculated above. Some things we could do to improve the model are to tune the neighbor count and the leaf size. Perhaps the number of neighbors or leaf size could be impacting the accuracy of the model, and this is going to specific to the data set. As a check I made the neighbors 10, and that lowered the models accuracy slightly and then tune the leaf size to 5 which drastically lowered the R-sq value. There is some wiggle room here and it would be worth trying this a few times. 

## KNN Classifier

In [37]:
# ----------------------------------------
#Importing data and creating a binary response
# ----------------------------------------

X = cars.iloc[:, cars.columns != 'RND_ADJ_FE']
y = cars["RND_ADJ_FE"] 

# MPH > 30, else Low
y_cat = np.where(y<29, 0,1)

# Feature Standardization
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_cat, test_size=0.2)

# KNN classifier with 5 neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = knn_clf.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = knn_clf.score(X_test, y_test)
print("Accuracy:", accuracy)


Accuracy: 0.8514851485148515


Summary of KNN Regressor Results: A KNN classifier accuracy score is essentially the % with which the model can correctly identify the classified result. In this case, it was fuel consumption high or low. Here, we note that the KNN classifier performed lower when compared to the SGD model, however, it still performed pretty well. Some actions that can be taken to improve the model are to adjust the n_neighbors variable and rerun the model. Additionally, we could play around with the MPG line, 30 might not be the best high low measure here. 

Sumamry of all 4
In conclusion, both SGD models were able to predict with a higher level of confidence the target variable when compared to the KNN models. The regression models predict the target provided the features while the classification models' accuracy score shares the accuracy of the model predicting high low on miles per gallon. For SGD, playing with the batch size, training data, and rerunning the model can be strong hyperparameters to tune. For KNN, the number of neighbors, leaf size, and training data are tunable hyperparameters. Overall, SGD regression and classigication outperformed KNN models, however, both performaed well. In terms of data anomolies, electric and hybrid vehicles really through the models through a loop. In the future, I would split the models into three catagories: Hybrid, Electric, and Petrol vehicals and this would likely yield more accurate petrol KNN and SGD models. We probably don't have enough Electric or Hybrid vehicals on the market to run a full model, but in a few years, that will likely change. 
