## CSE 532 Assignment 4 (Due 4/27/24)

**Note: As with the previous assignment you should submit a separate document (.pdf or .doc(x)) with your responses to the analysis portion of the problems.** 

**1. (Machine Learning (Classification))** <br>a. Choose one of the [toy classification datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) bundled with sklearn **other than the digits dataset**. <br> b. Train **three** distinct sklearn classification estimators for the chosen dataset and compare the results to see which one performs the best when using **2-fold cross-validation**.  Note that you should use three distinct classification models here (not just tweak underlying parameters).  A relatively complete listing of the available estimators can be found here (https://scikit-learn.org/stable/supervised_learning.html) -- but make sure you only use classifiers!  Unless you have an inclination to do otherwise, I recommend using the model default parameters when available.   <br> c. Repeat a. for **20-fold cross-validation**. Explain in a paragraph the difference in your results when using 20-fold vs 2-fold cross-validation (if any). <br>d. Construct a **confusion matrix** for your _most accurate_ model between the three estimators and two cross-fold options. <br> e. Which class in your dataset is most accurately predicted to have the correct label by the best classifier, and and which is most likely to be confused among one or more of the wrong classes?_(You can use a cell in a jupyter notebook file for this or a separate text/document file)._

In [19]:
# Part A and Part B
from sklearn import datasets
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

iris = datasets.load_iris()

X, y = iris.data, iris.target

classifier1 = KNeighborsClassifier()
classifier2 = LogisticRegression(max_iter=10000)
classifier3 = DecisionTreeClassifier()

kf = KFold(n_splits=2, shuffle=True, random_state=1)

scores1 = cross_val_score(classifier1, X, y, cv=kf)
scores2 = cross_val_score(classifier2, X, y, cv=kf)
scores3 = cross_val_score(classifier3, X, y, cv=kf)

print("(2-Fold) Mean R-squared score for KNeighborsClassifier:", np.mean(scores1))
print("(2-Fold) Mean R-squared score for LinearRegressionClassifier:", np.mean(scores2))
print("(2-Fold) Mean R-squared score for DecisionTreeClassifier:", np.mean(scores3))

(2-Fold) Mean R-squared score for KNeighborsClassifier: 0.94
(2-Fold) Mean R-squared score for LinearRegressionClassifier: 0.9733333333333334
(2-Fold) Mean R-squared score for DecisionTreeClassifier: 0.9266666666666667


The Logistic Regression Classifier (linear model) seems to have the best average cross validation score, then the KNN classifier, and lastly the Decision Tree Classifier.

In [20]:
# Part C

kf = KFold(n_splits=20, shuffle=True, random_state=1)

scores1 = cross_val_score(classifier1, X, y, cv=kf)
scores2 = cross_val_score(classifier2, X, y, cv=kf)
scores3 = cross_val_score(classifier3, X, y, cv=kf)

print("(20-Fold) Mean R-squared score for KNeighborsClassifier:", np.mean(scores1))
print("(20-Fold) Mean R-squared score for LinearRegressionClassifier:", np.mean(scores2))
print("(20-Fold) Mean R-squared score for DecisionTreeClassifier:", np.mean(scores3))

(20-Fold) Mean R-squared score for KNeighborsClassifier: 0.9651785714285716
(20-Fold) Mean R-squared score for LinearRegressionClassifier: 0.9660714285714287
(20-Fold) Mean R-squared score for DecisionTreeClassifier: 0.9598214285714286


**Difference in your results when using 20-fold vs 2-fold cross-validation:**

It seems like both of sets of results when using the different number of cross-validation folds actually showed the same classifier order with regards to model accuracy (LRC > KNN > DTC). However, you can easily see that with more cross validation folds the arrays of accuracies seems to average out more (subsequent runs give around the same average acccuracy for the three models). A higher number of cross validation folds allows us to better predict how the model would perform with regards to different training and testing data subsets, giving us a better idea on how the model would actual perform on "new" and different testing data. Often, every bit of the training dataset will be used to train an actual model, leaving the model architect with the choice to use cross validation to find the optimal paramerters which give the model its best validation accuracy. Overall 20-fold cross validation gives us a better idea of how the model performs based on 20 different training and validation data splits, while 2-fold cross validation only provides us with 2 different training and validation splits leaving us with more model accuracy variation upcome consecutive executions.

In [21]:
# Part D
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

predicted = cross_val_predict(classifier2, X, y, cv=kf)
matrix = confusion_matrix(y, predicted)
print(matrix)


[[50  0  0]
 [ 0 47  3]
 [ 0  2 48]]


**Part E: Which class in your dataset is most accurately predicted to have the correct label by the best classifier, and and which is most likely to be confused among one or more of the wrong classes?**

Class 0 is much more likely to be accurately predicted to have the correct label by my trained logistic regression classifier because it has been correctly classified 100% of the time and misclassified 0% of the time, it seems like class 0 has vastly different attributes when compared to the other two classes of iris flower types. Class 1 is the mostly likely class to be misclassified, in this case 3 samples (out of 50) were misclassified as class 2 when they actual were class 1 samples. However, it is tough to argue which class (class 1 or class 2) is misclassified the most because of the very small amount of samples that are available to validate with. I also think it is important to point out that class 1 samples and class 2 samples must have similar attributes to eachother because their misclassification trends specifically point to the opposing class.


**2 (Option I). (Machine Learning (Regression))** <br>a. Locate a non-proprietary, small-scale dataset _suitable for regression_ online.  There are countless sources and repositories than you can use in this task, but if you have trouble finding one, I recommend starting via Kaggle (https://www.kaggle.com/code/rtatman/datasets-for-regression-analysis/notebook).  Explain briefly what the dataset represents, what target variable you will be using, and what other features are present.  _You may want or need to apply preprocessing to your data to insure it can be used properly with the regression models_ (e.g. making every feature numeric through transformation or by dropping some)  <br> b. Train **three** distinct sklearn regression estimators for the chosen dataset and compare the results to see which one performs the best when using **10-fold cross-validation**, utilizing the R-Squared score to gauge performance.  Note that you should use two distinct regression models here (not just tweak underlying parameters).  A relatively complete listing of the available estimators can be found here (https://scikit-learn.org/stable/supervised_learning.html) -- but make sure you only use regression models!  Unless you have an inclination to do otherwise, I recommend using the model default parameters when available.<br>  c. Repeat part b utilizing the Mean Square Error to gauge performance.  _Briefly_ research the difference between the two metrics (MSE and R2), and explain in a paragraph or two i. the difference between them ii. when each one is the preferable metric to use. _(You can use a cell in a jupyter notebook file for this or a separate text/document file)._

**Chosen Dataset Summary**

I chose a csv dataset that has several different attributes for each sample: transaction date (decimal), house age, distance to the nearest Mass Rapid Transit (MRT) station, number of convenience stores, longitude, and latitude. These features are being used to predict the price of the actual house per unit area (target variable). No preprocessing will be needed for this dataset.

In [22]:
import pandas as pd

df = pd.read_csv('real_estate.csv')
# set X to data properties from columns X1-X6 
X = df.iloc[:, 1:-1].values
# set y to target variable (last column, house price of unit area)
y = df.iloc[:, -1].values
display(df)

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.59470,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.98450,5,24.98746,121.54391,47.3
3,4,2013.500,13.3,561.98450,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.56840,5,24.97937,121.54245,43.1
...,...,...,...,...,...,...,...,...
409,410,2013.000,13.7,4082.01500,0,24.94155,121.50381,15.4
410,411,2012.667,5.6,90.45606,9,24.97433,121.54310,50.0
411,412,2013.250,18.8,390.96960,7,24.97923,121.53986,40.6
412,413,2013.000,8.1,104.81010,5,24.96674,121.54067,52.5


In [23]:
#Part B
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, r2_score

regressor1 = KNeighborsRegressor()
regressor2 = LinearRegression()
regressor3 = RandomForestRegressor()

kf = KFold(n_splits=10, shuffle=True, random_state=1)

r2_scorer = make_scorer(r2_score)

scores1 = cross_val_score(regressor1, X, y, cv=kf, scoring=r2_scorer)
scores2 = cross_val_score(regressor2, X, y, cv=kf, scoring=r2_scorer)
scores3 = cross_val_score(regressor3, X, y, cv=kf, scoring=r2_scorer)

print("Mean R-squared score for KNeighborsRegressor:", np.mean(scores1))
print("Mean R-squared score for LinearRegression:", np.mean(scores2))
print("Mean R-squared score for RandomForestRegressor:", np.mean(scores3))


Mean R-squared score for KNeighborsRegressor: 0.5918522491710381
Mean R-squared score for LinearRegression: 0.5704017534496113
Mean R-squared score for RandomForestRegressor: 0.7015104791655438


The RandomForestRegressor model seems to have the best average cross validation score, then the KNeighborsRegression model, and lastly the LinearRegression model.

In [24]:
# Part C
from sklearn.metrics import mean_squared_error

predict1 = cross_val_predict(regressor1, X, y, cv=kf)
mse_scores1 = mean_squared_error(y, predict1)

predict2 = cross_val_predict(regressor2, X, y, cv=kf)
mse_scores2 = mean_squared_error(y, predict2)

predict3 = cross_val_predict(regressor3, X, y, cv=kf)
mse_scores3 = mean_squared_error(y, predict3)


print("Average Mean Square Error for KNeighborsRegressor:", mse_scores1)
print("Average Mean Square Error for LinearRegression:", mse_scores2)
print("Average Mean Square Error for RandomForestRegressor:", mse_scores3)

Average Mean Square Error for KNeighborsRegressor: 72.08382028985507
Average Mean Square Error for LinearRegression: 80.3266248434177
Average Mean Square Error for RandomForestRegressor: 56.472820973662806


The lower the average MSE is the better the model is predicting. The RandomForestRegression model seems to have the lowest average MSE validation score, then the KNeighborsRegression model, and lastly the LinearRegression model has the highest average MSE. (same order as with R-squared values)

**Briefly research the difference between the two metrics (MSE and R2):** 

i. The difference between MSE and R-squared:

The R-squared values measures how well the independent features can be used to predict or explain the target variable. The R-squared value ranges from 0 to 1, 0 meaning no correlation and 1 showing complete correlation. However, mean squared error is the average distance between the model's predicted output and the samples actual target output. Essentially, The smaller the MSE the better.

ii. When each one is the preferable metric to use:

When evaluating how well a model fits the dataset fed to it, you should utilize the GOF-based R-squared metric; however, if your main goal is to access the accuracy of a given model then MSE would be a better metric to utilize. Furthermore, if you want to emphasize the impact of outliers within your dataset when testing your model you would utilize the MSE metric, but if you wanted outliers to not have as much impact on your model's performance then you should utilize the R-squared metric. Overall, it is often beneficial to consider both metrics when analyzing a model.

