# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [13]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [14]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
# TO DO: Print size and type of X and y
X, y = load_spam()
print("X size:", X.size)
print("X type", type(X))
print("y size:", y.size)
print("y type", type(y))

X size: 262200
X type <class 'pandas.core.frame.DataFrame'>
y size: 4600
y type <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [15]:
# TO DO: Check if there are any missing values and fill them in if necessary
X.isnull().sum().sum()

0

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [16]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split
X_large, X_small, y_large, y_small = train_test_split(X, y, test_size=0.05, stratify=y, random_state = 0)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [18]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
results = pd.DataFrame(columns=['Data Size', 'Training Accuracy', 'Validation Accuracy'])
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

dataX = [X, X.iloc[:, :2], X_small]
datay = [y, y, y_small]
for i in range(3):
    X_train, X_val, y_train, y_val = train_test_split(dataX[i], datay[i], random_state = 0)
    logreg = LogisticRegression(max_iter=2000).fit(X_train, y_train)
    train_score = accuracy_score(y_train, logreg.predict(X_train))
    val_score = accuracy_score(y_val, logreg.predict(X_val))
    results.loc[len(results.index)] = [dataX[i].shape, train_score, val_score]
print(results)

    Data Size  Training Accuracy  Validation Accuracy
0  (4600, 57)           0.928116             0.938261
1   (4600, 2)           0.608406             0.613043
2   (230, 57)           0.936047             0.862069


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

The training and validation accuracies increase when a larger dataset with more columns are used. The highest validation accuracy was with the full dataset (X), and close behind was the 5% (X_small). The result with using the first two columns of X was significantly worse due to the omission of the other 55 columns. Even though the number of data points in the X_small was similar to the two columns of X, those two columns were clearly not sufficient in creating an accurate model. 

In this case, a false positive is a real email that was marked as spam, and a false negative was a spam email marked as real. The worse case is a false positive as it means a user might miss an important email, whereas a false negative is a minor inconvenience for the user.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code? 
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. I sourced it from the notes, online searches, and chatGPT
1. I completed all the steps in order
1. I asked chatGPT: "how to select the first two columns of a pandas dataframe in python", I did need to slightly alter the code as it gave me a full example when I just wanted the iloc syntax, and it used the default dataframe 'df'
1. I was initially confused with the train_test_split of 5%, since I wasn't sure if we were supposed to use it again to make our model. Reading through the notes again helped. Also on my first attempt the dataset would not load and I had to reset my conda environment.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [19]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_concrete
X, y = load_concrete()
print("X size:", X.size)
print("X type", type(X))
print("y size:", y.size)
print("y type", type(y))

X size: 8240
X type <class 'pandas.core.frame.DataFrame'>
y size: 1030
y type <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [20]:
# TO DO: Check if there are any missing values and fill them in if necessary
X.isnull().sum()

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [21]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)

lr = LinearRegression().fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [22]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
train_mse = mean_squared_error(y_train, lr.predict(X_train))
train_r2 = r2_score(y_train, lr.predict(X_train))
val_mse = mean_squared_error(y_val, lr.predict(X_val))
val_r2 = r2_score(y_val, lr.predict(X_val))
print("Training mse score: {:.2f}".format(train_mse))
print("Validation mse score: {:.2f}".format(val_mse))
print("Training r2 score: {:.2f}".format(train_r2))
print("Validation r2 score: {:.2f}".format(val_r2))

Training mse score: 111.36
Validation mse score: 95.90
Training r2 score: 0.61
Validation r2 score: 0.62


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [23]:
# TO DO: ADD YOUR CODE HERE

results = pd.DataFrame(columns=['Training Accuracy', 'Validation Accuracy'])
results.loc['MSE'] = [train_mse, val_mse]
results.loc['R2'] = [train_r2, val_r2]
print(results)

     Training Accuracy  Validation Accuracy
MSE         111.358439            95.904136
R2            0.610823             0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

No it didn't as the training and validation scores are both quite low for r2 and high for MSE. This dataset likely doesn't follow a linear relationship for the different attributes. A more complex model is required to properly fit the data.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. I sourced my code from the lectures and class examples, as well as some google searches for MSE and R2 methods
1. I completed the steps in the order provided
1. I did not use AI for this part of the assignment
1. I was confused in Step 3 as it asks to import a linear model, but then instantiate a logistic regression, but I assume that was maybe just a typo. Once I got past that it was all good

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


Based on the results the linear model is not sufficient in predicting the outcome. The model is likely underfitting the data. There are not many features in this dataset, so the linear model stuggles to predict the values. It is possible that some regularization could help in this scenario. The r2 score of 0.6 indicates that the model does not fit the data well as the score should be closer to 1. 

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


I liked implementing a different kind of model as it is important to see how different models predict on different datasets. But I did find it difficult to do the r2 and MSE functions initially as we had not yet seen an example in class. Once I went on the scikitlearn website and saw the import statements it went well from there.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [24]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
alpha_values = [0.001, 0.01, 0.1, 1, 10, 100]
row_names = ['ridge1', 'ridge2', 'ridge3', 'ridge4', 'ridge5', 'ridge6']
i=0
part5_results = pd.DataFrame(columns=['alpha', 'Training Accuracy MSE', 'Validation Accuracy MSE', 'Training Accuracy R^2', 'Validation Accuracy R^2'])
for a in alpha_values:
    ridge = Ridge(alpha=a).fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, ridge.predict(X_train))
    train_r2 = r2_score(y_train, ridge.predict(X_train))
    val_mse = mean_squared_error(y_val, ridge.predict(X_val))
    val_r2 = r2_score(y_val, ridge.predict(X_val))
    part5_results.loc[row_names[i]] = [a, train_mse, val_mse, train_r2, val_r2]
    i +=1

alpha_values = [0.001, 0.01, 0.1, 1, 10, 100]
row_names = ['lasso1', 'lasso2', 'lasso3', 'lasso4', 'lasso5', 'lasso6']
i=0
from sklearn.linear_model import Lasso
for a in alpha_values:
    lasso = Lasso(alpha=a).fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, lasso.predict(X_train))
    train_r2 = r2_score(y_train, lasso.predict(X_train))
    val_mse = mean_squared_error(y_val, lasso.predict(X_val))
    val_r2 = r2_score(y_val, lasso.predict(X_val))
    part5_results.loc[row_names[i]] = [a, train_mse, val_mse, train_r2, val_r2]
    i +=1

part5_results

Unnamed: 0,alpha,Training Accuracy MSE,Validation Accuracy MSE,Training Accuracy R^2,Validation Accuracy R^2
ridge1,0.001,111.358439,95.904136,0.610823,0.623414
ridge2,0.01,111.358439,95.904135,0.610823,0.623414
ridge3,0.1,111.358439,95.904126,0.610823,0.623415
ridge4,1.0,111.358439,95.904035,0.610823,0.623415
ridge5,10.0,111.35844,95.903131,0.610823,0.623418
ridge6,100.0,111.358548,95.894268,0.610823,0.623453
lasso1,0.001,111.358439,95.903755,0.610823,0.623416
lasso2,0.01,111.358445,95.900332,0.610823,0.623429
lasso3,0.1,111.359051,95.866646,0.610821,0.623562
lasso4,1.0,111.419648,95.584678,0.610609,0.624669


The best r2 score came from any alpha used with the ridge method. This value is still not good enough though, as it is 0.610823, when it should be much closer to 1 to show a good fit of the model.