# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [56]:
import numpy as np
import pandas as pd
import mglearn

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [57]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_spam
X, y = load_spam()
print(X.shape, y.shape, type(X), type(y))

(4600, 57) (4600,) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [58]:
# TO DO: Check if there are any missing values and fill them in if necessary
X.isna().values.any()
y.isna().values.any()

False

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [59]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

size = int(len(X) * 0.05)
X_small = X.sample(n=size)
y_small = y.sample(n=size)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [60]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y)
lr_full = LogisticRegression(max_iter=2000)
lr_full.fit(X_train, y_train)

X_first_two_train, X_first_two_test, y_first_two_train, y_first_two_test = train_test_split(X.iloc[:,:2], y)
lr_first_two = LogisticRegression(max_iter=2000)
lr_first_two.fit(X_first_two_train, y_first_two_train)

X_small_train, X_small_test, y_small_train, y_small_test = train_test_split(X_small, y_small)
lr_small = LogisticRegression(max_iter=2000)
lr_small.fit(X_small_train, y_small_train)

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

In [61]:
print("Full set training set score: {:.3f}".format(lr_full.score(X_train, y_train)))
print("Full set validation set score: {:.3f}\n".format(lr_full.score(X_test, y_test)))

print("First two rows training set score: {:.3f}".format(lr_first_two.score(X_first_two_train, y_first_two_train)))
print("First two rows validation set score: {:.3f}\n".format(lr_first_two.score(X_first_two_test, y_first_two_test)))

print("Five percent set training set score: {:.3f}".format(lr_small.score(X_small_train, y_small_train)))
print("Five percent set validation set score: {:.3f}".format(lr_small.score(X_small_test, y_small_test)))

Full set training set score: 0.930
Full set validation set score: 0.935

First two rows training set score: 0.606
First two rows validation set score: 0.619

Five percent set training set score: 0.773
Five percent set validation set score: 0.569


### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [62]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE TH E DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
results = pd.DataFrame(columns=['data size', 'training accuracy', 'validation accuracy'])
results.loc[len(results)] = [X.shape, lr_full.score(X_train, y_train), lr_full.score(X_test, y_test)]
results.loc[len(results)] = [X.iloc[:,:2].shape, lr_first_two.score(X_first_two_train, y_first_two_train), lr_first_two.score(X_first_two_test, y_first_two_test)]
results.loc[len(results)] = [X_small.shape, lr_small.score(X_small_train, y_small_train), lr_small.score(X_small_test, y_small_test)]
print(results)

    data size  training accuracy  validation accuracy
0  (4600, 57)           0.929565             0.934783
1   (4600, 2)           0.606087             0.619130
2   (230, 57)           0.773256             0.568966


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. Training and validation accuracy increase with more data used. Using five percent of the original dataset resulted in a reduction in training accuracy from 0.93 to 0.76, and a reduction in validation accuracy from 0.92 to 0.53. Reducing the amount of columns from 57 to 2 resulted in a reduction in training accuracy from 0.93 to 0.61, and a reduction in validation accuracy from 0.92 to 0.64.

2. A false positive would be a non-spam email marked as spam. A false negative would be a spam email marked as not spam. A false positive would be worse (marking non-spam email as spam could result in email being missed).

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. Code was sourced from the course examples (Linear Regression-filled.ipynb, Linnear Classification-filled.ipynb), and from the following websites:
    https://www.scikit-yb.org/en/latest/api/datasets/spam.html
    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

2. The steps were completed in order.

3. Generative AI was not used. I find I learn better researching my questions rather than asking a generative AI program. 

4. No major challenges. Just remembering some syntax, functions to use etc. Going over the course examples helped a lot. 

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [63]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_concrete
X, y = load_concrete()
print(X.shape, y.shape, type(X), type(y))

(1030, 8) (1030,) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [64]:
# TO DO: Check if there are any missing values and fill them in if necessary
X.isna().values.any()
y.isna().values.any()

False

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [65]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = lr.fit(X_train, y_train)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [66]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error
R2_training = model.score(X_train, y_train)
mse_training = mean_squared_error(y_train, model.predict(X_train))
R2_validation = model.score(X_test, y_test)
mse_validation = mean_squared_error(y_test, model.predict(X_test))



### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [67]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame(columns=['Training accuracy', 'Validation accuracy'], index=['MSE', 'R2 score'])

results.loc['MSE', 'Training accuracy'] = mse_training
results.loc['MSE', 'Validation accuracy'] = mse_validation
results.loc['R2 score', 'Training accuracy'] = R2_training
results.loc['R2 score', 'Validation accuracy'] = R2_validation
print(results)


         Training accuracy Validation accuracy
MSE             110.660379           98.297955
R2 score          0.597044            0.661224


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not? 
    
    The linear model produced inaccurate results for the dataset. This is because the data does not represent a very linear trend. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. Code was sourced from the course examples (Linear Regression-filled.ipynb, Linnear Classification-filled.ipynb), and from the following websites:<br>
    https://www.scikit-yb.org/en/latest/api/datasets/spam.html<br>
    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

2. The steps were completed in order.

3. Generative AI was not used. I find I learn better researching my questions rather than asking a generative AI program. 

4. No major challenges. Just remembering some syntax, functions to use etc. Going over the course examples helped a lot. 

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*<br>
The main pattern / finding I noticed is the importance of using the correct models for your dataset - for example using linear regression was not effective with the concrete dataset, as it does not follow a very linear trend. This resulted training and validation scores of 110.7 and 98.3 for MSE, and 0.60 and 0.66 for R2, respectively. 

The analysis done in Part 1 of this assignment with the spam dataset showed the importance of having sufficient data, as we saw that we could only produce good results with sufficient columns and amount of data. (Using five percent of the original dataset resulted in a reduction in training accuracy from 0.93 to 0.76, and a reduction in validation accuracy from 0.92 to 0.53. Reducing the amount of columns from 57 to 2 resulted in a reduction in training accuracy from 0.93 to 0.61, and a reduction in validation accuracy from 0.92 to 0.64).

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*<br>
I liked how the assignment broken down into steps, as it helps walk us through the assigment. I liked being able to try out different methods on different models to see how they work. 

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [68]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

max_ridge_training = 0
max_ridge_validation = 0
max_alpha_ridge_training = 0
max_alpha_ridge_validation = 0

max_lasso_training = 0
max_lasso_validation = 0
max_alpha_lasso_training = 0
max_alpha_lasso_validation = 0

for i in np.arange(0.001, 0.05, 100):
    ridge = Ridge(alpha=i).fit(X_train, y_train)
    lasso = Lasso(alpha = i).fit(X_train, y_train)
    
    if ridge.score(X_train, y_train) > max_ridge_training:
        max_ridge_training = ridge.score(X_train, y_train)
        max_alpha_ridge_training = i
    if ridge.score(X_test, y_test) > max_ridge_validation:
        max_ridge_validation = ridge.score(X_test, y_test)
        max_alpha_ridge_validation = i

    if lasso.score(X_train, y_train) > max_lasso_training:
        max_lasso_training = lasso.score(X_train, y_train)
        max_alpha_lasso_training = i
    if lasso.score(X_test, y_test) > max_lasso_validation:
        max_lasso_validation = lasso.score(X_test, y_test)
        max_alpha_lasso_validation = i

print(max_ridge_training)
print(max_ridge_validation)
print(max_alpha_ridge_training)
print(max_alpha_ridge_validation)

print(max_lasso_training)
print(max_lasso_validation)
print(max_alpha_lasso_training)
print(max_alpha_lasso_validation)

0.5970437074602829
0.6612241937986676
0.001
0.001
0.5970437072219401
0.661221890676573
0.001
0.001


*ANSWER HERE*<br>
The lowest alpha values gave the highest R2 score. Ridge regression gave the highest training and validation set scores. The validation scores, which show how well the model generalizes are low for all tested alpha values, meaning the score is not 'good enough'.