# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [48]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [49]:
from yellowbrick.datasets import load_spam

X, y = load_spam()

print("Size of X:", X.shape)
print("Type of X:", type(X))
print("Size of y:", y.shape)
print("Type of y:", type(y))

Size of X: (4600, 57)
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (4600,)
Type of y: <class 'pandas.core.series.Series'>


# Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [50]:
X, y = load_spam()

missing_values = np.isnan(X).sum()
print(missing_values)


if missing_values.any() > 0:
    X = np.nan_to_num(X, nan=np.nanmean(X))

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [51]:
from sklearn.model_selection import train_test_split

X_train, X_small, y_train, y_small = train_test_split(X, y, test_size=0.05, random_state=42)

print("X_small - Shape:", X_small.shape)
print("y_small - Shape:", y_small.shape)


X_small - Shape: (230, 57)
y_small - Shape: (230,)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [54]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from yellowbrick.datasets import load_spam
from sklearn.model_selection import train_test_split

results = []

X, y = load_spam()
X = np.array(X)  

datasets = [
    ("Full Dataset", X, y),
    ("Partial Dataset", X[:, :2], y),
    ("Small Dataset", X_small, y_small)
]

for dataset_name, X_data, y_data in datasets:
    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=0)
    
    model = LogisticRegression(max_iter=2000)
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    
    y_val_pred = model.predict(X_val)
    accuracy_val = accuracy_score(y_val, y_val_pred)

    results.append([dataset_name, accuracy_train, accuracy_val])

results_df = pd.DataFrame(results, columns=["Data size", "Training accuracy", "Validation accuracy"])
print(results_df)


         Data size  Training accuracy  Validation accuracy
0     Full Dataset           0.927174             0.938043
1  Partial Dataset           0.614946             0.593478
2    Small Dataset           0.940217             0.847826


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.

Training Accuracy - As you reduce the amount of data used, training accuracy tends to increase because with less data, the model may overfit, fitting noise instead of true patterns in the data.

Validation Accuracy - As you reduce the amount of data used, validation accuracy may decrease because the model may struggle to generalize new data due to less validation data. In other words, the smaller the validation dataset, the less confident you can be in the model's ability to generalize.

For example:

Original Dataset:
Training Accuracy: 0.927174
Validation Accuracy: 0.938043

First Two Columns of the Dataset:
Training Accuracy: 0.614946
Validation Accuracy: 0.593478

Smaller Dataset:
Training Accuracy: 0.940217
Validation Accuracy: 0.847826

As you can see, the validation accuracy decreases whereas the training accuracy increases as less data is being used.

2. In this case, what do a false positive and a false negative represent? Which one is worse?

False Positive - In this case, it means when a legitimate email is classified as spam and may end up in the spam folder.

False Negative - In this case, it means a spam email is classified as legitimate and may end up in the inbox.

In this case, false positives are worse because an important email may be disregarded due to being classified as spam. This is more severe than the alternative which would be a spam email being flagged as legitimate. Although the latter could be annoying, it is likely to have less severe consequences.



*YOUR ANSWERS HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?

From the examples, lectures provided in class, as well as online tools and libraries.

2. In what order did you complete the steps?

In the order that they were listed.

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

I used generative AI to better understand the questions and expectations for the solutions. The prompts were more or less the questions. I also used generative AI when I was experiencing errors, especially for steps 3-5. It helped me figure out what was going wrong.

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I did have challenges remembering how to perform certain functions. I had to dig through some of the old examples or search online.


*DESCRIBE YOUR PROCESS HERE*

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [61]:
from yellowbrick.datasets import load_concrete

X, y = load_concrete()

print("Size and type of X:")
print(X.shape, type(X))

print("Size and type of y:")
print(y.shape, type(y))

Size and type of X:
(1030, 8) <class 'pandas.core.frame.DataFrame'>
Size and type of y:
(1030,) <class 'pandas.core.series.Series'>


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [63]:
missing_values_X = X.isnull().sum().sum()
missing_values_y = y.isnull().sum()

if missing_values_X == 0 and missing_values_y == 0:
    print("No missing values in the dataset.")
else:
    print(f"Missing values in X: {missing_values_X}")
    print(f"Missing values in y: {missing_values_y}")
    X = X.fillna(X.mean())

if X.isnull().sum().sum() == 0 and y.isnull().sum() == 0:
    print("Missing values have been handled.")


No missing values in the dataset.
Missing values have been handled.


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [65]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression

# Instantiate the Linear Regression model
model = LinearRegression()

# Fit the model with the data
model.fit(X, y)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [74]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the training and validation data
y_train_pred = model.predict(X_train)
y_valid_pred = model.predict(X_valid)

# Calculate mean squared error for training and validation
mse_train = mean_squared_error(y_train, y_train_pred)
mse_valid = mean_squared_error(y_valid, y_valid_pred)

# Calculate R2 score for training and validation
r2_train = r2_score(y_train, y_train_pred)
r2_valid = r2_score(y_valid, y_valid_pred)

print(f"Training Mean Squared Error: {mse_train}")
print(f"Validation Mean Squared Error: {mse_valid}")
print(f"Training R2 Score: {r2_train}")
print(f"Validation R2 Score: {r2_valid}")


Training Mean Squared Error: 110.66177124473455
Validation Mean Squared Error: 95.97548435336684
Training R2 Score: 0.6104593527939581
Validation R2 Score: 0.6275416055429188


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [75]:
import pandas as pd

# Calculate mean squared error for training and validation
mse_train = mean_squared_error(y_train, y_train_pred)
mse_valid = mean_squared_error(y_valid, y_valid_pred)

# Calculate R2 score for training and validation
r2_train = r2_score(y_train, y_train_pred)
r2_valid = r2_score(y_valid, y_valid_pred)

# Create a DataFrame for the results
results = pd.DataFrame(columns=['Training accuracy', 'Validation accuracy'], index=['MSE', 'R2 score'])
results.at['MSE', 'Training accuracy'] = mse_train
results.at['MSE', 'Validation accuracy'] = mse_valid
results.at['R2 score', 'Training accuracy'] = r2_train
results.at['R2 score', 'Validation accuracy'] = r2_valid

# Print the results DataFrame
print(results)


         Training accuracy Validation accuracy
MSE             110.661771           95.975484
R2 score          0.610459            0.627542


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

I believe using a linear model produced decent results. Although the MSE values are considerably high (110.661771 & 95.975484) which indicate a significant amount of variance that the linear model is not capturing well, the validation MSE is slightly lower than the training MSE which may suggest that the model generalizes reasonably well to unseen data.

The R2 scores of 0.610459 & 0.627542 are not extremely close to 1 but they indicate that the linear model explains a reasonable portion of the variance in the data.

Therefore, the results indicate a moderate fit to the data. If higher accuracy is desired, other machine learning algorithms such as regression models may be better tailored.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?

From the examples, lectures provided in class, as well as online tools and libraries.

2. In what order did you complete the steps?

In the order that they were listed.

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

I did not use generative AI for this section. I used the fundamentals I learnt in part 1.

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I didn't have as many challenges as I did in part 1. I think I found good resources in part 1 that allowed me to be more successful in part 2 as I had them readily available. These include online resources and the lecture notes + examples.


*DESCRIBE YOUR PROCESS HERE*

## Part 3: Observations/Interpretation (3 marks)


The linear regression model, with a training Mean Squared Error of 110.66 and a Validation Mean Squared Error of 95.98, demonstrates a reasonable ability to understand the data's patterns. 

It generalizes well, as shown by the validation MSE being slightly lower than the training MSE. This balance between simplicity and accuracy, evident in the training R2 score of 0.610 and the validation R2 score of 0.628, aligns with our discussions on finding the right model trade-off in machine learning, where the linear regression model provides a fair fit for this dataset.


*ADD YOUR FINDINGS HERE*

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,

I liked that we could relate the results to the real-life scenario of the dataset. It makes it more engaging to complete assignments when you can see the direct affect of the results.


- found interesting, confusing, challangeing, motivating while working on this assignment.

I found it a little challenging to figure out where the errors are occuring when the results aren't producing results that reinforce the patterns we learnt in class. It took quite a bit of researching to figure out.


*ADD YOUR THOUGHTS HERE*

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [77]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV

alphas = [0.001, 0.01, 0.1, 1, 10, 100]

ridge_model = Ridge()
lasso_model = Lasso()

param_grid = {'alpha': alphas}

ridge_grid = GridSearchCV(ridge_model, param_grid, cv=5, scoring='r2')
ridge_grid.fit(X, y)

lasso_grid = GridSearchCV(lasso_model, param_grid, cv=5, scoring='r2')
lasso_grid.fit(X, y)

best_alpha_ridge = ridge_grid.best_params_['alpha']
best_r2_score_ridge = ridge_grid.best_score_

best_alpha_lasso = lasso_grid.best_params_['alpha']
best_r2_score_lasso = lasso_grid.best_score_

if best_r2_score_ridge > best_r2_score_lasso:
    best_method = "Ridge"
    best_alpha = best_alpha_ridge
    best_r2_score = best_r2_score_ridge
else:
    best_method = "Lasso"
    best_alpha = best_alpha_lasso
    best_r2_score = best_r2_score_lasso

print(f"Best method: {best_method}")
print(f"Best alpha: {best_alpha}")
print(f"Best R2 score: {best_r2_score}")


Best method: Lasso
Best alpha: 10
Best R2 score: 0.47935157993290034


*ANSWER HERE*