# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: 

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [92]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [145]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets.loaders import load_spam

# TO DO: Print size and type of X and y
X, y = load_spam()
print(f"Size of X: {X.shape}, Type of X: {type(X)}")
print(f"Size of y: {y.shape}, Type of y: {type(y)}")

X.head()
y.head()

Size of X: (4600, 57), Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (4600,), Type of y: <class 'pandas.core.series.Series'>


0    1
1    1
2    1
3    1
4    1
Name: is_spam, dtype: int64

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [153]:
# TO DO: Check if there are any missing values and fill them in if necessary

#check if any missing values in X
X_nulls = X.isnull().sum().sort_values(ascending=False)
print(X_nulls)

# Fill missing values in X with 0 or 0.0
for column in X.columns:
    if X[column].dtype == 'int64':
        X[column] = X[column].fillna(data.mean())
    elif X[column].dtype == 'float64':
        X[column] = X[column].fillna(data.mean())

# Check if there are any missing values in y
y_nulls = y.isnull().sum()
print(y_nulls)

# Fill missing values in y with 0
if y_nulls > 0:
    y = y.fillna(0)

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
0


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [110]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split
X_train, X_small, y_train, y_small, = train_test_split(X, y, test_size=.05, random_state=0)

X_small
y_small

991     1
2824    0
1906    0
1471    1
1813    0
       ..
1012    1
1855    0
2717    0
1300    1
1082    1
Name: is_spam, Length: 230, dtype: int64

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [144]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

#Pandas DataFrame results 
results = pd.DataFrame({"Data size": [], "Training Accuracy (All Data)": [], "Training Accuracy (Cross-Validation)": [], "Validation Accuracy (Cross-Validation)": []})

# Define the datasets and corresponding labels
datasets = [(X, y, "X and y"), (X.iloc[:, :2], y, "First two columns"), (X_small, y_small, "X_small and y_small")]

# Initialize and evaluate models for each dataset, i.e. the regular X and y, first two columns, and X_small/y_small
for data, target, data_label in datasets:
    
    # Instantiate model 
    model = LogisticRegression(max_iter=2000)
    
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=0)
    
    model.fit(X_train, y_train)
    training_accuracy_all_data = model.score(X_train, y_train)
    
    # calculate training and validation accuracy 
    scores = cross_validate(model, X_train, y_train, cv=5, scoring="accuracy", return_train_score=True)
    training_accuracy_cv = scores["train_score"].mean()
    validation_accuracy_cv = scores["test_score"].mean()
    
    data_shape_str = str(data.shape)
    
    # Add the results to the DataFrame
    results.loc[len(results)] = [data_shape_str, training_accuracy_all_data, training_accuracy_cv, validation_accuracy_cv]


# Print the results DataFrame
print(results)

    Data size  Training Accuracy (All Data)  \
0  (4600, 57)                      0.927174   
1   (4600, 2)                      0.614946   
2   (230, 57)                      0.956522   

   Training Accuracy (Cross-Validation)  \
0                              0.927785   
1                              0.615693   
2                              0.956527   

   Validation Accuracy (Cross-Validation)  
0                                0.922283  
1                                0.614402  
2                                0.902102  


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
Depending on the amount of the data used, the validation accuracy increases. For example, using the original dataset with size 4600 by 57, the validation accuracy is the highest, with 0.922 when compared to 0.614 and 0.902. This means that using the original dataset, the model generalizes to unseen data the best. On the other hand, when using two columns only for the dataset, the value is 0.614 which is quite low compared to the rest. One interesting thing to see here is that the training accuracy on a smaller dataset (5%), is actually higher than the normal dataset. This can be attuned to the fact that smaller datasets tend to be easier to fit because there is less noise and complexity in the data. 

2. In this case, what do a false positive and a false negative represent? Which one is worse?
Since 1 in the target column y represents spam, a false positive represents in this case, an email as a spam when the email is not a spam email. A false negative on the other hand would represents a spam email as a normal email, i.e. we missed classifying a spam email correctly and it goes through to the inbox. In my opinion, both can have consequences. For example, if the spam email is passed through to an inbox and trade security is leaked, there is definitely more chance of a false negative being worse. However, a false positive can have also some consequences such as an important email being missed, or an important deadline. In my opinion, I think the risk of a false positive is worse since the effects may be significant (cyberattacks). 

int: {1 for spam, 0 for not spam}
*YOUR ANSWERS HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
I sourced my code using reference from ENSF 611 class notes available on D2L: https://colab.research.google.com/drive/1VGEj1carkVz06Deza4m7JoK_SW9LkseE?usp=sharing
This lab activity notes allowed me to understand what processes to follow regarding the model regression initialization, validation and visualization. I also used the code available from D2L: https://d2l.ucalgary.ca/d2l/le/content/543310/viewContent/6091032/View
Some of the accuracy code was derived from this worksheet in order to find the training and validation accuracy. There was also information using LogisticRegression that I applied to this code. 

2. In what order did you complete the steps?
I completed the steps for this question following the step order as provided. I created a first pass of one model, i.e. X and y and got its respective accuracy results without using a for loop. Then, I created the for loop to achieve the other two models and their respective errors. 

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
I did not use generative AI to modify the code at all. I did use it however to try understading the main differences between the training accuracy using cross-validation and all data. (https://chat.openai.com/)

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
I had some challenges regarding the DataFrame object and passing the values of accuracies through the usage of the for loop, but was able to figure it out. There were also some initial challenges where I could not understand how to use train_test_split to achieve X_small and y_small, but was able to test it using an output to see the difference of size.  

*DESCRIBE YOUR PROCESS HERE*

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [205]:
# TO DO: Import spam dataset from yellowbrick library

from yellowbrick.datasets.loaders import load_concrete

# TO DO: Print size and type of X and y
X, y = load_concrete()
print(f"Size of X: {X.shape}, Type of X: {type(X)}")
print(f"Size of y: {y.shape}, Type of y: {type(y)}")

X.head()

Size of X: (1030, 8), Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (1030,), Type of y: <class 'pandas.core.series.Series'>


Unnamed: 0,cement,slag,ash,water,splast,coarse,fine,age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [206]:
## TO DO: Check if there are any missing values and fill them in if necessary

#check if any missing values in X
X_nulls = X.isnull().sum().sort_values(ascending=False)
print(X_nulls)

# Fill missing values in X with 0 or 0.0
for column in X.columns:
    if X[column].dtype == 'int64':
        X[column] = X[column].fillna(data.mean())
    elif X[column].dtype == 'float64':
        X[column] = X[column].fillna(data.mean())

# Check if there are any missing values in y
y_nulls = y.isnull().sum()
print(y_nulls)

# Fill missing values in y with 0
if y_nulls > 0:
    y = y.fillna(0)
    
    


cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [207]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

linear_model.fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [214]:
# TO DO: ADD YOUR CODE HERE

from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Make predictions on test data
test_predictions = linear_model.predict(X_test)

# Make predictions on training data
train_predictions = linear_model.predict(X_train)

# Evaluate the model using mean squared error on training and test data
test_mse = np.sqrt(mean_squared_error(y_test, test_predictions))
train_mse = np.sqrt(mean_squared_error(y_train, train_predictions))

# Calculate R2 score for training and test data 
train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [215]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame(
    {
        "Training accuracy": [train_mse, train_r2],
        "Validation accuracy": [test_mse, test_r2],
    },
    index=["MSE", "R2 score"]
)

# Print the results
print(results)




          Training accuracy  Validation accuracy
MSE               10.504547             9.779332
R2 score           0.609071             0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

Since the MSE are relatively high, this suggests the predictive performance of the linear model might not be accurate in producing good results. 

On the other hand, an R2 score of 0.636898 for the validation set indicates that the model explains approximately 63.69% of the variance in the validation data, and vice-versa for the training accuracy, and with a 63.69% compared to a perfect 100%, there is around 37% still not accounted for, suggesting that the linear model is not very accurate for this dataset. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. Where did you source your code?
I sourced some of my code from the lab 2 example provided below, as well as sites linked:  

https://d2l.ucalgary.ca/d2l/le/content/543310/viewContent/6084981/View

https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

2. In what order did you complete the steps?

I completed these steps by first utilizing similar code from the previous part. Then, to validate the errors I searched up the link above and imported the errors, then predicted y using test and train predictions to get the errors. Once completed, the errors were visualied using a DataFrame. 

3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
I did not use generative AI for this portion because the code and understanding it was covered in the lecture notes, as well as sources provided in D2L. 

4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?
I didn't run into many challenges in this part, as there were a lot of available content to help me understand and determine the results. I think what helped me to be successful was following the previous examples and understanding which datasets to compare and which results we needed.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


Both the training and validation MSE values are relatively close, which suggests that the model is not overfitting (training MSE is not significantly lower than validation MSE). Another pattern I see in the results is that the validation MSE and R2 are better than the training, which makes sense since there are more entries of data (95%) than the training data. 

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

*ADD YOUR THOUGHTS HERE*

I liked how we were applying the code and knowledge that we learnt in class for this assignment and it was explained it detail. We had enough resources to understand what the assignment was asking for and apply the knowledge learnt from class. Something that I found a little confusing was that the directions were not clear for some of the parts, such as training and validation accuracy. There are multiple accuracies we could use for the LogisticRegression so this was confusing to understand. Something motivating was using the machine learning models to understand the results and intepreting it. It was cool seeing how coding could be used to interpret results in a way that can apply to real world situations.   



## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [204]:
# TO DO: ADD YOUR CODE HERE

from sklearn.linear_model import Ridge, Lasso

alphas = np.logspace(-3, 2, num=100)  # Values from 0.001 to 100

#RIDGE 

ridge_results = []

for alpha in alphas:

    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    
   
    test_predictions_ridge = ridge_model.predict(X_test)
    
    
    test_mse_ridge = mean_squared_error(y_test, test_predictions_ridge)
    test_r2_ridge = r2_score(y_test, test_predictions_ridge)
    
    ridge_results.append((alpha, test_mse_ridge, test_r2_ridge))


best_alpha_ridge, best_mse_ridge, best_r2_ridge = min(ridge_results, key=lambda x: x[1])

print("Ridge Regression:")
print(f"Best alpha: {best_alpha_ridge}")
print(f"Test MSE: {best_mse_ridge}")
print(f"Test R2 Score: {best_r2_ridge}")
print()

#LASSO

lasso_results = []

for alpha in alphas:
  
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    
   
    test_predictions_lasso = lasso_model.predict(X_test)
    
  
    test_mse_lasso = mean_squared_error(y_test, test_predictions_lasso)
    test_r2_lasso = r2_score(y_test, test_predictions_lasso)
    
    lasso_results.append((alpha, test_mse_lasso, test_r2_lasso))


best_alpha_lasso, best_mse_lasso, best_r2_lasso = min(lasso_results, key=lambda x: x[1])

print("Lasso Regression:")
print(f"Best alpha: {best_alpha_lasso}")
print(f"Test MSE: {best_mse_lasso}")
print(f"Test R2 Score: {best_r2_lasso}")

Ridge Regression:
Best alpha: 100.0
Test MSE: 95.62517337012183
Test R2 Score: 0.6369366906855762

Lasso Regression:
Best alpha: 9.770099572992246
Test MSE: 95.11511718456465
Test R2 Score: 0.638873238146232


*ANSWER HERE* #The method which gave the best R2 score was using Lasso Regression, producing a 0.63887 score. The value of alpha was 9.77 as well. However, the MSE value was significantly higher than the linear by almost ten times. There was honestly no significant difference in the R2 score as well, and so this score of R2 is not good enough, since the model is still only 63.88% effective in matching the observed data points.