# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Mustayeen Abedin

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [231]:
import numpy as np
import pandas as pd
from yellowbrick import datasets

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [232]:
# TO DO: Import spam dataset from yellowbrick library
X, y = ds.loaders.load_spam()
# TO DO: Print size and type of X and y
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))

(4600, 57)
(4600,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [233]:
# TO DO: Check if there are any missing values and fill them in if necessary
if X.isnull().sum().sum() > 0:
    X.fillna(X.mean(), inplace=True)
X.shape

(4600, 57)

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [234]:
# TO DO: Create X_small and y_small 

from sklearn.model_selection import train_test_split

X_small, _, y_small, _ = train_test_split(X, y, test_size=0.95, random_state=42)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [235]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 3. Implement model

model = LogisticRegression(max_iter=2000)

datasets = [
    ("X and y", X,y),
    ("First 2 columns", X.iloc[:, :2], y),
    ("Small", X_small, y_small)
]

# 4. Validate model
results = pd.DataFrame(columns=["Data Size", "Training Acc", "Validation Acc"])

for name, X_set, y_set in datasets:
    X_train, X_test, y_train, y_test = train_test_split(X_set, y_set, test_size=0.2, random_state=0)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    val_acc = accuracy_score(y_test, model.predict(X_test))
    new_row = {"Data Size":name, "Training Acc": train_acc, "Validation Acc": val_acc}
    results.loc[len(results)] = new_row

# Print results
print(results)

         Data Size  Training Acc  Validation Acc
0          X and y      0.927446        0.935870
1  First 2 columns      0.614946        0.593478
2            Small      0.961957        0.891304


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1) The best performing model with a validation accuracy of 93.5% was when we which used the largest amount of data. The worst performing dataset was when we only used two features of X with a validation accuracy of 58%, suggesting that using more features of a dataset is recommended. Surprisingly the small test set performed relatively well considering that it only contained roughly 5% of the first data set. The small dataset had the highest training accuracy of 96% but a validation accuracy of 89%, indicating that when you have a smaller dataset, the model will tend to overfit. 

2) A false positive represents an email that is marked as spam but is actually a legitimate email. A false negative represents a email that is marked as a legitimate email but is actually spam. In this case, I would consider a false positive to be worse, because you could have a important email in the spam folder.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. 
1. I followed the steps in order. The pipeline of the steps in this notebook closely follow the recommended ML pipeline so it makes sense to do it in order. 
1. I used ChatGPT the following : "how do i write a for loop that inputs into a Dataframe  with three columns and adds the row to the dataframe before printing it". I did have to modify the code that it suggested because it used a outdated method of pd.Dataframe.append. I had to go on stackoverflow and found out that df.loc[last_index] = data is a better method.
1. Besides when the generative AI told me to use an outdated function that resulted in an AttributeError, I did not have any struggles with this part. I used the class notebooks, particularly the Linear-Classification notebook to help me with this problem.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [236]:
# TO DO: Import spam dataset from yellowbrick library
X, y = ds.loaders.load_concrete()
# TO DO: Print size and type of X and y
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))
print(max(y))
print(min(y))

(1030, 8)
(1030,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
82.5992248
2.331807832


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [237]:
# TO DO: Check if there are any missing values and fill them in if necessary
if X.isnull().sum().sum() > 0:
    X.fillna(X.mean(), inplace=True)

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [238]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(X_set, y_set, test_size=0.2, random_state=0)
model.fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [239]:
# TO DO: ADD YOUR CODE HERE

train_mse = mean_squared_error(y_train, model.predict(X_train), squared=False)
val_mse = mean_squared_error(y_test, model.predict(X_test), squared=False)

train_r2 = r2_score(y_train, model.predict(X_train))
val_r2 = r2_score(y_test, model.predict(X_test))

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [240]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame({"Training Acc": [train_mse, train_r2], 
                        "Validation Acc": [val_mse, val_r2]}, index=["MSE", "R2_Score"])
print(results)

          Training Acc  Validation Acc
MSE           0.249222        0.295232
R2_Score      0.742771        0.570082


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

I think it did an okay job with the dataset. However, it greatly overfitted based on the training accuracy of 74% compared to a validation accuracy of 57%. R2 is a indicator of variance so that means about 57% of the test values fall on the regression line. The root mean squared error of this model on the test set is 0.3. This means that there is a average distance of 0.3 away from the regression line. This is quite good because the range of the target feature is from 2.3 - 82.5.  Because there is no way to control the model complexity, this is about the best the model will score. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

I used ChatGPT to help me understand what mean squared error and r2 score indicate in this context. I gave it the following prompt: "what does mean squared error indicate vs r2 score". I referred to the Linear Regression notebook from the class examples for the machine learning pipeline. The only challenge I had in this problem was understanding the difference between the two most common metrics for Linear Regression. 

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*
In both the problems, the training score was hire than the validation score. This means that we are usually overfitting the model based on the training set.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked the assignment because it helped me better understand the machine learning pipeline.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [241]:
# TO DO: ADD YOUR CODE HERE

from sklearn.linear_model import Ridge

alphas = [0.001, 0.01, 0.1, 1, 10, 100]
train_r2 = []
val_r2 = []

for alpha in alphas:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    train_r2.append(r2_score(y_train, model.predict(X_train)))
    val_r2.append(r2_score(y_test, model.predict(X_test)))
    
    

results = pd.DataFrame({"Alpha": alphas, "Training Acc": train_r2, "Validation Acc": val_r2})

print("Ridge")
print(results)

Ridge
     Alpha  Training Acc  Validation Acc
0    0.001      0.742771        0.570593
1    0.010      0.742763        0.574965
2    0.100      0.742236        0.604561
3    1.000      0.733867        0.671713
4   10.000      0.694794        0.649813
5  100.000      0.561963        0.543863


In [242]:
from sklearn.linear_model import Lasso

alphas = [0.001, 0.01, 0.1, 1, 10, 100]
train_r2 = []
val_r2 = []

for alpha in alphas:
    model = Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    train_r2.append(r2_score(y_train, model.predict(X_train)))
    val_r2.append(r2_score(y_test, model.predict(X_test)))
    

results = pd.DataFrame({"Alpha": alphas, "Training Acc": train_r2, "Validation Acc": val_r2})

print("Lasso")
print(results)

Lasso
     Alpha  Training Acc  Validation Acc
0    0.001      0.733754        0.678817
1    0.010      0.654619        0.637438
2    0.100      0.333539        0.352487
3    1.000      0.167320        0.170313
4   10.000      0.144324        0.153257
5  100.000      0.000000       -0.077069


*ANSWER HERE*

The best alpha for Ridge Regression for this dataset is 1 and the best for Lasso Regression is 0.001. Using the alphas for the models, respectively, we get a test accuracy of 67%. Both the best models for Ridge and Lasso scored a better r2 score than the standard Linear Regression model. I believe that the score is good enough for this dataset because it is saying that almost 7/10s of the test value are landing on the regression line from this model. To fully know whether a linear model is suitable, we should first test what the accuracy is against a baseline line (say a horizontal line) and we should also test the roo mean squared error of each model to know how far each value is from our predicted lines. 