# Assignment 3: Non-Linear Models and Validation Metrics (37 total marks)

Name: Christian Valdez 

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# dataset
import yellowbrick
from yellowbrick.datasets import load_spam, load_concrete

In [3]:
# sklearn
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import make_scorer, mean_squared_error

## Part 1: Regression (14.5 marks)

### Step 1: Data Input (0.5 marks)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

In [4]:
# import concrete dataset
X, y = yellowbrick.datasets.loaders.load_concrete(
    data_home=None, return_dataset=False)

In [5]:
# print size of X
rows, cols = X.shape
data_size = X.size
print(f"There are {rows} samples and {cols} features.\nThe size of the feature matrix is {data_size}.")

There are 1030 samples and 8 features.
The size of the feature matrix is 8240.


In [6]:
# print size of y
rows, = y.shape
print(f"The feature matrix comes with the corresponding {rows} labels.")

The feature matrix comes with the corresponding 1030 labels.


### Step 3: Implement Machine Learning Model

In [7]:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [8]:
# decision tree, max depth = 5
tree = DecisionTreeRegressor(max_depth=5, random_state=0)
tree.fit(X_train, y_train)

In [9]:
# random forest, max depth = 5
rf = RandomForestRegressor(max_depth=5, random_state=0)
rf.fit(X_train, y_train)

In [10]:
# gradient boosting, max depth = 5
gbm = GradientBoostingRegressor(max_depth=5, random_state=0)
gbm.fit(X_train, y_train)

### Step 4: Validate Model

Calculate the average training and validation accuracy using mean squared error with cross-validation. To do this, you will need to set `scoring='neg_mean_squared_error'` in your `cross_validate` function and negate the results (multiply by -1)

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: DT, RF and GB
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [11]:
# Set up cross-validation on the training set
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    gbm, X_train, y_train, 
    cv=kfold, 
    scoring=make_scorer(lambda y_true, y_pred: -mean_squared_error(y_true, y_pred))
)

In [12]:
# Train the model on the entire training set
gbm.fit(X_train, y_train)

# Evaluate on the test set
y_pred = gbm.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)

In [13]:
# Print out the results
print(f"Mean Cross-Validation Negative Mean Squared Error on Training Set: {scores.mean():.2f}")
print(f"Standard Deviation: {scores.std():.2f}")
print(f"Mean Squared Error on Test Set: {test_mse:.2f}")

Mean Cross-Validation Negative Mean Squared Error on Training Set: -25.10
Standard Deviation: 3.66
Mean Squared Error on Test Set: 19.87


Repeat the step above to print the R2 score instead of the mean-squared error. For this case, you can use `scoring='r2'`

In [14]:
# TO DO: ADD YOUR CODE HERE

### Questions (6 marks)
1. How do these results compare to the results using a linear model in the previous assignment? Use values.
1. Out of the models you tested, which model would you select for this dataset and why?
1. If you wanted to increase the accuracy of the tree-based models, what would you do? Provide two suggestions.

*ANSWER HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Part 2: Classification (17.5 marks)

### Step 1: Data Input (2 marks)

The data used for this task can be downloaded from UCI: https://archive.ics.uci.edu/dataset/109/wine

Use the pandas library to load the dataset. You must define the column headers if they are not included in the dataset 

You will need to split the dataset into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the size and type of `X` and `y`

In [15]:
# import wine dataset
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine = fetch_ucirepo(id=109) 
  
# data (as pandas dataframes) 
X = wine.data.features 
y = wine.data.targets 
  
# metadata 
print(wine.metadata) 
  
# variable information 
print(wine.variables) 

ModuleNotFoundError: No module named 'ucimlrepo'

### Step 2: Data Processing (1.5 marks)

Print the first five rows of the dataset to inspect:

In [None]:
# TO DO: ADD YOUR CODE HERE

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values

In [None]:
# TO DO: ADD YOUR CODE HERE

How many samples do we have of each type of wine?

In [None]:
# TO DO: ADD YOUR CODE HERE

### Step 3: Implement Machine Learning Model

1. Import `SVC` and `DecisionTreeClassifier` from sklearn
2. Instantiate models as `SVC()` and `DecisionTreeClassifier(max_depth = 3)`
3. Implement the machine learning model with `X` and `y`

### Step 4: Validate Model 

Calculate the average training and validation accuracy using `cross_validate` for the two different models listed in Step 3. For this case, use `scoring='accuracy'`

### Step 5: Visualize Results (4 marks)

#### Step 5.1: Compare Models
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [None]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

#### Step 5.2: Visualize Classification Errors
Which method gave the highest accuracy? Use this method to print the confusion matrix and classification report:

In [None]:
# TO DO: Implement best model

In [None]:
# TO DO: Print confusion matrix using a heatmap

In [None]:
# TO DO: Print classification report

### Questions (6 marks)
1. How do the training and validation accuracy change depending on the method used? Explain with values.
1. What are two reasons why the support vector machines model did not work as well as the tree-based model?
1. How many samples were incorrectly classified in step 5.2? 
1. In this case, is maximizing precision or recall more important? Why?

*YOUR ANSWERS HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

## Part 5: Bonus Question (3 marks)

Repeat Part 2 and compare the support vector machines model used to `LinearSVC(max_iter=5000)`. Does using `LinearSVC` improve the results? Why or why not?

Is `LinearSVC` a good fit for this dataset? Why or why not?

In [None]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*