### Name: Dillon Pullano

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [30]:
# Note that additional packages have been added that are not used since allot of tinkering was done:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, accuracy_score, f1_score, mean_absolute_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso
from sklearn.svm import SVR

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)
df = pd.read_csv('spotify_songs.csv')


In [3]:
# Explore shape:
print("The shape of the dataset is: " + str(df.shape))


The shape of the dataset is: (32833, 23)


In [4]:
# Explore data types:
print("\nThe data types of each attribute in the dataset are as follows:")
df.dtypes



The data types of each attribute in the dataset are as follows:


track_id                     object
track_name                   object
track_artist                 object
track_popularity              int64
track_album_id               object
track_album_name             object
track_album_release_date     object
playlist_name                object
playlist_id                  object
playlist_genre               object
playlist_subgenre            object
danceability                float64
energy                      float64
key                           int64
loudness                    float64
mode                          int64
speechiness                 float64
acousticness                float64
instrumentalness            float64
liveness                    float64
valence                     float64
tempo                       float64
duration_ms                   int64
dtype: object

In [5]:
# Explore head of data:
print("\nThe head of the dataset is as follows:")
df.head()


The head of the dataset is as follows:


Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
2. (1 mark) Why did you pick this particular dataset?
3. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

### Answers:

1. My dataset was sourced from Kaggle and can be downloaded from the following page:
    https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs/
    
2. I picked this dataset because it was a good size having 32,833 samples with 23 attributes.
    The parameters are intuitive to understand and I love listening to music so the topic was interesting to me.
    I especially like how there are different attributes that I could train a model to classify (genre, release year, etc.).
    
3. I found that it was hard to decide on a dataset when I started looking for one on Kaggle. Most of the first ones I was 
    looking at were mainly set up to train a model that would classify a single attribute (diabetes y/n for example).
    Other than this, I found that acquiring the data and loading it in was straight forward.
    

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [6]:
# Clean data (if needed)

# Check to see if there are any missing values:
print("Initial null values for each parameter are as follows:")
df.isnull().sum()

Initial null values for each parameter are as follows:


track_id                    0
track_name                  5
track_artist                5
track_popularity            0
track_album_id              0
track_album_name            5
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64

In [7]:
# Drop rows with null values since there are only 5:
df = df.dropna()

# Show that there are now no missing values:
print("Final null values for each parameter are as follows:")
print(df.isnull().sum())


Final null values for each parameter are as follows:
track_id                    0
track_name                  0
track_artist                0
track_popularity            0
track_album_id              0
track_album_name            0
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64


In [8]:
# Remove duplicate rows:
df = df.drop_duplicates()

# Drop columns that are unnecessary such as ID's and other non-relevant parameters:
columns_to_drop = ['track_id', 'track_name', 'track_artist', 'track_album_id', 'playlist_genre', 'playlist_subgenre', 'track_album_name', 'playlist_name', 'playlist_id', 'mode', 'instrumentalness']
df = df.drop(columns = columns_to_drop)

print("\nThe final shape of the dataset is: " + str(df.shape))



The final shape of the dataset is: (32828, 12)


In [9]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

# Convert Date into only year (preprocessing method):
df['track_album_release_date'] = df['track_album_release_date'].str[:4].astype(int)

# Check data types of remaining parameters:
df.dtypes


track_popularity              int64
track_album_release_date      int32
danceability                float64
energy                      float64
key                           int64
loudness                    float64
speechiness                 float64
acousticness                float64
liveness                    float64
valence                     float64
tempo                       float64
duration_ms                   int64
dtype: object

In [10]:
# Split dataset into X and y components
X = df
y = X.pop('track_album_release_date')

# Create training and testing datasets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

y.nunique()


63

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

### Answers:

1. Initially, there were 5 null values present for the 'track_name', 'track_artist', and 'track_album_name' parameters in the dataset. If any of these parameters were numerical, I would have considered replacing it with an average of the entire column. Since they are text based and there were only 5 rows affected, I decided to remove any rows where null values were present. If a significant number of null values existed for any column, I would consider dropping the column completely.

2. The data that I am working with is either float or int. All data was either transformed to one of these formats or dropped. The target parameter 'track_album_release_date' was transformed from a string in the YYYY-MM-DD format into an integer of the year YYYY which resulted in 63 unique possibilities.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [11]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

#-------------------------------------------------------------------
# Linear Model #1: Lasso Regression
#-------------------------------------------------------------------
# Lasso regression pipeline:
las_pipeline = Pipeline(steps = [
    ('scaler', StandardScaler()),
    ('classifier', Lasso())
])

# Lasso regression parameter grid definition:
las_param_grid = {
    'classifier' : [Lasso()],
    'classifier__alpha' : [0.1, 1, 10, 100]
}

#-------------------------------------------------------------------
# Non-Linear Model #1: Random Forest Regression
#-------------------------------------------------------------------
# Random forest regressor pipeline
rf_pipeline = Pipeline(steps = [
    ('scaler', StandardScaler()),
    ('classifier', RandomForestRegressor())
])

# Random forest regressor parameter grid definition:
rf_param_grid = {
    'classifier' : [RandomForestRegressor(random_state = 0)],
    'classifier__max_depth' : [25, 50, 100]
}

#-------------------------------------------------------------------
# Non-Linear Model #2: Gradient Boosted (GBM) Regression
#-------------------------------------------------------------------
# Gradient boosted regression pipeline:
gbm_pipeline = Pipeline(steps = [
    ('scaler', StandardScaler()),
    ('classifier', GradientBoostingRegressor())
])

# Define a parameter grid for gradient boosted regression pipeline:
gbm_param_grid = {
    'classifier' : [GradientBoostingRegressor()],
    'classifier__learning_rate' : [0.2, 0.4, 0.6]
}
     
# Create GridSearchCV instances with their different metrics:
las_grid_result = GridSearchCV(las_pipeline, las_param_grid, cv = 4)
rf_grid_result = GridSearchCV(rf_pipeline, rf_param_grid, cv = 4)
gbm_grid_result = GridSearchCV(gbm_pipeline, gbm_param_grid, cv = 4)


In [12]:
# Run grid search on the lasso models:
las_grid_result.fit(X_train, y_train)


In [13]:
# Run grid search on the random forest models:
rf_grid_result.fit(X_train, y_train)


In [14]:
# Run grid search on the SVM models:
gbm_grid_result.fit(X_train, y_train)


In [15]:
las_grid_result.best_estimator_

In [16]:
rf_grid_result.best_estimator_

In [17]:
gbm_grid_result.best_estimator_

In [18]:
las_grid_result.best_score_

0.2931473066258655

In [19]:
rf_grid_result.best_score_

0.5210904630851307

In [20]:
gbm_grid_result.best_score_

0.43803502699194685

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
2. (2 marks) Which models did you select for testing and why?
3. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

### Answers:

1. The 'track_album_release_date' parameter that the models are solving for is 1 of 63 possible years in the dataset. For this reason, I am using regression models for my dataset. Being close to the year has importance even if it is off by a few years. Even if the correct decade is identified that could have value.

2. The linear model that I selected was the Lasso model. My logic behind this decision was that I had 11 parameters influencing the model and lasso has the ability to completely remove weights if needed. This could potentially simplify the model. The non-linear models that I selected were random forests and GBM (gradient boosted machines). I selected random forests because they are good at preventing the model from overfitting to the 12 parameters. Due to initial testing of the random forest going well, I thought that GBM would also work well on the dataset.

3. The model that worked the best here was the Random Forest model with 'max_depth = 50'. It had a best accuracy score (R2 value) of 0.52. It does not surprise me that the Random Forest was able to work the best on a dataset with a larger number of parameters. The R2 value is lower than I would have hoped, but a more meaningful MAE (mean absolute error) metric will be considered in the next section with the testing data.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [28]:
# Calculate testing accuracy (1 mark)

# Instantiate the best random forest regressor pipeline:
best_result_pipeline = Pipeline(steps = [
    ('scaler', StandardScaler()),
    ('classifier', RandomForestRegressor(max_depth=50, random_state=0))
])

# Fit the model using training data and make predictions using testing data
best_result_pipeline.fit(X_train, y_train)
y_pred = best_result_pipeline.predict(X_test)


In [29]:
# Calculate and display accuracy metrics:
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("R2 error: ", r2)
print("Mean absolute error: ", mae)


R2 error:  0.5566446895085475
Mean absolute error:  4.7932278857171875



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
2. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
3. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

### Answers:

1. The accuracy metric that I chose was the MAE (Mean Absolute Error). I chose this as I am looking for a metric that will describe how close we could expect this model to reasonably predict what year a song came out. I also included an R2 metric so that I could compare it with results in part 3.

2. When comparing these results to those in section 3 that were produced using the grid searches, the testing data produced a slightly higher R2 score or 0.56 compared to the highest being 0.52 in part 3. The higher testing accuracy suggests to me that it does generalize well, even though the accuracy isn't the highest.

3. It depends on the accuracy required of the year estimation in a use case to determine if it performs well enough. If the model needs to predict the exact year or it is considered a failed prediction, then I would say that this model is not performing well enough. However, the mean absolute error was less than 5 years (4.79) in either direction of the true year value. So in the case where a model was trying to gather songs that were within 10 years apart from one another to make an era-based playlist, we could say that this would likely perform well enough for that task. A suggestion to improve the model would be to break the year down into decades (i.e. wider target) and see how it performs.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
2. In what order did you complete the steps?
3. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. All code was written by me. Code was sourced by looking at work that was completed during older lab sessions. I also referenced the 'Lab 6 - Scaler + Pipelines + Gridsearch - solutions' document on D2L. Official Sklearn documentation was used in conjunction with ChatGPT to clarify how to use functions that were unknown to me.

2. The steps were completed as follows:
        1. Source dataset from Kaggle and download
        2. Load data in and explore it
        3. Clean data (remove nulls, duplicates, and drop unwanted columns)
        4. Convert Date values from YYYY-MM-DD to YYYY
        5. Create training and testing datasets
        6. Decide on models, create pipelines, and define hyperparameter grids
        7. Create GridSearchCV instances and fit the data to the models
        8. Print out the best model parameters and results
        9. Calculate the MAPE testing accuracy for the best model
        10. Answer all questions and write a reflection

3. I found it challenging to set up the pipelines and parameter grids. Once I got the format of one right, it started to make more sense. The document 'Lab 6 - Scaler + Pipelines + Gridsearch - solutions' document that is posted on D2L helped me to understand how these are used. Another challenging aspect was due to the length of time it took to run the grid searches on this size of data, which resulted in a longer debugging time. 

## Reflection (2 marks)
Include a sentence or two about:
1. what you liked or disliked,
2. found interesting, confusing, challenging, motivating while working on this assignment.

## Reflection Answers:

1. I liked how using pipelines allows you to check multiple possible solutions to a problem with a smaller amount of code. I disliked how long I needed to spend debugging my code, but I know that this will get beter over time. 

2. I found it motivating that learning how to use these parameter grids and pipelines will be very useful moving forward. It combines many of the concepts that we have learned throughout the course and allows you to play around with finding the optimal solution to the problem at hand.