## Ensemble Learning
#### Gavin Daves, Rice University
#### INDE 577, Dr. Randy Davila

In this notebook, we will utilize Ensemble Methods in Python and use the model(s) we create on the Spotify dataset.

In [1]:
# Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sys
from sklearn.model_selection import train_test_split

sns.set_theme()

In [2]:
# Loading the data

# Add the top-level directory to the system path
sys.path.append('../../')

# Load the data
import clean_data as sd

df = sd.get_df()

df.columns

Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

In [15]:
# Build the Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


# Use a sample of the data for quicker experimentation
sample_df = df.sample(frac=0.1, random_state=42)

X = sample_df[['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
               'instrumentalness', 'liveness', 'popularity']]
y = sample_df['valence']

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Simplified Random Forest model for faster execution
clf = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1)

# Train the model

clf = RandomForestRegressor(random_state=42)
clf.fit(X_train, y_train)

# Predict the test set

y_pred = clf.predict(X_test)

print(f'Mean Squared Error: {mean_squared_error(y_test, y_pred)}')
print(f'R2 Score: {r2_score(y_test, y_pred)}')

Mean Squared Error: 0.039782269558645535
R2 Score: 0.4246333682874872


### Error Analysis

The MSE of the model is near zero, meaning this is a good predictor of the data. However, the R-squared score is ~0.42, which is not a great score. This means that the model is not a great predictor of the data.