<a href="https://colab.research.google.com/github/alortiz05/DDDS-Cohort-16-Projects/blob/main/Spotify_Description_for_StudentsALO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 4: Music Popularity Prediction


This project will take data features collected for songs that have been on the Top 200 Weekly (Global) charts of Spotify in 2020 & 2021. The popularity of the song will be predicted using a tree-based regression model trained on these features.



The goals for the project are:

- Minimize the cross-validated ***root mean squared error ( RMSE )*** when predicting the popularity of a new song.

- Determine the importance of the features in driving the regression result.
The project will be done using tree-based regression techniques as covered in class. The hyperparameters of the trees should be carefully selected to avoid over-fitting.

- **Determine feature importance, using the tree methods discussed**

There are three main challenges for this project:

1. Determining the outcome ( i.e. target ).  There is a "popularity" column.  But other columns may or may not be more appropriate indicators of popularity.
  - how does Spotify determine popularity
  -Popularity seems to be determine based on number of place but recent plays, user engagement, skips, playlists, total plays

1. Choosing appropriate predictors ( i.e. features ). When building a machine learning model, we want to make sure that we consider how the model will be ultimately used. For this project, we are predicting the popularity of a new song. Therefore, we should only include the predictors we would have for a new song. It might help to imagine that the **song will not be released for several weeks**.
  - some of these features are garbage. You will be given a song and determine how popular. Some featueres you will not know until after a song is release. Note those should be removed because they are not known till after. I.E. if you can "see in the future" then no need to predict

1. Data cleaning and feature engineering. Some creative cleaning and/or feature engineering may be needed to extract useful information for prediction.
  - i.e. this is not a good dataset.
  - check nulls, dont believe it if there are zero nulls
  - How do you find nulls that are not showing up in the .nulls


Once again, be sure to go through the whole data science process and document as such in your Jupyter notebook.

The data is available AWS at https://ddc-datascience.s3.amazonaws.com/Projects/Project.4-Spotify/Data/Spotify.csv .



# Problem Definition

We will try to predict music popularity of a future released song.
We are working with a supervised problem because our data is labled and we are looking to predict something

# Data Collection/Sources


In [2]:
url = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.4-Spotify/Data/Spotify.csv"
!curl -s -I {url}

HTTP/1.1 200 OK
[1mx-amz-id-2[0m: fhG1vqKu6ZDW7djmlJDZtkrU8FrWs4FHno7Y7Ng3O+LHmGkEMSpnN0/TE97GctbfmsqKAHECLax5lKgrTk+XviWAbhI9IaPzVHrxRTekZ30=
[1mx-amz-request-id[0m: 6HAC67XQZDW222WS
[1mDate[0m: Fri, 25 Apr 2025 19:19:04 GMT
[1mLast-Modified[0m: Wed, 04 Oct 2023 17:23:56 GMT
[1mETag[0m: "65b9875b11e0d7ea03ee2af024f45e99"
[1mx-amz-server-side-encryption[0m: AES256
[1mAccept-Ranges[0m: bytes
[1mContent-Type[0m: text/csv
[1mContent-Length[0m: 738124
[1mServer[0m: AmazonS3



In [3]:
!curl -s -O {url}

In [4]:
ls -la

total 18492
drwxr-xr-x 1 root root     4096 Apr 25 17:05 [0m[01;34m.[0m/
drwxr-xr-x 1 root root     4096 Apr 25 14:33 [01;34m..[0m/
drwxr-xr-x 4 root root     4096 Apr 23 13:39 [01;34m.config[0m/
-rw-r--r-- 1 root root 18176282 Apr 25 18:32 rfModel.p
drwxr-xr-x 1 root root     4096 Apr 23 13:39 [01;34msample_data[0m/
-rw-r--r-- 1 root root   738124 Apr 25 19:19 Spotify.csv


In [5]:
!head -1 Spotify.csv | tr , '\n' | cat -n

     1	Index
     2	Highest Charting Position
     3	Number of Times Charted
     4	Week of Highest Charting
     5	Song Name
     6	Streams
     7	Artist
     8	Artist Followers
     9	Song ID
    10	Genre
    11	Release Date
    12	Weeks Charted
    13	Popularity
    14	Danceability
    15	Energy
    16	Loudness
    17	Speechiness
    18	Acousticness
    19	Liveness
    20	Tempo
    21	Duration (ms)
    22	Valence
    23	Chord


In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn import metrics
import pickle
from sklearn.preprocessing import LabelEncoder
import graphviz
from IPython.display import display
from sklearn import tree


In [7]:
spotify = pd.read_csv( url, index_col = 0 )

In [8]:
memory_bytes = spotify.memory_usage(deep=True).sum()
print(f"{memory_bytes} bytes")

2499083 bytes


# Exploratory Data Analysis


In [9]:
spotify.head()

Unnamed: 0_level_0,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name,Streams,Artist,Artist Followers,Song ID,Genre,Release Date,...,Danceability,Energy,Loudness,Speechiness,Acousticness,Liveness,Tempo,Duration (ms),Valence,Chord
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,8,2021-07-23--2021-07-30,Beggin',48633449,Måneskin,3377762,3Wrjm47oTz2sjIgck11l5e,"['indie rock italiano', 'italian pop']",2017-12-08,...,0.714,0.8,-4.808,0.0504,0.127,0.359,134.002,211560,0.589,B
2,2,3,2021-07-23--2021-07-30,STAY (with Justin Bieber),47248719,The Kid LAROI,2230022,5HCyWlXZPP0y6Gqq8TgA20,['australian hip hop'],2021-07-09,...,0.591,0.764,-5.484,0.0483,0.0383,0.103,169.928,141806,0.478,C#/Db
3,1,11,2021-06-25--2021-07-02,good 4 u,40162559,Olivia Rodrigo,6266514,4ZtFanR9U6ndgddUvNcjcG,['pop'],2021-05-21,...,0.563,0.664,-5.044,0.154,0.335,0.0849,166.928,178147,0.688,A
4,3,5,2021-07-02--2021-07-09,Bad Habits,37799456,Ed Sheeran,83293380,6PQ88X9TkUIAUIZJHW2upE,"['pop', 'uk pop']",2021-06-25,...,0.808,0.897,-3.712,0.0348,0.0469,0.364,126.026,231041,0.591,B
5,5,1,2021-07-23--2021-07-30,INDUSTRY BABY (feat. Jack Harlow),33948454,Lil Nas X,5473565,27NovPIUIRrOZoCHxABJwK,"['lgbtq+ hip hop', 'pop rap']",2021-07-23,...,0.736,0.704,-7.409,0.0615,0.0203,0.0501,149.995,212000,0.894,D#/Eb


In [10]:
spotify.shape

(1556, 22)

In [11]:
spotify.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1556 entries, 1 to 1556
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Highest Charting Position  1556 non-null   int64 
 1   Number of Times Charted    1556 non-null   int64 
 2   Week of Highest Charting   1556 non-null   object
 3   Song Name                  1556 non-null   object
 4   Streams                    1556 non-null   object
 5   Artist                     1556 non-null   object
 6   Artist Followers           1556 non-null   object
 7   Song ID                    1556 non-null   object
 8   Genre                      1556 non-null   object
 9   Release Date               1556 non-null   object
 10  Weeks Charted              1556 non-null   object
 11  Popularity                 1556 non-null   object
 12  Danceability               1556 non-null   object
 13  Energy                     1556 non-null   object
 14  Loudness     

We probably want to change 2,4,9,10 into a datetime format.

19 we can change into an integer.

In [12]:
spotify.isnull().sum()

Unnamed: 0,0
Highest Charting Position,0
Number of Times Charted,0
Week of Highest Charting,0
Song Name,0
Streams,0
Artist,0
Artist Followers,0
Song ID,0
Genre,0
Release Date,0


In [13]:
#Had to ask help from chatgpt on this one
fake_nulls = [" "] #"NA", "N/A", "na", "null", "NULL", "None", "none", "-", "--"]

# Check for those values in each column
fake_null=spotify.apply(lambda col: col.isin(fake_nulls).sum() if col.dtypes == "object" else 0)
fake_null

Unnamed: 0,0
Highest Charting Position,0
Number of Times Charted,0
Week of Highest Charting,0
Song Name,0
Streams,0
Artist,0
Artist Followers,11
Song ID,11
Genre,11
Release Date,11


There are a lot of columns that have blank values. These would obviously not contribute to the predition model.

Thinking about the popularity is determined by Spotify some important columns would be streams, artist followers. If the song has not been released yet then the Highest Charting, Charting, week charting would not be known. But also Streams may also not be known?? As such we should probably remove it.

In [14]:
spotify.describe(include='all')


Unnamed: 0,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name,Streams,Artist,Artist Followers,Song ID,Genre,Release Date,...,Danceability,Energy,Loudness,Speechiness,Acousticness,Liveness,Tempo,Duration (ms),Valence,Chord
count,1556.0,1556.0,1556,1556,1556.0,1556,1556.0,1556.0,1556,1556,...,1556.0,1556.0,1556.0,1556.0,1556.0,1556.0,1556.0,1556.0,1556.0,1556
unique,,,83,1556,1556.0,716,600.0,1517.0,395,478,...,530.0,575.0,1394.0,772.0,965.0,606.0,1461.0,1486.0,732.0,13
top,,,2019-12-27--2020-01-03,Lover (Remix) [feat. Shawn Mendes],4595450.0,Taylor Swift,42227614.0,,[],2020-01-17,...,,,,0.102,,0.103,,,,C#/Db
freq,,,89,1,1.0,52,52.0,11.0,75,34,...,11.0,11.0,11.0,15.0,11.0,23.0,11.0,11.0,11.0,214
mean,87.744216,10.66838,,,,,,,,,...,,,,,,,,,,
std,58.147225,16.360546,,,,,,,,,...,,,,,,,,,,
min,1.0,1.0,,,,,,,,,...,,,,,,,,,,
25%,37.0,1.0,,,,,,,,,...,,,,,,,,,,
50%,80.0,4.0,,,,,,,,,...,,,,,,,,,,
75%,137.0,12.0,,,,,,,,,...,,,,,,,,,,


In [15]:
spotify1 = spotify[~spotify.isin([" ",""])]
spotify1 = spotify1.dropna()
spotify1.head()

Unnamed: 0_level_0,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name,Streams,Artist,Artist Followers,Song ID,Genre,Release Date,...,Danceability,Energy,Loudness,Speechiness,Acousticness,Liveness,Tempo,Duration (ms),Valence,Chord
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,8,2021-07-23--2021-07-30,Beggin',48633449,Måneskin,3377762,3Wrjm47oTz2sjIgck11l5e,"['indie rock italiano', 'italian pop']",2017-12-08,...,0.714,0.8,-4.808,0.0504,0.127,0.359,134.002,211560,0.589,B
2,2,3,2021-07-23--2021-07-30,STAY (with Justin Bieber),47248719,The Kid LAROI,2230022,5HCyWlXZPP0y6Gqq8TgA20,['australian hip hop'],2021-07-09,...,0.591,0.764,-5.484,0.0483,0.0383,0.103,169.928,141806,0.478,C#/Db
3,1,11,2021-06-25--2021-07-02,good 4 u,40162559,Olivia Rodrigo,6266514,4ZtFanR9U6ndgddUvNcjcG,['pop'],2021-05-21,...,0.563,0.664,-5.044,0.154,0.335,0.0849,166.928,178147,0.688,A
4,3,5,2021-07-02--2021-07-09,Bad Habits,37799456,Ed Sheeran,83293380,6PQ88X9TkUIAUIZJHW2upE,"['pop', 'uk pop']",2021-06-25,...,0.808,0.897,-3.712,0.0348,0.0469,0.364,126.026,231041,0.591,B
5,5,1,2021-07-23--2021-07-30,INDUSTRY BABY (feat. Jack Harlow),33948454,Lil Nas X,5473565,27NovPIUIRrOZoCHxABJwK,"['lgbtq+ hip hop', 'pop rap']",2021-07-23,...,0.736,0.704,-7.409,0.0615,0.0203,0.0501,149.995,212000,0.894,D#/Eb


In [16]:
spotify1['Artist Followers'] = spotify1['Artist Followers'].astype(float)  # or int, if no decimals

In [17]:
spotify1['Streams'] = spotify1['Streams'].str.replace(',', '')  # remove commas

spotify1['Streams'] = spotify1['Streams'].astype(float)  # or int, if no decimals

In [18]:
spotify1.drop(columns=['Highest Charting Position', 'Number of Times Charted','Week of Highest Charting','Release Date','Weeks Charted'], inplace=True)
# I am going to remove these columns based on the idea that we are assuming this is an unrealeased song.


In [19]:
spotify1.drop(columns=['Speechiness'], inplace=True)
#I feel Speechiness is not very descriptive and really not important.

In [20]:
spotify1.drop(columns=['Artist'], inplace=True)

In [21]:
spotify1.drop(columns=['Chord'], inplace=True)

In [22]:
spotify1.drop(columns=['Song Name'], inplace=True)

In [23]:
spotify1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1545 entries, 1 to 1556
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Streams           1545 non-null   float64
 1   Artist Followers  1545 non-null   float64
 2   Song ID           1545 non-null   object 
 3   Genre             1545 non-null   object 
 4   Popularity        1545 non-null   object 
 5   Danceability      1545 non-null   object 
 6   Energy            1545 non-null   object 
 7   Loudness          1545 non-null   object 
 8   Acousticness      1545 non-null   object 
 9   Liveness          1545 non-null   object 
 10  Tempo             1545 non-null   object 
 11  Duration (ms)     1545 non-null   object 
 12  Valence           1545 non-null   object 
dtypes: float64(2), object(11)
memory usage: 169.0+ KB


In [24]:
for col in spotify1.columns[6:15]:
    spotify1[col] = spotify1[col].str.replace(',', '')  # remove commas

    spotify1[col] = spotify1[col].astype(float)  # or int, if no decimals


In [25]:
spotify1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1545 entries, 1 to 1556
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Streams           1545 non-null   float64
 1   Artist Followers  1545 non-null   float64
 2   Song ID           1545 non-null   object 
 3   Genre             1545 non-null   object 
 4   Popularity        1545 non-null   object 
 5   Danceability      1545 non-null   object 
 6   Energy            1545 non-null   float64
 7   Loudness          1545 non-null   float64
 8   Acousticness      1545 non-null   float64
 9   Liveness          1545 non-null   float64
 10  Tempo             1545 non-null   float64
 11  Duration (ms)     1545 non-null   float64
 12  Valence           1545 non-null   float64
dtypes: float64(9), object(4)
memory usage: 169.0+ KB


I have learned the valence of a song determinines is sad (0) to happy (1). This seems like it could partake in the predicatbility of popularity. That said Danceability and Energy may provide similar information to valence.

I dont think the chord is relativly important to the popularity, I would likely remove that.

In [26]:
spotify1

Unnamed: 0_level_0,Streams,Artist Followers,Song ID,Genre,Popularity,Danceability,Energy,Loudness,Acousticness,Liveness,Tempo,Duration (ms),Valence
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,48633449.0,3377762.0,3Wrjm47oTz2sjIgck11l5e,"['indie rock italiano', 'italian pop']",100,0.714,0.800,-4.808,0.12700,0.3590,134.002,211560.0,0.589
2,47248719.0,2230022.0,5HCyWlXZPP0y6Gqq8TgA20,['australian hip hop'],99,0.591,0.764,-5.484,0.03830,0.1030,169.928,141806.0,0.478
3,40162559.0,6266514.0,4ZtFanR9U6ndgddUvNcjcG,['pop'],99,0.563,0.664,-5.044,0.33500,0.0849,166.928,178147.0,0.688
4,37799456.0,83293380.0,6PQ88X9TkUIAUIZJHW2upE,"['pop', 'uk pop']",98,0.808,0.897,-3.712,0.04690,0.3640,126.026,231041.0,0.591
5,33948454.0,5473565.0,27NovPIUIRrOZoCHxABJwK,"['lgbtq+ hip hop', 'pop rap']",96,0.736,0.704,-7.409,0.02030,0.0501,149.995,212000.0,0.894
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1552,4630675.0,27167675.0,2ekn2ttSfGqwhhate0LSR0,"['dance pop', 'pop', 'uk pop']",79,0.762,0.700,-6.021,0.00261,0.1530,116.073,209320.0,0.608
1553,4623030.0,15019109.0,2PWjKmjyTZeDpmOUa3a5da,"['sertanejo', 'sertanejo universitario']",66,0.528,0.870,-3.123,0.24000,0.3330,152.370,181930.0,0.714
1554,4620876.0,22698747.0,1rfofaqEpACxVEHIZBJe6W,"['dance pop', 'electropop', 'pop', 'post-teen ...",81,0.765,0.523,-4.333,0.18400,0.1320,104.988,217307.0,0.394
1555,4607385.0,208630.0,5F8ffc8KWKNawllr5WsW0r,"['brega funk', 'funk carioca']",60,0.832,0.550,-7.026,0.24900,0.1820,154.064,152784.0,0.881


In [27]:
spotify1.describe(include='all')

Unnamed: 0,Streams,Artist Followers,Song ID,Genre,Popularity,Danceability,Energy,Loudness,Acousticness,Liveness,Tempo,Duration (ms),Valence
count,1545.0,1545.0,1545,1545,1545.0,1545.0,1545.0,1545.0,1545.0,1545.0,1545.0,1545.0,1545.0
unique,,,1516,394,69.0,529.0,,,,,,,
top,,,5uEYRdEIh9Bo4fpjDd4Na9,[],75.0,0.664,,,,,,,
freq,,,3,75,67.0,10.0,,,,,,,
mean,6337136.0,14716900.0,,,,,0.633495,-6.348474,0.248695,0.181202,122.811023,197940.816828,0.514704
std,3375402.0,16675790.0,,,,,0.161577,2.509281,0.250326,0.144071,29.591088,47148.93042,0.227326
min,4176083.0,4883.0,,,,,0.054,-25.166,2.5e-05,0.0197,46.718,30133.0,0.032
25%,4915080.0,2123734.0,,,,,0.532,-7.491,0.0485,0.0966,97.96,169266.0,0.343
50%,5269163.0,6852509.0,,,,,0.642,-5.99,0.161,0.124,122.012,193591.0,0.512
75%,6452492.0,22698750.0,,,,,0.752,-4.711,0.388,0.217,143.86,218902.0,0.691


For the string columns we will need to encode them. The problem is im not sure a binary encoding is the best.

I found a way to use labelencoder so that I wont have to use a lot of columns iwth binary data.

In [28]:
# Create and apply LabelEncoder
encoder = LabelEncoder()

In [29]:
spotify1['Genre'] = encoder.fit_transform(spotify1['Genre'])

In [30]:
spotify1['Song ID'] = encoder.fit_transform(spotify1['Song ID'])

In [31]:
#make a copy and run RF for feature importance
spotify2 = spotify1.copy()

In [32]:
X = spotify2.drop('Popularity', axis = 1)
y = spotify2['Popularity']

#Testing Feature Importance

In [None]:

# two parameters - n_estimators (number of trees), max_depth (number of splits)

numLoops = 50
numtrees= [10,20,40,80,200]
mean_error = np.zeros(numLoops)
RMSE_results = np.zeros(len(numtrees))
std_results = np.zeros(len(numtrees))
for n, trees in enumerate(numtrees):
# np.random.seed(42)
  for i in range(numLoops):
      X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
      model = RandomForestRegressor( n_estimators = trees ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
      model.fit( X_train, y_train )
      y_pred = model.predict( X_test )
      mean_error[i] = np.sqrt(mean_squared_error(y_test, y_pred))

  print(numLoops,' loop finished.')
  RMSE_results[n]= np.sqrt(mean_error).mean()
  std_results[n] = np.sqrt(mean_error).std()

  print(RMSE_results[n])
  print(std_results[n])

  np.sqrt(mean_error)[:50]

50  loop finished.
3.1689868832629515
0.10663067377660645


In [None]:
pickle.dump(model, open('rfModel.p','wb'))


In [None]:
plt.plot(numtrees, RMSE_results)
plt.xlabel('Tree No.')
plt.ylabel('RMSE')
plt.grid()


In [None]:
# plt.errorbar(num_trees, rmse_results, yerr=(std_results*2,std_results*2))
plt.errorbar(numtrees, RMSE_results, yerr=std_results)
plt.xlabel('Tree No.')
plt.ylabel('RMSE')
plt.ylim(0,5)
plt.xlim(0,100)
plt.grid()

In [None]:
display(
  graphviz.Source(
    tree.export_graphviz(
      model.estimators_[0],
      feature_names = X.columns,
    )
  )
)

In [None]:
from sklearn.tree import _tree

# Pick the first tree from the forest
estimator = model.estimators_[0]
tree_ = estimator.tree_

print("Number of nodes:", tree_.node_count)
print("Tree depth:", tree_.max_depth)

# Features used (-2 means leaf)
print("Feature indices used in splits:", tree_.feature)
print("Thresholds used in splits:", tree_.threshold)

In [None]:
import numpy as np

feature_names = X.columns
used_features = set(tree_.feature[tree_.feature != _tree.TREE_UNDEFINED])
print("Features used:", [feature_names[i] for i in used_features])

In [None]:
is_leaf = tree_.feature == _tree.TREE_UNDEFINED
print("Leaf nodes:", np.sum(is_leaf))
print("Split nodes:", np.sum(~is_leaf))

In [None]:
importances = model.feature_importances_
forest_importances = pd.Series( importances, index = X.columns )

plt.figure()
# forest_importances.plot.bar()
forest_importances.sort_values( ascending = False ).plot.bar()
plt.title("Feature importances")
plt.ylabel('Feature Importance Score') ;


In [None]:
spotify1.head()

#Processing: Random Forest


In [None]:
X = spotify1.drop('Popularity', axis = 1)
y = spotify1['Popularity']

A few reminders:
- test_size = .2 means you are spliting test/train 20/80%
- we did not set a seed so the starting point is random for each set or run
- numLoops 500 we are running the train test split evaluation that many times. We are storing the RMSE squared for each of those runs in "mean_error".
  - each run we get average performance (mean) and stability (std deb)
- for n_estimators this is the number of trees we create. The nubmer of decision trees in this case is 10
  - each tree is trained on a different bootstrap sample (random with replacment)
  - prediction made by averaging the trees together (regression) or boting (classification)
  - More stable/robust preditions, less varience, reduce the risk of over/underfit
  

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
numLoops = 250

mean_error = np.zeros(numLoops)

# np.random.seed(42)
for idx in range(0,numLoops):
  X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
  model = RandomForestRegressor( n_estimators = 30 ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
  model.fit(X_train, y_train)
  y_pred = model.predict( X_test )
  mean_error[idx] = mean_squared_error( y_test, y_pred )

print(f'RMSE: {np.sqrt(mean_error).mean()}')
print(f'RMSE_std: {np.sqrt(mean_error).std()}')
np.sqrt(mean_error)[:50]


##How many loops should we run?

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
for nloops in [50,100,250,500]:
  numLoops = nloops

  mean_error = np.zeros(numLoops)

# np.random.seed(42)
  for i in range(0,numLoops):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
    model = RandomForestRegressor( n_estimators = 30 ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
    model.fit( X_train, y_train )
    y_pred = model.predict( X_test )
    mean_error[i] = np.sqrt(mean_squared_error(y_test, y_pred))

  print(numLoops,' loop finished.')
  print(f'RMSE: {np.sqrt(mean_error).mean()}')
  print(f'RMSE_std: {np.sqrt(mean_error).std()}')
  np.sqrt(mean_error)[:50]

#print(f'RMSE: {np.sqrt(mean_error).mean()*1000}')
#print(f'RMSE_std: {np.sqrt(mean_error).std()*1000}')
#np.sqrt(mean_error)[:50]


In [None]:
#I need to creat lists for each value and then graph them
#plt.plot(numLoops, mean_error)
#plt.xlabel('Tree No.')
#plt.ylabel('RMSE')
#plt.grid()


- numloops loop took about 10 min to complete.
- using this it seems like 50 loops is actually the best for number of loops.
- 50 had both the lowest RMSE and the lowest RMSE_std


##How many trees should we have?

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
numLoops = 50

mean_error = np.zeros(numLoops)
for n, trees in enumerate([10,20,40,80,200]):
# np.random.seed(42)
  for idx in range(0,numLoops):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
    model = RandomForestRegressor( n_estimators = trees ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
    model.fit( X_train, y_train )
    y_pred = model.predict( X_test )
    mean_error[idx] = mean_squared_error( y_test, y_pred )

  print(trees,' trees finished.')
  print(f'RMSE: {np.sqrt(mean_error).mean()}')
  print(f'RMSE_std: {np.sqrt(mean_error).std()}')
  np.sqrt(mean_error)[:50]


- n_estimators loop takes about 6.5 min to complete
- These results are a little trickier:
  - the lowest RMSE is for 80 trees
  - the lowest RMSE_std is for 20 trees.
- I would likely choose 80 trees because it has the lowest overall RMSE but the std is about 7% while for 20 ist about 5.8% change.

- The next best thing might be to look at the values between 20-80

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
numLoops = 50

mean_error = np.zeros(numLoops)
for n, trees in enumerate([20,30,40,50,60,70,80]):
# np.random.seed(42)
  for idx in range(0,numLoops):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
    model = RandomForestRegressor( n_estimators = trees ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
    model.fit( X_train, y_train )
    y_pred = model.predict( X_test )
    mean_error[idx] = mean_squared_error( y_test, y_pred )

  print(trees,' trees finished.')
  print(f'RMSE: {np.sqrt(mean_error).mean()}')
  print(f'RMSE_std: {np.sqrt(mean_error).std()}')
  np.sqrt(mean_error)[:50]

- When looking more closely at the number of trees between 20 and 80 we do get a similary tricky result.
- I would likley asses this similarly:
  - the lowest RMSE is: 50 trees
  - the lowest RMSE_std is: 40 trees
- If I look at the affect of the std on each RMSE we see 6.3% change for 50 trees and 6.1% for 40 trees.
  - not a huge difference but I think the overall affect of the std on the RMSE is important and would go with 40 Trees.

##Best parameters chosen

- 50 is the number of loops I choose
- 40 is the number of trees I choose

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
numLoops = 50

mean_error = np.zeros(numLoops)

# np.random.seed(42)
for idx in range(0,numLoops):
  X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
  model = RandomForestRegressor( n_estimators = 40 ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
  model.fit( X_train, y_train )
  y_pred = model.predict( X_test )
  mean_error[idx] = mean_squared_error( y_test, y_pred )

print(f'RMSE: {np.sqrt(mean_error).mean()}')
print(f'RMSE_std: {np.sqrt(mean_error).std()}')
np.sqrt(mean_error)[:50]

#Processing: Random Forest (genre)

In [None]:
X = spotify1.drop('Genre', axis = 1)
y = spotify1['Genre']

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
numLoops = 50

mean_error = np.zeros(numLoops)

# np.random.seed(42)
for idx in range(0,numLoops):
  X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
  model = RandomForestRegressor( n_estimators = 40 ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
  model.fit( X_train, y_train )
  y_pred = model.predict( X_test )
  mean_error[idx] = mean_squared_error( y_test, y_pred )

print(f'RMSE: {np.sqrt(mean_error).mean()}')
print(f'RMSE_std: {np.sqrt(mean_error).std()}')
np.sqrt(mean_error)[:50]

#Processign: Random Forest (Artist Followers)

In [None]:
X = spotify1.drop('Genre', axis = 1)
y = spotify1['Genre']

In [None]:
# two parameters - n_estimators (number of trees), max_depth (number of splits)
numLoops = 50

mean_error = np.zeros(numLoops)

# np.random.seed(42)
for idx in range(0,numLoops):
  X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 )
  model = RandomForestRegressor( n_estimators = 40 ) #n_estimators is number of trees in forest. Note: you can also choose max_depth for RFs
  model.fit( X_train, y_train )
  y_pred = model.predict( X_test )
  mean_error[idx] = mean_squared_error( y_test, y_pred )

print(f'RMSE: {np.sqrt(mean_error).mean()}')
print(f'RMSE_std: {np.sqrt(mean_error).std()}')
np.sqrt(mean_error)[:50]

In [None]:
# Now decode it back, to get my encoded back
#decoded = encoder.inverse_transform(encoded)
#print(decoded)