# Machine Learning with Spotify

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Silhouette_of_bare_tree_branches_under_twilight_sky.jpg/640px-Silhouette_of_bare_tree_branches_under_twilight_sky.jpg' width = 600>


Spotify has data on million of songs. It assigns certain atributes to each in order to describe the music. In manhine learning, we can use these attributes (known as **features**) to train the model. Here are a few:

- **Mode**: 1=major, 0=minor
- **Tempo**: beats per minute
- **Duration**: length of somg (in milliseconds)
- **Time signature**: 4=4/4, 3=3/4

- **Acousticness**: confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- **Danceability**: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- **Energy**: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
- **Instrumentalness**: predicts whether a track contains no vocals.  Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- **Liveness**: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
- **Speechiness** detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
- **Valence**: measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

    

**Import Pandas**

In [1]:
import pandas as pd

**Read in the Spotify dataset: `../data/spotify_all_genres_tracks.csv`**

In [2]:
data = pd.read_csv('../data/spotify_all_genres_tracks.csv')
data.head()

Unnamed: 0,track_id,playlist_url,playlist_name,track_name,track_popularity,artist_name,album,album_cover,artist_genres,artist_popularity,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,genre
0,4Gia17DzXBhYFbYiJj6SyW,https://open.spotify.com/playlist/7qACZGMjyo64...,The Sound of Blues,Working Man,51,Otis Rush,Mourning In The Morning,https://i.scdn.co/image/ab67616d0000b273fea221...,"['blues', 'blues rock', 'chicago blues', 'elec...",41,...,1,0.0436,0.492,0.000418,0.204,0.841,103.355,147800,4,blues
1,1BjYNhg7JhVfQdxqEThBwn,https://open.spotify.com/playlist/7qACZGMjyo64...,The Sound of Blues,Long Way Home,38,"Clarence ""Gatemouth"" Brown",Long Way Home,https://i.scdn.co/image/ab67616d0000b2730e1f13...,"['blues', 'blues rock', 'memphis blues', 'mode...",33,...,0,0.038,0.91,0.048,0.12,0.425,78.033,338333,4,blues
2,2Cg3GUkhjX96nO4p2WRlIa,https://open.spotify.com/playlist/7qACZGMjyo64...,The Sound of Blues,She's A Sweet One,49,Junior Wells,"Calling All Blues - The Chief, Profile & USA R...",https://i.scdn.co/image/ab67616d0000b27399b18c...,"['blues', 'blues rock', 'chicago blues', 'elec...",41,...,1,0.0542,0.15,0.0265,0.202,0.713,122.863,181786,4,blues
3,5bC6ONDsL88snGN6QasjZH,https://open.spotify.com/playlist/7qACZGMjyo64...,The Sound of Blues,Help Me,59,Sonny Boy Williamson II,More Real Folk Blues,https://i.scdn.co/image/ab67616d0000b273b48c81...,"['acoustic blues', 'blues', 'blues rock', 'chi...",46,...,0,0.043,0.597,0.0213,0.61,0.771,114.216,188200,4,blues
4,2TKykeHeVKsBqZC8M3SKcN,https://open.spotify.com/playlist/7qACZGMjyo64...,The Sound of Blues,Take Out Some Insurance,51,Jimmy Reed,Rockin' With Reed,https://i.scdn.co/image/ab67616d0000b2739b7573...,"['blues', 'blues rock', 'chicago blues', 'elec...",42,...,1,0.0513,0.663,0.0,0.122,0.566,111.33,143332,4,blues


**Get a summary of the numerical data.**

In [12]:
data.describe()

Unnamed: 0,track_popularity,artist_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0,9198.0
mean,48.891716,51.161883,0.595363,0.535862,5.359209,-10.706267,0.596434,0.084681,0.362225,0.227429,0.170279,0.502993,116.589396,253282.5,3.896934
std,17.501544,16.080915,0.192927,0.273231,3.558963,6.573201,0.490639,0.081614,0.368888,0.350817,0.139707,0.26594,29.270118,101973.2,0.402003
min,0.0,0.0,0.0,0.000885,0.0,-47.001,0.0,0.0,2e-06,0.0,0.0145,0.0,0.0,30333.0,0.0
25%,38.0,42.0,0.474,0.334,2.0,-13.25575,0.0,0.0382,0.030425,2e-06,0.0901,0.282,93.983,189700.8,4.0
50%,49.0,52.0,0.621,0.5785,6.0,-8.7615,1.0,0.049,0.191,0.001675,0.116,0.519,117.9225,228267.0,4.0
75%,61.0,62.0,0.744,0.757,8.0,-6.12825,1.0,0.0875,0.731,0.461,0.198,0.724,132.69725,286456.8,4.0
max,95.0,100.0,0.984,0.999,11.0,1.342,1.0,0.827,0.996,0.983,0.979,0.986,216.09,1430840.0,5.0


**List the column names.**

In [13]:
data.columns

Index(['track_id', 'playlist_url', 'playlist_name', 'track_name',
       'track_popularity', 'artist_name', 'album', 'album_cover',
       'artist_genres', 'artist_popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature',
       'genre'],
      dtype='object')

We are interested in predicting the genre of each song using the other features.

**Identify the genres that are used in this dataset.**

In [14]:
data.drop_duplicates('genre')

Unnamed: 0,track_id,playlist_url,playlist_name,track_name,track_popularity,artist_name,album,album_cover,artist_genres,artist_popularity,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,genre
0,4Gia17DzXBhYFbYiJj6SyW,https://open.spotify.com/playlist/7qACZGMjyo64...,The Sound of Blues,Working Man,51,Otis Rush,Mourning In The Morning,https://i.scdn.co/image/ab67616d0000b273fea221...,"['blues', 'blues rock', 'chicago blues', 'elec...",41,...,1,0.0436,0.492,0.000418,0.204,0.841,103.355,147800,4,blues
1034,5bu9A6uphPWg39RC3ZKeku,https://open.spotify.com/playlist/3HYK6ri0GkvR...,The Sound of Classical,"Goldberg Variations, BWV 988: Aria",64,Johann Sebastian Bach,"Bach: The Goldberg Variations, BWV 988 (1981 G...",https://i.scdn.co/image/ab67616d0000b273c7ed97...,"['baroque', 'classical', 'early music', 'germa...",75,...,0,0.0514,0.995,0.943,0.0736,0.244,130.253,184853,4,classical
2024,0IffIW3eyCx9aZ36IqOu5o,https://open.spotify.com/playlist/5EyFMotmvSfD...,The Sound of Jazz,Infant Eyes - Remastered1998/Rudy Van Gelder E...,49,Wayne Shorter,Speak No Evil,https://i.scdn.co/image/ab67616d0000b273bdd696...,"['contemporary jazz', 'contemporary post-bop',...",46,...,1,0.042,0.985,0.761,0.0945,0.119,138.689,414240,3,jazz
3026,1SyQ6t9RdRBK0QUCS6a797,https://open.spotify.com/playlist/6MXkE0uYF4Xw...,The Sound of Hip Hop,Hip Hop Hooray,65,Naughty By Nature,19 Naughty III,https://i.scdn.co/image/ab67616d0000b273afbd83...,"['east coast hip hop', 'gangster rap', 'hardco...",57,...,0,0.101,0.102,0.0,0.272,0.765,99.2,267267,4,hiphop
4050,2DB4DdfCFMw1iaR6JaR03a,https://open.spotify.com/playlist/6gS3HhOiI17Q...,The Sound of Pop,Bam Bam (feat. Ed Sheeran),83,Camila Cabello,Familia,https://i.scdn.co/image/ab67616d0000b273370ed6...,"['dance pop', 'pop', 'post-teen pop', 'uk pop']",82,...,1,0.0401,0.182,0.0,0.333,0.956,94.996,206071,4,pop
5301,0hebjXwdDFHS1kHDQ82HZr,https://open.spotify.com/playlist/0TcXdt4sbITb...,The Sound of Reggae,Jah Give Us Life,55,Wailing Souls,The Very Best Of The Wailing Souls,https://i.scdn.co/image/ab67616d0000b273a3fef7...,"['dub', 'lovers rock', 'reggae', 'roots reggae...",48,...,0,0.0578,0.0521,0.000932,0.0666,0.962,144.678,232440,4,reggae
6339,3qiyyUfYe7CRYLucrPmulD,https://open.spotify.com/playlist/7dowgSWOmvdp...,The Sound of Rock,Baba O'Riley,76,The Who,Who's Next (Deluxe Edition),https://i.scdn.co/image/ab67616d0000b273fe24dc...,"['album rock', 'art rock', 'blues rock', 'brit...",68,...,1,0.0352,0.313,0.185,0.287,0.15,117.292,300400,4,rock
7473,7nionv2ijjqUlg9m5iWPTc,https://open.spotify.com/playlist/6AzCASXpbvX5...,The Sound of House,Feel My Needs,60,WEISS,Feel My Needs,https://i.scdn.co/image/ab67616d0000b27357c3c6...,"['deep groove house', 'deep house', 'disco hou...",55,...,0,0.0475,0.001,0.861,0.0505,0.774,121.998,208525,4,electronic


In machine learning, you are interested in predicting something. In this case, it is the genre of the song. By convention, we set this "target" equal to `y`.

In [3]:
y = data.genre

We can choose which features we want to use to train the model.

**Choose several features and place them in a list, stored in a variable called `features`.**

In [4]:
features = ['energy', 'key', 'valence', 'tempo', 'acousticness']

**By convention, the list of features is set equal to the variable `X`.**

In [5]:
X = data[features]

In [31]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Normalize features
scaler = StandardScaler()
#features = df.drop('genre', axis=1)
scaled_features = scaler.fit_transform(features)

ValueError: could not convert string to float: 'energy'

**View and summarize the feature set.**

In [6]:
X.describe()

Unnamed: 0,energy,key,valence,tempo,acousticness
count,9198.0,9198.0,9198.0,9198.0,9198.0
mean,0.535862,5.359209,0.502993,116.589396,0.362225
std,0.273231,3.558963,0.26594,29.270118,0.368888
min,0.000885,0.0,0.0,0.0,2e-06
25%,0.334,2.0,0.282,93.983,0.030425
50%,0.5785,6.0,0.519,117.9225,0.191
75%,0.757,8.0,0.724,132.69725,0.731
max,0.999,11.0,0.986,216.09,0.996


In [7]:
X.head()

Unnamed: 0,energy,key,valence,tempo,acousticness
0,0.625,0,0.841,103.355,0.492
1,0.054,11,0.425,78.033,0.91
2,0.483,1,0.713,122.863,0.15
3,0.436,5,0.771,114.216,0.597
4,0.288,9,0.566,111.33,0.663


We will use the machine learning library **Scikit-learn** (`sklearn`). Specifically, its decision tree classifier to train the model.

**Import the sklearn library.**

In [8]:
from sklearn.tree import DecisionTreeClassifier

**Assign the model to an object called `spotify_model`. Specify a number for random_state to ensure same results each run.**

In [9]:
# 
spotify_model = DecisionTreeClassifier(random_state=1)

**Train the model using the features and target data.**

In [10]:
spotify_model.fit(X, y)

DecisionTreeClassifier(random_state=1)

**Make predictions for the first 5 songs of the dataset.**

In [11]:
print("Making predictions for the following 5 songs:")
print(X.head())
print("The predictions are")
print(spotify_model.predict(X.head()))
predicted_genres = spotify_model.predict(X)
print(predicted_genres)

Making predictions for the following 5 songs:
   energy  key  valence    tempo  acousticness
0   0.625    0    0.841  103.355         0.492
1   0.054   11    0.425   78.033         0.910
2   0.483    1    0.713  122.863         0.150
3   0.436    5    0.771  114.216         0.597
4   0.288    9    0.566  111.330         0.663
The predictions are
['blues' 'blues' 'blues' 'blues' 'blues']
['blues' 'blues' 'blues' ... 'electronic' 'electronic' 'electronic']


**Check the accuracy of the model's predictions.**

In [12]:
data['genre'].head()

0    blues
1    blues
2    blues
3    blues
4    blues
Name: genre, dtype: object

Since we are training and testing the model on thousands of songs, we need an efficient way of measuring the models accuracy.

**Import the `accuracy_score` method from Scikit-learn.**

In [13]:
from sklearn.metrics import accuracy_score

**Run the model on all of the songs in the dataset.**

In [14]:
predicted_genres = spotify_model.predict(X)

**Measure the accuracy of the model.**

In [15]:
accuracy = accuracy_score(y, predicted_genres)
print(accuracy)

0.9921722113502935


## Training and Validating Your Model

Properly training your model requires both training and validating it with two different subsets of the data.

**Import the `train_test_split` method from Scikit-learn.**

In [16]:
from sklearn.model_selection import train_test_split

**Split data into training and validation data, for both the features (`X`) and the target (`y`).**

In [17]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

**View the subsets.**

In [18]:
train_X.head()

Unnamed: 0,energy,key,valence,tempo,acousticness
978,0.566,7,0.827,112.428,0.563
2530,0.424,1,0.641,156.048,0.412
4100,0.625,7,0.423,120.038,0.0774
7470,0.797,2,0.683,152.87,0.168
2101,0.277,0,0.322,90.38,0.814


**Assign the model to an object called `spotify_model`. Choose a maximum number of leaf nodes and specify a number for random_state to ensure same results each run.**

In [19]:
# Define model
spotify_model = DecisionTreeClassifier(max_leaf_nodes=5000, random_state=0)

**Train the model on the training subset.**

In [20]:
# Fit model
spotify_model.fit(train_X, train_y)

DecisionTreeClassifier(max_leaf_nodes=5000, random_state=0)

**Use the new model to make prediction of the genres from the validation subset.**

In [21]:
val_predictions = spotify_model.predict(val_X)

**Score the accuracy of the model's predictions.**

In [22]:
accuracy = accuracy_score(val_y, val_predictions)

print(accuracy)

0.3804347826086957


## Improving the model

### Tree depth

We can decide how complex we want the decision tree to be by specifying the number of leaf nodes. Limiting the number of leaf nodes can help control the complexity of the tree, prevent overfitting, and improve generalization on unseen data.

<center><img src = '../imgs/charlie.jpeg' width = 600><center> 


Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Overfitting_svg.svg/640px-Overfitting_svg.svg.png' width = 400>

**Assign a value to the variable `max_leaf_nodes`.**

In [23]:
max_leaf_nodes = 20

**Retrain the model with the maximum number of leaf nodes and measure its accuracy on the validation subset.**

In [24]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)

# Create and fit the DecisionTreeClassifier
spotify_model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
spotify_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = spotify_model.predict(X_val)

# Calculate accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 0.46


In [25]:
model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
y_pred = model.predict(val_X)
accuracy = accuracy_score(val_y, y_pred)
print(accuracy)


0.4573913043478261


**Define a function called `get_acc` that accepts `max_leaf_nodes`, `train_X`, `val_X`, `train_y`, and `val_y` as arguments and returns the accuracy of the model.**

In [26]:
def get_acc(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    y_pred = model.predict(val_X)
    accuracy = accuracy_score(val_y, y_pred)
    return accuracy

**Test the function `get_acc`.**

In [27]:
get_acc(1000, train_X, val_X, train_y, val_y)

0.4252173913043478

**Create a `for loop` that runs the `get_acc` function for 5, 50, 500 and 5000 max_leaf_nodes and returns the accuracy for each model run.**

In [28]:
for max_leaf_nodes in [50, 100, 200]:
    my_acc = get_acc(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Accuracy:  %f" %(max_leaf_nodes, my_acc))

Max leaf nodes: 50  		 Accuracy:  0.471739
Max leaf nodes: 100  		 Accuracy:  0.480870
Max leaf nodes: 200  		 Accuracy:  0.468696


**Modify the `for loop` to find the optimal number for `max_leaf_nodes`.**

fine tune

In [29]:
for max_leaf_nodes in range(50,100,1):
    my_acc = get_acc(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Accuracy:  %f" %(max_leaf_nodes, my_acc))

Max leaf nodes: 50  		 Accuracy:  0.471739
Max leaf nodes: 51  		 Accuracy:  0.471739
Max leaf nodes: 52  		 Accuracy:  0.473913
Max leaf nodes: 53  		 Accuracy:  0.475217
Max leaf nodes: 54  		 Accuracy:  0.475652
Max leaf nodes: 55  		 Accuracy:  0.476957
Max leaf nodes: 56  		 Accuracy:  0.480000
Max leaf nodes: 57  		 Accuracy:  0.480000
Max leaf nodes: 58  		 Accuracy:  0.479565
Max leaf nodes: 59  		 Accuracy:  0.478696
Max leaf nodes: 60  		 Accuracy:  0.478696
Max leaf nodes: 61  		 Accuracy:  0.480000
Max leaf nodes: 62  		 Accuracy:  0.480000
Max leaf nodes: 63  		 Accuracy:  0.480870
Max leaf nodes: 64  		 Accuracy:  0.479565
Max leaf nodes: 65  		 Accuracy:  0.481304
Max leaf nodes: 66  		 Accuracy:  0.488696
Max leaf nodes: 67  		 Accuracy:  0.487826
Max leaf nodes: 68  		 Accuracy:  0.486522
Max leaf nodes: 69  		 Accuracy:  0.486522
Max leaf nodes: 70  		 Accuracy:  0.485217
Max leaf nodes: 71  		 Accuracy:  0.485652
Max leaf nodes: 72  		 Accuracy:  0.486087
Max leaf no

### Improve Your Model

Continue to adjust the parameters of your model to see if you can improve its accuracy. Consider changing the features that are used to train the model and finding a balance between underfitting and overfitting the model.

## Random Forest

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/D%C3%BClmen%2C_Kirchspiel%2C_Bauerschaft_B%C3%B6rnste_--_2017_--_6919.jpg/640px-D%C3%BClmen%2C_Kirchspiel%2C_Bauerschaft_B%C3%B6rnste_--_2017_--_6919.jpg' width = 600>

A Random Forest Classifier is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model for classification tasks. It is one of the most popular machine learning algorithms due to its simplicity, versatility, and effectiveness.

In [30]:
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier(random_state=1)
forest_model.fit(train_X, train_y)
y_preds = forest_model.predict(val_X)
print(accuracy_score(val_y, y_preds))


0.49956521739130433
