# **Unsupervised Model Example**
#### **Description**: In this example, we will be using a dataset of over 1 Million songs extracted from Spotify. Each row contains various metrics about the song, some created by Spotify's Audio Analysis engine, others extracted from the track itself. Our goal will be to cluster the songs to identify common genres.
##### **NOTE:** The K-Means algorithm is biased against Categorical Features, thus we'll remove those features.

### **Step 1. Import Required Libraries**

In [34]:
# data wrangling libraries
import pandas as pd
# model fit libraries
from sklearn.cluster import KMeans

### **Step 2. Preview Training Data and Schema**
##### Whenever you're unfamiliar with your data in the slightest, it's wise to analyze the schema and understand how all the features may relate to each other.

In [35]:
training_data = pd.read_parquet('clean_songs_dataset.parquet')
training_data.reset_index(drop=True, inplace=True)
training_data

Unnamed: 0,track_name,artist1,artist2,artist3,artist4,artist5,album_name,release_date,danceability,energy,track_popularity,acousticness,valence,tempo
0,Sk8er Boi,Avril Lavigne,,,,,Let Go,2002-06-04,0.487,0.900,73.0,0.000068,0.484,149.937
1,Paparazzi,Lady Gaga,,,,,The Fame,2008-01-01,0.762,0.692,70.0,0.113000,0.397,114.906
2,Sorry,Justin Bieber,,,,,Purpose (Deluxe),2015-11-13,0.654,0.760,78.0,0.079700,0.410,99.945
3,S&M,Rihanna,,,,,Loud,2010-11-16,0.767,0.682,70.0,0.011300,0.833,127.975
4,Shake It Off,Taylor Swift,,,,,1989 (Deluxe),2014-01-01,0.647,0.800,78.0,0.064700,0.942,160.078
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1157476,Wrong Move,Olivia Holt,R3HAB,THRDL!FE,,,The Wave,2018-08-25,0.722,0.706,0.0,0.161000,0.517,124.013
1157477,Howl At The Moon - Radio Edit,Stadiumx,Taylr Renee,,,,Nicky Romero presents Miami 2014,2014-03-17,0.514,0.934,28.0,0.052500,0.182,127.953
1157478,Let The Bass Kick In Miami Girl - Radio Edit,Chuckie,LMFAO,,,,Let The Bass Kick In Miami Girl,2009-12-06,0.762,0.937,0.0,0.050400,0.546,128.021
1157479,You Make Me,Avicii,,,,,True (Bonus Edition),2013-09-16,0.586,0.727,51.0,0.002470,0.496,124.989


In [36]:
with open("clean_songs_dataset_schema.txt", "r") as file:
    for line in file:
        print(line)

TRAINING DATA SCHEMA



track_name - name of the song.

artist1 - first artist featured on the song.

artist2 - second artist featured on the song.

artist3 - third artist featured on the song.

artist4 - fourth artist featured on the song.

artist5 - fifth artist featured on the song.

album_name - name of the album the song is on.

release_date - date the song was released on Spotify.

danceability - metric measuring how groovy a song is (1.0 means 70s Disco-level groovy, 0.0 means Beethoven-level groovy).

energy - metric measuring how epic the song is (1.0 means 2011 Skrillex, 0.0 means Frank Ocean).

track_popularity - metric measuring how popular the song is based off number of streams, song downloads, and other factors (100.0 means Shape of You by Ed Sheeran, 0.0 means Late 1960s Queen).

acousticness - proobability of a given song being acoustic (1.0 means it is acoustic, 0.0 means it isn't).

valence - metric measuring how positive or negative the vibes are (1.0 means Pharrell

### **Step 3. Perform Feature Engineering**
##### As mentioned above, we will need to remove Categorical Features as they will create Bias in the model.

In [37]:
training_data.drop(columns=['track_name', 'artist1', 'artist2', 'artist3', 'artist4', 'artist5', 'album_name', 'release_date'], inplace=True)
training_data

Unnamed: 0,danceability,energy,track_popularity,acousticness,valence,tempo
0,0.487,0.900,73.0,0.000068,0.484,149.937
1,0.762,0.692,70.0,0.113000,0.397,114.906
2,0.654,0.760,78.0,0.079700,0.410,99.945
3,0.767,0.682,70.0,0.011300,0.833,127.975
4,0.647,0.800,78.0,0.064700,0.942,160.078
...,...,...,...,...,...,...
1157476,0.722,0.706,0.0,0.161000,0.517,124.013
1157477,0.514,0.934,28.0,0.052500,0.182,127.953
1157478,0.762,0.937,0.0,0.050400,0.546,128.021
1157479,0.586,0.727,51.0,0.002470,0.496,124.989


### **Step 4. Fit Training Data to Model**
##### When Using the K-Means Algorhtm, you'll need to specify the number of clusters (the "K" in K-Means). This is typically chosen using either an Elbow or Silouhette Chart. In this case, we'll randomly choose 10 clusters.

In [38]:
model = KMeans(n_clusters=10).fit(training_data)
pd.DataFrame(model.cluster_centers_, columns=['danceability', 'energy', 'track_popularity', 'acousticness', 'valence', 'tempo'])

Unnamed: 0,danceability,energy,track_popularity,acousticness,valence,tempo
0,0.580598,0.639906,10.369624,0.285435,0.502879,147.857206
1,0.652142,0.628147,49.416058,0.288513,0.494944,126.205565
2,0.493723,0.632415,17.195867,0.317653,0.526726,180.004867
3,0.636806,0.596579,27.620017,0.352649,0.517629,104.227272
4,0.502312,0.428527,27.311075,0.543599,0.383032,77.229635
5,0.607107,0.569559,48.547577,0.363743,0.48427,91.327281
6,0.632768,0.630118,28.646338,0.291148,0.480497,130.120818
7,0.655571,0.631645,6.318527,0.290648,0.510002,122.123736
8,0.548603,0.664789,41.584333,0.263371,0.510874,161.753434
9,0.590108,0.544861,5.857219,0.411923,0.493597,89.962063
