# Data Science Fundamentals - Assignment 3
##### By Alexandra de Carvalho, nmec 93346  

This work aims at exploring dimensionality reduction techniques, such as Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP), to help explore and visualize real-world data. Dimensionality Reduction is the process of combining multiple features into a smaller number of features, for interpretability and visualization purposes, but also for to reduce space and time complexity. Features that have a higher contribution to the target value (in this case, the genre) have a greater impact in the final combined feature than features that contribute less.

The tasks at hand are using PCA and UMAP to preform, visualize and discuss clustering of the data. The dataset features used are audio features of songs, provided by Echonest (Spotify), also taking into account metadata associated to the tracks, such as their genre.

In [117]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import umap

import plotly.express as px

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

%matplotlib inline

### Exploratory Data Analysis

The dataset provided comes in several files. The echonest.csv file contains audio features for 13129 songs. The tracks.csv file contains metadata about each track, such as title, artist, genre (which we are interested in), tags and play counts. We will load both of them to pandas dataframe. Each dataframe has multiple indexes, specified by the header parameter of the read_csv() method. The tracks file has been treated as suggested in the documentation of the dataset.

In [4]:
echonest = pd.read_csv('data/fma_metadata/echonest.csv', index_col=0, header=[0, 1, 2])
echonest

Unnamed: 0_level_0,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest,echonest
Unnamed: 0_level_1,audio_features,audio_features,audio_features,audio_features,audio_features,audio_features,audio_features,audio_features,metadata,metadata,...,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features,temporal_features
Unnamed: 0_level_2,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,album_date,album_name,...,214,215,216,217,218,219,220,221,222,223
track_id,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
2,0.416675,0.675894,0.634476,0.010628,0.177647,0.159310,165.922,0.576661,,,...,-1.992303,6.805694,0.233070,0.192880,0.027455,0.06408,3.676960,3.61288,13.316690,262.929749
3,0.374408,0.528643,0.817461,0.001851,0.105880,0.461818,126.957,0.269240,,,...,-1.582331,8.889308,0.258464,0.220905,0.081368,0.06413,6.082770,6.01864,16.673548,325.581085
5,0.043567,0.745566,0.701470,0.000697,0.373143,0.124595,100.260,0.621661,,,...,-2.288358,11.527109,0.256821,0.237820,0.060122,0.06014,5.926490,5.86635,16.013849,356.755737
10,0.951670,0.658179,0.924525,0.965427,0.115474,0.032985,111.562,0.963590,2008-03-11,Constant Hitmaker,...,-3.662988,21.508228,0.283352,0.267070,0.125704,0.08082,8.414010,8.33319,21.317064,483.403809
134,0.452217,0.513238,0.560410,0.019443,0.096567,0.525519,114.290,0.894072,,,...,-1.452696,2.356398,0.234686,0.199550,0.149332,0.06440,11.267070,11.20267,26.454180,751.147705
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124857,0.007592,0.790364,0.719288,0.853114,0.720715,0.082550,141.332,0.890461,,,...,-1.721207,4.686078,0.213789,0.208800,0.007911,0.06395,2.040730,1.97678,8.144532,147.040405
124862,0.041498,0.843077,0.536496,0.865151,0.547949,0.074001,101.975,0.476845,,,...,-0.647897,1.282306,0.214586,0.181860,0.011247,0.06240,0.922360,0.85996,1.794739,6.321268
124863,0.000124,0.609686,0.895136,0.846624,0.632903,0.051517,129.996,0.496667,,,...,-0.771613,1.623510,0.180471,0.128185,0.010103,0.06222,2.251160,2.18894,5.578341,89.180328
124864,0.327576,0.574426,0.548327,0.452867,0.075928,0.033388,142.009,0.569274,,,...,-2.054143,7.927149,0.250178,0.219205,0.014851,0.06390,1.487440,1.42354,2.173092,12.503966


In [33]:
echonest['echonest', 'metadata'].head()

Unnamed: 0_level_0,album_date,album_name,artist_latitude,artist_location,artist_longitude,artist_name,release
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life
3,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life
5,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life
10,2008-03-11,Constant Hitmaker,39.9523,"Philadelphia, PA, US",-75.1624,Kurt Vile,Constant Hitmaker
134,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life


In [17]:
pd.DataFrame({'Missing Data': 100* echonest['echonest', 'metadata'].isna().sum() / len(echonest['echonest', 'metadata']), 'Type': echonest['echonest', 'metadata'].dtypes})

Unnamed: 0,Missing Data,Type
album_date,79.305355,object
album_name,78.124762,object
artist_latitude,25.584584,float64
artist_location,25.584584,object
artist_longitude,25.584584,float64
artist_name,0.0,object
release,0.0,object


In [34]:
echonest['echonest', 'audio_features'].head()

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,0.416675,0.675894,0.634476,0.010628,0.177647,0.15931,165.922,0.576661
3,0.374408,0.528643,0.817461,0.001851,0.10588,0.461818,126.957,0.26924
5,0.043567,0.745566,0.70147,0.000697,0.373143,0.124595,100.26,0.621661
10,0.95167,0.658179,0.924525,0.965427,0.115474,0.032985,111.562,0.96359
134,0.452217,0.513238,0.56041,0.019443,0.096567,0.525519,114.29,0.894072


In [18]:
pd.DataFrame({'Missing Data': 100* echonest['echonest', 'audio_features'].isna().sum() / len(echonest['echonest', 'audio_features']), 'Type': echonest['echonest', 'audio_features'].dtypes})

Unnamed: 0,Missing Data,Type
acousticness,0.0,float64
danceability,0.0,float64
energy,0.0,float64
instrumentalness,0.0,float64
liveness,0.0,float64
speechiness,0.0,float64
tempo,0.0,float64
valence,0.0,float64


In [35]:
echonest['echonest', 'social_features'].head()

Unnamed: 0_level_0,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,0.38899,0.38674,0.40637,0.0,0.0
3,0.38899,0.38674,0.40637,0.0,0.0
5,0.38899,0.38674,0.40637,0.0,0.0
10,0.557339,0.614272,0.798387,0.005158,0.354516
134,0.38899,0.38674,0.40637,0.0,0.0


In [19]:
pd.DataFrame({'Missing Data': 100* echonest['echonest', 'social_features'].isna().sum() / len(echonest['echonest', 'social_features']), 'Type': echonest['echonest', 'social_features'].dtypes})

Unnamed: 0,Missing Data,Type
artist_discovery,0.0,float64
artist_familiarity,0.0,float64
artist_hotttnesss,0.0,float64
song_currency,0.0,float64
song_hotttnesss,0.0,float64


In [36]:
echonest['echonest', 'ranks'].head()

Unnamed: 0_level_0,artist_discovery_rank,artist_familiarity_rank,artist_hotttnesss_rank,song_currency_rank,song_hotttnesss_rank
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,,,,,
3,,,,,
5,,,,,
10,2635.0,2544.0,397.0,115691.0,67609.0
134,,,,,


In [20]:
pd.DataFrame({'Missing Data': 100* echonest['echonest', 'ranks'].isna().sum() / len(echonest['echonest', 'ranks']), 'Type': echonest['echonest', 'ranks'].dtypes})

Unnamed: 0,Missing Data,Type
artist_discovery_rank,78.482748,float64
artist_familiarity_rank,78.482748,float64
artist_hotttnesss_rank,78.490365,float64
song_currency_rank,84.515195,float64
song_hotttnesss_rank,83.197502,float64


In [22]:
tracks = pd.read_csv('data/fma_metadata/tracks.csv', index_col=0, header=[0, 1])

COLUMNS = [('track', 'tags'), ('album', 'tags'), ('artist', 'tags'), ('track', 'genres'), ('track', 'genres_all')]
for column in COLUMNS:
    tracks[column] = tracks[column].map(ast.literal_eval)

COLUMNS = [('track', 'date_created'), ('track', 'date_recorded'), ('album', 'date_created'), ('album', 'date_released'), ('artist', 'date_created'), ('artist', 'active_year_begin'), ('artist', 'active_year_end')]
for column in COLUMNS:
    tracks[column] = pd.to_datetime(tracks[column])

SUBSETS = ('small', 'medium', 'large')
tracks['set', 'subset'] = tracks['set', 'subset'].astype(pd.CategoricalDtype(categories=SUBSETS, ordered=True))

COLUMNS = [('track', 'genre_top'), ('track', 'license'), ('album', 'type'), ('album', 'information'), ('artist', 'bio')]
for column in COLUMNS:
    tracks[column] = tracks[column].astype('category')

tracks

Unnamed: 0_level_0,album,album,album,album,album,album,album,album,album,album,...,track,track,track,track,track,track,track,track,track,track
Unnamed: 0_level_1,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,...,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],...,,4656,en,Attribution-NonCommercial-ShareAlike 3.0 International,1293,,3,,[],Food
3,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],...,,1470,en,Attribution-NonCommercial-ShareAlike 3.0 International,514,,4,,[],Electric Ave
5,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],...,,1933,en,Attribution-NonCommercial-ShareAlike 3.0 International,1151,,6,,[],This World
10,0,2008-11-26 01:45:08,2008-02-06,,4,6,,47632,,[],...,,54881,en,Attribution-NonCommercial-NoDerivatives (aka Music Sharing) 3.0 International,50135,,1,,[],Freeway
20,0,2008-11-26 01:45:05,2009-01-06,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],...,,978,en,Attribution-NonCommercial-NoDerivatives (aka Music Sharing) 3.0 International,361,,3,,[],Spiritual Level
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155316,0,2017-03-30 15:20:35,2017-02-17,,0,22940,"<p>A live performance at Monty Hall on Feb 17, 2017 on a bill with Screaming Females. For the full video of their performance, please visit https://vimeo.com/209737167</p>",1506,Monty Hall,[],...,,122,,Creative Commons Attribution-NonCommercial-NoDerivatives 4.0,102,,3,,[],The Auger
155317,0,2017-03-30 15:20:35,2017-02-17,,0,22940,"<p>A live performance at Monty Hall on Feb 17, 2017 on a bill with Screaming Females. For the full video of their performance, please visit https://vimeo.com/209737167</p>",1506,Monty Hall,[],...,,194,,Creative Commons Attribution-NonCommercial-NoDerivatives 4.0,165,,4,,[],Let's Skin Ruby
155318,0,2017-03-30 15:20:35,2017-02-17,,0,22940,"<p>A live performance at Monty Hall on Feb 17, 2017 on a bill with Screaming Females. For the full video of their performance, please visit https://vimeo.com/209737167</p>",1506,Monty Hall,[],...,,214,,Creative Commons Attribution-NonCommercial-NoDerivatives 4.0,168,,6,,[],My House Smells Like Kim Deal/Pulp
155319,0,2017-03-30 15:20:35,2017-02-17,,0,22940,"<p>A live performance at Monty Hall on Feb 17, 2017 on a bill with Screaming Females. For the full video of their performance, please visit https://vimeo.com/209737167</p>",1506,Monty Hall,[],...,,336,,Creative Commons Attribution-NonCommercial-NoDerivatives 4.0,294,,5,,[],The Man With Two Mouths


In [23]:
tracks['track'].head()

Unnamed: 0_level_0,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2,256000,0,,2008-11-26 01:48:12,2008-11-26,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 International,1293,,3,,[],Food
3,256000,0,,2008-11-26 01:48:14,2008-11-26,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 International,514,,4,,[],Electric Ave
5,256000,0,,2008-11-26 01:48:20,2008-11-26,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 International,1151,,6,,[],This World
10,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka Music Sharing) 3.0 International,50135,,1,,[],Freeway
20,256000,0,,2008-11-26 01:48:56,2008-01-01,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka Music Sharing) 3.0 International,361,,3,,[],Spiritual Level


In [24]:
pd.DataFrame({'Missing Data': 100* tracks['track'].isna().sum() / len(tracks['track']), 'Type': tracks['track'].dtypes})

Unnamed: 0,Missing Data,Type
bit_rate,0.0,int64
comments,0.0,int64
composer,96.556383,object
date_created,0.0,datetime64[ns]
date_recorded,94.220917,datetime64[ns]
duration,0.0,int64
favorites,0.0,int64
genre_top,53.461445,category
genres,0.0,object
genres_all,0.0,object


In [38]:
tracks['album'].head()

Unnamed: 0_level_0,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album
3,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album
5,0,2008-11-26 01:44:45,2009-01-05,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album
10,0,2008-11-26 01:45:08,2008-02-06,,4,6,,47632,,[],Constant Hitmaker,2,Album
20,0,2008-11-26 01:45:05,2009-01-06,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album


In [25]:
pd.DataFrame({'Missing Data': 100* tracks['album'].isna().sum() / len(tracks['album']), 'Type': tracks['album'].dtypes})

Unnamed: 0,Missing Data,Type
comments,0.0,int64
date_created,3.311314,datetime64[ns]
date_released,34.042074,datetime64[ns]
engineer,85.64847,object
favorites,0.0,int64
id,0.0,int64
information,21.980033,category
listens,0.0,int64
producer,83.054028,object
tags,0.0,object


In [26]:
tracks['artist'].head()

Unnamed: 0_level_0,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from NJ...................</p>",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of Records,Zooberelli the Don,F.A.H,MadSicka,Damien Omenicci..and a van load more...",AWOL,The list of past projects is 2 long but every1 and every style from Tabby Bonet 2 M.O.P..Azillion Records Flagship trackmaster DJ BrownBum is a beat Wizard.....A-2-Z..illion....(right now working with JerseyBlock Ent),[awol],http://www.AzillionRecords.blogspot.com,
3,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from NJ...................</p>",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of Records,Zooberelli the Don,F.A.H,MadSicka,Damien Omenicci..and a van load more...",AWOL,The list of past projects is 2 long but every1 and every style from Tabby Bonet 2 M.O.P..Azillion Records Flagship trackmaster DJ BrownBum is a beat Wizard.....A-2-Z..illion....(right now working with JerseyBlock Ent),[awol],http://www.AzillionRecords.blogspot.com,
5,2006-01-01,NaT,,"<p>A Way Of Life, A Collective of Hip-Hop from NJ...................</p>",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of Records,Zooberelli the Don,F.A.H,MadSicka,Damien Omenicci..and a van load more...",AWOL,The list of past projects is 2 long but every1 and every style from Tabby Bonet 2 M.O.P..Azillion Records Flagship trackmaster DJ BrownBum is a beat Wizard.....A-2-Z..illion....(right now working with JerseyBlock Ent),[awol],http://www.AzillionRecords.blogspot.com,
10,NaT,NaT,"Mexican Summer, Richie Records, Woodsist, Skulltones","<p><span style=""font-family:Verdana, Geneva, Arial;font-size:xx-small;"">""Philly's Constant Hitmaker"" Kurt Vile plays psychedelic experimental pop infused with a classic, folksy directness. 2008's Gulcher Records release <i>Constant Hitmaker</i> more than lives up to the name. Vile's a consistent live performer who's shared the stage with a range of sonic adventurers and avant pop-stars including Sonic Youth, Big Star, Blues Control, Magik Markers, and Ariel Pink's Haunted Graffiti.</span></p><p><span style=""font-family:Verdana, Geneva, Arial;font-size:xx-small;"">After a slew of limited releases on labels like Mexican Summer, Richie Records, Woodsist, and Skulltones,</span><span style=""font-family:Verdana, Geneva, Arial;font-size:xx-small;""> Kurt Vile signed to Matador Records who relea...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"[philly, kurt vile]",http://kurtvile.com,
20,1990-01-01,2011-01-01,,"<p>Songs written by: Nicky Cook</p>\n<p>VOCALS, CAJON AND TABLA+ UKELELE</p>",2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"[instrumentals, experimental pop, post punk, experimental rock, soundtracks, lo fi, nicky, impro, cook, nicky andrews, ukelele, nicky cook, singer songwriter, tabla]",,


In [27]:
pd.DataFrame({'Missing Data': 100* tracks['artist'].isna().sum() / len(tracks['artist']), 'Type': tracks['artist'].dtypes})

Unnamed: 0,Missing Data,Type
active_year_begin,78.689924,datetime64[ns]
active_year_end,94.956556,datetime64[ns]
associated_labels,86.609304,object
bio,33.233246,category
comments,0.0,int64
date_created,0.803198,datetime64[ns]
favorites,0.0,int64
id,0.0,int64
latitude,58.203689,float64
location,34.120893,object


In [40]:
tracks['set'].head()

Unnamed: 0_level_0,split,subset
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,training,small
3,training,medium
5,training,small
10,training,small
20,training,large


The tracks['track'] subset has a genres column, which is categorical data. To be able to interpret that data, we need to load the genres data.

In [29]:
genres = pd.read_csv('data/fma_metadata/genres.csv', index_col=0)
genres

Unnamed: 0_level_0,#tracks,parent,title,top_level
genre_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,8693,38,Avant-Garde,38
2,5271,0,International,2
3,1752,0,Blues,3
4,4126,0,Jazz,4
5,4106,0,Classical,5
...,...,...,...,...
1032,60,102,Turkish,2
1060,30,46,Tango,2
1156,26,130,Fado,2
1193,72,763,Christmas,38


In [30]:
pd.DataFrame({'Missing Data': 100* genres.isna().sum() / len(genres), 'Type': genres.dtypes})

Unnamed: 0,Missing Data,Type
#tracks,0.0,int64
parent,0.0,int64
title,0.0,object
top_level,0.0,int64


### Principal Component Analysis

Now, we need to define our (numerical) features and our (categorical, or string) labels. StandardScaler has been used to normalise features. Feature normalisation is a technique that reduces the range of values within each features, or the standard deviation, such that all features have a mean = 0 and standard deviation = 1, to avoid a feature with a broader range of values to overpower the others, and bias the algorithm.

In [110]:
#defining the numerical features
scaler = StandardScaler()
X = echonest['echonest', 'audio_features'].merge(echonest['echonest', 'social_features'], on='track_id').merge(tracks['track']['bit_rate'], on = 'track_id').merge(tracks['track']['duration'], on = "track_id")
X = scaler.fit_transform(X)

#defining the labels - genre top
y = tracks['track']['genre_top']

As our data has a lot of variables, it becomes useful to discover the ones that sum up the relevant patterns in the data giving us insight into the characteristics of the songs that are most important to the variability, making further analysis and visualization of the data easier. Principal Component Analysisis (PCA) is a widely used transformation technique for feature extraction and dimensionality reduction without losing important information, by producing linear combinations of the original variables that summarize the predominant patterns in the data aiming to maximize the variance captured, creating uncorrelated components. In this case, we aim to visualize the data, so it is convenient to choose only the three principal components.

In [128]:
# Get our explained variance ratios from PCA using all features
pca = PCA(n_components=3)
principalComponents = pd.DataFrame(data = pca.fit_transform(X), columns = ['Principal Component 1', 'Principal Component 2', 'Principal Component 3'])
finalDf = pd.concat([principalComponents, y], axis = 1).dropna()

fig = px.scatter_3d(finalDf, x = 'Principal Component 1', y = 'Principal Component 2', z = 'Principal Component 3', color = 'genre_top')
fig.show()

While I don’t see any clustering patterns straight away, the primary thing that does stand out is how heavily the dots are clustered towards the origin. One of the reasons for the lack of results might be that only three principal components are not enough to explain most of the variability, so let's check that. Notice that it is commonly adviseable to keep at least 70% of the variance explained.

In [138]:
np.cumsum((pca.explained_variance_ratio_))[2] # 42.72% is way less than 70%

0.42716092979379017

### Uniform Manifold Approximation and Projection

UMAP is a general purpose manifold learning and dimension reduction algorithm, like t-SNE, to transform and visualise data.

In [145]:
reducer = umap.UMAP(n_components = 3)

principalComponents = pd.DataFrame(data = reducer.fit_transform(X), columns = ['Component 1', 'Component 2', 'Component 3'])
finalDf = pd.concat([principalComponents, y], axis = 1).dropna()

fig = px.scatter_3d(finalDf, x = 'Component 1', y = 'Component 2', z = 'Component 3', color = 'genre_top')
fig.show()

### PCA + UMAP

In [144]:
pca = PCA()

reducer = umap.UMAP(n_components = 3)

principalComponents = pd.DataFrame(data = reducer.fit_transform(pca.fit_transform(X)), columns = ['Component 1', 'Component 2', 'Component 3'])
finalDf = pd.concat([principalComponents, y], axis = 1).dropna()

fig = px.scatter_3d(finalDf, x = 'Component 1', y = 'Component 2', z = 'Component 3', color = 'genre_top')
fig.show()