# CAPSTONE PROJECT: GENRE CLASSIFICATION 

### Genre Consolidation 
Stephanie Barrett | September 5, 2023

This project aims to use machine learning to classify songs into the correct genre based on particular attributes. 

**Introduction:** This is notebook 2 of 4 that was used in the genre classification project.
This notebook contains further exploration of the genres in our spotify dataset. In this notebook, we remove a total of 28 genres and reduce the total number of unique genres to 85. The new DataFrame with 85 genres is saved to another csv file that will be used in the next notebook: modeling. 

***
**TABLE OF CONTENTS**

[Introduction](#Genre-Consolidation)

[Sleep](#Sleep)

[Study](#Study)

[Ambient](#Ambient)

[Sad & Happy](#Sad-&-Happy)

[Languages](#Languages)

[Next Steps](#NEXT-STEPS)

***

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

In [2]:
# Reading in our cleaned dataset 
spotify_cleaned = pd.read_csv('~/Desktop/CapstoneProject/data/spotify_cleaned_final.csv', index_col=0)

In [3]:
# Making a copy - we will eventually save this DataFrame to a csv file that we can use for modeling. 
consolidated_spotify = spotify_cleaned.copy()

In [4]:
# Transforming the 'explicit' column from bool to binary
consolidated_spotify['explicit'] = consolidated_spotify['explicit'].astype(int)

In [5]:
# Sanity check 
consolidated_spotify['explicit'].unique()

array([0, 1])

In [6]:
# Looking at our unique genres 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'pow

In [7]:
# Total number of genres 
consolidated_spotify['track_genre'].nunique()

113

We are going to start off by removing some genres of music that are probably a combination of noise (like nature sounds, white noise, etc.) and "soothing" or "relaxing" songs that overlap with other genres present in our dataset. 
- ambient 
- study 
- sleep 
- sad 
- happy

Let's check them out just to be sure. 

## Sleep 

In [8]:
# All songs from the 'sleep' genre
sleep = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'sleep']
sleep

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
101000,07UDTaRYJAsIhUZTyZSUzM,Huma,Ons,Ons,71,193239,0,0.1110,0.018500,0,-32.335,1,0.0424,0.957,0.940,0.087,0.04590,76.153,3,sleep
101001,0CMYUXTTTmI6Lwc0opH2XG,Sohn Aelia,Herinneringen,Herinneringen,74,172357,0,0.0685,0.003420,0,-37.156,1,0.0418,0.990,0.910,0.106,0.05810,73.644,1,sleep
101002,5LEHlRPjGcZ5RdagAwXpHS,Little Circuits,Beautiful Imagination,Beautiful Imagination,71,143052,0,0.1020,0.033300,3,-34.177,1,0.0476,0.797,0.927,0.111,0.05910,89.605,4,sleep
101003,58xXb7ASilQ9WYtninubrY,White Noise for Babies,Thinned Phaser,Aqua,0,157024,0,0.2400,0.011900,6,-33.515,0,0.0760,0.995,0.828,0.103,0.03580,138.009,4,sleep
101004,6kOJ0Ylenwr6Up0g4lDORg,White Noise for Babies,Rising Sun,Rising Sun,0,180000,0,0.1350,0.000804,4,-34.772,1,0.0507,0.968,0.728,0.108,0.03460,141.930,5,sleep
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101994,5BVK02c2QI9ODFm8liVSdw,Rain Sounds;Spa & Spa;Nature Sounds Nature Music,18 Relaxing Rain Sounds - No Fade,10 Minute Loopable Heavy Rain,32,591807,0,0.1080,0.774000,10,-17.200,0,0.0725,0.435,0.992,0.716,0.01820,93.952,4,sleep
101996,3OwZ8e7lb5Assv14ZRW1rU,Rain Sounds,Soft Rain Sleep,Forrest Rain,32,102649,0,0.4680,0.993000,10,-24.274,0,0.0386,0.250,0.945,0.932,0.00964,49.296,4,sleep
101997,3QZrgBLusdGcYJtST0ucXX,Rain Sounds,White Noise Rain,White Noise Rain,32,173608,0,0.1740,0.895000,2,-24.459,1,0.0468,0.172,0.919,0.875,0.00001,86.752,5,sleep
101998,6hD4qRIXuuD2cM7vX6eV23,Meeresrauschen,"Meeresrauschen zur Entspannung, für Meditation...",Relax,32,120353,0,0.1880,0.999000,10,-18.900,1,0.0510,0.474,0.988,0.929,0.00001,60.837,3,sleep


There is a total of 945 tracks. Already from glancing at the top and bottom, we can see track and album titles like "White Noise for Babies", "Rain Sounds." Let's check out a random sample of rows. 

In [9]:
# 100 random rows from sleep genre 
sleep.sample(100)

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
101934,5dYGrOlxDiibQ2qiVe8aNI,Baby Sleep Lullaby Academy,Sleep Music to Help You Relax All Night: New A...,Meditation for the Dreams,32,209142,0,0.1450,0.0395,7,-28.457,1,0.0403,0.924000,0.98200,0.103,0.03440,75.150,4,sleep
101180,42nq6tG2mUQQhkBIM0kPuu,Traditional;London Philharmonic Orchestra;The ...,"Classical Christmas, Vol. 2",The Trumpet Shall Sound (Remastered 2014),0,262663,0,0.3730,0.2890,2,-11.743,1,0.0384,0.935000,0.59400,0.547,0.26200,131.174,3,sleep
101626,6cONGJWwrp7KmgmLMjSDVX,Pink Noise,Pink Noise Rain Sounds,Calm Rain Water (No Fade),35,66000,0,0.1290,0.5670,1,-34.024,1,0.0523,0.081200,0.92300,0.363,0.03380,200.864,4,sleep
101166,3InJfiPB0ZOw9xg0Z9BV7G,Traditional;Royal Philharmonic Orchestra,"Classical Christmas, Vol. 1",O Holy Night (Remastered 2014),0,163698,0,0.1630,0.3420,4,-11.015,1,0.0371,0.980000,0.00124,0.110,0.08150,90.709,3,sleep
101905,0IsEyOQ5ObJcX1tnuXeXqX,Rnwy Lites,Breathe into the Still,Breathe into the Still,59,192000,0,0.0667,0.0895,0,-30.760,1,0.0417,0.849000,0.63300,0.111,0.10200,72.956,3,sleep
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101171,2H03nIdKechb09iRBKaZS6,Traditional;Royal Philharmonic Orchestra,"Classical Christmas, Vol. 1",Away in a Manger (Remastered 2014),0,216314,0,0.1120,0.2150,9,-17.766,1,0.0387,0.952000,0.00902,0.237,0.10400,80.987,5,sleep
101087,1umCMBSrqMBqmwNl1VfYus,Nature Sounds,Relaxing Rain Sounds,Serenity Rain,0,246334,0,0.4970,0.9880,1,-26.516,1,0.0413,0.275000,0.93500,0.931,0.00809,106.261,4,sleep
101009,6VwRN2gV7xF8pPxwry5Nvz,White Noise for Babies,Twilight Noise,Soft Light,0,186707,0,0.3340,0.2650,2,-13.579,1,0.0343,0.855000,0.83300,0.125,0.03190,30.200,4,sleep
101379,043f7kIHqTGBaOtN2tmZqB,Nature Sounds Nature Music,Tropical Storm for Deep Sleep - Thunderstorm S...,Tropical Thunder healing Sounds of Mother Nature,39,360000,0,0.1640,0.5470,10,-20.848,0,0.0759,0.000081,0.96900,0.357,0.02330,114.325,4,sleep


This confirms my thought above. In addition to white noise and nature sounds we also see a mix of classical music. 

In [10]:
# Songs of classical genre
classical = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'classical']

In [11]:
# Shape of this genre 
classical.shape

(759, 20)

We have a descent amount of tracks in our classical genre. I see no need to move any of these songs to more "appropriate" genres. Let's go ahead and remove this genre, `sleep` from our data set. 

In [12]:
sleep_index = list(sleep.index)

In [13]:
# Removing any song with 'sleep' as a genre from the DataFrame
consolidated_spotify.drop(sleep_index, inplace=True)

In [14]:
# Sanity check 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'pow

We no longer have 'sleep' as a genre. Let's move onto study music. 

 ## Study 


In [15]:
# All songs from the genre 'study'
study = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'study']

In [16]:
study

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
105000,6qbe2xJVwt7wwpeocsrVvc,Talented Mr Tipsy,Ritual,Ritual,53,173333,0,0.657,0.717,10,-6.007,0,0.1990,0.7510,0.878,0.1740,0.401,90.076,4,study
105001,5ZOwkvgMAeI0G2USq0cIFQ,O k O,neomatter,neomatter,49,104433,0,0.670,0.371,4,-13.830,0,0.1410,0.5700,0.891,0.1270,0.100,83.030,3,study
105002,2L2aSrua8ZBRMogFhF6I1Q,4to28,Elastic,Elastic,50,121600,0,0.736,0.410,5,-14.587,0,0.0528,0.0361,0.854,0.3800,0.587,112.509,3,study
105003,1tNwm7phOIMzGC1VO9Dggr,Joi Casette,One Two Tree,One Two Tree,47,121875,0,0.797,0.263,6,-13.089,0,0.1060,0.5760,0.786,0.9180,0.379,80.014,4,study
105004,4wkQzdybc3qK4nNaxHLzvO,hope mona,humify,humify,47,161032,0,0.496,0.198,3,-18.740,1,0.3230,0.9470,0.885,0.0933,0.341,77.168,4,study
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105995,0p9HloqJbkgokaktOhiwkn,"Sarah, the Illstrumentalist",FLOWERS,Chamomile,6,140199,0,0.640,0.296,3,-9.142,0,0.0314,0.0238,0.896,0.1530,0.337,112.018,5,study
105996,2RpHErrNFoo8A7MKAQS7tD,DGHTR,HoriZon Butterfly,Red Maple,2,119111,0,0.814,0.226,0,-13.278,1,0.1010,0.4930,0.904,0.4460,0.200,135.046,4,study
105997,2nHUre6dQezPvxBkeTAba5,"Sarah, the Illstrumentalist",Constellations,Capricornus,6,142521,0,0.769,0.716,11,-8.844,0,0.0514,0.0161,0.946,0.1520,0.598,137.995,4,study
105998,63QCKDyQB7OoiWOfwWkysb,"Sarah, the Illstrumentalist",Pocket Full of Crystals: Vol 2,Orange Jasper,6,145000,0,0.675,0.694,9,-7.689,0,0.0838,0.4450,0.928,0.1090,0.625,179.960,4,study


We have 996 tracks in this genre. it's a bit difficult to see the songs looking at the top and bottom. Let's again look at a sample of 100 rows. 

In [17]:
pd.set_option('display.max_rows', 299)

In [18]:
# 100 random rows from study genre
study.sample(100)

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
105453,57bY38pZVvHLS6LCBpzMKQ,Oscar Hollis,All About Mine,All About Mine,21,148000,0,0.742,0.18,1,-17.959,0,0.0689,0.882,0.94,0.101,0.0664,130.002,4,study
105861,1O9XqbawlAACCBNT1tg6YM,Matt Large,Waking up Next to You,Waking up Next to You,25,220675,0,0.717,0.34,11,-8.782,0,0.0397,0.299,0.782,0.124,0.516,75.009,4,study
105322,3AA7qG6AgLxmTDAlZVRKaX,Sless Praismo,"Experimentals, Pt. 3",Last Call,14,169014,0,0.331,0.61,1,-8.919,1,0.234,0.267,0.848,0.426,0.176,69.826,4,study
105629,0PkaVMW00sCAeiXd5mKIVb,Odd Shapes,Thousand Islands,Thousand Islands,28,203906,0,0.8,0.342,9,-18.103,1,0.0519,0.433,0.69,0.112,0.221,128.115,4,study
105592,5nMWNjCkesUlHXjjfBNsAL,Nathan Kawanishi,Freshman Tape,Lost,10,111000,0,0.835,0.667,7,-9.057,1,0.134,0.616,0.843,0.0988,0.707,160.026,4,study
105092,72wDz8RDKFvffQJhHucHDy,Dusty Decks,Slipping Mats,Slipping Mats,47,149866,0,0.746,0.477,11,-7.486,1,0.0463,0.124,0.881,0.133,0.714,91.999,4,study
105413,4sT5NZD0cHvpB3fcDkrGpD,Bullseye Release,belief tree,belief tree,36,180000,0,0.757,0.139,0,-6.647,0,0.0468,0.576,0.695,0.244,0.533,80.009,4,study
105859,2vXP0h0jE1vOaPM0zS5Ora,Timothy Infinite,Trumpet Man,Trumpet Man,24,128071,0,0.704,0.562,6,-6.811,1,0.39,0.758,0.879,0.246,0.28,84.156,4,study
105894,74f0VaipJke6lIWI5BEfTW,"Sarah, the Illstrumentalist",Good Rising,Good Rising,7,145724,0,0.724,0.869,2,-5.457,0,0.0461,0.0744,0.737,0.213,0.917,121.92,4,study
105885,3ZBJMy7ICSqGfwQivSjqZo,wavcrush,come pat a bowl,brake down,6,99307,0,0.761,0.428,8,-6.378,0,0.0578,0.21,0.193,0.115,0.187,90.014,4,study


I am not familiar with the artsits, albums, or tracks in this genre. But we can look at the numeric features and see if we notice any trends. 

In [19]:
# Summary stats 
study.describe()

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0,996.0
mean,26.15261,141592.863454,0.0,0.685067,0.411024,5.427711,-10.846727,0.501004,0.100187,0.530086,0.789227,0.161438,0.402181,111.776734,3.942771
std,14.121255,33766.767066,0.0,0.099322,0.168141,3.629,3.586343,0.50025,0.093228,0.311927,0.212534,0.126922,0.21366,35.636203,0.332098
min,0.0,68500.0,0.0,0.331,0.0282,0.0,-25.15,0.0,0.0256,0.000386,1.6e-05,0.0354,0.0351,30.322,1.0
25%,11.0,121452.25,0.0,0.629,0.287,2.0,-13.24875,0.0,0.0444,0.256,0.785,0.102,0.23375,82.99975,4.0
50%,28.0,138666.0,0.0,0.695,0.397,6.0,-10.3195,1.0,0.0657,0.57,0.8665,0.114,0.3845,92.04,4.0
75%,37.0,157710.75,0.0,0.752,0.5195,9.0,-8.012,1.0,0.11225,0.81125,0.90925,0.157,0.5485,141.9955,4.0
max,55.0,307246.0,0.0,0.937,0.959,11.0,-3.343,1.0,0.743,0.99,0.975,0.918,0.97,208.06,5.0


With the exception of loudness and instrumentalness, there tends to be a wider range of values. Seeing as these songs are from different genres but deemed to be appropriate for studying we can go ahead and remove these to improve the performance of our model. 

In [20]:
study_index = list(study.index)

In [21]:
# Removing 'study' genre
consolidated_spotify.drop(study_index, inplace=True)

In [22]:
# Sanity check 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'pow

Let's move on to 'ambient'. 

## Ambient 

In [23]:
# All the ambient music
ambient = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'ambient']

In [24]:
# Random 150 rows from this genre
ambient.sample(150)

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
4694,6p9Frj8pPOCWol6HMFihSs,Bethel Music;Jonathan David Helser;Melissa Helser,Homecoming (Live),I Believe - Live,52,456000,0,0.44,0.702,10,-6.759,1,0.0325,0.0197,3.8e-05,0.109,0.295,143.99,4,ambient
4716,5KIOfWtUEKrxDNi4OnIpg2,Murcof;Vanessa Wagner,Avril 14th (Aphex Twin) [Vanessa Wagner Version],Avril 14th (Aphex Twin) - Vanessa Wagner Version,58,165020,0,0.395,0.0826,5,-18.982,0,0.0432,0.995,0.93,0.126,0.119,93.288,4,ambient
4172,4Ks0OfgQpb58nHvQrcaiPi,Fabrizio Paterlini,Transitions II,Distractions,60,75775,0,0.123,0.0358,4,-38.528,1,0.0483,0.99,0.945,0.098,0.0399,114.425,5,ambient
4806,1qGI8KYIXtJK9WTMC6nIQM,Olivia Belli,Island II,Island II,57,143960,0,0.4,0.00478,5,-32.946,0,0.0382,0.993,0.92,0.103,0.122,74.012,3,ambient
4659,4FEcEOyyGbPR1cnYMBTRfO,Helios,Domicile,Our Distance,42,323573,0,0.302,0.0288,0,-26.078,1,0.0491,0.956,0.805,0.101,0.069,119.592,4,ambient
4096,6npUdmcsfp9yRa65tF0PUH,Philip Glass;Daniel Hope;Chie Peters;Deutsches...,Classical Running,Echorus,0,351466,0,0.285,0.0158,0,-24.312,1,0.0483,0.947,0.466,0.126,0.0369,119.852,4,ambient
4217,4GEzdXpbeGB97W1C3gH8Hh,aswekeepsearching,Khwaab,The Moment,29,110000,0,0.277,0.243,8,-18.456,0,0.0348,0.966,0.889,0.105,0.0379,150.074,3,ambient
4898,1q6ek8IwZsfPYFrbiuSEOz,Alexis Ffrench,Truth - The Solo Piano Collection,Evermore - Solo Piano Version,25,169317,0,0.55,0.0502,2,-26.362,1,0.0462,0.995,0.886,0.0891,0.385,134.892,3,ambient
4181,2NZEJxIUnsP18o2aNzeuZW,Sleeping At Last,"Covers, Vol. 2",Make You Feel My Love,54,208339,0,0.451,0.0767,7,-12.221,1,0.0402,0.979,0.0,0.129,0.333,139.586,4,ambient
4390,6T10XPeC9X5xEaD6tMcK6M,Air;Beth Hirsch,Moon Safari,All I Need (feat. Beth Hirsch),63,268306,0,0.327,0.425,7,-11.577,0,0.027,0.878,0.000873,0.18,0.165,93.946,4,ambient


In [25]:
# Summary stats 
ambient.describe()

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0,937.0
mean,44.817503,236133.7,0.005336,0.367037,0.235512,4.955176,-18.679177,0.616862,0.041793,0.77671,0.676701,0.128756,0.166526,111.290487,3.648879
std,17.409528,112879.8,0.072893,0.164229,0.210397,3.444945,7.957414,0.486411,0.030389,0.301903,0.361738,0.096965,0.14345,32.057238,0.775893
min,0.0,41904.0,0.0,0.0,0.00144,0.0,-41.808,0.0,0.0,2e-06,0.0,0.0345,0.0,0.0,0.0
25%,40.0,160000.0,0.0,0.23,0.0709,2.0,-24.141,0.0,0.0335,0.722,0.505,0.0916,0.0569,84.909,3.0
50%,50.0,218200.0,0.0,0.361,0.176,5.0,-18.199,1.0,0.0376,0.921,0.877,0.105,0.125,111.995,4.0
75%,55.0,289548.0,0.0,0.487,0.348,8.0,-12.016,1.0,0.0446,0.979,0.924,0.119,0.23,132.614,4.0
max,84.0,1041520.0,1.0,0.853,0.974,11.0,-4.431,1.0,0.761,0.996,0.986,0.975,0.962,213.848,5.0


The wikipedia description of ambient music is a genre of music that emphasizes tone and atmosthere over traditional musical structure of rhythm. It is a loosely defined musical genre that incorporatees elements of a number of different styles - including jazz, electornic music, new age, modern classical music and even noise. 

Just by checking a few artists, we can confirm that definition. Many of these artists span many different genres including but not limited to classical, indie, alternative, and electronica. Ambient music is a large, legitimate genre in its own right, but the mix of styles could make it harder for our model to perform. Using our fairly basic attributes, it might be difficult for the model to analyze this genre. For example, instrumentalness can vary quite a bit. However, energy, acousticness and liveliness have values that lie within a small range showing possible distinctive features of the genre. 

Let's leave this genre for now, particularly for our Principal Component Analysis. It's possiible we will be able to combine this with something else in our data. We can also try modeling with this genre and without it to see if this should be seen as more of a sub-genre in this context. We can come back to this genre later. 

Let's move on to `sad`

## Sad & Happy

In [26]:
# All songs under 'sad' genre 
sad = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'sad']

In [27]:
# 150 random rows from the 'sad' genre
sad.sample(150)

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
94974,60QvDCrjIu6DmewWRxKKRa,Yxngxr1;Morgan Reese,Teenage Motel,NICE GUY,54,138151,1,0.785,0.5,8,-8.498,1,0.261,0.635,0.0,0.159,0.564,140.061,4,sad
94019,5bryfrnYaxlofC6MAG6vde,Snøw,Sunset/Sunrise,Sunset/Sunrise,63,87353,0,0.753,0.24,10,-12.095,1,0.0458,0.903,0.000451,0.109,0.183,77.993,4,sad
94435,4BUEgtfQSZkHbjjztrKtlv,BrxkenBxy,Valentina,Valentina,53,91346,1,0.781,0.657,8,-6.91,1,0.0867,0.0201,0.000369,0.132,0.622,99.959,4,sad
94220,0CL5CGhcKaJLakMZNSRLu9,Thekidszn;Bangers Only,Hills,Hills,50,125106,1,0.71,0.78,2,-5.211,1,0.0493,0.0248,0.0,0.239,0.342,94.047,4,sad
94400,1PJsUeqqU2H2iVCZ7ZBGnU,Connor Price;Nic D,Selfish,Selfish,55,113056,0,0.814,0.52,8,-7.409,0,0.45,0.0502,0.0,0.114,0.57,179.955,4,sad
94678,2pLPUx9oQfg8L5sAEPDGBp,Lil Loski,Life Without Heartbreak,Fiji (Bonus Track),50,162448,1,0.66,0.491,9,-8.845,1,0.111,0.515,0.0,0.0932,0.338,168.959,4,sad
94100,7xcTzEFNkkfO5a60T93tdk,ZOID LAND,midnight sun (ramzoid x hal walker),midnight sun (ramzoid x hal walker),63,58053,0,0.908,0.591,10,-5.13,0,0.226,0.0464,0.966,0.0972,0.565,113.083,4,sad
94410,3ydusC725etWQ7QlAjmIi4,Dro Kenji,THEY DON'T KNOW,THEY DON'T KNOW,52,103783,1,0.722,0.719,7,-6.548,1,0.0422,0.119,0.0,0.13,0.147,92.586,4,sad
94034,56H3K8a3gbWeX0cg4GjSzC,Adam Oh,Trapped in My Mind,Trapped in My Mind,52,216052,1,0.678,0.678,5,-7.518,1,0.0998,0.176,0.0,0.104,0.465,159.984,4,sad
94647,50HJE4g0GR62ID38kzRMGo,Lxst,Over Me,Exhausted Pt. 2,46,92536,1,0.584,0.416,8,-12.617,1,0.182,0.797,2.5e-05,0.111,0.149,165.996,4,sad


In [28]:
# summary stats to see if we see any patterns 
sad.describe()

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0,536.0
mean,51.807836,152524.983209,0.501866,0.703657,0.478351,5.402985,-9.775991,0.617537,0.131899,0.435505,0.070964,0.163116,0.437461,119.812062,3.983209
std,10.831005,36723.805015,0.500464,0.126381,0.164225,3.667222,3.758678,0.486443,0.119375,0.305249,0.211245,0.113619,0.20781,31.396476,0.283004
min,0.0,57877.0,0.0,0.328,0.0337,0.0,-23.236,0.0,0.0281,0.000496,0.0,0.0445,0.0387,58.968,1.0
25%,47.0,127741.5,0.0,0.624,0.35475,2.0,-11.8205,0.0,0.0465,0.13925,0.0,0.102,0.27375,92.99675,4.0
50%,53.0,150000.0,1.0,0.719,0.481,6.0,-9.036,1.0,0.0851,0.4095,0.0,0.1205,0.434,118.711,4.0
75%,58.0,175533.0,1.0,0.798,0.602,8.0,-6.9975,1.0,0.176,0.708,0.000407,0.175,0.59075,142.289,4.0
max,83.0,314500.0,1.0,0.98,0.906,11.0,-2.325,1.0,0.891,0.991,0.973,0.851,0.97,220.039,5.0


We do see that 75% of the data has a valence score of 0.59 and below. 

Based on the tracks' content, and some metrics like danceability and energy we can assume that the title of the genre refers to the lyrics or sentiment of the songs not any musical style. This in itself is a good reason to drop this genre seeing that we do not have a metric that quantifies the meaning of lyrics. 

In [29]:
# All indexes of 'sad' genre songs 
sad_index = list(sad.index)

In [30]:
# Using 'sad' indexes to drop them from the DataFrame
consolidated_spotify.drop(sad_index, inplace=True)

In [31]:
# Sanity check 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'pow

Based on `sad`, and the fact that we have the valence metric, we can also go ahead and drop `happy`.

In [32]:
# Grabbing all songs with 'happy' genre
happy = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'happy']

In [33]:
# All of their indexes 
happy_index = happy.index

In [34]:
# Using their indexes to drop them from the DataFrame
consolidated_spotify.drop(happy_index, inplace=True)

In [35]:
# Sanity check 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'power-pop', 

Now that we've covered loosely defined genres, let's move on to the languages. 

## Languages 

We have several different genres titled with a language. Due to our limited attributes, let's look through these genres to make sure that we don't have overlapping genres simply sung in different langauges (i.e. french hip-hop, french indie, french pop, etc.). 

Our languages/countries are: 
- brazil
- british
- french
- german
- iranian
- latin
- latino 
- spanish 
- swedish
- turkish 

And in addition to the above let's check out `world music`. 

### Brazil

This could either be popular brazilian styles like Samba, Forro, Bossa Nova, etc. or brazilian "versions" of the genres we already have present in the dataset. 

In [36]:
# All the songs that have 'brazil' as their genre
brazil = consolidated_spotify.loc[consolidated_spotify['track_genre'] == 'brazil']

In [37]:
# A sample of 150 rows to randomly assses content
brazil.sample(150)

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
9880,2igDtZlgWm3ZxhV7CHv6pI,Colo de Deus,Casa de Maria (Ao Vivo),Colo de Mãe - Ao Vivo,43,540000,0,0.46,0.642,9,-6.07,1,0.0781,0.718,0.0,0.395,0.497,136.036,4,brazil
9881,7x4bh40tbYvpN5Gp7fMyts,Ultraje a Rigor,Acústico Mtv,Rebelde Sem Causa - Acústico,42,175280,0,0.457,0.911,0,-6.373,1,0.0782,0.14,0.000228,0.711,0.669,173.135,4,brazil
9526,5mblm2qeUbMbsDNsIE5OQi,Cássia Eller,Acústico (Ao Vivo),1º De Julho - Ao Vivo,45,292306,0,0.32,0.878,7,-5.771,1,0.0909,0.538,0.0,0.665,0.509,170.064,4,brazil
9437,6M1mIjeq2kNvT8F2dHNugk,Marcos Antônio,Eterno Dependente,Pai,47,330804,0,0.655,0.574,6,-7.261,0,0.031,0.327,0.0,0.366,0.577,117.048,4,brazil
9767,1muDihGGBhRaAfIDUA93z9,Major RD;Rock Danger;Sain;Chris MC;El Lif Beat...,Troféu,Deixa a Luz Baixa,44,210783,1,0.745,0.638,1,-7.012,1,0.284,0.44,0.000109,0.229,0.488,130.022,4,brazil
9796,6WmlPRkJH68P7tmkFuQe3z,Ministério Vineyard,Vem Esta é a Hora,Senhor Te Quero,44,277893,0,0.507,0.676,7,-8.192,1,0.0326,0.677,1.3e-05,0.677,0.513,130.834,4,brazil
9616,6ijPGzgs3jqHokPhkhPpMh,Projota;Nando Reis,A Saída Está Dentro,Homens De Bem,44,250116,0,0.678,0.586,7,-6.526,1,0.0448,0.155,0.0,0.0839,0.676,78.949,4,brazil
9858,3yz4ypggDnJ4N3z3YG6pn2,O Rappa,O Rappa - Acústico Oficina Francisco Brennand ...,Hóstia - Ao vivo,42,206718,0,0.637,0.898,7,-5.322,0,0.106,0.186,0.000673,0.975,0.81,96.878,4,brazil
9019,6OSs0dmcVqVdeWOmsaA5g1,Jota Quest,Acústico Jota Quest,Fácil - Acústico,54,229540,0,0.556,0.537,7,-10.761,1,0.0282,0.377,0.0,0.757,0.313,95.914,4,brazil
9021,2OmZjKpXjbDWXEsw1JEy1x,Isaias Saad;LUDI,Seu Amor / Diante da Cruz,Seu Amor / Diante da Cruz,55,497600,0,0.333,0.24,4,-12.006,1,0.0312,0.836,4.7e-05,0.0963,0.251,71.374,4,brazil


Listening to the tracks confirmed by assumption that these are songs of many genres sung in portuguese. Once again since we don't have a metric for lyrics or languages, it's safe to remove this genre. 

In [38]:
# Grabbing indexes of brazilian songs and removing them from the DataFrame 
brazil_index = brazil.index 
consolidated_spotify.drop(brazil_index, inplace=True)

It's safe to assume that we can also remove the remaining langauge columns (i.e. British, French, German, Spanish, Swedish, Indian, Turkish, etc.). 

In [39]:
# A list of the language genres we wish to remove
language_columns = ['british', 'french', 'german', 'spanish', 'turkish', 'indian', 'swedish', 'latin', 'latino', 'iranian']

In [40]:
# Getting all the songs belonging to the genres above 
language_genres = consolidated_spotify[consolidated_spotify['track_genre'].isin(language_columns)]

In [41]:
# Sanity check 
language_genres

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
11000,0DuWDLjriRPjDRoPgaCslY,Adele,25,Love In The Dark,78,285935,0,0.331,0.341,9,-6.057,0,0.0309,0.52800,0.000000,0.1090,0.152,109.821,4,british
11001,3di5hcvxxciiqwMH1jarhY,Adele,21,Set Fire to the Rain,80,242973,0,0.603,0.670,2,-3.882,0,0.0249,0.00408,0.000002,0.1120,0.446,107.993,4,british
11002,6DEMMeWXfmFAXgDUMMzeg6,Adele,25,All I Ask,74,271800,0,0.591,0.280,4,-5.494,1,0.0283,0.88900,0.000000,0.1240,0.348,141.916,4,british
11003,0gplL1WMoJ6iYaPgMCL0gX,Adele,Easy On Me,Easy On Me,85,224694,0,0.604,0.366,5,-7.519,1,0.0282,0.57800,0.000000,0.1330,0.130,141.981,4,british
11004,3bNv3VuUOKgrf5hu3YcuRo,Adele,21,Someone Like You,81,285240,0,0.556,0.319,9,-8.251,1,0.0281,0.89300,0.000000,0.0996,0.294,135.187,4,british
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112995,7fo4ScxpiBnESWu8NRMyzx,Ebru Yaşar,Haddinden Fazla,Öldüm Sahiden,37,249996,0,0.277,0.674,7,-5.128,0,0.0458,0.22400,0.000000,0.0943,0.356,85.839,4,turkish
112996,2iSS96WasflS5EbkXytwDM,Can Bonomo;Ceza,Kâinat Sustu,Terslik Var,37,200776,0,0.676,0.799,7,-6.933,0,0.0412,0.08970,0.000000,0.1330,0.568,98.018,4,turkish
112997,5MaKcM9OtEnbu8YAZbPtfS,Melike Şahin,Merhem: İlk Konserler (Live),Uykumun Boynunu Bükme (Live @ Bostancı Gösteri...,37,280469,0,0.488,0.672,9,-11.690,0,0.0365,0.02190,0.000035,0.7120,0.463,105.943,3,turkish
112998,4yiW7dLPFKyMlalsB8UMSc,Hayko Cepkin,Aşkın Izdırabını,Geç Kaldım - Akustik,37,206680,0,0.425,0.405,7,-12.428,0,0.0299,0.33300,0.000025,0.1360,0.382,160.058,3,turkish


In [42]:
# Turning their indexes into a list 
language_index = list(language_genres.index)

In [43]:
# Dropping genres from DataFrame 
consolidated_spotify.drop(language_index, inplace=True)

In [44]:
# Sanity check 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'breakbeat',
       'cantopop', 'chicago-house', 'children', 'chill', 'classical',
       'club', 'comedy', 'country', 'dance', 'dancehall', 'death-metal',
       'deep-house', 'detroit-techno', 'disco', 'disney', 'drum-and-bass',
       'dub', 'dubstep', 'edm', 'electro', 'electronic', 'emo', 'folk',
       'forro', 'funk', 'garage', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm',
       'indie-pop', 'indie', 'industrial', 'j-dance', 'j-idol', 'j-pop',
       'j-rock', 'jazz', 'k-pop', 'kids', 'malay', 'mandopop', 'metal',
       'metalcore', 'minimal-techno', 'mpb', 'new-age', 'opera', 'pagode',
       'party', 'piano', 'pop-film', 'pop', 'power-pop',
       'progressive-house', 'psych-rock', 'punk-rock', 'punk', 'r-n-b',
       'reggae', 'regga

Let's see where we are now in terms of total songs in our dataset. 

In [45]:
# Checking the size of our DataFrame
consolidated_spotify.shape

(69844, 20)

### Other Remaining Genres to Remove

We now have 72,939 songs. That's a significant loss. However, after removing the following genres we can start to combine some of the remaining genres together based on their umbrella genre: 
- children
- chill
- comedy 
- disney
- kids
- romance
- party
- piano 

These genres are either grouped based on their lyrics or they contain tracks in overlapping genres. 

In [46]:
# list of genre names we want to remove
remaining_genre_cols = ['children', 'chill', 'comedy', 'disney', 'kids', 'romance', 'party', 'piano']

In [47]:
# Finding all the songs that are in the above genres
remaining_genres = consolidated_spotify[consolidated_spotify['track_genre'].isin(remaining_genre_cols)]

In [48]:
# Grabbing their index
remaining_indexes = list(remaining_genres.index)

In [49]:
# Removing themfrom the DataFrame
consolidated_spotify.drop(remaining_indexes, inplace=True)

In [50]:
# Sanity check 
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'breakbeat',
       'cantopop', 'chicago-house', 'classical', 'club', 'country',
       'dance', 'dancehall', 'death-metal', 'deep-house',
       'detroit-techno', 'disco', 'drum-and-bass', 'dub', 'dubstep',
       'edm', 'electro', 'electronic', 'emo', 'folk', 'forro', 'funk',
       'garage', 'gospel', 'goth', 'grindcore', 'groove', 'grunge',
       'guitar', 'hard-rock', 'hardcore', 'hardstyle', 'heavy-metal',
       'hip-hop', 'honky-tonk', 'house', 'idm', 'indie-pop', 'indie',
       'industrial', 'j-dance', 'j-idol', 'j-pop', 'j-rock', 'jazz',
       'k-pop', 'malay', 'mandopop', 'metal', 'metalcore',
       'minimal-techno', 'mpb', 'new-age', 'opera', 'pagode', 'pop-film',
       'pop', 'power-pop', 'progressive-house', 'psych-rock', 'punk-rock',
       'punk', 'r-n-b', 'reggae', 'reggaeton', 'rock-n-roll', 'rock',
       'rockabilly', 'salsa', 'samba', 'sert

In [51]:
consolidated_spotify['track_genre'].nunique()

90

In [52]:
print(f'We now have 90 genres and {consolidated_spotify.shape[0]} songs in our dataset.')

We now have 90 genres and 62962 songs in our dataset.


After more research and observation, the following genres will also be removed: 
- malay
- groove
- guitar 
- pop film 

Malay is traditional malaysian music. This genre is similar to our langauge genres above where we saw a possible mix of traditional music with other genres already in the dataset. We can safely remove this one. 

Groove, guitar, and pop-film are all too general for our purposes. 

Let's remove these final genres. 

In [53]:
# creating list of column names
remaining_genre_cols = ['malay', 'groove', 'guitar', 'pop-film']

In [54]:
remaining_genres = consolidated_spotify[consolidated_spotify['track_genre'].isin(remaining_genre_cols)]

In [55]:
remaining_indexes = list(remaining_genres.index)

In [56]:
# Sanity check 
remaining_indexes

[43004,
 43008,
 43029,
 43040,
 43062,
 43064,
 43065,
 43067,
 43071,
 43087,
 43100,
 43107,
 43113,
 43114,
 43116,
 43119,
 43120,
 43121,
 43123,
 43124,
 43126,
 43130,
 43132,
 43134,
 43137,
 43138,
 43139,
 43140,
 43141,
 43143,
 43144,
 43156,
 43157,
 43160,
 43165,
 43166,
 43167,
 43168,
 43169,
 43170,
 43171,
 43172,
 43173,
 43174,
 43175,
 43176,
 43177,
 43180,
 43181,
 43182,
 43184,
 43185,
 43186,
 43187,
 43188,
 43189,
 43190,
 43191,
 43192,
 43193,
 43194,
 43195,
 43197,
 43198,
 43199,
 43204,
 43205,
 43206,
 43210,
 43211,
 43212,
 43213,
 43214,
 43217,
 43218,
 43219,
 43220,
 43223,
 43225,
 43226,
 43227,
 43228,
 43229,
 43230,
 43231,
 43232,
 43233,
 43234,
 43235,
 43236,
 43237,
 43238,
 43239,
 43240,
 43241,
 43243,
 43244,
 43245,
 43246,
 43247,
 43248,
 43249,
 43261,
 43265,
 43266,
 43268,
 43270,
 43271,
 43272,
 43274,
 43276,
 43277,
 43278,
 43279,
 43281,
 43282,
 43283,
 43284,
 43285,
 43286,
 43287,
 43289,
 43290,
 43291,
 43292,


In [57]:
consolidated_spotify.drop(remaining_indexes, inplace=True)

In [58]:
# Sanity check - our remaining genres
consolidated_spotify['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'breakbeat',
       'cantopop', 'chicago-house', 'classical', 'club', 'country',
       'dance', 'dancehall', 'death-metal', 'deep-house',
       'detroit-techno', 'disco', 'drum-and-bass', 'dub', 'dubstep',
       'edm', 'electro', 'electronic', 'emo', 'folk', 'forro', 'funk',
       'garage', 'gospel', 'goth', 'grindcore', 'grunge', 'hard-rock',
       'hardcore', 'hardstyle', 'heavy-metal', 'hip-hop', 'honky-tonk',
       'house', 'idm', 'indie-pop', 'indie', 'industrial', 'j-dance',
       'j-idol', 'j-pop', 'j-rock', 'jazz', 'k-pop', 'mandopop', 'metal',
       'metalcore', 'minimal-techno', 'mpb', 'new-age', 'opera', 'pagode',
       'pop', 'power-pop', 'progressive-house', 'psych-rock', 'punk-rock',
       'punk', 'r-n-b', 'reggae', 'reggaeton', 'rock-n-roll', 'rock',
       'rockabilly', 'salsa', 'samba', 'sertanejo', 'show-tunes',
       'singer-songwriter'

In [59]:
# Number of remaining genres
consolidated_spotify['track_genre'].nunique()

86

In [60]:
consolidated_spotify.shape

(59721, 20)

In [106]:
consolidated_spotify.to_csv('~/Desktop/CapstoneProject/data/consolidated_spotify_updated.csv', )

### NEXT STEPS

By removing all of these genres, we have lost a considerable amount of data, but there are ways to compensate for the smaller dataset. We also have the option of adding songs the same genres from our other dataset. The ambiguousness of these genres would have only hindered model performance. 

We still have a lot of genres that overlap with one another. For example, we can reduce everything down to the larger genres parent genres within the dataset - i.e. rock, pop, electronic, country, etc. Before we do any further consolidation of genres, let's run a few baseline models to see where we stand. 

***
**TABLE OF CONTENTS**

[Introduction](#Genre-Consolidation)

[Sleep](#Sleep)

[Study](#Study)

[Ambient](#Ambient)

[Sad & Happy](#Sad-&-Happy)

[Languages](#Languages)

[Next Steps](#NEXT-STEPS)

***

In [106]:
consolidated_spotify.to_csv('~/Desktop/CapstoneProject/data/consolidated_spotify_updated.csv', )