#Cleaning Spotify Track Data
##In this notebook, I'll be cleaning up the data that we scraped in part 1 (see Spotify_Generate_Tracks.ipynb for how this was originally scraped). Most of the cleaning here involved standardizing genres and calcualting the most popular genres for further analysis. 

##Package Setup and Data Import

In [13]:
import pandas as pd
import numpy as np
from google.colab import files
uploaded = files.upload()

Saving full_trackset.csv to full_trackset (1).csv


In [37]:
# Read in tracks
track_table = pd.read_csv("full_trackset.csv")
track_table.drop_duplicates(inplace=True)
track_table = track_table[['track_pop','track_name', 'track_id','track_year', 'track_spotify_genre',
                           'art_name','art_id', 'alb_name','alb_id', 'art_genre',
                           'duration_ms','time_signature', 'key', 'loudness','energy', 'speechiness',
                           'acousticness','mode','tempo', 'valence', 'danceability', 'instrumentalness',
                           'liveness']]
len(track_table)

80000

##Standardizing and grouping genre
One early issue I ran into was how specific the genres of each song were. Each song could conclude many HIGHLY specific genres like below. To try to use these as features, or to attempt to predict genre with this many potential classifications was nearly impossible.

In [15]:
for track in track_table.art_genre[0:20]:
  print(track)

['classical', 'late romantic era']
['korean instrumental', 'neo-classical', 'new age piano']
['british classical piano', 'classical piano']
['classical cello', 'classical performance']
['bow pop', 'compositional ambient', 'icelandic classical', 'neo-classical', 'neoclassical darkwave']
['classical', 'late romantic era']
['korean instrumental', 'neo-classical', 'new age piano']
['classical performance']
['compositional ambient', 'neo-classical', 'soundtrack']
['classical', 'late romantic era']
['bow pop', 'compositional ambient', 'neo-classical']
['classical', 'early romantic era', 'polish classical']
['classical', 'late romantic era']
['classical', 'german romanticism', 'late romantic era']
['compositional ambient', 'neo-classical', 'new age piano']
['baroque', 'classical', 'early music', 'german baroque']
['bow pop', 'compositional ambient', 'neo-classical']
['bow pop', 'compositional ambient', 'icelandic classical', 'neo-classical', 'neoclassical darkwave']
[]
['classical', 'impressi

I tried to take care of a few low-hanging fruit like converting those listed as hip hop and hip-hop to hiphop, and replacing most dashes with spaces to separate the genres.

In [38]:
# Tranform hip hop into hip-hop, hip pop into hip-pop, early music into early-music
track_table['art_genre'] = track_table['art_genre'].str.replace('hip hop','hiphop')
track_table['art_genre'] = track_table['art_genre'].str.replace('hip-hop','hippop')
track_table['art_genre'] = track_table['art_genre'].str.replace('-','')
track_table['art_genre'] = track_table['art_genre'].str.replace('bow pop','bowpop')
track_table['art_genre'] = track_table['art_genre'].str.replace('art pop','bowpop')

At this point, I had to rethink my strategy a bit. I decided to convert the array of genres per song, like `['classical', 'late romantic era']`, into a dictionary of word counts like `{'classical': 1, 'era': 1, 'late': 1, 'romantic': 1}`. 

While doing this, I also added each word to a master list of word counts to evaluate what genres were most common throughout the entire dataset.

In [39]:
# ARTIST GENRE ANALYSIS

# create dictionary of word counts for each track (track_table['genre_words'])
# AND dictionary of word counts across all tracks (all_words)

import ast
genre_words = []
all_words = dict()

for genre_list in track_table['art_genre']: # for every artist
  artist_count = dict()
  genre_list = ast.literal_eval(genre_list)
  for genre_string in genre_list: # for each genre in artist_genre list
    for word in genre_string.split(): # for each word in each genre
      #Add to master list
      if word in all_words:
        all_words[word] += 1
      else:
        all_words[word] = 1
      #Add to artist-specific list
      if word in artist_count:
        artist_count[word] += 1
      else:
        artist_count[word] = 1
  genre_words.append(artist_count)
track_table['genre_words'] = genre_words

From there, I created a counter to determine the most common genres across all 80,000 songs. I took the 8 most popular genres. I chose 8 because after this number, most of the genres were sub-genres or vague descriptions like modern, indie, or dance, which are often just overpowered by more common genres like Pop or Rock.

I chose to drop *rap* because it's incredibly similar to *hip-hop* sonically. Most songs that are classified as *rap* were also classified as *hip-hop*, and vice versa. 

In [41]:
# Generate set of 7 most popular genres (popular_genres)
from collections import Counter 
all_words_count = Counter(all_words) 
popular_genres = set([word[0] for word in all_words_count.most_common(8)])
print("Most Popular Genres: ", popular_genres)
popular_genres.remove('rap')

Most Popular Genres:  {'country', 'rock', 'r&b', 'pop', 'house', 'rap', 'indie', 'hiphop'}


That left me with 7 genres to use for my analysis.

In [42]:
print(all_words_count.most_common(8))
print("Most Popular Genres: ", popular_genres)

[('pop', 75043), ('rock', 40113), ('country', 32302), ('indie', 29079), ('rap', 28852), ('hiphop', 25594), ('house', 21779), ('r&b', 19538)]
Most Popular Genres:  {'country', 'rock', 'r&b', 'pop', 'house', 'indie', 'hiphop'}


From there, I added a column *master_popular_genre* that would represent which primary genre that the track fell under. Whichever of the top 8 popuplar genres occured most in the genre list for that artist was used as the *master_popular_genre* for that track.

In [43]:
# Add column (track_table["master_popular_genre"]) to each track, representing which of the primary genres the track falls into
popular_word_count = []
for words in track_table['genre_words']: # for every artist
  popular_words = words.copy()
  for word in words:
    if word not in popular_genres:
      del popular_words[word]
  popular_word_count.append(popular_words)

most_popular_master_genre = [Counter(song).most_common(1) for song in popular_word_count]
for item in range(0,len(most_popular_master_genre)):
  if len(most_popular_master_genre[item]) == 0:
    most_popular_master_genre[item] = np.nan
  else:
    most_popular_master_genre[item] = most_popular_master_genre[item][0][0]
track_table["master_popular_genre"] = most_popular_master_genre

In [44]:
print("Number of tracks that do not fall under top 10 genres:", track_table[track_table["master_popular_genre"].isna()].count().track_id,"\r\n")
print("Number of tracks that fall under top 10 genres: ",track_table.track_id.count() - track_table[track_table["master_popular_genre"].isna()].count().track_id)
print(display(track_table.groupby("master_popular_genre").count().track_id))

Number of tracks that do not fall under top 10 genres: 13316 

Number of tracks that fall under top 10 genres:  66684


master_popular_genre
country    10005
hiphop     10728
house       6529
indie       7673
pop        17277
r&b         5505
rock        8967
Name: track_id, dtype: int64

None


By summarizing all genres into just 8 primary genres, we lost about 13,000 out of 80,000 songs. This still leaves us with a sample of 67,000 tracks, which should be plenty for our analysis.

In [45]:
track_table.dropna(inplace=True)

The final track data output looks like this. Notice that in the below example, since one of the art_genres was *instrumental rock*, the master genre for the track was labeled as *rock*.

In [47]:
print(track_table.iloc[100], "\r\n")
print("art_genre:",track_table.iloc[100,9])
print("master_popular_genre:",track_table.iloc[100,24])

track_pop                                                              30
track_name                                            Tragedy and Harmony
track_id                                           1MLDzI1sCN0cr9UmTwyDd0
track_year                                                           2012
track_spotify_genre                                             classical
art_name                                                      Jeff Loomis
art_id                                             1bqlyIXRTCilCmvAr1newg
alb_name                                               Plains Of Oblivion
alb_id                                             337UZEcZWgax9G9xYIDN7Z
art_genre               ['djent', 'instrumental rock', 'jazz metal', '...
duration_ms                                                        301013
time_signature                                                          4
key                                                                     1
loudness                              

In [48]:
track_table.to_csv('cleaned_tracks_both.csv', index=False)
files.download('cleaned_tracks_both.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>