# Notebook 4: Dimensionality Reduction of Genre

### Introduction

I need to clean up the genre information for each artist, since I plan to validate the results of the PCA by looking at how the albums cluster, expecting albums with similar genres to cluster near each other. Currently, there are multiple genres assigned for each artist, so I will do dimensionality reduction using CorEx to assign each album to a genre.

In [1]:
import pandas as pd
import numpy as np
import pickle
import sys

sys.setrecursionlimit(1000000) #to allow pickling

from corextopic import corextopic as ct
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
with open('../data/full.pickle', 'rb') as read_file:
    all = pickle.load(read_file)

In [3]:
all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12163 entries, 0 to 12162
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   album_artist      12163 non-null  object 
 1   album_id          12163 non-null  object 
 2   genre             12163 non-null  object 
 3   track_id          12163 non-null  object 
 4   danceability      12163 non-null  float64
 5   energy            12163 non-null  float64
 6   key               12163 non-null  int64  
 7   loudness          12163 non-null  float64
 8   mode              12163 non-null  int64  
 9   speechiness       12163 non-null  float64
 10  acousticness      12163 non-null  float64
 11  instrumentalness  12163 non-null  float64
 12  liveness          12163 non-null  float64
 13  valence           12163 non-null  float64
 14  tempo             12163 non-null  float64
 15  type              12163 non-null  object 
 16  id                12163 non-null  object

### Process Genre Data To Prepare for Count Vectorization

In [4]:
genre_data = all.loc[:, ['album_artist', 'genre']]
genre_data['genre_flat'] = genre_data.genre.map(lambda x: ' '.join([i.replace(' ', '_').replace('-', '_') for i in x]))
genre_data['genre_flat2'] = genre_data.genre.map(lambda x: ' '.join([i.replace(' ', '_').replace('-', '_').replace('+', '').replace("'", '').replace('&', '') for i in x]))

In [5]:
genre_album = genre_data.loc[:, ['album_artist', 'genre_flat2']].drop_duplicates().reset_index(drop = True)

### Use Count Vectorizer to Get All Possible Genres

In [6]:
all_genres = CountVectorizer()
album_genres = all_genres.fit_transform(genre_album['genre_flat2'])

Note that applying this count vectorizer to different kinds of music may not yield good results, because the available genres will change dramatically with the inclusion of new kinds of music.

In [7]:
#List of all genres detected
all_genres.get_feature_names()

['abstract_hip_hop',
 'abstract_idm',
 'acoustic_pop',
 'action_rock',
 'adult_standards',
 'african_electronic',
 'afro_funk',
 'afro_psych',
 'afrobeat',
 'afrofuturism',
 'afropop',
 'afroswing',
 'alabama_indie',
 'album_rock',
 'albuquerque_indie',
 'alternative_americana',
 'alternative_country',
 'alternative_dance',
 'alternative_emo',
 'alternative_hip_hop',
 'alternative_metal',
 'alternative_pop',
 'alternative_rb',
 'alternative_rock',
 'alternative_roots_rock',
 'ambient',
 'ambient_black_metal',
 'ambient_folk',
 'ambient_idm',
 'ambient_pop',
 'ambient_techno',
 'american_folk_revival',
 'american_metalcore',
 'american_post_rock',
 'american_primitive',
 'american_shoegaze',
 'anime_score',
 'anthem_emo',
 'anti_folk',
 'apostolic_worship',
 'appalachian_folk',
 'argentine_indie',
 'arkansas_country',
 'art_pop',
 'art_punk',
 'art_rock',
 'athens_indie',
 'atl_hip_hop',
 'atl_trap',
 'atlanta_indie',
 'atmospheric_black_metal',
 'atmospheric_doom',
 'atmospheric_post_m

In [8]:
#Dataframe of all albums and which genres they are associated with
test = pd.DataFrame(album_genres.toarray(), columns = all_genres.get_feature_names())

In [9]:
genre_album.iloc[159, :]

album_artist                album:Arular artist:M.I.A.
genre_flat2     art_pop dance_pop escape_room new_rave
Name: 159, dtype: object

### Use CorEx for Dimensionality Reduction

In [10]:
#original is 10
genre_category = ct.Corex(n_hidden = 10, words = all_genres.get_feature_names(), seed = 1)
genre_category.fit(album_genres, words = all_genres.get_feature_names(), docs = genre_album['genre_flat2'])

<corextopic.corextopic.Corex at 0x10255c4dd0>

In [11]:
genre_name = genre_category.get_topics()
for n,genre_name in enumerate(genre_name):
    genre_words,_ = zip(*genre_name)
    print('{}: '.format(n) + ','.join(genre_words))

0: folk,roots_rock,singer_songwriter,traditional_folk,country_rock,folk_rock,alternative_country,new_americana,mellow_gold,lilith
1: hip_hop,rap,alternative_hip_hop,southern_hip_hop,neo_soul,soul,conscious_hip_hop,underground_hip_hop,pop_rap,funk
2: intelligent_dance_music,electronica,microhouse,fluxwork,ambient,compositional_ambient,fourth_world,electra,drone,minimal_techno
3: classic_rock,post_metal,drone_metal,post_doom_metal,alternative_metal,sludge_metal,album_rock,doom_metal,stoner_metal,progressive_sludge
4: indie_rock,indie_pop,dance_punk,alternative_dance,new_rave,art_pop,noise_pop,indie_punk,modern_power_pop,brooklyn_indie
5: freak_folk,indie_folk,chamber_pop,stomp_and_holler,experimental_pop,lo_fi,garage_psych,chamber_psych,gbvfi,experimental
6: alternative_rb,indie_soul,escape_room,contemporary_jazz,indie_rb,afrofuturism,indie_jazz,abstract_hip_hop,contemporary_post_bop,trap_queen
7: alternative_rock,rock,modern_rock,permanent_wave,dance_rock,britpop,new_wave,garage_rock,pu

### Assign Genres to Each Album By Probability, With an "Other" Category

In [12]:
album_assigned_genre = pd.concat([genre_album.iloc[:, 0], 
           pd.DataFrame(genre_category.predict(album_genres), 
                        columns = ['genre'+str(i) for i in range(10)])],  axis = 1)

In [13]:
probabilities = pd.DataFrame(genre_category.p_y_given_x, columns = ['genre'+str(i) for i in range(10)])

In [14]:
genres_assigned = album_assigned_genre.set_index('album_artist')
genres_assigned['any_genre'] = genres_assigned.apply(lambda x: 1 if x.sum() >= 1 else 0, axis = 1)
genres_assigned['genre10'] = genres_assigned.any_genre.map(lambda x: True if x == 0 else False) #genre9 is catchall
genres_assigned = genres_assigned.reset_index()
#In instances where the album fits in multiple genres, assign the one with the highest probability
genres_assigned['genre_num'] = probabilities.apply(lambda x: np.argmax(x), axis = 1)
#If album does not fit into any genre (probability < 50%), put it in "Other" category
genres_assigned['final_genre'] = genres_assigned.apply(lambda x: 10 if x.genre10 else x.genre_num, axis = 1)
#sum(sum_first) #775/993

### Name Genres, Finalize Genre Data

In [15]:
album_assigned = genres_assigned[['album_artist', 'final_genre']]

In [16]:
genre_mapping = pd.DataFrame({0: 'Folk-SingerSongwriter', 
                              1: 'HipHop-Rap',
                              2: 'Dance-Electronica',
                              3: 'Metal-ClassicRock', 
                              4: 'Indie',
                              5: 'Experimental',
                              6: 'AlternativeRnB-IndieSoul',
                              7: 'Rock-NewWave',
                              8: 'Electropop-Pop',
                              9: 'ArtRock-SlowCore',
                              10: 'Other'}, index = [1]).T.reset_index()

In [17]:
album_genre_final = album_assigned.merge(genre_mapping, left_on = 'final_genre', right_on = 'index')[['album_artist', 1]]
album_genre_final.columns = ['album_artist', 'genre']

In [18]:
album_genre_final.genre.value_counts()

Other                       180
Indie                       125
Folk-SingerSongwriter       122
HipHop-Rap                  119
Experimental                110
Dance-Electronica           107
Metal-ClassicRock            68
AlternativeRnB-IndieSoul     54
Rock-NewWave                 43
Electropop-Pop               42
ArtRock-SlowCore             22
Name: genre, dtype: int64

In [225]:
# with open('../data/album_genre.pickle', 'wb') as to_write:
#     pickle.dump(album_genre_final, to_write)