This notebook calculates the following:
### <p><center>For each genre, what is the gender split among all artists that are labeled with that genre?</center></p> 

- Part 1 compiles a dataframe ('genre_stats') with this information along with some other stats on each genre label from the training set.
- Part 2 further analyzes this information.

Unless something needs to be recalculated, start at Part 2 where a copy of the dataframe 'genre_stats' is imported; the calculation of it is ~5 min. 

In [44]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
#%matplotlib notebook


import re

from functools import partial

import plotly.graph_objects as go

# Part 1

Import the cleaned data:

In [2]:
#%ls -lt ../../data/genre_lists/data_ready_for_model/

In [3]:
%store -r now
now
#now = '2020-05-11-14-35'

'2020-05-18-10-06'

In [4]:
X_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now), index_col = ['artist'])
y_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now), index_col = ['artist'])

### Genre Labels -- as a list

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

In [5]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        L_new.append(x.replace(" ","_").lstrip("_").rstrip("_"))
    while (str("") in L_new):
        L_new.remove("")
    return L_new

Now we apply
- it to the whole column and put the lists in a new column
- assemble X,y into DF
-reset index to 'artist_id'

In [6]:
X_train['genrelist']= X_train['genrelist'].apply(genrelist)

data = X_train.join(y_train, how = 'inner', on = 'artist')

data.reset_index(inplace = True)
data.index.name = 'artist_id'
data_set_size = data.shape[0]

- Full genre_list (not just that for the training set)
- Vocab Dict and Size
- max length of lists

In [7]:
genre_list = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))
genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)
genre_list['genre_id'] = list(range(1,genre_list.shape[0]+1))

#Size of the vocab:
vocab_size = genre_list.shape[0]

#Create a dictionary {genre_label: genre_id}
label_id_dict = genre_list.set_index(['genre_list'])['genre_id'].to_dict()
id_label_dict = genre_list.set_index(['genre_id'])['genre_list'].to_dict()



#Find max length of genre lists:
max_list_length = data.genrelist_length.max()

### Count thefrequency of each label and prepare other columns

(This deals only with the training data, not the test data.)

In [8]:
genre_list_1 = data.genrelist.values.tolist()
genre_list_1 = [x for y in genre_list_1 for x in y]
genre_counts = pd.Series(genre_list_1, name = 'frequency')
genre_stats = genre_counts.value_counts().to_frame()
genre_stats.index.name = 'label'

In [9]:
# encode labels as ints within the list
def encode_list(row):
    return [label_id_dict[item] for item in row.genrelist]

data['genres_encoded_as_list'] = data.apply(encode_list, axis = 1)

#Check that the encoding is consistent: 
# n = np.random.randint(data.shape[0])
# [label_id_dict[item] for item in data.genrelist.iloc[n]], data.genres_encoded_as_list.iloc[n]

# Encode targets. The categories still appear as strings. To see the encoding use df.column.cat.codes.
data['gender'] = data.gender.apply(lambda x: 1 if x == 'female' else 0)

In [10]:
max_num_male = 73
max_num_female = 11
max_num = max(max_num_female, max_num_male)

### Stats by that genre label
- [ ] do this using sparse dataframe with one-hot encoding for speed up?

In [11]:
# fnc used with apply to data to select artists with a given label
def indicate(row):
    if label in row.genrelist:
        return 1
    else:
        return 0

In [12]:
data.columns

Index(['artist', 'genrelist', 'genrelist_length', 'gender',
       'genres_encoded_as_list'],
      dtype='object')

In [13]:
# produce stats for each label: male, female; mean number of labels; max, min number of labels
idx_list = []
for label in genre_stats.index: # use labels ordered by their frequency of appearance
    data['indicator'] = data.apply(indicate, axis = 1)
    label_artists = data[data.indicator == 1]
    genre_stats.loc[label,'female'] = int(label_artists.gender.sum())
    genre_stats.loc[label,'male'] = label_artists.shape[0]-label_artists.gender.sum()
    genre_stats.loc[label,'max_list_length'] = label_artists.genrelist_length.max()
    genre_stats.loc[label,'min_list_length'] = label_artists.genrelist_length.min()
    genre_stats.loc[label,'mean_list_length'] = label_artists.genrelist_length.mean()
    #data.drop(['indicator'], inplace = True)

# calculated columns
genre_stats['female%'] = genre_stats['female']/genre_stats['frequency']
genre_stats['male%'] = genre_stats['male']/genre_stats['frequency']
# reorder columns
genre_stats = genre_stats[['frequency','female','male','female%','male%','max_list_length','min_list_length','mean_list_length']]

Write to csv:

In [15]:
genre_stats.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_stats.csv')

# Part 2

In [45]:
genre_stats = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_stats.csv', index_col = 'label')

In [46]:
genre_stats.head()

Unnamed: 0_level_0,frequency,female,male,female%,male%,max_list_length,min_list_length,mean_list_length
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pop,2617,1321.0,1294.0,0.504776,0.494459,73.0,1.0,3.065392
rock,1765,356.0,1409.0,0.2017,0.7983,13.0,1.0,3.373371
r_and_b,1647,760.0,887.0,0.461445,0.538555,13.0,1.0,3.585914
country,1613,504.0,1108.0,0.312461,0.686919,12.0,1.0,2.358561
hip_hop,1114,187.0,927.0,0.167864,0.832136,73.0,1.0,2.5386


Overall ratio to total of female and male artists:

In [47]:
f_ratio = data.gender.sum()/data.shape[0]
m_ratio = (1-data.gender).sum()/data.shape[0]
f_ratio, m_ratio

(0.3108435681965094, 0.6891564318034906)

In [48]:
genre_stats['f-skew'] = genre_stats['female%']/f_ratio
genre_stats['m-skew'] = genre_stats['male%']/m_ratio

Rank genres (with frequency >= 100) by male-skew descending:

In [49]:
genre_stats[(genre_stats.frequency >= 100) & (genre_stats['m-skew'] > 1)].sort_values(['m-skew'], ascending = False)


Unnamed: 0_level_0,frequency,female,male,female%,male%,max_list_length,min_list_length,mean_list_length,f-skew,m-skew
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
hardcore_punk,135,5.0,130.0,0.037037,0.962963,11.0,1.0,4.459259,0.11915,1.397307
progressive_rock,227,9.0,218.0,0.039648,0.960352,73.0,1.0,4.453744,0.127548,1.393519
hard_rock,616,42.0,574.0,0.068182,0.931818,73.0,1.0,4.150974,0.219344,1.352114
heavy_metal,453,34.0,419.0,0.075055,0.924945,15.0,1.0,4.090508,0.241456,1.342141
rockabilly,129,10.0,119.0,0.077519,0.922481,14.0,1.0,4.062016,0.249384,1.338565
jazz_fusion,122,11.0,111.0,0.090164,0.909836,45.0,2.0,4.737705,0.290062,1.320217
alternative_metal,148,14.0,134.0,0.094595,0.905405,12.0,1.0,4.959459,0.304316,1.313788
blues_rock,335,31.0,303.0,0.092537,0.904478,12.0,1.0,4.227545,0.297697,1.312442
electric_blues,156,15.0,141.0,0.096154,0.903846,9.0,1.0,2.532051,0.309332,1.311525
rock_and_roll,299,29.0,270.0,0.09699,0.90301,12.0,1.0,3.698997,0.312022,1.310312


Rank genres (with frequency >= 100) by female-skew descending:

In [50]:
genre_stats[(genre_stats.frequency >= 100) & (genre_stats['f-skew'] > 1)].sort_values(['f-skew'], ascending = False)


Unnamed: 0_level_0,frequency,female,male,female%,male%,max_list_length,min_list_length,mean_list_length,f-skew,m-skew
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
dance_pop,183,133.0,50.0,0.726776,0.273224,9.0,1.0,3.754098,2.338076,0.396462
electro_pop,123,83.0,40.0,0.674797,0.325203,10.0,1.0,3.910569,2.170856,0.471886
dance,307,204.0,103.0,0.664495,0.335505,73.0,1.0,4.205212,2.137716,0.486834
house,143,94.0,49.0,0.657343,0.342657,14.0,1.0,4.195804,2.114706,0.497213
disco,147,88.0,59.0,0.598639,0.401361,73.0,1.0,5.061224,1.925854,0.582394
indie_pop,238,141.0,97.0,0.592437,0.407563,73.0,1.0,3.651261,1.905901,0.591394
soul,1023,532.0,490.0,0.520039,0.478983,13.0,1.0,3.894325,1.672993,0.695029
pop,2617,1321.0,1294.0,0.504776,0.494459,73.0,1.0,3.065392,1.623892,0.717485
alternative,120,58.0,62.0,0.483333,0.516667,11.0,1.0,3.575,1.554909,0.749709
gospel,346,162.0,184.0,0.468208,0.531792,12.0,1.0,3.936416,1.50625,0.771656


Rank genres (with frequency >= 50) by absolute male-ness descending:


In [51]:
genre_stats[(genre_stats.frequency >= 50) & (genre_stats['m-skew'] > 1)].sort_values(['male%'], ascending = False)


Unnamed: 0_level_0,frequency,female,male,female%,male%,max_list_length,min_list_length,mean_list_length,f-skew,m-skew
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
post_hardcore,87,1.0,86.0,0.011494,0.988506,10.0,1.0,4.83908,0.036978,1.434371
death_metal,71,1.0,70.0,0.014085,0.985915,12.0,1.0,4.056338,0.045311,1.430612
progressive_metal,87,2.0,85.0,0.022989,0.977011,9.0,1.0,4.264368,0.073955,1.417692
gangsta_rap,68,2.0,66.0,0.029412,0.970588,7.0,1.0,3.102941,0.094619,1.408371
thrash_metal,98,3.0,95.0,0.030612,0.969388,12.0,1.0,4.377551,0.098481,1.406629
emo,62,2.0,60.0,0.032258,0.967742,10.0,2.0,5.258065,0.103776,1.404241
hardcore_punk,135,5.0,130.0,0.037037,0.962963,11.0,1.0,4.459259,0.11915,1.397307
progressive_rock,227,9.0,218.0,0.039648,0.960352,73.0,1.0,4.453744,0.127548,1.393519
power_metal,63,3.0,60.0,0.047619,0.952381,9.0,1.0,4.190476,0.153193,1.381952
southern_rock,90,5.0,85.0,0.055556,0.944444,10.0,1.0,3.944444,0.178725,1.370436


Rank genres (with frequency >= 50) by female-ness descending:

In [52]:
genre_stats[(genre_stats.frequency >= 50) & (genre_stats['f-skew'] > 1)].sort_values(['female%'], ascending = False)



Unnamed: 0_level_0,frequency,female,male,female%,male%,max_list_length,min_list_length,mean_list_length,f-skew,m-skew
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
vocal_jazz,69,56.0,13.0,0.811594,0.188406,9.0,1.0,2.927536,2.610941,0.273386
dance_pop,183,133.0,50.0,0.726776,0.273224,9.0,1.0,3.754098,2.338076,0.396462
dream_pop,60,41.0,19.0,0.683333,0.316667,10.0,1.0,4.1,2.198319,0.459499
electro_pop,123,83.0,40.0,0.674797,0.325203,10.0,1.0,3.910569,2.170856,0.471886
dance,307,204.0,103.0,0.664495,0.335505,73.0,1.0,4.205212,2.137716,0.486834
house,143,94.0,49.0,0.657343,0.342657,14.0,1.0,4.195804,2.114706,0.497213
trip_hop,70,45.0,25.0,0.642857,0.357143,13.0,2.0,4.3,2.068105,0.518232
celtic,62,39.0,23.0,0.629032,0.370968,11.0,1.0,3.33871,2.02363,0.538293
disco,147,88.0,59.0,0.598639,0.401361,73.0,1.0,5.061224,1.925854,0.582394
indie_pop,238,141.0,97.0,0.592437,0.407563,73.0,1.0,3.651261,1.905901,0.591394
