This notebook is used to clean and prepare the genre label data for analysis.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

In [2]:
%ls -l /Users/Daniel/Code/Genre/data/

total 0
drwxr-xr-x  1142 Daniel  staff  36544 Apr 17 10:27 [34martist_network_graphs[m[m/
drwxr-xr-x    14 Daniel  staff    448 Apr 22 15:14 [34mgenre_lists[m[m/


### Data Sets

The file singers_gender.csv is from Kaggle and lists music artists and their gender. This is our starting point. It is augmented using the lists of women artists. Genre and network info will be generated by scraping databases. For now we are focusing in Wikipedia.

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
# kaggle_data.shape

### Load the data to be cleaned:

Current: wiki-kaggle_genres_rough.csv

- This will be replaced by the fully scraped set
- The full set needs to be cleaned

Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

- I renamed the first column of the csv file to be 'index'

In [28]:
data = pd.read_csv('../../data/genre_lists/wiki-kaggle_genres_rough.csv', header = 0, index_col = 'index')

In [29]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [20]:
data.shape

(8770, 4)

We want to remove the artists for which 'retrieved' != 'artist'. To do this, we need to put both values in the same format.

Take a glance at artist and retrieved values:

In [33]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Helen Martin        retrieved: none
artist: Jessica Jacobs        retrieved: https://en.wikipedia.org/wiki/Jessica_Jacobs
artist: Sam Riley        retrieved: none
artist: Clay Davidson        retrieved: https://en.wikipedia.org/wiki/Clay_Davidson
artist: Florence LaRue        retrieved: https://en.wikipedia.org/wiki/Florence_LaRue
artist: Gary Stewart        retrieved: https://en.wikipedia.org/wiki/Gary_Stewart_(singer)
artist: Warren Storm        retrieved: https://en.wikipedia.org/wiki/Warren_Storm
artist: Olaf Thörsen        retrieved: https://en.wikipedia.org/wiki/Olaf_Thörsen
artist: Harry Wayne Casey        retrieved: https://en.wikipedia.org/wiki/Harry_Wayne_Casey
artist: Marcel Gagnon        retrieved: none
artist: Somi        retrieved: none
artist: George Lynch        retrieved: https://en.wikipedia.org/wiki/George_Lynch_(musician)
artist: Dan Hicks        retrieved: https://en.wikipedia.org/wiki/Dan_Hicks_(musician)
artist: Maria Azevedo        retrieved: https://en.

Retrieved:

- It appears that there is always the incipit (see below) that needs to be removed 
- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped
- double quotes are allowed in names

In [None]:
incipit = 'https://en.wikipedia.org/wiki/'

In [19]:
data.genre.iloc[21]

"['Funk music', 'Disco music', 'Dance music']"

Remove extraneous last column and category:

In [8]:
data.columns

Index(['artist', 'gender', 'category', 'genre', 'Unnamed: 4'], dtype='object')

In [9]:
data.drop(['category','Unnamed: 4'], axis = 1, inplace = True)

In [10]:
data.head()

Unnamed: 0,artist,gender,genre
0,12 Gauge,male,miami bass
1,1987,male,retro electro
2,2 Chainz,male,"atl hip hop, gangster rap, hip hop, pop rap, r..."
3,2 Pistols,male,"dirty south rap, pop rap, southern hip hop, trap"
4,21 Savage,male,"atl hip hop, rap, trap"


### Remove all artists with null values for genre:

In [11]:
data = data[data['genre'].notnull()]

In [12]:
data.isnull().sum(axis = 0)

artist    0
gender    0
genre     0
dtype: int64

In [13]:
data.shape

(9734, 3)

### Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

First we carry out the split on an example:

In [14]:
x = data.genre.iloc[2]

In [15]:
x

'atl hip hop, gangster rap, hip hop, pop rap, rap, southern hip hop, trap'

In [16]:
[s.strip() for s in x.split(',')]

['atl hip hop',
 'gangster rap',
 'hip hop',
 'pop rap',
 'rap',
 'southern hip hop',
 'trap']

Now we make a function to apply to the genre column:

In [35]:
def genrelist(string):
    return [s.strip() for s in string.split(',')]

Now we apply it to the whole column and put the lists in a new column:

In [37]:
data['genrelist']= data['genre'].apply(genrelist)

In [38]:
data.head()

Unnamed: 0,artist,gender,genre,genrelist
0,12 Gauge,male,miami bass,[miami bass]
1,1987,male,retro electro,[retro electro]
2,2 Chainz,male,"atl hip hop, gangster rap, hip hop, pop rap, r...","[atl hip hop, gangster rap, hip hop, pop rap, ..."
3,2 Pistols,male,"dirty south rap, pop rap, southern hip hop, trap","[dirty south rap, pop rap, southern hip hop, t..."
4,21 Savage,male,"atl hip hop, rap, trap","[atl hip hop, rap, trap]"


Let's count the number of labels for each artist (length of the list in genrelist column):

In [39]:
data['num_genres'] = data['genrelist'].apply(lambda x: len(x))

In [40]:
data.head()

Unnamed: 0,artist,gender,genre,genrelist,num_genres
0,12 Gauge,male,miami bass,[miami bass],1
1,1987,male,retro electro,[retro electro],1
2,2 Chainz,male,"atl hip hop, gangster rap, hip hop, pop rap, r...","[atl hip hop, gangster rap, hip hop, pop rap, ...",7
3,2 Pistols,male,"dirty south rap, pop rap, southern hip hop, trap","[dirty south rap, pop rap, southern hip hop, t...",4
4,21 Savage,male,"atl hip hop, rap, trap","[atl hip hop, rap, trap]",3


Count the min, mean, max of number of genres for male and female artists:

In [54]:
data_female = data[data.gender == 'female']
n = data_female.shape[0]
a,b,c = data_female.num_genres.mean(), data_female.num_genres.std(), data_female.num_genres.max()
print('Female:')
print(f'{n} Artists.')
print(f'Mean number of genre labels: {round(a,2)}.')
print(f'STD of the number of genre labels: {round(b,2)}.')
print(f'Max number of genre labels: {c}.')

Female:
3180 Artists.
Mean number of genre labels: 2.5.
STD of the number of genre labels: 2.21.
Max number of genre labels: 22.


In [55]:
# plt.hist(data_female.num_genres, bins = 25, density = True)
# plt.show()

In [56]:
data_male = data[data.gender == 'male']
m = data_male.shape[0]
a,b,c = data_male.num_genres.mean(), data_male.num_genres.std(), data_male.num_genres.max()
print('Male:')
print(f'{m} Artists.')
print(f'Mean number of genre labels: {round(a,2)}.')
print(f'STD of the number of genre labels: {round(b,2)}.')
print(f'Max number of genre labels: {c}.')

Male:
6554 Artists.
Mean number of genre labels: 2.75.
STD of the number of genre labels: 2.41.
Max number of genre labels: 20.


In [57]:
# plt.hist(data_male.num_genres, bins = 25, density = True)
# plt.show()

### Extracting the unique genre labels:

In [61]:
genre_list0 = data.genrelist.values.tolist()

In [62]:
genre_list0[:5]

[['miami bass'],
 ['retro electro'],
 ['atl hip hop',
  'gangster rap',
  'hip hop',
  'pop rap',
  'rap',
  'southern hip hop',
  'trap'],
 ['dirty south rap', 'pop rap', 'southern hip hop', 'trap'],
 ['atl hip hop', 'rap', 'trap']]

In [63]:
genre_list1 = [x for y in genre_list0 for x in y]
len(genre_list1)

25998