This notebook is used to clean and prepare the genre label data for analysis.

## Outline of Cleaning:

- [x] remove artists for which 'retrieved' value is 'none'
- [x] remove the url prefix from the retrieved artist names 
- [x] replace ' ' in the artist column with '_'
- [x] remove the '(singer)', '(rapper)', '(musician)' designation from the 'retrieved' column
- [x] remove the artists for which the retrieved-artist != searched-artist. 
    - inspect mismatches to look for typos and different versions
- [x] convert genre column values into lists of strings
- [x] remove old columns
- [x] extract unique genres as a list
- [x] select 1% sample to verify gender

In [1]:
import numpy as np
np.random.seed(23)
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

from datetime import datetime

In [2]:
%ls -lt /Users/Daniel/Code/Genre/data/genre_lists/data_to_be_cleaned/

total 24032
-rw-r--r--  1 Daniel  staff  1893962 Apr 30 11:50 wiki-kaggle_genres_rescrape.csv
-rw-r--r--  1 Daniel  staff   184182 Apr 24 16:44 wiki-women_wiki_lists.csv
-rw-r--r--@ 1 Daniel  staff  1901830 Apr 23 13:41 wiki-kaggle_genres_gender.csv
-rw-r--r--@ 1 Daniel  staff  2010856 Apr 23 13:41 wiki-kaggle_genres_rough_gender.csv
-rw-r--r--@ 1 Daniel  staff   708936 Apr 22 15:30 wiki-kaggle_genres_rough.csv
-rw-r--r--  1 Daniel  staff   895886 Apr 17 10:27 kaggle_genres-reduced.csv
-rwxr-xr-x@ 1 Daniel  staff    22397 Apr 15 13:24 [31mwiki_bands_women.csv[m[m*
-rwxr-xr-x@ 1 Daniel  staff    37575 Apr 15 13:24 [31mwiki_country_women.csv[m[m*
-rwxr-xr-x@ 1 Daniel  staff    79119 Apr 15 13:24 [31mwiki_rock_women.csv[m[m*
-rw-r--r--@ 1 Daniel  staff  1949982 Apr 15 11:34 kaggle_genres_rough.csv
-rwxr-xr-x@ 1 Daniel  staff    15444 Apr 11 12:41 [31mwiki_bands_women-cleaned.csv[m[m*
-rwxr-xr-x@ 1 Daniel  staff    24278 Apr 11 12:41 [31mwiki_country_women-cleaned.

### Data Sets

The file singers_gender.csv is from Kaggle and lists music artists and their gender. This is our starting point. It is augmented using the lists of women artists. Genre and network info will be generated by scraping databases. For now we are focusing in Wikipedia.

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
# kaggle_data.shape

### Load the data to be cleaned:

Current: wiki-kaggle_genres_rough.csv

- This will be replaced by the fully scraped set
- The full set needs to be cleaned

Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

In [5]:
data = pd.read_csv('../../data/genre_lists/data_to_be_cleaned/wiki-kaggle_genres_rescrape.csv', header = 0)
data.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [6]:
data.head(50)

Unnamed: 0,artist,gender,retrieved,genre
0,Jimmy Boyd,male,none,none
1,Christopher Willits,male,https://en.wikipedia.org/wiki/Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
2,Henry Frayne,male,none,none
3,Shawn Hook,male,https://en.wikipedia.org/wiki/Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,https://en.wikipedia.org/wiki/Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"
5,Dave Schramm,male,none,none
6,Marvin Isley,male,https://en.wikipedia.org/wiki/Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']"
7,Povel Ramel,male,https://en.wikipedia.org/wiki/Povel_Ramel,['vaudeville']
8,Lisa McHugh,female,none,none
9,Sharon Van Etten,female,https://en.wikipedia.org/wiki/Sharon_Van_Etten,"['indie rock', 'indie folk']"


In [7]:
data.shape

(23177, 4)

In [8]:
data.isnull().sum()

artist       0
gender       0
retrieved    0
genre        0
dtype: int64

For how many artists is the scraped genre 'none':

In [9]:
(data.genre == 'none').sum()

7677

For how many artists is the 'retrieved' value 'none':

In [10]:
(data.retrieved == 'none').sum()

7677

Take a glance at artist and retrieved values to determine necessary cleaning:

In [11]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Jeanne Cherhal        retrieved: https://en.wikipedia.org/wiki/Jeanne_Cherhal
artist: Jessica Cauffiel        retrieved: none
artist: Marco Marinangeli        retrieved: https://en.wikipedia.org/wiki/Marco_Marinangeli
artist: Rosemary Clooney        retrieved: https://en.wikipedia.org/wiki/Rosemary_Clooney
artist: Jon Rennard        retrieved: none
artist: Colm Wilkinson        retrieved: https://en.wikipedia.org/wiki/Colm_Wilkinson
artist: Dwight Pinkney        retrieved: https://en.wikipedia.org/wiki/Dwight_Pinkney
artist: Nivea        retrieved: https://en.wikipedia.org/wiki/Nivea_(singer)
artist: Emilie Hammarskjöld        retrieved: none
artist: J. D. Souther        retrieved: https://en.wikipedia.org/wiki/J._D._Souther
artist: Øyvind Nypan        retrieved: none
artist: Brownstone        retrieved: https://en.wikipedia.org/wiki/Brownstone_(band)
artist: Frankie Armstrong        retrieved: https://en.wikipedia.org/wiki/Frankie_Armstrong
artist: Chuck Berghofer        retri

Notes on Retrieved:

- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped (probably to distinguish from othe people in wikipedia)
- double quotes are allowed in names
- hyphens appear

### Remove artists for which 'retrieved' value is 'none'

Convert none to null:

In [12]:
data['retrieved'] = data['retrieved'].replace('none', np.nan)

In [13]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
0,Jimmy Boyd,male,,none
1,Christopher Willits,male,https://en.wikipedia.org/wiki/Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
2,Henry Frayne,male,,none
3,Shawn Hook,male,https://en.wikipedia.org/wiki/Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,https://en.wikipedia.org/wiki/Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"


In [14]:
data.isnull().sum()

artist          0
gender          0
retrieved    7677
genre           0
dtype: int64

Drop rows with nulls:

In [15]:
data.dropna(axis = 0, inplace = True)

In [16]:
data.shape

(15500, 4)

In [17]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
1,Christopher Willits,male,https://en.wikipedia.org/wiki/Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
3,Shawn Hook,male,https://en.wikipedia.org/wiki/Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,https://en.wikipedia.org/wiki/Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"
6,Marvin Isley,male,https://en.wikipedia.org/wiki/Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']"
7,Povel Ramel,male,https://en.wikipedia.org/wiki/Povel_Ramel,['vaudeville']


## Remove the prefix from the 'retrieved' values

In [18]:
"""This function extracts artist name from the url.
Apply it to the 'retrieved' values."""
def retrieved_artist(text):
    try:
        retrieved = text
        p = re.compile(r'(https://en.wikipedia.org/wiki/)(.*)')
        result = re.match(p, retrieved)
        return result.group(2)
    except:
        if text == 'none':
            return 'none'
    else:
        return 'None'

In [19]:
data['retrieved'] = data.retrieved.apply(retrieved_artist)

In [20]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
1,Christopher Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
3,Shawn Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"
6,Marvin Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']"
7,Povel Ramel,male,Povel_Ramel,['vaudeville']


Take a glance at artist and retrieved values after cleaning retrieved to determine further cleaning:

In [21]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Dominique Eade        retrieved: Dominique_Eade
artist: John C. J. Taylor        retrieved: John_C._J._Taylor
artist: Phillip Mitchell        retrieved: Phillip_Mitchell
artist: Meja        retrieved: Meja
artist: Simon Underwood        retrieved: Simon_Underwood
artist: Del Reeves        retrieved: Del_Reeves
artist: Syleena Johnson        retrieved: Syleena_Johnson
artist: Jimi Jamison        retrieved: Jimi_Jamison
artist: John Spiker        retrieved: John_Spiker
artist: Melissa Sgambelluri        retrieved: Melissa_Sgambelluri
artist: Justin Hayford        retrieved: Justin_Hayford
artist: Markus Fagervall        retrieved: Markus_Fagervall
artist: John Schlitt        retrieved: John_Schlitt
artist: Jack Hues        retrieved: Jack_Hues
artist: Paris Bennett        retrieved: Paris_Bennett


## Replace spaces with _ in the artist column:

In [22]:
"""This function replaces white space in the values of
the column artist with an underscore."""
def underscore(text):
    try:
        split_name = text.split(' ')
        return '_'.join(split_name)  
    except:
        return 'error'

In [23]:
data['artist'] = data.artist.apply(underscore)

In [24]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']"
7,Povel_Ramel,male,Povel_Ramel,['vaudeville']


## Remove the \_(singer) type designation from retrieved

In [25]:
"""This function uses re. to remove any parenthetical designations
form the retrieved artist name"""
def remove_designation(text):
    designations = [r'_\(singer\)', r'_\(musician\)', r'_\(rapper\)', r'_\(band\)', r'_\(composer\)', r'_\(music_producer\)', r'_\(singer-songwriter\)' ]
    x = text
    for des in designations:
        if re.search(des, x):
            x = re.sub(r'{}'.format(des),'',text)
    return x

Apply the function:

In [26]:
data['retrieved_clean'] = data.retrieved.apply(remove_designation)

Take a glance at artist and retrieved_clean values:

In [27]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

data[['artist','retrieved_clean']].iloc[rints]

Unnamed: 0,artist,retrieved_clean
3281,Julian_Velard,Julian_Velard
9987,Chingy,Chingy
12032,Deesha,Deesha
9981,Jamie_Oldaker,Jamie_Oldaker
6919,Pearl_Future,Pearl_Future
10485,Vince_Taylor,Vince_Taylor
15091,Tarsame_Singh_Saini,Tarsame_Singh_Saini
16312,Bradford_Cox,Bradford_Cox
3036,Chubby_Checker,Chubby_Checker
12532,Charlie_Dominici,Charlie_Dominici


### Mark the rows for which retrieved_clean is different from artist

In [28]:
"""This function takes a pair of strings and checks
if they are equivalent (case insensitive)

.casefold is used to be case insensitive; 
still might have problems on some characters"""

def verify_artist(x,y):
    if x.casefold() == y.casefold(): 
        return 1
    else:
        return 0

Introduce a mismatch just to make sure we can properly remove these:

In [29]:
# use an iloc index larger than the size of the original dataframe
#data.iloc[data.shape[0]+1] = ['test','test_wrong','universal','test_wrong']

Apply the function:

In [30]:
data['match'] = (data.artist.apply(lambda x: x.casefold()) != data.retrieved_clean.apply(lambda x: x.casefold())).astype('int64')

In [31]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac...",Christopher_Willits,0
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']",Shawn_Hook,0
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']",Steve_Poltz,0
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']",Marvin_Isley,0
7,Povel_Ramel,male,Povel_Ramel,['vaudeville'],Povel_Ramel,0


In [32]:
data.match.sum()

0

Now remove artists where retrieved_clean doesn't match artist:

In [33]:
data = data[data.match == 0]

In [34]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac...",Christopher_Willits,0
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']",Shawn_Hook,0
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']",Steve_Poltz,0
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']",Marvin_Isley,0
7,Povel_Ramel,male,Povel_Ramel,['vaudeville'],Povel_Ramel,0


Now the remaining artists are verified and have non-null genre label. 

In [35]:
data = data.copy( deep = True)

### Genre Labels

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

The following function will systematically reformat the genre lists to a great extent. After applying it, there are still a few cases that are dealt with individually and some that are dropped.

In [36]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label.
It also removes strings in parentheses and removes \( or \) that are isolated.
It replaces 'singer/songwriter' with 'singer-songwriter' and replaces forward slashes with commas."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","").replace('"',"") \
    .replace("singer/songwriter", "singer-songwriter").replace("/",",") \
    .replace(r"\n",",").replace(r";",",")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        x = re.sub(r"\(.*?\)", "", x) # newly introduced
        x = re.sub(r"\(", "", x) # newly introduced
        x = re.sub(r"\)", "", x) # newly introduced
        x = x.replace(" ","_").lstrip("_").rstrip("_").strip(";")
        L_new.append(x)
    while (str("") in L_new):
        L_new.remove("")
    return L_new

In [37]:
data['genrelist']= data['genre'].apply(genrelist)

We can still use 'genre' column for str.contains searches

Issues to check:

The ";" -- this seems to have been dealt with

In [38]:
data.genre[data.genre.str.contains(r';')]

254                ['k-pop', ';', 'hip hop', ';', 'dance']
2416     ['.mw-parser-output div.columns-2 div.column{f...
5378     ['singer-songwriter', ';', 'jazz', ';', 'music...
7150       ['rock', ';', 'country rock', ';', 'folk rock']
7861     ['.mw-parser-output div.columns-2 div.column{f...
8201          ['classical crossover', '; opera; romantic']
9496     ['classical crossover', '; pop music; rock mus...
11170    ['aor', 'pop rock', '; earlier:', 'pop', 'disc...
13707    ['traditional pop music', ';', 'show tunes', '...
14041                ['american folk; struggle & protest']
21388    ['singer-songwriter', '; mainstream', 'jazz', ...
Name: genre, dtype: object

In [39]:
data.genrelist[data.genre[data.genre.str.contains(r';')].index]

254                                [k-pop, hip_hop, dance]
2416     [.mw-parser-output_div.columns-2_div.column{fl...
5378     [singer-songwriter, jazz, musical_theatre, com...
7150                       [rock, country_rock, folk_rock]
7861     [.mw-parser-output_div.columns-2_div.column{fl...
8201                [classical_crossover, opera, romantic]
9496     [classical_crossover, pop_music, rock_music, r...
11170          [aor, pop_rock, earlier:, pop, disco, soul]
13707           [traditional_pop_music, show_tunes, opera]
14041                  [american_folk, struggle_&_protest]
21388    [singer-songwriter, mainstream, jazz, and_melo...
Name: genrelist, dtype: object

Remove 'earlier:' from 11170

In [40]:
data.genrelist.loc[11170]

['aor', 'pop_rock', 'earlier:', 'pop', 'disco', 'soul']

In [41]:
data.at[11170,'genrelist'] = ['aor', 'pop_rock', 'pop', 'disco', 'soul']

Two entries with .mw-parser... -- Fix by hand.

In [42]:
data.genrelist[data.genre[data.genre.str.contains(r'\.mw-.*')].index]

2416    [.mw-parser-output_div.columns-2_div.column{fl...
7861    [.mw-parser-output_div.columns-2_div.column{fl...
Name: genrelist, dtype: object

In [43]:
data.genrelist.loc[2416]

['.mw-parser-output_div.columns-2_div.column{float:left',
 'width:50%',
 'min-width:300px}.mw-parser-output_div.columns-3_div.column{float:left',
 'width:33.3%',
 'min-width:200px}.mw-parser-output_div.columns-4_div.column{float:left',
 'width:25%',
 'min-width:150px}.mw-parser-output_div.columns-5_div.column{float:left',
 'width:20%',
 'min-width:120px}',
 'disco',
 'funk',
 'electric',
 'latin_soul']

In [44]:
data.at[2416,'genrelist'] = ['disco', 'funk', 'electric', 'latin_soul']

In [45]:
data.genrelist.loc[7861]

['.mw-parser-output_div.columns-2_div.column{float:left',
 'width:50%',
 'min-width:300px}.mw-parser-output_div.columns-3_div.column{float:left',
 'width:33.3%',
 'min-width:200px}.mw-parser-output_div.columns-4_div.column{float:left',
 'width:25%',
 'min-width:150px}.mw-parser-output_div.columns-5_div.column{float:left',
 'width:20%',
 'min-width:120px}',
 'torch_song']

In [46]:
data.at[7861,'genrelist'] = ['torch_song']

The ) or ( seem to have been dealt with.

In [47]:
data.genre[data.genre.str.contains(r'[\(\)]')]

349      ['metalcore', 'electronic', 'electropop', 'har...
917      ['pop rock', '(2009–2010)', 'hard rock', 'aren...
1383               ['country', '(', 'neotraditional', ')']
1605     ['metalcore', 'alternative metal', 'alternativ...
1990          ['alternative rock', 'punk rock', '(early)']
                               ...                        
22743    ['electropop', 'indie pop', 'pop rock', '(earl...
22754                ['jazz', 'folk', 'americana (music)']
22770    ['heavy metal', 'progressive metal', 'hard roc...
22984    ['groove metal', 'heavy metal', 'southern meta...
23169    ['glam rock', 'rock and roll', 'psychedelic fo...
Name: genre, Length: 73, dtype: object

In [48]:
parentrouble = data.genrelist[data.genre[data.genre.str.contains(r'[\(\)]')].index]

In [49]:
n = np.random.randint(74)
parentrouble.iloc[n]

['groove_metal', 'thrash_metal', 'hardcore_punk', 'death_metal']

The / seems to have been dealt with.

In [50]:
data.genre[data.genre.str.contains(r'[/]')]

331                          ['indie rock/pop/electronic']
503                                   ['roots/blues/soul']
646      ['christian & gospel', 'independent\nsinger/so...
830                                           ['pop/rock']
1642     ['gospel', 'contemporary christian', 'inspirat...
                               ...                        
22119    ['heavy metal', 'hard rock', 'rock', 'blues', ...
22198    ['pop', 'french pop', 'pop/rock', 'adult conte...
22520    ['progressive rock', 'industrial rock', 'pop/r...
22534            ['pop', 'persian / middle eastern music']
23090    ['singer/songwriter', 'folk', 'americana', 'pop']
Name: genre, Length: 73, dtype: object

In [51]:
data.genrelist[data.genre[data.genre.str.contains(r'[/]')].index]

331                          [indie_rock, pop, electronic]
503                                   [roots, blues, soul]
646      [christian_&_gospel, independent, singer-songw...
830                                            [pop, rock]
1642     [gospel, contemporary_christian, inspirational...
                               ...                        
22119    [heavy_metal, hard_rock, rock, blues, r&b, sou...
22198    [pop, french_pop, pop, rock, adult_contemporar...
22520       [progressive_rock, industrial_rock, pop, rock]
22534                 [pop, persian, middle_eastern_music]
23090            [singer-songwriter, folk, americana, pop]
Name: genrelist, Length: 73, dtype: object

Let's look at the list of genre labels from the above rows that we have just dealt with in detail:

In [52]:
trouble_index = data.genre[data.genre.str.contains(r'[;)(/]')].index

In [53]:
trouble = data.loc[trouble_index]

In [54]:
trouble.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match,genrelist
254,Eddy_Oh,male,Eddy_Oh,"['k-pop', ';', 'hip hop', ';', 'dance']",Eddy_Oh,0,"[k-pop, hip_hop, dance]"
331,Jillian_Wheeler,female,Jillian_Wheeler,['indie rock/pop/electronic'],Jillian_Wheeler,0,"[indie_rock, pop, electronic]"
349,Caleb_Shomo,male,Caleb_Shomo,"['metalcore', 'electronic', 'electropop', 'har...",Caleb_Shomo,0,"[metalcore, electronic, electropop, hardcore_p..."
503,Amy_Black,female,Amy_Black_(singer),['roots/blues/soul'],Amy_Black,0,"[roots, blues, soul]"
646,Gabriel_Wilson,male,Gabriel_Wilson,"['christian & gospel', 'independent\nsinger/so...",Gabriel_Wilson,0,"[christian_&_gospel, independent, singer-songw..."


In [55]:
trouble_genre_list = trouble.genrelist.values.tolist()
trouble_genre_list = [x for y in trouble_genre_list for x in y]
trouble_genre_list = list(set(trouble_genre_list))

In [56]:
len(trouble_genre_list)

218

In [57]:
n = 8
trouble_genre_list[25*n:25*(n+1)]

['teen_pop',
 'folk',
 'metal',
 'r&b',
 'religious',
 'rebetiko',
 'metalcore',
 'post-hardcore',
 'rap_metal',
 'composer',
 'diy',
 'gh_brothers',
 'rockabilly',
 'inspirational',
 'tribute',
 'indie',
 'pop',
 'rock_music']

All of the problematic symbols are removed.

In [58]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match,genrelist
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac...",Christopher_Willits,0,"[electronic, glitch, ambient, electroacoustic,..."
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']",Shawn_Hook,0,"[pop, electronic, rock]"
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']",Steve_Poltz,0,"[pop_rock, indie-rock, folk_rock]"
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']",Marvin_Isley,0,"[r&b, funk, soul, funk_rock]"
7,Povel_Ramel,male,Povel_Ramel,['vaudeville'],Povel_Ramel,0,[vaudeville]


In [59]:
data.shape

(15500, 7)

### Remove all artists with null values for genre :

In [60]:
data.isnull().sum(axis = 0)

artist             0
gender             0
retrieved          0
genre              0
retrieved_clean    0
match              0
genrelist          0
dtype: int64

In [61]:
data = data[data['genrelist'].notnull()]

In [62]:
data.isnull().sum(axis = 0)

artist             0
gender             0
retrieved          0
genre              0
retrieved_clean    0
match              0
genrelist          0
dtype: int64

In [63]:
data.shape

(15500, 7)

### Create column with length of genre lists:

In [64]:
data['genrelist_length'] = data.genrelist.apply(lambda x: len(x))

In [65]:
data.shape

(15500, 8)

### Remove artists for which the genre list is empty:

In [66]:
data[data.genrelist_length == 0]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match,genrelist,genrelist_length
4024,Cupid,male,Cupid_(singer),[],Cupid,0,[],0
15255,Mary_Zilba,female,Mary_Zilba,[],Mary_Zilba,0,[],0
17487,Betty_Clooney,female,Betty_Clooney,[],Betty_Clooney,0,[],0


Note: Cupid, Mary Zilba, Betty Clooney have made it to this point but actually have no genres listed in wikipedia.

In [67]:
data = data[data.genrelist_length > 0]

In [68]:
data.shape

(15497, 8)

Remove old columns:

In [69]:
data.columns

Index(['artist', 'gender', 'retrieved', 'genre', 'retrieved_clean', 'match',
       'genrelist', 'genrelist_length'],
      dtype='object')

In [70]:
data.drop(['retrieved','genre','retrieved_clean', 'match'], axis = 1, inplace = True)

In [71]:
data.head()

Unnamed: 0,artist,gender,genrelist,genrelist_length
1,Christopher_Willits,male,"[electronic, glitch, ambient, electroacoustic,...",5
3,Shawn_Hook,male,"[pop, electronic, rock]",3
4,Steve_Poltz,male,"[pop_rock, indie-rock, folk_rock]",3
6,Marvin_Isley,male,"[r&b, funk, soul, funk_rock]",4
7,Povel_Ramel,male,[vaudeville],1


In [72]:
data.shape

(15497, 4)

## Gender

Remove any artists without gender value

In [73]:
data.gender.isnull().sum()

0

In [74]:
data.gender.unique()

array(['male', 'female'], dtype=object)

### Extracting the unique genre labels:

First make a list of the genrelists:

In [75]:
genre_list = data.genrelist.values.tolist()
genre_list = [x for y in genre_list for x in y]
genre_list = list(set(genre_list))

In [76]:
genre_list[:5]

['western_films', 'world', 'country_blues', 'r’n’b', 'flamenco_fusion']

In [77]:
len(genre_list)

1864

In [78]:
print('There are {} artists with genre and binary-gender labels.'.format(data.shape[0]))
print('There are {} unique genre labels.'.format(len(genre_list)))

There are 15497 artists with genre and binary-gender labels.
There are 1864 unique genre labels.


Check to see if there are artists with len(genrelist) = 0

In [79]:
data[data.genrelist.apply(lambda x: True if len(x) == 0 else False)]

Unnamed: 0,artist,gender,genrelist,genrelist_length


## Select 1% of artists to verify gender manually.

In [80]:
data.shape

(15497, 4)

In [81]:
data_male = data[data.gender == 'male']
data_female = data[data.gender == 'female']

In [82]:
tot = data.shape[0]
m = data_male.shape[0]
f = data_female.shape[0]
print('{} total artists'.format(tot))
print('{} female artists, or {:0.0f}%'.format(f, 100*f/(f+m)))
print('{} male artists, or {:0.0f}%'.format(m, 100*m/(f+m)))

15497 total artists
4869 female artists, or 31%
10628 male artists, or 69%


We will verify the same number of female as male artists. Let 
- $p$ be the fraction of artists to verify,
- $p_f$ be the fraction of female artists to verify
- $p_m$ be the fraction of male artists to verify

We want two conditions satisfied:

$$ p_f \cdot f = p_m \cdot m $$
$$ p_f \cdot f + p_m \cdot m = p \cdot (f+m)$$

These encode the requirements that
- we verify equal numbers of male and female artists
- the fractino of artists we verify is $p$

Solving for $p_f$ and $p_m$ leads to 

$$ p_f = \frac{p}{2f}(f+m)$$
$$ p_m = \frac{p}{2m}(f+m)$$

Define the functions for $p_f$ and $p_m$:

In [83]:
def pf(p,f,m):
    return p*(m+f)/(2*f)

def pm(p,f,m):
    return p*(m+f)/(2*m)

Substituting in $p=.01$, $f = 4864$, and $m = 10609$ leads to 

In [84]:
p = .01
p_f, p_m = pf(p,f,m), pm(p,f,m)

We select samples from the data:

In [85]:
sample_female = data_female.sample(frac = p_f)
sample_male = data_male.sample(frac = p_m)

In [86]:
sample_male.shape[0], sample_female.shape[0]

(77, 77)

In [87]:
sample_fm = pd.concat([sample_female,sample_male])

In [88]:
sample_size = sample_fm.shape[0]

Take a sample of the full size in order to shuffle:

In [89]:
sample_fm = sample_fm.sample(sample_size)

In [90]:
sample_fm.head()

Unnamed: 0,artist,gender,genrelist,genrelist_length
11137,Billy_Daniels,male,"[vocal_jazz, cabaret]",2
14246,Peter_Gifford,male,[rock],1
7259,Blanca,female,[contemporary_christian_music],1
16063,Whitfield_Crane,male,"[rock, hard_rock, heavy_metal]",3
19907,Sabrina_Johnston,female,"[dance, hip_hop]",2


In [91]:
tom_sample = sample_fm[:77]
dan_sample = sample_fm[77:]
tom_sample.shape, dan_sample.shape

((77, 4), (77, 4))

In [92]:
dan_sample.head(20)

Unnamed: 0,artist,gender,genrelist,genrelist_length
18163,Kim_Yarbrough,female,[r&b],1
2635,Lalaine,female,"[pop, alternative_rock, indie_rock]",3
18811,Brett_Gurewitz,male,"[punk_rock, hardcore_punk, melodic_hardcore, d...",4
8392,Dave_Flett,male,"[rock, progressive_rock]",2
3403,Julian_Taylor,male,[rock],1
1250,Francine_Reed,female,"[blues, jazz]",2
7106,Jewel,female,"[folk, pop, pop_rock, country]",4
19816,Kathy_Troccoli,female,"[contemporary_christian, inspirational, jazz]",3
6909,Barbara_Pittman,female,"[country, rockabilly]",2
9974,Matt_White,male,"[rock, pop]",2


Export data for further use:

In [93]:
today = datetime.today()
now = today.strftime('%Y-%m-%d-%H-%M')
#now = 'temp'

In [94]:
data.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_genres_gender_cleaned_{}.csv'.format(now))

In [95]:
tom_sample.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/tom_sample_to_verify_{}.csv'.format(now))
dan_sample.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/dan_sample_to_verify_{}.csv'.format(now))