# Resume at START HERE

This notebook is used to clean and prepare the genre label data for analysis.

In [1]:
import numpy as np
np.random.seed(23)
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

In [2]:
%ls -l /Users/Daniel/Code/Genre/data/

total 0
drwxr-xr-x  1142 Daniel  staff  36544 Apr 17 10:27 [34martist_network_graphs[m[m/
drwxr-xr-x    14 Daniel  staff    448 Apr 22 15:14 [34mgenre_lists[m[m/


### Data Sets

The file singers_gender.csv is from Kaggle and lists music artists and their gender. This is our starting point. It is augmented using the lists of women artists. Genre and network info will be generated by scraping databases. For now we are focusing in Wikipedia.

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
# kaggle_data.shape

### Load the data to be cleaned:

Current: wiki-kaggle_genres_rough.csv

- This will be replaced by the fully scraped set
- The full set needs to be cleaned

Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

- I renamed the first column of the csv file to be 'index'

In [5]:
data = pd.read_csv('../../data/genre_lists/wiki-kaggle_genres_rough.csv', header = 0, index_col = 'index')

In [6]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [7]:
data.shape

(8770, 3)

In [8]:
data.isnull().sum()

artist       0
retrieved    0
genre        0
dtype: int64

For how many artists is the scraped genre 'none':

In [9]:
(data.genre == 'none').sum()

2924

For how many artists is the 'retrieved' value 'none':

In [10]:
(data.retrieved == 'none').sum()

2924

Take a glance at artist and retrieved values to determine necessary cleaning:

In [11]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Hound Dog Taylor        retrieved: https://en.wikipedia.org/wiki/Hound_Dog_Taylor
artist: Kenya Bell        retrieved: https://en.wikipedia.org/wiki/Kenya_Bell
artist: Betty Compton        retrieved: none
artist: Christopher Hall        retrieved: https://en.wikipedia.org/wiki/Christopher_Hall_(musician)
artist: Emily Whitehurst        retrieved: https://en.wikipedia.org/wiki/Emily_Whitehurst
artist: Chris Kahl        retrieved: none
artist: Ed Dowie        retrieved: none
artist: Freddy Moore        retrieved: https://en.wikipedia.org/wiki/Freddy_Moore
artist: Sara Storer        retrieved: https://en.wikipedia.org/wiki/Sara_Storer
artist: Liza Manili        retrieved: none
artist: April Lawton        retrieved: https://en.wikipedia.org/wiki/April_Lawton
artist: Bev Pegg        retrieved: none
artist: McLean        retrieved: https://en.wikipedia.org/wiki/McLean_(singer)
artist: Ryn Weaver        retrieved: https://en.wikipedia.org/wiki/Ryn_Weaver
artist: Stan Wilson        ret

Notes on Retrieved:

- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped (probably to distinguish from othe people in wikipedia)
- double quotes are allowed in names
- hyphens appear

## Outline of Cleaning:

- [x] remove artists for which 'retrieved' value is 'none'
- [x] remove the url prefix from the retrieved artist names 
- [x] replace ' ' in the artist column with '_'
- [x] remove the '(singer)', '(rapper)', '(musician)' designation from the 'retrieved' column
- [ ] remove the artists for which the retrieved-artist != searched-artist. 
    - inspect mismatches to look for typos and different versions
- [ ] convert genre column values into lists of strings

### Remove artists for which 'retrieved' value is 'none'

Convert none to null:

In [12]:
data['retrieved'] = data['retrieved'].replace('none', np.nan)

In [13]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,,none
1,Studebaker John,,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [14]:
data.isnull().sum()

artist          0
retrieved    2924
genre           0
dtype: int64

Drop rows with nulls:

In [15]:
data.dropna(axis = 0, inplace = True)

In [16]:
data.shape

(5846, 3)

In [17]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']
7,Jerry Penrod,https://en.wikipedia.org/wiki/Jerry_Penrod,['Rock music']
8,Wendy Rene,https://en.wikipedia.org/wiki/Wendy_Rene,"['Soul music', 'Rhythm and blues']"


## Remove the prefix from the 'retrieved' values

In [18]:
"""This function extracts artist name from the url.
Apply it to the 'retrieved' values."""
def retrieved_artist(text):
    try:
        retrieved = text
        p = re.compile(r'(https://en.wikipedia.org/wiki/)(.*)')
        result = re.match(p, retrieved)
        return result.group(2)
    except:
        if text == 'none':
            return 'none'
    else:
        return 'None'

In [19]:
data['retrieved'] = data.retrieved.apply(retrieved_artist)

In [20]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Storm Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,Larry_Jon_Wilson,['Country music']
4,Leah Randi,Leah_Randi,['Alternative rock']
7,Jerry Penrod,Jerry_Penrod,['Rock music']
8,Wendy Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']"


## Replace spaces with _ in the artist column:

In [21]:
"""This function replaces white space in the values of
the column artist with an underscore."""
def underscore(text):
    try:
        split_name = text.split(' ')
        return '_'.join(split_name)  
    except:
        return 'error'

In [22]:
data['artist'] = data.artist.apply(underscore)

In [23]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music']
4,Leah_Randi,Leah_Randi,['Alternative rock']
7,Jerry_Penrod,Jerry_Penrod,['Rock music']
8,Wendy_Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']"


## Remove the \_(singer) type designation from retrieved

In [24]:
"""This function removes the designation from the retrieved name.
Apply it to the 'retrieved' values."""
def remove_designation(text):
    retrieved = text
    p = re.compile(r'(.*)(_\(musician\))')
    q = re.compile(r'(.*)(_\(singer\))')
    r = re.compile(r'(.*)(_\(rapper\))')
    result_p = re.match(p, retrieved)
    result_q = re.match(q, retrieved)
    result_r = re.match(r, retrieved)
    if result_p != None:
        return result_p.group(1)
    elif result_q != None:
        return result_q.group(1)
    elif result_r != None:
        return result_r.group(1)
    else:
        return text
    #print(result_p)
    #x = result.group(1)
    #y = result.group(1)

In [25]:
data['retrieved_clean'] = data.retrieved.apply(remove_designation)

Take a glance at artist and retrieved values:

In [26]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('retrieved: {}        retrieved_clean: {}'.format(data.retrieved.iloc[n], data.retrieved_clean.iloc[n]))

retrieved: Louis_Bertignac        retrieved_clean: Louis_Bertignac
retrieved: Clara_Smith        retrieved_clean: Clara_Smith
retrieved: Dave_Moody        retrieved_clean: Dave_Moody
retrieved: Karrin_Allyson        retrieved_clean: Karrin_Allyson
retrieved: Kit_Hain        retrieved_clean: Kit_Hain
retrieved: Anthony_David_(singer)        retrieved_clean: Anthony_David
retrieved: Robert_Lockwood_Jr.        retrieved_clean: Robert_Lockwood_Jr.
retrieved: Hélène_Martin        retrieved_clean: Hélène_Martin
retrieved: Louis_Cennamo        retrieved_clean: Louis_Cennamo
retrieved: Patti_Smith        retrieved_clean: Patti_Smith
retrieved: Kristin_Hersh        retrieved_clean: Kristin_Hersh
retrieved: Korey_Cooper        retrieved_clean: Korey_Cooper
retrieved: Kerli        retrieved_clean: Kerli
retrieved: Janet_Pressley        retrieved_clean: Janet_Pressley
retrieved: Ivan_Neville        retrieved_clean: Ivan_Neville


Take a glance at artist and retrieved_clean values:

In [27]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

data[['artist','retrieved_clean']].iloc[rints]

Unnamed: 0_level_0,artist,retrieved_clean
index,Unnamed: 1_level_1,Unnamed: 2_level_1
1994,Baby_Lloyd_Stallworth,Baby_Lloyd_Stallworth
2383,Shorty_Long,Shorty_Long
5622,Wendy_Waldman,Wendy_Waldman
3916,V.I.C.,V.I.C.
1320,Robert_Lucas,Robert_Lucas
1601,Carson_Robison,Carson_Robison
1415,Irma_Schultz_Keller,Irma_Schultz_Keller
803,Algis_Kizys,Algis_Kizys
3170,Darryl_Jenifer,Darryl_Jenifer
8216,Les_Paul,Les_Paul


### Mark the rows for which retrieved_clean is different from artist

In [36]:
"""This function takes a pair of strings and checks
if they are equivalent (case insensitive)

.casefold is used to be case insensitive; 
still might have problems on some characters"""

def verify_artist(x,y):
    if x.casefold() == y.casefold(): 
        return 1
    else:
        return 0

In [42]:
data['match'] = (data.artist.apply(lambda x: x.casefold()) != data.retrieved_clean.apply(lambda x: x.casefold())).astype('int64')

In [43]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre,retrieved_clean,match
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ...",Storm_Calysta,0
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music'],Larry_Jon_Wilson,0
4,Leah_Randi,Leah_Randi,['Alternative rock'],Leah_Randi,0
7,Jerry_Penrod,Jerry_Penrod,['Rock music'],Jerry_Penrod,0
8,Wendy_Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']",Wendy_Rene,0


In [44]:
data.match.sum()

15

In [45]:
data[data.match == 1]

Unnamed: 0_level_0,artist,retrieved,genre,retrieved_clean,match
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
137,Quincy,Quincy_(band),"['New wave music', 'Power pop', 'Punk rock']",Quincy_(band),1
960,Ours,Ours_(band),"['Alternative rock', 'Post-grunge', 'Progressi...",Ours_(band),1
1011,Millionaires,Millionaires_(band),"['Electropop', 'Hip hop music', 'Crunkcore']",Millionaires_(band),1
1483,Jawbone,Jawbone_(band),"['Christian hardcore', 'Hardcore punk']",Jawbone_(band),1
1809,John_Barry,John_Barry_(composer),['Film score'],John_Barry_(composer),1
2299,Angel,Angel_(band),"['Glam rock', 'Progressive rock', 'Hard rock']",Angel_(band),1
2489,Northcote,Northcote_(band),"['Folk rock', 'Punk rock', 'Post-hardcore']",Northcote_(band),1
2653,Beef,Beef_(band),"['Reggae', 'Ska', 'Funk', 'Rock music']",Beef_(band),1
2811,The_Teardrops,The_Teardrops_(band),"['Punk rock', 'Post-punk', 'New wave music']",The_Teardrops_(band),1
3572,Troja,Troja_(band),"['Heavy metal music', 'Thrash metal', 'Hard ro...",Troja_(band),1


# START HERE

### Genre Labels

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

First we carry out the split on an example:

In [14]:
x = data.genre.iloc[2]

In [15]:
x

'atl hip hop, gangster rap, hip hop, pop rap, rap, southern hip hop, trap'

In [16]:
[s.strip() for s in x.split(',')]

['atl hip hop',
 'gangster rap',
 'hip hop',
 'pop rap',
 'rap',
 'southern hip hop',
 'trap']

Now we make a function to apply to the genre column:

In [35]:
def genrelist(string):
    return [s.strip() for s in string.split(',')]

Now we apply it to the whole column and put the lists in a new column:

In [37]:
data['genrelist']= data['genre'].apply(genrelist)

In [38]:
data.head()

Unnamed: 0,artist,gender,genre,genrelist
0,12 Gauge,male,miami bass,[miami bass]
1,1987,male,retro electro,[retro electro]
2,2 Chainz,male,"atl hip hop, gangster rap, hip hop, pop rap, r...","[atl hip hop, gangster rap, hip hop, pop rap, ..."
3,2 Pistols,male,"dirty south rap, pop rap, southern hip hop, trap","[dirty south rap, pop rap, southern hip hop, t..."
4,21 Savage,male,"atl hip hop, rap, trap","[atl hip hop, rap, trap]"


### Remove all artists with null values for genre:

In [11]:
data = data[data['genre'].notnull()]

In [12]:
data.isnull().sum(axis = 0)

artist    0
gender    0
genre     0
dtype: int64

In [13]:
data.shape

(9734, 3)

### Extracting the unique genre labels:

In [61]:
genre_list0 = data.genrelist.values.tolist()

In [62]:
genre_list0[:5]

[['miami bass'],
 ['retro electro'],
 ['atl hip hop',
  'gangster rap',
  'hip hop',
  'pop rap',
  'rap',
  'southern hip hop',
  'trap'],
 ['dirty south rap', 'pop rap', 'southern hip hop', 'trap'],
 ['atl hip hop', 'rap', 'trap']]

In [63]:
genre_list1 = [x for y in genre_list0 for x in y]
len(genre_list1)

25998