# Resume at START HERE

This notebook is used to clean and prepare the genre label data for analysis.

In [81]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

In [82]:
%ls -l /Users/Daniel/Code/Genre/data/

total 0
drwxr-xr-x  1142 Daniel  staff  36544 Apr 17 10:27 [34martist_network_graphs[m[m/
drwxr-xr-x    14 Daniel  staff    448 Apr 22 15:14 [34mgenre_lists[m[m/


### Data Sets

The file singers_gender.csv is from Kaggle and lists music artists and their gender. This is our starting point. It is augmented using the lists of women artists. Genre and network info will be generated by scraping databases. For now we are focusing in Wikipedia.

In [83]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [84]:
# kaggle_data.shape

### Load the data to be cleaned:

Current: wiki-kaggle_genres_rough.csv

- This will be replaced by the fully scraped set
- The full set needs to be cleaned

Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

- I renamed the first column of the csv file to be 'index'

In [85]:
data = pd.read_csv('../../data/genre_lists/wiki-kaggle_genres_rough.csv', header = 0, index_col = 'index')

In [86]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [87]:
data.shape

(8770, 3)

We want to remove the artists for which 'retrieved' != 'artist'. To do this, we need to put both values in the same format.

Take a glance at artist and retrieved values:

In [88]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Rebecca Hollweg        retrieved: https://en.wikipedia.org/wiki/Rebecca_Hollweg
artist: Makana        retrieved: https://en.wikipedia.org/wiki/Makana_(musician)
artist: Cheikha Rimitti        retrieved: https://en.wikipedia.org/wiki/Cheikha_Rimitti
artist: Bryn McAuley        retrieved: none
artist: Angelo Starr        retrieved: none
artist: Houston Stackhouse        retrieved: https://en.wikipedia.org/wiki/Houston_Stackhouse
artist: Danielle Bradbery        retrieved: https://en.wikipedia.org/wiki/Danielle_Bradbery
artist: Hound Dog Taylor        retrieved: https://en.wikipedia.org/wiki/Hound_Dog_Taylor
artist: Haley Bennett        retrieved: none
artist: Thierry Gotti        retrieved: none
artist: Sally Jaye        retrieved: https://en.wikipedia.org/wiki/Sally_Jaye
artist: Angel Clivillés        retrieved: none
artist: Jim Garstang        retrieved: https://en.wikipedia.org/wiki/Jim_Garstang
artist: Doni Tamblyn        retrieved: none
artist: Ella Edmondson        retrieve

Retrieved:

- [x] It appears that there is always the incipit (see below) that needs to be removed 
- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped (probably to distinguish from othe people in wikipedia)
- double quotes are allowed in names
- hyphens appear

In [89]:
incipit = 'https://en.wikipedia.org/wiki/'

For filtering out artists that were not matched:

- separate the 'none'
- inspect mismatches to look for typos and different versions

## Remove the incipit from the 'retrieved' values

In [90]:
"""This function extracts artist name from the url.
Apply it to the 'retrieved' values."""
def retrieved_artist(text):
    try:
        retrieved = text
        p = re.compile(r'(https://en.wikipedia.org/wiki/)(.*)')
        result = re.match(p, retrieved)
        return result.group(2)
    except:
        if text == 'none':
            return 'none'
    else:
        return 'None'

In [91]:
data['retrieved'] = data.retrieved.apply(retrieved_artist)

In [92]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,Larry_Jon_Wilson,['Country music']
4,Leah Randi,Leah_Randi,['Alternative rock']


## Replace spaces with _ in the artist column:

In [93]:
"""This function replaces white space in the values of
the column artist with an underscore."""
def underscore(text):
    try:
        split_name = text.split(' ')
        return '_'.join(split_name)  
    except:
        return 'error'

In [94]:
data['artist'] = data.artist.apply(underscore)

In [95]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary_Vandenbroucke,none,none
1,Studebaker_John,none,none
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music']
4,Leah_Randi,Leah_Randi,['Alternative rock']


## Remove the \_(singer) and \_(musician) groups from retrieved

In [96]:
"""This function removes the designation from the retrieved name.
Apply it to the 'retrieved' values."""
def remove_designation(text):
    retrieved = text
    p = re.compile(r'(.*)(_\(musician\))')
    q = re.compile(r'(.*)(_\(singer\))')
    r = re.compile(r'(.*)(_\(rapper\))')
    result_p = re.match(p, retrieved)
    result_q = re.match(q, retrieved)
    result_r = re.match(r, retrieved)
    if result_p != None:
        return result_p.group(1)
    elif result_q != None:
        return result_q.group(1)
    elif result_r != None:
        return result_r.group(1)
    else:
        return text
    #print(result_p)
    #x = result.group(1)
    #y = result.group(1)

In [97]:
data['retrieved_clean'] = data.retrieved.apply(remove_designation)

Take a glance at artist and retrieved values:

In [109]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('retrieved: {}        retrieved_clean: {}'.format(data.retrieved.iloc[n], data.retrieved_clean.iloc[n]))

retrieved: none        retrieved_clean: none
retrieved: none        retrieved_clean: none
retrieved: Tish_Hyman        retrieved_clean: Tish_Hyman
retrieved: Dave_Stryker        retrieved_clean: Dave_Stryker
retrieved: Betty_Clooney        retrieved_clean: Betty_Clooney
retrieved: none        retrieved_clean: none
retrieved: none        retrieved_clean: none
retrieved: Jennifer_Warnes        retrieved_clean: Jennifer_Warnes
retrieved: none        retrieved_clean: none
retrieved: Amir_Derakh        retrieved_clean: Amir_Derakh
retrieved: none        retrieved_clean: none
retrieved: J._D._Crowe        retrieved_clean: J._D._Crowe
retrieved: Mims_(rapper)        retrieved_clean: Mims
retrieved: none        retrieved_clean: none
retrieved: Nick_Drake        retrieved_clean: Nick_Drake


# START HERE

### Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

First we carry out the split on an example:

In [14]:
x = data.genre.iloc[2]

In [15]:
x

'atl hip hop, gangster rap, hip hop, pop rap, rap, southern hip hop, trap'

In [16]:
[s.strip() for s in x.split(',')]

['atl hip hop',
 'gangster rap',
 'hip hop',
 'pop rap',
 'rap',
 'southern hip hop',
 'trap']

Now we make a function to apply to the genre column:

In [35]:
def genrelist(string):
    return [s.strip() for s in string.split(',')]

Now we apply it to the whole column and put the lists in a new column:

In [37]:
data['genrelist']= data['genre'].apply(genrelist)

In [38]:
data.head()

Unnamed: 0,artist,gender,genre,genrelist
0,12 Gauge,male,miami bass,[miami bass]
1,1987,male,retro electro,[retro electro]
2,2 Chainz,male,"atl hip hop, gangster rap, hip hop, pop rap, r...","[atl hip hop, gangster rap, hip hop, pop rap, ..."
3,2 Pistols,male,"dirty south rap, pop rap, southern hip hop, trap","[dirty south rap, pop rap, southern hip hop, t..."
4,21 Savage,male,"atl hip hop, rap, trap","[atl hip hop, rap, trap]"


### Remove all artists with null values for genre:

In [11]:
data = data[data['genre'].notnull()]

In [12]:
data.isnull().sum(axis = 0)

artist    0
gender    0
genre     0
dtype: int64

In [13]:
data.shape

(9734, 3)

### Extracting the unique genre labels:

In [61]:
genre_list0 = data.genrelist.values.tolist()

In [62]:
genre_list0[:5]

[['miami bass'],
 ['retro electro'],
 ['atl hip hop',
  'gangster rap',
  'hip hop',
  'pop rap',
  'rap',
  'southern hip hop',
  'trap'],
 ['dirty south rap', 'pop rap', 'southern hip hop', 'trap'],
 ['atl hip hop', 'rap', 'trap']]

In [63]:
genre_list1 = [x for y in genre_list0 for x in y]
len(genre_list1)

25998