# Resume at START HERE

This notebook is used to clean and prepare the genre label data for analysis.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

In [2]:
%ls -l /Users/Daniel/Code/Genre/data/

total 0
drwxr-xr-x  1142 Daniel  staff  36544 Apr 17 10:27 [34martist_network_graphs[m[m/
drwxr-xr-x    14 Daniel  staff    448 Apr 22 15:14 [34mgenre_lists[m[m/


### Data Sets

The file singers_gender.csv is from Kaggle and lists music artists and their gender. This is our starting point. It is augmented using the lists of women artists. Genre and network info will be generated by scraping databases. For now we are focusing in Wikipedia.

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
# kaggle_data.shape

### Load the data to be cleaned:

Current: wiki-kaggle_genres_rough.csv

- This will be replaced by the fully scraped set
- The full set needs to be cleaned

Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

- I renamed the first column of the csv file to be 'index'

In [5]:
data = pd.read_csv('../../data/genre_lists/wiki-kaggle_genres_rough.csv', header = 0, index_col = 'index')

In [6]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [7]:
data.shape

(8770, 3)

We want to remove the artists for which 'retrieved' != 'artist'. To do this, we need to put both values in the same format.

Take a glance at artist and retrieved values:

In [8]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Jeff Carpenter        retrieved: https://en.wikipedia.org/wiki/Jeff_Carpenter
artist: Meechie        retrieved: none
artist: Jesse Hughes        retrieved: https://en.wikipedia.org/wiki/Jesse_Hughes_(musician)
artist: Willy William        retrieved: https://en.wikipedia.org/wiki/Willy_William
artist: Ari Joshua Zucker        retrieved: https://en.wikipedia.org/wiki/Ari_Joshua_Zucker
artist: Cuco Sánchez        retrieved: none
artist: Edward Johnson        retrieved: none
artist: Anthony Rossomando        retrieved: https://en.wikipedia.org/wiki/Anthony_Rossomando
artist: David T. Walker        retrieved: https://en.wikipedia.org/wiki/David_T._Walker
artist: Terry McBride        retrieved: https://en.wikipedia.org/wiki/Terry_McBride_(musician)
artist: Max Ochs        retrieved: none
artist: Vincent Delerm        retrieved: https://en.wikipedia.org/wiki/Vincent_Delerm
artist: The Mississippi Moaner        retrieved: none
artist: Kate Nash        retrieved: https://en.wikipedia.or

Retrieved:

- [x] It appears that there is always the incipit (see below) that needs to be removed 
- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped (probably to distinguish from othe people in wikipedia)
- double quotes are allowed in names
- hyphens appear

In [9]:
incipit = 'https://en.wikipedia.org/wiki/'

For filtering out artists that were not matched:

- separate the 'none'
- inspect mismatches to look for typos and different versions

## Remove the incipit from the 'retrieved' values

In [10]:
"""This function extracts artist name from the url.
Apply it to the 'retrieved' values."""
def retrieved_artist(text):
    try:
        retrieved = text
        p = re.compile(r'(https://en.wikipedia.org/wiki/)(.*)')
        result = re.match(p, retrieved)
        return result.group(2)
    except:
        if text == 'none':
            return 'none'
    else:
        return 'None'

In [11]:
data['retrieved'] = data.retrieved.apply(retrieved_artist)

In [12]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,Larry_Jon_Wilson,['Country music']
4,Leah Randi,Leah_Randi,['Alternative rock']


## Replace spaces with _ in the artist column:

In [13]:
"""This function replaces white space in the values of
the column artist with an underscore."""
def underscore(text):
    try:
        split_name = text.split(' ')
        return '_'.join(split_name)  
    except:
        return 'error'

In [14]:
data['artist'] = data.artist.apply(underscore)

In [15]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary_Vandenbroucke,none,none
1,Studebaker_John,none,none
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music']
4,Leah_Randi,Leah_Randi,['Alternative rock']


# START HERE

Take a glance at artist and retrieved values:

In [41]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Kathy_Kosins        retrieved: Kathy_Kosins
artist: Shy_Glizzy        retrieved: Shy_Glizzy
artist: Ycare        retrieved: none
artist: Paul_Burch        retrieved: none
artist: Nick_Carter        retrieved: Nick_Carter_(musician)
artist: Julia_Neigel        retrieved: none
artist: Sahlene        retrieved: Sahlene
artist: Chadwick_Stokes_Urmston        retrieved: Chadwick_Stokes_Urmston
artist: Herriot_Row        retrieved: Herriot_Row
artist: Steinar_Aadnekvam        retrieved: none
artist: David_T._Walker        retrieved: David_T._Walker
artist: Darin_Gray        retrieved: none
artist: Tina_St._Claire        retrieved: none
artist: Julio_Sosa        retrieved: Julio_Sosa
artist: Cello_Dias        retrieved: Cello_Dias


## Remove the \_(singer) and \_(musician) groups from retrieved

In [71]:
"""This function removes the designation from the retrieved name.
Apply it to the 'retrieved' values."""
def remove_designation(text):
    retrieved = text
    p = re.compile(r'(.*)(_\(musician\))')
    q = re.compile(r'(.*)(_\(singer\))')
    if 
    try:
        
        result = re.match(p, retrieved)
        
    except:
        p = re.compile(r'(.*)(_\(musician\))')
        result = re.match(p, retrieved)
        return result.group(1)
    except:
        
        result = re.match(p, retrieved)
        return result.group(1)
    else:
        return text

SyntaxError: invalid syntax (<ipython-input-71-7c00cf2a19f2>, line 8)

In [72]:
p = re.compile(r'(.*)(_\(musician\))')
result = re.match(p, 'none')
result.group(2)

AttributeError: 'NoneType' object has no attribute 'group'

In [70]:
remove_designation('none')

AttributeError: 'NoneType' object has no attribute 'group'

In [69]:
data['retrieved'] == data.retrieved.apply(remove_designation)

AttributeError: 'NoneType' object has no attribute 'group'

In [12]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,Larry_Jon_Wilson,['Country music']
4,Leah Randi,Leah_Randi,['Alternative rock']


In [32]:
data[data.artist == 'Johnny_Kidd']

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4048,Johnny_Kidd,Johnny_Kidd_(singer),"['Rock and roll', 'Beat music']"


### Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

First we carry out the split on an example:

In [14]:
x = data.genre.iloc[2]

In [15]:
x

'atl hip hop, gangster rap, hip hop, pop rap, rap, southern hip hop, trap'

In [16]:
[s.strip() for s in x.split(',')]

['atl hip hop',
 'gangster rap',
 'hip hop',
 'pop rap',
 'rap',
 'southern hip hop',
 'trap']

Now we make a function to apply to the genre column:

In [35]:
def genrelist(string):
    return [s.strip() for s in string.split(',')]

Now we apply it to the whole column and put the lists in a new column:

In [37]:
data['genrelist']= data['genre'].apply(genrelist)

In [38]:
data.head()

Unnamed: 0,artist,gender,genre,genrelist
0,12 Gauge,male,miami bass,[miami bass]
1,1987,male,retro electro,[retro electro]
2,2 Chainz,male,"atl hip hop, gangster rap, hip hop, pop rap, r...","[atl hip hop, gangster rap, hip hop, pop rap, ..."
3,2 Pistols,male,"dirty south rap, pop rap, southern hip hop, trap","[dirty south rap, pop rap, southern hip hop, t..."
4,21 Savage,male,"atl hip hop, rap, trap","[atl hip hop, rap, trap]"


### Remove all artists with null values for genre:

In [11]:
data = data[data['genre'].notnull()]

In [12]:
data.isnull().sum(axis = 0)

artist    0
gender    0
genre     0
dtype: int64

In [13]:
data.shape

(9734, 3)

### Extracting the unique genre labels:

In [61]:
genre_list0 = data.genrelist.values.tolist()

In [62]:
genre_list0[:5]

[['miami bass'],
 ['retro electro'],
 ['atl hip hop',
  'gangster rap',
  'hip hop',
  'pop rap',
  'rap',
  'southern hip hop',
  'trap'],
 ['dirty south rap', 'pop rap', 'southern hip hop', 'trap'],
 ['atl hip hop', 'rap', 'trap']]

In [63]:
genre_list1 = [x for y in genre_list0 for x in y]
len(genre_list1)

25998