# Resume at START HERE

This notebook is used to clean and prepare the genre label data for analysis.

## Outline of Cleaning:

- [x] remove artists for which 'retrieved' value is 'none'
- [x] remove the url prefix from the retrieved artist names 
- [x] replace ' ' in the artist column with '_'
- [x] remove the '(singer)', '(rapper)', '(musician)' designation from the 'retrieved' column
- [x] remove the artists for which the retrieved-artist != searched-artist. 
    - inspect mismatches to look for typos and different versions
- [x] convert genre column values into lists of strings
- [x] remove old columns
- [x] extract unique genres as a list

In [1]:
import numpy as np
np.random.seed(23)
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

In [2]:
%ls -l /Users/Daniel/Code/Genre/data/

total 0
drwxr-xr-x  1142 Daniel  staff  36544 Apr 17 10:27 [34martist_network_graphs[m[m/
drwxr-xr-x    14 Daniel  staff    448 Apr 22 15:14 [34mgenre_lists[m[m/


### Data Sets

The file singers_gender.csv is from Kaggle and lists music artists and their gender. This is our starting point. It is augmented using the lists of women artists. Genre and network info will be generated by scraping databases. For now we are focusing in Wikipedia.

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
# kaggle_data.shape

### Load the data to be cleaned:

Current: wiki-kaggle_genres_rough.csv

- This will be replaced by the fully scraped set
- The full set needs to be cleaned

Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

- I renamed the first column of the csv file to be 'index'

In [5]:
data = pd.read_csv('../../data/genre_lists/wiki-kaggle_genres_rough.csv', header = 0, index_col = 'index')

In [6]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,none,none
1,Studebaker John,none,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [7]:
data.shape

(8770, 3)

In [8]:
data.isnull().sum()

artist       0
retrieved    0
genre        0
dtype: int64

For how many artists is the scraped genre 'none':

In [9]:
(data.genre == 'none').sum()

2924

For how many artists is the 'retrieved' value 'none':

In [10]:
(data.retrieved == 'none').sum()

2924

Take a glance at artist and retrieved values to determine necessary cleaning:

In [11]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Hound Dog Taylor        retrieved: https://en.wikipedia.org/wiki/Hound_Dog_Taylor
artist: Kenya Bell        retrieved: https://en.wikipedia.org/wiki/Kenya_Bell
artist: Betty Compton        retrieved: none
artist: Christopher Hall        retrieved: https://en.wikipedia.org/wiki/Christopher_Hall_(musician)
artist: Emily Whitehurst        retrieved: https://en.wikipedia.org/wiki/Emily_Whitehurst
artist: Chris Kahl        retrieved: none
artist: Ed Dowie        retrieved: none
artist: Freddy Moore        retrieved: https://en.wikipedia.org/wiki/Freddy_Moore
artist: Sara Storer        retrieved: https://en.wikipedia.org/wiki/Sara_Storer
artist: Liza Manili        retrieved: none
artist: April Lawton        retrieved: https://en.wikipedia.org/wiki/April_Lawton
artist: Bev Pegg        retrieved: none
artist: McLean        retrieved: https://en.wikipedia.org/wiki/McLean_(singer)
artist: Ryn Weaver        retrieved: https://en.wikipedia.org/wiki/Ryn_Weaver
artist: Stan Wilson        ret

Notes on Retrieved:

- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped (probably to distinguish from othe people in wikipedia)
- double quotes are allowed in names
- hyphens appear

### Remove artists for which 'retrieved' value is 'none'

Convert none to null:

In [12]:
data['retrieved'] = data['retrieved'].replace('none', np.nan)

In [13]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Rosemary Vandenbroucke,,none
1,Studebaker John,,none
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']


In [14]:
data.isnull().sum()

artist          0
retrieved    2924
genre           0
dtype: int64

Drop rows with nulls:

In [15]:
data.dropna(axis = 0, inplace = True)

In [16]:
data.shape

(5846, 3)

In [17]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Storm Calysta,https://en.wikipedia.org/wiki/Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,https://en.wikipedia.org/wiki/Larry_Jon_Wilson,['Country music']
4,Leah Randi,https://en.wikipedia.org/wiki/Leah_Randi,['Alternative rock']
7,Jerry Penrod,https://en.wikipedia.org/wiki/Jerry_Penrod,['Rock music']
8,Wendy Rene,https://en.wikipedia.org/wiki/Wendy_Rene,"['Soul music', 'Rhythm and blues']"


## Remove the prefix from the 'retrieved' values

In [18]:
"""This function extracts artist name from the url.
Apply it to the 'retrieved' values."""
def retrieved_artist(text):
    try:
        retrieved = text
        p = re.compile(r'(https://en.wikipedia.org/wiki/)(.*)')
        result = re.match(p, retrieved)
        return result.group(2)
    except:
        if text == 'none':
            return 'none'
    else:
        return 'None'

In [19]:
data['retrieved'] = data.retrieved.apply(retrieved_artist)

In [20]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Storm Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry Jon Wilson,Larry_Jon_Wilson,['Country music']
4,Leah Randi,Leah_Randi,['Alternative rock']
7,Jerry Penrod,Jerry_Penrod,['Rock music']
8,Wendy Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']"


## Replace spaces with _ in the artist column:

In [21]:
"""This function replaces white space in the values of
the column artist with an underscore."""
def underscore(text):
    try:
        split_name = text.split(' ')
        return '_'.join(split_name)  
    except:
        return 'error'

In [22]:
data['artist'] = data.artist.apply(underscore)

In [23]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ..."
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music']
4,Leah_Randi,Leah_Randi,['Alternative rock']
7,Jerry_Penrod,Jerry_Penrod,['Rock music']
8,Wendy_Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']"


## Remove the \_(singer) type designation from retrieved

In [24]:
"""This function uses re. to remove any parenthetical designations
form the retrieved artist name"""
def remove_designation(text):
    designations = [r'_\(singer\)', r'_\(musician\)', r'_\(rapper\)', r'_\(band\)', r'_\(composer\)', r'_\(music_producer\)']
    x = text
    for des in designations:
        if re.search(des, x):
            x = re.sub(r'{}'.format(des),'',text)
    return x

Apply the function:

In [25]:
data['retrieved_clean'] = data.retrieved.apply(remove_designation)

Take a glance at artist and retrieved_clean values:

In [26]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

data[['artist','retrieved_clean']].iloc[rints]

Unnamed: 0_level_0,artist,retrieved_clean
index,Unnamed: 1_level_1,Unnamed: 2_level_1
6995,Louis_Bertignac,Louis_Bertignac
2894,Clara_Smith,Clara_Smith
4116,Dave_Moody,Dave_Moody
3146,Karrin_Allyson,Karrin_Allyson
339,Kit_Hain,Kit_Hain
1752,Anthony_David,Anthony_David
5683,Robert_Lockwood_Jr.,Robert_Lockwood_Jr.
3107,Hélène_Martin,Hélène_Martin
1065,Louis_Cennamo,Louis_Cennamo
5553,Patti_Smith,Patti_Smith


### Mark the rows for which retrieved_clean is different from artist

In [27]:
"""This function takes a pair of strings and checks
if they are equivalent (case insensitive)

.casefold is used to be case insensitive; 
still might have problems on some characters"""

def verify_artist(x,y):
    if x.casefold() == y.casefold(): 
        return 1
    else:
        return 0

Introduce a mismatch just to make sure we can properly remove these:

In [28]:
# use an iloc index larger than the size of the original dataframe
#data.iloc[data.shape[0]+1] = ['test','test_wrong','universal','test_wrong']

Apply the function:

In [29]:
data['match'] = (data.artist.apply(lambda x: x.casefold()) != data.retrieved_clean.apply(lambda x: x.casefold())).astype('int64')

In [30]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre,retrieved_clean,match
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ...",Storm_Calysta,0
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music'],Larry_Jon_Wilson,0
4,Leah_Randi,Leah_Randi,['Alternative rock'],Leah_Randi,0
7,Jerry_Penrod,Jerry_Penrod,['Rock music'],Jerry_Penrod,0
8,Wendy_Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']",Wendy_Rene,0


In [31]:
data.match.sum()

0

Now remove artists where retrieved_clean doesn't match artist:

In [32]:
data = data[data.match == 0]

In [33]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre,retrieved_clean,match
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ...",Storm_Calysta,0
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music'],Larry_Jon_Wilson,0
4,Leah_Randi,Leah_Randi,['Alternative rock'],Leah_Randi,0
7,Jerry_Penrod,Jerry_Penrod,['Rock music'],Jerry_Penrod,0
8,Wendy_Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']",Wendy_Rene,0


Now the remaining artists are verified and have non-null genre label. 

### Genre Labels

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

In [34]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    return [s for s in string.split(',')]

Now we apply it to the whole column and put the lists in a new column:

In [35]:
data['genrelist']= data['genre'].apply(genrelist)

In [36]:
data.head()

Unnamed: 0_level_0,artist,retrieved,genre,retrieved_clean,match,genrelist
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Storm_Calysta,Storm_Calysta,"['Indie-Pop', 'Rock music', 'Indie pop', 'Pop ...",Storm_Calysta,0,"[Indie-Pop, Rock music, Indie pop, Pop musi..."
3,Larry_Jon_Wilson,Larry_Jon_Wilson,['Country music'],Larry_Jon_Wilson,0,[Country music]
4,Leah_Randi,Leah_Randi,['Alternative rock'],Leah_Randi,0,[Alternative rock]
7,Jerry_Penrod,Jerry_Penrod,['Rock music'],Jerry_Penrod,0,[Rock music]
8,Wendy_Rene,Wendy_Rene,"['Soul music', 'Rhythm and blues']",Wendy_Rene,0,"[Soul music, Rhythm and blues]"


### Remove all artists with null values for genre :

In [37]:
data = data[data['genrelist'].notnull()]

In [38]:
data.isnull().sum(axis = 0)

artist             0
retrieved          0
genre              0
retrieved_clean    0
match              0
genrelist          0
dtype: int64

In [39]:
data.shape

(5846, 6)

Remove old columns:

In [40]:
data.columns

Index(['artist', 'retrieved', 'genre', 'retrieved_clean', 'match',
       'genrelist'],
      dtype='object')

In [42]:
data.drop(['retrieved','genre','retrieved_clean', 'match'], axis = 1, inplace = True)

In [43]:
data.head()

Unnamed: 0_level_0,artist,genrelist
index,Unnamed: 1_level_1,Unnamed: 2_level_1
2,Storm_Calysta,"[Indie-Pop, Rock music, Indie pop, Pop musi..."
3,Larry_Jon_Wilson,[Country music]
4,Leah_Randi,[Alternative rock]
7,Jerry_Penrod,[Rock music]
8,Wendy_Rene,"[Soul music, Rhythm and blues]"


In [44]:
data.shape

(5846, 2)

### Extracting the unique genre labels:

First make a list of the genrelists:

In [51]:
genre_list = data.genrelist.values.tolist()

In [52]:
genre_list[:5]

[['Indie-Pop', ' Rock music', ' Indie pop', ' Pop music', ' Psychedelic Rock'],
 ['Country music'],
 ['Alternative rock'],
 ['Rock music'],
 ['Soul music', ' Rhythm and blues']]

Flatten:

In [53]:
genre_list = [x for y in genre_list for x in y]
len(genre_list)

15284

Only keep unique values:

In [54]:
genre_list = list(set(genre_list))

In [55]:
len(genre_list)

1614