## This notebook is used to clean and prepare the genre label data for analysis.

- [ ] switch CSV import to use dtype list
- [ ] move individual case respells to a script
- [ ] reformat respell function to take a dictionary of all the cases

## Outline of Cleaning:

1. - [x] remove artists for which 'retrieved' value is 'none'
2. - [x] remove the url prefix from the retrieved artist names 
3. - [x] replace ' ' in the artist column with '_'
4. - [x] remove the '(singer)', '(rapper)', '(musician)' designation from the 'retrieved' column
5. - [x] remove the artists for which the retrieved-artist != searched-artist. 
    - inspect mismatches to look for typos and different versions
6. - [x] Deal with genre label problems
7. - [x] Normalize genre label spelling: e.g. r&b vs rhythm and blues vs rhythm & blues
    - [x] 'rock n roll' = 'rock & roll' etc, but 'rock' separate.
    - [x] remove _music
    - [x] hip hop and hip--hop -> hip-hop
    - [ ] separate lists: look at why they aren't separated
8. - [x] convert genre column values into lists of strings
9. - [x] remove bands (as opposed to individuals)
10. - [x] remove old columns
11. - [x] extract unique genres as a list

In [1]:
import numpy as np
np.random.seed(23)
import pandas as pd
import re
from datetime import datetime

In [2]:
# %ls -lt /Users/Daniel/Code/Genre/data/genre_lists/data_to_be_cleaned/

### The data set with the genders of music artists

The file singers_gender.csv is from Kaggle and lists music artists and their gender. 

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

### Adding genre labels

Genre labels were scraped from Wikipedia info boxes for the artists in the Kaggle dataset. The dataset that joins that info to the gender of the singers is in: 'wiki-kaggle_genres_rescrape.csv'

- [ ] Add in a converter that splits the genre list on commas:
https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas

In [4]:
data = pd.read_csv('../../data/genre_lists/data_to_be_cleaned/wiki-kaggle_genres_rescrape.csv', header = 0)
data.drop(['Unnamed: 0'], axis = 1, inplace = True)
print(f'Shape of the data: {data.shape}')
print('Null values in each column')
print(data.isnull().sum())

Shape of the data: (23177, 4)
Null values in each column
artist       0
gender       0
retrieved    0
genre        0
dtype: int64


In [5]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
0,Jimmy Boyd,male,none,none
1,Christopher Willits,male,https://en.wikipedia.org/wiki/Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
2,Henry Frayne,male,none,none
3,Shawn Hook,male,https://en.wikipedia.org/wiki/Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,https://en.wikipedia.org/wiki/Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"


### Remove artists for which 'retrieved' value is 'none'

For how many artists is the scraped genre 'none'?

In [6]:
(data.genre == 'none').sum()

7677

For how many artists is the 'retrieved' value 'none':

In [7]:
(data.retrieved == 'none').sum()

7677

Convert none to null:

In [8]:
data['retrieved'] = data['retrieved'].replace('none', np.nan)

In [9]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
0,Jimmy Boyd,male,,none
1,Christopher Willits,male,https://en.wikipedia.org/wiki/Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
2,Henry Frayne,male,,none
3,Shawn Hook,male,https://en.wikipedia.org/wiki/Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,https://en.wikipedia.org/wiki/Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"


In [10]:
data.isnull().sum()

artist          0
gender          0
retrieved    7677
genre           0
dtype: int64

Drop rows with nulls:

In [11]:
data.dropna(axis = 0, inplace = True)

In [12]:
data.shape

(15500, 4)

### Take a glance at random artist and retrieved values to determine necessary cleaning:

In [13]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Berner        retrieved: https://en.wikipedia.org/wiki/Berner_(rapper)
artist: Chad Kroeger        retrieved: https://en.wikipedia.org/wiki/Chad_Kroeger
artist: Zoogz Rift        retrieved: https://en.wikipedia.org/wiki/Zoogz_Rift
artist: Gábor Szabó        retrieved: https://en.wikipedia.org/wiki/Gábor_Szabó
artist: Lloyd        retrieved: https://en.wikipedia.org/wiki/Lloyd_(singer)
artist: Dale McBride        retrieved: https://en.wikipedia.org/wiki/Dale_McBride
artist: Nomy Lamm        retrieved: https://en.wikipedia.org/wiki/Nomy_Lamm
artist: Ray Stevens        retrieved: https://en.wikipedia.org/wiki/Ray_Stevens
artist: Betty Clooney        retrieved: https://en.wikipedia.org/wiki/Betty_Clooney
artist: Martin Bramah        retrieved: https://en.wikipedia.org/wiki/Martin_Bramah
artist: Merry Clayton        retrieved: https://en.wikipedia.org/wiki/Merry_Clayton
artist: Francisco Céspedes        retrieved: https://en.wikipedia.org/wiki/Francisco_Céspedes
artist: Jim Messina 

Notes on Retrieved:

- Underscore is used to separate parts of the name
- '.' are allowed in names 
- '(singer)' and '(musician)' are sometimes included and need to be stripped (probably to distinguish from othe people in wikipedia)
- double quotes are allowed in names
- hyphens appear

## Remove the URL prefix from the 'retrieved' values

In [14]:
"""This function extracts artist name from the url.
Apply it to the 'retrieved' values."""
def retrieved_artist(text):
    try:
        retrieved = text
        p = re.compile(r'(https://en.wikipedia.org/wiki/)(.*)')
        result = re.match(p, retrieved)
        return result.group(2)
    except:
        if text == 'none':
            return 'none'
    else:
        return 'None'

In [15]:
data['retrieved'] = data.retrieved.apply(retrieved_artist)

In [16]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
1,Christopher Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
3,Shawn Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"
6,Marvin Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']"
7,Povel Ramel,male,Povel_Ramel,['vaudeville']


Take a glance at artist and retrieved values after cleaning retrieved to determine further cleaning:

In [17]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

for n in rints:
    print('artist: {}        retrieved: {}'.format(data.artist.iloc[n], data.retrieved.iloc[n]))

artist: Lisa Brokop        retrieved: Lisa_Brokop
artist: Shani Rigsbee        retrieved: Shani_Rigsbee
artist: Hamish Stuart        retrieved: Hamish_Stuart
artist: Dominique Eade        retrieved: Dominique_Eade
artist: John C. J. Taylor        retrieved: John_C._J._Taylor
artist: Phillip Mitchell        retrieved: Phillip_Mitchell
artist: Meja        retrieved: Meja
artist: Simon Underwood        retrieved: Simon_Underwood
artist: Del Reeves        retrieved: Del_Reeves
artist: Syleena Johnson        retrieved: Syleena_Johnson
artist: Jimi Jamison        retrieved: Jimi_Jamison
artist: John Spiker        retrieved: John_Spiker
artist: Melissa Sgambelluri        retrieved: Melissa_Sgambelluri
artist: Justin Hayford        retrieved: Justin_Hayford
artist: Markus Fagervall        retrieved: Markus_Fagervall


## Replace spaces with _ in the artist column:

In [18]:
"""This function replaces white space in the values of
the column artist with an underscore."""
def underscore(text):
    try:
        split_name = text.split(' ')
        return '_'.join(split_name)  
    except:
        return 'error'

In [19]:
data['artist'] = data.artist.apply(underscore)

In [20]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac..."
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']"
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']"
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']"
7,Povel_Ramel,male,Povel_Ramel,['vaudeville']


## Remove the \_(singer) type designation from retrieved

In [21]:
"""This function uses re. to remove any parenthetical designations
form the retrieved artist name"""
def remove_designation(text):
    designations = [r'_\(singer\)', r'_\(musician\)', r'_\(rapper\)', r'_\(band\)', r'_\(composer\)', r'_\(music_producer\)', r'_\(singer-songwriter\)' ]
    x = text
    for des in designations:
        if re.search(des, x):
            x = re.sub(r'{}'.format(des),'',text)
    return x

Apply the function:

In [22]:
data['retrieved_clean'] = data.retrieved.apply(remove_designation)

Take a glance at artist and retrieved_clean values:

In [23]:
rints = np.random.randint(0,data.shape[0],15) # generate 15 random numbers from 0 to k-1, with k = # of rows

data[['artist','retrieved_clean']].iloc[rints]

Unnamed: 0,artist,retrieved_clean
4639,John_Schlitt,John_Schlitt
6657,Jack_Hues,Jack_Hues
8250,Paris_Bennett,Paris_Bennett
3281,Julian_Velard,Julian_Velard
9987,Chingy,Chingy
12032,Deesha,Deesha
9981,Jamie_Oldaker,Jamie_Oldaker
6919,Pearl_Future,Pearl_Future
10485,Vince_Taylor,Vince_Taylor
15091,Tarsame_Singh_Saini,Tarsame_Singh_Saini


### Mark the rows for which retrieved_clean is different from artist

In [24]:
"""This function takes a pair of strings and checks
if they are equivalent (case insensitive)

.casefold is used to be case insensitive; 
still might have problems on some characters"""

def verify_artist(x,y):
    if x.casefold() == y.casefold(): 
        return 1
    else:
        return 0

Introduce a mismatch just to make sure we can properly remove these:

In [25]:
# use an iloc index larger than the size of the original dataframe
#data.iloc[data.shape[0]+1] = ['test','test_wrong','universal','test_wrong']

Apply the function:

In [26]:
data['match'] = (data.artist.apply(lambda x: x.casefold()) != data.retrieved_clean.apply(lambda x: x.casefold())).astype('int64')

In [27]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac...",Christopher_Willits,0
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']",Shawn_Hook,0
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']",Steve_Poltz,0
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']",Marvin_Isley,0
7,Povel_Ramel,male,Povel_Ramel,['vaudeville'],Povel_Ramel,0


In [28]:
data.match.sum()

0

Now remove artists where retrieved_clean doesn't match artist:

In [29]:
data = data[data.match == 0].copy(deep = True)

In [30]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac...",Christopher_Willits,0
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']",Shawn_Hook,0
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']",Steve_Poltz,0
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']",Marvin_Isley,0
7,Povel_Ramel,male,Povel_Ramel,['vaudeville'],Povel_Ramel,0


Now the remaining artists are verified and have non-null genre label. 

In [31]:
data = data.copy( deep = True)

# Genre Labels

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

### Problems! 

Strings to search for to possibly correct:

- [ ] 'and' -- too many!
- [x] '\'
- [x] 'r&g' -- OK - rhythm and grime
- [x] '*'
- [x] '_·_' -- added genrelist function; replaced with comma
- [ ] search for all non-letters
- [x] 'descriptors'- '|'
- [x] 'rock_folk_-rock_rock_-electronic_ballad'
- [x] 'hillbilly_rockabilly_r&b'
- [x] '-' at the beggining of a string
- [x] '\xa0' (no break space)

More examples:

- [x] 'hardcore\\\\xa0punk'
- [x] 'country_·_americana_·_folk_·_singer_songwriter'
- [x] 'blues_soul_r_&_b_gospel_funk_folk'
- [x] 'college\\\\xa0rock'
- [ ] 'gospel_and_gospel_blues'

Clearly some of these need to be parsed into common genres. 

In [32]:
#data.artist[data.genre.str.contains("christian")]

In [33]:
data.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
1,Christopher_Willits,male,Christopher_Willits,"['electronic', 'glitch', 'ambient', 'electroac...",Christopher_Willits,0
3,Shawn_Hook,male,Shawn_Hook,"['pop', 'electronic', 'rock']",Shawn_Hook,0
4,Steve_Poltz,male,Steve_Poltz,"['pop rock', 'indie-rock', 'folk rock']",Steve_Poltz,0
6,Marvin_Isley,male,Marvin_Isley,"['r&b', 'funk', 'soul', 'funk rock']",Marvin_Isley,0
7,Povel_Ramel,male,Povel_Ramel,['vaudeville'],Povel_Ramel,0


In [34]:
data.genre.isnull().sum()

0

# Move this section into a script that is run here

## Deal with unusual cases by hand:

Remove any '*' (moved to general cleaning function below)

In [42]:
data[data.genre.str.contains(r'descriptors\\')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
6650,RY_X,male,RY_X,"['ambient pop', 'folktronica', 'indietronica',...",RY_X,0


In [43]:
data.genre.at[6650] = "['ambient pop', 'folktronica', 'indietronica', 'indie folk']"

In [44]:
data[data.genre.str.contains(r'\\u2060')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
17600,Ben_Moody,male,Ben_Moody,"['alternative rock', 'alternative metal', '\u2...",Ben_Moody,0


In [45]:
data.genre.loc[17600]

"['alternative rock', 'alternative metal', '\\u2060', 'nu metal', 'gothic metal']"

In [46]:
data.genre.at[17600] = "['alternative rock', 'alternative metal', 'nu metal', 'gothic metal']"

In [47]:
data[data.genre.str.contains(r'\\')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
646,Gabriel_Wilson,male,Gabriel_Wilson,"['christian & gospel', 'independent\nsinger/so...",Gabriel_Wilson,0
1306,Eric_Gaffney,male,Eric_Gaffney,"['alternative rock', 'indie rock', 'lo-fi', 'h...",Eric_Gaffney,0
3015,Michael_Stipe,male,Michael_Stipe,"['alternative\xa0rock', 'folk rock', 'college\...",Michael_Stipe,0
8147,John_Wesley_Harding,male,John_Wesley_Harding_(singer),"['singer-songwriter rock\npop', 'folk']",John_Wesley_Harding,0
9899,Marcus_Singletary,male,Marcus_Singletary,"['rock', 'progressive\xa0rock', 'jazz', 'jazz ...",Marcus_Singletary,0
16275,Patrick_Dennis,male,Patrick_Dennis_(musician),"['alternative rock', 'indie rock', 'alternativ...",Patrick_Dennis,0
16726,Jamal_Millner,male,Jamal_Millner,['black american music \nwestern european art ...,Jamal_Millner,0
19018,Lance_King,male,Lance_King,"['heavy metal', 'power metal', 'progressive me...",Lance_King,0


In [48]:
data[data.genre.str.contains(r' -')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
11961,Sarana_VerLin,female,Sarana_VerLin,"['celticana - a blend of rock', 'pop', 'folk',...",Sarana_VerLin,0
13278,Jane_Castro,female,Jane_Castro,"['pop', ' house music - latin american']",Jane_Castro,0
15232,Declan_Galbraith,male,Declan_Galbraith,"['pop', 'rock folk -rock rock -electronic ball...",Declan_Galbraith,0


In [49]:
data.genre.loc[11961]

"['celticana - a blend of rock', 'pop', 'folk', 'celtic and americana']"

In [50]:
data.genre.at[11961] = "['celticana']"

In [51]:
data.genre.loc[13278]

"['pop', ' house music - latin american']"

In [52]:
data.genre.at[13278] = "['pop', 'house music']"

In [53]:
data.genre.loc[15232]

"['pop', 'rock folk -rock rock -electronic ballad']"

In [54]:
data.genre.at[15232] = "['pop', 'rock', 'folk-rock', rock-electronic, ballad']"

In [55]:
data[data.genre.str.contains(r'hillbilly')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
6897,Webb_Pierce,male,Webb_Pierce,"['country', 'honky-tonk', 'western swing', 'co...",Webb_Pierce,0
7743,Boyden_Carpenter,male,Boyden_Carpenter,"['bluegrass', 'bluegrass gospel', 'hillbilly']",Boyden_Carpenter,0
8854,Harvie_June_Van,female,Harvie_June_Van,"['country', 'hillbilly']",Harvie_June_Van,0
15556,Bill_Holford,male,Bill_Holford,"['cajun', 'country', ' hillbilly rockabilly r&b']",Bill_Holford,0


In [56]:
data.genre.loc[15556]

"['cajun', 'country', ' hillbilly rockabilly r&b']"

In [57]:
data.genre.at[15556] = "['cajun', 'country', ' hillbilly, rockabilly, r&b']"

In [58]:
data[data.genre.str.contains(r'r & b')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match
2599,Larissa_Lam,female,Larissa_Lam,"['dance-pop', 'r & b', 'jazz']",Larissa_Lam,0
6853,Nick_Clemons,male,Nick_Clemons,"['rock', 'funk', 'r & b']",Nick_Clemons,0
14153,A._J._Ghent,male,A._J._Ghent,"['funk', 'soul', 'jam', 'gospel', 'sacred stee...",A._J._Ghent,0
14654,Thomasina_Winslow,female,Thomasina_Winslow,"['blues soul r & b gospel funk folk', 'african...",Thomasina_Winslow,0
16319,Jeff_Pehrson,male,Jeff_Pehrson,"['rock', 'psychedelic rock', 'r & b', 'folk ro...",Jeff_Pehrson,0


In [59]:
data.genre[data.genre.str.contains('active')]

11935    ['rock', " active rock country jazz standards ...
Name: genre, dtype: object

In [60]:
data.genre.loc[11935]

'[\'rock\', " active rock country jazz standards children\'s folk"]'

In [61]:
data.genre.at[11935] = '[rock, active rock, country, jazz standards, childrens folk]'

In [62]:
data.genre[data.genre.str.contains('pitbash')]

13009    ['punk rock', 'art rock', 'jewish rock', 'gara...
Name: genre, dtype: object

In [63]:
data.genre.loc[13009]

'[\'punk rock\', \'art rock\', \'jewish rock\', \'garage punk\', \'obscuro\', \'metal\', \'marching band\', \'jewish music\', \'jazz\', \' pitbash jewish punk "thrash opera"\']'

In [64]:
data.at[13009, 'genre'] = "['punk rock', 'art rock', 'jewish rock', 'garage punk', 'obscuro', 'metal','marching band', 'jewish music', 'jazz',  'pitbash', 'jewish punk', 'thrash', 'opera']"

Remove 'earlier:'

In [65]:
data.genre.loc[11170]

"['aor', 'pop rock', '; earlier:', 'pop', 'disco', 'soul']"

In [66]:
data.at[11170,'genre'] = "['aor', 'pop_rock', 'pop', 'disco', 'soul']"

In [67]:
data.genre.loc[3535]

"['praise &', 'worship']"

In [68]:
data.at[3535,'genre'] = "['praise&worship']"

In [69]:
pd.set_option('display.max_rows', None)
pd.options.display.max_rows
#data.genre[data.genre.str.contains(r"\balt\b")]

In [70]:
data.loc[22536]

artist                                          Billy_Connolly
gender                                                    male
retrieved                                       Billy_Connolly
genre              ['observational', 'blue', 'musical comedy']
retrieved_clean                                 Billy_Connolly
match                                                        0
Name: 22536, dtype: object

In [71]:
data.at[22536, 'genre'] = "['observational', 'blue comedy', 'musical_comedy']"

In [72]:
data.loc[22536]

artist                                                Billy_Connolly
gender                                                          male
retrieved                                             Billy_Connolly
genre              ['observational', 'blue comedy', 'musical_come...
retrieved_clean                                       Billy_Connolly
match                                                              0
Name: 22536, dtype: object

In [73]:
data.genre.loc[12183]

"['jazz', ' blue jump blues rock']"

In [74]:
data.at[12183, 'genre'] = "['jazz', 'blues', 'jump_blues', 'rock']"

In [75]:
data.genre.loc[12183]

"['jazz', 'blues', 'jump_blues', 'rock']"

In [76]:
data.genre.loc[9679]

"['singer-songwriter', 'world beat', 'alternative pop', 'lounge', 'electronic', 'world music', 'indie pop', 'j-synth', 'cool', 'fusion', 'electro-pop']"

In [77]:
data.at[9679, 'genre'] = "['singer-songwriter', 'world beat', 'alternative pop', 'lounge', 'electronic', 'world music', 'indie pop', 'j-synth', 'cool jazz', 'fusion', 'electro-pop']"

In [78]:
data.genre.loc[9679]

"['singer-songwriter', 'world beat', 'alternative pop', 'lounge', 'electronic', 'world music', 'indie pop', 'j-synth', 'cool jazz', 'fusion', 'electro-pop']"

In [79]:
data.genre.loc[2374]

"['blues', 'ballads and rock & roll']"

In [80]:
data.at[2374, 'genre'] = "['blues', 'ballads', 'rock & roll']"

In [81]:
data.genre.loc[2374]

"['blues', 'ballads', 'rock & roll']"

In [82]:
data.genre.loc[7946]

"['bhangramuffin', 'reggae fusion', 'eurodance']"

In [83]:
data.at[7946, 'genre'] = "['dancehall', 'reggae fusion', 'eurodance']"

In [84]:
data.genre.loc[7946]

"['dancehall', 'reggae fusion', 'eurodance']"

In [85]:
data.genre.loc[12500]

"['classical and folk']"

In [86]:
data.at[12500, 'genre'] = "['classical', 'folk']"

In [87]:
data.genre.loc[12500]

"['classical', 'folk']"

In [88]:
data.genre.loc[9732]

"['pop', 'r&b', ' electropop alternative pop']"

In [89]:
data.at[9732, 'genre'] = "['pop', 'r&b', ' electropop', 'alternative pop']"

In [90]:
data.genre.loc[19082]

"['pop', 'r&b', ' electropop alternative pop']"

In [91]:
data.at[19082, 'genre'] = "['pop', 'r&b', ' electropop', 'alternative pop']"

In [92]:
data.genre.loc[3843]

"['gospel and gospel blues']"

In [93]:
data.at[3843, 'genre'] = "['gospel','gospel blues']"

In [94]:
data.genre.loc[3843]

"['gospel','gospel blues']"

In [95]:
data.genre.loc[17494]

"['lo-fi-diy-indie-pop-core', 'beat poetry']"

In [96]:
data.at[17494, 'genre'] = "['lo-fi','diy','indie','popcore', 'beat poetry']"

In [97]:
data.genre.loc[17494]

"['lo-fi','diy','indie','popcore', 'beat poetry']"

In [98]:
data.genre.loc[10414]

"['cabaret', ' pop contemporary opera']"

In [99]:
data.at[10414, 'genre'] = "['cabaret', 'pop', 'contemporary opera']"

In [100]:
data.genre.loc[10414]

"['cabaret', 'pop', 'contemporary opera']"

In [101]:
data.genre.loc[14600]

"['rock', 'pop', 'jazz', 'r&b', 'country', ' streets—blues roots']"

In [102]:
data.at[14600, 'genre'] = "['rock', 'pop', 'jazz', 'r&b', 'country', ' blues', 'roots']"

In [103]:
data.artist.loc[14600]

'Doug_Pettibone'

In [104]:
data.genre.loc[10022]

"['indie-alternative rock', 'pop punk']"

In [105]:
data.at[10022, 'genre'] = "['indie','alternative rock', 'pop punk']"

In [106]:
data.genre.loc[10022]

"['indie','alternative rock', 'pop punk']"

In [107]:
data.genre.loc[13433]

"['blues', 'roots', ' rock and roll americana rhythm and blues alternative']"

In [108]:
data.at[13433, 'genre'] = "['blues', 'roots', ' rock and roll', 'americana', 'rhythm and blues', 'alternative']"

In [109]:
data.genre.loc[13433]

"['blues', 'roots', ' rock and roll', 'americana', 'rhythm and blues', 'alternative']"

In [110]:
data.genre.loc[2295]

"['r&b', ' jazz funk rock']"

In [111]:
data.at[2295, 'genre'] = "['r&b', ' jazz', 'funk', 'rock']"

In [112]:
data.genre.loc[2295]

"['r&b', ' jazz', 'funk', 'rock']"

In [113]:
data.genre.loc[3017]

"['jazz fusion', ' jazz funk bluegrass pop']"

In [114]:
data.at[3017, 'genre'] = "['jazz fusion', 'jazz funk', 'bluegrass pop']"

In [115]:
data.genre.loc[3017]

"['jazz fusion', 'jazz funk', 'bluegrass pop']"

In [116]:
data.genre.loc[18667]

"['jazz', 'latin-jazz pop']"

In [117]:
data.at[18667, 'genre'] = "['jazz', 'latin-jazz', 'pop']"

In [118]:
data.genre.loc[18667]

"['jazz', 'latin-jazz', 'pop']"

## Spelling normalizations:

Make genre label spelling replacements: (and, &, n, 'n, 'n')

- rock and roll
- rhythm and blues
- rhythm and grime
- country and western
- RnB

- [x] Tom will supply further normalizations

Now we use the list of issues flagged by Tom to augment the respelling.

In [36]:
data.loc[18410]

artist                          Fanny_J
gender                           female
retrieved                       Fanny_J
genre              ['zouk', '*', 'r&b']
retrieved_clean                 Fanny_J
match                                 0
Name: 18410, dtype: object

# Create a dictionary of {regex: respell} and iterate through here instead of explicit list

In [37]:
"""This function takes normalizes genre spellings."""
def genrespelling(string):
    string = re.sub(r'r & b','r&b', string) 
    string = re.sub(r'rhythm\s{0,1}(and|&)\s{0,1}blues','r&b', string) 
    string = re.sub(r'rhythm and grime','r&g', string) 
    string = re.sub("electronic dance music", "edm", string)
    string = re.sub(r'country\s{0,1}(and|&)\s{0,1}western','country&western', string) 
    string = re.sub(r"rock[\w. &''-]{0,5}roll",'rock&roll', string) 
    string = re.sub(r"r.{0,1}n.{0,1}b","r&b", string)
    string = re.sub(r"hip.{0,1}hop","hip-hop", string)
    string = re.sub(r"hip.{0,1}house","hip-house", string)
    string = re.sub(r"adult","", string)
    string = re.sub(r"afrobeats","afrobeat", string)
    string = re.sub(r"boleros","bolero", string)
    string = re.sub(r"musicals","musical", string)
    string = re.sub(r"neo_souls","neo_soul", string)
    string = re.sub(r"protest_songs","protest_song", string)
    string = re.sub(r"spirituals","spiritual", string)
    string = re.sub(r"television_scores","television_score", string)
    string = re.sub(r"show tune","show_tunes", string)
    string = re.sub(r"showtunes","show_tunes", string)
    string = re.sub(r"showtunes adult contemporary", "show_tunes, adult_contemporary", string)
    string = re.sub(r"ballad\b","ballads", string)
    string = re.sub(r"soundtracks","soundtrack", string)
    string = re.sub(r"afropop","afro-pop", string)
    
    string = re.sub(r"alt/rock","alternative-rock", string)
    string = re.sub(r"alt.\s{0,1}country","alternative-country", string)
    string = re.sub(r"\balt-","alternative-", string)
    string = re.sub(r"\balt\b","alternative", string)
    string = re.sub(r"alternative ","alternative-", string)
    
    string = re.sub(r"antifolk","anti-folk", string)
    
    string = re.sub(r"avant(-|\s)pop","avant-garde_pop", string)
    string = re.sub(r"avant-rock","avant-garde_rock", string)
    string = re.sub(r"avant-prog","avant-garde_prog", string)
    string = re.sub(r"avant\s{0,1}garde","avant-garde", string)
    string = re.sub(r"\bavant\s\b","avant-garde", string)
    string = re.sub(r"\bavant[^-]","avant-garde", string)
    
    string = re.sub(r"avantgarde","avant-garde", string)
    
    string = re.sub(r"balladeer","ballads", string)
    string = re.sub(r"bossanova","bossa_nova", string)
    string = re.sub(r"brazilian {0,1}music","brazilian", string)
    string = re.sub(r"broadway musicals{0,1}","broadway", string)
    string = re.sub(r"broadway music","broadway", string)
    string = re.sub(r"broadway theatre","broadway", string)
    string = re.sub(r"broadway theatre","broadway", string)
    
    string = re.sub(r"breton singing","breton", string)
    string = re.sub(r"canterbury scene","canterbury_sound", string)
    string = re.sub(r"chansonnier","chanson", string)
    string = re.sub(r"children.{0,2} songs","childrens", string)
    string = re.sub(r"chill-out","chillout", string)
    string = re.sub(r"christian and gospel","christian, gospel", string)
    string = re.sub(r"christian & gospel","christian, gospel", string)
    
    string = re.sub(r"citation needed","", string)
    string = re.sub(r"clarification needed","", string)
    
    string = re.sub(r"concerts","concert", string)
    string = re.sub(r"cpop","c-pop", string)
    string = re.sub(r"","", string)
    string = re.sub(r"crooning","crooner", string)
    string = re.sub(r"darkwave","dark_wave", string)
    string = re.sub(r"downtempo","down_tempo", string)
    string = re.sub(r"dreampop","dream_pop", string)
    string = re.sub(r"drum\s{0,1}(and|&)\s{0,1}bass","drum&bass", string)
    string = re.sub(r"electroacoustic","electro-acoustic", string)
    string = re.sub(r"electropop","electro pop", string)
    string = re.sub(r"electro\s{0,1}pop\s{0,1}alternative\spop","electro_pop, alternative_pop", string)
    string = re.sub(r"electro-pop dance-rock","electro_pop, dance_rock", string)
    string = re.sub(r"electropunk","electro_punk", string)
    string = re.sub(r"experimental & brazilian jazz","experimental, brazilian_jazz", string)
    string = re.sub(r"expressionist","expressionism", string)
    
    string = re.sub(r"film scores","film", string)
    string = re.sub(r"film score","film", string)
    string = re.sub(r"film soundtrack","film", string)
    
    string = re.sub(r"fingerstyle_and_classical_guitar","fingerstyle, classical_guitar", string)
    string = re.sub(r"folk and country","folk, country", string)
    string = re.sub(r"folk rock folk pop","folk_rock, folk_pop", string)
    
    string = re.sub(r"free improv\b","free_improvisation", string)
    string = re.sub(r"freestyling","freestyle", string)
    string = re.sub(r"french variÃ©tÃ©","french variety", string)
    string = re.sub(r"french variété","french variety", string)
    string = re.sub(r"french varieties","french variety", string)
    
    string = re.sub(r"futurepop","future_pop", string)
    string = re.sub(r"gospel_and_gospel_blues","gospel, gospel_blues", string)
    string = re.sub(r"hard core","hardcore", string)
    string = re.sub(r"hawaii","hawaiian", string)
    string = re.sub(r"hymnal","hymns", string)

    string = re.sub(r"hip-hop_soulhip-hop, soul","hip-hop, soul", string)
    string = re.sub(r"indipop","indie_pop", string)
    string = re.sub(r"indiepop","indie_pop", string)
    string = re.sub(r"lo fi","lo-fi", string)
    string = re.sub(r"lofi","lo-fi", string)
    string = re.sub(r"mellow_&_acoustic_rock","mellow, acoustic_rock", string)
    string = re.sub(r"minimalist","minimalism", string)
    string = re.sub(r"mor","middle_of_the_road", string)
    string = re.sub(r"motown sound","motown", string)
    
    string = re.sub(r"music pop rock","pop_rock", string)
    string = re.sub(r"musical theater","musical", string)
    string = re.sub(r"musical theatre","musical", string)
    string = re.sub(r"musical theatre pop","musical_pop", string)
    string = re.sub(r"musicals","musical", string)
    string = re.sub(r"music-jewish liturgy","jewish_liturgy", string)
    string = re.sub(r"musique concrÃ¨te","musique_concrete", string)
    string = re.sub(r"musique concrÃ©te","musique_concrete", string)
    string = re.sub(r"musique concrète","musique_concrete", string)
    string = re.sub(r"musique concréte","musique_concrete", string)
    
    string = re.sub(r"neo souls", "neo_soul", string)
    string = re.sub(r"neo-cla", "neocla", string)
    
    string = re.sub(r"neofolk", "neo_folk", string)
    string = re.sub(r"neo-prog\b", "neo-progressive_rock", string)
    string = re.sub(r"prog.{0,1}rock", "progressive rock", string)
    string = re.sub(r"neotraditionalist country", "neotraditional country", string)
    string = re.sub(r"pacific northwest hip-hop", "pacific_northwest_hip-hop", string)
    string = re.sub(r"\snorthwest hip-hop", "pacific_northwest_hip-hop", string)
    string = re.sub(r"hip-hop music in the pacific northwest","pacific_northwest_hip-hop", string)
    string = re.sub(r"old-school", "old school", string)
    string = re.sub(r"old school rap", "old_school_hip-hop", string)
    
    string = re.sub(r"oldtime", "old-time", string)
    string = re.sub(r"old-timey", "old-time", string)
    string = re.sub(r"opera\s(&|and)\smusical theatre", "opera, musical", string)
    string = re.sub(r"opera and comic opera", "opera, comic_opera", string)
    string = re.sub(r"opera arias", "opera", string)
    string = re.sub(r"operatic", "opera", string)

    string = re.sub(r"pitbash jewish punk thrash opera", "jewish_punk, thrash, opera", string)
    string = re.sub(r"pbr&b", "alternative r&b", string)
    string = re.sub(r"pop  dance", "pop_dance", string)
    string = re.sub(r"pop dance rock jazz", "pop, dance, rock, jazz", string)
    string = re.sub(r"pop edm hip-hop", "pop, edm, hip-hop", string)
    string = re.sub(r"pop folksinger songwriter", "pop_folk, singer-songwriter", string)
    string = re.sub(r"pop rock dance", "pop, rock, dance", string)
    string = re.sub(r"pop rock soul", "pop, rock, soul", string)
    string = re.sub(r"pop traditional pop", "pop, traditional_pop", string)
    
    string = re.sub(r"poprock", "pop_rock", string)
    string = re.sub(r"backing", "", string)
    string = re.sub(r"post ", "post-", string)
    string = re.sub(r"powerpop", "power_pop", string)
    string = re.sub(r"praise & worship", "praise&worship", string)
    string = re.sub(r"protest songs", "protest song", string)
    string = re.sub(r"proto punk", "proto-punk", string)
    string = re.sub(r"protopunk", "proto-punk", string)
    string = re.sub(r"singer[ -/]{0,1}songwriter", "singer-songwriter", string)
    
    string = re.sub(r"psych ", "psychedelic ", string)
    string = re.sub(r"psychedelia", "psychedelic", string)
    
    string = re.sub(r"punk-{0,1}rock", "punk_rock", string)
    string = re.sub(r"r&b soul dance", "r&b, soul, dance", string)
    string = re.sub(r"reggae cultural influence", "reggae", string)
    string = re.sub(r"revival punk psycho blues", "revival_punk, psycho_blues", string)
    string = re.sub(r"rock&roll_americana_rhythm_and_blues_alternative", "rock&roll, americana, rhythm_and_blues, alternative", string)
    string = re.sub(r"rock&roll blues", "rock&roll, blues", string)
    string = re.sub(r"sea shanty", "sea_shanties", string)
    string = re.sub(r"shoegazing", "shoe gaze", string)
    string = re.sub(r"singer-songwriter rock", "singer-songwriter, rock", string)

    string = re.sub(r"soundtracks", "soundtrack", string)
    string = re.sub(r"spirituals", "spiritual", string)
    string = re.sub(r"surreal humour", "surreal_humor", string)
    string = re.sub(r"synthpop", "synth_pop", string)
    string = re.sub(r"synthpunk", "synth_punk", string)
    string = re.sub(r"television scores", "television_score", string)
    string = re.sub(r"the motown sound", "the_motown_sound", string)
    string = re.sub(r"theatre", "theater", string)
    string = re.sub(r"theatre performer", "theater", string)
    
    string = re.sub(r"torch singer", "torch", string)
    string = re.sub(r"torch songs{0,1}", "torch", string)
    string = re.sub(r"trad\b", "traditional", string)
    string = re.sub(r"traditional irish early", "traditional_irish", string)
    string = re.sub(r"trance-blues r&b", "trance-blues, r&b", string)
    string = re.sub(r"various styles", "various", string)
    string = re.sub(r"vaudevillian", "vaudeville", string)
    string = re.sub(r"western movies", "western films", string)
    string = re.sub(r"with", "", string)
    string = re.sub(r"with electronics", "electronics", string)
    string = re.sub(r"world music deep-house quiet storm", "world, deep-house, quiet_storm", string)
    string = re.sub(r"world music folk world jazz", "world, folk, worl_jazz", string)
    string = re.sub(r"worldbeat", "world_beat", string)
    string = re.sub(r"yéyé", "yé-yé", string)
    string = re.sub(r"yodelling", "yodeling", string)
    
    string = re.sub(r"alternative.{0,1}rock.{0,1}garage.{0,1}rock", "alternative_rock,garage_rock", string)
    string = re.sub(r"americana folk alternative country garage rock", "americana,folk, alternative_country,garage_rock", string)
    string = re.sub(r"acoustic rock folk rock", "acoustic_rock,folk_rock", string)
    string = re.sub(r"americana.{0,1}folk.{0,1}alternative.{0,1}country.{0,1}garage.{0,1}rock", "americana,folk,alternative_country,garage_rock", string)
    string = re.sub(r"'blues soul r & b gospel funk folk', 'african american music'", "blues,soul,r&b,gospel,funk,folk,african_american", string)
    string = re.sub(r"jazz funk bluegrass pop", "jazz,funk,bluegrass,pop", string)
    
    string = re.sub(r" music\b", "", string)
    string = re.sub(" songs", "", string)
    string = re.sub(r"\bsinger[^- ]", "", string)
    return string

In [38]:
data['genre_respell']= data['genre'].apply(genrespelling)

In [121]:
data.genre[data.genre.str.contains(r"northwest")]

827                       ['hip hop', 'northwest hip hop']
3613                      ['hip hop', 'northwest hip hop']
19450    ['hip hop', 'west coast hip hop', 'hip hop mus...
23139    ['hip hop', 'alternative hip hop', 'pacific no...
Name: genre, dtype: object

In [122]:
data.genre_respell[data.genre_respell.str.contains(r"\bhip[^-]")]

8471    ['pop', 'hip', 'hop', 'r&b']
Name: genre_respell, dtype: object

In [123]:
data.genre.loc[8471]

"['pop', 'hip', 'hop', 'r&b']"

In [124]:
data.genre_respell.loc[14654]

"['blues soul r&b gospel funk folk', 'african american']"

## General String Cleaning

The following function will systematically reformat the genre lists to a great extent. After applying it, there are still a few cases that are dealt with individually and some that are dropped.

In [42]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label.
It also removes strings in parentheses and removes \( or \) that are isolated.
It replaces 'singer/songwriter' with 'singer-songwriter' and replaces forward slashes with commas."""

def genrelist(string):
    string = string.strip("[").strip("]").replace("'","").replace('"',"") \
    .replace("/",",").replace("·",",") \
    .replace(r";",",").replace(r"|",",").replace(u"\xa0",u" ")\
    .replace(u"\\xa0",u" ")\
    .replace(r"\n",",")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        x = re.sub(r"\(.*?\)", "", x) 
        x = re.sub(r"\(", "", x) 
        x = re.sub(r"\)", "", x) 
        x = re.sub(r":", "", x)
        x = re.sub(r"\.", "", x)
        x = re.sub(r"\]", "", x)
        x = re.sub(r"\[", "", x)
        x = x.replace(" ","_").lstrip("_").rstrip("_").lstrip("-").rstrip("-")
        x = re.sub(r"\band_{0,1}", "", x)
        x = re.sub(r"_music\b", "", x)
        x = re.sub(r"_musician\b", "", x)
        x = re.sub(r"_with\b", "", x)
        x = re.sub(r"-", "_", x)
        x = re.sub(r"\*","", x)
        L_new.append(x)
    while (str("") in L_new):
        L_new.remove("")
    return L_new

In [43]:
data['genrelist']= data['genre_respell'].apply(genrelist)

In [44]:
data.loc[18410]

artist                          Fanny_J
gender                           female
retrieved                       Fanny_J
genre              ['zouk', '*', 'r&b']
retrieved_clean                 Fanny_J
match                                 0
genre_respell      ['zouk', '*', 'r&b']
genrelist                   [zouk, r&b]
Name: 18410, dtype: object

We can still use 'genre' column for str.contains searches. For example:

In [127]:
data.genre[data.genre.str.contains("reggae\.")]

12100    ['funk', 'blues', 'swamp', 'soul', 'reggae.']
Name: genre, dtype: object

In [128]:
data.genrelist[data.genre.str.contains("reggae\.")]

12100    [funk, blues, swamp, soul, reggae]
Name: genrelist, dtype: object

### There are a few more issues to fix by hand:

In [129]:
data.genre.loc[14654]

"['blues soul r & b gospel funk folk', 'african american music']"

In [130]:
data.at[14654, 'genrelist'] = ['blues','soul','r&b','gospel','funk','folk','african_american']

In [131]:
data.genrelist.loc[14654]

['blues', 'soul', 'r&b', 'gospel', 'funk', 'folk', 'african_american']

In [132]:
data.genre.loc[8471]

"['pop', 'hip', 'hop', 'r&b']"

In [133]:
data.at[8471, 'genrelist'] = ['pop','hip_hop','r&b']

In [134]:
data.genre.loc[8471]

"['pop', 'hip', 'hop', 'r&b']"

Two entries with .mw-parser... -- Fix by hand.

In [135]:
data.genrelist[data.genre[data.genre.str.contains(r'\.mw-.*')].index]

2416    [mw_parser_output_divcolumns_2_divcolumn{float...
7861    [mw_parser_output_divcolumns_2_divcolumn{float...
Name: genrelist, dtype: object

In [136]:
data.genrelist.loc[2416]

['mw_parser_output_divcolumns_2_divcolumn{floatleft',
 'width50%',
 'min_width300px}mw_parser_output_divcolumns_3_divcolumn{floatleft',
 'width333%',
 'min_width200px}mw_parser_output_divcolumns_4_divcolumn{floatleft',
 'width25%',
 'min_width150px}mw_parser_output_divcolumns_5_divcolumn{floatleft',
 'width20%',
 'min_width120px}',
 'disco',
 'funk',
 'electric',
 'latin_soul']

In [137]:
data.at[2416,'genrelist'] = ['disco', 'funk', 'electric', 'latin_soul']

In [138]:
data.genrelist.loc[7861]

['mw_parser_output_divcolumns_2_divcolumn{floatleft',
 'width50%',
 'min_width300px}mw_parser_output_divcolumns_3_divcolumn{floatleft',
 'width333%',
 'min_width200px}mw_parser_output_divcolumns_4_divcolumn{floatleft',
 'width25%',
 'min_width150px}mw_parser_output_divcolumns_5_divcolumn{floatleft',
 'width20%',
 'min_width120px}',
 'torch']

In [139]:
data.at[7861,'genrelist'] = ['torch_song']

Let's look at the list of genre labels that we have just dealt with in detail:

In [140]:
trouble_index = data.genre[data.genre.str.contains(r'[;)(/\\]')].index

In [141]:
trouble = data.loc[trouble_index]

In [142]:
trouble.head()

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match,genre_respell,genrelist
254,Eddy_Oh,male,Eddy_Oh,"['k-pop', ';', 'hip hop', ';', 'dance']",Eddy_Oh,0,"['k-pop', ';', 'hip-hop', ';', 'dance']","[k_pop, hip_hop, dance]"
331,Jillian_Wheeler,female,Jillian_Wheeler,['indie rock/pop/electronic'],Jillian_Wheeler,0,['indie rock/pop/electronic'],"[indie_rock, pop, electronic]"
349,Caleb_Shomo,male,Caleb_Shomo,"['metalcore', 'electronic', 'electropop', 'har...",Caleb_Shomo,0,"['metalcore', 'electronic', 'electro pop', 'ha...","[metalcore, electronic, electro_pop, hardcore_..."
503,Amy_Black,female,Amy_Black_(singer),['roots/blues/soul'],Amy_Black,0,['roots/blues/soul'],"[roots, blues, soul]"
646,Gabriel_Wilson,male,Gabriel_Wilson,"['christian & gospel', 'independent\nsinger/so...",Gabriel_Wilson,0,"['christian, gospel', 'independent\nsinger-son...","[christian, gospel, independent, singer_songwr..."


In [143]:
trouble_genre_list = trouble.genrelist.values.tolist()
trouble_genre_list = [x for y in trouble_genre_list for x in y]
trouble_genre_list = list(set(trouble_genre_list))

In [144]:
len(trouble_genre_list)

209

We find that 'era' appears and explore further. Find that one instance exists and we remove it.

In [145]:
era = trouble.genre[trouble.genre.str.contains(r'\bera')]

In [146]:
era

7908    ['rock', 'alternative rock', 'experimental', '...
Name: genre, dtype: object

In [147]:
data.genrelist.loc[7908]

['rock',
 'alternative_rock',
 'experimental',
 'mpb',
 'progressive_rock',
 'post_punk',
 'new_wave',
 'samba_rock',
 'cuidado!',
 'era']

In [148]:
data.at[7908,'genrelist'] = ['rock',
 'alternative_rock',
 'experimental',
 'mpb',
 'progressive_rock',
 'post-punk',
 'new_wave',
 'samba_rock',
 ]

In [149]:
data.genrelist.loc[7908]

['rock',
 'alternative_rock',
 'experimental',
 'mpb',
 'progressive_rock',
 'post-punk',
 'new_wave',
 'samba_rock']

In [150]:
data.loc[22536]

artist                                                Billy_Connolly
gender                                                          male
retrieved                                             Billy_Connolly
genre              ['observational', 'blue comedy', 'musical_come...
retrieved_clean                                       Billy_Connolly
match                                                              0
genre_respell      ['observational', 'blue comedy', 'musical_come...
genrelist               [observational, blue_comedy, musical_comedy]
Name: 22536, dtype: object

We look through the list of genres that were causing trouble due to special characters:

In [151]:
n = 8
trouble_genre_list[25*n:25*(n+1)]

['aor',
 'inspirational',
 'nu_metal',
 'teen_pop',
 'rap_metal',
 'christian_metal',
 'alternative',
 'art_rock',
 'heavy_metal']

- [ ] Check that all of the problematic symbols are removed.

- [ ] Search for labels with two or more "_" as these might be multiple labels stuck together

In [152]:
data.shape

(15500, 8)

### Remove all artists with null values for genre :

In [153]:
data.isnull().sum(axis = 0)

artist             0
gender             0
retrieved          0
genre              0
retrieved_clean    0
match              0
genre_respell      0
genrelist          0
dtype: int64

In [154]:
data = data[data['genrelist'].notnull()].copy(deep = True)

In [155]:
data.isnull().sum(axis = 0)

artist             0
gender             0
retrieved          0
genre              0
retrieved_clean    0
match              0
genre_respell      0
genrelist          0
dtype: int64

In [156]:
data.shape

(15500, 8)

### Create column with length of genre lists:

In [157]:
data['genrelist_length'] = data.genrelist.apply(lambda x: len(x))

In [158]:
data.shape

(15500, 9)

### Remove artists for which the genre list is empty:

In [159]:
data[data.genrelist_length == 0]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match,genre_respell,genrelist,genrelist_length
610,Erick_Baker,male,Erick_Baker,"['clarification needed', ']']",Erick_Baker,0,"['', ']']",[],0
4024,Cupid,male,Cupid_(singer),[],Cupid,0,[],[],0
15255,Mary_Zilba,female,Mary_Zilba,[],Mary_Zilba,0,[],[],0
17487,Betty_Clooney,female,Betty_Clooney,[],Betty_Clooney,0,[],[],0


Note: Cupid, Mary Zilba, Betty Clooney have made it to this point but actually have no genres listed in wikipedia.

In [160]:
data = data[data.genrelist_length > 0].copy(deep = True)

In [161]:
data.shape

(15496, 9)

### Remove bands

In [162]:
data[data.retrieved.str.contains('band')]

Unnamed: 0,artist,gender,retrieved,genre,retrieved_clean,match,genre_respell,genrelist,genrelist_length
250,Midori,female,Midori_(band),"['punk jazz', 'noise rock', 'jazz fusion', 'ex...",Midori,0,"['punk jazz', 'noise rock', 'jazz fusion', 'ex...","[punk_jazz, noise_rock, jazz_fusion, experimen...",5
537,Sissy_Spacek,female,Sissy_Spacek_(band),['noisecore'],Sissy_Spacek,0,['noisecore'],[noisecore],1
895,Dr._Know,male,Dr._Know_(band),"['nardcore', 'crossover thrash', 'punk']",Dr._Know,0,"['nardcore', 'crossover thrash', 'punk']","[nardcore, crossover_thrash, punk]",3
2243,Atlanta,female,Atlanta_(band),['country'],Atlanta,0,['country'],[country],1
4238,Newton,male,Newton_(band),['mákina'],Newton,0,['mákina'],[mákina],1
4691,Northcote,male,Northcote_(band),"['folk rock', 'punk rock', 'post-hardcore']",Northcote,0,"['folk rock', 'punk rock', 'post-hardcore']","[folk_rock, punk_rock, post_hardcore]",3
5667,Stone,female,Stone_(band),"['thrash metal', 'progressive metal', 'speed m...",Stone,0,"['thrash metal', 'progressive metal', 'speed m...","[thrash_metal, progressive_metal, speed_metal]",3
6043,Envy,female,Envy_(band),"['post-hardcore', 'screamo', 'post-rock']",Envy,0,"['post-hardcore', 'screamo', 'post-rock']","[post_hardcore, screamo, post_rock]",3
6084,Muna,female,Muna_(band),"['dark pop', 'pop rock']",Muna,0,"['dark pop', 'pop rock']","[dark_pop, pop_rock]",2
7017,Ó,female,Ó_(band),"['indie pop', 'indie rock', 'folk', ' bedroom ...",Ó,0,"['indie pop', 'indie rock', 'folk', ' bedroom ...","[indie_pop, indie_rock, folk, bedroom_pop]",4


In [163]:
data.shape, data[~data.retrieved.str.contains('band')].shape, data[data.retrieved.str.contains('band')].shape

((15496, 9), (15470, 9), (26, 9))

In [164]:
data = data[~data.retrieved.str.contains('band')].copy(deep = True)

In [165]:
data.shape

(15470, 9)

Version with all columns:

In [166]:
data_full = data.copy(deep = True)

Remove old columns:

In [167]:
data.columns

Index(['artist', 'gender', 'retrieved', 'genre', 'retrieved_clean', 'match',
       'genre_respell', 'genrelist', 'genrelist_length'],
      dtype='object')

In [168]:
data.drop(['retrieved','genre','retrieved_clean', 'match', 'genre_respell'], axis = 1, inplace = True)

In [169]:
data.head()

Unnamed: 0,artist,gender,genrelist,genrelist_length
1,Christopher_Willits,male,"[electronic, glitch, ambient, electro_acoustic...",5
3,Shawn_Hook,male,"[pop, electronic, rock]",3
4,Steve_Poltz,male,"[pop_rock, indie_rock, folk_rock]",3
6,Marvin_Isley,male,"[r&b, funk, soul, funk_rock]",4
7,Povel_Ramel,male,[vaudeville],1


In [170]:
data.shape

(15470, 4)

## Gender

Remove any artists without gender value

In [171]:
data.gender.isnull().sum()

0

In [172]:
data.gender.unique()

array(['male', 'female'], dtype=object)

### Extracting the unique genre labels:

In [173]:
genre_list = data.genrelist.values.tolist()
genre_list = [x for y in genre_list for x in y]
genre_list = list(set(genre_list))

In [174]:
genre_list[:5]

['twee_pop',
 'chanson_française',
 'venezuelan_folk',
 'east_coast_blues',
 'power_pop']

In [175]:
len(genre_list)

1494

In [176]:
print('There are {} artists with genre and binary-gender labels.'.format(data.shape[0]))
print('There are {} unique genre labels.'.format(len(genre_list)))

There are 15470 artists with genre and binary-gender labels.
There are 1494 unique genre labels.


Check to see if there are artists with len(genrelist) = 0

In [177]:
data[data.genrelist.apply(lambda x: True if len(x) == 0 else False)]

Unnamed: 0,artist,gender,genrelist,genrelist_length


In [178]:
data.head()

Unnamed: 0,artist,gender,genrelist,genrelist_length
1,Christopher_Willits,male,"[electronic, glitch, ambient, electro_acoustic...",5
3,Shawn_Hook,male,"[pop, electronic, rock]",3
4,Steve_Poltz,male,"[pop_rock, indie_rock, folk_rock]",3
6,Marvin_Isley,male,"[r&b, funk, soul, funk_rock]",4
7,Povel_Ramel,male,[vaudeville],1


### Export full data set for further use:

In [179]:
today = datetime.today()
now = today.strftime('%Y-%m-%d-%H-%M')
#now = 'temp'

In [180]:
data.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_genres_gender_cleaned_{}.csv'.format(now))

### Export the list of genres:

In [181]:
genre_list_df = pd.DataFrame({'genre_list':genre_list})

In [182]:
genre_list_df.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))

In [183]:
%ls -lt ../../data/genre_lists/data_ready_for_model/

total 30960
-rw-r--r--  1 Daniel  staff    24591 May 11 14:34 genre_list_2020-05-11-14-34.csv
-rw-r--r--  1 Daniel  staff   951681 May 11 14:34 wiki-kaggle_genres_gender_cleaned_2020-05-11-14-34.csv
-rw-r--r--  1 Daniel  staff    25112 May  7 15:49 genre_list_2020-05-07-15-49.csv
-rw-r--r--  1 Daniel  staff  1501714 May  7 15:49 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-49.csv
-rw-r--r--  1 Daniel  staff    25150 May  7 15:47 genre_list_2020-05-07-15-47.csv
-rw-r--r--  1 Daniel  staff  1501728 May  7 15:47 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-47.csv
-rw-r--r--  1 Daniel  staff    25179 May  7 15:45 genre_list_2020-05-07-15-45.csv
-rw-r--r--  1 Daniel  staff  1501719 May  7 15:45 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-45.csv
-rw-r--r--  1 Daniel  staff    25197 May  7 15:32 genre_list_2020-05-07-15-32.csv
-rw-r--r--  1 Daniel  staff  1501722 May  7 15:32 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-32.csv
-rw-r--r--  1 Daniel  staff    25286 May  

## Viewing the genre list:

In [184]:
glist = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))

In [185]:
glist.drop(['Unnamed: 0'], axis =1, inplace = True)

In [186]:
glist = glist.sort_values('genre_list')

In [187]:
pd.set_option('display.max_rows', None)
pd.options.display.max_rows

Display the list of unique genre labels:

In [8]:
#glist

issues --all are dealt with


In [189]:
data_full.genre_respell[data_full.genre.str.contains(r"medieval.{0,1}folk")]

692    ['medieval folk rock', 'folk rock', 'hard rock...
Name: genre_respell, dtype: object

In [190]:
data.artist.loc[22047]

'Shirley_Murdock'

In [191]:
data_full.genrelist.loc[22047]

['r&b', 'soul', 'jazz_funk', 'gospel', 'smooth_soul']

In [192]:
data_full.genrelist.loc[19082]

['pop', 'r&b', 'electro_pop', 'alternative_pop']

In [193]:
genre_list_df.genre_list[genre_list_df.genre_list.str.contains(r".*_.*_.*")]

3                           east_coast_blues
14                          avant_garde_jazz
17                          bossa_nova_samba
23                          new_york_hip_hop
24                        funeral_doom_metal
63                           new_orleans_r&b
69                     childrens_book_author
73                         post_punk_revival
87                       neue_deutsche_härte
88                    alternative_guitar_pop
89                        industrial_hip_hop
101                         new_orleans_jazz
102                     new_age_instrumental
103                        christian_new_age
114                          french_pop_rock
136                          rock_en_español
179                           new_jack_swing
186                         west_coast_blues
192                   neoclassical_dark_wave
204                   mellow_&_acoustic_rock
213                              pop_hip_hop
217                        british_folk_rock
220       