- [ ]  the list of genres using the script is length 1493, before 1494. Find missing?

## This notebook is used to clean and prepare the genre label data for analysis.

- [ ] switch CSV import to use dtype list
- [ ] move individual case respells to a script
- [ ] reformat respell function to take a dictionary of all the cases

## Outline of Cleaning:

1. - [x] remove artists for which 'retrieved' value is 'none'
2. - [x] remove the url prefix from the retrieved artist names 
3. - [x] replace ' ' in the artist column with '_'
4. - [x] remove the '(singer)', '(rapper)', '(musician)' designation from the 'retrieved' column
5. - [x] remove the artists for which the retrieved-artist != searched-artist. 
    - inspect mismatches to look for typos and different versions
6. - [x] Deal with genre label problems
7. - [x] Normalize genre label spelling: e.g. r&b vs rhythm and blues vs rhythm & blues
    - [x] 'rock n roll' = 'rock & roll' etc, but 'rock' separate.
    - [x] remove _music
    - [x] hip hop and hip--hop -> hip-hop
    - [ ] separate lists: look at why they aren't separated
8. - [x] convert genre column values into lists of strings
9. - [x] remove bands (as opposed to individuals)
10. - [x] remove old columns
11. - [x] extract unique genres as a list

In [1]:
import numpy as np
np.random.seed(23)
import pandas as pd
import re
from datetime import datetime

In [2]:
# %ls -lt /Users/Daniel/Code/Genre/data/genre_lists/data_to_be_cleaned/

### The data set with the genders of music artists

The file singers_gender.csv is from Kaggle and lists music artists and their gender. 

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
from genre_scripts.genre_cleaning import clean_genre_data

# Caution: 

1. in respell_dict in genre_cleaning script: rhythm and blues --> r&b, which is a mistake, I think. Look at rules from Tom
2. in switching to \_and\_ from &, sometimes introduced "\__"; fix this
3. get artist names as index from train/test split, and use this to remake the split but with the re-cleaned data

In [5]:
data = clean_genre_data()

In [6]:
data.head()

Unnamed: 0,artist,gender,genrelist,genrelist_length
1,Christopher_Willits,male,"[electronic, glitch, ambient, electro_acoustic...",5
3,Shawn_Hook,male,"[pop, electronic, rock]",3
4,Steve_Poltz,male,"[pop_rock, indie_rock, folk_rock]",3
6,Marvin_Isley,male,"[r&b, funk, soul, funk_rock]",4
7,Povel_Ramel,male,[vaudeville],1


### Extracting the unique genre labels:

In [7]:
genre_list = data.genrelist.values.tolist()
genre_list = [x for y in genre_list for x in y]
genre_list = list(set(genre_list))
genre_list = sorted(genre_list)

In [8]:
genre_list[:5]

['1960s', '2_step', '2_step_garage', '2_tone', 'a_cappella']

In [9]:
len(genre_list)

1492

Previous list of genres had 1494. Let's import that and compare.

In [10]:
genre_list_old = pd.read_csv('../../data/genre_lists/data_ready_for_model/genre_list_2020-05-18-10-06.csv')
genre_list_old = genre_list_old.genre_list.tolist()
genre_list_old = sorted(genre_list_old)

In [11]:
genre_set = set(genre_list)
genre_set_old = set(genre_list_old)
diff = genre_set.symmetric_difference(genre_set_old)

In [12]:
len(diff)

44

In [13]:
diff

{'alternative_r&b',
 'alternative_r_and_b',
 'british_rock&roll',
 'british_rock_and_roll',
 'c&w',
 'c_and_w',
 'christian_r&b',
 'christian_r_and_b',
 'contemporary_r&b',
 'contemporary_r_and_b',
 'country&western',
 'country_and_western',
 'cuidado!',
 'drum&bass',
 'drum_and_bass',
 'electronic_r&b',
 'electronic_r_and_b',
 'era',
 'mellow_&_acoustic_rock',
 'mellow__and__acoustic_rock',
 'new_orleans_r&b',
 'new_orleans_r_and_b',
 'opera_&_musical',
 'opera__and__musical',
 'orgasmic_r&b',
 'orgasmic_r_and_b',
 'praise&worship',
 'praise_and_worship',
 'r&b',
 'r&bang',
 'r&g',
 'r_and_b',
 'r_and_bang',
 'r_and_g',
 'rhythm_&_grime',
 'rhythm__and__grime',
 'rock&roll',
 'rock&roll_revival',
 'rock_and_roll',
 'rock_and_roll_revival',
 'screwed_&_chopped',
 'screwed__and__chopped',
 'struggle_&_protest',
 'struggle__and__protest'}

Issues:

& vs. and

Old - New:
'era'
'cuidado!'

These last two are not genres. See https://en.wikipedia.org/wiki/Lob%C3%A3o for details.

In [19]:
print('There are {} artists with genre and binary-gender labels.'.format(data.shape[0]))
print('There are {} unique genre labels.'.format(len(genre_list)))

There are 15470 artists with genre and binary-gender labels.
There are 1493 unique genre labels.


### Export full data set for further use:

In [179]:
today = datetime.today()
now = today.strftime('%Y-%m-%d-%H-%M')
#now = 'temp'

In [180]:
data.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_genres_gender_cleaned_{}.csv'.format(now))

### Export the list of genres:

In [181]:
genre_list_df = pd.DataFrame({'genre_list':genre_list})

In [182]:
genre_list_df.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))

In [183]:
%ls -lt ../../data/genre_lists/data_ready_for_model/

total 30960
-rw-r--r--  1 Daniel  staff    24591 May 11 14:34 genre_list_2020-05-11-14-34.csv
-rw-r--r--  1 Daniel  staff   951681 May 11 14:34 wiki-kaggle_genres_gender_cleaned_2020-05-11-14-34.csv
-rw-r--r--  1 Daniel  staff    25112 May  7 15:49 genre_list_2020-05-07-15-49.csv
-rw-r--r--  1 Daniel  staff  1501714 May  7 15:49 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-49.csv
-rw-r--r--  1 Daniel  staff    25150 May  7 15:47 genre_list_2020-05-07-15-47.csv
-rw-r--r--  1 Daniel  staff  1501728 May  7 15:47 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-47.csv
-rw-r--r--  1 Daniel  staff    25179 May  7 15:45 genre_list_2020-05-07-15-45.csv
-rw-r--r--  1 Daniel  staff  1501719 May  7 15:45 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-45.csv
-rw-r--r--  1 Daniel  staff    25197 May  7 15:32 genre_list_2020-05-07-15-32.csv
-rw-r--r--  1 Daniel  staff  1501722 May  7 15:32 wiki-kaggle_genres_gender_cleaned_2020-05-07-15-32.csv
-rw-r--r--  1 Daniel  staff    25286 May  

## Viewing the genre list:

In [184]:
glist = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))

In [185]:
glist.drop(['Unnamed: 0'], axis =1, inplace = True)

In [186]:
glist = glist.sort_values('genre_list')

In [187]:
pd.set_option('display.max_rows', None)
pd.options.display.max_rows

Display the list of unique genre labels:

In [8]:
#glist

issues --all are dealt with


In [189]:
data_full.genre_respell[data_full.genre.str.contains(r"medieval.{0,1}folk")]

692    ['medieval folk rock', 'folk rock', 'hard rock...
Name: genre_respell, dtype: object

In [190]:
data.artist.loc[22047]

'Shirley_Murdock'

In [191]:
data_full.genrelist.loc[22047]

['r&b', 'soul', 'jazz_funk', 'gospel', 'smooth_soul']

In [192]:
data_full.genrelist.loc[19082]

['pop', 'r&b', 'electro_pop', 'alternative_pop']

In [193]:
genre_list_df.genre_list[genre_list_df.genre_list.str.contains(r".*_.*_.*")]

3                           east_coast_blues
14                          avant_garde_jazz
17                          bossa_nova_samba
23                          new_york_hip_hop
24                        funeral_doom_metal
63                           new_orleans_r&b
69                     childrens_book_author
73                         post_punk_revival
87                       neue_deutsche_härte
88                    alternative_guitar_pop
89                        industrial_hip_hop
101                         new_orleans_jazz
102                     new_age_instrumental
103                        christian_new_age
114                          french_pop_rock
136                          rock_en_español
179                           new_jack_swing
186                         west_coast_blues
192                   neoclassical_dark_wave
204                   mellow_&_acoustic_rock
213                              pop_hip_hop
217                        british_folk_rock
220       