- [ ]  the list of genres using the script is length 1493, before 1494. Find missing?

## This notebook is used to clean and prepare the genre label data for analysis.

- [ ] switch CSV import to use dtype list
- [ ] move individual case respells to a script
- [ ] reformat respell function to take a dictionary of all the cases

## Outline of Cleaning:

1. - [x] remove artists for which 'retrieved' value is 'none'
2. - [x] remove the url prefix from the retrieved artist names 
3. - [x] replace ' ' in the artist column with '_'
4. - [x] remove the '(singer)', '(rapper)', '(musician)' designation from the 'retrieved' column
5. - [x] remove the artists for which the retrieved-artist != searched-artist. 
    - inspect mismatches to look for typos and different versions
6. - [x] Deal with genre label problems
7. - [x] Normalize genre label spelling: e.g. r&b vs rhythm and blues vs rhythm & blues
    - [x] 'rock n roll' = 'rock & roll' etc, but 'rock' separate.
    - [x] remove _music
    - [x] hip hop and hip--hop -> hip-hop
    - [ ] separate lists: look at why they aren't separated
8. - [x] convert genre column values into lists of strings
9. - [x] remove bands (as opposed to individuals)
10. - [x] remove old columns
11. - [x] extract unique genres as a list

In [1]:
import numpy as np
np.random.seed(23)
import pandas as pd
import re
from datetime import datetime

In [2]:
# %ls -lt /Users/Daniel/Code/Genre/data/genre_lists/data_to_be_cleaned/

### The data set with the genders of music artists

The file singers_gender.csv is from Kaggle and lists music artists and their gender. 

In [3]:
# kaggle_data = pd.read_csv('singers_gender.csv', encoding = 'latin-1')

In [4]:
from genre_scripts.genre_cleaning import clean_genre_data

# Fixed issues 

1. FIXED [in respell_dict in genre_cleaning script: rhythm and blues --> r&b, which is a mistake. Look at rules from Tom]
2. NOT ISSUE [old list has "\__", but new does not]
4. FIXED [remove "mellow" from "mellow_&_acoustic_rock"]
5. FIXED [deal with rhythm and grime underscores]
6. FIXED ['era' 'cuidado!' These are not genres. See https://en.wikipedia.org/wiki/Lob%C3%A3o for details.]

In [5]:
data = clean_genre_data()

In [6]:
data.head()

Unnamed: 0,artist,gender,genrelist,genrelist_length
1,Christopher_Willits,male,"[electronic, glitch, ambient, electro_acoustic...",5
3,Shawn_Hook,male,"[pop, electronic, rock]",3
4,Steve_Poltz,male,"[pop_rock, indie_rock, folk_rock]",3
6,Marvin_Isley,male,"[r_and_b, funk, soul, funk_rock]",4
7,Povel_Ramel,male,[vaudeville],1


### Extracting the unique genre labels:

In [7]:
genre_list = data.genrelist.values.tolist()
genre_list = [x for y in genre_list for x in y]
genre_list = list(set(genre_list))
genre_list = sorted(genre_list)

In [8]:
genre_list[:5]

['1960s', '2_step', '2_step_garage', '2_tone', 'a_cappella']

In [9]:
len(genre_list)

1491

Previous list of genres had 1494. Let's import that and compare.

In [10]:
genre_list_old = pd.read_csv('../../data/genre_lists/data_ready_for_model/genre_list_2020-05-18-10-06.csv')
genre_list_old = genre_list_old.genre_list.tolist()
genre_list_old = sorted(genre_list_old)

In [11]:
genre_set = set(genre_list)
genre_set_old = set(genre_list_old)
nmo = genre_set.difference(genre_set_old)
omn = genre_set_old.difference(genre_set)

In [12]:
len(nmo), len(omn)

(4, 7)

In [13]:
print('There are {} artists with genre and binary-gender labels.'.format(data.shape[0]))
print('There are {} unique genre labels.'.format(len(genre_list)))

There are 15470 artists with genre and binary-gender labels.
There are 1491 unique genre labels.


### Export full data set for further use:

In [14]:
today = datetime.today()
now = today.strftime('%Y-%m-%d-%H-%M')
#now = 'temp'

In [15]:
data.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_genres_gender_cleaned_{}.csv'.format(now))

### Export the list of genres:

In [16]:
genre_list_df = pd.DataFrame({'genre_list':genre_list})

In [17]:
genre_list_df.to_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))

In [19]:
# %ls -lt ../../data/genre_lists/data_ready_for_model/