This notebook builds a function such that:
- input: a genre occurring in our data set
- output: list of all artists with that genre

Run all the cells leading up to the function and then you can put a genre into the function and run that cell.


This function will be turned into a web app using streamlit for publisc exploration of the dataset

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
#%matplotlib notebook

import re

from functools import partial

Import the cleaned data:

In [2]:
%ls -lt ../../data/genre_lists/data_ready_for_model/

total 39824
-rw-r--r--  1 Daniel  staff    21724 Jun  9 11:47 genre_label_counts_TOTAL_2020-05-18-10-06.csv
-rw-r--r--@ 1 Daniel  staff   287510 Jun  4 13:42 genre_stats.html
-rw-r--r--@ 1 Daniel  staff   911587 Jun  4 13:39 genre_set_counts.html
-rw-r--r--@ 1 Daniel  staff     1845 Jun  4 13:11 female_bias_freq500.html
-rw-r--r--@ 1 Daniel  staff     1459 Jun  4 13:11 male_bias_freq500.html
-rw-r--r--  1 Daniel  staff    73746 May 29 10:19 genre_stats.csv
-rw-r--r--  1 Daniel  staff    66235 May 21 11:00 promiscuity_table.csv
-rw-r--r--  1 Daniel  staff    57474 May 20 12:47 corpus.mm.index
-rw-r--r--  1 Daniel  staff   382436 May 20 12:47 corpus.mm
-rw-r--r--  1 Daniel  staff    49966 May 20 12:47 genre_dictionary.dict
drwxr-xr-x  5 Daniel  staff      160 May 20 10:59 [34mlogistic_model_data[m[m/
-rw-r--r--  1 Daniel  staff    10926 May 18 11:10 genre_label_non-lonely_TRAINING_2020-05-18-10-06.csv
-rw-r--r--  1 Daniel  staff     8664 May 18 11:09 genre_label_lonely_TR

In [3]:
%store -r now
now
#now = '2020-05-11-14-35'

'2020-05-18-10-06'

In [4]:
X_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now), index_col = ['artist'])
y_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now), index_col = ['artist'])
X_test = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_test_{}.csv'.format(now), index_col = ['artist'])
y_test = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_test_{}.csv'.format(now), index_col = ['artist'])

In [5]:
X_tot = pd.concat([X_train,X_test])
y_tot = pd.concat([y_train,y_test])

In [6]:
X_tot.shape, y_tot.shape

((15470, 2), (15470, 1))

In [7]:
data = y_tot.join([X_tot], how = 'outer')

In [8]:
data.head()

Unnamed: 0_level_0,gender,genrelist,genrelist_length
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,male,"['pop', 'rock', 'emo_pop']",3
Bobby_Edwards,male,['country'],1
La_Palabra,male,"['afro_cuban_jazz', 'son_montuno', 'guaracha',...",4
Sherrick,male,"['r_and_b', 'soul']",2
Allen_Collins,male,['southern_rock'],1


In [9]:
data.shape, data.isnull().sum()

((15470, 3),
 gender              0
 genrelist           0
 genrelist_length    0
 dtype: int64)

### Genre Labels

Each value of the genre column is a _string_ of comma separated genre labels. We want to convert it to a _list_ of strings.

In [10]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        L_new.append(x.replace(" ","_").lstrip("_").rstrip("_"))
    while (str("") in L_new):
        L_new.remove("")
    return L_new

Now we apply it to the whole column and put the lists in a new column:

In [11]:
data['genrelist']= data['genrelist'].apply(genrelist)

In [12]:
data.head()

Unnamed: 0_level_0,gender,genrelist,genrelist_length
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,male,"[pop, rock, emo_pop]",3
Bobby_Edwards,male,[country],1
La_Palabra,male,"[afro_cuban_jazz, son_montuno, guaracha, salsa...",4
Sherrick,male,"[r_and_b, soul]",2
Allen_Collins,male,[southern_rock],1


### Import the genre labels from the whole data set:

In [13]:
genrelist_df = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now), index_col = 'Unnamed: 0')

In [14]:
genrelist_df.shape

(1494, 1)

## Co-Occurrence

Function to apply to dataframe that marks a row as having a genre label:

In [33]:
def artists_with_label(row, label = 'jazz'):
    if label in row.genrelist:
        return True
    else:
        return False

In [34]:
artists_with = partial(artists_with_label,label = 'pop')

In [39]:
artists_with(data.iloc[0])

True

In [54]:
label = 'rap'
artists_with = partial(artists_with_label,label = label) # create the partial function for the selected genre
data[label] = data.apply(artists_with, axis = 1) # select those artists with the selected genre
data[data[label]].index.sort_values() # produce alphabetical list of artists with the selected genre

Index(['Ali_Brustofski', 'Angela_Hunte', 'Bekuh_BOOM', 'Blacko', 'Bobby_J',
       'Bradley_McIntosh', 'CL_Smooth', 'Canardo', 'Cara_Braia',
       'Cheryl_James', 'Chris_Landry', 'Chrome', 'Dante_Spinetta', 'Def_Jef',
       'Divine', 'El_Da_Sensei', 'Elin_Bergman', 'Feloni', 'JAE_E', 'Jipsta',
       'Johnny_Richter', 'Kenza_Farah', 'Koncept', 'Kristoff_Krane',
       'Liam_Cacatian_Thomassen', 'Little_Bruce', 'Loose_Logic', 'Luke_Ski',
       'M-Doc', 'Maestro_Harrell', 'Merlin_Bronques', 'Mr._Mack', 'Ménélik',
       'Omega_Crosby', 'Phil_Barney', 'Playalitical', 'Produkt', 'Rampage',
       'Rob_Pilatus', 'Rockwell_Knuckles', 'Vaï', 'Visto', 'Yarah_Bravo',
       'Young_Scrap'],
      dtype='object', name='artist')

In [55]:
def genre_artists(data, label = 'soul'):
    artists_with = partial(artists_with_label,label = label) # create the partial function for the selected genre
    data[label] = data.apply(artists_with, axis = 1) # select those artists with the selected genre
    return data[data[label]].index.sort_values() # produce alphabetical list of artists with the selected genre

In [56]:
genre_artists(data, 'punk')

Index(['Alexander_Rocciasana', 'Alfunction', 'Barry_Donegan', 'Baz_Warne',
       'Becky_Stark', 'Billy_Karren', 'Cass_McCombs', 'Chris_Bailey',
       'Chris_Clavin', 'Chris_Eskola', 'Craig_Else', 'Danny_Barnes',
       'Dave_King', 'Dave_Mello', 'Dave_Zegarac', 'David_Barbe', 'Ed_Kuepper',
       'Efrem_Schulz', 'Exene_Cervenka', 'Fred_Negro', 'Freddie_Wadling',
       'Jacquie_O'Sullivan', 'Jane_Wiedlin', 'Jay_Kalk', 'Jeff_Suffering',
       'Jim_Neversink', 'Jimmy_Rip', 'John_Lydon', 'John_Otway', 'Karen_O',
       'Kevin_Mooney', 'Kirk_Brandon', 'Lenny_Burns', 'Lew_Nottke',
       'Matt_Fishel', 'Mya_Byrne', 'Nicholas_Bullen', 'Nick_Falcon',
       'Nina_Hagen', 'Paul_Cunniffe', 'Paul_Hyde', 'Paul_Roberts',
       'Paula_Frazer', 'Preston', 'RM_Hubbert', 'Regina_Zernay_Roberts',
       'Richie_Birkenhead', 'Robb_Johnson', 'Rose_Mazzola', 'Sa'ra_Charismata',
       'Scott_H._Biram', 'Simon_Gallup', 'Taylor_Hollingsworth',
       'Terje_Winterstø_Røthing', 'Theo_Kogan', 'Tim_Steward