# Calculate the upper bound for accuracy of any model trained on our training data.

- Warning: I seem to have arrived a contradiction.
    - for the DNN classifier the 10-fold CV accuracy has a mean of 76% with std 1%. This is better than my supposed upperbound.
    - for the DNN classifier the training accuracy can be close to 80%. How? Is it memorizing the order? No: shuffle is True

The data is of the form $(X,y)$ with $X_i \in \left\{ 0, 1 \right\}^{\times p}$ ($p=1494$), and $y \in  \left\{ 0, 1 \right\}$. There are 12376 training samples. Let $\left\{\bar{X}_a \right\}_{1 \leq a \leq  6230}$ be unique representatives of the inputs in the training set; That is, for all $i$ there exists $a$ such that $X_i = \bar{X_a}$. For each $\bar{X}_a$ the number of female artists ($\text{fem}\left( \bar{X}_a \right)$) and male artists $\left( \text{mal}(\bar{X}_a) \right)$ with $X_i = \bar{X}_a$ are calculated. Define a classifier on the set of training data $f_0: \left\{ X_i \right\}_{i=1}^{12376} \to \left\{ 0, 1 \right\}$ as 
$$ $$
$$ f(X_i) = \text{argmax}_{\left\{ \text{male},\text{female}\right\}} \left\{ \text{mal}(\bar{X}_a), \text{fem}(\bar{X}_a)\right\} \; \text{if} \; X_i = \bar{X}_a$$
$$ $$
Then extend $f_0$ to $f: \left\{ 0, 1 \right\}^{\times p} \to \left\{ 0, 1 \right\}$. When $f$ is only used on the training data, the extension from $f_0$ to $f$ is irrevelant, and $f_0$ gives rise to an optimal classifier. However, to generalize to data which includes points in $\left\{ 0, 1 \right\}^{\times p}$ that were not in the training set, a rule is needed to make the extension.

This notebook shows that even on the training data $f_0$ has an expected error of 26.8%, or an accuracy of 73.2%.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

Import the cleaned data:

In [2]:
#%ls -lt ../../data/genre_lists/data_ready_for_model/

In [3]:
%store -r now
now
#now = '2020-05-11-14-35'

'2020-05-18-10-06'

In [4]:
X_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now), index_col = ['artist'])
y_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now), index_col = ['artist'])

### Genre Labels -- as a set

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _set_ of strings.

In [5]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        L_new.append(x.replace(" ","_").lstrip("_").rstrip("_"))
    while (str("") in L_new):
        L_new.remove("")
    return set(L_new)

Now we apply
- it to the whole column and put the lists in a new column
- assemble X,y into DF
-reset index to 'artist_id'

In [6]:
X_train['genre_set']= X_train['genrelist'].apply(genrelist)

data = X_train.join(y_train, how = 'inner', on = 'artist')
data.drop(['genrelist'], axis = 1, inplace = True)
data.reset_index(inplace = True)
data.index.name = 'artist_id'
data_set_size = data.shape[0]

In [7]:
data.head()

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Pablo_Holman,3,"{pop, emo_pop, rock}",male
1,Bobby_Edwards,1,{country},male
2,La_Palabra,4,"{son_montuno, salsa_romántica, guaracha, afro_...",male
3,Sherrick,2,"{soul, r_and_b}",male
4,Allen_Collins,1,{southern_rock},male


- Set IDs for Full genre_list as index (not just that for the training set)
- Vocab Dict and Size-- {label: id}
- max length of lists

In [8]:
genre_list = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))
genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)
genre_list['genre_id'] = list(range(1,genre_list.shape[0]+1))

#Size of the vocab:
vocab_size = genre_list.shape[0]

#Create a dictionary {genre_label: genre_id}
temp = genre_list.set_index(['genre_list'])
label_id_dict = temp['genre_id'].to_dict()

# set genre_id to index
genre_list.set_index(['genre_id'], inplace = True)

#Find max length of genre lists:
max_list_length = data.genrelist_length.max()

In [9]:
genre_list

Unnamed: 0_level_0,genre_list
genre_id,Unnamed: 1_level_1
1,chilean
2,zamba
3,afro_punk_blues
4,crunk
5,spanish_guitar
...,...
1490,british_rock
1491,funeral_doom_metal
1492,blues_soul
1493,mainstream


In [10]:
# encode labels as ints within the list
def encode_list(row):
    return {label_id_dict[item] for item in row.genre_set}

data['genre_set_encoded'] = data.apply(encode_list, axis = 1)

#Check that the encoding is consistent: 
# n = np.random.randint(data.shape[0])
# [label_id_dict[item] for item in data.genrelist.iloc[n]], data.genres_encoded_as_list.iloc[n]

# Encode targets. The categories still appear as strings. To see the encoding use df.column.cat.codes.
data['gender'] = data.gender.apply(lambda x: 1 if x == 'female' else 0)

In [11]:
data.head()

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender,genre_set_encoded
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Pablo_Holman,3,"{pop, emo_pop, rock}",0,"{1431, 794, 1007}"
1,Bobby_Edwards,1,{country},0,{465}
2,La_Palabra,4,"{son_montuno, salsa_romántica, guaracha, afro_...",0,"{809, 1442, 1004, 1357}"
3,Sherrick,2,"{soul, r_and_b}",0,"{1426, 359}"
4,Allen_Collins,1,{southern_rock},0,{1186}


Gender priors:

In [12]:
p_fem = data.gender.sum()/data.gender.shape[0]
p_male = 1-p_fem
p_fem, p_male

(0.3108435681965094, 0.6891564318034906)

A function that maps from a genreset (string format) to its frequency

In [42]:
# not being used
# def set_to_count(genreset):
#     """fnc from genreset to its frequency; 
#     uses set_counts dict which still needs to be constructed"""
#     return set_counts[genre_sets.index(genreset)]

### Count frequency of a genre_set

In [14]:
# Initialize list of genre sets and counts:
genre_sets = [] # a list of the genre sets
set_counts = {} # a dictionary of items {int id for genre set : frequency of that genre set}

In [15]:
def set_counting(row):
    if row.genre_set_encoded in genre_sets:
        set_counts[genre_sets.index(row.genre_set_encoded)] += 1
    else: 
        genre_sets.append(row.genre_set_encoded)
        set_counts[len(set_counts)] = 1

In [17]:
data.apply(set_counting, axis = 1);

In [18]:
# example:
#set_to_count({1426,359}), data[data.genre_set_encoded == {1426, 359}].shape[0]

Create dataframe with the sets and their counts, using encoded and string versions of sets

In [19]:
set_counts_df = pd.DataFrame.from_dict(set_counts, orient = 'index')
set_counts_df.index.name = 'list_index'
set_counts_df.columns = ['count']
set_counts_df['genre_set'] = set_counts_df.apply(lambda x: genre_sets[x.name], axis = 1)

def set_strings(row):
    stringset = set([])
    for id in row.genre_set:
        stringset.add(genre_list.loc[id][0])
    return stringset

set_counts_df['genre_set_strings'] = set_counts_df.apply(set_strings, axis = 1)
set_counts_df['set_size'] = set_counts_df.genre_set_strings.apply(lambda x: len(x))
set_counts_df.sort_values(['count'], ascending = False, inplace = True)
#playing
#set_counts_df['attention'] = set_counts_df['count']*set_counts_df['set_size']

In [20]:
# attention_grabbers = set_counts_df.sort_values(['attention'], ascending = False)
# attention_grabbers = attention_grabbers[attention_grabbers.attention > attention_grabbers['count']]
# attention_grabbers[attention_grabbers.attention >70]

In [21]:
set_counts_df.head(20)

Unnamed: 0_level_0,count,genre_set,genre_set_strings,set_size
list_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,720,{465},{country},1
27,443,{128},{hip_hop},1
16,418,{1007},{pop},1
102,212,{1431},{rock},1
69,176,{507},{jazz},1
17,163,"{1431, 1007}","{pop, rock}",2
21,118,{650},{folk},1
145,116,{1427},{blues},1
3,101,"{1426, 359}","{r_and_b, soul}",2
425,93,"{1426, 1007}","{r_and_b, pop}",2


### Count frequencies of genre sets by gender

### Set counts for female artists:

In [22]:
data_fem = data[data.gender == 1].copy()

In [23]:
data_fem.head()

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender,genre_set_encoded
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
11,Christina_Milian,4,"{dance, r_and_b, hip_hop, pop}",1,"{128, 1426, 1007, 1063}"
14,Lee_Soo-jung,4,"{r_and_b, folk, soul, k_pop}",1,"{1426, 359, 138, 650}"
15,Gwen_Sebastian,2,"{rock, country}",1,"{465, 1431}"
18,Anna_Margaret_Collins,1,{pop},1,{1007}
20,Lawnie_Wallace,1,{country},1,{465}


In [24]:
set_counts_fem = dict.fromkeys( set_counts.keys(), 0 ) # a dictionary of items {int id for genre set : frequency of that genre set}
def set_counting_fem(row):
    if row.genre_set_encoded in genre_sets:
        set_counts_fem[genre_sets.index(row.genre_set_encoded)] += 1

In [25]:
data_fem.apply(set_counting_fem, axis = 1);

In [26]:
data_fem[data_fem.genre_set_encoded == {465}]

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender,genre_set_encoded
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,Lawnie_Wallace,1,{country},1,{465}
73,Amanda_Wilkinson,1,{country},1,{465}
94,Victoria_Shaw,1,{country},1,{465}
154,Tania_Kernaghan,1,{country},1,{465}
326,Jenny_Simpson,1,{country},1,{465}
...,...,...,...,...,...
11941,Naomi_Judd,1,{country},1,{465}
12151,Kaitlyn_Baker,1,{country},1,{465}
12224,Nikki_Nelson,1,{country},1,{465}
12248,Karen_Tobin,1,{country},1,{465}


Create dataframe with the sets and their counts, using encoded and string versions of sets

In [27]:
set_counts_fem_df = pd.DataFrame.from_dict(set_counts_fem, orient = 'index')
set_counts_fem_df.index.name = 'list_index'
set_counts_fem_df.columns = ['count_female']
set_counts_fem_df['genre_set'] = set_counts_fem_df.apply(lambda x: genre_sets[x.name], axis = 1)

def set_strings(row):
    stringset = set([])
    for id in row.genre_set:
        stringset.add(genre_list.loc[id][0])
    return stringset

set_counts_fem_df['genre_set_strings'] = set_counts_fem_df.apply(set_strings, axis = 1)
set_counts_fem_df['set_size'] = set_counts_fem_df.genre_set_strings.apply(lambda x: len(x))
set_counts_fem_df.sort_values(['count_female'], ascending = False, inplace = True)
#playing
#set_counts_fem_df['attention'] = set_counts_fem_df['count']*set_counts_fem_df['set_size']

In [28]:
set_counts_fem_df

Unnamed: 0_level_0,count_female,genre_set,genre_set_strings,set_size
list_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,218,{465},{country},1
16,215,{1007},{pop},1
69,59,{507},{jazz},1
425,56,"{1426, 1007}","{r_and_b, pop}",2
17,51,"{1431, 1007}","{pop, rock}",2
...,...,...,...,...
2619,0,"{1081, 194, 449, 1426}","{hardcore_punk, alternative_rock, r_and_b, pun...",4
2618,0,"{1064, 870}","{pop_rock, teen_pop}",2
2615,0,"{561, 1427, 470}","{reggae, blues_rock, blues}",3
2613,0,"{472, 465, 650, 572}","{cajun, bluegrass, country, folk}",4


### Set counts for male artists:

In [29]:
data_male = data[data.gender == 0].copy()

In [30]:
set_counts_male = dict.fromkeys( set_counts.keys(), 0 ) # a dictionary of items {int id for genre set : frequency of that genre set}
def set_counting_male(row):
    if row.genre_set_encoded in genre_sets:
        set_counts_male[genre_sets.index(row.genre_set_encoded)] += 1

In [31]:
data_male.apply(set_counting_male, axis = 1);

Create dataframe with the sets and their counts, using encoded and string versions of sets

In [32]:
set_counts_male_df = pd.DataFrame.from_dict(set_counts_male, orient = 'index')
set_counts_male_df.index.name = 'list_index'
set_counts_male_df.columns = ['count_male']
set_counts_male_df['genre_set'] = set_counts_male_df.apply(lambda x: genre_sets[x.name], axis = 1)

def set_strings(row):
    stringset = set([])
    for id in row.genre_set:
        stringset.add(genre_list.loc[id][0])
    return stringset

set_counts_male_df['genre_set_strings'] = set_counts_male_df.apply(set_strings, axis = 1)
set_counts_male_df['set_size'] = set_counts_male_df.genre_set_strings.apply(lambda x: len(x))
set_counts_male_df.sort_values(['count_male'], ascending = False, inplace = True)
#playing
#set_counts_male_df['attention'] = set_counts_male_df['count']*set_counts_male_df['set_size']

In [33]:
set_counts_male_df.head()

Unnamed: 0_level_0,count_male,genre_set,genre_set_strings,set_size
list_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,502,{465},{country},1
27,434,{128},{hip_hop},1
16,203,{1007},{pop},1
102,187,{1431},{rock},1
69,117,{507},{jazz},1


### Join female and male set counts to total set counts:

In [34]:
set_counts_df = set_counts_df.join([set_counts_fem_df[['count_female']], set_counts_male_df[['count_male']]], how = 'left', lsuffix = '_tot', rsuffix = '_gender')

Calculate a column that classifies by majority vote for each genre set

In [35]:
def classify(row):
    if row.count_female < row.count_male:
        return 0 # male = 0
    else:
        return 1 # female = 1

In [36]:
set_counts_df['classifier'] = set_counts_df.apply(classify, axis = 1)

In [37]:
set_counts_df.head()

Unnamed: 0_level_0,count,genre_set,genre_set_strings,set_size,count_female,count_male,classifier
list_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,720,{465},{country},1,218,502,0
27,443,{128},{hip_hop},1,9,434,0
16,418,{1007},{pop},1,215,203,1
102,212,{1431},{rock},1,25,187,0
69,176,{507},{jazz},1,59,117,0


Create a column with the error of the classifier for that genre_set

In [38]:
set_counts_df['error_bound'] = set_counts_df.apply(
    lambda x: x.count_female if x.classifier == 0 else x.count_male, axis = 1)

In [39]:
set_counts_df.iloc[500:540]

Unnamed: 0_level_0,count,genre_set,genre_set_strings,set_size,count_female,count_male,classifier,error_bound
list_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3123,2,"{465, 650, 572}","{bluegrass, country, folk}",3,2,0,1,0
3890,2,"{1433, 1063, 1007}","{pop, disco, dance}",3,2,0,1,0
3891,2,"{1074, 1070, 1007}","{folk_pop, country_pop, pop}",3,2,0,1,0
2286,2,"{452, 1431}","{rock, roots}",2,1,1,1,1
1335,2,"{673, 1481, 405}","{alternative_metal, nu_metal, heavy_metal}",3,0,2,0,0
1432,2,"{264, 667, 870, 1431}","{new_wave, pop_rock, rock, progressive_rock}",4,0,2,0,0
1416,2,{1338},{opera_pop},1,0,2,0,0
5282,2,"{600, 507}","{classic_female_blues, jazz}",2,2,0,1,0
1418,2,"{264, 1481, 366}","{progressive_metal, heavy_metal, progressive_r...",3,0,2,0,0
1419,2,"{1431, 1429, 1007}","{pop, dance_pop, rock}",3,2,0,1,0


Calculate the total error of the model:

In [40]:
upper_error_bound = round(set_counts_df.error_bound.sum()/set_counts_df.shape[0],3)

In [41]:
print(f'The error of any classifier trained on this data is at least {upper_error_bound}.')
print(f'The accuracy of any classifier trained on this data is less than {1-upper_error_bound}.')

The error of any classifier trained on this data is at least 0.268.
The accuracy of any classifier trained on this data is less than 0.732.


## Now turn this model into a class that can be trained on any data frame

In [12]:
def UpperBound(df):
    # calculate ratios of (fe)male to total
    p_fem = data.gender.sum()/data.gender.shape[0]
    p_male = 1-p_fem
    
    # Initialize list of genre sets and counts:
    genre_sets = [] # a list of the genre sets
    genre_sets_fem = [] # a list of the genre sets of female artists
    genre_sets_mal = [] # a list of the genre sets of male artists
    set_counts = {} # a dictionary of items {int id for genre set : frequency of that genre set}
    set_counts_fem = {} # a dictionary of items {int id for genre set fem : frequency of that genre set for female artists}
    set_counts_mal = {} # a dictionary of items {int id for genre set mal : frequency of that genre set for male artists}
    
    def set_counting(row):
        if row.gender == 1: # case of female
            if row.genre_set_encoded in genre_sets_fem:
                set_counts_fem[genre_sets_fem.index(row.genre_set_encoded)] += 1
                set_counts[genre_sets.index(row.genre_set_encoded)] += 1
            else:
                genre_sets_fem.append(row.genre_set_encoded)
                set_counts_fem[len(set_counts_fem)] = 1
                genre_sets.append(row.genre_set_encoded)
                set_counts[len(set_counts)] = 1
        else: # case of row.gender == 0; male 
            if row.genre_set_encoded in genre_sets_mal:
                set_counts_mal[genre_sets_mal.index(row.genre_set_encoded)] += 1
                set_counts[genre_sets.index(row.genre_set_encoded)] += 1
            else:
                genre_sets_mal.append(row.genre_set_encoded)
                set_counts_mal[len(set_counts_mal)] = 1
                genre_sets.append(row.genre_set_encoded)
                set_counts[len(set_counts)] = 1
    
    def set_strings(row):
        stringset = set([])
        for id in row.genre_set:
            stringset.add(genre_list.loc[id][0])
        return stringset
    
    df.apply(set_counting, axis = 1);
    
    set_counts_df = pd.DataFrame.from_dict(set_counts, orient = 'index')
    set_counts_df_fem = pd.DataFrame.from_dict(set_counts_fem, orient = 'index')
    set_counts_df_mal = pd.DataFrame.from_dict(set_counts_mal, orient = 'index')
    
    set_counts_df.index.name = 'list_index'
    set_counts_df.columns = ['count']
    set_counts_df_fem.index.name = 'list_index'
    set_counts_df_fem.columns = ['count_fem']
#     set_counts_df_mal.index.name = 'list_index'
#     set_counts_df_mal.columns = ['count_mal']
    set_counts_df['genre_set'] = set_counts_df.apply(lambda x: genre_sets[x.name], axis = 1)

    set_counts_df['genre_set_strings'] = set_counts_df.apply(set_strings, axis = 1)
    set_counts_df['set_size'] = set_counts_df.genre_set_strings.apply(lambda x: len(x))
    set_counts_df.sort_values(['count'], ascending = False, inplace = True)
    
    set_counts_df = set_counts_df.join([set_counts_df_fem], how = 'outer')
    
    return set_counts_df, set_counts_df_fem

In [13]:
sc, scf = UpperBound(data)

In [14]:
sc.sort_values(['count_fem'])

Unnamed: 0_level_0,count,genre_set,genre_set_strings,set_size,count_fem
list_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1044,1,"{302, 1219, 1238}","{instrumental_rock, world_fusion, jazz_fusion}",3,1.0
1392,1,"{576, 449, 397, 1427, 858}","{psychobilly, rock_and_roll, blues, punk_rock,...",5,1.0
1391,1,"{1464, 1427, 1317, 507}","{blues, stride, jazz, boogie_woogie}",4,1.0
1390,1,"{1383, 493, 21, 504, 858, 859, 478}","{rock_and_roll, comedy_rock, avant_garde, trad...",7,1.0
1389,1,"{1442, 1450}","{son, afro_cuban_jazz}",2,1.0
...,...,...,...,...,...
6657,1,"{1081, 778, 1427, 650}","{blues, alternative_rock, experimental, folk}",4,
6658,1,"{1200, 385, 332}","{texas_blues, country_blues, electric_blues}",3,
6659,1,"{128, 465, 1108, 405, 1431}","{rap_rock, hip_hop, country, rock, nu_metal}",5,
6660,1,"{1186, 1427}","{blues, southern_rock}",2,


In [126]:
scf.head(30)

Unnamed: 0_level_0,count_fem
list_index,Unnamed: 1_level_1
0,6
1,1
2,4
3,215
4,218
5,50
6,25
7,2
8,1
9,1
