# Calculate the upper bound for accuracy of any model trained on our training data.

- Warning: I seem to have arrived a contradiction.
    - for the DNN classifier the 10-fold CV accuracy has a mean of 76% with std 1%. This is better than my supposed upperbound.
    - for the DNN classifier the training accuracy can be close to 80%. How? Is it memorizing the order? No: shuffle is True

The data is of the form $(X,y)$ with $X_i \in \left\{ 0, 1 \right\}^{\times p}$ ($p=1494$), and $y \in  \left\{ 0, 1 \right\}$. There are 12376 training samples. Let $\left\{\bar{X}_a \right\}_{1 \leq a \leq  6230}$ be unique representatives of the inputs in the training set; That is, for all $i$ there exists $a$ such that $X_i = \bar{X_a}$. For each $\bar{X}_a$ the number of female artists ($\text{fem}\left( \bar{X}_a \right)$) and male artists $\left( \text{mal}(\bar{X}_a) \right)$ with $X_i = \bar{X}_a$ are calculated. Define a classifier on the set of training data $f_0: \left\{ X_i \right\}_{i=1}^{12376} \to \left\{ 0, 1 \right\}$ as 
$$ $$
$$ f(X_i) = \text{argmax}_{\left\{ \text{male},\text{female}\right\}} \left\{ \text{mal}(\bar{X}_a), \text{fem}(\bar{X}_a)\right\} \; \text{if} \; X_i = \bar{X}_a$$
$$ $$
Then extend $f_0$ to $f: \left\{ 0, 1 \right\}^{\times p} \to \left\{ 0, 1 \right\}$. When $f$ is only used on the training data, the extension from $f_0$ to $f$ is irrevelant, and $f_0$ gives rise to an optimal classifier. However, to generalize to data which includes points in $\left\{ 0, 1 \right\}^{\times p}$ that were not in the training set, a rule is needed to make the extension.

This notebook shows that even on the training data $f_0$ has an expected error of 26.8%, or an accuracy of 73.2%.

In [3]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

Import the cleaned data:

In [4]:
#%ls -lt ../../data/genre_lists/data_ready_for_model/

In [5]:
%store -r now
now
#now = '2020-05-11-14-35'

'2020-05-18-10-06'

In [6]:
X_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now), index_col = ['artist'])
y_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now), index_col = ['artist'])

### Genre Labels -- as a set

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _set_ of strings.

In [7]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        L_new.append(x.replace(" ","_").lstrip("_").rstrip("_"))
    while (str("") in L_new):
        L_new.remove("")
    return set(L_new)

Now we apply
- it to the whole column and put the lists in a new column
- assemble X,y into DF
-reset index to 'artist_id'

In [8]:
X_train['genre_set']= X_train['genrelist'].apply(genrelist)

data = X_train.join(y_train, how = 'inner', on = 'artist')
data.drop(['genrelist'], axis = 1, inplace = True)
data.reset_index(inplace = True)
data.index.name = 'artist_id'
data_set_size = data.shape[0]

In [9]:
data.head()

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Pablo_Holman,3,"{emo_pop, rock, pop}",male
1,Bobby_Edwards,1,{country},male
2,La_Palabra,4,"{guaracha, salsa_romántica, afro_cuban_jazz, s...",male
3,Sherrick,2,"{r_and_b, soul}",male
4,Allen_Collins,1,{southern_rock},male


- Set IDs for Full genre_list as index (not just that for the training set)
- Vocab Dict and Size-- {label: id}
- max length of lists

In [10]:
genre_list = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_{}.csv'.format(now))
genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)
genre_list['genre_id'] = list(range(1,genre_list.shape[0]+1))

#Size of the vocab:
vocab_size = genre_list.shape[0]

#Create a dictionary {genre_label: genre_id}
temp = genre_list.set_index(['genre_list'])
label_id_dict = temp['genre_id'].to_dict()

# set genre_id to index
genre_list.set_index(['genre_id'], inplace = True)

#Find max length of genre lists:
max_list_length = data.genrelist_length.max()

In [12]:
genre_list.head()

Unnamed: 0_level_0,genre_list
genre_id,Unnamed: 1_level_1
1,chilean
2,zamba
3,afro_punk_blues
4,crunk
5,spanish_guitar


In [13]:
# encode labels as ints within the list
def encode_list(row):
    return {label_id_dict[item] for item in row.genre_set}

data['genre_set_encoded'] = data.apply(encode_list, axis = 1)

#Check that the encoding is consistent: 
# n = np.random.randint(data.shape[0])
# [label_id_dict[item] for item in data.genrelist.iloc[n]], data.genres_encoded_as_list.iloc[n]

# Encode targets. The categories still appear as strings. To see the encoding use df.column.cat.codes.
#data['gender'] = data.gender.apply(lambda x: 1 if x == 'female' else 0)

In [14]:
data.head()

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender,genre_set_encoded
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Pablo_Holman,3,"{emo_pop, rock, pop}",male,"{794, 1007, 1431}"
1,Bobby_Edwards,1,{country},male,{465}
2,La_Palabra,4,"{guaracha, salsa_romántica, afro_cuban_jazz, s...",male,"{809, 1442, 1004, 1357}"
3,Sherrick,2,"{r_and_b, soul}",male,"{1426, 359}"
4,Allen_Collins,1,{southern_rock},male,{1186}


Create an id for each genre_set

In [17]:
# Initialize list of genre sets and counts:
genre_sets = [] # a list of the genre sets

def set_id(row):
    if row.genre_set_encoded in genre_sets:
        row_id = genre_sets.index(row.genre_set_encoded)
    else:
        # add to list of all genre sets
        genre_sets.append(row.genre_set_encoded)
        row_id = genre_sets.index(row.genre_set_encoded)
    return row_id

data['set_id'] = data.apply(set_id, axis = 1)

In [18]:
data.head()

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender,genre_set_encoded,set_id
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Pablo_Holman,3,"{emo_pop, rock, pop}",male,"{794, 1007, 1431}",0
1,Bobby_Edwards,1,{country},male,{465},1
2,La_Palabra,4,"{guaracha, salsa_romántica, afro_cuban_jazz, s...",male,"{809, 1442, 1004, 1357}",2
3,Sherrick,2,"{r_and_b, soul}",male,"{1426, 359}",3
4,Allen_Collins,1,{southern_rock},male,{1186},4


## Function that calculates the optimal classifier and its error

In [19]:
def UpperBound(df):
    """Function Description: input is a dataframe 
    with the type of 'data' above. It returns (DataFrame, float):
    DataFrame: a dataframe with the counts for female/male
    and a column classifying by majority vote
    and the error for that input type;
    float: the error of the classifier, which is the smallest
    error of any classifier on this data"""
    
    # Initialize list of genre sets and counts:
    genre_sets = [] # a list of the genre sets

    def set_id(row):
        if row.genre_set_encoded in genre_sets:
            row_id = genre_sets.index(row.genre_set_encoded)
        else:
            # add to list of all genre sets
            genre_sets.append(row.genre_set_encoded)
            row_id = genre_sets.index(row.genre_set_encoded)
        return row_id

    df['set_id'] = df.apply(set_id, axis = 1)

    set_counts = pd.pivot_table(df, index = 'set_id', columns = 'gender', values = 'artist', aggfunc = 'count', fill_value = 0)
    set_counts['genre_set_encoded'] = set_counts.apply(lambda x: genre_sets[int(x.name)], axis = 1)
    set_counts['total'] = set_counts.female +set_counts.male
    set_counts = set_counts[['total','female','male','genre_set_encoded']]
    
    def set_strings(row):
        stringset = set([])
        for id in row.genre_set_encoded:
            stringset.add(genre_list.loc[id][0])
        return stringset
    
    df['genre_set'] = df.apply(set_strings, axis = 1);

    # Calculate a column that classifies by majority vote for each genre set
    def classify(row):
        if row.female < row.male:
            return 0 # male = 0
        else:
            return 1 # female = 1
    
    # indicate class
    set_counts['classifier'] = set_counts.apply(classify, axis = 1)
    
    # Create a column with the error of the classifier for that genre_set
    set_counts['error_bound'] = set_counts.apply(
        lambda x: x.female if x.classifier == 0 else x.male, axis = 1)
    
    # Calculate the total error of the model
    error = round(set_counts.error_bound.sum()/set_counts.shape[0],3)
    
    return set_counts, error

In [20]:
set_counts, error= UpperBound(data)

In [23]:
set_counts.head()

gender,total,female,male,genre_set_encoded,classifier,error_bound
set_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,0,1,"{794, 1007, 1431}",0,0
1,720,218,502,{465},0,218
2,1,0,1,"{809, 1442, 1004, 1357}",0,0
3,101,46,55,"{1426, 359}",0,46
4,9,2,7,{1186},0,2


In [24]:
print(f'The error of any classifier trained on this data is at least {error}.')
print(f'The accuracy of any classifier trained on this data is less than {1-error}.')

The error of any classifier trained on this data is at least 0.268.
The accuracy of any classifier trained on this data is less than 0.732.


How does the error vary when the classifier is used on sub-samples?

In [34]:
data.shape[0], data.shape[0] // 10, data.shape[0] % 10

(12376, 1237, 6)

In [35]:
sample_sizes = np.arange(1237,12376,1237) # sizes of nested subsamples

# Fix indices in the list comprehension below

In [47]:
subsamples = {}
subsamples[9] = data.sample(sample_sizes[9])
subsamples = {9-i:subsamples[i-1] for i, size in enumerate(sample_sizes[:9]) }

KeyError: -1

In [45]:
subsamples[9]

Unnamed: 0_level_0,artist,genrelist_length,genre_set,gender,genre_set_encoded,set_id
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9607,Emily_Piriz,1,{pop_rock},female,{870},125
10351,Marty_Grebb,1,{rock},male,{1431},102
381,Christophe_Godin,5,"{heavy_metal, blues, jazz_fusion, classical, i...",male,"{1481, 1005, 302, 1427, 1238}",288
466,Ashley_Parker_Angel,2,"{pop_rock, pop}",male,"{870, 1007}",187
11954,Roni_Lee,2,"{punk_rock, new_wave}",female,"{449, 667}",2993
...,...,...,...,...,...,...
9833,Attrell_Cordes,2,"{hip_hop, r_and_b}",male,"{128, 1426}",118
7155,Marty_Casey,4,"{hard_rock, indie_rock, post_grunge, alternati...",male,"{497, 354, 1270, 1081}",3909
549,Martin_Armiger,2,"{pop, rock_and_roll}",male,"{858, 1007}",68
6139,Katie_McMahon,1,{celtic},female,{1401},3411
