# Calculate the upper bound for accuracy of any model trained on our training data. Or calculate the lower bound for error.


    - for the DNN classifier the 10-fold CV accuracy has a mean of 76% with std 1%. This is better than my supposed upperbound. This is because the error decreases on random subsets of the data. See below.
    - for the DNN classifier the training accuracy can be close to 80%. How? I suspect this is again from using subsets where the data mean error is smaller.

The data is of the form $(X,y)$ with $X_i \in \left\{ 0, 1 \right\}^{\times p}$ ($p=1494$), and $y \in  \left\{ 0, 1 \right\}$. There are 12376 training samples. Let $\left\{\bar{X}_a \right\}_{1 \leq a \leq  6230}$ be unique representatives of the inputs in the training set; That is, for all $i$ there exists $a$ such that $X_i = \bar{X_a}$. For each $\bar{X}_a$ the number of female artists ($\text{fem}\left( \bar{X}_a \right)$) and male artists $\left( \text{mal}(\bar{X}_a) \right)$ with $X_i = \bar{X}_a$ are calculated. Define a classifier on the set of training data $f_0: \left\{ X_i \right\}_{i=1}^{12376} \to \left\{ 0, 1 \right\}$ as 
$$ $$
$$ f(X_i) = \text{argmax}_{\left\{ \text{male},\text{female}\right\}} \left\{ \text{mal}(\bar{X}_a), \text{fem}(\bar{X}_a)\right\} \; \text{if} \; X_i = \bar{X}_a$$
$$ $$
Then extend $f_0$ to $f: \left\{ 0, 1 \right\}^{\times p} \to \left\{ 0, 1 \right\}$. When $f$ is only used on the training data, the extension from $f_0$ to $f$ is irrevelant, and $f_0$ gives rise to an optimal classifier. However, to generalize to data which includes points in $\left\{ 0, 1 \right\}^{\times p}$ that were not in the training set, a rule is needed to make the extension.

This notebook shows that even on the training data $f_0$ has an expected error of 26.8%, or an accuracy of 73.2%.

It further shows that on subsets of the data, as the size of the subset decreases so does the error.

In [1]:
import genre_data_loader, genre_upperbound

In [2]:
# get currrent date for latest version of data set
%store -r now

X_path_train = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now)
y_path_train = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now)
X_path_test = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_test_{}.csv'.format(now)
y_path_test = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_test_{}.csv'.format(now)

In [3]:
# call data loader script
genre_data = genre_data_loader.LoadGenreData(now, X_path_train, y_path_train, X_path_test, y_path_test)

In [4]:
# load data with genre sets
data = genre_data.as_sets()
# create list of all genres
list_of_genres = genre_data.get_list_of_genres()

In [5]:
data.head()

Unnamed: 0_level_0,genrelist_length,gender,genre_set
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,3,male,"{rock, pop, emo_pop}"
Bobby_Edwards,1,male,{country}
La_Palabra,4,male,"{salsa_romántica, son_montuno, afro_cuban_jazz..."
Sherrick,2,male,"{soul, r_and_b}"
Allen_Collins,1,male,{southern_rock}


### Calculate lower bound on error (implies upperbounda on accuracy) 

This uses the function in genre_upperbound.py

In [6]:
set_counts, error = genre_upperbound.UpperBound(data)

In [7]:
print(f'The error of any classifier trained on this data is at least {error}.')
print(f'The accuracy of any classifier trained on this data is less than {1-error}.')

The error of any classifier trained on this data is at least 0.287216.
The accuracy of any classifier trained on this data is less than 0.712784.


## this has been spun off into the script genre_upperbound.py

In [9]:
# import numpy as np
# import pandas as pd

# def UpperBound(df_input):
#     """Function Description: input is a dataframe 
#     with the type of 'data' above. It returns (DataFrame, float):
#     DataFrame: a dataframe with the counts for female/male
#     and a column classifying by majority vote
#     and the error for that input type;
#     float: the error of the classifier, which is the smallest
#     error of any classifier on this data"""
    
#     # Initialize list of genre sets and counts:
#     genre_sets = [] # a list of the genre sets

#     df = df_input.copy(deep = True)

#     def set_id(row):
#         if row.genre_set in genre_sets:
#             row_id = genre_sets.index(row.genre_set)
#         else:
#             # add to list of all genre sets
#             genre_sets.append(row.genre_set)
#             row_id = genre_sets.index(row.genre_set)
#         return row_id


#     df['set_id'] = df.apply(set_id, axis = 1)
    
#     df.reset_index(inplace = True)

#     set_counts = pd.pivot_table(df, index = 'set_id', columns = 'gender', values = 'artist', aggfunc = 'count', fill_value = 0)
#     set_counts['genre_set_encoded'] = set_counts.apply(lambda x: genre_sets[int(x.name)], axis = 1)
#     set_counts['total'] = set_counts.female + set_counts.male
#     set_counts = set_counts[['total','female','male','genre_set_encoded']]

#     # Calculate a column that classifies by majority vote for each genre set
#     def classify(row):
#         if row.female < row.male:
#             return 0 # male = 0
#         else:
#             return 1 # female = 1
    
#     # indicate class
#     set_counts['classifier'] = set_counts.apply(classify, axis = 1)
    
#     # Create a column with the error of the classifier for that genre_set
#     set_counts['error_bound'] = set_counts.apply(
#         lambda x: x.female if x.classifier == 0 else x.male, axis = 1)
    
#     # Calculate the total error of the model
#     error = round(set_counts.error_bound.sum()/set_counts.shape[0],6)
    

#     return set_counts, error

In [10]:
# set_counts, error= UpperBound(data)