## Currently switching to using sparse matrix representations of the genres

Train a logistic model to classify the gender of an artist based on the list of genre labels. This approach uses sparse matrices to one-hot encode the genre labels of each artist. The features are used with and without normalization; there is no significant difference.

- [ ] supervised PCA?
- [ ] use dimension reduction through word embeddings - USE Scikit-learn LSA, etc
    - [ ] LSI
    - [ ] LDA
    - [ ] HDP
- [ ] introduce length of the genre list as a new feature? or save for NN model

In [1]:
import numpy as np
import pandas as pd

from scipy import sparse

from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

import genre_data_loader, genre_upperbound

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

import os

import csv

seed = 23

In [2]:
%store -r now
X_path_train = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now)
y_path_train = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now)
X_path_test = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_test_{}.csv'.format(now)
y_path_test = '/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_test_{}.csv'.format(now)

In [3]:
genre_data = genre_data_loader.LoadGenreData(now, X_path_train = X_path_train,  y_path_train = y_path_train)

Format genre labels as a string

In [4]:
data = genre_data.as_strings()

In [5]:
data.head()

Unnamed: 0_level_0,genrelist_length,gender,genre_string
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,3,male,rock emo_pop pop
Bobby_Edwards,1,male,country
La_Palabra,4,male,afro_cuban_jazz guaracha son_montuno salsa_rom...
Sherrick,2,male,r_and_b soul
Allen_Collins,1,male,southern_rock


In [6]:
data.reset_index(inplace = True)
data.index.name = 'artist_id'

In [7]:
data.head()

Unnamed: 0_level_0,artist,genrelist_length,gender,genre_string
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Pablo_Holman,3,male,rock emo_pop pop
1,Bobby_Edwards,1,male,country
2,La_Palabra,4,male,afro_cuban_jazz guaracha son_montuno salsa_rom...
3,Sherrick,2,male,r_and_b soul
4,Allen_Collins,1,male,southern_rock


Use data_loader to get full list of genres:

In [8]:
# genre_data_full = genre_data_loader.LoadGenreData(now, X_path_train = X_path_train,  y_path_train = y_path_train, 
#                                              X_path_test = X_path_test,  y_path_test = y_path_test)

In [9]:
# list_of_genres = genre_data_full.get_list_of_genres()

In [10]:
# genre_list_full_df = pd.DataFrame(list_of_genres)
# genre_list_full_df.head()

Sparse data structure:

In [11]:
X_sparse = genre_data.get_sparse_X_vector()

In [12]:
X_sparse

<12376x1353 sparse matrix of type '<class 'numpy.int64'>'
	with 33463 stored elements in Compressed Sparse Row format>

In [13]:
X_sparse.nnz #number of stored values

33463

Encode labels:

In [14]:
le = preprocessing.LabelEncoder()
le.fit(['male', 'female'])
le.classes_

array(['female', 'male'], dtype='<U6')

In [15]:
# le.transform(['female'])

In [16]:
# le.inverse_transform([1,0,1])

In [17]:
y = le.transform(data.gender.values)

### Normalization
Convert inputs to a numpy array and then create a scaler class to normalize the feature values that can be applied to training and test data.

### Currently not needed; all features are 1/0

In [18]:
#scaler = preprocessing.StandardScaler(with_mean = False).fit(X_sparse) # need with_mean = False for sparse data
transformer = preprocessing.MaxAbsScaler(copy = False).fit(X_sparse)

In [19]:
scale, maxabs = transformer.scale_, transformer.max_abs_

In [20]:
np.argmax(scale), np.max(scale), np.argmax(maxabs), np.max(maxabs)

(0, 1.0, 0, 1.0)

Apply the scaler to the training data:

In [21]:
X_scaled = transformer.transform(X_sparse)

In [22]:
X_scaled.shape

(12376, 1353)

### Create the Model

Define the class of models.

Needed to increase max_iter for it to converge. A high value of C (>1) helps the score a little bit. Recall, C controls the amount of regularization: low C imples strong regularization, high C implies weak regularization. Using l1_ratio between 0 and 1 gives a combination of l1 and l2; l1_ration = 1 is pure l1.

In [23]:
model = LogisticRegression(
    l1_ratio = .5,
    penalty =  'elasticnet', # 'l1',
    solver = 'saga', 
    random_state = 0, 
    max_iter = 10000, 
    C = 5 
    )

Train the model on the training data:

In [24]:
#model.fit(X_sparse,y)
model.fit(X_scaled,y)

# Can look at predictions:
#model.predict(X_sparse[:3,:])
#model.predict(X_scaled[:3,:])

#model.score(X_sparse,y)
print(f'The accuracy is {model.score(X_scaled,y)}.')

The accuracy is 0.7711700064641241.


In [25]:
print('The percentage of male artists in the training set is {}%.'.format(round(100*y.sum()/y.shape[0])))

The percentage of male artists in the training set is 69.0%.


We see that it is a little better than the baseline of .69 obtained by always predicting that the artist is male.

### Cross validation.

This function can be run on the normalized and un-normalized data:

In [32]:
data = genre_data.as_sets()
data

Unnamed: 0_level_0,genrelist_length,gender,genre_string,genre_set
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pablo_Holman,3,male,rock emo_pop pop,"{rock, emo_pop, pop}"
Bobby_Edwards,1,male,country,{country}
La_Palabra,4,male,afro_cuban_jazz guaracha son_montuno salsa_rom...,"{afro_cuban_jazz, guaracha, son_montuno, salsa..."
Sherrick,2,male,r_and_b soul,"{r_and_b, soul}"
Allen_Collins,1,male,southern_rock,{southern_rock}
...,...,...,...,...
Steve_Gaines,2,male,blues southern_rock,"{blues, southern_rock}"
Dan_Hoerner,3,male,emo indie_rock alternative_rock,"{emo, indie_rock, alternative_rock}"
Detail,2,male,r_and_b hip_hop,"{r_and_b, hip_hop}"
Billy_Woods,1,male,hip_hop,{hip_hop}


In [34]:
genre_upperbound.UpperBound(data.iloc[:1000])[1]

0.127193

In [43]:
def train_validate(x_data, y_data):
    """This function takes X,y data and returns 
    (list of cvscores, list of confusion matrices). It
    prints basic stats.
    
    It also returns the lowerbounds for errors for any model on the validation sets
    for which the scores are provided.
    """

    kf = KFold(n_splits = 2, shuffle = True, random_state = seed)

    cvscores = []
    #cms = []
    error_lower_bounds = []
    for train, val in kf.split(x_data,y_data):

    #CODE BELOW NEEDS TO BE ADAPTED TO THIS NB
        X_train = x_data[train]
        y_train = y_data[train]

        model.fit(X_train, y_train);

        X_val = x_data[val]
        y_val = y_data[val]

        score = model.score(X_val, y_val)
        cvscores.append(score)

        # compute confusion matrices and store them in a list
        y_pred = model.predict(X_val)
        #cms.append(confusion_matrix(y_val, y_pred))
        
        err = genre_upperbound.UpperBound(data.iloc[val])[1]
        error_lower_bounds.append(err)

    print(f'Mean accuracy is {100*np.mean(cvscores):.2f}% and 100*STD is {100*np.std(cvscores):.2f}%')
    print(f'This is a {100*(100*np.mean(cvscores)-69)/69:.2f}% improvement over a random guess.')
    return cvscores, error_lower_bounds

Without normalization:

In [44]:
cvscores, error_bounds = train_validate(X_sparse, y)

Mean accuracy is 72.70% and 100*STD is 0.82%
This is a 5.36% improvement over a random guess.


In [46]:
cvscores

[0.7188106011635423, 0.7351325145442793]

In [45]:
error_bounds

[0.217041, 0.222157]