## Currently switching to using sparse matrix representations of the genres

Train a logistic model to classify the gender of an artist based on the list of genre labels.

- [ ] shuffle the data -- is it currently training in order, on all males first...
- [ ] use cross validation
- [ ] use dimension reduction, possibly through word embeddings like word2vec

In [1]:
import numpy as np
import pandas as pd

from scipy import sparse

from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

Import the cleaned data:

In [2]:
%store -r now
now

'2020-05-11-14-35'

In [3]:
X_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now), index_col = ['artist'])
y_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now), index_col = ['artist'])

In [4]:
X_train.shape, y_train.shape

((12376, 2), (12376, 1))

In [5]:
X_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length
artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Pablo_Holman,"['pop', 'rock', 'emo_pop']",3
Bobby_Edwards,['country'],1
La_Palabra,"['afro_cuban_jazz', 'son_montuno', 'guaracha',...",4
Sherrick,"['r&b', 'soul']",2
Allen_Collins,['southern_rock'],1


In [6]:
y_train.head()

Unnamed: 0_level_0,gender
artist,Unnamed: 1_level_1
Pablo_Holman,male
Bobby_Edwards,male
La_Palabra,male
Sherrick,male
Allen_Collins,male


# Shuffle the data!

### Genre Labels

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

In [7]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        L_new.append(x.replace(" ","_").lstrip("_").rstrip("_"))
    while (str("") in L_new):
        L_new.remove("")
    return L_new

Now we apply it to the whole column and put the lists in a new column:

In [8]:
X_train['genrelist']= X_train['genrelist'].apply(genrelist)

In [9]:
X_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length
artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Pablo_Holman,"[pop, rock, emo_pop]",3
Bobby_Edwards,[country],1
La_Palabra,"[afro_cuban_jazz, son_montuno, guaracha, salsa...",4
Sherrick,"[r&b, soul]",2
Allen_Collins,[southern_rock],1


In [10]:
data_train = X_train.join(y_train, how = 'inner', on = 'artist')

In [11]:
data_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length,gender
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,"[pop, rock, emo_pop]",3,male
Bobby_Edwards,[country],1,male
La_Palabra,"[afro_cuban_jazz, son_montuno, guaracha, salsa...",4,male
Sherrick,"[r&b, soul]",2,male
Allen_Collins,[southern_rock],1,male


In [12]:
data_train = data_train.sample(frac = 1, random_state = 13)

In [13]:
data_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length,gender
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Julia_Kedhammar,"[pop, dance_pop, synth_pop]",3,female
Debby_Ryan,[indie_pop],1,female
Gepe,"[pop, folk, electro_pop, indie_folk]",4,male
Nash_the_Slash,"[progressive_rock, electronic]",2,male
Dhani_Harrison,"[alternative_rock, rock]",2,male


In [14]:
data_train.reset_index(inplace = True)
data_train.index.name = 'artist_id'

In [15]:
data_train.head()

Unnamed: 0_level_0,artist,genrelist,genrelist_length,gender
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Julia_Kedhammar,"[pop, dance_pop, synth_pop]",3,female
1,Debby_Ryan,[indie_pop],1,female
2,Gepe,"[pop, folk, electro_pop, indie_folk]",4,male
3,Nash_the_Slash,"[progressive_rock, electronic]",2,male
4,Dhani_Harrison,"[alternative_rock, rock]",2,male


In [16]:
genre_list = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_training_{}.csv'.format(now))
genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [17]:
genre_list.head()

Unnamed: 0,genre_list
0,country
1,afro_cuban_jazz
2,aaa
3,mainstream_jazz
4,chicano_rock


In [18]:
genre_label_counts = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_label_counts_TRAINING_{}.csv'.format(now))
#genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)
#genre_label_counts.set_index(['Unnamed: 0'], inplace = True)
genre_label_counts.index.name = 'genre_id'
genre_label_counts.columns = ['genre','freqency']

In [19]:
genre_label_counts

Unnamed: 0_level_0,genre,freqency
genre_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,pop,2617
1,rock,1765
2,r&b,1647
3,country,1613
4,hip_hop,1114
...,...,...
1348,tapping,1
1349,street_artist,1
1350,euthadisco,1
1351,afro,1


## Start Here: don't need to define these vectors explicitly--begin using sparse matrices

- [ ] create column in data_train with sparse vector for each artist

In [15]:
def vec_position(row):
    i = row.name
    v = np.zeros(1353)
    v[i] = 1
    return v

In [16]:
# genre_label_counts['vectors'] = genre_label_counts.apply(vec_position, axis = 1)

In [17]:
# genre_label_counts.head()

Create sparse matrix -- $A_{ij}$ is jth entry of sparse vector rep for ith genre list

In [19]:
genrelist_vectors = sparse.dok_matrix((X_train.shape[0], genre_label_counts.shape[0]), dtype = 'int64') 

In [17]:
keys = genre_label_counts.genre.values.tolist()
values = genre_label_counts.vectors.tolist()
genre_dict = dict(zip(keys, values))

In [18]:
genre_dict['r&b']

array([0., 0., 1., ..., 0., 0., 0.])

Now apply a function to the data that adds the vectors for each genre in the list

In [19]:
def genre_list_vector(x):
    v = np.zeros(1353)
    for genre in x:
        v += genre_dict[genre]
    return v

In [20]:
X_train['vector'] = X_train.genrelist.apply(genre_list_vector)

In [21]:
X_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length,vector
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,"[pop, rock, emo_pop]",3,"[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
Bobby_Edwards,[country],1,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
La_Palabra,"[afro_cuban_jazz, son_montuno, guaracha, salsa...",4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
Sherrick,"[r&b, soul]",2,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
Allen_Collins,[southern_rock],1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Now encode male/female as 0/1 for targets:

In [22]:
y_train['one_zero'] = (y_train.gender == 'female').astype('int64')

In [23]:
y_train.iloc[8455]

gender      female
one_zero         1
Name: Cinder_Block, dtype: object

Convert inputs to a numpy array and then create a scaler class that can be applied to training and test data.

In [78]:
X = X_train.vector.values.tolist()
X = np.stack(X, axis = 0)
scaler = preprocessing.StandardScaler().fit(X)

In [79]:
scaler.mean_, scaler.scale_

(array([2.11457660e-01, 1.42614738e-01, 1.33080155e-01, ...,
        8.08015514e-05, 8.08015514e-05, 8.08015514e-05]),
 array([0.40873772, 0.34967953, 0.33966134, ..., 0.00898861, 0.00898861,
        0.00898861]))

Apply the scaler to the training data:

In [80]:
X_scaled = scaler.transform(X)

In [81]:
X_scaled.shape

(12376, 1353)

In [82]:
y = y_train.one_zero.values

Define the class of models.

Needed to increase max_iter for it to converge. A high value of C (>1) helps the score a little bit. Recall, C controls the amount of regularization: low C imples strong regularization, high C implies weak regularization.

In [112]:
model = LogisticRegression(random_state = 0, max_iter = 1000, C = 10 )

Train the model on the training data:

In [113]:
model.fit(X,y)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Can look at predictions:

In [114]:
model.predict(X[:3,:])

array([0, 0, 0])

In [115]:
y.sum()/y.shape[0]

0.3108435681965094

Score the model. We see that it is a little better than the baseline of .69 obtained by always predicting that the artist is male.

In [116]:
model.score(X,y)

0.7716548157724629

Run cross validation.