## Currently switching to using sparse matrix representations of the genres

Train a logistic model to classify the gender of an artist based on the list of genre labels.

- [x] shuffle the data -- is it currently training in order, on all males first...
- [x] use sparse matrices - merge to main branch
- [ ] use cross validation
- [ ] use dimension reduction, possibly through word embeddings like word2vec

In [2]:
import numpy as np
import pandas as pd

from scipy import sparse

from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import KFold

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import re

seed = 23

Import the cleaned data:

In [2]:
%store -r now
now

'2020-05-11-14-35'

In [3]:
X_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_X_train_{}.csv'.format(now), index_col = ['artist'])
y_train = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/wiki-kaggle_y_train_{}.csv'.format(now), index_col = ['artist'])

In [4]:
X_train.shape, y_train.shape

((12376, 2), (12376, 1))

In [5]:
X_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length
artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Pablo_Holman,"['pop', 'rock', 'emo_pop']",3
Bobby_Edwards,['country'],1
La_Palabra,"['afro_cuban_jazz', 'son_montuno', 'guaracha',...",4
Sherrick,"['r&b', 'soul']",2
Allen_Collins,['southern_rock'],1


In [6]:
y_train.head()

Unnamed: 0_level_0,gender
artist,Unnamed: 1_level_1
Pablo_Holman,male
Bobby_Edwards,male
La_Palabra,male
Sherrick,male
Allen_Collins,male


### Genre Labels -- as a list -- NOT USED CURRENTLY

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. We want to convert it to a _list_ of strings.

In [7]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets and extra quotes and
returns a list of strings where each string is a genre label."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","")
    L = [s for s in string.split(',')]
    L_new = []
    for x in L:
        L_new.append(x.replace(" ","_").lstrip("_").rstrip("_"))
    while (str("") in L_new):
        L_new.remove("")
    return L_new

### Genre Labels -- as a string

Each value of the genre column is a _string_ of comma separated genre labels using the spotify abbreviations. This function strips the brackets and commas and quotes, but leaves it as a string.

In [17]:
"""This function takes in a string of the form
appearing in the genrelist of the dataframe.
It strips the square brackets, commas, and extra quotes."""
def genrelist(string):
    string = string.strip("[").strip("]").replace("'","").replace(",","")
    return string

Now we apply it to the whole column and put the lists in a new column:

In [18]:
X_train['genrelist']= X_train['genrelist'].apply(genrelist)

In [19]:
X_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length
artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Pablo_Holman,pop rock emo_pop,3
Bobby_Edwards,country,1
La_Palabra,afro_cuban_jazz son_montuno guaracha salsa_rom...,4
Sherrick,r&b soul,2
Allen_Collins,southern_rock,1


In [22]:
X_train.genrelist.iloc[0]

'pop rock emo_pop'

In [23]:
data_train = X_train.join(y_train, how = 'inner', on = 'artist')

In [24]:
data_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length,gender
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pablo_Holman,pop rock emo_pop,3,male
Bobby_Edwards,country,1,male
La_Palabra,afro_cuban_jazz son_montuno guaracha salsa_rom...,4,male
Sherrick,r&b soul,2,male
Allen_Collins,southern_rock,1,male


In [27]:
data_train = data_train.sample(frac = 1, random_state = 13)

In [28]:
data_train.head()

Unnamed: 0_level_0,genrelist,genrelist_length,gender
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A._J._Ghent,funk soul jam gospel sacred_steel blues r&b,7,male
James_Carr,r&b soul,2,male
Cassidy,hip_hop gangsta_rap east_coast_hip_hop,3,female
Juliet_Roberts,jazz rock soul house,4,female
Bill_Barwick,western,1,male


In [29]:
data_train.reset_index(inplace = True)
data_train.index.name = 'artist_id'

In [30]:
data_train.head()

Unnamed: 0_level_0,artist,genrelist,genrelist_length,gender
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,A._J._Ghent,funk soul jam gospel sacred_steel blues r&b,7,male
1,James_Carr,r&b soul,2,male
2,Cassidy,hip_hop gangsta_rap east_coast_hip_hop,3,female
3,Juliet_Roberts,jazz rock soul house,4,female
4,Bill_Barwick,western,1,male


In [31]:
genre_list = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_list_training_{}.csv'.format(now))
genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [32]:
genre_list.head()

Unnamed: 0,genre_list
0,country
1,afro_cuban_jazz
2,aaa
3,mainstream_jazz
4,chicano_rock


Import to DataFrame genres and their frequencies: 

In [33]:
genre_label_counts = pd.read_csv('/Users/Daniel/Code/Genre/data/genre_lists/data_ready_for_model/genre_label_counts_TRAINING_{}.csv'.format(now))
#genre_list.drop(['Unnamed: 0'], axis = 1, inplace = True)
#genre_label_counts.set_index(['Unnamed: 0'], inplace = True)
genre_label_counts.index.name = 'genre_id'
genre_label_counts.columns = ['genre','freqency']

In [34]:
genre_label_counts.head(12)

Unnamed: 0_level_0,genre,freqency
genre_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,pop,2617
1,rock,1765
2,r&b,1647
3,country,1613
4,hip_hop,1114
5,folk,1046
6,soul,1023
7,jazz,962
8,alternative_rock,937
9,blues,859


Create a dictionary of {genre : genre_id}

In [35]:
genre_label_id_dict = dict(zip(genre_label_counts.genre.values.tolist(), genre_label_counts.index.tolist()))

In [36]:
genre_label_id_dict['hard_rock']

11

Now create a sparse data structure encoding of the genre labels:

In [61]:
CountVectorizer?

In [62]:
vec = CountVectorizer(vocabulary = genre_label_id_dict) # this implementation uses scipy.sparse.csr_matrix representation

In [63]:
X_sparse = vec.fit_transform(data_train.genrelist)

In [64]:
X_sparse

<12376x1353 sparse matrix of type '<class 'numpy.int64'>'
	with 31722 stored elements in Compressed Sparse Row format>

In [65]:
X_sparse.nnz #number of stored values

31722

Let's look at the nonzero entries of a row and make sure the encoding worked properly:

In [66]:
X_sparse[677].nonzero()

(array([0, 0, 0, 0], dtype=int32), array([ 4,  7, 18, 20], dtype=int32))

Convert those id's to genre labels using the genre_label_counts DF:

In [81]:
[genre_label_counts.loc[id] for id in X_sparse[11000].nonzero()[1]]

[genre       contemporary_jewish_religious
 freqency                                2
 Name: 740, dtype: object]

Compare to the data_train entry:

In [82]:
data_train.loc[11000]

artist                              Avraham_Fried
genrelist           contemporary_jewish_religious
genrelist_length                                1
gender                                       male
Name: 11000, dtype: object

Checks out on some examples.

Encode labels:

In [90]:
le = preprocessing.LabelEncoder()
le.fit(['male', 'female'])
le.classes_

array(['female', 'male'], dtype='<U6')

In [98]:
le.transform(['female'])

array([0])

In [94]:
le.inverse_transform([1,0,1])

array(['male', 'female', 'male'], dtype='<U6')

In [106]:
y = le.transform(data_train.gender.values)

### Normalization (skipped currently)
Convert inputs to a numpy array and then create a scaler class to normalize the feature values that can be applied to training and test data.

In [78]:
# X = X_train.vector.values.tolist()
# X = np.stack(X, axis = 0)
# scaler = preprocessing.StandardScaler().fit(X)

In [79]:
# scaler.mean_, scaler.scale_

(array([2.11457660e-01, 1.42614738e-01, 1.33080155e-01, ...,
        8.08015514e-05, 8.08015514e-05, 8.08015514e-05]),
 array([0.40873772, 0.34967953, 0.33966134, ..., 0.00898861, 0.00898861,
        0.00898861]))

Apply the scaler to the training data:

In [80]:
# X_scaled = scaler.transform(X)

In [81]:
# X_scaled.shape

(12376, 1353)

In [82]:
# y = y_train.one_zero.values

### Create the Model

Define the class of models.

Needed to increase max_iter for it to converge. A high value of C (>1) helps the score a little bit. Recall, C controls the amount of regularization: low C imples strong regularization, high C implies weak regularization.

In [107]:
model = LogisticRegression(random_state = 0, max_iter = 1000, C = 1 )

Train the model on the training data:

In [108]:
model.fit(X_sparse,y)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Can look at predictions:

In [110]:
model.predict(X_sparse[:3,:])

array([1, 1, 1])

In [117]:
print('The percentage of male artists in the training set is {}%.'.format(round(100*y.sum()/y.shape[0])))

The percentage of male artists in the training set is 69.0%.


Score the model:

In [118]:
model.score(X_sparse,y)

0.7626050420168067

We see that it is a little better than the baseline of .69 obtained by always predicting that the artist is male.

Run cross validation.

In [None]:
kf = KFold(n_splits = 10, shuffle = True, random_state = seed)

cvscores_mse = []
cvscores_rmse = []
cvscores_mae = []


for train, test in kf.split(X?,y?):
    
#CODE BELOW NEEDS TO BE ADAPTED TO THIS NB
    X_vol = X?[train,?]
    y_train = y[train]
    
    model6 = tf.keras.Model(inputs = [inputs_vol, inputs_temp], outputs = outputs, name = 'ensemble')
    
    model6.compile(optimizer = opt, loss = loss_fcn, metrics= [mse, rmse, mae])

    history6 = model6.fit(
    {'volatility_data':X_vol,'weather_data':X_temp},
    {'outputs':y_train},
    epochs = 30, 
    batch_size = 128, 
    validation_split = .2, verbose = 0 );
    
    
    X_temp = X_tot[test,0:18]
    X_vol = X_tot[test,18:]
    y_test = y[test]
    
    scores = model6.evaluate(
        {'volatility_data':X_vol,'weather_data':X_temp},
        {'outputs':y_test}, verbose = 0)
    
    cvscores_mse.append(scores[1])
    cvscores_rmse.append(scores[2])
    cvscores_mae.append(scores[3])

print(f'Mean MSE is {np.mean(cvscores_mse):.4f} and STD is {np.std(cvscores_mse):.4f}')
print(f'Mean RMSE is {np.mean(cvscores_rmse):.4f} and STD is {np.std(cvscores_rmse):.4f}')
print(f'Mean MAE is {np.mean(cvscores_mae):.4f} and STD is {np.std(cvscores_mae):.4f}')