# Train genderListener and pickle it

Purpose of this notebook:
1. Upload dataset containing the metadata and audio features
2. Create the model pipeline
3. Train the model
4. Pickle the trained model


## Load Libraries

In [1]:
# Import common python library
from collections import OrderedDict, Counter

# Import numpy library
import numpy as np

# Import matplotlib library
import matplotlib.pyplot as plt

from matplotlib.pyplot import *
from matplotlib import colors

# Import pandas library
import pandas as pd

# Import scikit-learn library
from sklearn.externals import joblib

from sklearn.dummy import DummyClassifier

from sklearn.linear_model import LogisticRegression

# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.ensemble import RandomForestClassifier

# from sklearn.neighbors import KNeighborsClassifier

# from sklearn.svm import SVC

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import (StandardScaler,
                                   LabelEncoder, 
                                   OneHotEncoder)

from sklearn.metrics import precision_recall_curve

# Import imbalance-learn library
from imblearn.pipeline import make_pipeline

from imblearn.over_sampling import RandomOverSampler

import pickle 

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Import user created library
#from code_cc import utilities
import aux_code.functions as mfc
from aux_code.utilities import *  # These are functions created by Emmanuel Contreras-Campana, Ph.D.

# random seed
seed = 3

% matplotlib inline

## Load Data


In [2]:
# Upload metadata
df = pd.read_csv('data/meta_audio.csv', index_col=0)
df.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'link', 'annualTED', 'film_year', 'published_year',
       'num_speaker_talks', 'technology', 'science', 'global issues',
       'culture', 'design', 'business', 'entertainment', 'health',
       'innovation', 'society', 'Fascinating', 'Courageous', 'Longwinded',
       'Obnoxious', 'Jaw-dropping', 'Inspiring', 'OK', 'Beautiful', 'Funny',
       'Unconvincing', 'Ingenious', 'Informative', 'Confusing', 'Persuasive',
       'wpm', 'words_per_min', 'first_name', 'gender_name',
       'gender_name_class', 'fileName', 'ZCR', 'Energy', 'EnergyEntropy',
       'SpectralCentroid', 'SpectralSpread', 'SpectralEntropy', 'SpectralFlux',
       'SpectralRollof', 'mfcc1', 'mfcc2', 'mfcc3', 'mfcc4', 'mfccC5', 'mfcc6',
       'mfcc7', 'mfcc8', 'mfcc9', 'mfcc10', '

In [3]:
# The model uses all 34 pyAudioAnalysis features
audioCols = ['ZCR', 'Energy', 'EnergyEntropy',
       'SpectralCentroid', 'SpectralSpread', 'SpectralEntropy', 'SpectralFlux',
       'SpectralRollof', 'mfcc1', 'mfcc2', 'mfcc3', 'mfcc4', 'mfccC5', 'mfcc6',
       'mfcc7', 'mfcc8', 'mfcc9', 'mfcc10', 'mfcc11', 'mfcc12', 'mfcc13',
       'Chroma1', 'Chroma2', 'Chroma3', 'Chroma4', 'Chroma5', 'Chroma6',
       'Chroma7', 'Chroma8', 'Chroma9', 'Chroma10', 'Chroma11', 'Chroma12',
       'Chroma_std']

# The dataset from which our training and validation sets will come out should only include 
# the talks with 1 speaker and gender_name_class or 0 or 1.
# Excluding talks with very low words_per_min (below 25th percentile) lets us exclude some musical performances
df_model = df.loc[(df['num_speaker']==1) & (df['gender_name_class'].isin([0.0, 1,0])) & (df['words_per_min'] >=131), audioCols +['gender_name_class']]
X = df_model[audioCols]
print(X.shape)
print('male/female speakers',Counter(df.loc[df['gender_name_class'].isin([0.0, 1,0]), 'gender_name_class']))
X.head(3)

(922, 34)
male/female speakers Counter({0.0: 849, 1.0: 280})


Unnamed: 0,ZCR,Energy,EnergyEntropy,SpectralCentroid,SpectralSpread,SpectralEntropy,SpectralFlux,SpectralRollof,mfcc1,mfcc2,...,Chroma4,Chroma5,Chroma6,Chroma7,Chroma8,Chroma9,Chroma10,Chroma11,Chroma12,Chroma_std
1,0.157549,0.012028,3.26116,0.236284,0.219008,1.599764,0.0,0.24669,-11.097283,1.327426,...,1.599878e-09,2.665708e-09,1.376139e-09,7.381325e-09,1.894452e-09,2.329639e-09,4.507106e-09,1.057406e-08,1.167405e-08,3.479431e-09
3,0.11167,0.014973,3.286377,0.182854,0.206212,0.828991,0.0,0.12383,-11.37276,2.26059,...,7.222366e-09,7.437066e-09,2.284307e-09,8.678086e-08,3.631717e-09,8.220817e-09,5.630347e-09,5.76066e-09,9.587197e-09,2.225379e-08
4,0.126741,0.011402,3.319345,0.229218,0.235725,1.074359,0.0,0.179995,-11.518566,1.970373,...,4.655073e-09,1.070902e-08,7.810629e-09,1.046956e-08,6.75475e-09,3.192602e-09,5.348694e-09,8.367562e-09,6.860652e-09,3.03029e-09


In [4]:
y = df_model['gender_name_class']

# Model pipeline

## Standardizing 
I transform feature values to their z-scores, meaning every feature will have a mean of zero and a standard deviation of 1. This particular standardization method assumes that features have a normal distribution, which is true in this case. 

Standardizing makes it possible to:
- compare the effect of different predictor variables
- interpret the model parameters based on 'standard deviation' units of the predicted variables (and optionally of the predictor). [.](https://think-lab.github.io/d/205/)

In [5]:
# The StandardScaler assumes your data is normally distributed within each feature 
# and will output the z-scores.
# The distribution is now centred around 0, with a standard deviation of 1.

scaler = StandardScaler(copy=True, with_mean=True, with_std=True)

## Models to produce gender labels

There are about 3x as many talks by male speakers than by female speakers, that is to say that the male/female target classes are highly imbalanced. This affects model performance and some measures must be taken to mitigate the impact.

In this case, I opt for oversampling the minority class to match the sample size of the majority class, using a method closely related to bootstrap sampling.

The hyper-parameters will be optimized and cross-validated using the Logarithmic Loss function (i.e. log loss). Log loss was chosen because it heavily penalizes any strongly mis-classified predictions. 

Precision and Recall are used for the model selection and evaluation. To make sure that precision and recall are robust, I cross-validate them.

I employ stratified k-fold cross validation instead of regular (non-stratified) k-fold cross-validation. In stratified k-fold cross validation, samples from each target class are chosen separately to ensure that each fold contains samples from both target classes. Using non-stratified k-fold cross validation on a highly dataset that has highly imbalanced target classes could result in some folds containing only samples from the majority class. 

To reduce computation time, I use 3-fold cross-validation.

In [6]:
## Stratified K-Fold cross-validation
k_fold = 4

outer_kfold_cv = StratifiedKFold(n_splits=k_fold, shuffle=True, random_state=seed)
inner_kfold_cv = StratifiedKFold(n_splits=k_fold-1, shuffle=True, random_state=seed)

## Random Over Sampling of minority class
ros = RandomOverSampler(ratio='all', random_state=3)

In [7]:
# Dictionary to store all the model performance scores
scores = {}

### Logistic Regression

Logistic Regression is a linear model with good interpretability. Linear models are great at indicating the relative importance (predictive power) of different features. I use ridge regression with different regularization parameters to find the ones that most improve the model performance.

- ridge regression
    - uses L2 regularization technique
    - prevent multicollinearity by reducing the model parameters
    - reduces the model complexity by preveting extremely large coefficients
    - has hyperparameter alpha, which controls the penalty term. Higher values of alpha can help reduce the magnitude of coefficients.

Use SciPy's pipeline constructor make_pipeline [doc](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)

In [8]:
## Fitting a logistic regression model with ridge regression

# Create a parameter grid for hyper-parameter optimization using grid_search
name = 'LogisticRegression'.lower()
param_grid = {name+'__C': [0.1, 1, 10]}

In [9]:
# Losgistic regression model
# using multi-class version because later I will have more just the "Male" and "Female" classes
# sag: the Stochastic Average Gradient Descent solver is fast for large datasets, compatible for multi-class problems, and compatible with ridge regression, 
# penalty: the norm used in the penalization. The ‘sag’ solver supports only l2 penalties.
# C: Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization.
log = LogisticRegression(penalty='l2', C=0.1, 
                         solver='sag', #'liblinear', #'lbfgs', #'sag',
                         max_iter=1000, n_jobs=-1,
                         tol= 1e-3,
                         class_weight=None,
                         multi_class='multinomial')

#Construct a Pipeline from the given estimators.
# ros: Random Over Sampling of minority class
# scaler: Scale features to their z-scores
# log: logistic regression model

pipe = make_pipeline(ros, scaler, log)

## Hyper-parameter optimization and model evaluation using nested cross-validation

A hyper-parameter grid search consists of:

- an estimator (regressor or classifier such as LogisticRegression());
- a parameter space;
- a method for searching or sampling candidates;
- a cross-validation scheme such as k-fold cross validation; and
- a score function such as neg_log_loss. [SciPy](http://scikit-learn.org/stable/modules/grid_search.html)

In [None]:
# Carry out a grid search to find the parameter values that give 
# the lowest log-loss using stratified k-fold cross-validation.
# grid_search is a user-created function in utilities.py
log = grid_search(pipe, X, y,
                  outer_kfold_cv, inner_kfold_cv,
                  param_grid, scoring='neg_log_loss', 
                  debug=True)

# Pickle the trained model for future use

In [11]:
# Persist model : After training a scikit-learn model, it is desirable to 
# have a way to persist the model for future use without having to retrain.
# joblib.dump: joblib’s replacement of pickle, which is more 
# efficient on objects that carry large numpy arrays internally as is often 
# the case for fitted scikit-learn estimators
joblib.dump(log, 'models/genderListener.pkl');

The Final model evaluation is 0.34 (i.e. the log loss).

The cross-validated precision scoresfor the "Male" and "Female" classes is 96% and 78%. The cross-validated recall scores are 90% and 89%. F-score is the harmonic mean of precision and recall.

It was expected that the performance for the minority class ("Female") would be lower because there are 3 times fewer unique "female" samples. While random oversampling helped mitigate the class imbalance problem, the performance difference classifying 'male' and 'female' is still noticeable.