# Add gender_sound labels to data

Now that we have all the metadata and audio features together, we can simply run this data through the already-trained genderListener classifier model.

The pickled implementation of gender_listener is a logistic regression binary classifier trained on 1,096 TED talks. The male/female labels used to train gender_listener come from using gender_guesser, which assigns male/female labels based on the main speaker's first name. 

Note that some of the talks in this dataset are by multiple speakers and some are musical performances rather than pure speech. Gender_listener will assign a label based on the audio features in the data, which come from the 240-second audio segments (the values are the averages of 10 subsamples). 

You can use pyAudioAnalysis and the code in notebooks d1 and d2 to extract different samples from the TED audio or from any other audio files.

In [36]:
# Import common python library
import math
from collections import OrderedDict

# Import numpy library
import numpy as np

# Import matplotlib library
import matplotlib.pyplot as plt

from matplotlib import colors

# Import pandas library
import pandas as pd

# Import scikit-learn library
from sklearn.externals import joblib

% matplotlib inline

from IPython.core.display import display, HTML
pd.set_option('display.max_colwidth', -1)  # to fix problem with html for long urls not showing up


In [26]:
# Upload metadata
df = pd.read_csv('data/meta_audio.csv', index_col = 0)
print(df.shape)
df.columns

(1331, 86)


Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'link', 'annualTED', 'film_year', 'published_year',
       'num_speaker_talks', 'technology', 'science', 'global issues',
       'culture', 'design', 'business', 'entertainment', 'health',
       'innovation', 'society', 'Fascinating', 'Courageous', 'Longwinded',
       'Obnoxious', 'Jaw-dropping', 'Inspiring', 'OK', 'Beautiful', 'Funny',
       'Unconvincing', 'Ingenious', 'Informative', 'Confusing', 'Persuasive',
       'wpm', 'words_per_min', 'first_name', 'gender_name',
       'gender_name_class', 'fileName', 'ZCR', 'Energy', 'EnergyEntropy',
       'SpectralCentroid', 'SpectralSpread', 'SpectralEntropy', 'SpectralFlux',
       'SpectralRollof', 'mfcc1', 'mfcc2', 'mfcc3', 'mfcc4', 'mfccC5', 'mfcc6',
       'mfcc7', 'mfcc8', 'mfcc9', 'mfcc10', '

In [27]:
audioCols = ['ZCR', 'Energy', 'EnergyEntropy',
       'SpectralCentroid', 'SpectralSpread', 'SpectralEntropy', 'SpectralFlux',
       'SpectralRollof', 'mfcc1', 'mfcc2', 'mfcc3', 'mfcc4', 'mfccC5', 'mfcc6',
       'mfcc7', 'mfcc8', 'mfcc9', 'mfcc10', 'mfcc11', 'mfcc12', 'mfcc13',
       'Chroma1', 'Chroma2', 'Chroma3', 'Chroma4', 'Chroma5', 'Chroma6',
       'Chroma7', 'Chroma8', 'Chroma9', 'Chroma10', 'Chroma11', 'Chroma12',
       'Chroma_std']

## upload already-trained genderListener

In [28]:
log = joblib.load('models/genderListener.pkl')

In [29]:
# Generate labels for all the talks that have audio features
# Note that these labels are only as valid as the audio features
# For example, if the audio features were taken from 4 minutes of audio containing multiple speakers,
# then the gender labels will not be valid.

X_allTalks = df[audioCols]

In [30]:
genderListenerLabels = log.predict(X_allTalks)

In [31]:
print(len(genderListenerLabels))

1331


# Add genderListener labels to rest of data

In [32]:
labels = ['male' if item == 0 else 'female' for item in genderListenerLabels]
col_index = df.columns.get_loc('gender_name')+1
col_index

50

In [33]:
df.insert(col_index, 'gender_sound', labels)

In [37]:
# inspect
dftemp = df[['first_name','gender_name','gender_sound','link']][0:10]
display(HTML(dftemp.to_html(escape=False)))  # Note: requires pd.set_option('display.max_colwidth', -1)

Unnamed: 0,first_name,gender_name,gender_sound,link
0,Al,male,male,link
1,David,male,male,link
2,Majora,unknown,female,link
3,Hans,male,male,link
4,Tony,male,male,link
5,Julia,female,female,link
6,Dan,male,male,link
7,Rick,male,male,link
8,Cameron,mostly_male,male,link
9,Jehane,unknown,female,link


In [41]:
df.columns[:53]

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'link', 'annualTED', 'film_year', 'published_year',
       'num_speaker_talks', 'technology', 'science', 'global issues',
       'culture', 'design', 'business', 'entertainment', 'health',
       'innovation', 'society', 'Fascinating', 'Courageous', 'Longwinded',
       'Obnoxious', 'Jaw-dropping', 'Inspiring', 'OK', 'Beautiful', 'Funny',
       'Unconvincing', 'Ingenious', 'Informative', 'Confusing', 'Persuasive',
       'wpm', 'words_per_min', 'first_name', 'gender_name', 'gender_sound',
       'gender_name_class', 'fileName'],
      dtype='object')

In [42]:
df.columns[53:]

Index(['ZCR', 'Energy', 'EnergyEntropy', 'SpectralCentroid', 'SpectralSpread',
       'SpectralEntropy', 'SpectralFlux', 'SpectralRollof', 'mfcc1', 'mfcc2',
       'mfcc3', 'mfcc4', 'mfccC5', 'mfcc6', 'mfcc7', 'mfcc8', 'mfcc9',
       'mfcc10', 'mfcc11', 'mfcc12', 'mfcc13', 'Chroma1', 'Chroma2', 'Chroma3',
       'Chroma4', 'Chroma5', 'Chroma6', 'Chroma7', 'Chroma8', 'Chroma9',
       'Chroma10', 'Chroma11', 'Chroma12', 'Chroma_std'],
      dtype='object')

## Write to file

In [43]:
df.to_csv('data/ted_plus.csv')