#### Can we recoginze the gender by the character's  speech?

In [1]:
import os
import os.path
import shutil
import numpy as np
import pandas as pd
import random
import pickle
import string

from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

import matplotlib.pyplot as plt
%matplotlib inline

Prepare labels (male / female)  
[Female characters are taken from this source](http://www.shakespeareswords.com/Special-Features-Female-Characters)

In [2]:
female_characters_df = pd.read_csv("female_characters.csv")
female_characters_df.character = female_characters_df.character.str.lower()

In [3]:
female_characters_df.sample(n= 10)

Unnamed: 0,character,play_name,gender,replics
85,lavinia,Titus Andronicus,Female,59
114,alice,Henry V,Female,28
106,octavia,Antony and Cleopatra,Female,36
105,cassandra,Troilus and Cressida,Female,37
63,bawd,"Pericles, Prince of Tyre",Female,104
132,lady faulconbridge,King John,Female,15
151,lady northumberland,Henry IV,Female,5
20,celia,As You Like It,Female,282
171,first lady,Cymbeline,Female,1
154,phrynia and timandra,Timon of Athens,Female,4


In [4]:
with open('shakespeare_plays.pickle', 'rb') as handle:
    speeches = pickle.load(handle)

In [5]:
speeches_df = pd.DataFrame(speeches)
speeches_df.speaker = speeches_df.speaker.str.lower()

Every record is a single speech from one of the plays.

In [6]:
speeches_df.sample(n = 10)

Unnamed: 0,act,genre,play_name,scene,scene_name,speaker,speech_number,speech_text
26924,4,Tragedy,Titus Andronicus,3,The same. A public place.,publius,3,"Therefore, my lord, it highly us concerns\nBy ..."
7274,5,Comedy,A Midsummer Night's Dream,1,Athens. The palace of THESEUS.,theseus,55,If we imagine no worse of them than they of\nt...
9055,1,Comedy,Taming of the Shrew,1,Padua. A public place.,lucentio,41,"Tranio, I saw her coral lips to move\nAnd with..."
25902,1,Tragedy,Timon of Athens,1,Athens. A hall in Timon's house.,second lord,146,"Away, unpeaceable dog, or I'll spurn thee hence!"
9435,4,Comedy,Taming of the Shrew,1,PETRUCHIO'S country house.,curtis,18,How?
239,2,Comedy,All's Well That Ends Well,3,Paris. The KING's palace.,parolles,5,So I say.
19277,2,Tragedy,Coriolanus,1,Rome. A public place.,sicinius,79,He cannot temperately transport his honours\nF...
23162,1,Tragedy,Macbeth,3,A heath near Forres.,banquo,50,Very gladly.
15884,3,History,King John,4,The same. KING PHILIP'S tent.,cardinal pandulph,31,How green you are and fresh in this old world!...
20084,1,Tragedy,Hamlet,1,Elsinore. A platform before the castle.,marcellus,18,Holla! Bernardo!


In [7]:
our_names = set(speeches_df.play_name)
print(our_names)

{"Love's Labours Lost", 'Henry VIII', 'Hamlet', 'Coriolanus', 'Julius Caesar', 'Cymbeline', 'King John', 'Antony and Cleopatra', "A Midsummer Night's Dream", 'Twelfth Night', 'Richard III', 'Macbeth', 'Henry V', 'The Merchant of Venice', 'Troilus and Cressida', 'As You Like It', 'Measure for Measure', "All's Well That Ends Well", 'TheMerry Wives of Windsor', 'Titus Andronicus', 'Two Gentlemen of Verona', 'Pericles, Prince of Tyre', 'The Comedy of Errors', 'The Tempest', 'Romeo and Juliet', 'Othello', 'King Lear', 'Much Ado About Nothing', 'Richard II', "Winter's Tale", 'Taming of the Shrew', 'Timon of Athens'}


In [8]:
their_names = set(female_characters_df.play_name.unique())

In [9]:
print(their_names)

{"Love's Labours Lost", 'Henry VIII', 'Hamlet', 'Coriolanus', 'Julius Caesar', 'Cymbeline', 'King John', 'Henry IV', 'Antony and Cleopatra', "A Midsummer Night's Dream", 'Twelfth Night', 'Richard III', 'Macbeth', 'Henry V', 'The Merchant of Venice', 'Troilus and Cressida', 'As You Like It', 'Measure for Measure', "All's Well That Ends Well", 'TheMerry Wives of Windsor', 'Titus Andronicus', 'King Edward III', 'Two Gentlemen of Verona', 'The Comedy of Errors', 'Pericles, Prince of Tyre', 'The Tempest', 'Romeo and Juliet', 'Othello', 'Henry VI', 'King Lear', 'Much Ado About Nothing', 'Richard II', "Winter's Tale", 'Taming of the Shrew', 'Timon of Athens', 'The Two Noble Kinsmen'}


Check the differences

In [10]:
print(our_names - their_names)

set()


In [11]:
print(their_names - our_names)

{'The Two Noble Kinsmen', 'Henry VI', 'King Edward III', 'Henry IV'}


It looks like the differences are only plays Shakespeare co-authored so looks OK.

In [12]:
our_speakers = set(speeches_df.speaker)
print('All characters \n')
print(our_speakers)

All characters 

{'katharina', 'third gentleman', 'domitius enobarbus', 'sailor', 'orlando', 'katharine', 'bourbon', 'philostrate', 'parolles', 'lord willoughby', 'olivia', 'aemilius', 'helicanus', 'longaville', 'young lucius', 'escalus', 'marcus andronicus', 'duke vincentio', 'king henry v', 'herbert', 'scribe', 'third servant', 'francisca', 'horatio', 'player queen', 'maria', 'lady montague', 'pandar', 'fourth lord', 'balthazar', 'governor', 'rambures', 'anne', 'herald', 'king claudius', 'benvolio', 'casca', 'arragon', 'hortensia', 'eglamour', 'don pedro', 'ceres', 'gurney', 'dumain', 'groom', 'moonshine', 'rosaline', 'iris', 'calpurnia', 'constance', 'bassianus', 'mother', 'doctor caius', 'patience', 'grandpre', 'lavinia', 'sicilius leonatus', 'scroop', 'queen elizabeth', 'lafeu', 'cardinal wolsey', 'king henry viii', 'helena', 'petruchio', 'verges', 'lord marshal', 'grumio', 'young marcius', 'second servingman', 'frenchman', 'old lady', 'henry bolingbroke', 'cassius', 'philemon', '

In [13]:
their_speakers = set(female_characters_df.character)
print('Female characters \n')
print(their_speakers)

Female characters 

{'emilia', 'helenus', 'mistress quickly', 'mariana', 'queen eleanor', 'julia', 'queen katherine', 'first lady', 'blanche', 'katharine', 'cleopatra', 'hecat', 'rosalind', 'olivia', 'lady capulet', 'all the ladies', 'innogen', 'countess', 'dorcas', 'perdita', 'queen', 'francisca', 'viola', 'jessica', 'lady montague', 'maria', 'regan', 'anne', 'lucetta', 'audrey', 'mistress ford and mistress page', 'ursula', 'second lady', 'lady', 'ceres', 'gonerill', 'cassandra', 'lady faulconbridge', 'daughter', 'widow and mariana', 'adriana', 'nurse', 'mistress page', 'rosaline', 'duchess of gloucester', 'constance', 'mistress quickly as queen of fairies', 'iris', 'timandra', 'mother', 'patience', 'lavinia', 'mistress overdone', 'all witches', 'virgilia and valeria', 'queen elizabeth', 'duchess of york', 'dionyza', 'all queens', 'helena', 'lady grey', 'jourdain', 'portia', 'old lady', 'courtesan', 'phrynia', "queen's lady", 'mopsa', 'widow', 'luce', 'cordelia', 'mistress ford', 'nel

Add 'female' column to the dataset

In [14]:
speeches_df['female'] = speeches_df.apply(lambda r : r['speaker'] in their_speakers, axis=1)

Some speeaches are by a group of people

In [15]:
speeches_df[speeches_df.speaker.str.startswith('all')]

Unnamed: 0,act,genre,play_name,scene,scene_name,speaker,speech_number,speech_text,female
268,2,Comedy,All's Well That Ends Well,3,Paris. The KING's palace.,all,34,"We understand it, and thank heaven for you.",False
587,4,Comedy,All's Well That Ends Well,1,Without the Florentine camp.,all,26,"Cargo, cargo, cargo, villiando par corbo, cargo.",False
3062,5,Comedy,Cymbeline,4,A British prison.,all,19,"Thanks, Jupiter!",False
5692,3,Comedy,TheMerry Wives of Windsor,2,A street.,all,33,Have with you to see this monster.\nExeunt,False
6504,3,Comedy,The Merchant of Venice,2,Belmont. A room in PORTIA'S house.,all,10,"Ding, dong, bell.",False
6904,1,Comedy,A Midsummer Night's Dream,2,Athens. QUINCE'S house.,all,32,"That would hang us, every mother's son.",False
7030,3,Comedy,A Midsummer Night's Dream,1,The wood. TITANIA lying asleep.,all,55,Where shall we go?,False
8374,1,Comedy,"Pericles, Prince of Tyre",4,Tarsus. A room in the Governor's house.,all,19,The gods of Greece protect you!\nAnd we'll pra...,False
8508,2,Comedy,"Pericles, Prince of Tyre",4,Tyre. A room in the Governor's house.,all,14,"Live, noble Helicane!",False
8993,0,Comedy,Taming of the Shrew,2,A bedchamber in the Lord's house.,all,24,Amen.,False


We remove them from the data

In [16]:
speeches_df.drop(speeches_df[speeches_df.speaker.str.startswith('all')].index, inplace = True)

In [17]:
speeches_df.sample(n = 10)

Unnamed: 0,act,genre,play_name,scene,scene_name,speaker,speech_number,speech_text,female
4426,2,Comedy,Measure for Measure,1,A hall In ANGELO's house.,escalus,23,Dost thou detest her therefore?,False
6325,2,Comedy,The Merchant of Venice,2,Venice. A street.,gobbo,24,"Her name is Margery, indeed: I'll be sworn, if...",False
5747,3,Comedy,TheMerry Wives of Windsor,3,A room in FORD'S house.,mistress ford,55,"Why, what have you to do whither they bear it?...",True
11767,1,Comedy,Twelfth Night,5,OLIVIA'S house.,olivia,47,"By mine honour, half drunk. What is he at the ...",True
7403,1,Comedy,Much Ado About Nothing,1,Before LEONATO'S house.,claudio,80,"If this were so, so were it uttered.",False
19139,1,Tragedy,Coriolanus,6,Near the camp of Cominius.,cominius,3,"Though thou speak'st truth,\nMethinks thou spe...",False
24324,3,Tragedy,Othello,4,Before the castle.,desdemona,49,"I say, it is not lost.",True
21645,3,Tragedy,Julius Caesar,2,The Forum.,third citizen,60,O woful day!,False
11655,1,Comedy,Twelfth Night,3,OLIVIA'S house.,sir toby belch,15,"By this hand, they are scoundrels and subtract...",False
5275,1,Comedy,TheMerry Wives of Windsor,1,Windsor. Before PAGE's house.,shallow,111,Here comes fair Mistress Anne.\nRe-enter ANNE ...,False


The data is imbalanced we will have to take it into account when fitting the model.

In [18]:
speeches_df.groupby(['female']).size()

female
False    22275
True      4728
dtype: int64

In [19]:
# 0 - male, 1 - female
labels = [ 1 if f else 0 for f in speeches_df.female.values ]

In [20]:
features = speeches_df['speech_text'].values

Put 10% of data aside for testing

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, stratify = labels, test_size = 0.10, random_state = 100
)

In [22]:
print('Train data shape', X_train.shape)
print('Test data shape', X_test.shape)

Train data shape (24302,)
Test data shape (2701,)


We train a classifier to recoginze the gender based on a single speech. This is hard!  
The general idea is to use Latent Semantic Analysis. We create features using TF-IDF and apply SVD to extract the latent relashionshps.  
We use SGD classifer with a 'hinge' loss; that is an equivalent of using Linear SVM but works faster on large datasets.

In [23]:
shutil.rmtree('pipeline', ignore_errors = True)
os.makedirs('pipeline')

pipe = Pipeline(
    memory = 'pipeline',
    steps=[
        # Create the feature space
        ('tfidf', TfidfVectorizer(stop_words='english', lowercase = True)),
        # Perform LSA on the features
        ('svd', TruncatedSVD(random_state = 100)),
        # faster than SVC but default loss is still 'hinge'
        ('clf', SGDClassifier(class_weight='balanced', verbose = 0, n_jobs = -1, max_iter = 1000))
    ]
)

param_grid = {
    'tfidf__norm': ('l1', 'l2'),
    'tfidf__use_idf': (True, False),
    'svd__n_components': (1000, 1100, 1200),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),   
}

model = GridSearchCV(
    pipe,
    param_grid = param_grid,
    cv = StratifiedKFold(random_state = 100),
    scoring = 'roc_auc',
    verbose = 1,
    n_jobs = -1)

In [24]:
model = model.fit(X_train, y_train)
print("The best parameters are %s with a score of %0.2f" % (model.best_params_, model.best_score_))

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed: 47.4min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed: 389.2min finished


The best parameters are {'svd__n_components': 1200, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'clf__alpha': 1e-06, 'clf__penalty': 'l2'} with a score of 0.65


In [25]:
y_hat = model.predict(X_test)

In [27]:
print('Classification report for the SGD classifier')
print(classification_report(y_test, y_hat))

Classification report for the SGD classifier
             precision    recall  f1-score   support

          0       0.88      0.57      0.69      2228
          1       0.23      0.62      0.34       473

avg / total       0.76      0.58      0.63      2701



In [28]:
print('Confusion matrix for the SGD classifier')
print(confusion_matrix(y_test, y_hat))

Confusion matrix for the SGD classifier
[[1260  968]
 [ 179  294]]


What if we just use  the training set class distribution?

In [29]:
dummy = Pipeline(
    steps=[
        ('vect', CountVectorizer()), 
        ('clf', DummyClassifier(strategy='stratified', random_state=100))
    ]
)
dummy = dummy.fit(X_train, y_train)
y_dummy = dummy.predict(X_test)

In [30]:
print('Classification report for the dummy classifier')
print(classification_report(y_test, y_dummy))

Classification report for the dummy classifier
             precision    recall  f1-score   support

          0       0.83      0.84      0.83      2228
          1       0.18      0.16      0.17       473

avg / total       0.71      0.72      0.72      2701



In [31]:
print('Counfusiion matrix for the dummy classifier')
print(confusion_matrix(y_test, y_dummy))

Counfusiion matrix for the dummy classifier
[[1872  356]
 [ 396   77]]


It looks like our classifier is doing just slightly better in recognizing female characters but overall we think we've failed to achive a significant improvement over the dummy classifier. Well, it was a hard job!  
We have to conclude that it is not possible to recoginze the characters' gender just by one speech. We think it is a hard task and better results could be achieved if we used a combination of all speecheeches per character as the observations.