In [1]:
import re

import pandas as pd
import numpy as np
from scipy.sparse import hstack
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import tree
from sklearn.preprocessing import LabelEncoder

In [2]:
lines = pd.read_csv('shakespeare-plays/Shakespeare_data.csv')

The dataline is like an id -- largely useless data we can discard from the beginning.

In [3]:
del lines['Dataline']

Some of the lines in this dataset are actually stage directions.  Since these are not actually spoken by a player, it makes sense that we would want to exclude these points.  One other line is also mysteriously lacking a player, which messes with learning models.

In [4]:
lines = lines.query('not ActSceneLine.isnull()')
lines = lines.query('not Player.isnull()')

# Text Feature Engineering
This first section focuses on feature engineering from the lines spoken by players.

Stripping out non alphanumerics helps improve accuracy, since commas, apostrophes, and the like are mostly noise that will confuse models

In [5]:
lines = lines.assign(
    PlayerLine=lines['PlayerLine'].apply(lambda x: re.sub('[^a-zA-z0-9\s]', '', x))
)

This is the most important stage in text feature engineering.  The CountVectorizer converts each text input into a vector that records the count of each word, in a technique known as 'bag of words'.  The resultant matrix has one column assigned to each word that appears in the inputs.  This matrix can then be applied to classic learning models, such as linear regression and support vector machines.  Deep learning techniques, such as neural networks, can also be trained on the data, but these models are typically much slower, and the output of the CountVectorizer is large.

The count vectorizer also comes with a set of stop words that can be used.  Stop words are sets of words such as 'a', 'an', 'and', 'the', etc., that have been identified as having little semantic meaning.  Removing these words, as the CountVectorizer does, can help improve training accuracy since it is a source of noise.  Adding stop words improved training accuracy by about 2%.  Note that these stopwords are intended for modern english, and thus may only be partially applicable to shakespearean english.  A hand-tailored set of stopwords would likely yield additional improvements.

In [6]:
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(lines['PlayerLine'])

# Other feature engineering
Play names are one-hot encoded for ease of model training.  This is a more machine-readable format for this data than the text representation of the play.

Act-scene-line data is converted into labels, since this text format doesn't easily convert to numbers on its own.

In [7]:
le = LabelEncoder()
lines[['ActSceneLine']] = lines[['ActSceneLine']].apply(
    lambda col: le.fit_transform(col)
)

play_onehot = pd.get_dummies(lines['Play'], prefix='Play', sparse=True)

Engineered data is stacked together horizontally into a single matrix, then separated into a training and a test set.

In [8]:
final_data = hstack((lines[['PlayerLinenumber', 'ActSceneLine']], play_onehot, cv_matrix))

X_train, X_test, y_train, y_test = train_test_split(final_data, lines['Player'], test_size=.3, random_state=0)

Decision trees are a nice choice for this dataset, since they train quickly (and we have a *lot* of engineered data) and can get surprisingly good results for such short training times.  Our final accuracy is about 72%.

In [9]:
t = tree.DecisionTreeClassifier()
t.fit(X_train, y_train)
print(t.score(X_test, y_test))

0.7230710708172193
