<a href="https://colab.research.google.com/github/emilymikeska1/EECS731_Project2/blob/master/Project2_EM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.tree import export_graphviz
import graphviz
import matplotlib.pyplot as plt
from dtreeviz.trees import *
from sklearn.model_selection import cross_validate

In [2]:
df=pd.read_csv('/content/drive/My Drive/Colab Notebooks/project2/Shakespeare_data.csv')

Data preprocessing includes filling empty cells, removing excess punctuation, converting dtypes of some features, and using Label Encoding for some features.

In [3]:
df=df.dropna().reset_index(drop=True)
df['PlayerLine'].replace(',','', inplace=True)
df['PlayerLine'].replace('.','', inplace=True)
df['PlayerLine'].replace('!','', inplace=True)
df['PlayerLine'].replace('?','', inplace=True)
df['PlayerLine'].replace(':','', inplace=True)
df['PlayerLine'].replace(';','', inplace=True)
df['PlayerLine'].replace("'",'', inplace=True)
df.shape

(105152, 6)

In [4]:
df.dtypes

Dataline              int64
Play                 object
PlayerLinenumber    float64
ActSceneLine         object
Player               object
PlayerLine           object
dtype: object

In [5]:
le=preprocessing.LabelEncoder()
df['Play']=le.fit_transform(df['Play'])
df['Player']=df['Player'].astype('category')
df['ActSceneLine']=df['ActSceneLine'].astype('str')
df[['Act','Scene','Line']]=df['ActSceneLine'].str.split('.', expand=True).astype('int')
df.drop(['ActSceneLine'], axis=1, inplace=True)

In [6]:
df.columns

Index(['Dataline', 'Play', 'PlayerLinenumber', 'Player', 'PlayerLine', 'Act',
       'Scene', 'Line'],
      dtype='object')

I wanted to try to incorporate the Player Lines into the classification model. To do this I first vectorized the lines and then tried to add the vectors as a feature to the dataframe. This did not work well for several reasons. First, after some research I realized that I should have vectorized AND embedded the words. Second, the vectors did not translate well to a feature directly because I vectorized the corpus in one bulk piece (I should have split the lines by character, then vectorized them by character). 

In [7]:
#corpus=[]
#for a in df['PlayerLine']:
#  corpus.append(a)
#vectorizer=CountVectorizer()
#vects=vectorizer.fit_transform(corpus)
#vects=vects.toarray()
#print(vectorizer.get_feature_names())
#print(vects[0:10])
#df['Vectors']=vects
#print(df[0:10])
#print(vects[0:50])
#print(vects.shape)
#df['Vectors']=pd.Series([vects])
#df['Vectors']=df['Vectors'].astype('str')
#df['Vectors'].head
#df[['A','B','C']]=df['Vectors'].str.split
#df.drop(['PlayerLine'], axis=1, inplace=True)

After striking out with word vectorization, I decided to move on without the Player Lines as a feature.

In [7]:
df.drop(['Dataline'], axis=1, inplace=True)

In [8]:
df.drop(['PlayerLine'], axis=1, inplace=True)

In [9]:
df.dtypes

Play                   int64
PlayerLinenumber     float64
Player              category
Act                    int64
Scene                  int64
Line                   int64
dtype: object

The complete dataset is too large for the RAM available to me, so I decided to randomize the rows in the dataframe and then use the first 10,000 rows of randomized data.

In [10]:
df=df.sample(frac=1)

In [11]:
sub_frame=df.head(10000)

Define x and y and split the data into a 80-20 train to test ratio.

In [12]:
X=sub_frame.drop('Player', axis=1)
y=sub_frame['Player']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
X.columns

Index(['Play', 'PlayerLinenumber', 'Act', 'Scene', 'Line'], dtype='object')

Build a Random Forest Classifier and Decision Tree Classifier models!

In [14]:
model=RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [34]:
model1=DecisionTreeClassifier()
model1.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Model evaluation and 10-fold Cross Validation of dataset

In [15]:
y_pred=model.predict(X_test)
cv=cross_validate(model, X, y, cv=10)
print(y_pred)
print('Accuracy:',metrics.accuracy_score(y_test, y_pred))
print(cv['test_score'])
print(cv['test_score'].mean())



['DON PEDRO' 'ESCALUS' 'SIR TOBY BELCH' ... 'SEBASTIAN' 'BELARIUS'
 'ULYSSES']
Accuracy: 0.554
[0.591 0.552 0.577 0.599 0.576 0.553 0.561 0.586 0.581 0.55 ]
0.5726


In [35]:
y_pred1=model1.predict(X_test)
cv1=cross_validate(model1, X, y, cv=10)
print(y_pred1)
print('Accuracy:',metrics.accuracy_score(y_test, y_pred1))
print(cv1['test_score'])
print(cv1['test_score'].mean())



['HELENA' 'HAMLET' 'Clown' ... 'First Clown' 'GRUMIO' 'MENENIUS']
Accuracy: 0.5155
[0.524 0.522 0.53  0.562 0.529 0.54  0.502 0.541 0.519 0.538]
0.5307000000000001


Attempted to view the tree from the Decision Tree Classifier, but was unsuccessful...

In [None]:
viz=dtreeviz(model1,
             X_train,
             y_train,
             target_name='Player',
             class_names=y,
             feature_names=(sub_frame[['Play','PlayerLinenumber','Act','Scene','Line']]))
viz.svg()

In [None]:
from IPython.core.display import display, HTML
display(HTML(viz.svg()))

Attempted to view a random tree from the Random Forest Classifier, but was unsucessful again (sadly)...

In [16]:
random_tree=model.estimators_[34]
features=list(X.columns)
data=export_graphviz(random_tree, feature_names=features, max_depth=5, filled=True, rounded=True)
graph=graphviz.Source(data)
graph.render

<bound method File.render of <graphviz.files.Source object at 0x7f1e404b1518>>