# Determining the Players for each Shakespeare Play

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

shakespeareData = pd.read_csv("../data/Shakespeare_data.csv") 
shakespeareData

FileNotFoundError: File b'./data/Shakespeare_data.csv' does not exist

Firstly, before doing deeper exploratory data analysis, we can take note about some things regarding the dataset above. As we can see from the data set, there are several PlayerLineNumber, ActSceneLine, and Player values missing from each column. Therefore, we can start of by getting rid of rows that have these values and cleaning the dataset.

In [None]:
# Dropping all rows with NaN values.
shakespeareData = shakespeareData.dropna()

## Using Feature Engineering

Next, I will start generating some ideas regarding how we can use feature engineering to establish additional value to our data set above. 

### Idea 1 (Selection)

In order to build a classification model to predict a player, we will need to use the columns as features. However, as some columns/features are more promising than others, I will be dropping a few of them. The first column I will drop is the Dataline column as it does not add any extra value to predicitng players. The next and last column I will drop will be the PlayerLine column. Although the PlayerLine could be an extremely valuable feature to use to predict a player, I feel like it will be too complex for the purpose of this exercise and I am not sure how to transform it as it is not a category.

In [None]:
del shakespeareData['Dataline']
del shakespeareData['PlayerLine']

### Idea 2 (Extraction)

As to keep the feature types consistent. I will want each feature to be an Int type. As the type of Play and Player will already be converted to Int by doing label-encoding, I will have to change PlayerLinenumber and ActSceneLine to an Int type. Converting PlayerLinenumber will be really easy to convert but the challenge will be converting ActSceneLine as there are dots in between them. To solve this I plan on just removing the dots and making concatenating the numbers to make a larger Integer. Therefore, 1.1.1 will turn into 111, 5.3.169 will turn into 53169.

In [None]:
shakespeareData['PlayerLinenumber'] = shakespeareData['PlayerLinenumber'].astype(int)
shakespeareData['ActSceneLine'] = shakespeareData.ActSceneLine.str.replace('.', '').astype(int)

### Idea 3 (Transformation)

As many algorithms are unable to work with categorical text data right away such as the data for the column Play and Player. We need a way to transform them to numbers. One way is to use label encoding to conevert each value in a column to a number.

In [None]:
labelEncoder = LabelEncoder()
shakespeareData['Play'] =  labelEncoder.fit_transform(shakespeareData['Play'])
shakespeareData['Player'] =  labelEncoder.fit_transform(shakespeareData['Player'])

Now our dataset is ready for the classification model!

In [None]:
shakespeareData

## Building the Classification Model 

Finally, it is time for us to build our classification model using the Play, PlayerLineNumber, and ActSceneLine as featues in order to determine the player. I will use 80% of the data to train my model, and 20% of it to test.

The first classification model we will be using is the Gaussian Naive Bayes as it is easy to implement and reason.

### Data pre-processing 

In [None]:
x = shakespeareData[['PlayerLinenumber','Play', 'ActSceneLine']]
y = shakespeareData['Player']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

### Training the Gaussian Naive Bayes Model

In [None]:
modelG = GaussianNB()
# model = DecisionTreeClassifier()
modelG.fit(x_train, y_train)

### Predicition and Gaussian Naive Bayes Model Evaluation

In [None]:
y_pred = modelG.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)*100
accuracy

Oh no! With an accuracy rate of approximately 16.5%, this classifier is not accurate at all.

Therefore, as our Gaussian Naive Bayes Model did not determine the players accurately, we shall try to use another classification model to do the task. The next classification model I will be trying out will be the Decision Tree as it is also easy to understand and extremely fast. Lets try to aim for >70% this time.

### Training the Decision Tree Model

In [None]:
modelDT = DecisionTreeClassifier()
modelDT.fit(x_train, y_train)

### Prediction and Decision Tree Model Evaluation

In [None]:
y_pred = modelDT.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)*100
accuracy

Nice! Over 78% accuracy rate.

## Conclusion

For this project, we were able to use 2 Classification model and compare their accuracy to determine players for each shakespeare play. The first model we used was the Gaussian Naive Bayes which got an approx. 16.5% accuracy rate. The second model we used was the Decision tree which got an approx. 78% accuracy rate. A possible reason why the Gaussian Naive Bayes has such a low accuracy rate compared to the Decision Tree might be due to the shakespeare play dataset not following statistical distribution.