# Introduction 
This notebook attempts to build 4 models that predict the classification of a random sentence in the Shakespeare plays, Julius Caesar and Hamlet.  Two methods are used define the features of the models.  The first method vectorizes each sentence to determine the most relevant words or features for each corpus.  The second method tokenizes the words and finds the most common 1000.  These words (features) are then combined to create a unique set that will comprise the final columns of the dataset.  

Some additional methods are then used to further distinguish a sentence from each of the corpora.

The first step is to import the necessary modules and import the text data.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
from sklearn.model_selection import train_test_split


import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Fred\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# import the nlp library, spacy
import spacy

In [3]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Fred\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [4]:
# this takes a long time

### need to run as administrator from Anaconda3 promplt

!python -m spacy download en


    Linking successful
    C:\Users\Fred\Anaconda3\lib\site-packages\en_core_web_sm -->
    C:\Users\Fred\Anaconda3\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load('en')



In [5]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


# Clean and explore the data  

In [6]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
caesar = gutenberg.raw('shakespeare-caesar.txt')
hamlet = gutenberg.raw('shakespeare-hamlet.txt')

# The Chapter indicator is idiosyncratic
caesar = re.sub(r'Chapter \d+', '', caesar)
hamlet = re.sub(r'CHAPTER .*', '', hamlet)
    
caesar = text_cleaner(caesar[:int(len(caesar)/7)])
hamlet = text_cleaner(hamlet[:int(len(hamlet)/10)])

In [7]:
print(len(caesar))
print(len(hamlet))

15656
15867


In [8]:
# Parse the cleaned novels. This can take a bit.
# had to do this in an admin terminal.......
nlp = spacy.load('en')
caesar_doc = nlp(caesar)
hamlet_doc = nlp(hamlet)

In [9]:
# Group into sentences.
caesar_sents = [[sent, "Shake_C"] for sent in caesar_doc.sents]
hamlet_sents = [[sent, "Shake_H"] for sent in hamlet_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(caesar_sents + hamlet_sents)

pd.set_option('max_colwidth', 60)

sentences.head()

Unnamed: 0,0,1
0,"(Actus, Primus, .)",Shake_C
1,"(Scoena, Prima, .)",Shake_C
2,"(Enter, Flauius, ,, Murellus, ,, and, certaine, Commoner...",Shake_C
3,"(Flauius, .)",Shake_C
4,"(Hence, :, home, you, idle, Creatures, ,, get, you, home...",Shake_C


#### Using tf-idf to establish a list of "common" words to use as features

In [10]:
#reading in the data, this time in the form of paragraphs
caesar = gutenberg.paras('shakespeare-caesar.txt')

#processing
caesar_paras = []
for paragraph in caesar:
#    print(paragraph)

    para = paragraph[0]
    
    #removing the double-dash from all words
    para = [re.sub(r'--', '', word) for word in para]
    
    #Forming each paragraph into a string and adding it to the list of strings.
    caesar_paras.append(' '.join(para))

Build a model to generate a list of the most common words or features by vectorizing the sentences.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test = train_test_split(caesar_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice (or 4 times, etc...)
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,# we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', # Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True # Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
caesar_paras_tfidf = vectorizer.fit_transform(caesar_paras)
print("Number of features: %d" % caesar_paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(caesar_paras_tfidf, test_size=0.4, random_state=0)

#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms_c = vectorizer.get_feature_names()
print(terms_c)

Number of features: 138
['actus', 'alarum', 'alarums', 'ant', 'antony', 'army', 'art', 'beare', 'bid', 'body', 'bru', 'brut', 'brutus', 'caes', 'caesar', 'cai', 'calp', 'calphurnia', 'cas', 'cask', 'caska', 'cass', 'cassi', 'cassius', 'cato', 'certaine', 'cic', 'cicero', 'cin', 'cinna', 'clau', 'clit', 'cly', 'come', 'cymber', 'cynna', 'dangerous', 'dard', 'day', 'dec', 'deci', 'decius', 'downe', 'drum', 'enter', 'exeunt', 'exit', 'feare', 'fellow', 'fla', 'flauius', 'flourish', 'forth', 'giues', 'gods', 'good', 'gowne', 'ha', 'haue', 'heare', 'heart', 'heere', 'home', 'house', 'knocke', 'knockes', 'know', 'lep', 'lepidus', 'let', 'letter', 'ligarius', 'lightning', 'looke', 'low', 'luc', 'lucil', 'lucius', 'man', 'manet', 'march', 'mark', 'marke', 'men', 'mess', 'messa', 'messala', 'met', 'metel', 'metellus', 'mettle', 'mou', 'mur', 'murellus', 'nights', 'noble', 'octa', 'octauius', 'pin', 'ple', 'plebeians', 'poet', 'por', 'portia', 'publius', 'push', 'rome', 'say', 'selfe', 'ser', 's

In [12]:
#reading in the data, this time in the form of paragraphs
hamlet = gutenberg.paras('shakespeare-hamlet.txt')

#processing
hamlet_paras = []
for paragraph in hamlet:
#    print(paragraph)

    para = paragraph[0]
    
    #removing the double-dash from all words
    para = [re.sub(r'--', '', word) for word in para]
    
    #Forming each paragraph into a string and adding it to the list of strings.
    hamlet_paras.append(' '.join(para))

Build a model to generate a list of the most common words or features by vectorizing the sentences.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test = train_test_split(caesar_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice (or 4 times, etc...)
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,# we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', # Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True # Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
hamlet_paras_tfidf = vectorizer.fit_transform(hamlet_paras)
print("Number of features: %d" % hamlet_paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(hamlet_paras_tfidf, test_size=0.4, random_state=0)

#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms_h = vectorizer.get_feature_names()
print(terms_h)

Number of features: 70
['actus', 'againe', 'attendant', 'bap', 'bar', 'barn', 'barnardo', 'clo', 'clown', 'clowne', 'come', 'comes', 'dyes', 'enter', 'exeunt', 'exit', 'fran', 'friends', 'gertrude', 'gho', 'ghost', 'giue', 'guil', 'guild', 'guildenstern', 'guildensterne', 'ham', 'hamlet', 'haue', 'hor', 'hora', 'horatio', 'kin', 'king', 'know', 'laer', 'laertes', 'leaue', 'letter', 'lord', 'lords', 'manet', 'mar', 'messenger', 'noise', 'ophe', 'ophel', 'ophelia', 'osr', 'play', 'player', 'players', 'pol', 'polon', 'polonius', 'qu', 'queen', 'queene', 'reynol', 'rosin', 'rosincrance', 'say', 'scena', 'secunda', 'sings', 'thou', 'time', 'tragedie', 'vpon', 'vs']


Create a list of the most common words in each corpus.

In [14]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(1000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
caesarwords = bag_of_words(caesar_doc)
hamletwords = bag_of_words(hamlet_doc)

print(caesarwords)

# Combine bags to create a set of unique words. (just add 'terms' here to get the final list)
common_words = set(caesarwords + hamletwords)

# print the number of common words found in caesar and hamlet
print(len(common_words))

['-PRON-', 'caesar', 'and', 'haue', 'man', 'brutus', 'what', 'thou', 'cassi', 'that', 'cassius', 'caes', 'be', 'bru', 'vpon', 'sir', 'but', 'cask', 'selfe', 'hee', 'know', 'why', 'tell', 'good', 'rome', 'to', 'the', 'fall', 'great', 'as', 'caska', 'feare', 'time', 'heare', 'let', 'doth', 'eye', 'loue', 'for', 'crowne', 'day', 'trade', 'art', 'thing', 'shout', 'come', 'if', 'vs', 'when', 'brut', 'then', 'mur', 'Florida', 'go', 'god', 'do', 'downe', 'antony', 'calphurnia', 'againe', 'looke', 'himselfe', 'hath', "offer'd", 'thy', 'cob', 'a', 'bad', 'matter', 'heart', 'tyber', 'way', 'exeunt', 'so', 'will', 'who', 'antonio', 'ant', 'sooth', 'cry', 'hand', 'noble', 'dangerous', 'marke', 'like', 'say', 'enter', 'flauius', 'ouer', 'home', 'walke', 'of', 'speake', 'vse', 'liue', 'worke', 'wherefore', 'street', 'hard', 'passe', 'see', 'sound', 'pray', 'till', 'tongue', 'finde', 'caesars', 'course', 'shake', 'set', 'leaue', 'euery', 'beware', 'ides', 'march', 'look', 'spirit', 'beare', 'tis', 'h

Add the words / features found from the 2 methods 'bag_of_words' and 'tf-idf'.

In [15]:
common_words_with_tf = set(caesarwords + hamletwords + terms_c + terms_h)

In [16]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words_with_tf)
print(word_counts.shape)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
(551, 1456)


Unnamed: 0,sterrile,twill,tyr,start,finger,pollax,work,clau,cheeke,know,...,when,deckt,fault,gowne,vnderling,fantasie,niobe,caesars,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Actus, Primus, .)",Shake_C
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Scoena, Prima, .)",Shake_C
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Enter, Flauius, ,, Murellus, ,, and, certaine, Commoner...",Shake_C
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Flauius, .)",Shake_C
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Hence, :, home, you, idle, Creatures, ,, get, you, home...",Shake_C


In [17]:
word_counts['num_words'] = word_counts['text_sentence'].apply(lambda x: len(x))
word_counts.head()

Unnamed: 0,sterrile,twill,tyr,start,finger,pollax,work,clau,cheeke,know,...,deckt,fault,gowne,vnderling,fantasie,niobe,caesars,text_sentence,text_source,num_words
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Actus, Primus, .)",Shake_C,3
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Scoena, Prima, .)",Shake_C,3
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Enter, Flauius, ,, Murellus, ,, and, certaine, Commoner...",Shake_C,12
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Flauius, .)",Shake_C,2
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Hence, :, home, you, idle, Creatures, ,, get, you, home...",Shake_C,11


The following functions count the number of adverbs, interjections and pronouns and adds the columns to the word_counts dataframe in hopes that these additional features will improve the accuracy of the 4 models:  Random Forest, Logistic Regression, Gradient Boosting, and Support Vector Machines.

In [18]:
# Count the adverbs and add the column to the word_counts dataframe

def count_verbs(txt):
    sentences = nltk.sent_tokenize(str(txt))
    count = 0
    for sentence in sentences:    
        text = nltk.word_tokenize(sentence)
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        a = a.map(lambda x: 1 if x[1] == "RB" else 0).sum()
#        count = count + a
    return a

word_counts['count_adverbs'] = word_counts['text_sentence'].apply(lambda x: count_verbs(x))

# -------------------------------------------------------------------------------------------------------
# Count the interjections and add the column to the dataframe

def count_verbs(txt):
    sentences = nltk.sent_tokenize(str(txt))
    count = 0
    for sentence in sentences:    
        text = nltk.word_tokenize(sentence)
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        a = a.map(lambda x: 1 if x[1] == "UH" else 0).sum()
#        count = count + a
    return a

word_counts['count_inter'] = word_counts['text_sentence'].apply(lambda x: count_verbs(x))

# ------------------------------------------------------------------------------------------------------
# Count the pronouns and add the column to the dataframe

def count_verbs(txt):
    sentences = nltk.sent_tokenize(str(txt))
    count = 0
    for sentence in sentences:    
        text = nltk.word_tokenize(sentence)
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        a = a.map(lambda x: 1 if x[1] == "PRP" else 0).sum()
#        count = count + a
    return a

word_counts['count_pronoun'] = word_counts['text_sentence'].apply(lambda x: count_verbs(x))

# ------------------------------------------------------------------------------------------------------


Print the first 5 lines of the dataframe to make sure the knew features are captured.

In [19]:
word_counts.head()

Unnamed: 0,sterrile,twill,tyr,start,finger,pollax,work,clau,cheeke,know,...,vnderling,fantasie,niobe,caesars,text_sentence,text_source,num_words,count_adverbs,count_inter,count_pronoun
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Actus, Primus, .)",Shake_C,3,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Scoena, Prima, .)",Shake_C,3,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Enter, Flauius, ,, Murellus, ,, and, certaine, Commoner...",Shake_C,12,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Flauius, .)",Shake_C,2,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Hence, :, home, you, idle, Creatures, ,, get, you, home...",Shake_C,11,0,0,2


In [20]:
print(word_counts.shape)

(551, 1460)


# Build the models  
The "text_source" is what the models are trying to predict.

#### Random Forest

In [21]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
# normalize the training data
X_train = sklearn.preprocessing.normalize(X_train)

# normalize the test data
X_test = sklearn.preprocessing.normalize(X_test)

train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))



Training set score: 0.9787878787878788

Test set score: 0.7239819004524887


#### Logistic Regression

In [22]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set scorae:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))



(330, 1458) (330,)
Training set scorae: 0.6515151515151515

Test set score: 0.5520361990950227


#### Gradient Boosting

In [23]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9393939393939394

Test set score: 0.7782805429864253


#### Support Vector Machines

In [24]:
from sklearn.svm import SVC

sv_c = SVC(gamma='auto')
train = sv_c.fit(X_train, y_train)

print('Training set score:', sv_c.score(X_train, y_train))
print('\nTest set score:', sv_c.score(X_test, y_test))

Training set score: 0.5393939393939394

Test set score: 0.5067873303167421


# Evaluation and Conclusion  

The initial accuracy scores (for test data set) for the 4 models are as follows:  

- Random Forest:  0.73  
- Logistic Regression:  0.79  
- Gradient Boosting:  0.79  
- Support Vector Machines:  0.51  

With the additional feature of **'number of words in each sentence'**:

- Random Forest:  0.77  
- Logistic Regression:  0.55  
- Gradient Boosting:  0.77  
- Support Vector Machines:  0.51  

With the additional features of 'number of words' and **parts of speach**:  

- Random Forest:  0.72  
- Logistic Regression:  0.55  
- Gradient Boosting:  0.78  
- Support Vector Machines:  0.51  


Adding a 'number of words per sentence' feature looks like it just added a bunch of noise to the data.  The accuracy scores got worse.  This is probably expected because the author of each corpus is the same (Shakespeare).  His writing is probably similar enough so that the number of words per sentence for each corpus would be similar.  

Adding the parts of speach as a feature did not help at all either.  Again, since the author is the same in this case, there would not be a significant variance in the use of **pronouns, adverbs or injunctions**.  If the works considered here were written with many years in between, the accuracy could improve with these added features.  However, Caesar and Hamlet were both written in 1599.

