# Introduction 
This notebook imports the popular stories 'Alice in Wonderland' and 'Emma' and tries to predict if any given sentence is from "Alice" or "Emma".  In these models, 10% of Emma was used instead of the approximately 2% in the course curriculum.  As a result, the accuracy scores a significantly better.  

The first step is to import the necessary modules and import the text data.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Fred\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# import the nlp library, spacy
import spacy

In [3]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Fred\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [4]:
# this takes a long time

### need to run as administrator from Anaconda3 promplt

!python -m spacy download en


    Linking successful
    C:\Users\Fred\Anaconda3\lib\site-packages\en_core_web_sm -->
    C:\Users\Fred\Anaconda3\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load('en')



In [5]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


# Clean and explore the data  

In [6]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
emma = gutenberg.raw('austen-emma.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
emma = re.sub(r'Chapter \d+', '', emma)
alice = re.sub(r'CHAPTER .*', '', alice)
    
emma = text_cleaner(emma[:int(len(emma)/10)])
alice = text_cleaner(alice[:int(len(alice)/10)])

In [7]:
print(len(emma))
print(len(alice))

87962
14139


In [8]:
# Parse the cleaned novels. This can take a bit.
# had to do this in an admin terminal.......
nlp = spacy.load('en')
alice_doc = nlp(alice)
emma_doc = nlp(emma)

In [9]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + emma_sents)

pd.set_option('max_colwidth', 60)

sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, of, sittin...",Carroll
1,"(So, she, was, considering, in, her, own, mind, (, as, w...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in, that, ;,...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [10]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
emmawords = bag_of_words(emma_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + emmawords)

In [11]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
Processing row 600
Processing row 650
Processing row 700
Processing row 750
Processing row 800
Processing row 850
Processing row 900
Processing row 950


Unnamed: 0,repent,fortnight,flavour,repeat,key,luxury,inch,competence,irksomeness,distance,...,push,basin,apply,earnestly,illiterate,8th,personage,screw,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, of, sittin...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind, (, as, w...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in, that, ;,...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll


In [12]:
word_counts['num_words'] = word_counts['text_sentence'].apply(lambda x: len(x))
word_counts.head()

Unnamed: 0,repent,fortnight,flavour,repeat,key,luxury,inch,competence,irksomeness,distance,...,basin,apply,earnestly,illiterate,8th,personage,screw,text_sentence,text_source,num_words
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, of, sittin...",Carroll,67
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind, (, as, w...",Carroll,63
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in, that, ;,...",Carroll,33
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll,3
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll,6


The following functions count the number of adverbs, interjections and pronouns and adds the columns to the word_counts dataframe in hopes that these additional features will improve the accuracy of the 4 models:  Random Forest, Logistic Regression, Gradient Boosting, and Support Vector Machines.

In [13]:
# Count the adverbs and add the column to the word_counts dataframe

def count_verbs(txt):
    sentences = nltk.sent_tokenize(str(txt))
    count = 0
    for sentence in sentences:    
        text = nltk.word_tokenize(sentence)
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        a = a.map(lambda x: 1 if x[1] == "RB" else 0).sum()
#        count = count + a
    return a

word_counts['count_adverbs'] = word_counts['text_sentence'].apply(lambda x: count_verbs(x))

# -------------------------------------------------------------------------------------------------------
# Count the interjections and add the column to the dataframe

def count_verbs(txt):
    sentences = nltk.sent_tokenize(str(txt))
    count = 0
    for sentence in sentences:    
        text = nltk.word_tokenize(sentence)
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        a = a.map(lambda x: 1 if x[1] == "UH" else 0).sum()
#        count = count + a
    return a

word_counts['count_inter'] = word_counts['text_sentence'].apply(lambda x: count_verbs(x))

# ------------------------------------------------------------------------------------------------------
# Count the pronouns and add the column to the dataframe

def count_verbs(txt):
    sentences = nltk.sent_tokenize(str(txt))
    count = 0
    for sentence in sentences:    
        text = nltk.word_tokenize(sentence)
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        a = a.map(lambda x: 1 if x[1] == "PRP" else 0).sum()
#        count = count + a
    return a

word_counts['count_pronoun'] = word_counts['text_sentence'].apply(lambda x: count_verbs(x))

# ------------------------------------------------------------------------------------------------------


Print the first 5 lines of the dataframe to make sure the knew features are captured.

In [14]:
word_counts.head()

Unnamed: 0,repent,fortnight,flavour,repeat,key,luxury,inch,competence,irksomeness,distance,...,illiterate,8th,personage,screw,text_sentence,text_source,num_words,count_adverbs,count_inter,count_pronoun
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, of, sittin...",Carroll,67,2,0,3
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(So, she, was, considering, in, her, own, mind, (, as, w...",Carroll,63,5,0,3
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in, that, ;,...",Carroll,33,4,0,2
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(Oh, dear, !)",Carroll,3,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,"(I, shall, be, late, !, ')",Carroll,6,1,0,1


# Build the models  
The "text_source" is what the models are trying to predict.

#### Random Forest

In [15]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
# normalize the training data
X_train = sklearn.preprocessing.normalize(X_train)

# normalize the test data
X_test = sklearn.preprocessing.normalize(X_test)

train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))



Training set score: 0.9774305555555556

Test set score: 0.8880208333333334


#### Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set scorae:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))



(576, 1975) (576,)
Training set scorae: 0.8611111111111112

Test set score: 0.8671875


#### Gradient Boosting

In [17]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9809027777777778

Test set score: 0.9192708333333334


#### Support Vector Machines

In [18]:
from sklearn.svm import SVC

sv_c = SVC(gamma='auto')
train = sv_c.fit(X_train, y_train)

print('Training set score:', sv_c.score(X_train, y_train))
print('\nTest set score:', sv_c.score(X_test, y_test))

Training set score: 0.8611111111111112

Test set score: 0.8671875


# Evaluation and Conclusion  

The final accuracy scores (for test data set) for the 4 models are as follows:  

Random Forest:  0.89  
Logistic Regression:  0.87  
Gradient Boosting:  0.92  
Support Vector Machines:  0.86  

Including 10% of 'Emma' (instead of using just 2% before) increased the accuracy significantly.  Using only 10% of Alice and only about 2% of Emma (in the previous model) produced accuracy scores in the range of 55% to 67%.  This is much, much worse than the scores listed above.  To reiterate, the new features implemented in this model are as follows:

- the number of words in each sentence  
- the number of adverbs
- the number of interjections  
- the number of pronouns  