In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
from sklearn.metrics import confusion_matrix

def my_confusion_matrix(array_Expected,array_Predicted,colName):
    a = np.array(confusion_matrix(array_Expected, array_Predicted ))
    totalExpectedFalse = a[0,0] + a[0,1]
    totalExpectedTrue = a[1,0] + a[1,1]
    correctFalse = a[0,0] 
    correctTrue = a[1,1] 
    correctTruePct = np.round(correctTrue / totalExpectedTrue,3)
    correctFalsePct = np.round(correctFalse / totalExpectedFalse,3)
    print('Regarding ' + colName + '...')
    print('The model correctly predicted {} Austens out of {} expected Austens: {}'.format(
        correctFalse,totalExpectedFalse,correctFalsePct))
    print('The model correctly predicted {} Carrols out of {} expected Carrols: {}'.format(
        correctTrue,totalExpectedTrue,correctTruePct))    
    print(a)


Supervised NLP requires a pre-labelled dataset for training and testing, and is generally interested in categorizing text in various ways. In this case, we are going to try to predict whether a sentence comes from _Alice in Wonderland_ by Lewis Carroll or _Persuasion_ by Jane Austen. We can use any of the supervised models we've covered previously, as long as they allow categorical outcomes. In this case, we'll try Random Forests, SVM, and KNN.

Our feature-generation approach will be something called _BoW_, or _Bag of Words_. BoW is quite simple: For each sentence, we count how many times each word appears. We will then use those counts as features.  

In [53]:
%%time

# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
emma = gutenberg.raw('austen-emma.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
emma = re.sub(r'CHAPTER .*', '', emma)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)
emma = text_cleaner(emma)

CPU times: user 31.5 ms, sys: 10.5 ms, total: 42 ms
Wall time: 96.4 ms


In [54]:
%%time

# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)
emma_doc = nlp(emma)

print(type(alice_doc))

<class 'spacy.tokens.doc.Doc'>
CPU times: user 59.4 s, sys: 22.3 s, total: 1min 21s
Wall time: 54.9 s


In [55]:
# Group into sentences.
alice_sents = [[sent, "Carroll", "Alice in Wonderland"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen", 'Persuasion'] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen", 'Emma'] for sent in emma_doc.sents]

# Combine the sentences from the 3 novels into one data frame.
sentences = pd.DataFrame(alice_sents[0:1000] + persuasion_sents[0:1000] + emma_sents[0:1000])
sentences.shape

(3000, 3)

In [731]:
print(alice_doc[0:1000])

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT-POCKET, and looked at it, and the

In [89]:
for sentence in sentences.loc[1:1,0]:
    print(sentence)
    print(sentence[-2])
        
#token.pos_
    for token in sentence:
         print('{}  :  {}'.format(token,token.tag_))

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
her
So  :  RB
she  :  PRP
was  :  VBD
considering  :  VBG
in  :  IN
her  :  PRP$
own  :  JJ
mind  :  NN
(  :  -LRB-
as  :  RB
well  :  RB
as  :  IN
she  :  PRP
could  :  MD
,  :  ,
for  :  IN
the  :  DT
hot  :  JJ
day  :  NN
made  :  VBD
her  :  PRP
feel  :  VB
very  :  RB
sleepy  :  JJ
and  :  CC
stupid  :  JJ
)  :  -RRB-
,  :  ,
whether  :  IN
the  :  DT
pleasure  :  NN
of  :  IN
making  :  VBG
a  :  DT
daisy  :  NN
-  :  HYPH
chain  :  NN
would  :  MD
be  :  VB
worth  :  JJ
the  :  DT
trouble  :  NN
of  :  IN
getting  :  VBG
up  :  RP
and  :  CC
picking  :  VBG
the  :  DT
daisies  :  NNS
,  :  ,
when  :  WRB
suddenly  :  RB
a  :  DT
White  :  NNP
Rabbit  :  NNP
with  :  IN
pink  :  JJ
eyes  :  NNS


In [56]:
# Utility function to create a list of the most common words in a given document.
def bag_of_words(text, n, minlen=0, maxlen=99):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct    #disregard punctuation
                and not token.is_stop    #disregard stop words
                and not token.pos_ == 'PROPN'   #disregard proper nouns
                and len(token) >= minlen   #only get those with minimum length
                and len(token) < maxlen]   
    
    # Return the most common words.
    return [item for item in Counter(allwords).most_common(n)]
    

# Set up a bag of words of the most common small words in the 3 novels
alicewords = bag_of_words(alice_doc, 200, 0, 7)
persuasionwords = bag_of_words(persuasion_doc, 200, 0, 7)
emmawords = bag_of_words(emma_doc, 200, 0, 7)

# Combine bags to create a set of unique small words.
all_words = (alicewords + persuasionwords + emmawords)
df = pd.DataFrame(all_words)
df = df.loc[df[0].isin(['-PRON-',"'s"]) ==False]
df = df.groupby([0]).sum().sort_values(by=1, ascending=False)
frequent_small_words = list(df[0:50].index)


# Set up a bag of words of the most common big words in the 3 novels
alicewords = bag_of_words(alice_doc, n=200, minlen=7)
persuasionwords = bag_of_words(persuasion_doc, n=200, minlen=7)
emmawords = bag_of_words(emma_doc, n=200, minlen=7)

# Combine bags to create a set of unique big words.
all_words = (alicewords + persuasionwords + emmawords)
df = pd.DataFrame(all_words)
df = df.loc[df[0].isin(['-PRON-',"'s"]) ==False]
df = df.groupby([0]).sum().sort_values(by=1, ascending=False)
frequent_big_words = list(df[0:50].index)


frequent_words = frequent_small_words + frequent_big_words
len(frequent_words)

100

In [90]:
# Creates a data frame with features for each word in our word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(sentences):
        
        if i % 100 == 0:
            print('Processing ' + str(i))
        
        #features: populate sentence length
        df.loc[i,'numwords'] = len(sentence)
        #df.loc[i,'punct'] = sentence[-1]
        df.loc[i, common_words] = 0
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, proper nouns, and words not in our master list.
#         words = [token.lemma_
#                  for token in sentence
#                  if (
#                      not token.is_stop
#                      and not token.pos_ == 'PROPN'
#                      and token.lemma_ in common_words
#                  )]
        
#         punctuation = [token
#                  for token in sentence
#                  if (
#                     token.is_punct == True
#                  )]

#         propernouns = [token
#              for token in sentence
#              if (
#                 token.pos_ == 'PROPN'
#              )]

#         uppers = [token
#              for token in sentence
#              if (
#                 token.is_upper == True
#              )]
        num_punct = 0
        num_propn = 0
        num_words = 0
        num_char = 0
        bool_upper = 0
        num_repeat = 0
        num_ing = 0
        prior_token = ''
        
        for token in sentence:
            if token.is_punct == True:
                num_punct = num_punct + 1
            elif token.pos_ == 'PROPN':
                num_propn = num_propn + 1
            else:
                num_words = num_words + 1
                num_char = num_char + len(token)
            
            if token.is_upper == True:
                bool_upper = 1
               
            if token.lemma_ in common_words:
                df.loc[i, token.lemma_] += 1
            
            if token.norm_ == prior_token:
                num_repeat = num_repeat +1
            
            if token.suffix_ == 'ing':
                num_ing = num_ing + 1
                
            prior_token = token.norm_
            
        #feature: Populate the row with word counts.
#         for word in words:
#             df.loc[i, word] += 1
            
        
        #feature:  Avg Word Size
#         numchars = sentence.end_char - sentence.start_char
#         avgwordsize = numchars / len(sentence)
        if num_words > 0:
            avgwordsize = num_char / num_words
        else:
            avgwordsize = 0
        df.loc[i,'avgwordsize'] = avgwordsize
        
        #feature: Num Punct
        df.loc[i,'numpunct'] = num_punct  #  len(punctuation)
        
        #feature: Num Proper Nouns
        df.loc[i,'numpropernoun'] = num_propn   #len(propernouns)

        #feature: Num Uppercase words
        df.loc[i,'upperword'] = bool_upper   #np.where(len(uppers) > 0, 1, 0)
        
        #feature: repeats
        df.loc[i,'numrepeat'] = num_repeat
        
        #feature: ending in ing
        df.loc[i,'numing'] = num_ing
        
        #feature: first word part of speech
        df.loc[i,'first_pos'] = sentence[0].pos_
        
        #feature: last word part of speech
        df.loc[i,'last_pos'] = sentence[-2].pos_
        
    print('Filling NAs with 0')
    df.fillna(0,inplace=True)
    print('Done')
    print(df.shape)
    return df


df = bow_features(sentences.loc[:,0], frequent_words)
df[['author','title']] = sentences.loc[:,1:2] 

df_first_pos = pd.get_dummies(df.first_pos, prefix='first')
df_last_pos = pd.get_dummies(df.last_pos, prefix='last')
df = pd.concat([df, df_first_pos, df_last_pos], axis=1)
df.drop(columns=['first_pos','last_pos'], inplace=True)


#df[['begin','sit','have','maxrepeats']]

Processing 0
Processing 100
Processing 200
Processing 300
Processing 400
Processing 500
Processing 600
Processing 700
Processing 800
Processing 900
Processing 1000
Processing 1100
Processing 1200
Processing 1300
Processing 1400
Processing 1500
Processing 1600
Processing 1700
Processing 1800
Processing 1900
Processing 2000
Processing 2100
Processing 2200
Processing 2300
Processing 2400
Processing 2500
Processing 2600
Processing 2700
Processing 2800
Processing 2900
Filling NAs with 0
Done
(3000, 109)


## Trying out BoW

Now let's give the bag of words features a whirl by trying a random forest.

In [91]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
say,3000.0,0.152333,0.372166,0.0,0.0,0.0,0.0,2.0
good,3000.0,0.076000,0.294821,0.0,0.0,0.0,0.0,4.0
know,3000.0,0.072667,0.272172,0.0,0.0,0.0,0.0,2.0
come,3000.0,0.056000,0.241281,0.0,0.0,0.0,0.0,2.0
the,3000.0,0.835000,1.311112,0.0,0.0,0.0,1.0,12.0
little,3000.0,0.072667,0.275823,0.0,0.0,0.0,0.0,2.0
think,3000.0,0.093333,0.302191,0.0,0.0,0.0,0.0,2.0
thing,3000.0,0.047000,0.220924,0.0,0.0,0.0,0.0,2.0
time,3000.0,0.048667,0.228726,0.0,0.0,0.0,0.0,3.0
go,3000.0,0.073333,0.278055,0.0,0.0,0.0,0.0,2.0


In [61]:
print(pd.crosstab( df.first_pos , df.author,  margins=True, normalize='columns' ))

AttributeError: 'DataFrame' object has no attribute 'first_pos'

In [707]:
df.pivot_table(index=['author','title'], values=['numwords','maxrepeats','avgwordsize','numpunct',
                                                 'numpropernoun','numuppers'], 
                                               aggfunc=np.mean, margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,avgwordsize,maxrepeats,numpropernoun,numpunct,numuppers,numwords
author,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Austen,Emma,4.261418,0.854667,0.940667,3.249333,0.349333,21.573333
Austen,Persuasion,4.502711,0.908667,1.499333,4.403333,0.301333,30.900667
Carroll,Alice in Wonderland,3.871423,0.936667,0.818,4.242667,0.504,20.544
All,,4.211851,0.9,1.086,3.965111,0.384889,24.339333


In [92]:
#Train on 2 books
#- Alice by Carrol
#- Persuasion by Austin

X_Alice = df.loc[df.title.isin(['Alice in Wonderland'])]
Y_Alice = X_Alice.author
X_Alice = X_Alice.drop(columns=['author','title'])

X_Persuasion = df.loc[df.title.isin(['Persuasion'])]
Y_Persuasion = X_Persuasion.author
X_Persuasion = X_Persuasion.drop(columns=['author','title'])

X_Emma = df.loc[df.title.isin(['Emma'])]
Y_Emma = X_Emma.author
X_Emma = X_Emma.drop(columns=['author','title'])


#Train on first 1000 records from Alice and Persuasion
X_Train = pd.concat([X_Alice[0:800], X_Persuasion[0:800], X_Emma[0:800]], axis=0)   #X_Emma[0:1000]], axis=0)
Y_Train = pd.concat([Y_Alice[0:800], Y_Persuasion[0:800], Y_Emma[0:800]], axis=0)    #Y_Emma[0:1000]], axis=0)

#Test1 on last 300 records from Alice and Persuasion
X_Test1 = pd.concat([X_Alice[800:1000], X_Persuasion[800:1000]], axis=0)
Y_Test1 = pd.concat([Y_Alice[800:1000], Y_Persuasion[800:1000]], axis=0)

#Test2 on last 300 records from Alice and Emma
X_Test2 = pd.concat([X_Alice[800:1000], X_Emma[800:1000]], axis=0)
Y_Test2 = pd.concat([Y_Alice[800:1000], Y_Emma[800:1000]], axis=0)

#Test3 combines last 300 from all 3 books
X_Test3 = pd.concat([X_Alice[800:1000], X_Emma[800:1000], X_Persuasion[800:1000]], axis=0)
Y_Test3 = pd.concat([Y_Alice[800:1000], Y_Emma[800:1000], Y_Persuasion[800:1000]], axis=0)


#Full Set for Cross Validation
X_Full = pd.concat([X_Alice, X_Persuasion, X_Emma], axis=0)
Y_Full = pd.concat([Y_Alice, Y_Persuasion, Y_Emma], axis=0)

print(X_Train.shape)
print(Y_Train.shape)
print(X_Test1.shape)
print(Y_Test1.shape)
print(X_Test2.shape)
print(Y_Test2.shape)
print(X_Test3.shape)
print(Y_Test3.shape)
print(X_Full.shape)
print(Y_Full.shape)

(2400, 133)
(2400,)
(400, 133)
(400,)
(400, 133)
(400,)
(600, 133)
(600,)
(3000, 133)
(3000,)


In [98]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier(n_estimators=60, max_depth=30)

train = rfc.fit(X_Train, Y_Train)

print('RF: Training set score (Alice/Persuasion):', rfc.score(X_Train, Y_Train))
print('\nRF: Test set score (Alice/Persuasion):', rfc.score(X_Test1, Y_Test1))
print('\nRF: Test set score (Alice/Emma):', rfc.score(X_Test2, Y_Test2))

print('\nRF: Test set score (Alice/Emma/Persuasion):', rfc.score(X_Test3, Y_Test3))
print('')
Y_Pred = rfc.predict(X_Test3)
my_confusion_matrix(Y_Test3, Y_Pred, 'Austen')

score = cross_val_score(rfc, X_Full, Y_Full, cv=5)
print("\nRF: Cross Validation (All Records) Accuracy %i folds: %.2f (+/- %.2f)" % (5, score.mean(), (score.std() * 2)))



feature_importance = rfc.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
fi = pd.DataFrame(feature_importance, index=X_Train.columns, columns=['Importance'])
fi.sort_values(by='Importance', ascending=False).head(10)

RF: Training set score (Alice/Persuasion): 0.9945833333333334

RF: Test set score (Alice/Persuasion): 0.8175

RF: Test set score (Alice/Emma): 0.82

RF: Test set score (Alice/Emma/Persuasion): 0.8416666666666667

Regarding Austen...
The model correctly predicted 355 Austens out of 400 expected Austens: 0.888
The model correctly predicted 150 Carrols out of 200 expected Carrols: 0.75
[[355  45]
 [ 50 150]]

RF: Cross Validation (All Records) Accuracy 5 folds: 0.80 (+/- 0.04)


Unnamed: 0,Importance
avgwordsize,100.0
numwords,68.387872
last_PUNCT,64.180153
numpropernoun,52.092163
numpunct,44.672972
say,34.960283
be,31.299548
the,29.318123
have,27.832392
numrepeat,20.517775


In [101]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=100)
train = lr.fit(X_Train, Y_Train)

print('LR: Training set score (Alice/Persuasion):', lr.score(X_Train, Y_Train))
print('\nLR: Test set score (Alice/Persuasion):', lr.score(X_Test1, Y_Test1))
print('\nLR: Test set score (Alice/Emma):', lr.score(X_Test2, Y_Test2))

print('\nLR: Test set score (Alice/Emma/Persuasion):', lr.score(X_Test3, Y_Test3))
print('')
Y_Pred = lr.predict(X_Test3)
my_confusion_matrix(Y_Test3, Y_Pred, 'Austen')


score = cross_val_score(lr, X_Full, Y_Full, cv=5)
print("\nLR: Cross Validation (All Books) Accuracy %i folds: %.2f (+/- %.2f)" % (5, score.mean(), (score.std() * 2)))


LR: Training set score (Alice/Persuasion): 0.84625

LR: Test set score (Alice/Persuasion): 0.805

LR: Test set score (Alice/Emma): 0.79

LR: Test set score (Alice/Emma/Persuasion): 0.815

Regarding Austen...
The model correctly predicted 340 Austens out of 400 expected Austens: 0.85
The model correctly predicted 149 Carrols out of 200 expected Carrols: 0.745
[[340  60]
 [ 51 149]]

LR: Cross Validation (All Books) Accuracy 5 folds: 0.81 (+/- 0.04)


In [102]:
clf = ensemble.GradientBoostingClassifier(learning_rate=.1, n_estimators=200)
train = clf.fit(X_Train, Y_Train)

print('RFGB: Training set score (Alice/Persuasion):', clf.score(X_Train, Y_Train))
print('\nRFGB: Test set score (Alice/Persuasion):', clf.score(X_Test1, Y_Test1))
print('\nRFGB: Test set score (Alice/Emma):', clf.score(X_Test2, Y_Test2))

print('\nRFGB: Test set score (Alice/Emma/Persuasion):', clf.score(X_Test3, Y_Test3))
print('')
Y_Pred = clf.predict(X_Test3)
my_confusion_matrix(Y_Test3, Y_Pred, 'Austen')


score = cross_val_score(clf, X_Full, Y_Full, cv=5)
print("\nRFGBCross Validation (All Books) Accuracy %i folds: %.2f (+/- %.2f)" % (5, score.mean(), (score.std() * 2)))


feature_importance = clf.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
fi = pd.DataFrame(feature_importance, index=X_Train.columns, columns=['Importance'])
fi.sort_values(by='Importance', ascending=False).head(10)

RFGB: Training set score (Alice/Persuasion): 0.8995833333333333

RFGB: Test set score (Alice/Persuasion): 0.845

RFGB: Test set score (Alice/Emma): 0.8325

RFGB: Test set score (Alice/Emma/Persuasion): 0.8533333333333334

Regarding Austen...
The model correctly predicted 353 Austens out of 400 expected Austens: 0.882
The model correctly predicted 159 Carrols out of 200 expected Carrols: 0.795
[[353  47]
 [ 41 159]]

RFGBCross Validation (All Books) Accuracy 5 folds: 0.81 (+/- 0.05)


Unnamed: 0,Importance
avgwordsize,100.0
numwords,65.472022
numpropernoun,52.357778
last_PUNCT,41.757733
numpunct,41.191577
the,30.493957
be,27.274533
upperword,22.872968
have,21.249588
say,20.418504


Well look at that!  NLP approaches are generally effective on the same type of material as they were trained on. It looks like this model is actually able to differentiate multiple works by Austen from Alice in Wonderland.  Now the question is whether the model is very good at identifying Austen, or very good at identifying Alice in Wonderland, or both...

# Challenge 0:

Recall that the logistic regression model's best performance on the test set was 93%.  See what you can do to improve performance.  Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires.  Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.  

# Challenge 1:
Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work.  This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

Record your work for each challenge in a notebook and submit it below.