<a href="https://colab.research.google.com/github/aarondelgiudice/thinkful_data_bootcamp/blob/master/Supervised_NLP_Challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Challenges
## Part 1:
Recall that the logistic regression model's best performance on the test set was 93%.
See what you can do to improve performance.

Suggested avenues of investigation include:
- Other modeling techniques (SVM?)
- making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc)
- making sentence-level features (number of words, amount of punctuation)
- including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc)

Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.

In [0]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import gutenberg, stopwords
from collections import Counter
#nltk.download('gutenberg')
#nltk.download('stopwords')
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer

Supervised NLP requires a pre-labelled dataset for training and testing.
In this case, we are going to try to predict whether a sentence comes from _Alice in Wonderland_ by Lewis Carroll or _Persuasion_ by Jane Austen.
We can use any supervised model, as long as they allow categorical outcomes.
In this case, we'll use Random Forests, SVM, and KNN.

We'l generate features with  _BoW_, or _Bag of Words_.
For each sentence, we count how many times each word appears.
We will then use those counts as features.

**Note**: Since processing all the text takes around ~5-10 minutes, in the cell below we are taking only the first tenth of each text. If you want to experiment, feel free to change the following code in the next cell:

```python
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])
```
to 

```python
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)
```

In [0]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r'Chapter \d+', '', text)
    text = re.sub(r'VOLUME \w+', '', text)
    text = re.sub(r'CHAPTER \w+', '', text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')
emma = gutenberg.raw('austen-emma.txt')

# The Chapter indicator is idiosyncratic
#persuasion = re.sub(r'Chapter \d+', '', persuasion)
#alice = re.sub(r'CHAPTER .*', '', alice)
#emma = re.sub(r'VOLUME \w+', '', emma)
#emma = re.sub(r'CHAPTER \w+', '', emma)
    
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])
# in order to get comparable length texts, take the first sixtieth of Emma
emma = text_cleaner(emma[:int(len(emma)/60)])

In [0]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)
emma_doc = nlp(emma)

In [0]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]
# view number of sentences
print(len(alice_sents))
print(len(persuasion_sents))
print(len(emma_sents))

129
315
170


In [0]:
# For computational purposes, reduce length of each set of sentences to 500
alice_sents = alice_sents[0:500]
persuasion_sents = persuasion_sents[0:500]
emma_sents = emma_sents[0:500]
# view number of sentences
print(len(alice_sents))
print(len(persuasion_sents))
print(len(emma_sents))

129
315
170


In [0]:
# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents + emma_sents)
display(sentences.head())
# view number of authors
print(sentences.iloc[:, 1].unique())

Unnamed: 0,0,1
0,(.),Carroll
1,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",Carroll
2,"(So, she, was, considering, in, her, own, mind...",Carroll
3,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
4,"(Oh, dear, !)",Carroll


['Carroll' 'Austen']


Time to bag some words!  Since spaCy has already tokenized and labelled our data, we can move directly to recording how often various words occur.  We will exclude stopwords and punctuation.  In addition, in an attempt to keep our feature space from exploding, we will work with lemmas (root words) rather than the raw text terms, and we'll only use the 2000 most common words for each text.

## Feature Engineering
- **BoW**
- **Sentence-level features**: sentence length, amount of punctuation, length of previous and next sentences, words repeated from one sentence to the next.

In [0]:
def bag_of_words(text):
    '''
    Utility function to create a list of the 2000 most common words.
    '''
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

In [0]:
def bow_features(sentences, common_words):
    '''
    Creates a data frame with features for each word in our common word set.
    Each value is the count of the times the word appears in each sentence.
    '''
    # Create column headers for sentence text and source (author) and initialize to 0
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

In [0]:
# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)
emmawords = bag_of_words(emma_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords + emmawords)

In [0]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
Processing row 600


Unnamed: 0,equity,remnant,1760,1806,1789,encourage,requisition,suitable,favourite,nervous,...,personableness,confide,catch,insult,aware,flatter,presume,tire,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,(.),Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll


### Sentence-Level Feature

In [0]:
# sentence length feature
for i, sentence in enumerate(word_counts['text_sentence']):
    word_counts.loc[i, 'sentenceLength'] = len(sentence)

In [0]:
# amount of punctuation feature

In [0]:
# length of previous and next sentences features

In [0]:
# repeated words feature

In [0]:
word_counts.head()

Unnamed: 0,equity,remnant,1760,1806,1789,encourage,requisition,suitable,favourite,nervous,...,confide,catch,insult,aware,flatter,presume,tire,text_sentence,text_source,sentenceLength
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,(.),Carroll,1.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",Carroll,72.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll,63.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll,30.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll,3.0


## Model Selection - BoW

In [0]:
# define data, target
target = word_counts['text_source']
Data = np.array(word_counts.drop(['text_sentence','text_source'], 1))
# split training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    Data, target, test_size=0.2, random_state=42)

### Lasso Logistic Regression

In [0]:
# set penalty to 'l1'
# set penalty to 'liblinear'
lasso = LogisticRegression(penalty='l1', solver='liblinear')
lasso.fit(X_train, y_train)
print('Training set score:', lasso.score(X_train, y_train))
print('\nTest set score:', lasso.score(X_test, y_test))
cv_scores = cross_val_score(lasso, Data, target, cv=5)
print('\nCross validation:', cv_scores)
print('\nVariance:', np.var(cv_scores))

Training set score: 0.9124236252545825

Test set score: 0.8699186991869918

Cross validation: [0.87804878 0.83739837 0.85365854 0.8699187  0.8442623 ]

Variance: 0.00023339902121696036


### Ridge Logistic Regression

In [0]:
# set penalty to 'l2'
# set solver to 'lbfgs'
ridge = LogisticRegression(penalty='l2', solver='lbfgs')
ridge.fit(X_train, y_train)
print('Training set score:', ridge.score(X_train, y_train))
print('\nTest set score:', ridge.score(X_test, y_test))
cv_scores = cross_val_score(ridge, Data, target, cv=5)
print('\nCross validation:', cv_scores)
print('\nVariance:', np.var(cv_scores))

Training set score: 0.9531568228105907

Test set score: 0.8780487804878049

Cross validation: [0.87804878 0.83739837 0.90243902 0.8699187  0.87704918]

Variance: 0.00043710162654028554


### Support Vector Classifier

In [0]:
# set gamma to 'scale'
svc = SVC(gamma='scale')
svc.fit(X_train, y_train)
print('Training set score:', svc.score(X_train, y_train))
print('\nTest set score:', svc.score(X_test, y_test))
cv_scores = cross_val_score(svc, Data, target, cv=5)
print('\nCross validation:', cv_scores)
print('\nVariance:', np.var(cv_scores))

Training set score: 0.8004073319755601

Test set score: 0.7804878048780488

Cross validation: [0.78861789 0.78861789 0.78861789 0.78861789 0.79508197]

Variance: 6.6854949787822925e-06


# Part 1: Conclusion
The best performing model was the Lasso Logistic Regression.
Our SVC underperformed and the Ridge Regression never converged.

In [0]:
pd.crosstab(y_test, lasso.predict(X_test))

col_0,Austen,Carroll
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,96,0
Carroll,15,12


With the new feature set we saw a marginal increase in our model performance, but a slight increase in variance.
Our model was not able to improve on the original model's score of 93%.

# Part 2:
Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work.
This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

In [0]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


For our new model we'll use Shakepeare's Julius Caesar text.
The classification power of our model should be much stronger as the style of language used by Shakespeare is very different from other authors.

In [0]:
#specify path for new text
PATH = 'shakespeare-caesar.txt'
# Load and clean the data.
alice = gutenberg.raw('carroll-alice.txt')
alice = text_cleaner(alice[:int(len(alice)/10)])
new_text = gutenberg.raw(PATH)
new_text = text_cleaner(new_text[:int(len(alice))])
print(new_text[:100])

Actus Primus. Scoena Prima. Enter Flauius, Murellus, and certaine Commoners ouer the Stage. Flauius.


In [0]:
# Parse the cleaned novels. This can take a bit.
#nlp = spacy.load('en')
#alice_doc = nlp(alice)
new_text_doc = nlp(new_text)

In [0]:
Author = 'Shakespeare'
# Group into sentences.
#alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
new_text_sents = [[sent, Author] for sent in new_text_doc.sents]
# view number of sentences
print(len(alice_sents))
print(len(new_text_sents))

129
271


In [0]:
# For computational purposes, reduce length of each set of sentences to 500
#alice_sents = alice_sents[0:500]
new_text_sents = new_text_sents[0:500]
# view number of sentences
print(len(alice_sents))
print(len(new_text_sents))

129
271


In [0]:
# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + new_text_sents)
display(sentences.head())
# view number of authors
print(sentences.iloc[:, 1].unique())

Unnamed: 0,0,1
0,(.),Carroll
1,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",Carroll
2,"(So, she, was, considering, in, her, own, mind...",Carroll
3,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
4,"(Oh, dear, !)",Carroll


['Carroll' 'Shakespeare']


In [0]:
# Set up the bags.
#alicewords = bag_of_words(alice_doc)
new_text_words = bag_of_words(new_text_doc)
# Combine bags to create a set of unique words.
common_words = set(alicewords + new_text_words)

In [0]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350


Unnamed: 0,whereof,white,desire,paper,hot,vnder,Brutus,Loue,dip,shoulder,...,flourish,catch,high,proper,garden,brighten,lovely,poor,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,(.),Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",Carroll
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll


In [0]:
# sentence length feature
for i, sentence in enumerate(word_counts['text_sentence']):
    word_counts.loc[i, 'sentenceLength'] = len(sentence)

In [0]:
# define data, target
target = word_counts['text_source']
Data = np.array(word_counts.drop(['text_sentence','text_source'], 1))
# split training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    Data, target, test_size=0.2, random_state=42)

In [0]:
# set penalty to 'l1'
# set penalty to 'liblinear'
lasso = LogisticRegression(penalty='l1', solver='liblinear')
lasso.fit(X_train, y_train)
print('Training set score:', lasso.score(X_train, y_train))
print('\nTest set score:', lasso.score(X_test, y_test))
cv_scores = cross_val_score(lasso, Data, target, cv=5)
print('\nCross validation:', cv_scores)
print('\nVariance:', np.var(cv_scores))

Training set score: 0.915625

Test set score: 0.7875

Cross validation: [0.87654321 0.8125     0.9        0.8625     0.78481013]

Variance: 0.0017959030603215118


In [0]:
pd.crosstab(y_test, lasso.predict(X_test))

col_0,Carroll,Shakespeare
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
Carroll,17,16
Shakespeare,1,46


# Part 2: Conclusion
Our model did not perform as well as the Carroll-Austen model.
This is surprising as the style of language is so very different.
It could have something to do with the structure of Shakespeare's play compared to Carroll's novel.
Our cleaning function wasn't as effective with Shakespeare ('scoena Prima' vs 'chapter 1').