# Supervised NLP: Identify Book

10.13.19

Parse and explore data. Use different language modeling techniques to predict whether a sentence comes from "Alice's Adventures in Wonderland" by Lewis Carroll or "Persuasion" by Jane Austen.

Topics:
- Text Preprocessing and Exploration (spaCy, NLTK):
    - data cleaning, 
    - tokens, 
    - stop words, 
    - lemmas, 
    - sentences,
    - named entities;
- Feature Engineering/Language Modeling:
    - Bag of Words (CountVectorizer), 
    - N-grams,
    - tf-idf (TfidfVectorizer),
    - word2vec (Gensim);
- Modeling and Evaluation

In [1]:
import pandas as pd
import numpy as np

import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

import spacy
from nltk.corpus import gutenberg, stopwords
import gensim 

from collections import Counter

import warnings
warnings.filterwarnings('ignore')

## Text Preprocessing and Exploration

In [2]:
# nltk.download('gutenberg')
# !python -m spacy download en_core_web_sm

In [3]:
# Load data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# Print the first 100 characters of Alice.
print('\nRaw:\n', alice[0:100])


Raw:
 [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


### Data Cleaning

Some ways to clean text:
- correcting typos and misspelled words,
- dealing with abbreviations,
- lowercasing/uppercasing,
- removing emojis,
- removing stopwords,
- normalizing the words (lemmatization, stemming).

In [4]:
# Remove title.
# Match all text between square brackets.
pattern = "[\[].*?[\]]"
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)

print(alice[0:100])



CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on


In [5]:
# Remove chapter headings.
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

print(alice[0:100])





Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothin


In [6]:
# Remove "new line" characters and other types of extra whitespaces.
# Split and rejoin sentences.
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

print(alice[0:100])

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to


### Tokens
I'll use spaCy to parse the novels into tokens. When one calls spaCy on text, it automatically parses the text, tokenizing the string by breaking it into words and punctuation.

In [7]:
# Parse the cleaned novels.
nlp = spacy.load('en_core_web_sm')

alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [8]:
# Explore the objects.
print('The alice_doc object is a {} object.'.format(type(alice_doc)))
print('It is {} tokens long'.format(len(alice_doc)))
print('The first three tokens are "{}"'.format(alice_doc[:3]))
print('The type of each token is {}'.format(type(alice_doc[0])))

The alice_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 34408 tokens long
The first three tokens are "Alice was beginning"
The type of each token is <class 'spacy.tokens.token.Token'>


In [9]:
# Review the frequency of all tokens including stop words.
def word_frequencies(text, include_stop=True):
    
    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)
    
alice_freq = word_frequencies(alice_doc).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

Alice: [('the', 1524), ('and', 796), ('to', 724), ('a', 611), ('I', 533), ('it', 524), ('she', 508), ('of', 499), ('said', 453), ('Alice', 394)]
Persuasion: [('the', 3120), ('to', 2775), ('and', 2738), ('of', 2563), ('a', 1529), ('in', 1346), ('was', 1329), ('had', 1177), ('her', 1159), ('I', 1118)]


### Stopwords

In [10]:
# Here is a list of the stopwords identified by NLTK.
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
# Review the frequency of tokens without stopwords.
alice_freq = word_frequencies(alice_doc, include_stop=False).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc, include_stop=False).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

Alice: [('said', 453), ('Alice', 394), ('little', 124), ('like', 84), ('went', 83), ('know', 83), ('thought', 74), ('Queen', 73), ('time', 68), ('King', 61)]
Persuasion: [('Anne', 496), ('Captain', 297), ('Mrs', 291), ('Elliot', 288), ('Mr', 254), ('Wentworth', 217), ('Lady', 191), ('good', 181), ('little', 175), ('Charles', 166)]


### Lemmas

In [12]:
# Review the frequency of lemmas without stop words.
def lemma_frequencies(text, include_stop=False):
    
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    return Counter(lemmas)

alice_lemma_freq = lemma_frequencies(alice_doc).most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_doc).most_common(10)
print('\nAlice:', alice_lemma_freq)
print('Persuasion:', persuasion_lemma_freq)


Alice: [('say', 477), ('Alice', 394), ('think', 131), ('go', 130), ('little', 126), ('look', 106), ('know', 103), ('come', 96), ('like', 92), ('begin', 91)]
Persuasion: [('Anne', 496), ('Captain', 297), ('Mrs', 291), ('Elliot', 288), ('think', 257), ('Mr', 254), ('know', 252), ('good', 225), ('Wentworth', 217), ('Lady', 191)]


In [13]:
# Identify lemmas common to one text but not the other.
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print('Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common))
print('Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common))

Unique to Alice: {'begin', 'go', 'look', 'Alice', 'like', 'little', 'come', 'say'}
Unique to Persuasion: {'Elliot', 'Captain', 'Wentworth', 'Mrs', 'Lady', 'Anne', 'Mr', 'good'}


### Sentence-level Information

In [14]:
# Initial exploration of sentences.
sentences = list(alice_doc.sents)
print('Alice in Wonderland has {} sentences.'.format(len(sentences)))

example_sentence = sentences[2]
print('Here is an example: \n{}\n'.format(example_sentence))

Alice in Wonderland has 1860 sentences.
Here is an example: 
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!



In [15]:
# Look at some metrics around this sentence.
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(('There are {} words in this sentence, and {} of them are'
       ' unique.').format(len(example_words), len(unique_words)))

There are 29 words in this sentence, and 25 of them are unique.


### Named Entities

In [16]:
# Extract the first ten entities.
entities = list(alice_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

PERSON Alice
PERSON Alice
ORG White Rabbit
ORG VERY
PERSON Alice
GPE Rabbit
GPE Rabbit
PERSON Alice
PERSON Alice
PERSON Alice


In [17]:
# All of the unique entities spaCy thinks are people.
people = [entity.text for entity in list(alice_doc.ents) if entity.label_ == "PERSON"]
print(set(people))

{'Bill', 'the Knave of Hearts', 'King', 'Ma', 'Latitude', 'Curiouser', 'Dodo', 'Hatter', 'Mary Ann', 'Serpent', 'Miss', 'began:--', 'Brandy', 'Somebody', 'sadly:--', 'Game', 'Soo', 'the Queen of Hearts', 'The Knave of Hearts', 'Chorus', 'Said', 'Boots', 'Shark', 'Edwin', 'Knave', 'Fifteenth', 'Soles', 'Beau', 'Fury', 'Herald', "W. RABBIT'", 'Jack', 'WILLIAM', 'Stupid', 'Stuff', 'Dinah', 'Queen', 'William the Conqueror', 'Duck', 'Queens', 'Duchess', 'Morcar', 'Mercia', 'ALICE', 'Hjckrrh', 'Cheshire Puss', 'Hare', 'Kings', 'Gryphon', 'Longitude', 'Pat', 'Turtle', 'yer honour', 'words:--', 'Edgar Atheling', 'Dinn', 'Tut', 'Idiot', 'indeed:--', 'William', 'Lobster', 'Lizard', 'Tillie', 'Beautiful', 'Magpie', 'Footman', 'Down', 'Ou est ma chatte', 'Mabel', 'Tortoise', 'Shakespeare', 'follows:--', 'Alice', 'Lobster Quadrille', 'ye', 'Ada'}


In [18]:
# Extract the first ten entities.
entities = list(persuasion_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

PERSON Walter Elliot
ORG Kellynch Hall
GPE Somersetshire
TIME an idle hour
DATE the last century
WORK_OF_ART ELLIOT OF KELLYNCH HALL
PERSON Walter Elliot
DATE March 1 , 1760
DATE July 15 , 1784
PERSON Elizabeth


In [19]:
# All of the unique entities spaCy thinks are people.
people = [entity.text for entity in list(persuasion_doc.ents) if entity.label_ == "PERSON"]
print(set(people))

{'Cousin Charles', 'the Mr Wentworth of Monkford', 'Hayters', "Mrs Musgrove's", 'Captain Benwick', 'Master Harry', 'Belmont', 'Louise', 'Hayter', 'Carteret', 'God', 'an Anne Elliot', 'Shabby', 'a Louisa Musgrove', 'Mrs Clay', 'Louisa', 'Gibraltar', 'Basil Morley', 'William Walter Elliot', 'Mr--', 'Benwick', 'Byron', 'Walter Elliot', 'Sophia', 'Musgroves', "Walter Elliot's", 'Charles Smith', 'Elliot', 'Mr Wentworth', 'Lady Alicia', 'Bath', "Lady Russell's", "Captain Wentworth's", "Louisa Musgrove's", 'Henrietta', 'Lady Mary Maclean', "Mrs Charles's", 'Captain Harville', 'Thornberry', 'Mrs Speed', 'Mr Shepherd', 'Brigden', 'Sophy', 'Anne Elliot', "Mrs Croft's", 'Hamilton', 'Heir', 'Edward', 'Michaelmas', 'John Shepherd', 'Tunbridge Wells', "Lady Dalrymple's", "Elizabeth Elliot's", 'Dick', 'a Mrs Wallis', 'Henry', 'Winthrop', 'Mrs Croft', 'Archibald Drew', 'Mark', 'Mrs Musgrove', 'Louisa Musgrove', 'Cobb', 'F. W.', "Frederick Wentworth's", 'Kellynch Lodge', 'Mr Shepherd one morning', 'Wil

### Create Final Dataframe

In [20]:
# Group into sentences.
alice_sents = [[sent, 'Carroll'] for sent in alice_doc.sents]
persuasion_sents = [[sent, 'Austen'] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [21]:
# Get rid of stop words and punctuation, lemmatize the tokens.
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])

In [22]:
sentences.head()

Unnamed: 0,text,author
0,Alice begin tired sit sister bank have twice p...,Carroll
1,consider mind hot day feel sleepy stupid pleas...,Carroll
2,remarkable Alice think way hear rabbit oh dear,Carroll
3,oh dear,Carroll
4,shall late,Carroll


In [23]:
sentences.shape

(5709, 2)

## Feature Engineering and Model Evaluation

# Bag of Words (BoW)
For each observation, count the occurance of each word.

In [24]:
vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(sentences["text"])

bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_bow = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)

print(sentences_bow.shape)
sentences_bow.head()

(5709, 4953)


Unnamed: 0,15,16,1760,1784,1785,1787,1789,1791,1800,1803,...,younker,youth,youthful,zeal,zealand,zealous,zealously,zigzag,text,author
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,remarkable Alice think way hear rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,shall late,Carroll


In [25]:
def model_and_evaluation(data):
    Y = data['author']
    X = np.array(data.drop(['text','author'], 1))

    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)
    
    lr = LogisticRegression()
    rfc = RandomForestClassifier()
    gbc = GradientBoostingClassifier()

    lr.fit(X_train, y_train)
    rfc.fit(X_train, y_train)
    gbc.fit(X_train, y_train)

    print('Logistic Regression Scores:')
    print('Training set score:', lr.score(X_train, y_train))
    print('Test set score:', lr.score(X_test, y_test))

    print('\nRandom Forest Scores:')
    print('Training set score:', rfc.score(X_train, y_train))
    print('Test set score:', rfc.score(X_test, y_test))

    print('\nGradient Boosting Scores:')
    print('Training set score:', gbc.score(X_train, y_train))
    print('Test set score:', gbc.score(X_test, y_test))

In [26]:
# Scores on data with BoW features. 
model_and_evaluation(sentences_bow)

Logistic Regression Scores:
Training set score: 0.9378102189781022
Test set score: 0.8800350262697023

Random Forest Scores:
Training set score: 0.9652554744525548
Test set score: 0.8524518388791593

Gradient Boosting Scores:
Training set score: 0.8402919708029197
Test set score: 0.8288091068301225


# N-grams
Words in context.

**Use 2-grams (bigrams):**
For each observation, count the occurance of each word couple.

In [27]:
# Use 2-grams: ngram_range parameter = (2,2).
vectorizer = CountVectorizer(analyzer='word', ngram_range=(2,2))
X = vectorizer.fit_transform(sentences['text'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_bigram = pd.concat([bow_df, sentences[['text', 'author']]], axis=1)
print(sentences_bigram.shape)
sentences_bigram.head()

(5709, 30608)


Unnamed: 0,15 1784,16 1810,1760 married,1784 elizabeth,1785 anne,1787 bear,1789 mary,1803 dear,1806 have,1810 charles,...,zeal dwell,zeal sport,zeal think,zealand australia,zealous officer,zealous subject,zealously discharge,zigzag go,text,author
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,remarkable Alice think way hear rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,shall late,Carroll


In [29]:
model_and_evaluation(sentences_bigram)

Logistic Regression Scores:
Training set score: 0.9059854014598541
Test set score: 0.782399299474606

Random Forest Scores:
Training set score: 0.9445255474452555
Test set score: 0.7968476357267951

Gradient Boosting Scores:
Training set score: 0.7652554744525547
Test set score: 0.7578809106830122


**Use both 1-gram and 2-gram features:**

In [30]:
# Use both 1-gram and 2-gram together: ngram_range=(1,2)
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2))
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_both = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)
print(sentences_both.shape)
sentences_both.head()

(5709, 35559)


Unnamed: 0,15,15 1784,16,16 1810,1760,1760 married,1784,1784 elizabeth,1785,1785 anne,...,zealand australia,zealous,zealous officer,zealous subject,zealously,zealously discharge,zigzag,zigzag go,text,author
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,remarkable Alice think way hear rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,shall late,Carroll


In [31]:
model_and_evaluation(sentences_both)

Logistic Regression Scores:
Training set score: 0.9556204379562043
Test set score: 0.8791593695271454

Random Forest Scores:
Training set score: 0.9652554744525548
Test set score: 0.8445709281961471

Gradient Boosting Scores:
Training set score: 0.8402919708029197
Test set score: 0.8279334500875657


# TF-IDF
Take into account the meanings of the words as well as their number of occurrences. 

Parameters:
- max_df=0.5: This drops words that occur in more than half the documents.
- min_df=2: This makes the vectorizer only use words that appear at least twice.
- use_idf=True: This makes the vectorizer use inverse document frequencies in weighting.
- norm=u'l2': This applies a correction factor so that longer and shorter documents get treated equally.
- smooth_idf=True: This adds 1 to all document frequencies, as if an extra document existed that used every word once. This prevents divide-by-zero errors.

In [32]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)

X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_tfidf = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# the log base 2 of 1 is 0,
# so a tf-idf score of 0 indicates that the word was present once in that sentence.
sentences_tfidf.head()

Unnamed: 0,abide,ability,able,abominate,abroad,absence,absent,absolute,absolutely,absurd,...,yes,yesterday,yield,you,young,youth,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll


In [33]:
# Scores on data with tf-idf features. 
model_and_evaluation(sentences_tfidf)

Logistic Regression Scores:
Training set score: 0.9048175182481751
Test set score: 0.867338003502627

Random Forest Scores:
Training set score: 0.9643795620437956
Test set score: 0.8598949211908932

Gradient Boosting Scores:
Training set score: 0.8464233576642336
Test set score: 0.8235551663747811


**TF-IDF with Bigrams:**

In [34]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(2,2))

X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_tf_bigram = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

sentences_tf_bigram.head()

Unnamed: 0,able bear,able persuade,absence home,absolute necessity,absolutely hopeless,accident lyme,accidentally hearing,accommodation man,account louisa,account small,...,young friend,young lady,young man,young people,young person,young sister,young woman,youth say,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll


In [17]:
# Scores on data with tf-idf and bigram features. 
model_and_evaluation(sentences_tf_bigram)

Logistic Regression Scores:
Training set score: 0.8160583941605839
Test set score: 0.7683887915936952

Random Forest Scores:
Training set score: 0.8712408759124087
Test set score: 0.8021015761821366

Gradient Boosting Scores:
Training set score: 0.7640875912408759
Test set score: 0.7548161120840631


**TF-IDF with both 1 and 2-grams:**

In [35]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(1,2))

X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_tf_both = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

sentences_tf_both.head()

Unnamed: 0,abide,ability,able,able bear,able persuade,abominate,abroad,absence,absence home,absent,...,young people,young person,young sister,young woman,youth,youth say,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll


In [36]:
# Scores on data with tf-idf and bigram features. 
model_and_evaluation(sentences_tf_both)

Logistic Regression Scores:
Training set score: 0.9094890510948905
Test set score: 0.8629597197898424

Random Forest Scores:
Training set score: 0.9635036496350365
Test set score: 0.8633975481611208

Gradient Boosting Scores:
Training set score: 0.8443795620437956
Test set score: 0.824430823117338


# word2vec

It trains a shallow neural network model in an unsupervised manner for converting words to vectors. With a large enough corpus, this will eventually result in words that often appear together having vectors that are near one another.

Parameters:
- workers=4: We set the number of threads to run in parallel to 4 (make sense if your computer has available computing units).
- min_count=1: We set the minimum word count threshold to 1.
- window=6: We set the number of words around target word to consider to 6.
- sg=0: We use CBOW because our corpus is small.
- sample=1e-3: We penalize frequent words.
- size=100: We set the word vector length to 100.
- hs=1: We use hierarchical softmax.

In [41]:
del sentences

In [42]:
# Modify text preparation (to have sentences as a list of words in the final dataframe).

# Clean text.
# Remove title.
pattern = "[\[].*?[\]]"
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)

# Remove chapter headings.
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# Remove "new line" characters and other types of extra whitespaces.
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

# Parse the cleaned novels.
nlp = spacy.load('en_core_web_sm')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

# Group into sentences.
alice_sents = [[sent, 'Carroll'] for sent in alice_doc.sents]
persuasion_sents = [[sent, 'Austen'] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])

# Get rid of stop words and punctuation, lemmatize the tokens.
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop]
    
sentences.head()

Unnamed: 0,text,author
0,"[Alice, begin, tired, sit, sister, bank, have,...",Carroll
1,"[consider, mind, hot, day, feel, sleepy, stupi...",Carroll
2,"[remarkable, Alice, think, way, hear, rabbit, ...",Carroll
3,"[oh, dear]",Carroll
4,"[shall, late]",Carroll


In [43]:
# Train word2vec on the sentences.
model = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=6,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

In [45]:
# Explore word2vec representation.
print('The first five words that are closer to lady:')
print(model.most_similar(positive=['lady', 'man'], negative=['woman'], topn=5))
print('The word that doesn\'t fit in list: dad dinner mom aunt uncle:')
print(model.doesnt_match("dad dinner mom aunt uncle".split()))
print('The similarity score of woman and man:')
print(model.similarity('woman', 'man'))
print('The similarity score of horse and cat:')
print(model.similarity('horse', 'cat'))

The first five words that are closer to lady:
[('recommend', 0.9984135031700134), ('hand', 0.9983631372451782), ('constant', 0.9980081915855408), ('run', 0.9979991912841797), ('mind', 0.997941792011261)]
The word that doesn't fit in list: dad dinner mom aunt uncle:
dinner
The similarity score of woman and man:
0.99914813
The similarity score of horse and cat:
0.930524


Create numerical features using word2vec representations of the words:
- get the word2vec vectors of each word in a sentence and take the average of all the vectors in the high dimensional space (in this case it's 100). So, as a result, there is a vector of 100 dimensions as the feature for a sentence.
- then use each dimension as a separate feature which means that in our the data set there will be 100 numerical features.

In [46]:
word2vec_arr = np.zeros((sentences.shape[0],100))

for i, sentence in enumerate(sentences["text"]):
    word2vec_arr[i,:] = np.mean([model[lemma] for lemma in sentence], axis=0)

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

sentences.head()

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,...,90,91,92,93,94,95,96,97,98,99
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",0.071385,-0.291025,-0.049948,-0.108804,-0.231856,0.036387,-0.130457,-0.221261,...,0.229503,0.101283,-0.117701,0.545567,-0.000572,-0.150729,-0.10344,0.122948,0.02655,-0.126771
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",0.063728,-0.23992,-0.02109,-0.077444,-0.168184,0.027953,-0.096978,-0.187447,...,0.182897,0.092881,-0.08639,0.430058,-0.007557,-0.136318,-0.072336,0.104083,0.030854,-0.100596
2,Carroll,"[remarkable, Alice, think, way, hear, rabbit, ...",0.077163,-0.314964,-0.050008,-0.108498,-0.247042,0.050609,-0.133592,-0.243197,...,0.240807,0.116528,-0.120837,0.588547,0.004665,-0.179862,-0.098956,0.131845,0.042789,-0.14232
3,Carroll,"[oh, dear]",0.077471,-0.248518,-0.007798,-0.060287,-0.187497,0.004959,-0.127766,-0.226183,...,0.260143,0.125705,-0.103706,0.593541,-0.009451,-0.180068,-0.11348,0.108455,0.085983,-0.10596
4,Carroll,"[shall, late]",0.053746,-0.234428,-0.054857,-0.096483,-0.204009,0.040737,-0.102644,-0.176613,...,0.176987,0.078345,-0.090648,0.418138,0.006417,-0.111865,-0.076904,0.114342,0.007152,-0.109687


In [47]:
model_and_evaluation(sentences)

Logistic Regression Scores:
Training set score: 0.7847946045370938
Test set score: 0.8010110294117647

Random Forest Scores:
Training set score: 0.9831391784181484
Test set score: 0.8051470588235294

Gradient Boosting Scores:
Training set score: 0.8991416309012875
Test set score: 0.8350183823529411


**Use pre-trained vectors released by Google:**

In [48]:
# Load Google's pre-trained Word2Vec model.
model_pretrained = gensim.models.KeyedVectors.load_word2vec_format(
    'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [49]:
word2vec_arr = np.zeros((sentences.shape[0],300))

for i, sentence in enumerate(sentences["text"]):
  try:
    word2vec_arr[i,:] = np.mean([model_pretrained[lemma] for lemma in sentence], axis=0)
  except KeyError:
    word2vec_arr[i,:] = np.full((1,300), np.nan)
    continue

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

print("Shape of the dataset: {}".format(sentences.shape))
sentences.head()

Shape of the dataset: (4500, 302)


Unnamed: 0,author,text,0,1,2,3,4,5,6,7,...,290,291,292,293,294,295,296,297,298,299
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",0.046265,0.016199,-0.036288,0.08241,-0.010284,0.015515,0.005437,-0.035947,...,-0.066516,0.029852,-0.042609,-0.044208,-0.056998,-0.063269,0.000244,-0.085071,-0.00034,-0.064371
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",0.046331,0.020463,-0.002012,0.101565,-0.066478,-0.035698,0.045293,-0.068695,...,0.05594,0.085838,-0.067052,-0.013628,-0.027802,-0.033665,-0.023586,0.00962,0.030316,0.000908
2,Carroll,"[remarkable, Alice, think, way, hear, rabbit, ...",0.072189,0.034546,-0.009544,0.122665,-0.053543,-0.038696,0.058594,-0.055786,...,-0.008102,0.050652,-0.086411,0.005266,-0.085938,-0.137391,0.004723,0.020203,0.000519,0.061066
3,Carroll,"[oh, dear]",0.073975,0.134277,0.141357,0.256348,-0.147949,0.09967,0.077148,-0.093628,...,0.058228,0.000854,-0.094971,-0.052668,-0.091919,-0.142456,-0.053711,-0.112671,-0.148193,0.186798
4,Carroll,"[shall, late]",0.095215,0.084473,0.206787,0.211182,0.043579,-0.155762,0.088379,-0.038574,...,-0.021667,-0.103516,-0.038578,-0.007385,0.020264,0.134155,-0.177246,-0.254639,-0.212158,0.087646


In [50]:
model_and_evaluation(sentences)

Logistic Regression Scores:
Training set score: 0.8892592592592593
Test set score: 0.8594444444444445

Random Forest Scores:
Training set score: 0.9837037037037037
Test set score: 0.7672222222222222

Gradient Boosting Scores:
Training set score: 0.9562962962962963
Test set score: 0.8538888888888889


### Summary

In this assignment, I parsed text and explored data, reviewed most frequent tokens and lemmas, analyzed sentences and named entities. I then tried different language modeling methods, reviewed features and assessed them by training classification models (with default parameters) and evaluating scores and how well the models generalize on unseen data. Model performance varied based on input number and type of features.