# Natural Language Processing

Natural Language Processing (NLP) is concerned with the interaction of computers and natural human languages. We can model some NLP methods in Python.

In [1]:
import requests
import nltk
import pprint

from nltk import ne_chunk

import string
import numpy as np
import pandas as pd

from nltk.tokenize import WordPunctTokenizer
from sklearn.datasets import load_files

from scipy.sparse.csr import csr_matrix
from sklearn.cross_validation import check_random_state
from sklearn.preprocessing import normalize
from sklearn.metrics import classification_report

from helper import *

## Basic Concepts

Let's explore some basic concepts of part of speech (POS) tagging.

We use *Alice's Adventures in Wonderland* by Lewis Carroll, freely available from *Project Gutenberg*.

In [2]:
resp = requests.get('http://www.gutenberg.org/cache/epub/11/pg11.txt')
text = resp.text
print(text[:1000])

﻿Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice's Adventures in Wonderland

Author: Lewis Carroll

Posting Date: June 25, 2008 [EBook #11]
Release Date: March, 1994
[Last updated: December 20, 2011]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***










ALICE'S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the u

### Tokenize

- Function `tokenize()` calls `word_tokenize()` to tokenize the text by words.

In [3]:
word_tokens = tokenize(text)
print('{0} words in course description'.format(len(word_tokens)))
print(40*'-')
print(word_tokens[:13])

36719 words in course description
----------------------------------------
['\ufeffProject', 'Gutenberg', "'s", 'Alice', "'s", 'Adventures', 'in', 'Wonderland', ',', 'by', 'Lewis', 'Carroll', 'This']


### Collocations

- Function `find_best_bigrams()` builds bigram collocations by using the pointwise mutual information (PMI).
- The 10 best bigrams are returned.

In [4]:
top_bigrams = find_best_bigrams(word_tokens)

print('Best {0} bi-grams in text (WP Tokenizer)'.format(10))
print(50*'-')

ppf = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=False)
ppf.pprint(top_bigrams)

Best 10 bi-grams in text (WP Tokenizer)
--------------------------------------------------
[ ('#', '11'),
  ("'Cheshire", 'Puss'),
  ("'IT", 'DOES'),
  ("'ORANGE", 'MARMALADE'),
  ("'Ou", 'est'),
  ("'Rule", 'Forty-two'),
  ("'Seven", 'jogged'),
  ("'With", 'extras'),
  ("'any", 'shrimp'),
  ("'than", 'waste')]


### DefaultTagger

- Function `tag_words()` uses `DefaultTagger` to associate a tag of our choosing (the `tag` parameter) with words.

In [5]:
tags = tag_words(word_tokens, 'INFO')
print('Tagged text (WP Tokenizer)')
print(50*'-')
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)
pp.pprint(tags[:15])

Tagged text (WP Tokenizer)
--------------------------------------------------
[ ('\ufeffProject', 'INFO'), ('Gutenberg', 'INFO'), ("'s", 'INFO'),
  ('Alice', 'INFO'), ("'s", 'INFO'), ('Adventures', 'INFO'), ('in', 'INFO'),
  ('Wonderland', 'INFO'), (',', 'INFO'), ('by', 'INFO'), ('Lewis', 'INFO'),
  ('Carroll', 'INFO'), ('This', 'INFO'), ('eBook', 'INFO'), ('is', 'INFO')]


### Part of Speech Tagging

- Function `tag_pos()` uses a PerceptronTagger to create Part of Speech (PoS) tags.

In [6]:
pos_tags = tag_pos(word_tokens)

print('PoS tagged text (WP Tokenizer/Univesal Tagger)')
print(60*'-')

ppf.pprint(pos_tags[:15])

PoS tagged text (WP Tokenizer/Univesal Tagger)
------------------------------------------------------------
[ ('\ufeffProject', 'JJ'),
  ('Gutenberg', 'NNP'),
  ("'s", 'POS'),
  ('Alice', 'NNP'),
  ("'s", 'POS'),
  ('Adventures', 'NNS'),
  ('in', 'IN'),
  ('Wonderland', 'NNP'),
  (',', ','),
  ('by', 'IN'),
  ('Lewis', 'NNP'),
  ('Carroll', 'NNP'),
  ('This', 'DT'),
  ('eBook', 'NN'),
  ('is', 'VBZ')]


### Penn Treebank

- Function `tag_penn()` tokenizes and tags unigrams in `text` by using `UnigramTagger` and a Penn Treebank tagged sentence and word tokenizer.

In [7]:
b_tags = tag_penn(word_tokens)

print('Penn Treebank tagged text (WP Tokenizer)')
print(60*'-')

ppf.pprint(b_tags[:15])

Penn Treebank tagged text (WP Tokenizer)
------------------------------------------------------------
[ ('\ufeffProject', None),
  ('Gutenberg', None),
  ("'s", 'POS'),
  ('Alice', None),
  ("'s", 'POS'),
  ('Adventures', None),
  ('in', 'IN'),
  ('Wonderland', None),
  (',', ','),
  ('by', 'IN'),
  ('Lewis', 'NNP'),
  ('Carroll', None),
  ('This', 'DT'),
  ('eBook', None),
  ('is', 'VBZ')]


### Linking Taggers

- Function `tag_linked()` links the Penn Treebank Corpus tagger with our earlier Default tagger.

In [8]:
linked_tags = tag_linked(word_tokens)
print('Penn Treebank tagged text (WP Tokenizer/Linked Tagger)')
print(60*'-')

ppf.pprint(linked_tags[:15])

Penn Treebank tagged text (WP Tokenizer/Linked Tagger)
------------------------------------------------------------
[ ('\ufeffProject', 'INFO'),
  ('Gutenberg', 'INFO'),
  ("'s", 'POS'),
  ('Alice', 'INFO'),
  ("'s", 'POS'),
  ('Adventures', 'INFO'),
  ('in', 'IN'),
  ('Wonderland', 'INFO'),
  (',', ','),
  ('by', 'IN'),
  ('Lewis', 'NNP'),
  ('Carroll', 'INFO'),
  ('This', 'DT'),
  ('eBook', 'INFO'),
  ('is', 'VBZ')]


### Tagged Text Extraction

- We can use regular expressions to restrict tokens in the text to Nouns, Verbs, Adjectives, and Adverbs.
- Function `extract_tags()` returns a tuple of PoS tags and the extracted terms.

In [9]:
pos_tags, terms = extract_tags(word_tokens)

print('POS tagged text (WP Tokenizer)')
print(60*'-')
pp.pprint(pos_tags[:15])
print(60*'-')
print('POS tagged text (WP Tokenizer/RegEx applied)')
print(60*'-')
pp.pprint(terms[:15])

POS tagged text (WP Tokenizer)
------------------------------------------------------------
[ ('\ufeffProject', 'JJ'), ('Gutenberg', 'NNP'), ("'s", 'POS'),
  ('Alice', 'NNP'), ("'s", 'POS'), ('Adventures', 'NNS'), ('in', 'IN'),
  ('Wonderland', 'NNP'), (',', ','), ('by', 'IN'), ('Lewis', 'NNP'),
  ('Carroll', 'NNP'), ('This', 'DT'), ('eBook', 'NN'), ('is', 'VBZ')]
------------------------------------------------------------
POS tagged text (WP Tokenizer/RegEx applied)
------------------------------------------------------------
[ '\ufeffProject', 'Gutenberg', 'Alice', 'Adventures', 'Wonderland', 'Lewis',
  'Carroll', 'eBook', 'is', 'use', 'anyone', 'anywhere', 'cost', 'almost',
  'restrictions']


## Topic Modeling

Let's move on to the concept of topic modeling. We use the twenty newsgroup data.

In [10]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(
    data_home='~/textdm', 
    subset='train',
    shuffle=True,
    random_state=check_random_state(0),
    remove=('headers', 'footers', 'quotes')
    )

test = fetch_20newsgroups(
    data_home='~/textdm', 
    subset='test',
    shuffle=True,
    random_state=check_random_state(0),
    remove=('headers', 'footers', 'quotes')
    )

### Document term matrix

- We will use the function `get_document_term_matrix()`
- Uses TfidfVectorizer to create a document term matrix for both `train['data']` and `test['data']`.
- Uses English stop words.
- Uses unigrams and bigrams.
- Ignores terms that have a document frequency strictly lower than 2.
- Builds a vocabulary that only consider the top 20,000 features ordered by term frequency across the corpus.

In [11]:
cv, train_data, test_data = get_document_term_matrix(train['data'], test['data'])

### Non-negative matrix factorization

- We will use function `apply_nmf()`
- Applies non-negative matrix factorization (NMF) to compute topics in `train_data`.
- Uses 60 topics.
- Normalizes the transformed data to have unit probability.

In [12]:
nmf, td_norm = apply_nmf(train_data, random_state=check_random_state(0))

# We use a DataFrame to simplify the collecting of the data for display.
df = pd.DataFrame(td_norm)
df.fillna(value=0, inplace=True)
df['label'] = pd.Series(train['target_names'], dtype="category")

df.groupby('label').mean()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
alt.atheism,0.0,0.004216,0.0,0.120597,0.001416,0.010014,0.000491,0.0,0.0,0.0,...,0.0,0.0,0.001455,0.0,0.0,0.0,0.434272,0.0,0.003139,0.0
comp.graphics,0.037898,0.0,0.0,0.209269,0.0,0.0,0.0,0.00367,0.0,0.012231,...,0.0,0.0,0.0,0.0,0.034581,0.0,0.005427,0.003008,0.033312,0.019869
comp.os.ms-windows.misc,0.012951,0.207267,0.0,0.0,0.0,0.0,0.0,0.002893,0.057188,0.007003,...,0.0,0.015903,0.0,0.0,0.0,0.0,0.0,0.059286,0.0,0.0
comp.sys.ibm.pc.hardware,0.253397,0.015348,0.003114,0.0,0.241721,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.109778,0.0,0.0,0.03412,0.0,0.0,0.017412,0.0
comp.sys.mac.hardware,0.0,0.0,0.0,0.641944,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
comp.windows.x,0.002888,0.00413,0.0,0.0,0.048298,0.0,0.0,0.0,0.006685,0.0,...,0.003866,0.0,0.0,0.002701,0.0,0.013505,0.0,0.014806,0.367371,0.019263
misc.forsale,0.0,0.0,0.0,0.0,0.0,0.0,0.086419,0.0,0.007352,0.0,...,0.00567,0.04885,0.579381,0.0,0.0,0.007527,0.00942,0.0,0.083633,0.001053
rec.autos,0.044051,0.0,0.0,0.0,0.149592,0.0,0.0,0.03026,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
rec.motorcycles,0.00088,0.0,0.004938,0.002977,0.001808,0.0,0.00366,0.0,0.0,0.0,...,0.0,0.0,0.067102,0.0,0.0,0.0,0.002896,0.0,0.104846,0.0
rec.sport.baseball,0.0,0.0,0.0,0.004005,0.0,0.0,0.009396,0.0,0.042552,0.012623,...,0.0,0.0,0.0,0.0,0.0,0.675968,0.0,0.0,0.0,0.001574


### Topic-based Classification

- Let's train a Random Forest classifier on the topics in the training data sample of the twenty newsgroup data set.
- Function `classify_topics()` compute the topics, by using the previously created NMF model, for the test data and compute classifications from these topic models. 

The resulting classification report and confusion matrix are shown to demonstrate the quality of this classification method.

In [13]:
clf, ts_preds = classify_topics(
    nmf, nmf.transform(train_data), train['target'], test_data, check_random_state(0)
    )
print(classification_report(test['target'], ts_preds, target_names=test['target_names']))

                          precision    recall  f1-score   support

             alt.atheism       0.26      0.34      0.29       319
           comp.graphics       0.38      0.52      0.44       389
 comp.os.ms-windows.misc       0.45      0.47      0.46       394
comp.sys.ibm.pc.hardware       0.47      0.49      0.48       392
   comp.sys.mac.hardware       0.52      0.51      0.52       385
          comp.windows.x       0.62      0.57      0.59       395
            misc.forsale       0.73      0.69      0.71       390
               rec.autos       0.42      0.66      0.52       396
         rec.motorcycles       0.66      0.62      0.64       398
      rec.sport.baseball       0.52      0.52      0.52       397
        rec.sport.hockey       0.65      0.61      0.63       399
               sci.crypt       0.66      0.59      0.62       396
         sci.electronics       0.38      0.31      0.34       393
                 sci.med       0.61      0.56      0.58       396
         

### Topic Modeling with Gensim

- We could also use the gensim library to perform topic modeling of the twenty newsgroup data. First we transform a sparse matrix into a gensim corpus, and then construct a vocabulary dictionary. Finally, we create a  Latent Dirichlet allocation (LDA) model with 20 topics for the newsgroup text, and return 5 most significant words for each topic.
- We specify three parameters in `LdaModel()`: `corpus`, `id2word`, and `num_topics`.

In [14]:
topics = get_topics(cv, train_data)

for idx, (lst, val) in enumerate(topics):
    print('Topic {0}'.format(idx))
    print(35*('-'))
    for i, z in lst:
        print('    {0:20s}: {1:5.4f}'.format(z, i))
    print(35*('-'))

Topic 0
-----------------------------------
    pitt                : 0.0054
    banks               : 0.0049
    pitt edu            : 0.0046
    shameful            : 0.0044
    intellect           : 0.0043
-----------------------------------
Topic 1
-----------------------------------
    people              : 0.0029
    just                : 0.0029
    don                 : 0.0029
    like                : 0.0028
    know                : 0.0027
-----------------------------------
Topic 2
-----------------------------------
    god                 : 0.0076
    people              : 0.0045
    don                 : 0.0036
    think               : 0.0035
    believe             : 0.0034
-----------------------------------
Topic 3
-----------------------------------
    zip                 : 0.0027
    help                : 0.0022
    thanks              : 0.0020
    like                : 0.0019
    use                 : 0.0018
-----------------------------------
Topic 4
------------

## Semantic Analysis

Let's move on to semantic analysis.

### Wordnet

We use the Wordnet synonym rings.

- Function `find_number_of_entries_in_synonym_ring()` finds how many entries a word has in the wordnet synset.

In [15]:
the_word = 'ship'
n_entries = find_number_of_entries_in_synonym_ring(the_word)
print('{0} total entries in synonym ring for {1}. '.format(n_entries, the_word))

6 total entries in synonym ring for ship. 


In [16]:
the_word = 'throw'
n_entries = find_number_of_entries_in_synonym_ring(the_word)
print('{0} total entries in synonym ring for {1}. '.format(n_entries, the_word))

20 total entries in synonym ring for throw. 


### Word Similarities

- Here we have four functions that will compute word similarities.
- Computes the path similarity for _cat_, _dog_, _boy_, and _girl_.

In [17]:
# Now we print similarity measures.
fmt_str = '{1} to {2}: {0:4.3f}'

print('Path Similarity:')
print(40*'-')
print(fmt_str.format(get_path_similarity_between_boy_and_girl(), 'boy', 'girl'))
print(fmt_str.format(get_path_similarity_between_boy_and_cat(), 'boy', 'cat'))
print(fmt_str.format(get_path_similarity_between_boy_and_dog(), 'boy', 'dog'))
print(fmt_str.format(get_path_similarity_between_girl_and_girl(), 'girl', 'girl'))

Path Similarity:
----------------------------------------
boy to girl: 0.167
boy to cat: 0.083
boy to dog: 0.143
girl to girl: 1.000


### Word2Vec

Let's use the NLTK Brown corpus to build a word2vec model.

In [18]:
from nltk.corpus import brown
sentences = brown.sents()

#### Word2Vec model

- Our function `get_model()` will handle the Word2Vec model
- Builds model from movie reviews `new_mvr`.
- The maximum distance between the current and predicted word within a sentence is set to 10.
- Ignores all words with total frequency lower than 2.

The following code cell takes a while to complete.

In [19]:
model = get_model(sentences)

### Cosine Similarity

- Function `get_cosine_similarity()` computes Cosine Similarities.
- Cosine Similarity measures the similarity between two vectors by computing the cosine of the angle between them.

In [20]:
# Now we print similarity measures.
fmt_str = '{1} to {2}: {0:4.3f}'

print('Cosine Similarity:')
print(40*'-')
print(fmt_str.format(get_cosine_similarity(model, 'boy', 'girl'), 'boy', 'girl'))
print(fmt_str.format(get_cosine_similarity(model, 'boy', 'cat'), 'boy', 'cat'))
print(fmt_str.format(get_cosine_similarity(model, 'boy', 'dog'), 'boy', 'dog'))
print(fmt_str.format(get_cosine_similarity(model, 'girl', 'girl'), 'girl', 'girl'))

Cosine Similarity:
----------------------------------------
boy to girl: 0.947
boy to cat: 0.576
boy to dog: 0.716
girl to girl: 1.000


### Most similar words

- We create a function called `find_most_similar_words()`
- Finds the top 3 most similar words, where "girl" and "cat" contribute positively towards the similarity, and "boy" and "dog" contribute negatively.

In [21]:
print('{0:14s}: {1}'.format('Word', 'Cosine Similarity'))
print(40*'-')
for val in find_most_similar_words(model):
    print('{0:14s}: {1:6.3f}'.format(val[0], val[1]))

Word          : Cosine Similarity
----------------------------------------
Joan's        :  0.661
burglary      :  0.618
correspondingly:  0.617
