# Python's Natural Language Toolkit - NLTK
 ##### Keziah Sheldon, Eesha Das Gupta, Rachel Buttry

## What is Natural Language Processing?

"Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages." -[Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing)

Natural Language Processing is the way for computers to process the human spoken and written language for analysis.

Link: [Wisdom of Chopra](http://wisdomofchopra.com/)

## [Natural Language ToolKit](http://www.nltk.org/)

* Python module for NLP
* Provides interfaces to over 50 corpora and lexical resources
* Tools for tokenization, stemming, tagging, parsing, andd semantic reasoning
* Wrappers for industrial-strength NLP libraries

## Example NLP Pipeline for NLTK

![title](./pipeline1.png)

[Image source](http://www.nltk.org/book/ch03.html)

## Tokenizing

* Tokenization is the first step of NLP (Natural Language Processing)
* Split strings into substrings
* Separate into sentences, identifies where they start and stop

### For example, the Punkt tokenizer: (from docstring)
This tokenizer divides a text into a list of sentences,
by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences.  It must be
trained on a large collection of plaintext in the target language
before it can be used.

## Tokenize by sentence:
* Doesn’t just depend on punctuation since recent decades have much more ambiguous punctuation usage (emoticons, abbreviations etc.)
* Depends on the pre-trained 'tokenizer', e.g, Punkt, Stanford, Regex etc

In [6]:
import nltk

text = "This is PHYST480 Big Data class. Natural language processing with the Natural Language Tool Kit module. Much machine learning. So wow."

sent_tokenize_list = nltk.tokenize.sent_tokenize(text)

print sent_tokenize_list


['This is PHYST480 Big Data class.', 'Natural language processing with the Natural Language Tool Kit module.', 'Much machine learning.', 'So wow.']


## Tokenize by word
* Splits string by word
* Punkt tokenizer also has word tokenization

In [9]:
print nltk.tokenize.word_tokenize('Hello World.')


['Hello', 'World', '.']


## Part of Speech (POS) Tagging


* POS is grammatical tagging
* Separates into nouns, adjectives, verbs etc.
* Can use pre-trained taggers or train yourself

Types of POS tags:
* JJ: adjective or numeral, ordinal
* NNP: noun, proper, singular
* IN: preposition or conjunction, subordinating


In [11]:

text = nltk.word_tokenize('The cake is a lie.')

print nltk.pos_tag(text)


[('The', 'DT'), ('cake', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('lie', 'NN'), ('.', '.')]


## Stemming

Stemming is the the process of reducing a word to its stem through basic suffix stripping. For instance, the stem of the word “lively” is “live”, so running a the statement “She acted lively while helping me today.” through a stemmer (after tokenizing), would return the stems ‘act’, ‘live’, and ‘help’ with their respective words, while the rest of the words would just return the initial input (the words were already at their stem). 


Some common stemmers:

* [Snowball](http://www.nltk.org/_modules/nltk/stem/snowball.html)

* [Porter](http://www.nltk.org/_modules/nltk/stem/porter.html)

* [Lancaster](http://www.nltk.org/_modules/nltk/stem/lancaster.html)

In [4]:
import nltk
porter = nltk.porter.PorterStemmer()
words = ['dogs', 'creationism', 'ate']

for word in words:
    print "The stem of '%s' is '%s'." %(word, porter.stem(word))


The stem of 'dogs' is 'dog'.
The stem of 'creationism' is 'creation'.
The stem of 'ate' is 'ate'.


## [Porter Algorithm](http://snowball.tartarus.org/algorithms/porter/stemmer.html)

Porter's stemming algorithm consists of 5 phases of word reduction applied sequentially.

For instance:

$$SSES \rightarrow SS \\ caress \rightarrow care$$

<br />
$$S \rightarrow \\ cats \rightarrow cat$$



## Lemmatization
A problem that arises is that, for many cases, the stemmers will not return actual words. For instance, the stem of the words “excitement” and “excited” are both “excit”. Stemming is sufficient in many cases of interpreting the meaning of texts, but a better option in terms of meaning, is lemmatization. 

Lemmatization, on the other hand, is the reducing of a word to a common base word, known as the lemma. For instance, the word “excitement” reduces to the lemma “excite”, but also the words “are”, “am” and “is” reduce to the lemma “be”.

In [5]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
taggedwords = {'dogs':'n', 'creationism':'n', 'ate':'v'}

for word in taggedwords.keys():
    print "The lemma of '%s' is '%s'." %(word, wordnet_lemmatizer.lemmatize(word, taggedwords[word]))


The lemma of 'ate' is 'eat'.
The lemma of 'dogs' is 'dog'.
The lemma of 'creationism' is 'creationism'.


## Text Generator w/ Markov Chains

Based off of [markovify python module](https://github.com/jsvine/markovify) 


In [6]:
# %load ./mymarkovify/splitters.py
'''
import re

ascii_lowercase = "abcdefghijklmnopqrstuvwxyz"
ascii_uppercase = ascii_lowercase.upper()

# States w/ with thanks to https://github.com/unitedstates/python-us
# Titles w/ thanks to https://github.com/nytimes/emphasis and @donohoe
abbr_capped = "|".join([
    "ala|ariz|ark|calif|colo|conn|del|fla|ga|ill|ind|kan|ky|la|md|mass|mich|minn|miss|mo|mont|neb|nev|okla|ore|pa|tenn|vt|va|wash|wis|wyo", # States
    "u.s",
    "mr|ms|mrs|msr|dr|gov|pres|sen|sens|rep|reps|prof|gen|messrs|col|sr|jf|sgt|mgr|fr|rev|jr|snr|atty|supt", # Titles
    "ave|blvd|st|rd|hwy", # Streets
    "jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec", # Months
    "|".join(ascii_lowercase) # Initials
]).split("|")

abbr_lowercase = "etc|v|vs|viz|al|pct"

exceptions = "U.S.|U.N.|E.U.|F.B.I.|C.I.A.".split("|")

def is_abbreviation(dotted_word):
    clipped = dotted_word[:-1]
    if clipped[0] in ascii_uppercase:
        if clipped.lower() in abbr_capped: return True
        else: return False
    else:
        if clipped in abbr_lowercase: return True
        else: return False

def is_sentence_ender(word):
    if word in exceptions: return False
    if word[-1] in [ "?", "!" ]:
        return True
    if len(re.sub(r"[^A-Z]", "", word)) > 1:
        return True
    if word[-1] == "." and (not is_abbreviation(word)):
        return True
    return False

def split_into_sentences(text):
    potential_end_pat = re.compile(r"".join([
        r"([\w\.'’&\]\)]+[\.\?!])", # A word that ends with punctuation
        r"([‘’“”'\"\)\]]*)", # Followed by optional quote/parens/etc
        r"(\s+(?![a-z\-–—]))", # Followed by whitespace + non-(lowercase or dash)
        ]), re.U)
    dot_iter = re.finditer(potential_end_pat, text)
    end_indices = [ (x.start() + len(x.group(1)) + len(x.group(2)))
        for x in dot_iter
        if is_sentence_ender(x.group(1)) ]
    spans = zip([None] + end_indices, end_indices + [None])
    sentences = [ text[start:end].strip() for start, end in spans ]
    return sentences
'''

from nltk import tokenize
sent_tokenizer = tokenize.PunktSentenceTokenizer()

def split_into_sentences(text):
    return sent_tokenizer.tokenize(text)



In [7]:
import extract
import nltk
import numpy as np

#Extract text as a string
cleantext = extract.read(directory = "./ChopraEdited")

#Looking at the size of our data set
stoken = nltk.tokenize.PunktSentenceTokenizer()
wtoken = nltk.tokenize.WordPunctTokenizer()
s = stoken.tokenize(cleantext)
w = wtoken.tokenize(cleantext)
print np.size(w), " words"
print np.size(s), "sentences"


Reading files...
./ChopraEdited/DeepakChopra_Interview_Clean.pdf
./ChopraEdited/Chopra-Deepak-the_seven_spiritual_laws_of_yoga_Clean.pdf
./ChopraEdited/Chopra-Deepak-How-To-Know-God_Clean.pdf
./ChopraEdited/Chopra-Deepak-Book-Of-Secrets_Clean.pdf
./ChopraEdited/DeepakChopra_Quotes_Clean.pdf
./ChopraEdited/Chopra-Deepak-superbrain_Clean.pdf
./ChopraEdited/Chopra-Deepak-the-7-laws-of-success_Clean.pdf
Done Reading.
258477  words
13051 sentences


In [8]:
import mymarkovify

#Train our model on the string
text_model = mymarkovify.Text(cleantext, state_size=3)

In [9]:

#print n generated sentences
n = 10 #number of sentences to be generated
for i in range(n):
    print "%s. %s\n" % (i+1, text_model.make_sentence(tries=100))

1. In a state of being in the movie, outside the move, and the movie itself.

2. As it happens, I didn't have any awareness of it.

3. Nothing gets past us, no matter how obviously they deserve it, remind yourself that self-acceptance is the source of all that exists is not an option.

4. You can create anything because you are in stage five can make every wish come true, the ones that should come true matter more.

5. Although this cycle expresses itself in many ways, but this one says something about our stage one God.

6. If you perform asanas regularly, you will feel more guilty, adding to the burden of failure.

7. Next you will likely experience the return of worries, whatever is hanging over your head and leg to theoor while exhaling.

8. Recognizing this, Shankara named the physical body and mind come together.

9. We have outgrown the need for creative expression and renewal.

10. By going inside yourself, you can access wisdom and knowledge about anything in creation.



## Text Classification

Text Classification is the use of NLTK Classifiers to classify text into categories.


NLTK Classifiers:


* Naive Bayes Classifier


* Maximum Entropy Classifier


* Decision Tree


* Scikit Learn Wrapper



## Naive Bayes Classifier

Bayes Theorem : 

$$ P(C|x) = \frac{P(x|C) P(C)}{P(x)} $$

Naive Bayes Classification :

Based on the assumption that all features ${x_1, x_2,......,x_i}$ are conditionally independent, given the category C

or, $$ P(x_i|x_1, x_2,..,x_(i-1),x(i+1),..., x_i,C) = P(x_i|C) $$

Then, $P(C|x_1,x_2,....,x_i)$ can be computed using Bayes Theorem


## Maximum Entropy Model

The model considers probability distributions closest to empirical data and picks the one with highest entropy.

### What does entropy mean?

Entropy is measure of uncertainty or 'surprise' in the data. Assumption is that data is likely to have random, unknown elements.

### What will have the highest entropy?

Usually, uniform distribution will have the highest entropy.

## Applications of Text Classification - 


* Analyzing and classifying emails, texts and chats for security purposes


* Data acquisition from social media posts for advertising


* Analysis of speech patterns and transcripts for academic research (grammar development, language research, psychology, etc.)


* Integrating NLP techniques with Artificial Intelligence


# Thank you!
## Questions?

![title](./Cubs.png)

[Image source](http://webarebears.wikia.com/wiki/The_Bears)