# Python's Natural Language Toolkit - NLTK
 ##### Keziah Sheldon, Eesha Das Gupta, Rachel Buttry

## What is Natural Language Processing?

"Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages." -[Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing)

Link: [Wisdom of Chopra](http://wisdomofchopra.com/)

## NLTK
"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum." 

[Read More](http://www.nltk.org/)

## Tokenizing

* Tokenization is the first step of NLP (Natural Language Processing)
* Split strings into substrings
* Separate into sentences, identifies where they start and stop


Tokenize by sentence:
Doesn’t just depend on punctuation since recent decades have much more ambiguous punctuation usage (emoticons, abbreviations etc.)


Tokenize by word: (simpler than sentence tokenizer)

## Part of Speech (POS) Tagging


* POS is grammatical tagging
* Separates into nouns, adjectives, verbs etc.
* Can use pre-trained taggers or train yourself

Types of POS tags:
* JJ: adjective or numeral, ordinal
* NNP: noun, proper, singular
* IN: preposition or conjunction, subordinating


## Stemming

Stemming is the the process of reducing a word to its stem through basic suffix stripping. For instance, the stem of the word “lively” is “live”, so running a the statement “She acted lively while helping me today.” through a stemmer (after tokenizing), would return the stems ‘act’, ‘live’, and ‘help’ with their respective words, while the rest of the words would just return the initial input (the words were already at their stem). 


Some common stemmers:

* Snowball : http://www.nltk.org/_modules/nltk/stem/snowball.html

* Porter : http://www.nltk.org/_modules/nltk/stem/porter.html

* Lancaster : http://www.nltk.org/_modules/nltk/stem/lancaster.html

In [2]:
import nltk
snowball = nltk.stem.snowball.EnglishStemmer()
word = 'expectations'
print "The stem of '%s' is '%s'." %(word, snowball.stem(word))

The stem of 'expectations' is 'expect'.


## Porter Algorithm

Porter's stemming algorithm consists of 5 phases of word reduction applied sequentially.

For instance:

$$SSES \rightarrow SS \\ caress \rightarrow care$$

<br />
$$S \rightarrow \\ cats \rightarrow cat$$

[Read More](http://snowball.tartarus.org/algorithms/porter/stemmer.html)


A problem that arises is that, for many cases, the stemmers will not return actual words. For instance, the stem of the words “excitement” and “excited” are both “excit”. Stemming is sufficient in many cases of interpreting the meaning of texts, but a better option in terms of meaning, is lemmatization. 

## Lemmatization
Lemmatization, on the other hand, is the reducing of a word to a common base word, known as the lemma. For instance, the word “excitement” reduces to the lemma “excite”, but also the words “are”, “am” and “is” reduce to the lemma “be”.

## Text Generator w/ Markov Chains

Based off of [markovify python module](https://github.com/jsvine/markovify) 

(pip install markovify)


In [None]:
# %load ./mymarkovify/splitters.py
import re
from nltk import tokenize

'''
ascii_lowercase = "abcdefghijklmnopqrstuvwxyz"
ascii_uppercase = ascii_lowercase.upper()
sent_tokenizer = tokenize.PunktSentenceTokenizer()

# States w/ with thanks to https://github.com/unitedstates/python-us
# Titles w/ thanks to https://github.com/nytimes/emphasis and @donohoe
abbr_capped = "|".join([
    "ala|ariz|ark|calif|colo|conn|del|fla|ga|ill|ind|kan|ky|la|md|mass|mich|minn|miss|mo|mont|neb|nev|okla|ore|pa|tenn|vt|va|wash|wis|wyo", # States
    "u.s",
    "mr|ms|mrs|msr|dr|gov|pres|sen|sens|rep|reps|prof|gen|messrs|col|sr|jf|sgt|mgr|fr|rev|jr|snr|atty|supt", # Titles
    "ave|blvd|st|rd|hwy", # Streets
    "jan|feb|mar|apr|jun|jul|aug|sep|sept|oct|nov|dec", # Months
    "|".join(ascii_lowercase) # Initials
]).split("|")

abbr_lowercase = "etc|v|vs|viz|al|pct"

exceptions = "U.S.|U.N.|E.U.|F.B.I.|C.I.A.".split("|")

def is_abbreviation(dotted_word):
    clipped = dotted_word[:-1]
    if clipped[0] in ascii_uppercase:
        if clipped.lower() in abbr_capped: return True
        else: return False
    else:
        if clipped in abbr_lowercase: return True
        else: return False

def is_sentence_ender(word):
    if word in exceptions: return False
    if word[-1] in [ "?", "!" ]:
        return True
    if len(re.sub(r"[^A-Z]", "", word)) > 1:
        return True
    if word[-1] == "." and (not is_abbreviation(word)):
        return True
    return False

def split_into_sentences(text):
    potential_end_pat = re.compile(r"".join([
        r"([\w\.'’&\]\)]+[\.\?!])", # A word that ends with punctuation
        r"([‘’“”'\"\)\]]*)", # Followed by optional quote/parens/etc
        r"(\s+(?![a-z\-–—]))", # Followed by whitespace + non-(lowercase or dash)
        ]), re.U)
    dot_iter = re.finditer(potential_end_pat, text)
    end_indices = [ (x.start() + len(x.group(1)) + len(x.group(2)))
        for x in dot_iter
        if is_sentence_ender(x.group(1)) ]
    spans = zip([None] + end_indices, end_indices + [None])
    sentences = [ text[start:end].strip() for start, end in spans ]
    return sentences
'''

def split_into_sentences(text):
    return sent_tokenizer.tokenize(text)






In [1]:
import extract
import nltk
import numpy as np

cleantext = extract.read(directory = "./ChopraEdited")

stoken = nltk.tokenize.PunktSentenceTokenizer()
wtoken = nltk.tokenize.WordPunctTokenizer()
s = stoken.tokenize(cleantext)
w = wtoken.tokenize(cleantext)
print np.size(w)
print np.size(s)


Reading files...
./ChopraEdited/DeepakChopra_Interview_Clean.pdf
./ChopraEdited/Chopra-Deepak-the_seven_spiritual_laws_of_yoga_Clean.pdf
./ChopraEdited/Chopra-Deepak-How-To-Know-God_Clean.pdf
./ChopraEdited/Chopra-Deepak-Book-Of-Secrets_Clean.pdf
./ChopraEdited/DeepakChopra_Quotes_Clean.pdf
./ChopraEdited/Chopra-Deepak-superbrain_Clean.pdf
./ChopraEdited/Chopra-Deepak-the-7-laws-of-success_Clean.pdf
Done Reading.
258477 13051


In [3]:
import mymarkovify

#Train our model on the string
text_model = mymarkovify.Text(cleantext, state_size=3)

In [6]:

#print n generated sentences
n = 10 #number of sentences to be generated
for i in range(n):
    print "%s. %s\n" % (i+1, text_model.make_sentence(tries=100))

1. Can we truly satisfy the demands of our egos, who want to grab the love and attention that used to be seen by the eyes as photons, it doesn't suddenly jump into material existence.

2. Sometimes the connections are faultI might have come up with the fewest delays, obstacles, and backslidinadopting the right belief is much more elusive and even mystical.

3. Now slowly lower both legs to the floor.

4. It is inevitable that you will escape from evil.

5. Many of India's saints strike me with less than wonder, and I have a separate stake in the world, you are aligned with your creative juices, the expressions that emerge arise effortlessly.

6. To a co-creator, life has a tendency to show up in the word It.

7. To reach this state of innocence would be impossible if he didn't want to be known.

8. A stranger makes you feel like a choice anymore.

9. Karma is just experiences that make for a healing relationship is that life is a miracle; God works through me, my greatest joy is servic

## Text Classification

Text Classification is the use of NLTK Classifiers to classify text into categories.


NLTK Classifiers:


* Naive Bayes Classifier


* Maximum Entropy Classifier


* Decision Tree


* Scikit Learn Wrapper



## Naive Bayes Classifier

Bayes Theorem : 

$$ P(C|x) = \frac{P(x|C) P(C)}{P(x)} $$

Naive Bayes Classification :

Based on the assumption that all features ${x_1, x_2,......,x_i}$ are conditionally independent, given the category C

or, $$ P(x_i|x_1, x_2,..,x_(i-1),x(i+1),..., x_i,C) = P(x_i|C) $$

Then, $P(C|x_1,x_2,....,x_i)$ can be computed using Bayes Theorem


## Maximum Entropy Model

The model considers probability distributions closest to empirical data and picks the one with highest entropy.

### What does entropy mean?

Entropy is measure of uncertainty or 'surprise' in the data. Assumption is that data is likely to have random, unknown elements.

### What will have the highest entropy?

Usually, uniform distribution will have the highest entropy.

## Applications of Text Classification - 


* Analyzing and classifying emails, texts and chats for security purposes


* Data acquisition from social media posts for advertising


* Analysis of speech patterns and transcripts for academic research (grammar development, language research, psychology, etc.)


* Integrating NLP techniques with Artificial Intelligence
