# Python's Natural Language Toolkit - NLTK
 ##### Keziah Sheldon, Eesha Das Gupta, Rachel Buttry

## What is Natural Language Processing?

"Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages." -[Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing)

Link: [Wisdom of Chopra](http://wisdomofchopra.com/)

## NLTK
"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum." 

[Read More](http://www.nltk.org/)

## Tokenizing

* Tokenization is the first step of NLP (Natural Language Processing)
* Split strings into substrings
* Separate into sentences, identifies where they start and stop


Tokenize by sentence:
Doesn’t just depend on punctuation since recent decades have much more ambiguous punctuation usage (emoticons, abbreviations etc.)


Tokenize by word: (simpler than sentence tokenizer)

## Part of Speech (POS) Tagging


* POS is grammatical tagging
* Separates into nouns, adjectives, verbs etc.
* Can use pre-trained taggers or train yourself

Types of POS tags:
* JJ: adjective or numeral, ordinal
* NNP: noun, proper, singular
* IN: preposition or conjunction, subordinating


## Stemming

Stemming is the the process of reducing a word to its stem through basic suffix stripping. For instance, the stem of the word “lively” is “live”, so running a the statement “She acted lively while helping me today.” through a stemmer (after tokenizing), would return the stems ‘act’, ‘live’, and ‘help’ with their respective words, while the rest of the words would just return the initial input (the words were already at their stem). 


Some common stemmers:

* Snowball : http://www.nltk.org/_modules/nltk/stem/snowball.html

* Porter : http://www.nltk.org/_modules/nltk/stem/porter.html

* Lancaster : http://www.nltk.org/_modules/nltk/stem/lancaster.html

In [2]:
import nltk
snowball = nltk.stem.snowball.EnglishStemmer()
word = 'expectations'
print "The stem of '%s' is '%s'." %(word, snowball.stem(word))

The stem of 'expectations' is 'expect'.


## Porter Algorithm

Porter's stemming algorithm consists of 5 phases of word reduction applied sequentially.

For instance:

$$SSES \rightarrow SS \\ caress \rightarrow care$$

<br />
$$S \rightarrow \\ cats \rightarrow cat$$

[Read More](http://snowball.tartarus.org/algorithms/porter/stemmer.html)


A problem that arises is that, for many cases, the stemmers will not return actual words. For instance, the stem of the words “excitement” and “excited” are both “excit”. Stemming is sufficient in many cases of interpreting the meaning of texts, but a better option in terms of meaning, is lemmatization. 

## Lemmatization
Lemmatization, on the other hand, is the reducing of a word to a common base word, known as the lemma. For instance, the word “excitement” reduces to the lemma “excite”, but also the words “are”, “am” and “is” reduce to the lemma “be”.

## Text Generator w/ Markov Chains



In [3]:
#%load ./Chopra.class.php

In [6]:
import extract
import nltk
import numpy as np

cleantext = extract.read(directory = "./ChopraEdited")

stoken = nltk.tokenize.PunktSentenceTokenizer()
wtoken = nltk.tokenize.WordPunctTokenizer()
s = stoken.tokenize(cleantext)
w = wtoken.tokenize(cleantext)
print np.shape(w), np.shape(s)


Reading files...
./ChopraEdited/Chopra-Deepak-Book-Of-Secrets_Clean.pdf
./ChopraEdited/DeepakChopra_Interview_Clean.pdf
./ChopraEdited/Chopra-Deepak-the_seven_spiritual_laws_of_yoga_Clean.pdf
./ChopraEdited/Chopra-Deepak-superbrain_Clean.pdf
./ChopraEdited/DeepakChopra_Quotes_Clean.pdf
./ChopraEdited/Chopra-Deepak-the-7-laws-of-success_Clean.pdf
./ChopraEdited/Chopra-Deepak-How-To-Know-God_Clean.pdf
Done Reading.
(258456,) (13050,)


In [7]:
import mymarkovify

#Train our model on the string
text_model = mymarkovify.Text(cleantext, state_size=3)

n = 10 #number of sentences to be generated
for i in range(n):
    print "%s. %s\n" % (i+1, text_model.make_sentence(tries=100))

1. My knowledge now is partial; then it will pull me back toward wealth again.

2. How far have I gotten to my soul?

3. Stage 1: Fight-or-Flight Response: Fear of loss, abandonment Stage 2: Reactive Response Good is clarity, seeing the truth.

4. But for these ancient formulations to have any confidence in them.

5. The secret cause of suffering hast been examined.

6. We don't yet know what that world feels like until you give your allegiance finally to the inner world isn't a mystery.

7. Each level of commitment reflects the understanding you are willing to pose the question to yourself, What do I get out of this When you reframe the question asWhat willwe get out of stage one, who should have protected his children.

8. The thought I had the strange sensation that it was pervasive.

9. If you see yourself as the old melodramas used to promise.

10. They make karmic connections and are able to distinguish your observations from your interpretations.



## Text Classification

Text Classification is the use of NLTK Classifiers to classify text into categories.


NLTK Classifiers:


* Naive Bayes Classifier


* Maximum Entropy Classifier


* Decision Tree


* Scikit Learn Wrapper



## Naive Bayes Classifier

Bayes Theorem : 

$$ P(C|x) = \frac{P(x|C) P(C)}{P(x)} $$

Naive Bayes Classification :

Based on the assumption that all features ${x_1, x_2,......,x_i}$ are conditionally independent, given the category C

or, $$ P(x_i|x_1, x_2,..,x_(i-1),x(i+1),..., x_i,C) = P(x_i|C) $$

Then, $P(C|x_1,x_2,....,x_i)$ can be computed using Bayes Theorem


## Maximum Entropy Model

The model considers probability distributions closest to empirical data and picks the one with highest entropy.

### What does entropy mean?

Entropy is measure of uncertainty or 'surprise' in the data. Assumption is that data is likely to have random, unknown elements.

### What will have the highest entropy?

Usually, uniform distribution will have the highest entropy.

## Applications of Text Classification - 


* Analyzing and classifying emails, texts and chats for security purposes


* Data acquisition from social media posts for advertising


* Analysis of speech patterns and transcripts for academic research (grammar development, language research, psychology, etc.)


* Integrating NLP techniques with Artificial Intelligence
