## Natural Language Processing
* This tutorial just provide a really basic guide of NLP with python.
* If you want to learn more about NLP, you could take courses or check it on Coursera. :)

## NLTK Library
* The most famous python natural language processing toolkit.
* With lots of corpora and and lexical resources!!
* O’Reilly publish the Natural Language Processing with Python book, NLTK is the toolkit in the book.

## Install NLTK
`sudo pip3 install nltk` <br>
`sudo pip3 install numpy`  to speed up

## Tokenization, Stemming, Lemmatization

In [None]:
## To use nltk.tokenize.word_tokenize() function you first need to download data with nltk.download()
from nltk import download
download()  ## would open a GUI interface
## download punct package

In [1]:
from nltk.tokenize import word_tokenize, sent_tokenize
s = "Hey! I don't like this. "
sents = sent_tokenize(s)
print(sents)
[print(word_tokenize(x)) for x in sents]
words = word_tokenize(s)
print(words)

['Hey!', "I don't like this. "]
['Hey', '!']
['I', 'do', "n't", 'like', 'this', '.']
['Hey', '!', 'I', 'do', "n't", 'like', 'this', '.']


## Stemming
* Reduce terms to their stems in information retrieval
* Most famous one: Porter's algorithm

In [10]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('cats'))
print(stemmer.stem('crying'))
print(stemmer.stem('string'))
print(stemmer.stem('relational'))

cat
cri
string
relat


## Lemmatization
* Reduce inflections or variant forms to base form
* NLTK's lemmatizer is based on [WordNet](http://wordnetweb.princeton.edu/perl/webwn), a large lexical database of English. <br>

In [21]:
from nltk.stem import WordNetLemmatizer   ## remember to download wornet corpora first
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cats'))
print(lemmatizer.lemmatize('is', pos='v'))
print(lemmatizer.lemmatize('drove', pos='v'))
print(lemmatizer.lemmatize('media'))

cat
be
drive
medium


## Word Simiarity
* Use WordNet synset to calculate.
* [word].[pos].[number].[lemma]” string where: <br>
[word] is the morphological stem identifying the synset. <br>
[pos] is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <br>
[number] is the sense number, counting from 0. <br>
[lemma] is the morphological form of interest.

In [62]:
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print(cat.path_similarity(cat))  ## count the shortest path of the two sense, from 0-1
print(dog.path_similarity(cat))

1.0
0.2


In [63]:
move = wn.synset('move.v.01')
stop = wn.synset('stop.v.01')
print(wn.path_similarity(move, stop))

0.3333333333333333


## Practice
* Write a simple typing suggestion function.
* Use `nltk.coupus.brown` as training data.
* You can use `bigrams()` and `ConditionalFreqDist()` to handle data.
* Use `input()` to get user input.
* The user can choose their next words to type based on your suggestion. Give them 5 suggestions for each word

In [None]:
from nltk.corpus import brown
from nltk.probability import ConditionalFreqDist
from nltk.util import bigrams
words = brown.words()
cfdist = ConditionalFreqDist(bigrams(words))

In [None]:
print(cfdist['I'].most_common(3))
print(cfdist['United'].most_common(3))

## Reference
* [NLTK Online Book](http://www.nltk.org/book/ch01.html)
* [NLTK Documentation](http://www.nltk.org/index.html)

