<a href="https://colab.research.google.com/github/todnewman/coe_training/blob/master/Basic_NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Natural Language Processing
**Author**: W. Tod Newman

**Updates**: New release

## Learning Objectives


*   Learn the basics of the Python Natural Language Toolkit
*   Explore concepts of language processing: parts of speech, corpora, stemming, lemmatizing, etc.
*   Overview simple neural network classification


# About Python's Natural Language Toolkit (NLTK)

NLTK is the most widely used NLP module for Python.  It comes with the Anaconda distribution, so it's very easy to start working once Anaconda is in place.  From the NLTK site:

*NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.*

NLTK has a very large set of open data that can be used to train the NLTK learner.  This NLTK data includes a lot of corpora, grammars, models and etc. Without NLTK Data, NLTK is not extremely useful. You can find the complete nltk data list here: http://nltk.org/nltk_data/

The simplest way to install NLTK Data is run the Python interpreter and type the commands:
'>>> import nltk
'>>> nltk_download()

This should open the NLTK Downloader window and you can select which modules to download.  The Brown University corpus is one of the most cited artifacts in the field of corpus linguistics.  We'll start by exploring how we can make use of it in our own text classification tasks.


In [0]:
# use natural language toolkit
import nltk

#
# Use the nltk downloader to download corpora, tools, and dictionaries
#
nltk.download('brown')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('names')
nltk.download('tagsets')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

## Corpora

### Brown University corpus

The Brown Corpus was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University in Providence, Rhode Island, as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.

for more information: https://en.wikipedia.org/wiki/Brown_Corpus

### What will we do here?

We will load the corpus (which we downloaded with the nltk downloader above) and print the first 10 works along with their parts-of-speech (POS) tags.


In [0]:
# Import the Brown University Corpus and print the first ten words
from nltk.corpus import brown
print ("\nPrinting the first 10 words in the Brown University Corpora:\n")
print (brown.words()[0:10])
print ("\nNow printing the POS tags for the first 10 words:\n")
print (brown.tagged_words()[0:10])
print ("\nNote the u'WORD' is the UNICODE UTF-8 encoding")
#print (len(brown.words()))


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.

Printing the first 10 words in the Brown University Corpora:

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']

Now printing the POS tags for the first 10 words:

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]

Note the u'WORD' is the UNICODE UTF-8 encoding


## Overview of Sentence, Word, and Part of Speech Processing

### What will we do here?

We will bring in a large block of text (Wikipedia entry on Signal Processing) and do work to it.

*  Tokenize the text into sentences
*  Tokenize the sentences into words
*  Tag the words with part of speech and demonstrate use cases for POS tags
*  Show how to print out the "key" for NLTK POS tags

In [0]:
from nltk import sent_tokenize, word_tokenize, pos_tag

text = """
    Signal processing related to human hearing: The human ear interprets signals that are nearly periodic signals to be exactly periodic. This is just like the case where an autocorrelated signal exhibits slightly different maxima-values at regular intervals of time.
    Vocal processing: Correlation can help to determine the tempo or pitch associated with musical signals. The reason is the fact that the autocorrelation can effectively be used to identify repetitive patterns in any given signal.
    Determining synchronization pulses: The synchronization pulses in a received signal, which in turn facilitates the process of data retrieval at the receiver's end. This is because the correlation of the known synchronization pulses with the incoming signal exhibits peaks when the sync pulses are received in it. This point can then be used by the receiver as a point of reference, which makes the system understand that the part of the signal following from then on (until another peak is obtained in the correlated signal indicating the presence of sync pulse) contains data.
    Radar engineering: Correlation can help determine the presence of a target and its range from the radar unit. When a target is present, the signal sent by the radar is scattered by it and bounced back to the transmitter antenna after being highly attenuated and corrupted by noise. If there is no target, then the signal received will be just noise. Now, if we correlate the arriving signal with the signal sent, and if we obtain a peak at a certain point, then we can conclude that a target is present. Moreover, by knowing the time-delay (indicated by the time-instant at which the correlated signal exhibits a peak) between the sent and received signals, we can even determine the distance between the target and the radar.
    Interpreting digital communications through noise: As demonstrated above, correlation can aid in digital communications by retrieving the bits when a received signal is corrupted heavily by noise. Here, the receiver correlates the received signal with two standard signals which indicate the level of '0' and '1', respectively. Now, if the signal highly correlates with the standard signal which indicates the level of '1' more than with the one which represents '0', then it means that the received bit is '1' (or vice versa).
    Impulse response identification: As demonstrated above, cross-correlation of a system's output with its input results in its impulse response, provided the input is zero mean unit variance white Gaussian noise.
    Image processing: Correlation can help eliminate the effects of varying lighting which results in brightness variation of an image. Usually this is achieved by cross-correlating the image with a definite template wherein the considered image is searched for the matching portions when compared to a template (template matching). This is further found to aid the processes like facial recognition, medical imaging, navigation of mobile robots, etc.
    Linear prediction algorithms: In prediction algorithms, correlation can help guess the next sample arriving in order to facilitate the compression of signals.
    Machine learning: Correlation is used in branches of machine learning, such as in pattern recognition based on correlation clustering algorithms. Here, data points are grouped into clusters based on their similarity, which can be obtained by their correlation.
    SONAR: Correlation can be used in applications such as water traffic monitoring. This is based on the fact that the correlation of the signals received by various shells will have different time-delays and thus their distance from the point of reference can be found more easily.
"""
sents = sent_tokenize(text) # This will break the text into sentences.

sents

print ("The # of Sentences in the last example is %s" % len(sents))

tokens = word_tokenize(text)

print ("\nPrinting the tokens (words) out of the sentences\n")
print (tokens)  # Breaks into tokens.  

tagged_tokens = pos_tag(tokens)

print ("\nPrinting the POS TAGGED tokens (words) out of the sentences\n")

print (tagged_tokens) # Breaks into (Token, POS Tag) tuples

# Lets walk through the tuple and do some grouping

print ("\nNOW we'll be printing only the tokens (words) that are Nouns\n")


for token, pos_tag in tagged_tokens:
    if pos_tag == 'NN':
        print (token)
print()
print('Here\'s how we can figure out what these Part of Speech Tags mean!')
print('__________________________________________________________________')

for token, pos_tag in tagged_tokens:
    nltk.help.upenn_tagset(pos_tag)



The # of Sentences in the last example is 24

Printing the tokens (words) out of the sentences

['Signal', 'processing', 'related', 'to', 'human', 'hearing', ':', 'The', 'human', 'ear', 'interprets', 'signals', 'that', 'are', 'nearly', 'periodic', 'signals', 'to', 'be', 'exactly', 'periodic', '.', 'This', 'is', 'just', 'like', 'the', 'case', 'where', 'an', 'autocorrelated', 'signal', 'exhibits', 'slightly', 'different', 'maxima-values', 'at', 'regular', 'intervals', 'of', 'time', '.', 'Vocal', 'processing', ':', 'Correlation', 'can', 'help', 'to', 'determine', 'the', 'tempo', 'or', 'pitch', 'associated', 'with', 'musical', 'signals', '.', 'The', 'reason', 'is', 'the', 'fact', 'that', 'the', 'autocorrelation', 'can', 'effectively', 'be', 'used', 'to', 'identify', 'repetitive', 'patterns', 'in', 'any', 'given', 'signal', '.', 'Determining', 'synchronization', 'pulses', ':', 'The', 'synchronization', 'pulses', 'in', 'a', 'received', 'signal', ',', 'which', 'in', 'turn', 'facilitates', 'th

# Stemming and Lemmatization (what???)

Stemming and Lemmatization are the basic text processing methods for English text. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here is the definition from wikipedia for stemming and lemmatization:

In linguistic morphology (i.e., the structure of words) and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

For English, which has a fairly simple morphology, this task s generally simple.  For other languages (Turkish is a good example) it is absolutely necessary.

### What will we do here?
We'll instantiate a Lancaster Stemmer and demonstrate what a stemmer does.  Then we will instantiate a Lemmatizer and demonstrate what a lemmatizer does.

In [0]:
from nltk.stem.lancaster import LancasterStemmer
# word stemmer
stemmer = LancasterStemmer()
print (stemmer.stem('quickly'))
print (stemmer.stem('challenging'))
print (stemmer.stem('challenges'))

quick
challeng
challeng


In [0]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print (wordnet_lemmatizer.lemmatize('dogs'))
print (wordnet_lemmatizer.lemmatize('wolves'))
# Note that the default POS for lemmatize is Noun.  Lets see how it handles verbs.
print (wordnet_lemmatizer.lemmatize('does', pos='v'))

dog
wolf
do


# What can we do with this??

## Toy Example: Gender-based name classifier

Use the NLTK Name corpus to train a Gender Identification classifier.  This approach determines the likelihood that a name is associated with the 'male name' section of the corpus or the 'female name' section.  In this case, this is a lightweight form of supervised machine learning.

This approach is the basis for more complex classifiers that I have developed.

### What will we do here?

we're going to take the male and female names from the NLTK names function, shuffle these names, and then

In [0]:
# Grab names out of the nltk name corpus.

from nltk.corpus import names
import random

# Look for the likelihood that a name is contained in the male or the female name corpus.
classified_names = ([(name, 'male') for name in names.words('male.txt')] 
         + [(name, 'female') for name in names.words('female.txt')])

random.shuffle(classified_names)

print ("\nLets output our simple Bayesian name-gender classifications:")
classified_names[0:17]


Lets output our simple Bayesian name-gender classifications:


[('Dianna', 'female'),
 ('Sylvia', 'female'),
 ('Rorie', 'female'),
 ('Claudio', 'male'),
 ('Laraine', 'female'),
 ('Wilber', 'male'),
 ('Melantha', 'female'),
 ('Tristan', 'male'),
 ('Glad', 'female'),
 ('Idette', 'female'),
 ('Avis', 'female'),
 ('Veronike', 'female'),
 ('Wilfred', 'male'),
 ('Chrystal', 'female'),
 ('Jenda', 'female'),
 ('Shaw', 'male'),
 ('Bart', 'male')]

## Improve the Name Classifier and Return Scores

Using some built-in utilities from NLTK, we will train a classifier (using Scikit-learn, another great Python module) to classify names that were held out from the training set.

In [0]:
from nltk.classify.scikitlearn import SklearnClassifier
import numpy as np
from nltk.classify.util import names_demo, binary_names_demo_features
try:
    from sklearn.linear_model.sparse import LogisticRegression
except ImportError:     # separate sparse LR to be removed in 0.12
    from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Classify names using nltk built in names demo

print("\nClassify names using scikit-learn Naive Bayes:\n")
names_demo(SklearnClassifier(BernoulliNB(binarize=False), dtype=bool).train,
               features=binary_names_demo_features)

print("\nClassify names using scikit-learn logistic regression:\n")
names_demo(SklearnClassifier(LogisticRegression(), dtype=np.float64).train,
               features=binary_names_demo_features)

print("\nClassify names using scikit-learn Random Forest Classifier:\n")
names_demo(SklearnClassifier(RandomForestClassifier(), dtype=np.float64).train,
               features=binary_names_demo_features)




Classify names using scikit-learn Naive Bayes:

Training classifier...
Testing classifier...
Accuracy: 0.7840
Avg. log likelihood: -0.7900

Unseen Names      P(Male)  P(Female)
----------------------------------------
  Kelli            0.0061  *0.9939
  Er              *0.9578   0.0422
  Ally             0.0403  *0.9597
  Stephan         *0.9092   0.0908
  Chriss           0.9389  *0.0611

Classify names using scikit-learn logistic regression:

Training classifier...




Testing classifier...
Accuracy: 0.8020
Avg. log likelihood: -0.5964

Unseen Names      P(Male)  P(Female)
----------------------------------------
  Kelli            0.0397  *0.9603
  Er              *0.8069   0.1931
  Ally             0.2650  *0.7350
  Stephan         *0.7996   0.2004
  Chriss           0.5604  *0.4396

Classify names using scikit-learn Random Forest Classifier:

Training classifier...




Testing classifier...
Accuracy: 0.7920
Avg. log likelihood: -20000000000000003838251908499931647720456555653226733344429960421600659889350400133116159677367157678064678556950834919411457571319033081115243179092432341764775083585521439417826930514883025443724397504007672630600114472688297578978521880251628015186145688930135042504346886115963764281869217038336.0000

Unseen Names      P(Male)  P(Female)
----------------------------------------
  Kelli            0.0500  *0.9500
  Er              *0.8000   0.2000
  Ally             0.5000  *0.5000
  Stephan         *0.8333   0.1667
  Chriss           0.5667  *0.4333


<SklearnClassifier(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))>