# An introduction to the NLTK

This file introduces the reader to the Python Natural Langauge Toolkit (NLTK).

The Python NLTK is a set of modules and corpora enabling the reader to do natural langauge processing against corpora of one or more texts. It goes beyond text minnig and provides tools to do machine learning, but this Notebook barely scratches that surface.

This is my first Python Jupyter Notebook. As such I'm sure there will be errors in implementation, style, and functionality. For example, the Notebook may fail because the value of FILE is too operating system dependent, or the given file does not exist. Other failures may/will include the lack of additional modules. In these cases, simply read the error messages and follow the instructions. "Your mileage may vary."

That said, through the use of this Notebook, the reader ought to be able to get a flavor for what the Toolkit can do without the need to completly understand the Python language.

--  
Eric Lease Morgan _<emorgan@nd.edu_>  
April 12, 2018

In [None]:
# configure; using an absolute path, define the location of a plain text file for analysis
FILE = 'walden.txt'

In [None]:
# import / require the use of the Toolkit
from nltk import *

In [None]:
# slurp up the given file; display the result
handle = open( FILE, 'r')
data   = handle.read()
print( data )

In [None]:
# tokenize the data into features (words); display them
features = word_tokenize( data )
print( features )

In [None]:
# normalize the features to lower case and exclude punctuation
features = [ feature for feature in features if feature.isalpha() ]
features = [ feature.lower() for feature in features ]
print( features )

In [None]:
# create a list of (English) stopwords, and then remove them from the features
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' )
features  = [ feature for feature in features if feature not in stopwords ]

In [None]:
# count & tabulate the features, and then plot the results -- season to taste
frequencies = FreqDist( features )
plot = frequencies.plot( 10 )

In [None]:
# create a list of unique words (hapaxes); display them
hapaxes = frequencies.hapaxes()
print( hapaxes )

In [None]:
# count & tabulate ngrams from the features -- season to taste; display some
ngrams      = ngrams( features, 2 )
frequencies = FreqDist( ngrams )
frequencies.most_common( 10 )

In [None]:
# create a list each token's length, and plot the result; How many "long" words are there?
lengths = [ len( feature ) for feature in features ]
plot    = FreqDist( lengths ).plot( 10 )

In [None]:
# initialize a stemmer, stem the features, count & tabulate, and output
from nltk.stem import PorterStemmer
stemmer     = PorterStemmer()
stems       = [ stemmer.stem( feature ) for feature in features ]
frequencies = FreqDist( stems )
frequencies.most_common( 10 )

In [None]:
# re-create the features and create a NLTK Text object, so other cool things can be done
features = word_tokenize( data )
text     = Text( features )

In [None]:
# count & tabulate, again; list a given word -- season to taste
frequencies = FreqDist( text )
print( frequencies[ 'love' ] )

In [None]:
# do keyword-in-context searching against the text (concordancing)
print( text.concordance( 'love' ) )

In [None]:
# create a dispersion plot of given words
plot = text.dispersion_plot( [ 'love', 'war', 'man', 'god' ] )

In [None]:
# output the "most significant" bigrams, considering surrounding words (size of window) -- season to taste
text.collocations( num=10, window_size=4 )

In [None]:
# given a set of words, what words are nearby
text.common_contexts( [ 'love', 'war', 'man', 'god' ] )

In [None]:
# list the words (features) most associated with the given word
text.similar( 'love' )

In [None]:
# create a list of sentences, and display one -- season to taste
sentences = sent_tokenize( data )
sentence  = sentences[ 14 ]
print( sentence )

In [None]:
# tokenize the sentence and parse it into parts-of-speech, all in one go
sentence = pos_tag( word_tokenize( sentence ) )
print( sentence )

In [None]:
# extract named enities from a sentence, and print the results
entities = ne_chunk( sentence )
print( entities )

In [None]:
# output the entites graphically
entities

This is the end of the Notebook. I hope you found it useful.