## NLTK Fundamentals

NLTK (Natural Language ToolKit) is probably the most well known Python Natural
Language Processing library. You can find the official website here: http://www.nltk.org/2.
There’s a lot of discussion around it whether it really is a production-ready library.
I believe that the pre-trained models that come with it are not the best out there,
but NLTK is definitely a great library for:

- Learning NLP - the API is simple & well-designed, full of good didactic examples
- Prototyping - before building performant systems, you should build a prototype
so you can prove how cool your new app is.

In [1]:
import nltk
#nltk.download('all')

## Splitting Text
Until now, you probably haven’t put a lot of thought into how difficult can the task of
splitting text into sentences and/or words be. Let’s identify the most common issues
using a sample text from NLTK and see how the NLTK sentence splitter performs in
various situations.

In [2]:
from nltk.corpus import reuters
print(reuters.raw('test/21131')[:1000], '...')

AMPLE SUPPLIES LIMIT U.S. STRIKE'S OIL PRICE IMPACT
  Ample supplies of OPEC crude weighing on
  world markets helped limit and then reverse oil price gains
  that followed the U.S. Strike on an Iranian oil platform in the
  Gulf earlier on Monday, analysts said.
      December loading rose to 19.65 dlrs, up 45 cents before
  falling to around 19.05/15 later, unchanged from last Friday.
      "Fundamentals are awful," said Philip Lambert, analyst with
  stockbrokers Kleinwort Grieveson, adding that total OPEC
  production in the first week of October could be above 18.5 mln
  bpd, little changed from September levels.
      Peter Nicol, analyst at Chase Manhattan Bank, said OPEC
  production could be about 18.5-19.0 mln in October. Reuter and
  International Energy Agency (IEA) estimates put OPEC September
  production at 18.5 mln bpd.
      The U.S. Attack was in retaliation of last Friday's hit of
  a Kuwaiti oil products tanker flying the U.S. Flag, the Sea
  Isle City. It was struc

- not all punctuation marks indicate the end of a sentence (from the example
above: “U.S.”, “19.65”, “19.05/15”, etc. )
- not all sentences end with a punctuation mark (e.g. text from social platforms)
- not all sentences start with a capitalized letter (e.g. some sentences start with
a quotation mark or with numbers)
- not all capitalized letters mark the start of a sentence (from the example above:
“The U.S. Attack”, “U.S. Flag”)

In [3]:
import nltk
from nltk.corpus import reuters
sentences = nltk.sent_tokenize(reuters.raw('test/21131')[:1000])
print("#sentences={0}\n\n".format(len(sentences)))
for sent in sentences:
    print(sent, '\n')

#sentences=9


AMPLE SUPPLIES LIMIT U.S. STRIKE'S OIL PRICE IMPACT
  Ample supplies of OPEC crude weighing on
  world markets helped limit and then reverse oil price gains
  that followed the U.S. Strike on an Iranian oil platform in the
  Gulf earlier on Monday, analysts said. 

December loading rose to 19.65 dlrs, up 45 cents before
  falling to around 19.05/15 later, unchanged from last Friday. 

"Fundamentals are awful," said Philip Lambert, analyst with
  stockbrokers Kleinwort Grieveson, adding that total OPEC
  production in the first week of October could be above 18.5 mln
  bpd, little changed from September levels. 

Peter Nicol, analyst at Chase Manhattan Bank, said OPEC
  production could be about 18.5-19.0 mln in October. 

Reuter and
  International Energy Agency (IEA) estimates put OPEC September
  production at 18.5 mln bpd. 

The U.S. 

Attack was in retaliation of last Friday's hit of
  a Kuwaiti oil products tanker flying the U.S. 

Flag, the Sea
  Isle City. 

It wa

As you can see, the results are not perfect: the last three sentences are actually
a single sentence. The NLTK sentence splitter was fooled by the full stop and the
capitalized letter.
The NLTK splitter is a rule-based system that keeps lists of abbreviations, words that
usually go together and words that appear at the start of a sentence. Let me show
you an example where the NLTK splitter doesn’t fail:

In [4]:
#Introducing nltk.sent_tokenize, take 2
import nltk
print(nltk.sent_tokenize("The U.S. Army is a good example."))
# ['The U.S. Army is a good example.'] - only one sentence, no false splits

['The U.S. Army is a good example.']


Let’s talk about splitting sentences into words. The process is almost the same, but
the solution for word tokenization is easier and more straightforward

In [5]:
#Introducing nltk.word_tokenize
import nltk
print(nltk.word_tokenize('The U.S. Army is a good example.'))
# ['The', 'U.S.', 'Army', 'is', 'a', 'good', 'example', '.']

['The', 'U.S.', 'Army', 'is', 'a', 'good', 'example', '.']


The `word_tokenize` function uses the `nltk.tokenize.treebank.TreebankWordTokenizer` class.

The `TreebankWordTokenizer` is a rule-based system that splits sentences into words
according to the rules in the Penn Treebank corpus that makes heavy use of regular
expressions.

Here are the rules followed by the word tokenizer, extracted from the NLTK
documentation:
- split standard contractions, e.g. “don’t” ! “do n’t” and “they’ll” ! “they ‘ll”
- treat most punctuation characters as separate tokens
- split off commas and single quotes, when followed by whitespace
- separate periods that appear at the end of line



## Building a vocabulary
You will often want to compute some word statistics. NLTK has some helpful classes
to quickly compute the metrics you’re after, like `nltk.FreqDist`. Here’s an example of
how you can use it:

In [6]:
# Build a vocabulary
import nltk
fdist = nltk.FreqDist(nltk.corpus.reuters.words())

In [7]:
type(fdist) # dictionary

nltk.probability.FreqDist

In [8]:
len(fdist)

41600

In [9]:
print(fdist)

<FreqDist with 41600 samples and 1720901 outcomes>


In [10]:
fdist

FreqDist({'.': 94687, ',': 72360, 'the': 58251, 'of': 35979, 'to': 34035, 'in': 26478, 'said': 25224, 'and': 25043, 'a': 23492, 'mln': 18037, ...})

In [11]:
fdist['.']

94687

In [12]:
# top 10 most frequent words
print(fdist.most_common(n=10))

[('.', 94687), (',', 72360), ('the', 58251), ('of', 35979), ('to', 34035), ('in', 26478), ('said', 25224), ('and', 25043), ('a', 23492), ('mln', 18037)]


In [13]:
# get the count of the word `stock`
print(fdist['stock']) # 2346

2346


In [14]:
# get the count of the word `stork`
print(fdist['stork']) # 0 :(

0


In [15]:
# get the frequency of the word `the`
print(fdist.freq('the')) # 0.033849129031826936

0.033849129031826936


In [17]:
# get the words that only appear once (these words are called hap
fdist.hapaxes()

['RIFT',
 'Mounting',
 'inflict',
 'Move',
 'Unofficial',
 'Sheen',
 'Safe',
 'avowed',
 'VERMIN',
 'EAT',
 'vermin',
 'kilolitres',
 'kl',
 'Janunary',
 'pineapples',
 'Hasrul',
 'Paian',
 'sawn',
 'Goodall',
 'Bundey',
 'desposits',
 'unrecoverable',
 'Koh',
 'scratch',
 'forbidden',
 'lame',
 'duck',
 'Regulations',
 'Steagall',
 'Meiko',
 'devote',
 'digesting',
 'juggling',
 'plentiful',
 'BAIL',
 'ATLC',
 'bail',
 'Cebu',
 'Feedgrain',
 'ketchup',
 'scapegoat',
 'AMATIL',
 'AMAA',
 'Merchanting',
 'Tissue',
 'plannned',
 'FINNS',
 'locating',
 'enhancer',
 'tetra',
 'ethyl',
 'Limit',
 'FORREST',
 'WHIM',
 'Austwhim',
 'Croesus',
 'STAGNATING',
 '491p',
 '468p',
 '481p',
 'LOSES',
 'susbidiaries',
 'Viviez',
 'Padaeng',
 'PREDICT',
 'ANHEUSER',
 'BUSCH',
 'BOOMING',
 'trillions',
 'sloshing',
 'attraction',
 'unobtainable',
 'edgy',
 'pouring',
 'ordinaries',
 'Crucial',
 'sapped',
 'edict',
 'petrodollar',
 'portrayed',
 'overstressed',
 'REDLAND',
 'Isuzu',
 'refute',
 'briefca

In [18]:
# Hapaxes usually are `mispeled` or weirdly `cApiTALIZED` words.
# Total number of distinct words
print(len(fdist.keys())) # 41600

41600


In [19]:
# Total number of samples
print(fdist.N()) # 1720901

1720901


## Fun with Bigrams and Trigrams
You’ll hear about bigrams and trigrams a lot in NLP. **There are nothing but pairs or
triplets of adjacent words.** If we would generalize the term to bigger lengths, we get
ngrams.

Ngrams are used to build approximate language models, but they are also used in
text classification tasks or as features for various other natural language statistical
models. Bigrams and trigrams are especially popular because usually going further
in size, you don’t get any significant performance boost but rather a more complex
model. Here are some shortcuts to work with:

In [20]:
#Extracting bigrams & trigrams
from nltk import bigrams, trigrams, word_tokenize
text = "John works at Intel."
tokens = word_tokenize(text)

In [21]:
print(list(bigrams(tokens))) # the `bigrams` function returns a generator, so we must unwind it

[('John', 'works'), ('works', 'at'), ('at', 'Intel'), ('Intel', '.')]


In [22]:
print(list(trigrams(tokens))) # the `trigrams` function returns a generator, so we must unwind it

[('John', 'works', 'at'), ('works', 'at', 'Intel'), ('at', 'Intel', '.')]


A particular subset of a texts bigrams/trigrams are the collocations. To better
understand what the collocations are, you can think of them like an expression
of multiple words which commonly co-occur. Collocations can tell us a lot about
the text they are extracted from and can be used as important features in different
tasks.
Here’s how to compute them using some handy NLTK functions:

**Extracting collocations**

In [23]:
import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder
bigram_measures = BigramAssocMeasures()
trigram_measures = TrigramAssocMeasures()

In [24]:
# Compute length-2 collocations
finder = BigramCollocationFinder.from_words(nltk.corpus.reuters.words())

# only bigrams that appear 5+ times
finder.apply_freq_filter(5)

In [25]:
# return the 50 bigrams with the highest PMI (Pointwise Mutual Information)
finder.nbest(bigram_measures.pmi, 50)

[('DU', 'PONT'),
 ('Keng', 'Yaik'),
 ('Kwik', 'Save'),
 ('Nihon', 'Keizai'),
 ('corenes', 'pora'),
 ('fluidized', 'bed'),
 ('Akbar', 'Hashemi'),
 ('Constructions', 'Telephoniques'),
 ('Elevator', 'Mij'),
 ('Entre', 'Rios'),
 ('Graan', 'Elevator'),
 ('JIM', 'WALTER'),
 ('Taikoo', 'Shing'),
 ('der', 'Vorm'),
 ('di', 'Clemente'),
 ('Borrowing', 'Requirement'),
 ('FOOTE', 'MINERAL'),
 ('Hawker', 'Siddeley'),
 ('JARDINE', 'MATHESON'),
 ('PRORATION', 'FACTOR'),
 ('Wildlife', 'Refuge'),
 ('Kohlberg', 'Kravis'),
 ('Almir', 'Pazzionotto'),
 ('Bankhaus', 'Centrale'),
 ('Corpus', 'Christi'),
 ('Kuala', 'Lumpur'),
 ('Maple', 'Leaf'),
 ('Stats', 'Oljeselskap'),
 ('Zoete', 'Wedd'),
 ('Tadashi', 'Kuranari'),
 ('Drawing', 'Rights'),
 ('EASTMAN', 'KODAK'),
 ('Martinez', 'Cuenca'),
 ('Mathematical', 'Applications'),
 ('Neutral', 'Zone'),
 ('Townsend', 'Thoresen'),
 ('Sector', 'Borrowing'),
 ('Hashemi', 'Rafsanjani'),
 ('Hossein', 'Mousavi'),
 ('Kitty', 'Hawk'),
 ('Task', 'Force'),
 ('Tender', 'Loving'),

In [26]:
# among the collocations we can find stuff like: (u'Corpus', u'Christi') ...
# Compute length-3 collocations
finder = nltk.collocations.TrigramCollocationFinder.from_words(nltk.corpus.reuters.words())
# only trigrams that appear 5+ times
finder.apply_freq_filter(5)
# return the 50 trigrams with the highest PMI

In [27]:
finder.nbest(trigram_measures.pmi, 50)

[('Graan', 'Elevator', 'Mij'),
 ('Sector', 'Borrowing', 'Requirement'),
 ('Akbar', 'Hashemi', 'Rafsanjani'),
 ('Lim', 'Keng', 'Yaik'),
 ('Alejandro', 'Martinez', 'Cuenca'),
 ('Den', 'Norske', 'Stats'),
 ('Norske', 'Stats', 'Oljeselskap'),
 ('Kokusai', 'Denshin', 'Denwa'),
 ('Special', 'Drawing', 'Rights'),
 ('Dar', 'es', 'Salaam'),
 ('FOLLOWING', 'RAINFALL', 'WAS'),
 ('Duffour', 'et', 'Igon'),
 ('Tender', 'Loving', 'Care'),
 ('CATTLE', 'SLAUGHTER', 'GUESSTIMATES'),
 ('CAMPBELL', 'RED', 'LAKE'),
 ('Victor', 'Paz', 'Estenssoro'),
 ('Carter', 'Hawley', 'Hale'),
 ('Punta', 'del', 'Este'),
 ('ELEVATOR', 'LOADING', 'WAITING'),
 ('TIME', 'JOBLESS', 'CLAIMS'),
 ('Francaise', 'des', 'Petroles'),
 ('Public', 'Sector', 'Borrowing'),
 ('Arturo', 'Hernandez', 'Grisanti'),
 ('Speaker', 'Jim', 'Wright'),
 ('carrier', 'Kitty', 'Hawk'),
 ('Archer', 'Daniels', 'Midland'),
 ('Corning', 'Glass', 'Works'),
 ('refined', 'bleached', 'deodorised'),
 ('Grown', 'Cereals', 'Authority'),
 ('Commissioner', 'Frans'

### Pointwise Mutal Information
PMI is a measure that indicates theassociation betweent tow variables
- *how likey is that these tow values appear together?*

$$
pmi(x,y) = log\bigg(\frac{p(x,y)}{p(x)p(y)}\bigg)
$$

## Part Of Speech Tagging
Part of Speech Tagging (or POS Tagging, for short) is probably the most popular challenge
in the history of NLP. POS Tagging basically implies assigning a grammatical
label to every word in a sequence (usually a sentence). When I say grammatical
label, I mean: Noun, Verb, Preposition, Pronoun, etc.


In NLP, a collection of such labels is called a tag set. The most widespread one is:
Penn Treebank Tag Set3. Below is the alphabetical list of part-of-speech tags used in
the Penn Treebank Project


![](https://i.imgur.com/JvvGtXf.png)
![](https://i.imgur.com/VssomxH.png)



Your instinct now might be to run to your high school grammar book but don’t
worry, you don’t really need to know what all those POS tags mean. In fact, not even
all corpora implement this exact tag set, but rather a subset of it. For example, I’ve
never encountered the LS (list item marker) or the PDT (predeterminer) anywhere.

Part-Of-Speech tagging also serves as a base of deeper NLP analyses. There are just
a few cases when you’ll work directly with the tagged sentence. A scenario that
comes to mind is keyword extraction, when usually you only want to extract the
adjectives and nouns. In this case, you use the tags to filter out those words that
can’t be a keyword.

If you’re not familiar with the task, nowadays, POS tagging is done with machine
learning models. But a while ago, POS taggers we’re rule-based. Using regular
expressions and various heuristics, the POS tagger would determine an appropriate tag.




It may seem like a rather massive oversimplification, but this is the general idea.
The challenge was to come up with rules that best described the phenomena that is
the human language. But, as you go deeper and deeper, the rules become more and
more complex, getting harder to keep track with which rules did what, if there were
rules that were contradicting one another etc. Moreover, humans don’t usually
excel at noticing correlations between a great number of variables. In this case,
the variables could have been


- Previous words
- Following words
- Prefixes, Suffixes
- Previous POS tags
- Word capitalization



Mathematical algorithms are better at estimating which are the optimum rules
for correctly tagging words. Thus, the field turned to machine learning and the
approach became something like this:

---

1. Get some humans to annotate some texts with POS tags (we’ll call this the gold
standard)
2. Get other humans to build some mathematical models to predict tags using a
large part of the gold standard corpus (this is called training the model)
3. Using the remaining part of the gold standard corpus, assess how well the
model is performing on data the model hasn’t seen yet (this is called testing
the model)


This may seem complicated, and it usually is presented as such, but throughout this
book, we’ll be demistifying all of these algorithms. To start doing POS tagging we
don’t need much because NLTK comes with some pre-trained POS tagger models.
It’s super easy to get started and here’s how:


**Introducing nltk.pos_tag**

In [28]:
import nltk
sentence = "Things I wish I knew before I started blogging."
tokens = nltk.word_tokenize(sentence)

In [29]:
print("Tokens: ", tokens)

Tokens:  ['Things', 'I', 'wish', 'I', 'knew', 'before', 'I', 'started', 'blogging', '.']


In [30]:
tagged_tokens = nltk.pos_tag(tokens)
"Tagged Tokens: ", tagged_tokens

('Tagged Tokens: ',
 [('Things', 'NNS'),
  ('I', 'PRP'),
  ('wish', 'VBP'),
  ('I', 'PRP'),
  ('knew', 'VBD'),
  ('before', 'IN'),
  ('I', 'PRP'),
  ('started', 'VBD'),
  ('blogging', 'VBG'),
  ('.', '.')])

It is as straight-forward and easy as this, but keep in mind that the NLTK trained
model is not the best. It’s a bit slow and not the most precise. It is well suited though
for doing toy projects or prototyping.

## Named Entity Recognition

Named Entity Recognition (NER for short) is almost as well-known and studied as
POS tagging. NER implies extracting named entities and their classes from a given
text. The usual named entities we’re dealing with stand for: *People, Organizations,
Locations, Events, etc*. Sometimes, things like currencies, numbers, percents, dates
and time expressions can be considered named entities even though they technically
aren’t. The entities are used in information extraction tasks and usually, these
entities can be attributed to a real-life object or concept. To make things clearer, let
me give you some examples:

- if we extract a name of a person from a text, we can associate it with a Facebook
profile, email address or even a unique identification number
- if we extract a date/time, we can associate the string with an actual slot in a
calendar.
- if we extract a location, we can associate it with some exact coordinates in
Google Maps.

**Using the nltk.ne_chunk function**

In [31]:
import nltk
# Adapted from Wikipedia:
# https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Feynman!
sentence = """The closing chapter, is adapted from the address that
Feynman gave during the 1974 commencement exercises
at the California Institute Of Technology. """

# tokenize and pos tag
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
ner_annotated_tree = nltk.ne_chunk(tagged_tokens)

In [32]:
print(sentence)

The closing chapter, is adapted from the address that
Feynman gave during the 1974 commencement exercises
at the California Institute Of Technology. 


In [33]:
print(ner_annotated_tree)

(S
  The/DT
  closing/NN
  chapter/NN
  ,/,
  is/VBZ
  adapted/VBN
  from/IN
  the/DT
  address/NN
  that/IN
  (PERSON Feynman/NNP)
  gave/VBD
  during/IN
  the/DT
  1974/CD
  commencement/NN
  exercises/NNS
  at/IN
  the/DT
  (ORGANIZATION California/NNP Institute/NNP Of/IN Technology/NNP)
  ./.)


Notice that the result has a single or multiple tokens bundled up in entities:
1. Feynman = PERSON


2. California Institute Of Technology = ORGANIZATION

Also notice that we needed to POS tag the sentence first before feeding it to the
`ne_chunk` function. This is because the function uses the POS tags as features that
contribute decisively to predicting whether something is an entity or not. The most
accessible examples are the tags `NNP` and `NNPS`. What do you think is the reason these
tags help the NE extractor find entities?
Let’s pay a closer look at the `ne_chunk` function. Chunking means taking the tokens
of a sentence and grouping them together, in chunks. In this case, we grouped the
tokens belonging to the same named entity into a single chunk. The data structure
that facilitates this is `nltk.Tree`. The tokens of a chunk are all children of the same
node. All the other non-entity nodes and the chunk nodes are children of the same
root node: `S`. 