## Exploring Natural Language Toolkit (NLTK)

### Based on
- https://realpython.com/nltk-nlp-python/
- https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
- https://www.guru99.com/nltk-tutorial.html


In [3]:
from nltk import download

download('popular')
download('tagsets')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /home/lay/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/lay/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /home/lay/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/lay/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/lay/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/lay/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /home/lay/nltk_data...
[nltk_data]    |   Unzip

True

## POS-Tags

Abbreviation | Meaning
--- | ---:
CC | coordinating conjunction
CD | cardinal digit
DT | determiner
EX | existential there
FW | foreign word
IN | preposition/subordinating conjunction
JJ | This NLTK POS Tag is an adjective (large)
JJR | adjective, comparative (larger)
JJS | adjective, superlative (largest)
LS | list market
MD | modal (could, will)
NN | noun, singular (cat, tree)
NNS | noun plural (desks)
NNP | proper noun, singular (sarah)
NNPS | proper noun, plural (indians or americans)
PDT | predeterminer (all, both, half)
POS | possessive ending (parent\ ‘s)
PRP | personal pronoun (hers, herself, him, himself)
PRP$ | possessive pronoun (her, his, mine, my, our )
RB | adverb (occasionally, swiftly)
RBR | adverb, comparative (greater)
RBS | adverb, superlative (biggest)
RP | particle (about)
TO | infinite marker (to)
UH | interjection (goodbye)
VB | verb (ask)
VBG | verb gerund (judging)
VBD | verb past tense (pleaded)
VBN | verb past participle (reunified)
VBP	| verb, present tense not 3rd person singular(wrap)
VBZ | verb, present tense with 3rd person singular (bases)
WDT	| wh-determiner (that, what)
WP | wh- pronoun (who)
WRB	| wh- adverb (how)

In [14]:
from nltk.help import upenn_tagset

upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### Tokenization

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize


text="Hello, world! It's important to always start with it."

print(sent_tokenize(text))
print(word_tokenize(text))

['Hello, world!', "It's important to always start with it."]
['Hello', ',', 'world', '!', 'It', "'s", 'important', 'to', 'always', 'start', 'with', 'it', '.']


In [2]:
example_string = (
    "Muad'Dib learned rapidly because his first training was in how to learn. "
    'And the first lesson of all was the basic trust that he could learn. '
    "It's shocking to find how many people do not believe they can learn, "
    'and how many more believe learning to be difficult.'
)

sent_tokenize(example_string)

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."]

In [4]:
print(word_tokenize(example_string))

["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he', 'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not', 'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'believe', 'learning', 'to', 'be', 'difficult', '.']


### Stopwords filtering

In [5]:
from nltk.corpus import stopwords

worf_quote = 'Sir, I protest. I am not a merry man!'
words_in_quote = word_tokenize(worf_quote)
print(words_in_quote)

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']


In [6]:
stop_words = set(stopwords.words("english"))

In [7]:
filtered_list = [
    word for word in words_in_quote if word.casefold() not in stop_words
]
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']


### Stemming

Process of putting words in its root form.

In [8]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

string_for_stemming = (
    'The crew of the USS Discovery discovered many discoveries. '
    'Discovering is what explorers do.'
)

In [9]:
words = word_tokenize(string_for_stemming)
print(words)

['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.', 'Discovering', 'is', 'what', 'explorers', 'do', '.']


In [10]:
stemmed_words = [ stemmer.stem(word) for word in words ]
print(stemmed_words)

['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']


### Parts of Speech

In [11]:
from nltk import pos_tag

sagan_quote = (
    'If you wish to make an apple pie from scratch,'
    'you must first invent the universe.'
)

words_in_sagan_quote = word_tokenize(sagan_quote)

In [13]:
print(pos_tag(words_in_sagan_quote))

[('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('an', 'DT'), ('apple', 'NN'), ('pie', 'NN'), ('from', 'IN'), ('scratch', 'NN'), (',', ','), ('you', 'PRP'), ('must', 'MD'), ('first', 'VB'), ('invent', 'VB'), ('the', 'DT'), ('universe', 'NN'), ('.', '.')]


In [15]:
jabberwocky_excerpt = (
    "'Twas brillig, and the slithy toves did gyre and gimble in the wabe:"
    'all mimsy were the borogoves, and the mome raths outgrabe.'
)
words_in_excerpt = word_tokenize(jabberwocky_excerpt)

In [16]:
print(pos_tag(words_in_excerpt))

[("'T", 'NN'), ('was', 'VBD'), ('brillig', 'VBN'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('slithy', 'JJ'), ('toves', 'NNS'), ('did', 'VBD'), ('gyre', 'NN'), ('and', 'CC'), ('gimble', 'JJ'), ('in', 'IN'), ('the', 'DT'), ('wabe', 'NN'), (':', ':'), ('all', 'DT'), ('mimsy', 'NNS'), ('were', 'VBD'), ('the', 'DT'), ('borogoves', 'NNS'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('mome', 'JJ'), ('raths', 'NNS'), ('outgrabe', 'RB'), ('.', '.')]
