# Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](https://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active [discussion forum.](https://groups.google.com/group/nltk-users)

Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

[Natural Language Processing with Python](https://www.nltk.org/book/) provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. The online version of the book has been been updated for Python 3 and NLTK 3. (The original Python 2 version is still available at https://www.nltk.org/book_1ed.)

**credit:** 
* https://realpython.com/nltk-nlp-python/
* https://pythonspot.com/

## Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:

#### Tokenizing by word: 
Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more.

#### Tokenizing by sentence: 
When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?

In [1]:
# importing nltk libraray
import nltk

#### NLTK Tokenizer Package
**Tokenizers** divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

In [2]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [3]:
# sting to tokenize, it is a list of three strings that are sentences.
str_2_tokenize = """
... Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn, 
... and how many more believe learning to be difficult."""

## nltk.tokenize.punkt module
# Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

In [4]:
nltk.download('punkt')
word_tokenize(str_2_tokenize)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['...',
 "Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 '...',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 '...',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 '...',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult',
 '.']

***See how "It's" was split at the apostrophe to give you 'It' and "'s"***

## Filtering Stop Words
**Stop words** are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [5]:
# download stopwords from the NLTK library.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus with

In [6]:
from nltk.corpus import stopwords

the stop words that NLTK offers for English can be use like this

In [7]:
stop_words = set(stopwords.words("english"))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [8]:
worf_quote = "Sir, I protest. I am not a merry man!"
words_in_quote = word_tokenize(worf_quote)
words_in_quote

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

In [9]:
# create an empty list to hold the words that make it past the filter
filtered_list = []

In [10]:
# hold all the words in words_in_quote that aren’t stop words
# for word in words_in_quote:
#     if word.casefold() not in stop_words:
#         filtered_list.append(word)
# filtered_list

In [11]:
# list comprehensive will append it automatically
filtered_list = [word for word in words_in_quote if word.casefold() not in stop_words]
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

## Stemming
**Stemming** is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has [more than one stemmer](https://www.nltk.org/howto/stem.html), but you’ll be using the Porter stemmer.

# PorterStemmer

In [12]:
from nltk.stem import PorterStemmer
p_stremmer = PorterStemmer()

In [13]:
str_4_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""

Before you can stem the words in that string, you need to separate all the words in it.

In [14]:
t_words = word_tokenize(str_4_stemming)
print(t_words) # print in array
print(' '.join(t_words)) # print in a single line

['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.', 'Discovering', 'is', 'what', 'explorers', 'do', '.']
The crew of the USS Discovery discovered many discoveries . Discovering is what explorers do .


In [15]:
# pstem_words = []
# for word in t_words:
#     stem_words.append(p_stremmer.stem(word))

pstem_words = [p_stremmer.stem(word) for word in t_words]
print(' '.join(pstem_words))

the crew of the uss discoveri discov mani discoveri . discov is what explor do .


#### Understemming and overstemming are two ways stemming can go wrong:

**Understemming** happens when two related words should be reduced to the same stem but aren’t. This is a [false negative.](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error)<br>
**Overstemming** happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a [false positive.](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_positive_error)

# SnowballStemmer

In [16]:
from nltk.stem import SnowballStemmer
sb_stemmer = SnowballStemmer('english', ignore_stopwords=True)

In [17]:
sbstem_words = [sb_stemmer.stem(word) for word in t_words]
print(' '.join(sbstem_words))

the crew of the uss discoveri discov mani discoveri . discov is what explor do .


# Tagging Parts of Speech
Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. **Tagging parts of speech, or POS tagging**, is the task of labeling the words in your text according to their part of speech.

In English, there are eight parts of speech:

![image.png](attachment:image.png)

## Averaged Perceptron Tagger
The averaged_perceptron_tagger.zip contains the pre-trained English [Part-of-Speech (POS](https://en.wikipedia.org/wiki/Part_of_speech) tagger in NLTK.

In [18]:
sagan_quote = """
... If you wish to make an apple pie from scratch,
... you must first invent the universe."""
txt_4_pos = word_tokenize(sagan_quote)

nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(txt_4_pos)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

## English part-of-speech tagsets
A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense etc.) of each token in a text corpus.
When creating user corpora, the recommended tagset is always preselected. Using a different tagset is only recommended for advanced users. Tagsets cannot be normally changed for preloaded corpora.

In [19]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [20]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [23]:
jabberwocky_excerpt = """
... 'T was brillig, and the slithy toves did gyre and gimble in the wabe:
... all mimsy were the borogoves, and the mome raths outgrabe."""

In [24]:
txt_4_pos_eng = word_tokenize(jabberwocky_excerpt)
nltk.pos_tag(txt_4_pos_eng)

[("'T", 'NN'),
 ('was', 'VBD'),
 ('brillig', 'VBN'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('slithy', 'JJ'),
 ('toves', 'NNS'),
 ('did', 'VBD'),
 ('gyre', 'NN'),
 ('and', 'CC'),
 ('gimble', 'JJ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('wabe', 'NN'),
 (':', ':'),
 ('all', 'DT'),
 ('mimsy', 'NNS'),
 ('were', 'VBD'),
 ('the', 'DT'),
 ('borogoves', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('mome', 'JJ'),
 ('raths', 'NNS'),
 ('outgrabe', 'RB'),
 ('.', '.')]

# Lemmatizing
Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

In [36]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

In [40]:
str_4_lemmatizing = "The friends of DeSoto love scarves. Fungi, foci, cacti, tomatoes"

In [42]:
tokenize_word_4_lemma = word_tokenize(str_4_lemmatizing)
lemma_word = [lemmatizer.lemmatize(word) for word in tokenize_word_4_lemma]
print(lemma_word)

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.', 'Fungi', ',', 'focus', ',', 'cactus', ',', 'tomato']


In [44]:
lemmatizer.lemmatize('wives')

'wife'