<a href="https://colab.research.google.com/github/YasirKhan1811/Artificial_Intelligence/blob/main/NLP_with_NLTK_Package.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Toolkit (NLTK)

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.

A lot of the data that we could be analyzing is **unstructured data** and contains human-readable text. Before we can analyze that data programmatically, we first need to preprocess it. In this tutorial, we’ll take our first look at the kinds of **text preprocessing** tasks we can do with NLTK so that we’ll be ready to apply them in future projects. We’ll also see how to do some basic text analysis and create visualizations.

By the end of this tutorial, we’ll know how to:

1. Find text to analyze
2. Preprocess our text for analysis
3. Analyze our text
4. Create visualizations based on our analysis

### Tokenization

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
string = """Muad'Dib learned rapidly because his first training was in how to learn.
            And the first lesson of all was the basic trust that he could learn.
            It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."""

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
sent_tokenize(string)

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."]

In [None]:
tokens = word_tokenize(string)
tokens

["Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult',
 '.']

See how "It's" was split at the apostrophe to give us 'It' and "'s", but "Muad'Dib" was left whole? This happened because NLTK knows that 'It' and "'s" (a contraction of “is”) are two distinct words, so it counted them separately. But "Muad'Dib" isn’t an accepted contraction like "It's", so it wasn’t read as two separate words and was left intact.

### Text Preprocessing

In [None]:
doc = "Sir, I protest. I am not a merry man!"

Removing punctuations:

In [None]:
tokens = [token for token in word_tokenize(doc.lower())
          if token.isalpha()]

In [None]:
print(tokens)

['sir', 'i', 'protest', 'i', 'am', 'not', 'a', 'merry', 'man']


Removing stopwords:

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

In [None]:
words = [token for token in tokens if token not in stop_words]
print(words)

['sir', 'protest', 'merry', 'man']


Words like 'I' and 'not' may seem too important to filter out, and depending on what kind of analysis we want to conduct, they can be. Here is why:

* 'I' is a pronoun, which is a context word rather than content word:

  * **Content words** give us information about the topics covered in the text or the sentiment that the author has about those topics.

  * **Context words** give us information about the writing style. We can observe patterns in how authors use context words in order to quantify their writing style. Once we have quantified their writing style, we can analyze a text written by an unknown author to see how closely it follows a particular writing style so we can try to identify who the author is.

* 'not' is technically an adverb but has still been included in NLTK’s list of stop words for English. If we want to edit the list of stop words to exclude 'not' or make other changes, then we can download it.

So, 'I' and 'not' can be important parts of a sentence/text, but it depends on what we are trying to learn from that sentence.

### Stemming

Stemming is a text processing task in which we reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.”

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
string_for_stemming = """The crew of the USS Discovery discovered many discoveries. Discovering is what explorers do."""

Before we can stem the words in the string, we need to separate all the words in it by word_tokenizer:

In [None]:
stemmed_words = [stemmer.stem(word) for word in word_tokenize(string_for_stemming)]
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

### Tagging Parts of Speech


In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [None]:
# here is a quote of Carl Sagan
quote = """If you wish to make an apple pie from scratch, you must first invent the universe."""

# now let's tag the parts of speech in this text
nltk.pos_tag(word_tokenize(quote), tagset='universal')

[('If', 'ADP'),
 ('you', 'PRON'),
 ('wish', 'VERB'),
 ('to', 'PRT'),
 ('make', 'VERB'),
 ('an', 'DET'),
 ('apple', 'NOUN'),
 ('pie', 'NOUN'),
 ('from', 'ADP'),
 ('scratch', 'NOUN'),
 (',', '.'),
 ('you', 'PRON'),
 ('must', 'VERB'),
 ('first', 'VERB'),
 ('invent', 'VERB'),
 ('the', 'DET'),
 ('universe', 'NOUN'),
 ('.', '.')]

### Lemmatization
Lemmatization reduces words to their core/root forms, but it will give us a complete English word that makes sense on its own.

> Note: A lemma is a word that represents a whole group of words, and that group of words is called lexeme. For example, if we were to look up the word “blending” in a dictionary, then we would need to look at the entry for “blend” and we would find “blending” listed in that entry. In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when we lemmatize a word, we are reducing it to its lemma.

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

Let's start with lemmatizing a plural noun:

In [None]:
# first install the 'wordnet' corpus
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
lemmatizer.lemmatize("keeps")

'keep'

In [None]:
[lemmatizer.lemmatize(token) for token in word_tokenize("The friends of DeSoto love scarves.")]

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']

That looks right. The plurals 'friends' and 'scarves' became the singulars 'friend' and 'scarf'.

But what would happen if you lemmatized a word that looked very different from its lemma? Try lemmatizing "worst":

In [None]:
lemmatizer.lemmatize("worst")

'worst'

Here we got the result 'worst' because lemmatizer.lemmatize() assumed that "worst" was a noun. We can make it clear that we want "worst" to be an adjective:

In [None]:
lemmatizer.lemmatize("worst", pos="a")

'bad'

The default parameter for pos is 'n' for noun, but we made sure that "worst" was treated as an adjective by adding the parameter **pos="a"**. As a result, we got 'bad', which looks very different from the original word and is nothing like what we would get in stemming. This is because "worst" is the superlative form of the adjective 'bad', and lemmatizing reduces superlatives as well as comparatives to their lemmas.

Now that we know how to use NLTK to tag parts of speech, we can try tagging words before lemmatizing them to avoid mixing up **homographs**, or words that are spelled the same but have different meanings and can be different parts of speech.

### Chunking and Chinking

"Chunking involves grouping words into meaningful chunks based on their part-of-speech tags."

> Purpose: It helps in identifying and extracting phrases like noun phrases, verb phrases, prepositional phrases, etc., which provide more context and meaning to the text.
>
> Technique: Chunking is typically performed using regular expressions or rule-based approaches to define patterns for identifying and extracting chunks.
>
> Example: Identifying noun phrases like "the quick brown fox" or verb phrases like "jumps over" in a sentence.

Before we chunk, we need to make sure that the parts of speech in our text are tagged. So we create a string for POS tagging.

"Chinking is the process of excluding certain parts from a chunk that match specific patterns while chunking."

> Purpose: It allows for refining the chunking process by excluding certain words or patterns that should not be included in a chunk.
>
> Technique: Chinking involves specifying patterns that should be excluded from a chunk while defining the chunking rules.
>
> Example: Excluding determiners or adjectives from a noun phrase chunk to create more precise chunks.

**Example:**

Consider the sentence: "The quick brown fox jumps over the lazy dog."
> Chunking Output:
* Noun Phrases: "The quick brown fox", "the lazy dog"
* Verb Phrases: "jumps over"

> Chinking Output:
* Noun Phrases: "quick brown fox"
* Verb Phrase: "jumps"

By combining chunking and chinking techniques, NLP systems can extract structured information from text data, enabling more advanced analysis such as information extraction, named entity recognition, and relationship extraction. These techniques play a crucial role in syntactic analysis and text processing tasks in NLP.

In [None]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."
tags = nltk.pos_tag(word_tokenize(lotr_quote))
tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

### Named Entity Recognition (NER)

Named entities are noun phrases that refer to specific locations, people, organizations, and so on. With named entity recognition, we can find the named entities in our texts and also determine what kind of named entity they are.

In [None]:
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [None]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [None]:
!pip install svgling

Collecting svgling
  Downloading svgling-0.4.0-py3-none-any.whl (23 kB)
Collecting svgwrite (from svgling)
  Downloading svgwrite-1.4.3-py3-none-any.whl (67 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: svgwrite, svgling
Successfully installed svgling-0.4.0 svgwrite-1.4.3


**Example 01:**

In [None]:
# example text
text = 'In New York, I like to ride the Metro to visit MOMA'

# tokenizing and pos_tagging
tagged = nltk.pos_tag(nltk.word_tokenize(text))

# NER
named_entities = nltk.ne_chunk(tagged)

# instantiating an empty dictionary
from collections import defaultdict
ne_categories = defaultdict(int)

# count chunk types
for chunk in named_entities:
  if hasattr(chunk, 'label'):
    ne_categories[chunk.label()] += 1

print(ne_categories)

defaultdict(<class 'int'>, {'GPE': 1, 'ORGANIZATION': 2})


**Example 02:**

In [None]:
quote = """Men like Schiaparelli watched the red planet—it is odd, by-the-bye, that for countless centuries Mars has been the star of war—but failed to interpret the fluctuating appearances of the markings they mapped so well. All that time the Martians must have been getting ready. During the opposition of 1894 a great light was seen on the illuminated part of the disk, first at the Lick Observatory, then by Perrotin of Nice, and then by other observers. English readers heard of it first in the issue of Nature dated August 2."""

In [None]:
def extract_ne(quote):
  words = word_tokenize(quote)
  tags = nltk.pos_tag(words)
  tree = nltk.ne_chunk(tags, binary=True)
  return set(" ".join(entity[0] for entity in ne) for ne in tree if hasattr(ne, "label") and ne.label() == "NE")

In [None]:
extract_ne(quote)

{'Lick Observatory', 'Mars', 'Nature', 'Perrotin', 'Schiaparelli'}

**Example 03:**

In [None]:
# Define the text to be analyzed
text = "European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market"

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Apply part-of-speech tagging to the tokens
tagged = nltk.pos_tag(tokens)

# Apply named entity recognition to the tagged words
ne_tree = nltk.ne_chunk(tagged, binary=True)

# Print the named entities
for chunk in ne_tree:
    if hasattr(chunk, 'label'):
        print(chunk.label(), '-->', ' '.join(c for c in chunk))

NE --> European
