# Natural Language Processing (NLP)

- **Basic Steps involved in NLP with the help of Natural Language Tool Kit (nltk) as follow**:

> 1. Tokenization.
> 2. Stop Word Exclusion.
> 3. Stemming / Lemmatization.
> 4. POS Tagging (Part-of-Speech)
> 5. Chunking (using Regular Expressions & RegexParser)
> 6. NER (Named Entity Recognition) - kind of Forming groups of similar kinds `ne_chunk()`




---

## NLTK Installation:

`pip install nltk`

---

## Importing Necessary Dependencies:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

# Natural Language Tool kit
import nltk
# nltk.download('package_name') - for installation of nltk packages

plt.style.use("ggplot")

---

## 1. TOKENIZATION:

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
testText_01 = 'I bought these for my husband and he said they are the best energy shots out there. He takes one in the mornings and works hard all day. Good stuff!'


from nltk.tokenize import word_tokenize
tokens = word_tokenize(testText_01)
print(tokens)

['I', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energy', 'shots', 'out', 'there', '.', 'He', 'takes', 'one', 'in', 'the', 'mornings', 'and', 'works', 'hard', 'all', 'day', '.', 'Good', 'stuff', '!']


---

## 2. Stop Word Exclusion: 
- Stop Word includes the word that do not add that much meaning to the sentence. Eg.: 'a', 'an', 'the', 'and', etc.
- package used stopwords for the same.

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
# print(stop_words)

filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print("Previous: \n", tokens)
print("\nFiltered: \n", filtered_tokens)

Previous: 
 ['I', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energy', 'shots', 'out', 'there', '.', 'He', 'takes', 'one', 'in', 'the', 'mornings', 'and', 'works', 'hard', 'all', 'day', '.', 'Good', 'stuff', '!']

Filtered: 
 ['bought', 'husband', 'said', 'best', 'energy', 'shots', '.', 'takes', 'one', 'mornings', 'works', 'hard', 'day', '.', 'Good', 'stuff', '!']


---

## 3. Stemming / Lemmatization:
> **Stemming:** Used in large datasets, eg: `Caring -> Car` (can lead to incorrect meaning also).

> **Lemmatization:** Converts the word to its meaningful base (Lemma), eg: `Caring -> Care` (expensive).

In [6]:
# Package download
nltk.download('wordnet')
# nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

from nltk.stem import WordNetLemmatizer
print(tokens)

['I', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energy', 'shots', 'out', 'there', '.', 'He', 'takes', 'one', 'in', 'the', 'mornings', 'and', 'works', 'hard', 'all', 'day', '.', 'Good', 'stuff', '!']


In [8]:
# Stemming:

porter_stemmer = PorterStemmer()
stemmed_tokens_porter = [porter_stemmer.stem(token) for token in tokens]
print('Porter Stemming:\n \t',stemmed_tokens_porter)

snowball_stemmer = SnowballStemmer('english')
stemmed_tokens_snowball = [snowball_stemmer.stem(token) for token in tokens]
print('\n\nSnowball Stemming:\n \t', stemmed_tokens_snowball)

# Lemmatization:
# lemmatizer = WordNetLemmatizer()
# lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

#print('\n\nLemmatization:\n \t', lemmatized_tokens)


Porter Stemming:
 	 ['i', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energi', 'shot', 'out', 'there', '.', 'he', 'take', 'one', 'in', 'the', 'morn', 'and', 'work', 'hard', 'all', 'day', '.', 'good', 'stuff', '!']


Snowball Stemming:
 	 ['i', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energi', 'shot', 'out', 'there', '.', 'he', 'take', 'one', 'in', 'the', 'morn', 'and', 'work', 'hard', 'all', 'day', '.', 'good', 'stuff', '!']


---

## 4. POS Tagging (Part-Of-Speech): 

In [9]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [10]:
taggings = nltk.pos_tag(tokens)
print(taggings)
taggings[:5]

[('I', 'PRP'), ('bought', 'VBD'), ('these', 'DT'), ('for', 'IN'), ('my', 'PRP$'), ('husband', 'NN'), ('and', 'CC'), ('he', 'PRP'), ('said', 'VBD'), ('they', 'PRP'), ('are', 'VBP'), ('the', 'DT'), ('best', 'JJS'), ('energy', 'NN'), ('shots', 'NNS'), ('out', 'RP'), ('there', 'RB'), ('.', '.'), ('He', 'PRP'), ('takes', 'VBZ'), ('one', 'CD'), ('in', 'IN'), ('the', 'DT'), ('mornings', 'NNS'), ('and', 'CC'), ('works', 'VBZ'), ('hard', 'JJ'), ('all', 'DT'), ('day', 'NN'), ('.', '.'), ('Good', 'JJ'), ('stuff', 'NN'), ('!', '.')]


[('I', 'PRP'),
 ('bought', 'VBD'),
 ('these', 'DT'),
 ('for', 'IN'),
 ('my', 'PRP$')]

---

## 5. Chunking:
- In this Grouping of words is done into `chunks` based on the Part-of-Speech.
- Chunking can be performed using Regular Expressions and the `RegexParser` class.

In [11]:
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [12]:
taggings[:5]

[('I', 'PRP'),
 ('bought', 'VBD'),
 ('these', 'DT'),
 ('for', 'IN'),
 ('my', 'PRP$')]

In [13]:
chunk_parser = nltk.RegexpParser(r"""
    NP: {<DT>?<JJ>*<NN>} # chunk determiner/adj+noun
    PP: {<IN><NP>} # chunk preposition+NP
    VP: {<VB.*><NP|PP|CLAUSE>+$} # chunk verbs and their arguments
    CLAUSE: {<NP><VP>} # chunk NP, VP
""")

# groups:
chunks = chunk_parser.parse(taggings)
#print(chunks)
chunks.pprint()

(S
  I/PRP
  bought/VBD
  these/DT
  for/IN
  my/PRP$
  (NP husband/NN)
  and/CC
  he/PRP
  said/VBD
  they/PRP
  are/VBP
  the/DT
  best/JJS
  (NP energy/NN)
  shots/NNS
  out/RP
  there/RB
  ./.
  He/PRP
  takes/VBZ
  one/CD
  in/IN
  the/DT
  mornings/NNS
  and/CC
  works/VBZ
  hard/JJ
  (NP all/DT day/NN)
  ./.
  (NP Good/JJ stuff/NN)
  !/.)


---

## 6. NER (Named Entity Recognition):
> Classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. 

> **[🔗NER](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)** is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:
>- Which companies were mentioned in the news article?
>- Were specified products mentioned in complaints or reviews?
>- Does the tweet contain the name of a person? Does the tweet contain this person’s location?

In [14]:
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [15]:
entities = nltk.ne_chunk(taggings)
entities.pprint()

(S
  I/PRP
  bought/VBD
  these/DT
  for/IN
  my/PRP$
  husband/NN
  and/CC
  he/PRP
  said/VBD
  they/PRP
  are/VBP
  the/DT
  best/JJS
  energy/NN
  shots/NNS
  out/RP
  there/RB
  ./.
  He/PRP
  takes/VBZ
  one/CD
  in/IN
  the/DT
  mornings/NNS
  and/CC
  works/VBZ
  hard/JJ
  all/DT
  day/NN
  ./.
  Good/JJ
  stuff/NN
  !/.)


---
### Above were the basics steps to be consider for NLP ✨
---

## 7. Sentiment Analysis:

> Here we will use VADER (A trained model for Sentiment Intensity Analyzing) by NLTK.

In [16]:
# VADER - Package Installation
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [17]:
# Importing model
from nltk.sentiment import SentimentIntensityAnalyzer
print('Test Text:\n\t', testText_01)

# Loading model (Already Trained)
analyzer = SentimentIntensityAnalyzer()

# Measuring Scores
scores = analyzer.polarity_scores(testText_01)

print("\nScores:\n",scores)

Test Text:
	 I bought these for my husband and he said they are the best energy shots out there. He takes one in the mornings and works hard all day. Good stuff!

Scores:
 {'neg': 0.039, 'neu': 0.697, 'pos': 0.264, 'compound': 0.8439}


---

### Refer for more in-depth details about NLP : [NLP Documentations 🔗](https://realpython.com/nltk-nlp-python/)