# Natural Language Processing (NLP)

- **Basic Steps involved in NLP with the help of Natural Language Tool Kit (nltk) as follow**:

> 1. Tokenization.
> 2. Stop Word Exclusion.
> 3. Stemming / Lemmatization.
> 4. POS Tagging (Part-of-Speech)
> 5. Chunking (using Regular Expressions & RegexParser)
> 6. NER (Named Entity Recognition) - kind of Forming groups of similar kinds `ne_chunk()`




---

## NLTK Installation:

`pip install nltk`

---

## Importing Necessary Dependencies:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

# Natural Language Tool kit
import nltk
# nltk.download('package_name') - for installation of nltk packages

plt.style.use("ggplot")

---

## 1. TOKENIZATION:

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
testText_01 = 'I bought these for my husband and he said they are the best energy shots out there. He takes one in the mornings and works hard all day. Good stuff!'


from nltk.tokenize import word_tokenize
tokens = word_tokenize(testText_01)
print(tokens)

['I', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energy', 'shots', 'out', 'there', '.', 'He', 'takes', 'one', 'in', 'the', 'mornings', 'and', 'works', 'hard', 'all', 'day', '.', 'Good', 'stuff', '!']


---

## 2. Stop Word Exclusion: 
- Stop Word includes the word that do not add that much meaning to the sentence. Eg.: 'a', 'an', 'the', 'and', etc.
- package used stopwords for the same.

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91800\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [22]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
# print(stop_words)

filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print("Previous: \n", tokens)
print("\nFiltered: \n", filtered_tokens)

Previous: 
 ['I', 'bought', 'these', 'for', 'my', 'husband', 'and', 'he', 'said', 'they', 'are', 'the', 'best', 'energy', 'shots', 'out', 'there', '.', 'He', 'takes', 'one', 'in', 'the', 'mornings', 'and', 'works', 'hard', 'all', 'day', '.', 'Good', 'stuff', '!']

Filtered: 
 ['bought', 'husband', 'said', 'best', 'energy', 'shots', '.', 'takes', 'one', 'mornings', 'works', 'hard', 'day', '.', 'Good', 'stuff', '!']


---

## 3. Stemming / Lemmatization:
> **Stemming:** Used in large datasets, eg: `Caring -> Car` (can lead to incorrect meaning also).

> **Lemmatization:** Converts the word to its meaningful base (Lemma), eg: `Caring -> Care` (expensive).