# NLP Project
> Natural language processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language.

### Stages of NLP:
0. Importing important libraries and pre-setup
1. Segmentation
2. Tokenization
3. Removal of Stop words
4. Stemming and Lemmatization
5. Part of Speech Tagging 
6. Named Entity Recognition (NER)

We will be using `NLTK` library in Python for NLP tasks. 
> for Documentation, refer to: https://www.nltk.org/ 

We will be working with this text:
Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry. The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066. Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace. Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded.

> link: https://www.bbc.com/news/uk-65342840 

### Stage 0. Importing important Libraries and pre-setup

for this project, we will be meeting two key libraries. These are:
1. NLTK
2. RE

> 1. `NLTK` will be used for NLP task
> 2. `RE` will be used for regular expression

In [1]:
import nltk
import re

We will also setup `text` variable here

In [2]:
text = "Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry. The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066. Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace. Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded."
text

"Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry. The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066. Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace. Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded."

### Stage 1. Segmentation

We will break our entirety of text into small sentences

In [3]:
# download initially important packages
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt_tab to C:\Users\Kasarla
[nltk_data]     Vishwaja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


In [4]:
# splitting text into sentences
sentences = sent_tokenize(text)
sentences

['Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry.',
 'The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066.',
 'Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace.',
 "Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded."]

In [5]:
sentences[0]

'Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry.'

In [7]:
# remove punctuations
text = re.sub(r'[^a-zA-Z0-9]', ' ', sentences[0])
text

'Millions of people across the UK and beyond have celebrated the coronation of King Charles III   a symbolic ceremony combining a religious service and pageantry '

### Stage 2. Tokenization

Tokenization is simply breaking our sentences into tokens (words).

In [9]:
from nltk.tokenize import word_tokenize
words = word_tokenize(text)
print(words, end=" ")

['Millions', 'of', 'people', 'across', 'the', 'UK', 'and', 'beyond', 'have', 'celebrated', 'the', 'coronation', 'of', 'King', 'Charles', 'III', 'a', 'symbolic', 'ceremony', 'combining', 'a', 'religious', 'service', 'and', 'pageantry'] 

### Stage 3. Removal of Stop Words

Stop words are the words that appear frequently in language, but carry almost none to very minimal meaning in NLP.

In [10]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\Kasarla
[nltk_data]     Vishwaja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [11]:
# list of stopwords in English language in our NLTK
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [13]:
# removing stopwords
words = [w for w in words if w not in stopwords.words('english')]
print(words, end=" ")

['Millions', 'people', 'across', 'UK', 'beyond', 'celebrated', 'coronation', 'King', 'Charles', 'III', 'symbolic', 'ceremony', 'combining', 'religious', 'service', 'pageantry'] 

### Stage 4. Stemming and Lemmetization

Stemming is finding the stem root of the words.
Lemmetization is find the base word of the words.

In [14]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to C:\Users\Kasarla
[nltk_data]     Vishwaja\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to C:\Users\Kasarla
[nltk_data]     Vishwaja\AppData\Roaming\nltk_data...


True

In [None]:
# stemming
from nltk.stem.porter import PorterStemmer

# reduce words to their stem form
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed, end=" ")

['million', 'peopl', 'across', 'uk', 'beyond', 'celebr', 'coron', 'king', 'charl', 'iii', 'symbol', 'ceremoni', 'combin', 'religi', 'servic', 'pageantri'] 

In [16]:
# lemmetization
from nltk.stem.wordnet import WordNetLemmatizer

# reduce words to their base form
lemmetized = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmetized, end=" ")

['Millions', 'people', 'across', 'UK', 'beyond', 'celebrated', 'coronation', 'King', 'Charles', 'III', 'symbolic', 'ceremony', 'combining', 'religious', 'service', 'pageantry'] 

### Another Helping Example

In [17]:
words2 = ['wait', 'waiting', 'studies', 'studying', 'fairly', 'fairness']

# stemming
stemmed2 = [PorterStemmer().stem(w) for w in words2]
print(stemmed2, end=" ")

# lemmetization
lemmetized2 = [WordNetLemmatizer().lemmatize(w) for w in words2]
print(lemmetized2, end=" ")

['wait', 'wait', 'studi', 'studi', 'fairli', 'fair'] ['wait', 'waiting', 'study', 'studying', 'fairly', 'fairness'] 

### Stage 5. Part of Speech Tagging

Part of Speech_Tagging is the process of identifying the part of speech (such as noun, verb, adjective, adverb, etc.) of each word in a sentence. 

In [22]:
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Kasarla Vishwaja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to C:\Users\Kasarla
[nltk_data]     Vishwaja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.


True

### NLTK POS Tag List
Below is the pos tag list of nltk as follows. There is a multiple tag list available in nltk, tag list showing in output as per word.

- CC: It is the conjunction of coordinating
- CD: It is a digit of cardinal
- DT: It is the determiner
- EX: Existential
- FW: It is a foreign word
- IN: Preposition and conjunction
- JJ: Adjective
- JJR and JJS: Adjective and superlative
- LS: List marker
- MD: Modal
- NN: Singular noun
- NNS, NNP, NNPS: Proper and plural noun
- PDT: Predeterminer
- WRB: Adverb of wh
- WP$: Possessive wh
- WP: Pronoun of wh
- WDT: Determiner of wp
- VBZ: Verb
- VBP, VBN, VBG, VBD, VB: Forms of verbs
- UH: Interjection
- TO: To go
- RP: Particle
- RBS, RB, RBR: Adverb
- PRP, PRP$: Pronoun personal and professional

In [None]:
from nltk import pos_tag

# tag each word with their respective parts of speech
pos_tag(words)

[('Millions', 'NNS'),
 ('people', 'NNS'),
 ('across', 'IN'),
 ('UK', 'NNP'),
 ('beyond', 'IN'),
 ('celebrated', 'VBN'),
 ('coronation', 'NN'),
 ('King', 'NNP'),
 ('Charles', 'NNP'),
 ('III', 'NNP'),
 ('symbolic', 'JJ'),
 ('ceremony', 'NN'),
 ('combining', 'VBG'),
 ('religious', 'JJ'),
 ('service', 'NN'),
 ('pageantry', 'NN')]

### Stage 6. Named Entity Recognition (NER)

Named Entity Recognition (NER) is a means of identifying and categorizing named entities in unstructured text into predefined categories such as names of people, organizations, etc in structured text.

In [24]:
from nltk import ne_chunk
nltk.download('words')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package words to C:\Users\Kasarla
[nltk_data]     Vishwaja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\Kasarla Vishwaja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker_tab.zip.


True

In [26]:
ner_tree = ne_chunk(pos_tag(word_tokenize(sentences[0]))) # tokenize the sentences and then ne-chunk it
print(ner_tree) 

(S
  Millions/NNS
  of/IN
  people/NNS
  across/IN
  the/DT
  (ORGANIZATION UK/NNP)
  and/CC
  beyond/IN
  have/VBP
  celebrated/VBN
  the/DT
  coronation/NN
  of/IN
  King/NNP
  (PERSON Charles/NNP III/NNP)
  -/:
  a/DT
  symbolic/JJ
  ceremony/NN
  combining/VBG
  a/DT
  religious/JJ
  service/NN
  and/CC
  pageantry/NN
  ./.)
