# NLP Assignment 2
1. Mention and define the NLTK steps.
2. What different types of segmentation are there?
3. Explain the method of recognising named entities.
4. Make a list of the elements of NLP.
5. Give an example of pragmatic analysis.
6. Explain the morphological and lexical analysis procedure.

## 1. Mention and define the NLTK steps.

Below are the Text Processing steps in NLTK:
1. Tokenization
2. Lower case conversion
3. Stop Words removal
4. Stemming
5. Lemmatization
6. Parse tree or Syntax Tree generation
7. POS Tagging



#### 1. Tokenization
The breaking down of text into smaller units is called tokens. tokens are a small part of that text. If we have a sentence, the idea is to separate each word and build a vocabulary such that we can represent all words uniquely in a list. Numbers, words, etc.. all fall under tokens.



In [20]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Natural language processing is an exciting area."

print(sent_tokenize(text))

['Natural language processing is an exciting area.']


In [21]:
print(word_tokenize(text))

['Natural', 'language', 'processing', 'is', 'an', 'exciting', 'area', '.']


#### 2. Lower case conversion
- We want our model to not get confused by seeing the same word with different cases like one starting with capital and one without and interpret both differently. So we convert all words into the lower case to avoid redundancy in the token list.

In [22]:
import regex as re
text = re.sub(r"[^a-zA-Z0-9]", " ",text.lower())
words = text.split()
words

['natural', 'language', 'processing', 'is', 'an', 'exciting', 'area']

#### 3. Stop Words removal
When we use the features from a text to model, we will encounter a lot of noise. These are the stop words like the, he, her, etc… which don’t help us and, just be removed before processing for cleaner processing inside the model. With NLTK we can see all the stop words available in the English language.

In [23]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [25]:
# removing stopwords fro text
words_2 = []
for i in words:
    if i not in stopwords.words('english'):
        words_2.append(i)
        
words_2

['natural', 'language', 'processing', 'exciting', 'area']

#### 4. Stemming
In our text we may find many words like playing, played, playfully, etc… which have a root word, play all of these convey the same meaning. So we can just extract the root word and remove the rest. Here the root word formed is called ‘stem’ and it is not necessarily that stem needs to exist and have a meaning. Just by committing the suffix and prefix, we generate the stems.

NLTK provides us with PorterStemmer LancasterStemmer and SnowballStemmer packages.

In [27]:
from nltk.stem.porter import PorterStemmer
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words_2]
print(stemmed)

['natur', 'languag', 'process', 'excit', 'area']


#### 5. Lemmatization
We want to extract the base form of the word here. The word extracted here is called Lemma and it is available in the dictionary. We have the WordNet corpus and the lemma generated will be available in this corpus. NLTK provides us with the WordNet Lemmatizer that makes use of the WordNet Database to lookup lemmas of words.

In [28]:
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['natural', 'language', 'processing', 'is', 'an', 'exciting', 'area']


Stemming is much faster than lemmatization as it doesn’t need to lookup in the dictionary and just follows the algorithm to generate the root words.

#### 6. Parse tree or Syntax Tree generation
We can define grammar and then use NLTK RegexpParser to extract all parts of speech from the sentence and draw functions to visualize it.

In [None]:
# Import required libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser
# Example text
sample_text = "The quick brown fox jumps over the lazy dog"
# Find all parts of speech in above sentence
tagged = pos_tag(word_tokenize(sample_text))
#Extract all parts of speech from any text
chunker = RegexpParser("""
NP: {?*} #To extract Noun Phrases
P: {}  #To extract Prepositions
V: {}  #To extract Verbs
PP: {

#To extract Prepositional Phrases
VP: { *} #To extract Verb Phrases
""")
# Print all parts of speech in above sentence
output = chunker.parse(tagged)
print("After Extractingn", output)

In [None]:
syntax tree generation | NLTK
output.draw()

Source: https://www.geeksforgeeks.org/syntax-tree-natural-language-processing/

#### 7. POS Tagging
Part of Speech tagging is used in text processing to avoid confusion between two same words that have different meanings. With respect to the definition and context, we give each word a particular tag and process them. Two Steps are used here:

Tokenize text (word_tokenize).
Apply the pos_tag from NLTK to the above step.

In [35]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = "Natural language processing is an exciting area. Huge budget have been allocated for this."
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
for i in tokenized:
  # Word tokenizers is used to find the words
  # and punctuation in a string
  wordsList = nltk.word_tokenize(i)
  # removing stop words from wordList
  wordsList = [w for w in wordsList if not w in stop_words]
  # Using a Tagger. Which is part-of-speech
  # tagger or POS-tagger.
  tagged = nltk.pos_tag(wordsList)
  print(tagged)

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('exciting', 'JJ'), ('area', 'NN'), ('.', '.')]
[('Huge', 'NNP'), ('budget', 'NN'), ('allocated', 'VBD'), ('.', '.')]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bhavantik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Bhavantik\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 2. What different types of segmentation are there?

Text segmentation is the task of dividing a document of text into coherent and semantically meaningful segments which are contiguous. This task is important for other Natural Language Processing (NLP) applications like summarization, context understanding, and question-answering.

-  Word segmentation.
-  Intent segmentation.
-  Sentence segmentation.
-  Topic segmentation.

## 3. Explain the method of recognising named entities.

- As we can simple observed that after reading a particular text, naturally we can recognize named entities such as people, values, locations, and so on.

For Example, Consider the following sentence:

        "Sentence: Sundar Pichai, the CEO of Google Inc. is walking in the streets of California."

- From the above sentence, we can identify three types of entities: (Named Entities)

        ( “person”: “Sundar Pichai” ),
        (“org”: “Google Inc.”),
        (“location”: “California”).
- But to do the same thing with the help of computers, we need to help them recognize entities first so that they can categorize them. So, to do so we can take the help of machine learning and Natural Language Processing (NLP).

- Let’s discuss the role of both these things while implementing NER using computers:

- NLP: It studies the structure and rules of language and forms intelligent systems that are capable of deriving meaning from text and speech.
- Machine Learning: It helps machines learn and improve over time.

- To learn what an entity is, a NER model needs to be able to detect a word or string of words that form an entity (e.g. California) and decide which entity category it belongs to.

- So, as a concluding step we can say that the heart of any NER model is a two-step process:
    - Detect a named entity
    - Categorize the entity
- So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and feed a NER model relevant training data.

- Then, by tagging some samples of words and phrases with their corresponding entities, we’ll eventually teach our NER model to detect the entities and categorize them.

## 4. Make a list of the elements of NLP.

1. Lexical Analysis and Morphological
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis

## 5. Give an example of pragmatic analysis.

Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by applying a set of rules that characterize cooperative dialogues.

For Example: "Open the door" is interpreted as a request instead of an order.

## 6. Explain the morphological and lexical analysis procedure.

#### Morphological Analysis:
- In this analysis, we try to understand distinct words according to their morphemes, which are defined as the smallest units of meaning.

For Example, Consider the word: “unhappiness ”

- We can be broken down into three morphemes named prefix, stem, and suffix, with each conveying some form of meaning:

    - The prefix un- refers to “not being”,

    - The suffix -ness refers to “a state of being”.

    - The stem happy is considered a free morpheme since it is a “word” on its own.

- prefixes and suffixes are Bound morphemes and they require a free morpheme to which it can be attached, and can therefore not appear as a “word” on their own.

#### Lexical Analysis:
- This analysis involves identifying and analyzing the structure of words.

- In a language, the Lexicon of a language describes the collection of words and phrases.

- In Lexical analysis, we divide the whole chunk of text data into paragraphs, sentences, and words.

- To work with lexical analysis, mostly we need to perform Lexicon Normalization. The most common lexicon normalization practices are Stemming and Lemmatization which we will cover later in this blog series.

### References:
- https://www.analyticsvidhya.com/blog/2021/07/nltk-a-beginners-hands-on-guide-to-natural-language-processing/
- https://www.analyticsvidhya.com/blog/2021/06/part-10-step-by-step-guide-to-master-nlp-named-entity-recognition/#:~:text=Named%20Entity%20Recognition%20is%20one,classify%20them%20into%20predefined%20categories.
- https://www.analyticsvidhya.com/blog/2021/06/part-2-step-by-step-guide-to-master-natural-language-processing-nlp-in-python/