---Practice questions---

https://www.w3resource.com/python-exercises/nltk/corpus-index.php

## Stemming
Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.
### Why Stemming is Important?
As previously stated, the English language has several variants of a single term. The presence of these variances in a text corpus results in data redundancy when developing NLP or machine learning models. Such models may be ineffective.

To build a robust model, it is essential to normalize text by removing repetition and transforming words to their base form through stemming.

### Application of Stemming
In information retrieval, text mining SEOs, Web search results, indexing, tagging systems, and word analysis, stemming is employed. For instance, a Google search for prediction and predicted returns comparable results.

## Types (Imp)
https://www.analyticsvidhya.com/blog/2021/11/an-introduction-to-stemming-in-natural-language-processing/

In [1]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [2]:
word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

Word                Porter Stemmer      Snowball Stemmer    Lancaster Stemmer             Regexp Stemmer                          
friend              friend              friend              friend                        friend                                  
friendship          friendship          friendship          friend                        friendship                              
friends             friend              friend              friend                        friend                                  
friendships         friendship          friendship          friend                        friendship                              


## How is Lemmatization different from Stemming
In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word. There are definitely different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually know the meaning of the word in the language it belongs to. In lemmatization, on the other hand, the algorithms have this knowledge. In fact, you can even say that these algorithms refer a dictionary to understand the meaning of the word before reducing it to its root word, or lemma.

### Advantages and Disadvantages of Lemmatization
As you could probably tell by now, the obvious advantage of lemmatization is that it is more accurate. So if you’re dealing with an NLP application such as a chat bot or a virtual assistant where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this accuracy comes at a cost.

Because lemmatization involves deriving the meaning of a word from something like a dictionary, it’s very time consuming. So most lemmatization algorithms are slower compared to their stemming counterparts. There is also a computation overhead for lemmatization.


Examples of lemmatization:

-> rocks : rock

-> corpora : corpus

-> better : good


One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”

https://towardsdatascience.com/lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6


In [1]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
  
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
better : good


In [9]:
# Write a Python NLTK program to get a list of common stop words in various languages in Python.
from nltk.corpus import stopwords
print (stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [7]:
# Write a Python NLTK program to check the list of stopwords in various languages.
from nltk.corpus import stopwords
result = set(stopwords.words('english'))
print("List of stopwords in English:")
print(result)
print("\nList of stopwords in Arabic:")
result = set(stopwords.words('arabic'))
print(result)
print("\nList of stopwords in Azerbaijani:")
result = set(stopwords.words('azerbaijani'))
print(result)
print("\nList of stopwords in Danish:")
result = set(stopwords.words('danish'))
print(result)
print("\nList of stopwords in Dutch:")
result = set(stopwords.words('dutch'))
print(result)
print("\nList of stopwords in Finnish:")
result = set(stopwords.words('finnish'))
print(result)
print("\nList of stopwords in French:")
result = set(stopwords.words('french'))
print(result)
print("\nList of stopwords in German:")
result = set(stopwords.words('german'))
print(result)
print("\nList of stopwords in Greek:")
result = set(stopwords.words('greek'))
print(result)
print("\nList of stopwords in Hungarian:")
result = set(stopwords.words('hungarian'))
print(result)
print("\nList of stopwords in Indonesian:")
result = set(stopwords.words('indonesian'))
print(result)
print("\nList of stopwords in Italian:")
result = set(stopwords.words('italian'))
print(result)
print("\nList of stopwords in Kazakh:")
result = set(stopwords.words('kazakh'))
print(result)
print("\nList of stopwords in Nepali:")
result = set(stopwords.words('nepali'))
print(result)
print("\nList of stopwords in Norwegian:")
result = set(stopwords.words('norwegian'))
print(result)
print("\nList of stopwords in Portuguese:")
result = set(stopwords.words('portuguese'))
print(result)
print("\nList of stopwords in Romanian:")
result = set(stopwords.words('romanian'))
print(result)
print("\nList of stopwords in Russian:")
result = set(stopwords.words('russian'))
print(result)
print("\nList of stopwords in Spanish:")
result = set(stopwords.words('spanish'))
print(result)
print("\nList of stopwords in Swedish:")
result = set(stopwords.words('swedish'))
print(result)
print("\nList of stopwords in Turkish:")
result = set(stopwords.words('turkish'))
print(result)

List of stopwords in English:
{'didn', 'had', 'yours', 'or', 'no', 'should', "wouldn't", 'o', 'has', 'all', "couldn't", 'shan', "shouldn't", 'once', 'ourselves', 'for', 'd', 'most', 'him', 'itself', 'as', 'off', 'too', "hasn't", 'on', 'do', 'his', 'being', 'having', 'about', 'have', 'be', 'own', 'themselves', 'of', 'll', 'not', 'who', 'those', 't', 'some', 'ma', 'an', 'hadn', 'how', 'hasn', 'mightn', 'at', 'm', "wasn't", 'nor', 'whom', "don't", 'was', 'above', 'himself', 'yourself', 'because', 'y', 'its', 'through', 'are', 'me', "hadn't", 'your', 'while', 'does', 'again', 'myself', "weren't", 'a', 'during', 'is', 'and', 'doing', 'more', 'between', 'needn', 'so', 'than', 'won', "doesn't", "it's", 'below', 'out', 'isn', "mustn't", 'few', 'yourselves', 've', "she's", 'you', 'hers', 'such', 'mustn', 'from', 'did', 'doesn', 'both', "aren't", 'here', 'i', 'there', "needn't", 'now', 'her', 'if', 'aren', 'were', 'very', "you'd", 'down', "isn't", 'any', "you're", 'am', 'by', "won't", 'theirs', 

### Write a Python NLTK program to remove stop words from a given text.

In [11]:
# In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though 
# "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by 
# all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing 
# these stop words to support phrase search.
# Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most 
# common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching
# for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines
# remove some of the most common words-including lexical words, such as "want"-from a query in order to improve performance.

from nltk.corpus import stopwords
stoplist = stopwords.words('english')
text = '''
In computing, stop words are words which are filtered out before or after 
processing of natural language data (text). Though "stop words" usually 
refers to the most common words in a language, there is no single universal 
list of stop words used by all natural language processing tools, and 
indeed not all tools even use such a list. Some tools specifically avoid 
removing these stop words to support phrase search.
'''
print("\nOriginal string:")
print(text)
clean_word_list = [word for word in text.split() if word not in stoplist]
print("\nAfter removing stop words from the said text:")
print(clean_word_list)


Original string:

In computing, stop words are words which are filtered out before or after 
processing of natural language data (text). Though "stop words" usually 
refers to the most common words in a language, there is no single universal 
list of stop words used by all natural language processing tools, and 
indeed not all tools even use such a list. Some tools specifically avoid 
removing these stop words to support phrase search.


After removing stop words from the said text:
['In', 'computing,', 'stop', 'words', 'words', 'filtered', 'processing', 'natural', 'language', 'data', '(text).', 'Though', '"stop', 'words"', 'usually', 'refers', 'common', 'words', 'language,', 'single', 'universal', 'list', 'stop', 'words', 'used', 'natural', 'language', 'processing', 'tools,', 'indeed', 'tools', 'even', 'use', 'list.', 'Some', 'tools', 'specifically', 'avoid', 'removing', 'stop', 'words', 'support', 'phrase', 'search.']


###  Write a Python NLTK program to omit some given stop words from the stopwords list.

In [12]:
import nltk
from nltk.corpus import stopwords
result = set(stopwords.words('english'))
print("List of stopwords in English:")
print(result)
print("\nOmit - 'again', 'once' and 'from':")
stop_words = set(stopwords.words('english')) - set(['again', 'once', 'from'])
print("\nList of fresh stopwords in English:")
print (stop_words)

List of stopwords in English:
{'didn', 'had', 'yours', 'or', 'no', 'should', "wouldn't", 'o', 'has', 'all', "couldn't", 'shan', "shouldn't", 'once', 'ourselves', 'for', 'd', 'most', 'him', 'itself', 'as', 'off', 'too', "hasn't", 'on', 'do', 'his', 'being', 'having', 'about', 'have', 'be', 'own', 'themselves', 'of', 'll', 'not', 'who', 'those', 't', 'some', 'ma', 'an', 'hadn', 'how', 'hasn', 'mightn', 'at', 'm', "wasn't", 'nor', 'whom', "don't", 'was', 'above', 'himself', 'yourself', 'because', 'y', 'its', 'through', 'are', 'me', "hadn't", 'your', 'while', 'does', 'again', 'myself', "weren't", 'a', 'during', 'is', 'and', 'doing', 'more', 'between', 'needn', 'so', 'than', 'won', "doesn't", "it's", 'below', 'out', 'isn', "mustn't", 'few', 'yourselves', 've', "she's", 'you', 'hers', 'such', 'mustn', 'from', 'did', 'doesn', 'both', "aren't", 'here', 'i', 'there', "needn't", 'now', 'her', 'if', 'aren', 'were', 'very', "you'd", 'down', "isn't", 'any', "you're", 'am', 'by', "won't", 'theirs', 

In [23]:
# C:\Users\djoshi\Anaconda3\Lib\site-packages\nltk
import nltk
print(nltk.__file__)

C:\Users\djoshi\Anaconda3\lib\site-packages\nltk\__init__.py


In [None]:
# path ='C:/Users/djoshi/Anaconda3/Lib/site-packages/nltk/corpus/reader'


### Write a Python NLTK program to find the definition and examples of a given word using WordNet.
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available.

In [18]:
from nltk.corpus import wordnet 
syns = wordnet.synsets("fight")
print(syns)
print("\nDefination of the said word:")
print(syns[0].definition())
print("\nExamples of the word in use::")
print(syns[0].examples())

[Synset('battle.n.01'), Synset('fight.n.02'), Synset('competitiveness.n.01'), Synset('fight.n.04'), Synset('fight.n.05'), Synset('contend.v.06'), Synset('fight.v.02'), Synset('fight.v.03'), Synset('crusade.v.01')]

Defination of the said word:
a hostile meeting of opposing military forces in the course of a war

Examples of the word in use::
['Grant won a decisive victory in the battle of Chickamauga', 'he lost his romantic ideas about war when he got into a real engagement']


### Write a Python NLTK program to find the sets of synonyms and antonyms of a given word
https://stackoverflow.com/questions/59355529/is-there-any-order-in-wordnets-synsets

In [20]:
wordnet.synsets("end")

[Synset('end.n.01'),
 Synset('end.n.02'),
 Synset('end.n.03'),
 Synset('goal.n.01'),
 Synset('end.n.05'),
 Synset('end.n.06'),
 Synset('end.n.07'),
 Synset('end.n.08'),
 Synset('end.n.09'),
 Synset('end.n.10'),
 Synset('end.n.11'),
 Synset('conclusion.n.08'),
 Synset('end.n.13'),
 Synset('end.n.14'),
 Synset('end.v.01'),
 Synset('end.v.02'),
 Synset('end.v.03'),
 Synset('end.v.04')]

In [22]:
from nltk.corpus import wordnet
synonyms = []
antonyms = []

for syn in wordnet.synsets("end"):
#     print(syn)
    for l in syn.lemmas():
#         print(l)
        synonyms.append(l.name())
        if l.antonyms():
#             print(l.antonyms())
            antonyms.append(l.antonyms()[0].name())
            
print("\nSet of synonyms of the said word:")
print(set(synonyms))
print("\nSet of antonyms of the said word:")
print(set(antonyms))


Set of synonyms of the said word:
{'closing', 'remnant', 'final_stage', 'stop', 'finish', 'cease', 'death', 'destruction', 'oddment', 'terminate', 'remainder', 'end', 'conclusion', 'ending', 'close', 'goal', 'last', 'terminal'}

Set of antonyms of the said word:
{'beginning', 'begin'}


In [26]:
from nltk.corpus import wordnet
print("\nComparing ship anb boat:")
n1 = wordnet.synset('ship.n.01')
n2 = wordnet.synset('boat.n.01')
print(n1.wup_similarity(n2))

print("\nComparing bus anb boat:")
n1 = wordnet.synset('bus.n.01')
n2 = wordnet.synset('boat.n.01')
print(n1.wup_similarity(n2))

print("\nComparing red anb greed:")
n1 = wordnet.synset('red.n.01')
n2 = wordnet.synset('green.n.01')
print(n1.wup_similarity(n2))


Comparing ship anb boat:
0.9090909090909091

Comparing bus anb boat:
0.7

Comparing red anb greed:
0.875


## Construct a parse tree for the sentence using CFG rules
![image-2.png](attachment:image-2.png)

Ans:
https://www.youtube.com/watch?v=ovenwMD4U9Y

(Ignore hand)
![image.png](attachment:image.png)

## Construct top down and bottom up parse tree for the sentence using CFG rules
![image.png](attachment:image.png)

Ans

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)