# **Assignment 1 on Natural Language Processing**

### Date : 4th Sept, 2020

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

 # NLTK Library

The [NLTK](https://www.nltk.org/) Python framework is generally used as an education and research tool. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages which will be discussed in this tutorial.

**Installing Nltk** <br>
Nltk can be installed using PIP or Conda package managers.For detailed installation instructions follow this [link](https://www.nltk.org/install.html).

To ensure we are all on the same page, the coding environment will be in **python3**. We suggest downloading Anaconda3 and creating a separate environment to do this assignment. 
The link to anaconda3 for Windows and Linux is available here https://docs.anaconda.com/anaconda/install/. 
The steps to install NLTK is available on the link: 
```bash
sudo pip3 install nltk 
python3 
nltk.download()
```

**Note for Question and answers:**

Write your answers to the point in the text box below labelled as **Answer here**.

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [2]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
nltk.download('stopwords')
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Sample corpus.
from nltk.corpus import inaugural,stopwords
corpus = inaugural.raw('1789-Washington.txt')
#print(corpus)

### **TASK**:

For the given corpus, 
1. Print the number of sentences and tokens. 
2. Print the average number of tokens per sentence.
3. Print the number of unique tokens
4. Print the number of tokens after stopword removal using the stopwords from nltk.


In [4]:
 x = len(sent_tokenize(corpus))
y = len(word_tokenize(corpus))
z = len(set(word_tokenize(corpus)))
print('1. Number of sentences is : '+str(x)+' and number of tokens is : '+str(y) )
print('2. Avg number tokens per sentences is : '+str(y/x))
print('3. Number of unique tokens is : '+ str(z))
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in word_tokenize(corpus) if not w in stop_words] 
print('4. Number of tokens after stopword removal is : '+str(len(filtered_sentence)))


1. Number of sentences is : 23 and number of tokens is : 1537
2. Avg number tokens per sentences is : 66.82608695652173
3. Number of unique tokens is : 626
4. Number of tokens after stopword removal is : 800


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>

We will try and see differences between Porterstemmer and Snowballstemmer

In [5]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]

# TODO
ps = PorterStemmer()
ss = SnowballStemmer('english')

# create an instance of both the stemmers and perform stemming on above words

# TODO
# Complete the function which takes a sentence/corpus and gets its stemmed version.
def stemSentence(sentence=None):
    print('Stemming using PorterStemmer')
    for w in sentence: 
        print(w, " : ", ps.stem(w))
    print('\n\nStemming using SnowballStemmer')
    for w in sentence: 
        print(w, " : ", ss.stem(w))

stemSentence(words)


Stemming using PorterStemmer
grows  :  grow
leaves  :  leav
fairly  :  fairli
cats  :  cat
trouble  :  troubl
misunderstanding  :  misunderstand
friendships  :  friendship
easily  :  easili
rational  :  ration
relational  :  relat


Stemming using SnowballStemmer
grows  :  grow
leaves  :  leav
fairly  :  fair
cats  :  cat
trouble  :  troubl
misunderstanding  :  misunderstand
friendships  :  friendship
easily  :  easili
rational  :  ration
relational  :  relat


**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [6]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]

#TODO
# Create an instance of the Lemmatizer and perform Lemmatization on above words
# You can also give Parts-of-speech(pos) to the Lemmatizer for example "v" (verb). Check the differences in the outputs.
lm = WordNetLemmatizer()

#TODO
# Complete the function which takes a sentence/corpus and gets its lemmatized version.
def lemmatizeSentence(sentence=None):
    print('Lemmatize words in context of POS = n')
    for w in sentence: 
        print(w, " : ", lm.lemmatize(w,pos="n"))
    print('\n\nLemmatize words in context of POS = v')
    for w in sentence: 
        print(w, " : ", lm.lemmatize(w,pos="v"))
lemmatizeSentence(words)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatize words in context of POS = n
grows  :  grows
leaves  :  leaf
fairly  :  fairly
cats  :  cat
trouble  :  trouble
running  :  running
friendships  :  friendship
easily  :  easily
was  :  wa
relational  :  relational
has  :  ha


Lemmatize words in context of POS = v
grows  :  grow
leaves  :  leave
fairly  :  fairly
cats  :  cat
trouble  :  trouble
running  :  run
friendships  :  friendships
easily  :  easily
was  :  be
relational  :  relational
has  :  have


**Question:** Give example of two words which have same stem but different lemma? Show the stem and lemma of both words in the code below 



**Answer here:**
Lemma takes into consideration the context of words while normalizing them. Stemming operates without knowledge of context and usually usually refers to a crude heuristic process that chops off the ends of words.\
Eg: meeting and meet\
The lemma of "meeting" in context of noun is "meeting" itself and stem word of "meeting" is "meet" which is obtained by removing -ing.
The lemma of "meet" in context of noun is "meet" itself and stem word of "meet" is "meet".

In [7]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.stem import SnowballStemmer
w1 = "meeting"
w2 = "meet"
ss = SnowballStemmer('english')
print("Stem of",w1, ": ", ss.stem(w1))
print("Stem of",w2, ": ", ss.stem(w2))
print("Lemma of",w1, ": ", lm.lemmatize(w1))
print("Lemma of",w2, ": ", lm.lemmatize(w2))


Stem of meeting :  meet
Stem of meet :  meet
Lemma of meeting :  meeting
Lemma of meet :  meet


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Question:** Write a comparison between stemming and lemmatization?

**Answer here:**
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

The two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Another difference is that stemming operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech unlike lemmatization which considers context. 

It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.


However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

Eg: The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

