# **Aim :**

 Write a program to identify Stop Words, Stem and Lemma of English and Hindi Text

## **Theory** : 

# Stopwords

The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

What are Stop words?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not forget to change your home directory name).

The following program tokenizes the sentence, identifies and removes stop words from a piece of text.

In [None]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [None]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 

txt = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."
#print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 
#print(stop_words)
word_tokens = word_tokenize(txt)  

print("Sentence is:", txt,"\n")  
print("Tokens in the above sentence:", word_tokens,"\n")
stop = [w for w in stop_words if w in word_tokens]  
print("StopWords recognized in the given sentence:", stop,"\n") 

filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []  
  
for w in word_tokens:  
    if w not in stop_words:  
        filtered_sentence.append(w)  
  
print("After removing the recognized stopwords, the Tokens of sentence is:", filtered_sentence)  

Sentence is: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. 

Tokens in the above sentence: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amou

# **Stemming**

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:

 -> "likes"

 -> "liked"

 -> "likely"
 
 -> "liking"
Errors in Stemming: There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:

Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in domain analysis. Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

Below is the implementation of stemming words using NLTK:

In [None]:
# import these modules 
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
  
# choose some words to be stemmed 
words = ["walk", "walked", "walker", "walking", "walkman"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

walk  :  walk
walked  :  walk
walker  :  walker
walking  :  walk
walkman  :  walkman


In [None]:
# importing modules 
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 

ps = PorterStemmer() 

sentence = "We have Natural Language Processing in our semester VIII of Computer Engineering"
words = word_tokenize(sentence) 

for w in words: 
	print(w, " : ", ps.stem(w)) 

We  :  We
have  :  have
Natural  :  natur
Language  :  languag
Processing  :  process
in  :  in
our  :  our
semester  :  semest
VIII  :  viii
of  :  of
Computer  :  comput
Engineering  :  engin


# **Lemmatization**

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

**Applications of lemmatization are:**

Used in comprehensive retrieval systems like search engines.
Used in compact indexing

Examples of lemmatization:

```
-> rocks : rock
-> corpora : corpus
-> better : good
```

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”

**Below is the implementation of lemmatization words using NLTK:**

In [None]:
# import these modules 
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() 

print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 

# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a")) 
print("are :", lemmatizer.lemmatize("are", pos ="r"))


rocks : rock
corpora : corpus
better : good
are : are


# **Conclusion** : 

Hence we implemented program to identify Stop Words, Stem and Lemma.