### Removing stop words with NLTK in Python

<img src="https://user-images.githubusercontent.com/32620288/166104650-bca608ed-afc3-4c56-8bf2-eebf0b52b054.png" width="400" height="1">

----

The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words. 

What are Stop words?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not forget to change your home directory name)

<img src="https://user-images.githubusercontent.com/32620288/166501569-26e7d120-c55b-49b8-9327-0f18a898080b.png" width="400" height="1">

#### Removing stop words with NLTK

The following program removes stop words from a piece of text: 

In [1]:
#To check the list of stopwords
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [3]:
example_sent = """In a key development, German Chancellor Olaf Scholz on Tuesday stated that his country will support the admission of Finland and Sweden to the North Atlantic Treaty Organization (NATO) if they apply for it. He said this while addressing a joint press conference with Finland's Prime Minister Sanna Marin as well as Swedish Prime Minister Sweden Magdalena.  On Tuesday, the Prime Ministers of Finland and Sweden attended a meeting of the German Cabinet in Meseberg. "It is clear that if these two countries decide to be part of the NATO alliance, they can count on our support," Scholz added, as per the DPA news agency."""
example_sent

'In a key development, German Chancellor Olaf Scholz on Tuesday stated that his country will support the admission of Finland and Sweden to the North Atlantic Treaty Organization (NATO) if they apply for it. He said this while addressing a joint press conference with Finland\'s Prime Minister Sanna Marin as well as Swedish Prime Minister Sweden Magdalena.  On Tuesday, the Prime Ministers of Finland and Sweden attended a meeting of the German Cabinet in Meseberg. "It is clear that if these two countries decide to be part of the NATO alliance, they can count on our support," Scholz added, as per the DPA news agency.'

In [4]:
stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

filtered_sentence = []

for w in word_tokens:
   if w not in stop_words:
      filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)


['In', 'a', 'key', 'development', ',', 'German', 'Chancellor', 'Olaf', 'Scholz', 'on', 'Tuesday', 'stated', 'that', 'his', 'country', 'will', 'support', 'the', 'admission', 'of', 'Finland', 'and', 'Sweden', 'to', 'the', 'North', 'Atlantic', 'Treaty', 'Organization', '(', 'NATO', ')', 'if', 'they', 'apply', 'for', 'it', '.', 'He', 'said', 'this', 'while', 'addressing', 'a', 'joint', 'press', 'conference', 'with', 'Finland', "'s", 'Prime', 'Minister', 'Sanna', 'Marin', 'as', 'well', 'as', 'Swedish', 'Prime', 'Minister', 'Sweden', 'Magdalena', '.', 'On', 'Tuesday', ',', 'the', 'Prime', 'Ministers', 'of', 'Finland', 'and', 'Sweden', 'attended', 'a', 'meeting', 'of', 'the', 'German', 'Cabinet', 'in', 'Meseberg', '.', '``', 'It', 'is', 'clear', 'that', 'if', 'these', 'two', 'countries', 'decide', 'to', 'be', 'part', 'of', 'the', 'NATO', 'alliance', ',', 'they', 'can', 'count', 'on', 'our', 'support', ',', "''", 'Scholz', 'added', ',', 'as', 'per', 'the', 'DPA', 'news', 'agency', '.']
['In', 

#### Performing the Stopwords operations in a file

In the code below, text.txt is the original input file in which stopwords are to be removed. filteredtext.txt is the output file. It can be done using following code: 

In [38]:
import io
import os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [39]:
# Check working directory
os.getcwd()

'C:\\Users\\divak'

In [43]:
# word_tokenize accepts
# a string as an input, not a file.
stop_words = set(stopwords.words('english'))
file1 = open('newtext.txt','r')

In [45]:
text = file1.read()
text

"Prime Minister Narendra Modi's three-day Europe tour garnered global attention after having struck consecutive business-based meetings with German and Danish leadership and successful interactions with the Indian diaspora in Berlin and Copenhagen. In the latest development pertaining to the premiere's Europe itinerary, PM Modi, as per the Ministry of External Affairs, will make a brief stopover in Paris, given that France currently holds the EU presidency and French President Emmanuel Macron was recently been re-elected for a second consecutive term. Notably, PM Modi will be the first international leader to meet Macron after the latter was re-inducted as the French President."

In [41]:
# Use this to read file content as a stream:
line = file1.read()
words = line.split()
for r in words:
    if not r in stop_words:
        appendFile = open('filteredtext.txt','a')
        appendFile.write(" "+r)
        appendFile.close()

---------------------------------------------------------------------------------------------