<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/5_3_Stopword_Removal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem:
Suppose you have a text document which contains some information about a topic. But you don’t know what that topic is. Your challenge is to find out the topic without reading the text.

### Importing Data

In [22]:
# Reading the file using open()
file=open('/content/switzerland.txt',mode='r',encoding='utf-8')

# Getting text data as string
text=file.read()

# Closing the file
file.close()

In [23]:
print(text)

Switzerland, officially the Swiss Confederation, is a country situated in the confluence of Western, Central, and Southern Europe. It is a federal republic composed of 26 cantons, with federal authorities based in Bern. Switzerland is a landlocked country bordered by Italy to the south, France to the west, Germany to the north, and Austria and Liechtenstein to the east. It is geographically divided among the Swiss Plateau, the Alps, and the Jura, spanning a total area of 41,285 km2 (15,940 sq mi), and land area of 39,997 km2 (15,443 sq mi). While the Alps occupy the greater part of the territory, the Swiss population of approximately 8.5 million is concentrated mostly on the plateau, where the largest cities and economic centres are located, among them Zürich, Geneva and Basel, where multiple international organisations are domiciled (such as FIFA, the UN's second-largest Office, and the Bank for International Settlements) and where the main international airports of Switzerland are.



### Cleaning Text

In [46]:
import re

# Function for cleaning text
def clean_text(text):

    # Lowercasing the text
    text=text.lower()

    # Removing comma(,), period(.) and newline character(\n)
    text=re.sub('[,.\n()]','',text)

    # Replacing hypen with blank space
    text=re.sub('-',' ',text)

    return text

In [47]:
# Cleaning Text
cleaned_text=clean_text(text)

In [48]:
print(len(cleaned_text))

963


### Finding most frequent words

In [27]:
import spacy

In [28]:
# Loading spacy model
nlp=spacy.load('en_core_web_sm')

In [50]:
# creating doc object
doc=nlp(cleaned_text)

In [51]:
words_dict={}

# Add word-count pair to the dictionary
for token in doc:
    # Check if the word is already in dictionary
    if token.text in words_dict:
        # Increment count of word by 1
        words_dict[token.text]=words_dict[token.text]+1
    else:
        # Add the word to dictionary with count 1
        words_dict[token.text]=1

In [52]:
import pandas as pd

In [53]:
# Creating a dataframe from dictionary
df = pd.DataFrame({'word':list(words_dict.keys()), 'count':list(words_dict.values())})

In [54]:
# Sorting dataframe in descending order
df.sort_values(by='count',ascending=False,inplace=True,ignore_index=True)

In [55]:
print('Shape=>',df.shape)
df.head(5)

Shape=> (96, 2)


Unnamed: 0,word,count
0,the,18
1,and,9
2,of,7
3,is,5
4,to,4


# What are Stop words?
Stopwords are the most common words in a language which are added to make things more understandable to humans. Like in English we have `a, an, the, for, where, when, at,` etc. These words are removed during text pre-processing phase because these words do not add much value to the meaning of the document.

Consider a sample sentence:
##### String: "There is a pen on the table."
##### Stopwords: \["There", "is", "a", "on", "the" \]
##### Meaningful words: \["pen", "table"\]

In [36]:
print(nlp.Defaults.stop_words)

{'amongst', 'during', "n't", 'no', 'nine', 'all', 'should', 'became', 'never', 'were', 'therein', 'amount', 'please', 'seeming', 'sometime', 'two', 'could', 'its', 'we', 'when', 'any', 'otherwise', 'not', 'together', 'by', 'else', 'as', 'neither', 'the', 'whatever', 'upon', 'twenty', '‘ll', 'nowhere', 'then', 'sometimes', 'thru', "'ll", 'almost', 'be', 'very', 'fifteen', 'side', '‘d', 'have', "'m", 'nothing', 'down', 'indeed', 'these', 'against', 'he', 'really', 'every', 'yours', 'their', 'see', 're', 'us', 'forty', 'until', 'say', 'of', 'besides', 'at', 'them', 'why', 'most', 'someone', 'also', 'becoming', '’m', 'those', '’s', 'go', 'such', 'throughout', 'is', 'again', 'for', 'must', 'than', 'anyway', 'do', 'front', 'fifty', 'before', 'i', 'a', 'latterly', 'further', 'elsewhere', 'may', 'give', 'it', 'your', 'him', 'take', 'am', 'our', 'n‘t', 'along', 'much', 'either', 'next', 'if', '‘ve', 'sixty', 'anywhere', 'but', 'mine', 'some', 'more', 'to', 'via', 'hers', 'nobody', 'seems', 'wil

In [37]:
len(nlp.Defaults.stop_words)

326

In [56]:
# Getting words that are not stopwords
new_tokens=[token.text for token in doc if (token.is_stop == False)]

In [57]:
print(len(new_tokens))

89


In [58]:
new_words_dict={}

# Add word-count pair to the dictionary
for token in new_tokens:
    # Check if the word is already in dictionary
    if token in new_words_dict:
        # Increment count of word by 1
        new_words_dict[token] = new_words_dict[token]+1
    else:
        # Add the word to dictionary with count 1
        new_words_dict[token]=1

In [59]:
# Creating a dataframe from dictionary
new_df = pd.DataFrame({'word':list(new_words_dict.keys()), 'count':list(new_words_dict.values())})

In [60]:
# Sorting dataframe in descending order
new_df.sort_values(by='count',ascending=False,inplace=True,ignore_index=True)

In [61]:
print('Shape=>',new_df.shape)
new_df.head(5)

Shape=> (74, 2)


Unnamed: 0,word,count
0,switzerland,3
1,swiss,3
2,international,3
3,federal,2
4,largest,2


Due to stopwords a lot of resources get wasted in storing and pre-processing these. Removing them makes the process of analysis and model building faster because the corpus size gets reduced due to it.

## Remove Stopwords:
- Text Classification
- Caption Generation
- Auto-Tag Generation

## Don't Remove Stopwords:
- Machine Translation
- Language Modeling
- Text Summarization
- Question-Answering Problems