# Stop Word Removal in NLP
* Notebook by Adam Lang
* Date: 3/14/2024

## Problem
* Let's say you have a text document which contains information about a topic but you don't know what the topic is.
* Your challenge is to find out the topic without reading the text.

## Solutions
* Find the most frequently occuring word(s).

### Importing Data

In [1]:
# Read file using open()
file = open('/content/drive/MyDrive/Colab Notebooks/Classical NLP/moon.txt', mode='r', encoding='utf-8')

# getting text data as string
text = file.read()

# closing file
file.close()

In [2]:
print(text)

The moon is the satellite of the earth. It moves round the earth. It shines at night by light reflected from the Sun. It looks beautiful. The bright Moonlight is very soothing. The earthly objects shine like silver in the moonlight. We are fascinated by the enchanting beauty of the Moon. The moon is not as beautiful as it looks. It seems to be lovely when it shines in the sky at night. As a matter of fact it is devoid of plants and animals. The moon is not a suitable place for plants and animals. Therefore, no form of life can be found on the moon. Unlike the earth, the moon has got no atmosphere. Therefore, the lunar days are very hot and the lunar nights are intensely cold. The moon looks beautiful from the earth but in fact it has up forbidding appearance. It is full of rocks and craters. When we look at the moon at night we see some dark spots on it. These dark spots are dangerous rocks and craters. The gravitational pull of the moon is less than that of the earth, so it is difficu

### Cleaning Text

In [3]:
import re

# Function for cleaning text
def clean_text(text):

    # Lowercasing the text - normalize the case
    text = text.lower()

    # Removing comma(,), period(.) and newline characters(\n)
    text = re.sub('[,.\n]','',text)

    # Replacing hyphen with blank space
    text = re.sub('-',' ', text)

    return text

In [4]:
# Cleaning the text with function
cleaned_text=clean_text(text)

In [5]:
print(cleaned_text)

the moon is the satellite of the earth it moves round the earth it shines at night by light reflected from the sun it looks beautiful the bright moonlight is very soothing the earthly objects shine like silver in the moonlight we are fascinated by the enchanting beauty of the moon the moon is not as beautiful as it looks it seems to be lovely when it shines in the sky at night as a matter of fact it is devoid of plants and animals the moon is not a suitable place for plants and animals therefore no form of life can be found on the moon unlike the earth the moon has got no atmosphere therefore the lunar days are very hot and the lunar nights are intensely cold the moon looks beautiful from the earth but in fact it has up forbidding appearance it is full of rocks and craters when we look at the moon at night we see some dark spots on it these dark spots are dangerous rocks and craters the gravitational pull of the moon is less than that of the earth so it is difficult to walk on the surf

### Finding most frequent words

In [6]:
import spacy

In [7]:
# loading spacy model
nlp = spacy.load('en_core_web_sm')

In [8]:
# creating doc object
doc = nlp(cleaned_text)

In [9]:
# find word frequencies
words_dict = {}

# Add word-count pair to dict
for token in doc:
  # Check if word/token is already in dictionary
  if token.text in words_dict:
    # Increment count of word by 1
    words_dict[token.text]=words_dict[token.text]+1
  else:
    # Add word to dict with count 1
    words_dict[token.text]=1

In [10]:
# convert into dataframe for better handling
import pandas as pd

In [11]:
# create dataframe from dict
df = pd.DataFrame({'word':list(words_dict.keys()), 'count':list(words_dict.values())})

In [12]:
# sort dataframe in descending order
df.sort_values(by='count', ascending=False, inplace=True, ignore_index=True)

In [13]:
print('Shape=>', df.shape)
df.head(5)

Shape=> (151, 2)


Unnamed: 0,word,count
0,the,47
1,moon,21
2,it,15
3,of,13
4,to,11


Summary:
* The total number of tokens or words in dataframe are 151
* The top 5 most common tokens or words are seen in the df.head() print out above.
* Other than the word "moon", every other word is meaningless or considered a stop word.

# What are Stop Words?

Stopwords are the **most common** words in a language which are added to make things more understandable to humans. In English these include but are not limited to: `a, an, the, for, where, when, at` etc. These words are removed during text pre-processing phase because they do not add much value to the document meaning.

Consider a sample sentence:
* String: "There is a pen on the table."
* Stopwords: ["There", "is", "a", "on", "the"]
* Meaningful words: ["pen", "table"]

In [14]:
print(nlp.Defaults.stop_words)

{'somewhere', 'wherever', 'too', 'nor', 'those', 'latter', 'except', 'not', 'whereupon', 'each', 'less', 'her', 'every', 'a', "'ll", 'becomes', 'sometime', 'along', 'top', 'this', 'name', 'cannot', 'meanwhile', 'will', 'an', 'hers', 'seems', 'last', 'onto', 'since', 'you', 'n’t', 'us', 'can', 'toward', 'fifty', 'was', 'are', '’s', 'above', 'off', 'per', 'they', 'thereupon', 'became', 'already', 'without', 'though', 'only', 'amongst', 'serious', 'does', 'with', 'out', 'would', 'might', 'say', 'after', 'namely', 'often', 'such', 'among', 'who', 'whence', 'seemed', 'thereafter', 'part', 'n‘t', 'nothing', 'why', 'if', 'everywhere', 'two', "'s", 'unless', 'hence', 'give', 'myself', 'where', 'the', 'five', 'am', 'amount', 'he', 'always', 'everyone', 'mostly', 'one', 'whereby', 'once', 'when', 'take', 'itself', 'made', 'now', 'quite', 'via', 'call', 'none', 'really', 'therein', 'whither', 'elsewhere', 'ourselves', '‘ll', 'upon', 'than', 'using', 'herself', '‘s', 'we', 'behind', 'put', 'she', 

In [15]:
len(nlp.Defaults.stop_words)

326

summary:
* We can see that spacy gives us 326 of the most common stop words by Default. This may or may not apply to every corpus.

In [16]:
# getting words that are NOT stopwords
new_tokens = [token.text for token in doc if (token.is_stop == False)]

In [17]:
print(new_tokens)

['moon', 'satellite', 'earth', 'moves', 'round', 'earth', 'shines', 'night', 'light', 'reflected', 'sun', 'looks', 'beautiful', 'bright', 'moonlight', 'soothing', 'earthly', 'objects', 'shine', 'like', 'silver', 'moonlight', 'fascinated', 'enchanting', 'beauty', 'moon', 'moon', 'beautiful', 'looks', 'lovely', 'shines', 'sky', 'night', 'matter', 'fact', 'devoid', 'plants', 'animals', 'moon', 'suitable', 'place', 'plants', 'animals', 'form', 'life', 'found', 'moon', 'unlike', 'earth', 'moon', 'got', 'atmosphere', 'lunar', 'days', 'hot', 'lunar', 'nights', 'intensely', 'cold', 'moon', 'looks', 'beautiful', 'earth', 'fact', 'forbidding', 'appearance', 'rocks', 'craters', 'look', 'moon', 'night', 'dark', 'spots', 'dark', 'spots', 'dangerous', 'rocks', 'craters', 'gravitational', 'pull', 'moon', 'earth', 'difficult', 'walk', 'surface', 'moon', 'moon', 'fascinated', 'man', 'beginning', 'life', 'earth', 'looked', 'wonder', 'poets', 'composed', 'beautiful', 'poems', 'moon', 'scientists', 'tried

In [18]:
new_words_dict = {}

# Add word-count pair to dict
for token in new_tokens:
  # Check if word is already in dict
  if token in new_words_dict:
    # Increment count of word by 1
    new_words_dict[token] = new_words_dict[token]+1
  else:
    # Add the word to dict with count 1
    new_words_dict[token] = 1

In [19]:
# create a dataframe from dict
new_df = pd.DataFrame({'word':list(new_words_dict.keys()), 'count':list(new_words_dict.values())})

In [20]:
# Sorting dataframe in descending order
new_df.sort_values(by='count', ascending=False, inplace=True, ignore_index=True)

In [21]:
print('Shape=>', new_df.shape)
new_df.head(5)

Shape=> (97, 2)


Unnamed: 0,word,count
0,moon,21
1,earth,9
2,life,4
3,beautiful,4
4,looks,3


summary:
* There are now 97 rows in the dataframe.
* Top 5 most frequent words are not stop words now.
* We can predict that this text is about the beauty of the moon as seen from the earth.

# When to remove Stopwords?
* Text Classification
* Caption Generation
* Auto-Tag Generation

# Don't Remove Stopwords!!!!
* Machine Translation
* Language Modeling
* Text Summarization
* Question-Answering Problems