--> Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

--> What are Stop Words? Stop words are the most frequently used words in a natural language like in, on, a, an, the, etc., in the English language. While analyzing data, these words might not add much value to the meaning of the data, and hence they can be filtered out to focus more on the essential words

Q. When Should I not remove Stop Words?
- This is a good movie ---> remove stop words ---> good movie
- This is a not good movie ---> remove stop words ---> good movie

- When we are doing language translation,  here if i remove stop words in pre-processing stage, all i get some sentence w/o stop words that make may be non sense.

- Chat bot, Q&A System
- Language Translation
- Any case where valuable information is lost

In [1]:
import spacy

from spacy.lang.en.stop_words import STOP_WORDS


In [2]:
len(STOP_WORDS)

326

In [3]:
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [4]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
    if token.is_stop:
        print(token)

We
just
our
the
part
is


In [5]:
def preprocess(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)   

In [6]:
preprocess("Musk wants time to prepare for a trial over his")

'Musk wants time prepare trial'

In [7]:
preprocess("The other is not other but your divine brother")

'divine brother'

Removing stop words from pandas dataframe text column

In [14]:
import pandas as pd

df = pd.read_json("doj_press.json",lines=True)


In [15]:
df.tail()

Unnamed: 0,id,title,contents,date,topics,components
13082,16-735,Yuengling to Upgrade Environmental Measures to...,The Department of Justice and the U.S. Environ...,2016-06-23T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
13083,10-473,Zarein Ahmedzay Pleads Guilty to Terror Violat...,The Justice Department announced that Zarein...,2010-04-23T00:00:00-04:00,[],[Office of the Attorney General]
13084,17-045,Zimmer Biomet Holdings Inc. Agrees to Pay $17....,Subsidiary Agrees to Plead Guilty to Violating...,2017-01-12T00:00:00-05:00,[Foreign Corruption],"[Criminal Division, Criminal - Criminal Fraud ..."
13085,17-252,ZTE Corporation Agrees to Plead Guilty and Pay...,ZTE Corporation has agreed to enter a guilty p...,2017-03-07T00:00:00-05:00,"[Asset Forfeiture, Counterintelligence and Exp...","[National Security Division (NSD), USAO - Texa..."
13086,17-304,ZTE Corporation Pleads Guilty for Violating U...,ZTE Corporation pleaded guilty today to conspi...,2017-03-22T00:00:00-04:00,[Counterintelligence and Export Control],"[National Security Division (NSD), USAO - Texa..."


In [16]:
df.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [17]:
df.shape

(13087, 6)

In [25]:
#type(df.topics[0])

In [18]:
# Filtering out those rows that do not have any topics associated with the case

df = df[df["topics"].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [19]:
df.shape

(4688, 6)

In [20]:
df =df.head(100)
df.shape

(100, 6)

In [21]:
df["contents_new"] = df.contents.apply(preprocess)

In [22]:
df

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],"U.S. Department Justice , U.S. Environmental P..."
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 - count criminal indictment unsealed today...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],"21st Century Oncology LLC , agreed pay $ 19.75..."
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...
...,...,...,...,...,...,...,...
316,15-1359,Alaska Plastic Surgeon Convicted of Wire Fraud...,Doctor Hid Millions in Secret Accounts in Pana...,2015-11-04T00:00:00-05:00,[Tax],[Tax Division],Doctor Hid Millions Secret Accounts Panama Ala...
318,16-396,Alaska Plastic Surgeon Sentenced to Prison for...,Defendant Concealed Bank Accounts in Panama an...,2016-04-04T00:00:00-04:00,[Tax],[Tax Division],Defendant Concealed Bank Accounts Panama Costa...
321,17-736,Alaskan Commercial Fishing Couple Charged with...,An Alaskan couple was charged in federal court...,2017-07-26T00:00:00-04:00,[Tax],"[Tax Division, USAO - Alaska]","Alaskan couple charged federal court Juneau , ..."
322,18-717,Alaskan Husband And Wife Plead Guilty To Willf...,A husband and wife pleaded guilty yesterday to...,2018-06-01T00:00:00-04:00,[Tax],[Tax Division],husband wife pleaded guilty yesterday counts w...


In [26]:
len(df.contents[4])

6286

In [27]:
len(df.contents_new[4])

4810

In [28]:
df.contents[4][:300]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin conta'

In [29]:
df.contents_new[4][:300]

'U.S. Department Justice , U.S. Environmental Protection Agency ( EPA ) , Rhode Island Department Environmental Management ( RIDEM ) announced today subsidiaries Stanley Black & Decker Inc.—Emhart Industries Inc. Black & Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor'


Examples where removing stop words can create a problem

# (1) Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words


In [30]:
preprocess("this is a good movie")

'good movie'

In [31]:
preprocess("this is not a good movie")


'good movie'

# (2) Language translation: Say you want to translate following sentence from english to telugu. Before actual translation if you remove stop words and then translate, it will produce horrible result

In [32]:
preprocess("how are you doing dhaval?")

'dhaval ?'

# (3) Chat bot or any Q&A system

In [33]:
preprocess("I don't find yoga mat on your website. Can you help?")

'find yoga mat website . help ?'

# Assignment

In [34]:
import spacy
nlp = spacy.load("en_core_web_sm")



Exercise1:

    From a Given Text, Count the number of stop words in it.
    Print the percentage of stop word tokens compared to all tokens in a given text.



In [35]:
text = '''
Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and 
distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).
The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,
Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),
Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.
'''

#step1: Create the object 'doc' for the given text using nlp()
doc = nlp(text)


#step2: define the variables to keep track of stopwords count and total words count
stop_words_count = 0
total_words_count = 0


#step3: iterate through all the words in the document
for token in doc:
  if token.is_stop:         #check whether given token is stop word or not and increment accordingly       
    stop_words_count += 1
  total_words_count +=  1   #increment the total_words_count


#step4: print the count of stop words
print(f"Total Stop words presented in the given text: {stop_words_count}")
    

#step5: print the percentage of stop words compared to total words in the text
percentage_stop_words = (stop_words_count / total_words_count) * 100
print(f"Percentage of Stop words presented in the given text: {percentage_stop_words} %")

Total Stop words presented in the given text: 40
Percentage of Stop words presented in the given text: 25.0 %




Exercise2:

    Spacy default implementation considers "not" as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:

    - this is a good movie       ----> Positive Statement
    - this is not a good movie   ----> Negative Statement

    So, after applying stopwords to those 2 texts, both will return "good movie" and does not respect the polarity/sentiments of text.

    Now, your task is to remove this stop word "not" in spaCy and help in distinguishing the texts.

    Hint: GOOGLE IT! Google is your friend.



In [36]:
#use this pre-processing function to pass the text and to remove all the stop words and finally get the cleaned form
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)       


#Step1: remove the stopword 'not' in spacy
nlp.vocab['not'].is_stop = False


#step2: send the two texts given above into the pre-process function and store the transformed texts
positive_text = preprocess('this is a good movie')
negative_text = preprocess('this is not a good movie')


#step3: finally print those 2 transformed texts
print(f"Text1: {positive_text}")
print(f"Text2: {negative_text}")


Text1: good movie
Text2: not good movie




Exercise3:

    From a given text, output the most frequently used token after removing all the stop word tokens and punctuations in it.



In [37]:
text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.
It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,
One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the 
first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be
granted test cricket status.
'''


#step1: Create the object 'doc' for the given text using nlp()
doc = nlp(text)


#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list
remaining_tokens = []
for token in doc:
  if token.is_stop or token.is_punct:    #check whether a given token is stop word or punctuations
    continue
  remaining_tokens.append(token.text)


#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens  
frequency_tokens = {}
for token in remaining_tokens:
  if token != '\n' and token != ' ':      #As spacy considers new line and empty spaces as seperate token, it's better to ignore them
    if token not in frequency_tokens:     #if a particular token occurs for the first time, we initialise it to 1
      frequency_tokens[token] = 1
    else:
      frequency_tokens[token] += 1        #if a partcular token is already present, then increment by 1 based on value already presented


#step4: get the maximum frequency word
max_freq_word = max(frequency_tokens.keys(), key=(lambda key: frequency_tokens[key]))


#step5: finally print the result
print(f"Maximum frequency word: {max_freq_word}") 

Maximum frequency word: India
