# Stop Words Tutorial

Stop words are extremely common words, like "the," "a," "is," and "in," that are often filtered out during text processing because they are grammatically necessary but provide little semantic meaning.

In [1]:
import spacy

from spacy.lang.en.stop_words import STOP_WORDS

len(STOP_WORDS)

326

In [2]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our winds, the flying part is coming soon")

for token in doc:
  if token.is_stop:
    print(token)

We
just
our
the
part
is


Removing stop words

In [3]:
def preprocess(text):
  doc = nlp(text)

  no_stop_words = [token.text for token in doc if not token.is_stop]

  return " ".join(no_stop_words)

In [5]:
preprocess("Hamilton wants time to prepare for a trial over his")

'Hamilton wants time prepare trial'

In [6]:
preprocess("The other is not other but your divine brother")

'divine brother'

### Remove stop words from panda dataframe text column

Dataset is downloaded from: https://www.kaggle.com/datasets/jbencina/department-of-justice-20092018-press-releases It contains press releases of different court cases from depart of justice (DOJ). The releases contain information such as outcomes of criminal cases, notable actions taken against felons, or other updates about the current administration.

In [14]:
import json
import pandas as pd

data = []
with open("combined.json", "r", encoding="utf-8") as f:
    for i, line in enumerate(f, start=1):
        try:
            data.append(json.loads(line))
        except json.JSONDecodeError as e:
            print(f"Skipping line {i} due to JSON decoding error: {e}")

df = pd.DataFrame(data)
print(df.shape)

❌ Skipping line 2632 due to JSON decoding error: Unterminated string starting at: line 1 column 197 (char 196)
Successfully loaded 2631 lines into the DataFrame.
(2631, 6)


In [16]:
df.head(5)

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


Filter out those rows that do not have any topics associated with the case

In [17]:
df = df[df["topics"].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [18]:
df.shape

(917, 6)

In [19]:
df["contents_new"] = df.contents.apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["contents_new"] = df.contents.apply(preprocess)


In [20]:
df

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],"U.S. Department Justice , U.S. Environmental P..."
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 - count criminal indictment unsealed today...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],"21st Century Oncology LLC , agreed pay $ 19.75..."
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...
...,...,...,...,...,...,...,...
2626,18-389,District Court Enters Permanent Injunction Aga...,"A federal court permanently enjoined a Walton,...",2018-04-02T00:00:00-04:00,"[Consumer Protection, Health Care Fraud]","[Civil Division, USAO - New York, Northern]","federal court permanently enjoined Walton , Ne..."
2627,13-139,District Court Enters Permanent Injunction Aga...,U.S. District Court Judge Lesley Wells entered...,2013-01-31T00:00:00-05:00,[Consumer Protection],[Civil Division],U.S. District Court Judge Lesley Wells entered...
2628,15-149,District Court Enters Permanent Injunction Aga...,The U.S. District Court for the District of Or...,2015-02-06T00:00:00-05:00,[Consumer Protection],[Civil Division],U.S. District Court District Oregon entered pe...
2629,13-1365,District Court Enters Permanent Injunction Aga...,\nU.S. District Court Judge Kim R. Gibson of t...,2013-12-26T00:00:00-05:00,[Consumer Protection],[Civil Division],\n U.S. District Court Judge Kim R. Gibson Wes...


In [21]:
len(df.contents[4])

6286

In [22]:
len(df.contents_new[4])

4810

In [23]:
df.contents[4][:300]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin conta'

In [24]:
df.contents_new[4][:300]

'U.S. Department Justice , U.S. Environmental Protection Agency ( EPA ) , Rhode Island Department Environmental Management ( RIDEM ) announced today subsidiaries Stanley Black & Decker Inc.—Emhart Industries Inc. Black & Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor'

### Examples where removing stop words can create a problem

1. Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words

In [25]:
preprocess("This was a good movie")

'good movie'

In [26]:
preprocess("This was not a good movie")

'good movie'

2. Language translation: Say you want to translate following sentence from english to portuguese. Before actual translation if you remove stop words and then translate, it will produce horrible result

In [28]:
preprocess("How are you doing branco?")

'branco ?'

3. Chatbot or any Q&A system

In [29]:
preprocess("I don't find yoga mat on your website. Can you help?")

'find yoga mat website . help ?'