__Stopwords__

In Natural Language Processing, stopwords refer to common words that are often removed from text data during preprocessing to reduce the noise and size of the data. These are typically high-frequency words such as articles (e.g., 'the', 'a', 'an'), conjunctions (e.g., 'and', 'or', 'but'), prepositions (e.g., 'in', 'on', 'at'), pronouns (e.g., 'he', 'she', 'it'), and other words that do not carry significant meaning on their own.

Stopwords are removed from text data because they can have a negative impact on downstream NLP tasks such as text classification, sentiment analysis, and information retrieval. This is because these words can make it harder for NLP models to identify the most important words in a sentence, and can also lead to noisy and irrelevant results.

Stopwords can be language-specific, and different languages may have different sets of stopwords. Some NLP libraries like NLTK and spaCy provide pre-defined sets of stopwords for different languages, which can be used to remove these words from text data during preprocessing.

In [None]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
import nltk

In [None]:
# Print the set of spacy's default stop waords
print(nlp.Defaults.stop_words)

In [None]:
len(nlp.Defaults.stop_words)

In [None]:
# To see if a word is a stop word

In [None]:
nlp.vocab['myself'].is_stop

In [None]:
nlp.vocab['mystery'].is_stop

In [None]:
# Add the word to the set of stop words. use lowercase!
nlp.Defaults.stop_words.add('mystery')

In [None]:
# Set the stop_word tag on the lexeme
nlp.vocab['mystery'].is_stop = True

In [None]:
len(nlp.Defaults.stop_words)

In [None]:
nlp.vocab['mystery'].is_stop

In [None]:
# Remove a stop word
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop=False

In [None]:
len(nlp.Defaults.stop_words)

In [None]:
nlp.vocab['beyond'].is_stop

In [None]:
import string
import re
import nltk
nltk.download('punkt')
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords

# load data
text = 'The Quick brown for jump over the lazy dog!'

In [None]:
# Split into words
tokens = word_tokenize(text)
print(tokens)

In [None]:
# Convert to lower case
tokens = [w.lower() for w in tokens]
print(tokens)

In [None]:
# Prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
print(re_punc)

In [None]:
# Remove punctuation from each word
stripped = [re_punc.sub('', w) for w in tokens]
print(stripped)

In [None]:
# Remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
print(words)

In [None]:
# filter out non-stop words
stop_words=set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words)

In [None]:
nlp.vocab['quick'].is_stop

In [None]:
nlp.vocab['brown'].is_stop

In [None]:
nlp.vocab['jump'].is_stop