# Removing Stopwords

Stopwords are common words that are often filtered out from text during preprocessing because they carry less meaningful information compared to other words. Examples include "the," "is," "in," and "and."

## Why Remove Stopwords?

Removing stopwords helps reduce the dimensionality of the text and improves the efficiency and effectiveness of many NLP tasks by focusing on more meaningful words.

## Libraries for Stopword Removal

Several libraries provide built-in stopwords lists:
- **NLTK**: Natural Language Toolkit
- **spaCy**: Industrial-strength NLP
- **Scikit-learn**: Machine learning library with built-in stopwords list

### Removing Stopwords with NLTK

In [1]:
# Import the necessary libraries from NLTK
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download the stopwords corpus if not already installed
nltk.download('stopwords')
nltk.download('punkt')

# Define a sample text
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP and data science."

# Tokenize the text into words
words = word_tokenize(text)

# Load the list of stopwords in English
stop_words = set(stopwords.words('english'))

# Remove stopwords from the tokenized words
filtered_words = [word for word in words if word.lower() not in stop_words]

# Print the filtered list of words
print("Filtered Words:", filtered_words)

Filtered Words: ['Hello', ',', 'Farzad', 'Asgari', ',', 'welcome', 'NLPy', 'course', '.', 'learn', 'lot', 'NLP', 'data', 'science', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\free\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\free\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Removing Stopwords with spaCy

In [2]:
# Import spaCy
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Define a sample text
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP and data science."

# Process the text with spaCy
doc = nlp(text)

# Remove stopwords from the tokens
filtered_words = [token.text for token in doc if not token.is_stop]

# Print the filtered list of words
print("Filtered Words:", filtered_words)

Filtered Words: ['Hello', ',', 'Farzad', 'Asgari', ',', 'welcome', 'NLPy', 'course', '.', 'learn', 'lot', 'NLP', 'data', 'science', '.']


### Removing Stopwords with Scikit-learn

In [3]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Define a sample text
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP and data science."

# Tokenize the text into words
words = text.split()

# Remove stopwords using scikit-learn's ENGLISH_STOP_WORDS
filtered_words = [word for word in words if word.lower() not in ENGLISH_STOP_WORDS]

# Print the filtered list of words
print("Filtered Words:", filtered_words)

Filtered Words: ['Hello,', 'Farzad', 'Asgari,', 'welcome', 'NLPy', 'course.', 'learn', 'lot', 'NLP', 'data', 'science.']


## Conclusion
Removing stopwords is an important step in text preprocessing that helps focus on the most meaningful words.