## Text Preprocessing: Stopwords

### 1. **Context**
Text preprocessing is a crucial step in Natural Language Processing (NLP), and **stopwords removal** plays an essential role in improving the efficiency of NLP models. **Stopwords** are common words like "the," "and," "is," "in," etc., that don't add much meaningful content to the text. These words are often removed during text processing to reduce the size of the dataset and improve the performance of models.

In this notebook, we will focus on **stopwords removal** using the **NLTK (Natural Language Toolkit)** library, which provides a powerful set of tools for text analysis and preprocessing tasks.

---

### 2. **Install NLTK**
Before starting, make sure that NLTK is installed on your system.

```bash
!pip install nltk


### 3. Download NLTK Stopwords
To use NLTK’s stopwords, we need to download the corresponding corpus.

In [9]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\IT
[nltk_data]     Support\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 4. Example of Stopwords in Text
Let's take a look at an example sentence and identify the stopwords in it.

In [12]:
# Get the number of English stopwords and display the first 5 stopwords
len(stopwords.words('english')), stopwords.words('english')[:5]

(179, ['i', 'me', 'my', 'myself', 'we'])

In [10]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Example text
text = "This is a simple sentence, and it contains stopwords."

# Tokenize the sentence
words = word_tokenize(text.lower())

# Get the set of English stopwords
stop_words = set(stopwords.words("english"))

# Filter out stopwords from the tokenized words
filtered_words = [word for word in words if word not in stop_words]

print("Original Text: ", text)
print("Filtered Text: ", " ".join(filtered_words))

Original Text:  This is a simple sentence, and it contains stopwords.
Filtered Text:  simple sentence , contains stopwords .


### 5. Explanation of the Code
* We use **NLTK's** *stopwords* corpus to access the list of common stopwords in English.
* **Tokenization**: The *word_tokenize* function splits the text into individual words.
* **Stopwords Removal**: A list comprehension is used to filter out words from the tokenized list that are present in the stopwords set.

### 6. Custom Stopwords List
Sometimes, we may need to create our own custom list of stopwords to remove specific words related to a particular domain or task.

In [13]:
custom_stopwords = {"sentence", "contains"}

# Filter out custom stopwords
custom_filtered_words = [word for word in words if word not in custom_stopwords]

print("Original Text: ", text)
print("Custom Filtered Text: ", " ".join(custom_filtered_words))

Original Text:  This is a simple sentence, and it contains stopwords.
Custom Filtered Text:  this is a simple , and it stopwords .


### 7. Handling Punctuation and Special Characters
It's common to remove punctuation and special characters along with stopwords. Here's how to do that:

In [15]:
import string

# Remove punctuation from the words
words_no_punct = [word for word in filtered_words if word not in string.punctuation]

print("Original Text: ", text)
print("Text without Punctuation: ", " ".join(words_no_punct))

Original Text:  This is a simple sentence, and it contains stopwords.
Text without Punctuation:  simple sentence contains stopwords


### 8. Stopwords Removal for Larger Texts
For longer texts or larger datasets, stopwords removal can be performed in bulk using the following method:

In [16]:
def remove_stopwords_from_text(text):
    # Tokenize the text
    words = word_tokenize(text.lower())
    
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words("english"))
    filtered_words = [word for word in words if word not in stop_words and word not in string.punctuation]
    
    return " ".join(filtered_words)

# Example large text
large_text = "This is an example of a much larger text with multiple sentences. We can easily remove stopwords from it."

processed_text = remove_stopwords_from_text(large_text)
print("Processed Large Text: ", processed_text)

Processed Large Text:  example much larger text multiple sentences easily remove stopwords


### 9. Conclusion

Stopwords removal is a critical part of text preprocessing that helps in cleaning and reducing the noise in textual data. The NLTK library provides an easy-to-use set of tools for stopwords removal, which is essential for enhancing the performance of many NLP models.

#### Key Takeaways:
- **Stopwords** are common words that don't contribute much to the meaning of the text.
- **NLTK** provides a ready-made list of stopwords for various languages.
- **Stopwords removal** helps in reducing the dimensionality of the data, leading to more efficient models.

---

### 10. Further Enhancements

While removing stopwords is useful, sometimes it might not be necessary in all cases. Here are a few advanced steps you can explore:

- **Domain-specific stopwords**: Build a custom stopwords list specific to your domain.
- **Lemmatization or Stemming**: Combine stopwords removal with lemmatization or stemming for more effective text normalization.
- **Handling Different Languages**: Use NLTK's support for multiple languages to remove stopwords in languages other than English.