<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Removing_stopwords_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Removing stop words with NLTK in Python


Natural language processing tasks often involve filtering out commonly occurring words that provide no or very little semantic value to text analysis. These words are known as stopwords include articles, prepositions and pronouns like "the", "and", "is" and "in". While they seem insignificant, proper stopword handling can dramatically impact the performance and accuracy of NLP applications.

Consider the sentence: "The quick brown fox jumps over the lazy dog"

#Stopwords:
"the" and "over"
Content words: "quick", "brown", "fox", "jumps", "lazy", "dog"
It becomes particularly important when dealing with large text corpora where computational efficiency matters. Processing every single word including high-frequency stopwords can consume unnecessary resources and potentially skew analysis results.

#When to Remove Stopwords
The decision to remove stopwords depends heavily on the specific NLP task at hand:

#Tasks that benefit from stopword removal:
Text classification and sentiment analysis
Information retrieval and search engines
Topic modelling and clustering
Keyword extraction
Tasks that require preserving stopwords:
Machine translation (maintains grammatical structure)
Text summarization (preserves sentence coherence)
Question-answering systems (syntactic relationships matter)
Grammar checking and parsing
Language modeling presents an interesting middle ground where the decision depends on the specific application requirements and available computational resources.

#Categories of Stopwords
Understanding different types of stopwords helps in making informed decisions:

Standard Stopwords: Common function words like articles ("a", "the"), conjunctions ("and", "but") and prepositions ("in", "on")
Domain-Specific Stopwords: Context-dependent terms that appear frequently in specific fields like "patient" in medical texts
Contextual Stopwords: Words with extremely high frequency in particular datasets

Numerical Stopwords: Digits, punctuation marks and single characters
Implementation with NLTK
NLTK provides robust support for stopword removal across 16 different languages. The implementation involves tokenization followed by filtering:

Setup: Import NLTK modules and download required resources like stopwords and tokenizer data.
Text preprocessing: Convert the sample sentence to lowercase and tokenize it into words.
Stopword removal: Load English stopwords and filter them out from the token list.
Output: Print both the original and cleaned tokens for comparison.

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

# Sample text
text = "This is a sample sentence showing stopword removal."

# Get English stopwords and tokenize
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())

# Remove stopwords
filtered_tokens = [word for word in tokens if word not in stop_words]
+
print("Original:", tokens)
print("Filtered:", filtered_tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Original: ['this', 'is', 'a', 'sample', 'sentence', 'showing', 'stopword', 'removal', '.']
Filtered: ['sample', 'sentence', 'showing', 'stopword', 'removal', '.']


#Implementation with Scikit Learn


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Another sample text
new_text = "The quick brown fox jumps over the lazy dog."

# Tokenize the new text using NLTK
new_words = word_tokenize(new_text)

# Remove stopwords using NLTK
new_filtered_words = [
    word for word in new_words if word.lower() not in stopwords.words('english')]

# Join the filtered words to form a clean text
new_clean_text = ' '.join(new_filtered_words)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_clean_text)

Original Text: The quick brown fox jumps over the lazy dog.
Text after Stopword Removal: quick brown fox jumps lazy dog .


In [4]:
# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '

# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

{'both', "needn't", 'any', 'doing', 'nor', "should've", "couldn't", "we're", 'such', "i'll", 'd', 'more', "isn't", "i'm", 'am', 'do', 'those', 'as', 'between', "he's", 'had', "it's", "shan't", 'been', 'themselves', "wasn't", 'yours', 'himself', "she's", 'it', 'does', 'm', 'weren', "we'll", 'shan', 'yourselves', 'who', 'are', 'your', "weren't", "you've", "you'd", 'didn', 'few', 'if', 'just', 'you', 't', 'up', "we'd", 'its', 's', "that'll", 'is', 'ours', "he'd", 'a', 'above', 'while', 'for', 'o', 'they', "she'll", 'where', "they'd", 'these', 'we', 'once', "didn't", 'own', 'than', "he'll", 'can', "doesn't", "hadn't", 'her', 'that', "i'd", 'were', "we've", 'wouldn', 'off', 'being', 'because', 'but', 'what', 'whom', 'after', 're', 'couldn', "haven't", 'shouldn', 'and', "you'll", 'should', 'below', 'same', 'wasn', 'ourselves', 'ain', 'most', 'herself', 'with', 'from', "they'll", "she'd", 'this', 'into', 'all', 'when', 'doesn', "they've", 'too', 'no', 'to', 'very', 'in', 'of', 'did', 'during'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
