<a href="https://colab.research.google.com/github/drpetros11111/transformers-with-python/blob/main/00_stopwords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stopwords

```

```

TK add explanation of stopwords, + examples + some sample code + code we would actually use (eg NLTK)

We will be using [this tweet](https://twitter.com/ivan_bezdomny/status/1367160747537682438) (don't worry, we will get to train some models):

In [None]:
tweet = """I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :("""

We will be using the **NLTK** library for removing stopwords. NLTK comes with several stopword corpora, we will be using the English corpus. This corpus contains a huge number of English stopwords like *a*, *the*, *be*, *for*, *do*, and so on.

In [None]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

stop_words[:5]

['i', 'me', 'my', 'myself', 'we']

# 1. from nltk.corpus import stopwords

What it does: This line imports the stopwords module from the nltk.corpus (Natural Language Toolkit's corpus) package.

Why it's used: NLTK is a powerful library for working with human language data (text).

It contains many corpora (collections of text data) that are useful for natural language processing (NLP) tasks. One of these corpora is a collection of stopwords.

In simple terms: This is like saying, "I want to use the list of common words that NLTK provides."

---------------------------
#2. import nltk

What it does: This line imports the main nltk library, which is necessary to download the resources.

Why it's used: nltk.download('stopwords') is a method of this library and is used in the next line.

In simple terms: This is like saying, "I want to use the nltk library in the following code."

----------------------
#3.nltk.download('stopwords')

##What it does:
This line downloads the "stopwords" resource from NLTK's data repository.

##Why it's used:
The list of stopwords is not automatically included in the NLTK library itself.

It's stored as a separate data file. This line ensures that you have the data needed to use stopwords.words('english') in the next step.

You only need to run this once per Colab session or on a new machine.

##In simple terms:

This is like going to a library and getting the specific book you need (the "stopwords" list) before you can read it.

--------------------------
#4. stop_words = stopwords.words('english')

##What it does:

This is the core of the operation. It uses the stopwords.words() function to get a list of stopwords for the English language.

The 'english' argument specifies that you want the English stopword list. The result (the list of stopwords) is then assigned to the variable stop_words.

Why it's used: Stopwords (like "the," "a," "is," "in," etc.) are very common words that often don't add much meaning to text.

In many NLP tasks (like text classification or topic modeling), it's helpful to remove these stopwords so that you can focus on the more important words.

In simple terms: This line is saying, "Give me the list of English stopwords, and I'll call that list 'stop_words' from now on."

---------------------
#5. print(stop_words[:5])

##What it does:
This line prints the first five elements of the stop_words list. [:5] is a Python slicing technique that extracts the elements from the beginning of the list up to (but not including) the element at index 5.

##Why it's used:

This is a way to quickly see a sample of what's in the stop_words list to confirm that the code is working and that you have the list you expect.

##In simple terms:

This line is saying, "Show me the first five stopwords in the list so I can see what's there." To see the output, run the code.

------------------------
##Overall Goal

The goal of this code is to obtain a list of common English stopwords that you can then use in subsequent text processing steps to remove these words from your text data.

Why is this relevant to Google Colab?

Data Science & NLP: Google Colab is a popular environment for data science and machine learning. Text processing is a very common task in these fields.

##NLTK in Colab:

NLTK is easy to install and use in Colab, making it a great choice for NLP work in this environment.

##Text Preprocessing:

Removing stopwords is a crucial part of text preprocessing, a common first step in many NLP pipelines.

##Packages:

Google Colab comes with many packages already installed, but sometimes it is necessary to download the data for the specific package as shown with nltk.download('stopwords')

Now we have a list of stopwords. When we process our text data we will iterate through each word, if it is present in `stop_words` it will be removed. To optimize the speed of the stopword lookup we can convert `stop_words` to a `set` object.

In [None]:
stop_words = set(stop_words)

# Stop_words variable, which was originally a list, into a set.

-------------------------------------
Breakdown:

##set():

This is a built-in Python function that creates a set object.

stop_words (inside the set()): This refers to the existing list of stopwords that we obtained from NLTK in the previous steps (e.g., ['i', 'me', 'my', 'myself', 'we', ...]).

stop_words = ...: This assignment operator takes the newly created set and assigns it back to the variable stop_words, overwriting the original list.


Why it's used (the reasoning):

The key reason for converting the list to a set is to optimize the speed of stopword lookup. Here's why:

##Lists vs. Sets:

##Lists:

Lists are ordered collections of items. When you check if an item is in a list (e.g., if word in my_list), Python has to potentially go through each item in the list one by one until it finds the word or reaches the end. This is called a linear search, and it can be slow, especially for long lists.

##Sets:
Sets are unordered collections of unique items. They are implemented using a data structure called a hash table, which allows for very fast membership testing.

When you check if an item is in a set (e.g., if word in my_set), Python can determine if the item is present almost instantly, regardless of the set's size.

This is called constant-time lookup.

----------------------------
##Stopword Removal:

In NLP, you often need to check if a large number of words are present in your stopword list.

If the stopword list is a list, each check would be slow. But, if the stopword list is a set, each check is very fast.

##Efficiency:
When you're dealing with large amounts of text data, the efficiency difference between using a list and a set for stopword lookup can be significant.

Converting to a set can dramatically speed up your code.

In simple terms:

Imagine you have a phone book.

List: If the phone book was like a list, to find a number, you'd have to start at the first page and flip through each page one by one until you found the name you were looking for.

Set: If the phone book was like a set, it would be indexed in a special way. You could open it up to roughly the right section and find the name almost immediately.

Converting the stop_words list to a set is like indexing your phone book to make lookups much faster.

In the context of Google Colab:

Google Colab is often used for data science and machine learning, which frequently involves processing large amounts of text data.

This makes the efficiency boost from using sets for stopword removal especially relevant.

-----------------------------------------
#Example:

Let's say you have a very simple case:


my_list = ["apple", "banana", "orange"]
my_set = set(my_list)

# Checking if "banana" is present:
    print("banana" in my_list)  # Can be slower
    print("banana" in my_set)   # Much faster

While the difference is negligible for such small examples, imagine this with thousands of words in the list, and you are looking up thousands of words, and you have a better understanding of why sets are preferred for this specific task.

In summary, stop_words = set(stop_words) is a crucial optimization step that significantly speeds up the process of removing stopwords from text data by leveraging the efficient lookup capabilities of Python sets.

First we need to lowercase our text (because all of our stopwords are lowercased). Then we use split our input text into a list of tokens (each token is a word seperated by a space).

In [None]:
tweet = tweet.lower().split()

tweet

['i’m',
 'amazed',
 'how',
 'often',
 'in',
 'practice,',
 'not',
 'only',
 'does',
 'a',
 '@huggingface',
 'nlp',
 'model',
 'solve',
 'your',
 'problem,',
 'but',
 'one',
 'of',
 'their',
 'public',
 'finetuned',
 'checkpoints,',
 'is',
 'good',
 'enough',
 'for',
 'the',
 'job.',
 'both',
 'impressed,',
 'and',
 'a',
 'little',
 'disappointed',
 'how',
 'rarely',
 'i',
 'get',
 'to',
 'actually',
 'train',
 'a',
 'model',
 'that',
 'matters',
 ':(']

And now we can iterate through the list, we check if each word exists in `stop_words` - if it does we discard it.

In [None]:
tweet_no_stopwords = [word for word in tweet if word not in stop_words]

print("With stopwords:", ' '.join(tweet))
print("Without:", ' '.join(tweet_no_stopwords))

With stopwords: i’m amazed how often in practice, not only does a @huggingface nlp model solve your problem, but one of their public finetuned checkpoints, is good enough for the job. both impressed, and a little disappointed how rarely i get to actually train a model that matters :(
Without: i’m amazed often practice, @huggingface nlp model solve problem, one public finetuned checkpoints, good enough job. impressed, little disappointed rarely get actually train model matters :(


It's that easy! We'll move onto more preprocessing methods in the following sections.