***
Welcome!

In this notebook, we will explore the impact of Stop Words on a specific piece of text. Stop Words are common words like "and," "the," and "is" that often don't contribute much meaning to a sentence. By understanding how these words can be removed or filtered out, we can improve the clarity and focus of our text.

Additionally, we will dive into the concept of N-Grams. N-Grams are simply sequences of words, and by analyzing these sequences, we can uncover patterns and relationships within our text. This can help us gain insights into how words are used together and how they contribute to the overall meaning.
<br>
***

### Index:

[1.1 - Stop Words](#1.1---Stop-Words)
<br>
[1.2 - N-Grams](#1.2---N-Grams)

In [1]:
import nltk

Let's start by defining a long sentence - this time we will use an excerpt from J.R.R Tolkien's book 'The Lord of the Rings':

In [2]:
lotr_quote = '''
‘I cannot read the fiery letters,’ said Frodo in a quavering
voice.
‘No,’ said Gandalf, ‘but I can. The letters are Elvish, of an
ancient mode, but the language is that of Mordor, which I
will not utter here. But this in the Common Tongue is what
is said, close enough:
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
It is only two lines of a verse long known in Elven-lore:
Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them
In the Land of Mordor where the Shadows lie.’
He paused, and then said slowly in a deep voice: ‘This is
the Master-ring, the One Ring to rule them all. This is the
One Ring that he lost many ages ago, to the great weakening
the shadow of the past 67
of his power. He greatly desires it – but he must not get it.’
Frodo sat silent and motionless. Fear seemed to stretch out
a vast hand, like a dark cloud rising in the East and looming
up to engulf him. ‘This ring!’ he stammered. ‘How, how on
earth did it come to me?’

'''

### 1.1 - Stop Words

***
A lot of words in a sentence don't mean much - they are filler words that are used to connect sentences and in some (not all!) applications with Natural Language Processing, we can remove them.
<br>
These types of words are commonly called *stop words*.
<br>
***

**NLTK let us access the English Language Stop words really quickly:**

In [3]:
import nltk
from nltk.corpus import stopwords

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gauravkandel/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
# By using stopwords.words we access common stop words
stopwords_eng = stopwords.words('English')

In [6]:
stopwords_eng

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [7]:
# Let us tokenize our quote from the Lord of the Rings
# and use Freq Dist that we have already learned
word_frequency = nltk.FreqDist(nltk.tokenize.word_tokenize(lotr_quote))

In [8]:
word_frequency.most_common(10)

[('the', 20),
 (',', 19),
 ('to', 12),
 ('.', 10),
 ('One', 9),
 ('them', 9),
 ('in', 8),
 ('of', 8),
 ('Ring', 8),
 ('‘', 6)]

Our top words don't mean much - we have *the*, *to*, *them*, *in* and *of*, some very common English words. This is a pattern that you will encounter throughout a lot of different texts - for example, let's pass a really different text into this process to check the top words:

In [9]:
# Let's use a quote from Apple's Annual Report from 2020:
apple_annual_report = '''
The Company’s stock price has experienced substantial price volatility in the past and may continue to do so in the future.
Additionally, the Company, the technology industry and the stock market as a whole have experienced extreme stock price and
volume fluctuations that have affected stock prices in ways that may have been unrelated to these companies’ operating
performance. Price volatility over a given period may cause the average price at which the Company repurchases its stock to
exceed the stock’s price at a given point in time. The Company believes its stock price should reflect expectations of future
growth and profitability. The Company also believes its stock price should reflect expectations that its cash dividend will continue
at current levels or grow, and that its current share repurchase program will be fully consummated. Future dividends are subject
to declaration by the Company’s Board of Directors, and the Company’s share repurchase program does not obligate it to
acquire any specific number of shares. If the Company fails to meet expectations related to future growth, profitability, dividends,
share repurchases or other market expectations, its stock price may decline significantly, which could have a material adverse
impact on investor confidence and employee retention.
'''

In [10]:
word_frequency_apple = nltk.FreqDist(nltk.tokenize.word_tokenize(apple_annual_report))

In [11]:
word_frequency_apple.most_common(10)

[('the', 11),
 ('stock', 9),
 (',', 9),
 ('Company', 8),
 ('price', 8),
 ('and', 7),
 ('to', 7),
 ('.', 7),
 ('its', 6),
 ('’', 5)]

The top word in both texts is exactly the same - the word **the**. 
<br>
<br>
And notice how there are other words like **to** and punctuation are also shared between texts. Even for pieces of text as diverse as a lotr quote and a financial report, these are the words that stand out. 
<br>
<br>
Let us now build a proper text pipeline that will clean the punctuation and stop words and you will see something completely different from the `FreqDist` analysis:

In [12]:
# A cool tip we haven't explored yet - translate removes punctuation really fast
import string
lotr_quote = lotr_quote.translate(str.maketrans('', '', string.punctuation))

In [13]:
lotr_quote

'\n‘I cannot read the fiery letters’ said Frodo in a quavering\nvoice\n‘No’ said Gandalf ‘but I can The letters are Elvish of an\nancient mode but the language is that of Mordor which I\nwill not utter here But this in the Common Tongue is what\nis said close enough\nOne Ring to rule them all One Ring to find them\nOne Ring to bring them all and in the darkness bind them\nIt is only two lines of a verse long known in Elvenlore\nThree Rings for the Elvenkings under the sky\nSeven for the Dwarflords in their halls of stone\nNine for Mortal Men doomed to die\nOne for the Dark Lord on his dark throne\nIn the Land of Mordor where the Shadows lie\nOne Ring to rule them all One Ring to find them\nOne Ring to bring them all and in the darkness bind them\nIn the Land of Mordor where the Shadows lie’\nHe paused and then said slowly in a deep voice ‘This is\nthe Masterring the One Ring to rule them all This is the\nOne Ring that he lost many ages ago to the great weakening\nthe shadow of the past

In [14]:
# Some Punctuation was not removed - let's do it manually
lotr_quote = lotr_quote.replace('’','').replace('‘','')

In [15]:
# Let's now remove stop words
lotr_cleaned = [
    word.lower() for word in nltk.word_tokenize(lotr_quote) if word.lower() not in stopwords_eng
]

In [16]:
lotr_frequency_cleaned = nltk.FreqDist(lotr_cleaned)

In [17]:
lotr_frequency_cleaned.most_common(10)

[('one', 9),
 ('ring', 9),
 ('said', 4),
 ('mordor', 3),
 ('rule', 3),
 ('dark', 3),
 ('letters', 2),
 ('frodo', 2),
 ('voice', 2),
 ('find', 2)]

Just like magic, we were able to extract what is most relevant to this text and we can easily find words that are related to the lord of the rings saga:
- Mordor;
- Ring;
- Frodo;

Now let us do the same for the Apple Annual Report Quote:

In [18]:
apple_annual_report = apple_annual_report.translate(str.maketrans('', '', string.punctuation))

# Let's now remove stop words
apple_cleaned = [
    word.lower() for word in nltk.word_tokenize(apple_annual_report) if word.lower() not in stopwords_eng
]

In [19]:
apple_frequency_cleaned = nltk.FreqDist(apple_cleaned)

In [20]:
apple_frequency_cleaned.most_common(10)

[('stock', 9),
 ('price', 9),
 ('company', 8),
 ('’', 5),
 ('may', 4),
 ('future', 4),
 ('expectations', 4),
 ('share', 3),
 ('experienced', 2),
 ('volatility', 2)]

As you can see, now the top 10 words regarding the Apple Annual Report quote are more related to that piece of text.

### 1.2 - N-Grams

Many times, you will hear people talking about `n-grams` - being `n` the number of continuous chunks we group a sentence by. This is exactly what we've spoke about in the `"Train our own POS Tagger"` lectures. Here, we are just going to consolidate this knowledge:

In [21]:
from nltk import bigrams, trigrams, everygrams

Let's continue with our Lord of The Rings quote:

In [22]:
lotr_quote

'\nI cannot read the fiery letters said Frodo in a quavering\nvoice\nNo said Gandalf but I can The letters are Elvish of an\nancient mode but the language is that of Mordor which I\nwill not utter here But this in the Common Tongue is what\nis said close enough\nOne Ring to rule them all One Ring to find them\nOne Ring to bring them all and in the darkness bind them\nIt is only two lines of a verse long known in Elvenlore\nThree Rings for the Elvenkings under the sky\nSeven for the Dwarflords in their halls of stone\nNine for Mortal Men doomed to die\nOne for the Dark Lord on his dark throne\nIn the Land of Mordor where the Shadows lie\nOne Ring to rule them all One Ring to find them\nOne Ring to bring them all and in the darkness bind them\nIn the Land of Mordor where the Shadows lie\nHe paused and then said slowly in a deep voice This is\nthe Masterring the One Ring to rule them all This is the\nOne Ring that he lost many ages ago to the great weakening\nthe shadow of the past 67\nof

In [23]:
# As usual, we have to tokenize our sentence
lotr_quote_tokenized = nltk.word_tokenize(lotr_quote)

In [24]:
# Let's now check our bigrams - every pair of two
# continuous words in a sentence
list(bigrams(lotr_quote_tokenized))

[('I', 'can'),
 ('can', 'not'),
 ('not', 'read'),
 ('read', 'the'),
 ('the', 'fiery'),
 ('fiery', 'letters'),
 ('letters', 'said'),
 ('said', 'Frodo'),
 ('Frodo', 'in'),
 ('in', 'a'),
 ('a', 'quavering'),
 ('quavering', 'voice'),
 ('voice', 'No'),
 ('No', 'said'),
 ('said', 'Gandalf'),
 ('Gandalf', 'but'),
 ('but', 'I'),
 ('I', 'can'),
 ('can', 'The'),
 ('The', 'letters'),
 ('letters', 'are'),
 ('are', 'Elvish'),
 ('Elvish', 'of'),
 ('of', 'an'),
 ('an', 'ancient'),
 ('ancient', 'mode'),
 ('mode', 'but'),
 ('but', 'the'),
 ('the', 'language'),
 ('language', 'is'),
 ('is', 'that'),
 ('that', 'of'),
 ('of', 'Mordor'),
 ('Mordor', 'which'),
 ('which', 'I'),
 ('I', 'will'),
 ('will', 'not'),
 ('not', 'utter'),
 ('utter', 'here'),
 ('here', 'But'),
 ('But', 'this'),
 ('this', 'in'),
 ('in', 'the'),
 ('the', 'Common'),
 ('Common', 'Tongue'),
 ('Tongue', 'is'),
 ('is', 'what'),
 ('what', 'is'),
 ('is', 'said'),
 ('said', 'close'),
 ('close', 'enough'),
 ('enough', 'One'),
 ('One', 'Ring'),
 (

In [25]:
# And trigrams
list(trigrams(lotr_quote_tokenized))

[('I', 'can', 'not'),
 ('can', 'not', 'read'),
 ('not', 'read', 'the'),
 ('read', 'the', 'fiery'),
 ('the', 'fiery', 'letters'),
 ('fiery', 'letters', 'said'),
 ('letters', 'said', 'Frodo'),
 ('said', 'Frodo', 'in'),
 ('Frodo', 'in', 'a'),
 ('in', 'a', 'quavering'),
 ('a', 'quavering', 'voice'),
 ('quavering', 'voice', 'No'),
 ('voice', 'No', 'said'),
 ('No', 'said', 'Gandalf'),
 ('said', 'Gandalf', 'but'),
 ('Gandalf', 'but', 'I'),
 ('but', 'I', 'can'),
 ('I', 'can', 'The'),
 ('can', 'The', 'letters'),
 ('The', 'letters', 'are'),
 ('letters', 'are', 'Elvish'),
 ('are', 'Elvish', 'of'),
 ('Elvish', 'of', 'an'),
 ('of', 'an', 'ancient'),
 ('an', 'ancient', 'mode'),
 ('ancient', 'mode', 'but'),
 ('mode', 'but', 'the'),
 ('but', 'the', 'language'),
 ('the', 'language', 'is'),
 ('language', 'is', 'that'),
 ('is', 'that', 'of'),
 ('that', 'of', 'Mordor'),
 ('of', 'Mordor', 'which'),
 ('Mordor', 'which', 'I'),
 ('which', 'I', 'will'),
 ('I', 'will', 'not'),
 ('will', 'not', 'utter'),
 ('not'

In [26]:
# We can also check everygram - for example, from 2 to 6 grams!
list(everygrams(lotr_quote_tokenized, 2, 6))

[('I', 'can'),
 ('I', 'can', 'not'),
 ('I', 'can', 'not', 'read'),
 ('I', 'can', 'not', 'read', 'the'),
 ('I', 'can', 'not', 'read', 'the', 'fiery'),
 ('can', 'not'),
 ('can', 'not', 'read'),
 ('can', 'not', 'read', 'the'),
 ('can', 'not', 'read', 'the', 'fiery'),
 ('can', 'not', 'read', 'the', 'fiery', 'letters'),
 ('not', 'read'),
 ('not', 'read', 'the'),
 ('not', 'read', 'the', 'fiery'),
 ('not', 'read', 'the', 'fiery', 'letters'),
 ('not', 'read', 'the', 'fiery', 'letters', 'said'),
 ('read', 'the'),
 ('read', 'the', 'fiery'),
 ('read', 'the', 'fiery', 'letters'),
 ('read', 'the', 'fiery', 'letters', 'said'),
 ('read', 'the', 'fiery', 'letters', 'said', 'Frodo'),
 ('the', 'fiery'),
 ('the', 'fiery', 'letters'),
 ('the', 'fiery', 'letters', 'said'),
 ('the', 'fiery', 'letters', 'said', 'Frodo'),
 ('the', 'fiery', 'letters', 'said', 'Frodo', 'in'),
 ('fiery', 'letters'),
 ('fiery', 'letters', 'said'),
 ('fiery', 'letters', 'said', 'Frodo'),
 ('fiery', 'letters', 'said', 'Frodo', 'in'

Every grams produce every combination of grams from the first number to the last. Pay attention than in large texts, it can generate a large object!

The advantage of using n-grams is that sometimes we really want to consider our text as bigrams or trigrams given it's context - negation is the best example when we say 'not happy' we can only find that we are negating the word "happy" if we consider the word that precedes it.

***
That's it for NLTK! NLTK is an awesome library that we will use throughout the rest of the sections. During this section we also got familiar with important concepts in Natural Language Processing such as:
- Tokenization;
- Part-of-Speech;
- Stemming and Lemmatization;
- N-Grams;
- Stop Words;
<br>
<br>
Give it your best on the exercises!
***