# **ICE-3: Text Preprocessing Beyond Tokenization**
**Instructions**:
1. Wherever you are asked to code, insert a text block below your code block and explain what you have coded as per your own understanding.
2. If the code is provided by us, execute it, and add below a text block and provide your explanation in your own words.
3. Submit both the .ipynb and pdf files to canvas.
4. **The similarity score should be less than 15%.**

This notebook focuses on preprocessing English text.

In [1]:
import re

# for using NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')#I have added this because I was getting an error in WordNetLemmatizer
from nltk.corpus import stopwords

# for using SpaCy 
import spacy

# for HuggingFace
!pip install transformers
# !pip install ftfy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 5.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 48.5 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 56.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1


all the moduels and libraries that are required are being downloaded/imported.
re, nltk and spacy are imported.
I have added nltk.download('omw-1.4') because I was getting an error in WordNetLemmatizer

In [2]:
# trick to wrap text to the viewing window for this notebook
# Ref: https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

a user-defined function set_css has been defined.

## **(Tutorial) Tokenizing text using Spacy**

Following is a sample of text to demonstrate tokenization in SpaCy. 

In [3]:
dummy_text1 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text1)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



dummy_text1 is being displayed.

In [4]:
# loads a trained English pipeline with specific preprocessing components
nlp = spacy.load('en_core_web_sm')

# using SpaCy's tokenizer...
doc = nlp(dummy_text1)      # applies the processing pipeline on the text
for token in doc:
  print(token.text)

Here
is
the
First
Paragraph
and
this
is
the
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
first
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Now
,
it
is
the
Second
Paragraph
and
its
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
second
paragraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Finally
,
this
is
the
Third
Paragraph
and
is
the
First
Sentence
of
this
paragraph
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
third
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


4th
paragraph
just
has
one
sentence
in
it
.




using spacy's tokenizer, dummy_text1 is being tokenized and then those tokens are displayed

### **Task 1. Revisiting Tokenization**

Whitespace-based tokenization is a naive approach to tokenize text, where the idea is to extract words that are separated by whitespace characters on either sides.


#### **Question 1a. Implement the naive approach of tokenizing words (whitespace-based) for the text given in the code block below.(5 points)** 

**Important Note:** 
1. DO NOT use any of the existing implementations for tokenization distributed as part of open-source NLP libraries.
2. **If your solution uses readily available implementations of tokenizers, you will receive zero credit for this question.**
3. Your tokenizer implentation need not be the most optimized one. It should just be able to get the job done. You can also ignore punctuation.

In [5]:
sample_text="""Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,
remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum."""

# add your code below this comment and execute it once you have written the code
re_pattern2 = r'[A-Z][a-z]+|[a-z]+|[0-9]+[a-z]*'
print(re.findall(re_pattern2, sample_text))


['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry', 'Lorem', 'Ipsum', 'has', 'been', 'the', 'industry', 's', 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book', 'It', 'has', 'survived', 'not', 'only', 'five', 'centuries', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting', 'remaining', 'essentially', 'unchanged', 'It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing', 'Lorem', 'Ipsum', 'passages', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'Page', 'Maker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']


I have used regualar expression for tokenization. It checks for all the words starting with lowercase and uppercase letters.It also checks for nunbers with 0 or more alphabets at the end.

#### **Question 1b. For the same text in Q1., apply the tokenizers listed below. Analyze how the words are being tokenized by each of the tokenizers. Compare and contrast the outputs of the tokenization schemes.(10 points)**
1. **SpaCy's tokenizer**
2. **NLTK's tokenizer**

**Note:** You are already familiar with using NLTK's tokenization which was demosntrated in the previous labs. If you do not remember, just revisit them to refresh your memory.

In [6]:
sample_text="""Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,
remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum."""

# add your code below this comment and execute it once you have written the code.
# you can additional code cells if need be. make sure to use the text cell provided to answer the question.
#1.Spacy
spa = spacy.load("en_core_web_sm")

## tokenization
doc = spa(sample_text)
for token in doc:
    print(token.text)


#2.NLTK
tokens = [t for t in sample_text.split()]
print(tokens)

Lorem
Ipsum
is
simply
dummy
text
of
the
printing
and
typesetting
industry
.
Lorem
Ipsum
has
been
the
industry
's
standard
dummy
text
ever
since
the
1500s
,


when
an
unknown
printer
took
a
galley
of
type
and
scrambled
it
to
make
a
type
specimen
book
.
It
has
survived
not
only
five
centuries
,
but
also
the
leap
into
electronic
typesetting
,


remaining
essentially
unchanged
.
It
was
popularised
in
the
1960s
with
the
release
of
Letraset
sheets
containing
Lorem
Ipsum
passages
,
and
more
recently
with
desktop
publishing
software


like
Aldus
PageMaker
including
versions
of
Lorem
Ipsum
.
['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.', 'Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.', 'It', 'has', 'survived', 'not', 'o

**Answer for Q1b.** Type in your answer here!

The spcay tokenizer is considering every puncuation and even multiple white-spaces as tokens.
The NLTK Tokenizer is only considering a token if it is separated by a whtie-space, otherwise it is asssuming that the punctuation is a part of the word like in the last word "Ipsum." the character '.' is considered a part of Ipsum because they weren't separated by a white-space. It also not considering multiple white-spaces as tokens, which is a good thing.


---


## **(Tutorial) Stemming and Lemmatization using NLTK**

Let's see how we can perform stemming and lemmatization using NLTK library...

In [7]:
# importing PorterStemmer class from nltk.stem module
from nltk.stem import PorterStemmer
porter = PorterStemmer()    # instantiating an object of the PorterStemmer class

stem = porter.stem('cats')    # calling the stemmer algorithm on the desired word
print(f"'cats' after stemming: {stem}")

'cats' after stemming: cat


cats was stemmed using PorterStemmer.

**Try executing the portstemmer stemmer on your own examples (2 points)**

In [8]:
#Enter your code here
stem = porter.stem('driving')    # calling the stemmer algorithm on the desired word
print(f"'driving' after stemming: {stem}")

stem = porter.stem('flying')    # calling the stemmer algorithm on the desired word
print(f"'flying' after stemming: {stem}")

stem = porter.stem('animals')    # calling the stemmer algorithm on the desired word
print(f"'animals' after stemming: {stem}")

stem = porter.stem('dogs')    # calling the stemmer algorithm on the desired word
print(f"'dogs' after stemming: {stem}")

'driving' after stemming: drive
'flying' after stemming: fli
'animals' after stemming: anim
'dogs' after stemming: dog


driving, flying, animals and dogs are being stemmed using Porter Stemmer. dogs and driving gives correct answer, rest all failed.

In [9]:
# importing WordNet-based lemmatizer class from nltk.stem module
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()    # instantiating an object of the WordNetLemmatizer class

lemma = lemmatizer.lemmatize('cats')    # calling the lemmatization algorithm on the desired word
print(f"'cats' after lemmatization: {lemma}")

'cats' after lemmatization: cat


**Try executing the wordnet based lemmatizer on your own examples (3 points)**

In [10]:
#Enter your code here
lemma = lemmatizer.lemmatize('driving')    # calling the lemmatization algorithm on the desired word
print(f"'driving' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('flying')    # calling the lemmatization algorithm on the desired word
print(f"'flying' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('animals')    # calling the lemmatization algorithm on the desired word
print(f"'animals' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('dogs')    # calling the lemmatization algorithm on the desired word
print(f"'dogs' after lemmatization: {lemma}")

'driving' after lemmatization: driving
'flying' after lemmatization: flying
'animals' after lemmatization: animal
'dogs' after lemmatization: dog


driving, flying, animals and dogs are being lemmatized using wordnet based lemmatizer. dogs and animals gives correct answer, rest all failed.

### **Task 2: Lemmatization or Stemming?**




Following is the text that you will be using for this task (Task 2 only):

In [11]:
# This is the text on which you have to perform stemming; taken from Internet.
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."
print("Given text:")
print(text)

Given text:
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.


text is displayed

Performing some preprocessing that we have learnt in previous ICEs...

In [12]:
en_stopwords = set(stopwords.words('english'))
def remove_punc(text_string):
  return re.sub('[^a-zA-Z0-9 ]', '', text_string.lower())

def remove_stopwords(text_string):
  return [ token for token in text_string.split(' ') if token not in en_stopwords ]

Functions are defined to remove puncuation and stopwords

#### **Question 2. Remove punctuation and stopwords from the text using the functions provided above.Then perform stemming on the cleaned text using the Porter Stemmer from NLTK.(10 points)**

In [13]:
# apply Porter Stemmer on the cleaned text (after punctuation and stopwords are removed) below this comment
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."

def remove_punc(text_string):
  return re.sub('[^a-zA-Z0-9 ]', '', text_string.lower())

def remove_stopwords(text_string):
  return [ token for token in text_string.split(' ') if token not in en_stopwords ]
text = remove_punc(text)
clean_text = remove_stopwords(text)
#print(clean_text) #I had displayed the clean_text but realized later that it is not required
from nltk.stem import PorterStemmer
porter = PorterStemmer()    
for term in clean_text:
  stem = porter.stem(term)
  print(stem) 

linguist
morpholog
inform
retriev
stem
process
reduc
inflect
sometim
deriv
word
word
stem
base
root
form
gener
written
word
form
stem
need
ident
morpholog
root
word
usual
suffici
relat
word
map
stem
even
stem
valid
root


text is cleaned by removing punctuations and stop words. The cleaned text is stemmed using Porter Stemmer from NLTK.

#### **Question 3. Perform lemmatization on the same cleaned text above using NLTK's lemmatizer.(10 points)**

In [14]:
# apply NLTK's lemmatizer on the cleaned text (after punctuation and stopwords are removed) below this comment

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for word in clean_text:
  lemma = lemmatizer.lemmatize(word)
  print(lemma) 

linguistic
morphology
information
retrieval
stemming
process
reducing
inflected
sometimes
derived
word
word
stem
base
root
form
generally
written
word
form
stem
need
identical
morphological
root
word
usually
sufficient
related
word
map
stem
even
stem
valid
root


Same clean_text is used as Question 2. The cleaned data is lemmatized using wordnet based lemmatizer from NLTK

#### **Question 4. How good is Lemmatization when compared to Stemming? Also write down your observations on performing lemmatization and stemming on text before and after cleaning (removing punctuation and stopwords) (10 points)**

**IMPORTANT NOTE: Your observations should not be based on just Q2 and Q3. Your observations should characterize spacy's and nltk's segmentation as a whole. Bring out the differences also if there are any**

**Answer for Q4.:** Type your answer here!

Lemmatization is pretty good as compared to Stemming in some cases but vice versa is also true. Both Lemmatization and Stemming perform better than in each other for some cases. A good example is that PorterStemmer worked for 'driving' but WordNet based Lemmatizer failed, and similarly WordNet based Lemmatizer worked for 'animals' but Porter Stemmer failed. They both worked for 'dogs' but both failed for 'flying'. So we can clearly see that there is no clear winner between the two, it majorly depends upon the data and application. I think stemming is performing similar before and after cleaning the data but Lemmatizer is performing better after cleaning the data. Stemming removes affixes while Lemmatizer converts the words to base form. 


## **(Tutorial) Sentence Segmentation using Spacy**

Following is a dummy paragraph of text to demonstrate how to use SpaCy to segment text into sentences.

In [15]:
dummy_text3 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text3)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



dummy_text3 is displayed

**Code for sentence segmentation using Spacy**

In [16]:
nlp = spacy.load('en_core_web_sm')

# performing sentence splitting...
doc = nlp(dummy_text3)
for sentence in doc.sents:
  print(sentence)

Here is the First Paragraph and this is the First Sentence.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the first paragaraph.
This paragraph is ending now with a Fifth Sentence.

Now, it is the Second Paragraph and its First Sentence.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the second paragraph.
This paragraph is ending now with a Fifth Sentence.

Finally, this is the Third Paragraph and is the First Sentence of this paragraph.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the third paragaraph.
This paragraph is ending now with a Fifth Sentence.

4th paragraph just has one sentence in it.



dummy_text3 is segmented using spacy.

**Code for sentence segmentation using NLTK library**

In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

nltk is imported and package punkt is downloaded/updated

In [18]:
text="This is a very bad situation. Also I am looking good"
sentences=nltk.sent_tokenize(text)
for sentence in sentences:
  print(sentence)
  print()

This is a very bad situation.

Also I am looking good



text is segmented using NLTK


### **Task 3. Segmenting Sentences**

In [19]:
inau_text="""There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which
don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the
Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin
words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition,
injected humour, or non-characteristic words etc."""

print(inau_text)

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which
don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the
Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin
words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition,
injected humour, or non-characteristic words etc.


inau_text is displayed

#### **Question 5a. Implement a custom Python script that performs a simple way of segmenting sentences in the text above by using the period (.) character as the sentence boundary. Analyze the generated output and provide your observations.(15 points)**

**Note:** You do not need to remove any stopwords, punctuation or apply any kind of other preprocessing techniques. Only perform what's asked to minimize your effort needed to answer this question. 

**Hint**: Use print( ) to help you understand how the sentences are being split when analyzing your output to note down your observations.

In [20]:
# write your code below this comment

for i,w in enumerate(inau_text.split(". ")):
    print(w+".")

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which
don't look even slightly believable.
If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text.
All the
Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet.
It uses a dictionary of over 200 Latin
words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable.
The generated Lorem Ipsum is therefore always free from repetition,
injected humour, or non-characteristic words etc..


The text is being split in segments by using the period(.) character as the sentence boundary. I have used the split function with '.' as the parameter for splitting. As it was stated so I have not removed any stopwords, punctuations or applied any kind of preprocessing techniques.
The output is sentence being displayed after being segmented. I think all the sentences are segmented where '.' was pressent.

#### **Question 5b. Using SpaCy, perform sentence segmentation on the same text (that was used in Q5a.). Analyze the generated output and provide your observations.<br> Now implement segmentation using NLTK, provide your observations.(15 points)**

**Hint**: For implementing NLTK's sentence segmentation, you can refer to the code block above.

In [21]:
# write your code below this comment

nlp = spacy.load('en_core_web_sm')

# performing sentence splitting...
doc = nlp(inau_text)
for sentence in doc.sents:
  print(sentence)
print()
print()
sentences=nltk.sent_tokenize(inau_text)
for sentence in sentences:
  print(sentence)

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which
don't look even slightly believable.
If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text.
All the
Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet.
It uses a dictionary of over 200 Latin
words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable.
The generated Lorem Ipsum is therefore always free from repetition,
injected humour, or non-characteristic words etc.


There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which
don't look even slightly believable.
If you are going to use a passage of Lorem I

The output for Spacy Segmentation is same as the output for my Segmentaiton in Q5a. It might perform better when there is '?' because we never mentioned that we have to '.' as the sentence boundary and it still segmnets the text.

I have printed two blank lines in between, so that the output is clear.

The output for NLTK Segmentation is same as the output for my Segmentaiton in Q5a and the spacy segmentation. It might perform better than my segmenter when there is '?' because we never mentioned that we have to '.' as the sentence boundary and it still segmnets the text.



---



## **(Tutorial) Subword Tokenization using HuggingFace**

### **Task 4: Subword Tokenization**

Well, the natural language processing is not as intelligent as we humans are, and not so intellectual to break words into sub words and try to decipher the word if it sees a word that is not in the corpus yet. This is where Subword Tokenization comes into picture.

Subword tokenization is a recent strategy from machine translation that helps us solve these problems by breaking unknown words into “subword units” - strings of characters like ing or eau - that still allow the downstream model to make intelligent decisions on words it doesn't recognize.

**Below is the implementation of the Subword Tokenization:**
<br>We will see Byte Pair Encoder algorithm here:

In [22]:
!pip install tokenizers

!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2022-09-25 20:51:36--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.42.38
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.42.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘gpt2-medium-vocab.json’


2022-09-25 20:51:36 (7.27 MB/s) - ‘gpt2-medium-vocab.json’ saved [1042301/1042301]

--2022-09-25 20:51:36--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.42.38
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.42.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘gpt2-merges.txt’


2022-09-25 20:51:36 (3.98 MB/s) - ‘gpt2-merges.txt’ saved [456318/456318]



Tokenizer installed

In [23]:
from tokenizers import ByteLevelBPETokenizer
gpt2vocab = "gpt2-medium-vocab.json"
gpt2merges = "gpt2-merges.txt"

bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
bpe_encoding = bpe.encode("The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789.")
print(bpe_encoding.ids)
print(bpe_encoding.tokens)

[464, 2183, 286, 13630, 281, 2209, 319, 554, 7493, 3924, 3596, 2067, 351, 262, 845, 717, 554, 7493, 3924, 960, 20191, 2669, 447, 247, 82, 960, 261, 3035, 1542, 11, 1596, 4531, 13]
['The', 'Ġcustom', 'Ġof', 'Ġdelivering', 'Ġan', 'Ġaddress', 'Ġon', 'ĠIn', 'aug', 'uration', 'ĠDay', 'Ġstarted', 'Ġwith', 'Ġthe', 'Ġvery', 'Ġfirst', 'ĠIn', 'aug', 'uration', 'âĢĶ', 'George', 'ĠWashington', 'âĢ', 'Ļ', 's', 'âĢĶ', 'on', 'ĠApril', 'Ġ30', ',', 'Ġ17', '89', '.']


The sentence is encoded using the Byte-Pair Encoding Tokenizer.

**Question 6:** Consider the following two sentences:

* The movie was not good at all, the climax was good though.
* That's an example, don't ignore it!. Or else, you might miss key information.

Encode these sentences using the Byte-Pair Encoding tokenizer (created during the tutorial). Retrieve the tokens from the encodings of the two sentences. Is/Are there any interesting observations when you compare the tokens between the two encodings? What do you think is causing what you observe as part of your comparison? **(20points)**


In [24]:
# use the bpe tokenizer that was created during the tutorial to encode the sentences
# write your code below this comment and execute
# type in your answer to the question asked above in the following cell (see below)
from tokenizers import ByteLevelBPETokenizer
gpt2vocab = "gpt2-medium-vocab.json"
gpt2merges = "gpt2-merges.txt"

bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
bpe_encoding = bpe.encode("The movie was not good at all, the climax was good though.")
print(bpe_encoding.ids)
print(bpe_encoding.tokens)


bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
bpe_encoding = bpe.encode("That's an example, don't ignore it!. Or else, you might miss key information")
print(bpe_encoding.ids)
print(bpe_encoding.tokens)

[464, 3807, 373, 407, 922, 379, 477, 11, 262, 30032, 373, 922, 996, 13]
['The', 'Ġmovie', 'Ġwas', 'Ġnot', 'Ġgood', 'Ġat', 'Ġall', ',', 'Ġthe', 'Ġclimax', 'Ġwas', 'Ġgood', 'Ġthough', '.']
[2504, 338, 281, 1672, 11, 836, 470, 8856, 340, 43179, 1471, 2073, 11, 345, 1244, 2051, 1994, 1321]
['That', "'s", 'Ġan', 'Ġexample', ',', 'Ġdon', "'t", 'Ġignore', 'Ġit', '!.', 'ĠOr', 'Ġelse', ',', 'Ġyou', 'Ġmight', 'Ġmiss', 'Ġkey', 'Ġinformation']


**Answer to Question 6:**
Sometype of 'Ġ' is being added to the tokens excpet the tokens starting with uppercae character and tokens starting with punciations. I think this is caused because the 'Ġ' identifies the beginnig of the new word but it is not required for the first word.
.The "'s" in second sentence is encoded together. While 'it' and '!' are stored separately. 
In the first sentence 'was' appears twice, so the bpe_encoding.ids are same both times. 
I think the cause of my observations is pre-defined functions being used.

## **References**
* https://spacy.io/usage/spacy-101
* https://spacy.io/models/en
* https://www.nltk.org/howto/wordnet.html
* https://www.nltk.org/_modules/nltk/stem/wordnet.html
*https://colab.research.google.com/drive/10gwzRY55JqzgeEQOX6nwFs6bQ84-mB9f?usp=sharing#scrollTo=DP1xuStV0fDl
*https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c
*https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/