# **In-Class Assignment: NLP Pipeline**
## *DATA 5420/6420*
## Name: Dallin Moore

In this in-class assignment we're going to run through the entire NLP pipeline and apply some common cleaning and text normalizing steps. We'll start with a text that needs extensive processing to run through the battery of processing steps, then we'll do the same on a much more simple text that requires less effort.

What steps you ned to do will depend on the text and the task at hand!

### Basic Outline of Steps:
1. Import text
2. Remove HTML (if applicable)
3. Case conversion
4. Contractions
5. Stemming/Lemmatization
6. Removing Stopwords
7. Tokenize text
8. Text Output

It's important to note that this list is NOT exhaustive, does NOT need to be done in this order, and which steps you choose WILL depend on the task at hand. The point of this exercise is to show you one procedure for cleaning/processing a text and show two options of output. This will vary based on a given text and what you want to do with it after!

Here, we're going to be using lots of familiar libraries and packages, but we'll also introduce some new ones including the popular and useful `spacy` library! We'll also need `nltk`, `re`, `pprint`, `BeautifulSoup`, `contractions`, `pandas`, and `numpy`.

In [1]:
import nltk, re, pprint

from urllib import request
from bs4 import BeautifulSoup                                                                                   # needed for parsing HTML

!pip install contractions
import contractions                                                                                             # contractions dictionary
from string import punctuation

import spacy                                                                                                    # used for lemmatization/stemming
!python -m spacy download en_core_web_sm                # OR in Jupyter download in terminal using spacy download en_core_web_sm

from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
tokenizer = ToktokTokenizer()                                                                                   # stopword removal
from nltk import word_tokenize

import pandas as pd
import numpy as np                                                                                              # general packages for data manipulation

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
2024-01-26 05:59:25.125752: E external/local_x

#### **1) Import Text - UTF-8 Encoded**

For this example we'll run a `Helpful Hints for Halloween` text through the NLP pipeline. Why this text? Well it's pretty messy and provides a good opportunity to demonstrate different processing functions, plus I love Halloween.

In [4]:
url = "https://www.gutenberg.org/cache/epub/68984/pg68984-images.html"
response = request.urlopen(url)

raw = response.read().decode('utf-8-sig')
raw

'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n<meta charset="utf-8"><style>\r\n#pg-header div, #pg-footer div {\r\n    all: initial;\r\n    display: block;\r\n    margin-top: 1em;\r\n    margin-bottom: 1em;\r\n    margin-left: 2em;\r\n}\r\n#pg-footer div.agate {\r\n    font-size: 90%;\r\n    margin-top: 0;\r\n    margin-bottom: 0;\r\n    text-align: center;\r\n}\r\n#pg-footer li {\r\n    all: initial;\r\n    display: block;\r\n    margin-top: 1em;\r\n    margin-bottom: 1em;\r\n    text-indent: -0.6em;\r\n}\r\n#pg-footer div.secthead {\r\n    font-size: 110%;\r\n    font-weight: bold;\r\n}\r\n#pg-footer #project-gutenberg-license {\r\n    font-size: 110%;\r\n    margin-top: 0;\r\n    margin-bottom: 0;\r\n    text-align: center;\r\n}\r\n#pg-header-heading {\r\n    all: inherit;\r\n    text-align: center;\r\n    font-size: 120%;\r\n    font-weight:bold;\r\n}\r\n#pg-footer-heading {\r\n    all: inherit;\r\n    text-align: center;\r\n    font-size: 120%;\r\n    font-weight: normal;\r\n 

**It's clear that we want to remove the HTML tags, and we can use `html.parser` to do that. But that's not going to get rid of all unwanted characters. Let's remove the html and then figure out what else needs to be removed...**

#### **2) Remove HTML Tags + Unwanted Characters & Trim Text**

Let's start by defining a function to remove unwanted html tags, and then we'll build it out based on other characters we want to remove:

In [17]:
def text_cleaner(text):
    soup = BeautifulSoup(text, 'html.parser')
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub('[\r|\n|\r\n]+', '\n', stripped_text)
    stripped_text = re.sub('’',"'",stripped_text)
    stripped_text = re.sub(r"[^'\w\s\.]+", '', stripped_text)
    stripped_text = re.sub(r'\d+\.|\d+', '', stripped_text)
    stripped_text = re.sub(r"HALLOWE'EN|[hH]allowe'en",'halloween', stripped_text)
    # iteratively add cleaning steps here
    return stripped_text

clean_text = text_cleaner(raw)

In [18]:
clean_text[0:5000]

"\n      The Project Gutenberg eBook of Helps and Hints for halloween by Laura Rountree Smith.\n    \nThe Project Gutenberg eBook of Helps and hints for halloween\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it give it away or reuse it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\nTitle Helps and hints for halloween\nAuthor Laura Rountree Smith\nRelease date September   eBook \nLanguage English\nOriginal publication United States March Brothers \nCredits Charlene Taylor and the Online Distributed Proofreading Team at httpswww.pgdp.net This file was produced from images generously made available by The Internet ArchiveAmerican Libraries.\n START OF THE PROJECT GUT

**Now let's find the beginning and end of the text and trim it:**

In [19]:
print("[", clean_text.find("START OF THE PROJECT GUTENBERG"), ":", clean_text.rfind("END OF THE PROJECT"), "]")

[ 958 : 67070 ]


In [20]:
clean_text = clean_text[958:67070] # trim the text

### **3) Lowercase**

**Next in the pipeline is setting all characters to lowercase. Why do we care about doing this?**

To standardize the text and reduce the amount of tokens that we are working with.

In [22]:
def lowercase(text):
  sents_lower = text.lower() # fill in
  return sents_lower

lower_text = lowercase(clean_text) # apply to clean_text
lower_text

"start of the project gutenberg ebook helps and hints for halloween \n\nhelps and hints\nfor\nhalloween\nby\nlaura rountree smith\nmarch brothers publishers\n   wright ave. lebanon ohio\n\ncopyright  by\nmarch brothers\n\ncontents\npage\nintroduction\n\nparty suggestions\nnutcrack night\n\nhalloween stunts\na shadow play\n\nthe black cat stunt\n\na pumpkin climbing game\n\nexercises\nhalloween acrostic\n\ntake care tables are turned\n\ndrills\nclown drill and song\n\nautumn leaf drill\n\ncattail drill\n\nmuff drill\n\ndialogs and plays\nthe halloween ghosts\n\non halloween night\n\njack frost's surprise\n\nan historical halloween\n\nthe witch's dream\n\na halloween carnival and waxwork show\n\nthe play of pomona\n\nhalloween puppet play\n\n\nnote\nsend for our complete\ncatalog in which will be\nfound all the accessories\nneeded in carrying out the\nideas given in this book.\nmarch brothers publishers\n   wright ave. lebanon ohio\n\nintroduction\nhist be still 'tis halloween\nwhen fair

#### **4) Contractions**

Contractions are kind of an interesting thing to deal with; we often treat them as one entity but for NLP purposes we often want to separate them out into their two constituents. The `contractions` library contains a list of predefined contractions and their expansions. We will implement that here in the context of a `expand_contractions` function we will define.

In [23]:
contractions.contractions_dict # view dictionary of contractions

{"I'm": 'I am',
 "I'm'a": 'I am about to',
 "I'm'o": 'I am going to',
 "I've": 'I have',
 "I'll": 'I will',
 "I'll've": 'I will have',
 "I'd": 'I would',
 "I'd've": 'I would have',
 'Whatcha': 'What are you',
 "amn't": 'am not',
 "ain't": 'are not',
 "aren't": 'are not',
 "'cause": 'because',
 "can't": 'cannot',
 "can't've": 'cannot have',
 "could've": 'could have',
 "couldn't": 'could not',
 "couldn't've": 'could not have',
 "daren't": 'dare not',
 "daresn't": 'dare not',
 "dasn't": 'dare not',
 "didn't": 'did not',
 'didn’t': 'did not',
 "don't": 'do not',
 'don’t': 'do not',
 "doesn't": 'does not',
 "e'er": 'ever',
 "everyone's": 'everyone is',
 'finna': 'fixing to',
 'gimme': 'give me',
 "gon't": 'go not',
 'gonna': 'going to',
 'gotta': 'got to',
 "hadn't": 'had not',
 "hadn't've": 'had not have',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he've": 'he have',
 "he's": 'he is',
 "he'll": 'he will',
 "he'll've": 'he will have',
 "he'd": 'he would',
 "he'd've": 'he would have',
 

In [24]:
text_1 = "I didn't even know it's a big deal."

# Add in comments
def expand_contractions(text):
    expanded_words = [] # create empty list
    for word in text.split(): # split text into individual words
        expanded_words.append(contractions.fix(word)) # identify contractions and replace with words from dict
        expanded_text = ' '.join(expanded_words) # rejoin text
    return expanded_text

expand_contractions(text_1)

'I did not even know it is a big deal.'

In [25]:
expanded_text = expand_contractions(lower_text) # apply to lower_text
expanded_text

"start of the project gutenberg ebook helps and hints for halloween helps and hints for halloween by laura rountree smith march brothers publishers wright ave. lebanon ohio copyright by march brothers contents page introduction party suggestions nutcrack night halloween stunts a shadow play the black cat stunt a pumpkin climbing game exercises halloween acrostic take care tables are turned drills clown drill and song autumn leaf drill cattail drill muff drill dialogs and plays the halloween ghosts on halloween night jack frost's surprise an historical halloween the witch's dream a halloween carnival and waxwork show the play of pomona halloween puppet play note send for our complete catalog in which will be found all the accessories needed in carrying out the ideas given in this book. march brothers publishers wright ave. lebanon ohio introduction hist be still it is halloween when fairies troop across the green on halloween when elves and witches are abroad we find it the custom over 

#### **5) Removing Stopwords**

Next, we'll define a function to filter out stop words based on a stopwords list from `nltk`. This process involves firs tokenizing the text, removing extra whitespace, removing tokens in the stopword list, and then finally rejoining all the remaining words back into a continuous string of text.

**Removal of stopwords isn't required, but it is common. Why do you think this is the case?**

They are not content words, it will only distract from the meaningful words in the text.

### **Let's add some comments to see what we're doing here...**

In [26]:
nltk.download('stopwords')
tokenizer = ToktokTokenizer()
stopword_list = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = [token.strip().lower() for token in tokenizer.tokenize(text)] # tokenize words, rremove extra whitespace
    filtered_tokens = [token for token in tokens if token not in stopword_list] # fill in
    return ' '.join(filtered_tokens) # finish statement

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [27]:
stopword_list

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [29]:
stopped_text = remove_stopwords(expanded_text) # apply to expanded_text
stopped_text

"start project gutenberg ebook helps hints halloween helps hints halloween laura rountree smith march brothers publishers wright ave. lebanon ohio copyright march brothers contents page introduction party suggestions nutcrack night halloween stunts shadow play black cat stunt pumpkin climbing game exercises halloween acrostic take care tables turned drills clown drill song autumn leaf drill cattail drill muff drill dialogs plays halloween ghosts halloween night jack frost ' surprise historical halloween witch ' dream halloween carnival waxwork show play pomona halloween puppet play note send complete catalog found accessories needed carrying ideas given book. march brothers publishers wright ave. lebanon ohio introduction hist still halloween fairies troop across green halloween elves witches abroad find custom world build bonfires keep evil spirits night nights entertain friends stunts similar performed two hundred years ago. night fortunes told games played happens birthday falls nig

#### **6) Lemmatization**

Lemmatization is another processing step that isn't required, but often implementd. Remember that lemmatization is different from stemming in that it attempts to reduce words to their roots (or lemmas), where as stemming simply cuts off suffixes and affixes.

Here we will implement a pretrained lemmatizer from `Spacy`.

**Why might we be interested in applying lemmatization?**

To reduce the amount of tokens and reduce the complexity of the text. This makes analyzing the text easier because there are less tokens.

In [30]:
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer") # bring in spacy lemmatizer

def lemmatize_text(text):
  text = nlp(text)
  text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text]) # if the word is a pronoun do NOT lemmatize
  return text

lemmas = lemmatize_text(stopped_text) # apply to stopped_text
lemmas

"start project gutenberg ebook help hint halloween help hint halloween laura rountree smith march brothers publisher wright ave . lebanon ohio copyright march brother content page introduction party suggestion nutcrack night halloween stunt shadow play black cat stunt pumpkin climb game exercise halloween acrostic take care table turn drill clown drill song autumn leaf drill cattail drill muff drill dialog play halloween ghost halloween night jack frost ' surprise historical halloween witch ' dream halloween carnival waxwork show play pomona halloween puppet play note send complete catalog find accessory need carry idea give book . march brothers publisher wright ave . lebanon ohio introduction hist still halloween fairy troop across green halloween elve witch abroad find custom world build bonfire keep evil spirit night night entertain friend stunt similar perform two hundred year ago . night fortunes tell game play happen birthday fall night may even able hold converse fairiesso go a

#### **7) Sentence Tokenize Text**

Though we've applied word tokenization at other steps in the NLP pipeline and then rejoined our text, we are now ready to tokenize the text into sentences, so that we can put it into a structured format like a dataframe or list.

We will use the `PunktSentenceTokenizer` from `nltk` to perform this step:

In [31]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()

sents = punkt_st.tokenize(lemmas) # apply to lemmas
sents[3:15] # view some sentences

['lebanon ohio introduction hist still halloween fairy troop across green halloween elve witch abroad find custom world build bonfire keep evil spirit night night entertain friend stunt similar perform two hundred year ago .',
 'night fortunes tell game play happen birthday fall night may even able hold converse fairiesso go ancient superstition careful halloween whenever come careful halloween witch halloween origin old druid festival .',
 'druid keep fire burn year honor sungod .',
 'last night october meet altar fire burn put much pomp ceremony relighte they .',
 'take ember new fire return home kindle fire hearth .',
 'superstition home one fire burn constantly throughout year protect evil .',
 'later fire keep evil spirit away .',
 'country still witch fairy ghost agree night october st great time celebration .',
 'little book find useful school church home planning celebration halloween .',
 'air full magic let we write invitation hearty halloween night nutcrack party .',
 'party

#### **8) Deciding clean text output**

Finally, we need to decide how to structure our cleaned text. This is going to depend on what we want to do with it next (which we'll cover in Topic 4). For now, let's store our sentence tokens in a dataframe, and then we'll store our vocab in a list.

**Output is a dataframe of sentences:**

In [32]:
df = pd.DataFrame(sents, columns = ['Sentence'])
df

Unnamed: 0,Sentence
0,start project gutenberg ebook help hint hallow...
1,lebanon ohio copyright march brother content p...
2,march brothers publisher wright ave .
3,lebanon ohio introduction hist still halloween...
4,night fortunes tell game play happen birthday ...
...,...
731,punch judy punch judy merry time year often se...
732,call appear .
733,direction make puppet manipulation find puppet...
734,cent .


#### **Output is a list of unique words:**

In [33]:
words = nltk.wordpunct_tokenize(stopped_text)
text = nltk.Text(words)

In [35]:
vocab = sorted(set(text))
len(vocab)

1897

## **Basic NLP Pipeline**

We can also take a more basic approach and throw everything into one function, which can be helpful for less complicated texts.

In [45]:
url = "https://gutenberg.org/files/68667/68667-h/68667-h.htm"

html = request.urlopen(url).read()

In [46]:
raw = BeautifulSoup(html).get_text()
print(raw)





  The Project Gutenberg eBook of A Rogue’s Tragedy, by Bernard Capes
 




The Project Gutenberg eBook of A rogue’s tragedy, by Bernard Edward Joseph Capes

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online
at www.gutenberg.org. If you
are not located in the United States, you will have to check the laws of the
country where you are located before using this eBook.

Title: A rogue’s tragedy
Author: Bernard Edward Joseph Capes
Release Date: August 2, 2022 [eBook #68667]
Language: English
Produced by: an anonymous Project Gutenberg volunteer
*** START OF THE PROJECT GUTENBERG EBOOK A ROGUE’S TRAGEDY ***


A ROGUE’S TRAGEDY

BY
BERNARD CAPES


METHUEN & CO.
36 ESSEX STREET W.C.
LONDON





First Published in 1906


CONTENTS



Part I
Pa

In [47]:
print("[", raw.find("A LOVERS’ PROLOGUE"), ":", raw.rfind("CHAPTER III"), "]")

[ 1435 : 363426 ]


In [48]:
raw = raw[1435:363426]
print(raw)

A LOVERS’ PROLOGUE


Matter is but the eternal dressing of the imagination; the world the
unconscious self-delusion of a Spirit. Everything springs from Love,
and Love is the dreaming God.


Two figments of that endless sweet obsession stood alone—high on a
slope of Alp this time. Born of a dream to flesh, they thought they
owed themselves to flesh—a sacred debt. Truth seemed as plain to them
as pebbles in a brook, which lie round and firm for all their apparent
shaking under ripples. There, actual to their eyes, were the white
mountains, the hoary glaciers, the pine woods and foamy freshets of
eighteenth century Le Prieuré. Here, actual in the ears of each, was
the whisper of the deathless confidence which for ever and ever helps
on love’s succession. They loved, and therefore they lived.


Man has been for ten thousand ages at the pains to prove love a
delusion, and still he greets a baby, and a kitten, and the nesting
song of birds, and a hawthorn bush in flower, as

In [49]:
nltk.download('punkt')
def basic_text_cleaner(text):
    # Remove characters that are not letters, whitespaces, or periods
    text = re.sub(r'[^a-zA-Z\s\.]', '', text)
    # Tokenize and perform stopword removal, and casefolding
    tokens = word_tokenize(text)
    tokens = [token.lower() for token in tokens if token.lower() not in stopword_list]

    # Join tokens and trim extra whitespace
    cleaned_text = ' '.join(tokens).strip()

    return cleaned_text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [50]:
cleaned_text = basic_text_cleaner(raw)
cleaned_text

