# **In-Class Assignment: NLP Pipeline (with Normalization)**
## *IS 5150*
## Name: KEY

In this in-class assignment we'e going to run through the entire NLP pipeline and apply some common cleaning and text normalizing steps. We'll start with a text that needs extensive processing to run through the battery of processing steps, then we'll do the same on a much more simple text that requires less effort.

What steps you ned to do will depend on the text and the task at hand!

### Basic Outline of Steps:
1. Import text
2. Remove HTML (if applicable)
3. Case conversion
4. Contractions
5. Stemming/Lemmatization
6. Removing Stopwords
7. Tokenize text
8. Text Output

It's important to note that this list is NOT exhaustive, does NOT need to be done in this order, and which steps you choose WILL depend on the task at hand. The point of this exercise is to show you one procedure for cleaning/processing a text and show two options of output. This will vary based on a given text and what you want to do with it after!

Here, we're going to be using lots of familiar libraries and packages, but we'll also introduce some new ones including the popular and useful `spacy` library! We'll also need `nltk`, `re`, `pprint`, `BeautifulSoup`, `contractions`, `pandas`, and `numpy`.

In [1]:
import nltk, re, pprint

from urllib import request
from bs4 import BeautifulSoup #needed for parsing HTML

#!pip install contractions
import contractions #contractions dictionary
from string import punctuation

import spacy #used for lemmatization/stemming
#!python -m spacy download en_core_web_sm, or in Jupyter download in terminal using spacy download en_core_web_sm

from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer() #stopword removal
from nltk import word_tokenize

import pandas as pd
import numpy as np #general packages for data manipulation

#### **1) Import Text - UTF-8 Encoded**

For this example we'll run a `Helpful Hints for Halloween` text through the NLP pipeline. Why this text? Well it's pretty messy and provides a good opportunity to demonstrate different processing functions, plus I love Halloween.

In [2]:
url = "https://www.gutenberg.org/files/68984/68984-h/68984-h.htm"
response = request.urlopen(url)

raw = response.read().decode('utf-8-sig')
raw

'<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n  <head>\r\n    <meta charset="UTF-8" />\r\n    <title>\r\n      The Project Gutenberg eBook of Helps and Hints for Hallowe’en, by Laura Rountree Smith.\r\n    </title>\r\n\r\n    <link rel="icon" href="images/cover.jpg" type="image/x-cover" />\r\n\r\n    <style> /* <![CDATA[ */\r\n\r\na {\r\n    text-decoration: none;\r\n}\r\n\r\nbody {\r\n    margin: auto;\r\n    max-width: 40em;\r\n}\r\n\r\nh1,h2,h3,h4 {\r\n    text-align: center;\r\n    clear: both;\r\n}\r\n\r\nh2,h3 {\r\n    margin-top: 2em;\r\n}\r\n\r\nh2.nobreak, h3.nobreak {\r\n    page-break-before: avoid;\r\n}\r\n\r\nhr.chap {\r\n    margin-top: 2em;\r\n    margin-bottom: 2em;\r\n    clear: both;\r\n    width: 65%;\r\n    margin-left: 17.5%;\r\n    margin-right: 17.5%;\r\n}\r\n\r\nimg.w100 {\r\n    width: 100%;\r\n}\r\n\r\ndiv.chapter {\r\n    page-break-before: always;\r\n}\r\n\r\nul {\r\n    list-style-type: none;\r\n}\r\n\r\nli {\r\n

**It's clear that we want to remove the HTML tags, and we can use `html.parser` to do that. But that's not going to get rid of all unwanted characters. Let's remove the html and then figure out what else needs to be removed...**

#### **2) Remove HTML Tags + Unwanted Characters & Trim Text**

Let's start by defining a function to remove unwanted html tags, and then we'll build it out based on other characters we want to remove:

In [3]:
def text_cleaner(text):
    soup = BeautifulSoup(text, 'html.parser')
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text) # removes extra indentation
    stripped_text = re.sub(r'’', "'", stripped_text)
    stripped_text = re.sub(r"[^'\w\s\.]+", '', stripped_text) # remove non-period punctuation
    stripped_text = re.sub(r"(\s*'\s*s)", 's', stripped_text) # possesive s
    stripped_text = re.sub(r'\d+\.|\d+', '', stripped_text) # remove digits with or without a following period
    stripped_text = re.sub(r'[A-Z]\.', '', stripped_text) # remove uppercase letters with following period
    stripped_text = re.sub(r"HALLOWE'EN|[Hh]allowe'en", 'halloween', stripped_text) # that's not gonna be in our contractions list
    stripped_text = re.sub(r'\s+', ' ', stripped_text) #removes extra whitespaces
    return stripped_text

clean_text = text_cleaner(raw)

In [4]:
clean_text

" The Project Gutenberg eBook of Helps and Hints for halloween by Laura Rountree Smith. The Project Gutenberg eBook of Helps and hints for halloween by Laura Rountree Smith This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it give it away or reuse it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States you will have to check the laws of the country where you are located before using this eBook. Title Helps and hints for halloween Author Laura Rountree Smith Release Date September eBook Language English Produced by Charlene Taylor and the Online Distributed Proofreading Team at httpswww.pgdp.net This file was produced from images generously made available by The Internet ArchiveAmerican Libraries. START OF THE PROJECT GUTENBERG EBOOK HELPS AND HINTS FOR halloween Helps and Hi

**Now let's find the beginning and end of the text and trim it:**

In [5]:
print("[", clean_text.find("START OF THE PROJECT"), ":", clean_text.rfind("END OF THE PROJECT"), "]")

[ 920 : 66725 ]


In [6]:
clean_text = clean_text[920 : 66443]
clean_text

"START OF THE PROJECT GUTENBERG EBOOK HELPS AND HINTS FOR halloween Helps and Hints for halloween By Laura Rountree Smith MARCH BROTHERS Publishers Wright Ave. Lebanon Ohio COPYRIGHT By MARCH BROTHERS Contents PAGE Introduction Party Suggestions NutCrack Night halloween Stunts A Shadow Play The Black Cat Stunt A Pumpkin Climbing Game Exercises halloween Acrostic Take Care Tables are Turned Drills Clown Drill and Song Autumn Leaf Drill CatTail Drill Muff Drill Dialogs and Plays The halloween Ghosts On halloween Night Jack Frosts Surprise An Historical halloween The Witchs Dream A halloween Carnival and WaxWork Show The Play of Pomona halloween Puppet Play NOTE SEND FOR OUR COMPLETE CATALOG IN WHICH WILL BE FOUND ALL THE ACCESSORIES NEEDED IN CARRYING OUT THE IDEAS GIVEN IN THIS BOO March Brothers Publishers Wright Ave. Lebanon Ohio Introduction Hist be still 'tis halloween When fairies troop across the green On halloween when elves and witches are abroad we find it the custom over all t

### **3) Lowercase**

**Next in the pipeline is setting all characters to lowercase. Why do we care about doing this?**

We don't want to treat lowercase and uppercase variants as separate tokens, we will standardize to all lowercase.

In [7]:
def lowercase(text):
    sents_lower = text.lower()
    return sents_lower

lower_text = lowercase(clean_text)
lower_text

"start of the project gutenberg ebook helps and hints for halloween helps and hints for halloween by laura rountree smith march brothers publishers wright ave. lebanon ohio copyright by march brothers contents page introduction party suggestions nutcrack night halloween stunts a shadow play the black cat stunt a pumpkin climbing game exercises halloween acrostic take care tables are turned drills clown drill and song autumn leaf drill cattail drill muff drill dialogs and plays the halloween ghosts on halloween night jack frosts surprise an historical halloween the witchs dream a halloween carnival and waxwork show the play of pomona halloween puppet play note send for our complete catalog in which will be found all the accessories needed in carrying out the ideas given in this boo march brothers publishers wright ave. lebanon ohio introduction hist be still 'tis halloween when fairies troop across the green on halloween when elves and witches are abroad we find it the custom over all t

#### **4) Contractions**

Contractions are kind of an interesting thing to deal with; we often treat them as one entity but for NLP purposes we often want to separate them out into their two constituents. The `contractions` library contains a list of predefined contractions and their expansions. We will implement that here in the context of a `expand_contractions` function we will define.

In [8]:
text_1 = "I didn't even know it's a big deal."
def expand_contractions(text):
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
        expanded_text = ' '.join(expanded_words)
    return expanded_text

expand_contractions(text_1)   
expanded_text = expand_contractions(lower_text)

In [9]:
expanded_text

"start of the project gutenberg ebook helps and hints for halloween helps and hints for halloween by laura rountree smith march brothers publishers wright ave. lebanon ohio copyright by march brothers contents page introduction party suggestions nutcrack night halloween stunts a shadow play the black cat stunt a pumpkin climbing game exercises halloween acrostic take care tables are turned drills clown drill and song autumn leaf drill cattail drill muff drill dialogs and plays the halloween ghosts on halloween night jack frosts surprise an historical halloween the witchs dream a halloween carnival and waxwork show the play of pomona halloween puppet play note send for our complete catalog in which will be found all the accessories needed in carrying out the ideas given in this boo march brothers publishers wright ave. lebanon ohio introduction hist be still it is halloween when fairies troop across the green on halloween when elves and witches are abroad we find it the custom over all 

#### **5) Removing Stopwords**

Next, we'll define a function to filter out stop words based on a stopwords list from `nltk`. This process involves firs tokenizing the text, removing extra whitespace, removing tokens in the stopword list, and then finally rejoining all the remaining words back into a continuous string of text.

**Removal of stopwords isn't required, but it is common. Why do you think this is the case?**

In [10]:
#nltk.download('stopwords')
stopword_list = nltk.corpus.stopwords.words('english')

In [11]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [12]:
stopped_text = remove_stopwords(expanded_text, is_lower_case = True)
stopped_text

"start project gutenberg ebook helps hints halloween helps hints halloween laura rountree smith march brothers publishers wright ave. lebanon ohio copyright march brothers contents page introduction party suggestions nutcrack night halloween stunts shadow play black cat stunt pumpkin climbing game exercises halloween acrostic take care tables turned drills clown drill song autumn leaf drill cattail drill muff drill dialogs plays halloween ghosts halloween night jack frosts surprise historical halloween witchs dream halloween carnival waxwork show play pomona halloween puppet play note send complete catalog found accessories needed carrying ideas given boo march brothers publishers wright ave. lebanon ohio introduction hist still halloween fairies troop across green halloween elves witches abroad find custom world build bonfires keep evil spirits night nights entertain friends stunts similar performed two hundred years ago. night fortunes told games played happens birthday falls night m

#### **6) Lemmatization**

Lemmatization is another processing step that isn't required, but often implementd. Remember that lemmatization is different from stemming in that it attempts to reduce words to their roots (or lemmas), where as stemming simply cuts off suffixes and affixes.

Here we will implement a pretrained lemmatizer from `Spacy`.

**Why might we be interested in applying lemmatization?**

In [13]:
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmas = lemmatize_text(stopped_text)
lemmas

"start project gutenberg ebook help hint halloween help hint halloween laura rountree smith march brothers publisher wright ave . lebanon ohio copyright march brothers content page introduction party suggestion nutcrack night halloween stunt shadow play black cat stunt pumpkin climbing game exercise halloween acrostic take care table turn drill clown drill song autumn leaf drill cattail drill muff drill dialog play halloween ghost halloween night jack frost surprise historical halloween witch dream halloween carnival waxwork show play pomona halloween puppet play note send complete catalog find accessory need carry idea give boo march brothers publisher wright ave . lebanon ohio introduction hist still halloween fairy troop across green halloween elf witch abroad find custom world build bonfire keep evil spirit night night entertain friend stunt similar perform two hundred year ago . night fortune tell game play happen birthday fall night may even able hold converse fairiesso go ancien

#### **7) Tokenize Text**

Though we've applied word tokenization at other steps in the NLP pipeline and then rejoined our text, we are now ready to tokenize the text into sentences, so that we can put it into a structured format like a dataframe or list.

We will use the `PunktSentenceTokenizer` from `nltk` to perform this step:

In [14]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sents = punkt_st.tokenize(lemmas)
sents[3:15]

['night fortune tell game play happen birthday fall night may even able hold converse fairiesso go ancient superstition careful halloween whenever come careful halloween witch halloween origin old druid festival .',
 'druid keep fire burn year honor sungod .',
 'last night october meet altar fire burning put much pomp ceremony relighte they .',
 'take ember new fire return home kindle fire hearth .',
 'superstition home one fire burn constantly throughout year protect evil .',
 'later fire keep evil spirit away .',
 'country still witch fairy ghost agree night october st great time celebration .',
 'little book find useful school church home planning celebration halloween .',
 'air full magic let we write invitation hearty halloween night nutcrack part party suggestion nutcrack night northern part england halloween still call nutcrack night .',
 "nutcrack night party write invitation pumpkinshape booklet cut double face jacko ' lantern paint outside inside write nutcrack night meet fat

#### **8) Deciding clean text output**

Finally, we need to decide how to structure our cleaned text. This is going to depend on what we want to do with it next (which we'll cover in Topic 4). For now, let's store our sentence tokens in a dataframe, and then we'll store our vocab in a list.

**Output is a dataframe of sentences:**

In [15]:
df = pd.DataFrame(sents, columns = ['sentence'])
df

Unnamed: 0,sentence
0,start project gutenberg ebook help hint hallow...
1,lebanon ohio copyright march brothers content ...
2,lebanon ohio introduction hist still halloween...
3,night fortune tell game play happen birthday f...
4,druid keep fire burn year honor sungod .
...,...
723,brownie trip lightly green night halloween .
724,knowledge halloween come pleasant weather fun ...
725,exit punch judy return .
726,punch little puppet know halloween judy witch ...


In [16]:
df['sentence'].str.len().mean()

60.4739010989011

#### **Output is a list of unique words:**

In [17]:
words = nltk.wordpunct_tokenize(stopped_text)
text = nltk.Text(words)
text

<Text: start project gutenberg ebook helps hints halloween helps...>

In [18]:
vocab = sorted(set(words))
vocab

["'",
 '.',
 'able',
 'about',
 'above',
 'abroad',
 'accents',
 'accessories',
 'according',
 'across',
 'acrostic',
 'act',
 'acting',
 'add',
 'advance',
 'advances',
 'afford',
 'afraid',
 'aft',
 'after',
 'ages',
 'ago',
 'agree',
 'ahunting',
 'air',
 'alarm',
 'alarms',
 'all',
 'allow',
 'almost',
 'alone',
 'along',
 'alphabet',
 'already',
 'also',
 'altar',
 'altars',
 'always',
 'america',
 'american',
 'ancient',
 'ancients',
 'animal',
 'animalcracker',
 'animals',
 'announces',
 'another',
 'answer',
 'anthems',
 'anybody',
 'anyone',
 'anything',
 'anywhere',
 'aplenty',
 'apparent',
 'appeals',
 'appear',
 'appearance',
 'appears',
 'apple',
 'apples',
 'applesfour',
 'appropriate',
 'apron',
 'arapping',
 'arise',
 'arm',
 'arms',
 'around',
 'arrange',
 'arranged',
 'artificial',
 'asailing',
 'ask',
 'asleep',
 'assembled',
 'astrologer',
 'attached',
 'attendants',
 'attired',
 'attitude',
 'attracted',
 'audience',
 'autumn',
 'ave',
 'away',
 'awaywhere',
 'awkw

## **Basic NLP Pipeline**

What if you don't need to do all the fancy stuff and just want to extract some web text, clean it of HTML, trim, tokenize and set to lower?

Well that's pretty simple:

In [19]:
url = "https://gutenberg.org/files/68667/68667-h/68667-h.htm"
 
html = request.urlopen(url).read() # read in requested html

In [20]:
raw = BeautifulSoup(html).get_text() # get the text

In [21]:
print("[", raw.find("A LOVERS’ PROLOGUE"), ":", raw.rfind("CHAPTER III"), "]") # trim it

[ 1435 : 363426 ]


In [22]:
raw = raw[1435:363426]
print(raw)

A LOVERS’ PROLOGUE


Matter is but the eternal dressing of the imagination; the world the
unconscious self-delusion of a Spirit. Everything springs from Love,
and Love is the dreaming God.


Two figments of that endless sweet obsession stood alone—high on a
slope of Alp this time. Born of a dream to flesh, they thought they
owed themselves to flesh—a sacred debt. Truth seemed as plain to them
as pebbles in a brook, which lie round and firm for all their apparent
shaking under ripples. There, actual to their eyes, were the white
mountains, the hoary glaciers, the pine woods and foamy freshets of
eighteenth century Le Prieuré. Here, actual in the ears of each, was
the whisper of the deathless confidence which for ever and ever helps
on love’s succession. They loved, and therefore they lived.


Man has been for ten thousand ages at the pains to prove love a
delusion, and still he greets a baby, and a kitten, and the nesting
song of birds, and a hawthorn bush in flower, as

In [23]:
tokens = nltk.wordpunct_tokenize(raw) # tokenize the words
tokens

['A',
 'LOVERS',
 '’',
 'PROLOGUE',
 'Matter',
 'is',
 'but',
 'the',
 'eternal',
 'dressing',
 'of',
 'the',
 'imagination',
 ';',
 'the',
 'world',
 'the',
 'unconscious',
 'self',
 '-',
 'delusion',
 'of',
 'a',
 'Spirit',
 '.',
 'Everything',
 'springs',
 'from',
 'Love',
 ',',
 'and',
 'Love',
 'is',
 'the',
 'dreaming',
 'God',
 '.',
 'Two',
 'figments',
 'of',
 'that',
 'endless',
 'sweet',
 'obsession',
 'stood',
 'alone',
 '—',
 'high',
 'on',
 'a',
 'slope',
 'of',
 'Alp',
 'this',
 'time',
 '.',
 'Born',
 'of',
 'a',
 'dream',
 'to',
 'flesh',
 ',',
 'they',
 'thought',
 'they',
 'owed',
 'themselves',
 'to',
 'flesh',
 '—',
 'a',
 'sacred',
 'debt',
 '.',
 'Truth',
 'seemed',
 'as',
 'plain',
 'to',
 'them',
 'as',
 'pebbles',
 'in',
 'a',
 'brook',
 ',',
 'which',
 'lie',
 'round',
 'and',
 'firm',
 'for',
 'all',
 'their',
 'apparent',
 'shaking',
 'under',
 'ripples',
 '.',
 'There',
 ',',
 'actual',
 'to',
 'their',
 'eyes',
 ',',
 'were',
 'the',
 'white',
 'mountains'

In [24]:
words = [w.lower() for w in tokens] # set them to lower
vocab = sorted(set(words)) # sort your unique word tokens set

In [25]:
vocab # list of vocab words

['!',
 '!)',
 '!—',
 '!—”',
 '!’',
 '!”',
 '(',
 ')',
 '),',
 ').',
 ');',
 '*',
 ',',
 ',—',
 ',’',
 ',”',
 '-',
 '.',
 '.)',
 '.,',
 '...',
 '.’',
 '.’”',
 '.”',
 '11',
 '1782',
 '1783',
 '1786',
 '61',
 '81',
 '9',
 ':',
 ':—',
 ';',
 ';”',
 '?',
 '?—',
 '?”',
 '[',
 ']',
 'a',
 'abandoned',
 'abandonment',
 'abased',
 'abdicated',
 'abduct',
 'abduction',
 'abhorred',
 'abhorrent',
 'abigails',
 'abject',
 'abjectly',
 'ablaze',
 'able',
 'abnormal',
 'abnormity',
 'abode',
 'abolition',
 'abomination',
 'about',
 'above',
 'abroad',
 'absence',
 'absolute',
 'absolution',
 'absorbed',
 'absorbing',
 'absorption',
 'abstinence',
 'abstract',
 'abstractedly',
 'absurd',
 'abuse',
 'abused',
 'abuses',
 'abusing',
 'abyss',
 'abysses',
 'academies',
 'accent',
 'accept',
 'acceptance',
 'acceptances',
 'accepted',
 'accepting',
 'access',
 'accessory',
 'accident',
 'accommodate',
 'accommodating',
 'accompanied',
 'accompany',
 'accomplish',
 'accomplished',
 'accomplishment',
 'acc