# **In-Class Demonstration: NLP Pipeline for EBooks, Static HTML, Wikipedia**

## *IS 5150*

In this in-class assignment we will run several sources of text through the first few steps of the NLP pipeline, including:



1.   Reading in raw text
2.   Decoding text to readable format/removal of HTML tags
3.   Tokenizing
4.   Trimming unwanted tokens

We will start with one of the cleaner sources of text, Ebooks, and then delve into webpages, which present the additional issue of extracting meaningful text from a sea of HTML.



### **Let's start with an E-Book and run through these steps...**

We first need to import our required libraries and packages. We'll need `nltk`, `re`, and `pprint`. We will also want to import `word_tokenize` from `nltk`, and then `request` from `urllib`.

In [None]:
import nltk, re, pprint
from nltk import word_tokenize

from urllib import request

#### **1) Reading in raw ebook text**
#### **2) Decode text to readable format**

The first step in the pipeline is to actually read in the text we want to process. This can be done using the `request.urlopen()` function from `urllib`. We're also going to convert this raw text to `utf-8-sig` so that all the characters are in a computer readable format, using the `decode` function.

In [None]:
url = "https://www.gutenberg.org/files/2554/2554-0.txt"                                                           # provide url of ebook
response = request.urlopen(url)                                                                                   # open url

raw = response.read().decode('utf-8-sig')                                                                         # decode raw text to utf-8 encoding

print("data type:", type(raw), "Number of characters raw:", len(raw))                                             # print data type and number of characters

data type: <class 'str'> Number of characters raw: 1176811


In [None]:
raw[:75]                                                                                                          # extract the title and author of the text

'The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

#### **3) Tokenization**

Now that we've got our text decode and in a readable format, let's tokenize our text so that we can break down continuous strings of characters into words. Luckily, `nltk` has a built-in tokenizer called `word_tokenize()`. Let's apply it to our raw text.

In [None]:
tokens = word_tokenize(raw)                                                                                      # assign tokenize raw to object called 'tokens'
print("data type:", type(tokens), "Number of tokens:", len(tokens))                                              # print datatype and number of tokens

data type: <class 'list'> Number of tokens: 257058


In [None]:
tokens[:10]

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [None]:
text = nltk.Text(tokens)                                                                                        # assign text tokens as an nltk Text to use nltk functions
text.collocations()                                                                                             # pulls up common bigrams

#### **4) Trimming unwanted text**

If we examine the actual webpage of this project Gutenberg text, it's clear there's additional text that's not a part of the story. We can manually extract the part of the text we want using the `find` function and a regex pattern, and then overwrite our 'text' object to contain just that part.

In [None]:
print("[",raw.find("PART I"), ":", raw.rfind("END OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT"), "]")   # locate the title and end of the book strings

[ 5574 : 1158052 ]


In [None]:
raw = raw[5574:1158052]                                                                                         # assign raw to just the text between the tile and end
raw

### **Next, let's try out a  static HTML page**
 
Non-Dynamic (or static) HTML pages like are a bit easier to deal with from a text-extraction standpoint because they don't have interactive components or java script. Less and less do we see static HTML in the wild, but it's still useful to understand the process and the associated functions. `BeautifulSoup` is a popular library used for web scraping, and we will implement it here.

Here we will read in and then parse an example news article using `BeautifulSoup`:

In [None]:
from bs4 import BeautifulSoup                                                                                   # read in BeautifulSoup

#### **1) Read in raw text from url**
#### **2) Decode text to readable format**

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" 
html = request.urlopen(url).read().decode('utf-8-sig')                                                         # decode url to utf-8
print(html)

**Oh man that's ugly. Let's implement the second part of step 2, which is necessary here: removal of HTML tags. This can be done using the html.parser from `BeautifulSoup`.**

In [None]:
raw = BeautifulSoup(html, 'html.parser').get_text()                                                           # extract text, parse out html

#### **Now let's complete steps 3) and 4), tokenizing and trimming:**

This time we'll trim first, tokenize second -- really these steps can be done in any order.

In [None]:
print("[",raw.find("The last natural blondes will die out within 200 years, scientists believe"), ":", raw.find("The frequency of blondes may drop but they won't disappear."), "]")

[ 943 : 2452 ]


In [None]:
raw_1 = raw[943:2512]
raw_1                                                                                                          # trim text

'The last natural blondes will die out within 200 years, scientists believe. \r\n\r\nA study by experts in Germany suggests people with blonde hair are an endangered species and will become extinct by 2202.\r\n\r\nResearchers predict the last truly natural blonde will be born in Finland - the country with the highest proportion of blondes. \r\n\n\n\n\n\r\n\r\n\r\n\r\n\r\n\r\n\tThe frequency of blondes may drop but they won\'t disappear\r\n\r\n\r\n\r\n\r\n\r\n\t\n\n\r\n\r\n\r\n\r\n\r\n\r\n\tProf Jonathan Rees, University of Edinburgh\r\n\r\n\r\n\r\n\r\n\r\n\t\n\r\n\r\n\r\n\r\n\r\n\t\r\nBut they say too few people now carry the gene for blondes to last beyond the next two centuries. \r\n\r\nThe problem is that blonde hair is caused by a recessive gene. \r\n\r\nIn order for a child to have blonde hair, it must have the gene on both sides of the family in the grandparents\' generation. \r\nDyed rivals\n\r\n\r\nThe researchers also believe that so-called bottle blondes may be to blame for t

In [None]:
tokens = word_tokenize(raw_1)                                                                                   # tokenize the raw text to words
text = nltk.Text(tokens)                                                                                        # set as our nltk text
tokens

In [None]:
text.concordance('gene')                                                                                        # search for context that the word 'gene' appears in

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


### **Finally, we will scrape some text from Wikipedia, export it, and then read it in as a local file**

We can actually access the wikipedia API through a handy-dandy `wikipedia` library in Python! Let's find some wikipedia pages about pengiuns that we could extract some text from...

In [None]:
#!pip install wikipedia
import wikipedia

In [None]:
print(wikipedia.search("Penguins"))                                                         # search for pages about penguins

['Penguin', 'Pittsburgh Penguins', 'Adélie penguin', 'African penguin', 'Penguins of Madagascar', 'Little penguin', 'Penguin Books', 'The Penguins', 'Emperor penguin', 'King penguin']


In [None]:
print(wikipedia.suggest('Pengui Books'))                                                    # can make typos and it will make a suggestion

penguin books


In [None]:
wiki = wikipedia.page('penguin books')                                                      # we choose the penguin books wikipage and assign it to an object called 'wiki'
wiki

<WikipediaPage 'Penguin Books'>

In [None]:
print("[",wiki.content.find("Origin"), ":", wiki.content.rfind("War years"), "]")           # find text we want

[ 1450 : 5204 ]


In [None]:
text = wiki.content[1450:5204]                                                              # trim text
print(text)

Origins ==

The first Penguin paperbacks were published in 1935, but at first only as an imprint of The Bodley Head (of Vigo Street, London) with the books originally distributed from the crypt of Holy Trinity Church Marylebone.
Anecdotally, Lane recounted how it was his experience with the poor quality of reading material on offer at Exeter train station that inspired him to create cheap, well designed quality books for the mass market. However the question of how publishers could reach a larger public had been the subject of a conference at Rippon Hall, Oxford in 1934 which Lane had attended. Though the publication of literature in paperback was then associated mainly with poor quality lurid fiction, the Penguin brand owed something to the short-lived Albatross imprint of British and American reprints that briefly traded in 1932.Inexpensive paperbacks did not initially appear viable to Bodley Head, since the deliberately low price of 6d. made profitability seem unlikely. This helped 

#### **Now let's export our wiki page text as a Text File:**

In [None]:
text_file = open('C:/users/carly/Documents/Text Mining Course/Penguins.txt', 'w')
text_file.write(text)
text_file.close()

#### **And then read it back in again as a local file...**

In [None]:
f = open('C:/users/carly/Documents/Text Mining Course/Penguins.txt', 'r')
for line in f:
    print(line.strip())

Origins ==

The first Penguin paperbacks were published in 1935, but at first only as an imprint of The Bodley Head (of Vigo Street, London) with the books originally distributed from the crypt of Holy Trinity Church Marylebone.
Anecdotally, Lane recounted how it was his experience with the poor quality of reading material on offer at Exeter train station that inspired him to create cheap, well designed quality books for the mass market. However the question of how publishers could reach a larger public had been the subject of a conference at Rippon Hall, Oxford in 1934 which Lane had attended. Though the publication of literature in paperback was then associated mainly with poor quality lurid fiction, the Penguin brand owed something to the short-lived Albatross imprint of British and American reprints that briefly traded in 1932.Inexpensive paperbacks did not initially appear viable to Bodley Head, since the deliberately low price of 6d. made profitability seem unlikely. This helped 