<h1>3 Processing Raw Text</h1>

<h3>Imports</h3>

In [1]:
import nltk, re, pprint
from nltk import word_tokenize

<h1>3.1 Accessing Text from the Web and from Disk</h1>

<h1>Electronic Books</h1>

In [2]:
from urllib import request

# Go to url of Crime and Punishment
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
# Save the response in a string
# To avoid getting "ufeff" in raw string, include sig
# to specify encode with BOM
raw = response.read().decode("utf-8-sig")


In [3]:
print("Characters in this text: ", len(raw))
print(raw[:75])


Characters in this text:  1176964
The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky



<h3>Tokenize</h3>

In [4]:
tokens = word_tokenize(raw)

In [5]:
print(type(tokens))
print(len(tokens))
print(tokens[:10])

<class 'list'>
257726
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']


<h3>Create a Text Object for the Raw Text</h3>

In [6]:
text = nltk.Text(tokens)
print(type(text))
print(text[1024:1062])
print(text.collocations())

<class 'nltk.text.Text'>
['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.', 'He', 'had', 'successfully']
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens
None


<h3>Find Indices In String Where Keywords/Phrases Occur</h3>

In [7]:
start = raw.find("PART I")
end = raw.rfind("End of Project Gutenberg’s")
print(start)
print(end)


5335
1157809


In [8]:
n_raw = raw[start:end]
# print(n_raw)

<h1>Dealing with HTML</h1>

In [9]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode("utf-8")
print(html[:60])

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN


In [10]:
from bs4 import BeautifulSoup
# Include "lxml" in the arguments to explicitly
# specify a parser to be used
raw = BeautifulSoup(html, "lxml").get_text()
tokens = word_tokenize(raw)
print(tokens)

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'", 'NEWS', 'SPORT', 'WEATHER', 'WORLD', 'SERVICE', 'A-Z', 'INDEX', 'SEARCH', 'You', 'are', 'in', ':', 'Health', 'News', 'Front', 'Page', 'Africa', 'Americas', 'Asia-Pacific', 'Europe', 'Middle', 'East', 'South', 'Asia', 'UK', 'Business', 'Entertainment', 'Science/Nature', 'Technology', 'Health', 'Medical', 'notes', '--', '--', '--', '--', '--', '--', '-', 'Talking', 'Point', '--', '--', '--', '--', '--', '--', '-', 'Country', 'Profiles', 'In', 'Depth', '--', '--', '--', '--', '--', '--', '-', 'Programmes', '--', '--', '--', '--', '--', '--', '-', 'SERVICES', 'Daily', 'E-mail', 'News', 'Ticker', 'Mobile/PDAs', '--', '--', '--', '--', '--', '--', '-', 'Text', 'Only', 'Feedback', 'Help', 'EDITIONS', 'Change', 'to', 'UK', 'Friday', ',', '27', 'September', ',', '2002', ',', '11:51', 'GMT', '12:51', 'UK', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'", 'Scientists', 'believe', 'the', 'last', 'blond

In [11]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance("gene")

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


<h1>Processing RSS Feeds</h1>

In [12]:
import feedparser

llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")


In [16]:
# Title of the feed
print(llog["feed"]["title"])

# How many entries
print(len(llog.entries))

# Grab the first post
post0 = llog.entries[0]
print(post0.title)

# Grab the HTML of the first post
post0_content = post0.content[0].value
print(post0_content[:100])

# Extract the text from the HTML
raw = BeautifulSoup(post0_content, "lxml").get_text()
tokens = word_tokenize(raw)
print(tokens[:20])

Language Log
13
Seven double-plus-ungood words and phrases
<p>Lena H. Sun and Juliet Eilperin, "<a href="https://www.washingtonpost.com/national/health-science
['Lena', 'H.', 'Sun', 'and', 'Juliet', 'Eilperin', ',', '``', 'CDC', 'gets', 'list', 'of', 'forbidden', 'words', ':', 'fetus', ',', 'transgender', ',', 'diversity']


<h1>Reading Local Files</h1>

In [78]:
# Open the text file
love_song_path = "../My-Texts/the-love-song-of-j-alfred-prufrock.txt"
love_song = open(love_song_path, 'r', encoding="utf")

# Store the text in a string
love_song_raw = love_song.read()

# Tokenize the text
love_song_tokens = word_tokenize(love_song_raw)

# Normalize the words
# Remove punctuation
love_song_tokens = [w.lower() for w in love_song_tokens if w.isalnum()]
# print(love_song_tokens[:100])

# Grab all unique vocab
love_song_vocab_raw = sorted(set(love_song_tokens))

# Find unique vocabulary
print(love_song_vocab[:10])
print("\nUnique Vocab: ", len(love_song_vocab))

['a', 'about', 'across', 'advise', 'afraid', 'after', 'afternoon', 'afternoons', 'against', 'al']

Unique Vocab:  435


<h3>The NLP Pipeline</h3>

<img src="../Images/pipeline1.png">