# Exploring the NLTK Book (Chapter 3)
[NLTK Book](https://www.nltk.org/book/)

Resources:
* [urllib](https://docs.python.org/3/library/urllib.html) <br/>Python package for working with urls.
* [Regular Expression module](https://docs.python.org/3/library/re.html) <br/>allows us to [use regular expressions in python](https://docs.python.org/3/howto/regex.html#regex-howto) strings
* [Data pretty printer](https://docs.python.org/3/library/pprint.html) <br/>print data structures in a readable format
* [Project Guttenberg catalog](http://www.gutenberg.org/catalog/)<br/>find electronice texts from Project Guttenberg's collection that are not inlcuded in NLTK.
* [textfiles.com](http://www.textfiles.com/directory.html) <br/>A usefule source for finding plain text files.
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) <br/>A Python library that helps us work with HTML and XML

In [None]:
import nltk, re, pprint
from nltk import word_tokenize
from urllib import request

%matplotlib inline

## Getting the text
Find a text from the Project Guttenberg colleciton or from textfile.com using urllib. You should browse the website to get the url you need.

In [None]:
url = 'http://www.gutenberg.org/cache/epub/7178/pg7178.txt'
response = request.urlopen(url)
raw_text = response.read().decode('utf8')

We just retrieved the text for Marcel Proust's 'Swann's Way' from the Project Guttenberg catalog and turned into plain text (i.e. a string)


In [None]:
type(raw_text)

In [None]:
# this will tell us how many characters (not words) long the text is. 
# In order to get a word count we need to do some processing to this text.

len(raw_text)

In [None]:
raw_text[:100]

## Tokenization
Turning the text into words using the nltk word_tokenizer

In [None]:
words_text = word_tokenize(raw_text)

In [None]:
# now we have a list and not a string.
# The list contains the words if the text as identified by the word_tokenizer
type(words_text)

In [None]:
# now we can get a better approximation of the word count
len(words_text)

In [None]:
words_text[:100]

Now we can take our tokeinzed text, the list of strings and turn it into an NLTK Text and carry out all of the processing we saw earlier. (e.g. collocation, similar, etc.)

In [None]:
# we turn the list into an nltk text.
nltk_text = nltk.Text(words_text)
type(nltk_text)

In [None]:
nltk_text[100:200]

In [None]:
nltk_text.concordance('blue',lines=40)

In [None]:
nltk_text.similar('blue')

In [None]:
nltk_text.concordance('red')


In [None]:
colors = ['red','blue','green','black','white']
nltk_text.dispersion_plot(colors)


In [None]:
nltk.Text(nltk_text[100000:200000]).concordance('blue',width=100)


In [None]:
from nltk import FreqDist
freqDist = FreqDist(nltk_text)

In [None]:
freqDist.most_common(50)

In [None]:
freqDist['blue']

In [None]:
freqDist.plot(50, cumulative=True)

In [None]:
freqDist.hapaxes()[:50]

## 'Grooming' or 'munging' the text file

We can see that bigrams like "Project Guttneberg" and "Archive Foundation" and "electronic works" appear frequently in the text. Let's take a closer look and see if we want to remove them.

In [None]:
nltk_text.collocations()

We will go back to the raw text data so that we can create a clean NLTK text.

In [None]:
raw_text[:1000]
#let's look for the start of the book (hint: Project Guttenberg gives us some clues)

It looks like the actual text of the book book starts with the text  "OVERTURE"

In [None]:
#we can find where the string starts
raw_text.find('OVERTURE')

In [None]:
raw_text[842]

So Now where does it end? <br/>
Warning: Project Guttneberg has a long 'footer' but Project Guttenberg is good about helping out.

In [None]:
raw_text[-20000:]

In [None]:
raw_text.rfind('\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nEnd of the Project Gutenberg')

In [None]:
raw_text[1102563]

In [None]:
string="hello"
string = string[1:3]
string

In [None]:
raw_text = raw_text[842:1102537]

In [None]:
raw_text[:200]

In [None]:
raw_text[-100:]

Now we can easily rebuild our NLTK Text

In [None]:
nltk_text = nltk.Text(word_tokenize(raw_text))

In [None]:
nltk_text.collocations()

# working with HTML

A lot of modern text and language we encounter now is online and not necessarily presented in a text file. Getting documents from the web will likely require working with html documents. The initial process is the same. Find a url from a web page that you might like to take a closer look at. The easiest place to start is a news article.

In [None]:
# Note: this is the second time we have done this so we might want to think about turning this into a function.

def get_text(url):
    response = request.urlopen(url)
    text = response.read().decode('utf8')
    return text
html = get_text('https://www.bbc.com/news/entertainment-arts-45818204')

In [None]:
type(html)

In [None]:
#we have a string of text that has html markup
html[:150]

This is where BeautifulSoup comes in

In [None]:
from bs4 import BeautifulSoup as bs

In [None]:
# this code will give us a warning. If we want to avoid the warning we can pass in the parser we would like to use.
# bydefault the parser is pythons lxml parser. This is what we want to use so we can ignore the warning.
soup = bs(html,'lxml')

Now let's trim down the HTML to just the bit we want. If we use the `get_text() function` we will also get a lot of text we don't want (like the javascript inside the `script>` tags. 

In [None]:
soup.get_text()

So let's take a closer look

In [None]:
# look at what is in the <body> element.
soup.body

In [None]:
#If this is well written HTML there should be on <h1> (and we will be lucky)
soup.find_all('h1')

In [None]:
soup.h1

In [None]:
# semantic HTML (Yeah!) will often use the '<article>' tag to hold main content.
print(soup.find('article'))

In [None]:
for el in soup.h1.children:
    print(el)

In [None]:
soup.h1.parent

In [None]:
body = soup.h1.parent

In [None]:
article = body.find_all('p')

In [None]:
for tx in article:
    print(tx.text)

### not bad but...
This is close but if we really want the article we can do a little better.

Looking at the html it looks like the main article is in a `<div>` tag with a class of "story-body_inner"

In [None]:
soup.find_all('div',{"class":"story-body__inner"})

In [None]:
print(soup.find('div',{"class":"story-body__inner"}).text)

So this is getting better, we have the right section so now let's just get the `<p>` elements.

In [None]:
div = soup.find('div',{"class":"story-body__inner"})

In [None]:
for p in div.find_all('p'):
    print(p.text)

So it looks like this is the one. TO save it as a string of text we will need to create a function.

In [None]:
def get_article(ls):
    words = ""
    for el in ls:
        words = words + ' ' + el.text # if we want to keep the paragrapgh structure we can add `+'\n'`
    return words
        

In [None]:
article = get_article(div.find_all('p'))

In [None]:
print(article)

<hr style="height:2px"/>
Now we can tokenize and tokenize and process the article as we did the previous text. 

## Working with local files 
In some cases we might have files already availale to us that we want to examine. To access local files we can use python to load and read any local documents.

In [None]:
# we can see what files are in our current directory 
# (on windows these commands are different but the process is the same.)
!ls

In [None]:
!ls cat_corpus

We can also use the `os` python module to examine files in directories before using the pythons `read` and `open` functions.

In [None]:
import os
os.listdir('.')

In [None]:
os.listdir('cat_corpus')

In [None]:
file = open('cat_corpus/00.txt')
text = file.read()

In [None]:
text

In [None]:
# we can see we have a string ready to process
type(text)

We can see that for every new line there is a "\n" which is the notation for a new line in a text file. We can filter this out if we do not need to preserve that information.

In [None]:
file = open('cat_corpus/00.txt')
new_text = ""
for line in file:
    new_text = new_text + (line.strip('\n'))
    
new_text

## Some helpful string methods

|Method	|Functionality|
|-------|-------------|
|s.find(t)|	index of first instance of string t inside s (-1 if not found)|
|s.rfind(t)|	index of last instance of string t inside s (-1 if not found)|
|s.index(t)	|like s.find(t) except it raises ValueError if not found|
|s.rindex(t)|	like s.rfind(t) except it raises ValueError if not found|
|s.join(text)|	combine the words of the text into a string using s as the glue|
|s.split(t)	|split s into a list wherever a t is found (whitespace by default)|
|s.splitlines()|	split s into a list of strings, one per line|
|s.lower()	|a lowercased version of the string s|
|s.upper()	|an uppercased version of the string s|
|s.title()	|a titlecased version of the string s|
|s.strip()	|a copy of s without leading or trailing whitespace|
|s.replace(t, u)|	replace instances of t with u inside s|


In [None]:
string = "WE went to the Supermarlet on Wednesday"

In [None]:
string.lower()

In [None]:
string.replace('marlet','market')

In [None]:
"-".join(string)

## Some helpful Regular expression patterns
<table border="1" class="docutils" id="tab-regexp-meta-characters1">
<colgroup>
<col width="15%">
<col width="85%">
</colgroup>
<thead valign="bottom">
<tr><th class="head">Operator</th>
<th class="head">Behavior</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><tt class="doctest"><span class="pre">.</span></tt></td>
<td>Wildcard, matches any character</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">^abc</span></tt></td>
<td>Matches some pattern <span class="math">abc</span> at the start of a string</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">abc$</span></tt></td>
<td>Matches some pattern <span class="math">abc</span> at the end of a string</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">[abc]</span></tt></td>
<td>Matches one of a set of characters</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">[A-Z0-9]</span></tt></td>
<td>Matches one of a range of characters</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">ed|ing|s</span></tt></td>
<td>Matches one of the specified strings (disjunction)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">*</span></tt></td>
<td>Zero or more of previous item, e.g. <tt class="doctest"><span class="pre">a*</span></tt>, <tt class="doctest"><span class="pre">[a-z]*</span></tt> (also known as <em>Kleene Closure</em>)</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">+</span></tt></td>
<td>One or more of previous item, e.g. <tt class="doctest"><span class="pre">a+</span></tt>, <tt class="doctest"><span class="pre">[a-z]+</span></tt></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">?</span></tt></td>
<td>Zero or one of the previous item (i.e. optional), e.g. <tt class="doctest"><span class="pre">a?</span></tt>, <tt class="doctest"><span class="pre">[a-z]?</span></tt></td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{n}</span></tt></td>
<td>Exactly <span class="math">n</span> repeats where n is a non-negative integer</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{n,}</span></tt></td>
<td>At least <span class="math">n</span> repeats</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{,n}</span></tt></td>
<td>No more than <span class="math">n</span> repeats</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">{m,n}</span></tt></td>
<td>At least <span class="math">m</span> and no more than <span class="math">n</span> repeats</td>
</tr>
<tr><td><tt class="doctest"><span class="pre">a(b|c)+</span></tt></td>
<td>Parentheses that indicate the scope of the operators</td>
</tr>
</tbody>


</table>

In [None]:
words = nltk.corpus.words.words('en')

In [None]:
len(words)

In [None]:
# find words by the ending
[w for w in words if re.search('ing$', w)][:50]

In [None]:
# the wildcard is a well know search symbol: '.'
[w for w in words if re.search('....eding', w)]

In [None]:
#You can combine the regula expressions to find more complex patterns
[w for w in words if re.search('ing.$', w)]

In [None]:
# Use the '^' symbol to match the start of a word
[w for w in words if re.search('^pre..ing', w)]

In [None]:
# and the '[]' can be used to denote ranges
[w for w in words if re.search('^[t-x][uy]', w)]

In [None]:
# And the plus symbol '+' is repeating character of one or more.
[w for w in words if re.search('s+$', w)]

In [None]:
# While the star '*' means zero or more instances of the preceding character or set.
# all the words ending in 'ing'
[w for w in words if re.search('.*ing$', w)]

## Normalizing Text
Let's look at normalizing text by tokenizing, stemming, and lemmatization. For the purposes of demonstration let's take a small bit of text so we can observe the changes we make. But these operations can be applied equally to large texts.

In [None]:
raw_text = 'NARRATOR : Sir Launcelot had saved Sir Galahad from almost certain temptation, but they were still no nearer the Grail. Meanwhile, King Arthur and Sir Bedevere, not more than a swallow\'s flight away, had discovered something. Oh, that\'s an unladen swallow\'s flight, obviously. I mean, they were more than two laden swallows\' flights away -- four, really, if they had a coconut on a line between them.'

In [None]:
raw_text

To begin let's tokenize the text so we can work with the individual words.

In [None]:
token_text = word_tokenize(raw_text)
token_text

NLTK provides stemmers so we do not need to create an algorithm to create our own. Two stemmers available on Pyhton are the Porter and the Lancaster stemmers. They handle the stemming a little differently and will provide different results based on their algorithms.

In [None]:
# we can normalize text by making everything lower case.
token_text_lower = [w.lower() for w in token_text]
print(token_text_lower)

In [None]:
# These Stemmers have the normalization as part of the process
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [None]:
port = [porter.stem(t) for t in token_text]
print(port)

In [None]:
lan = [lancaster.stem(t) for t in token_text]
print(lan)

In [None]:
print(sorted(set(port).difference(set(lan))))

In [None]:
print(sorted(set(lan).difference(set(port))))

The following code is beyond the scope of our workshop but it is here to demonstrate some of the utility of the stemmers.

In [None]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [None]:
from nltk import book

In [None]:
text = book.text1
porter = nltk.PorterStemmer()
indexed_text = IndexedText(porter, text)
indexed_text.concordance('die')

### Lemmatization

NLTK relies on the WordNet lemmatizer to lemmatize words and only affects words that are in its dictionary. <br/>
**Note:** The WordNet lemmatizer will convert 'women' to 'woman'

In [None]:
wnlemma = nltk.WordNetLemmatizer()

In [None]:
# 'women' is in the dictionary and is returned as woman, but 'running' is not.
[wnlemma.lemmatize(w) for w in ['women','running','children']]

In [None]:
text = book.text1

In [None]:
lemma_text = [wnlemma.lemmatize(w) for w in text]
print(lemma_text)

In [None]:
freqDist = FreqDist(lemma_text)

In [None]:
freqDist.most_common(50)

In [None]:
freqDist.plot(50, cumulative=True)