# Natural Language Processing (NLP) prep

We've poked around some textual archives data, and now let's look at preparing textual data to perform some natural language processing on it. 

First, we're going to need to import a bunch of libraries to get ready:

In [1]:
import nltk.downloader
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords

And now we need to use one of those libraries we imported to download some content that the rest of these libraries will need to function:

In [2]:
nltk.download('punkt')
nltk.download('tagsets')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/nms-
[nltk_data]     workshop/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package tagsets to /Users/nms-
[nltk_data]     workshop/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/nms-
[nltk_data]     workshop/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/nms-
[nltk_data]     workshop/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nms-workshop/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

And finally, we need to have some text to be working on here! Let's use one of the .ocr files in the `sample-documents` directory of the no-more-silence-sample dataset we're working with in this workshop. These files are text files (if you open one in a text editor you'll be able to see it) so they'll open easily in our progam and give us text to start working with. 

`ucsf_mss95-04_001_033.ocr` looks like a good one, let's open it up and read it into a variable with the following code: 

In [3]:
with open('no-more-silence-sample-main/sample-documents/ucsf_mss95-04_001_033.ocr', 'r') as file:
    sample_text = file.read()

FileNotFoundError: [Errno 2] No such file or directory: 'no-more-silence-sample-main/sample-documents/ucsf_mss95-04_001_033.ocr'

Now we have the text from this file stored in a variable -- `sample_text` -- that we'll prepare for some NLP analysis using nltk

In [None]:
print(sample_text)

Right from the start we can see this is pretty messy. To do NLP we have to make things more uniform, so let's do that work.

Firstly, let's turn this into a series of word we can work individually on. This is called "tokenizing".

In [None]:
tokenized_text = word_tokenize(sample_text)

In [None]:
print(tokenized_text)

We can see we've got a lot of wild stuff going on in here. Let's start by removing non-alphanumeric characters to see if that gets rid of some of the wildness at the front (the poorly OCR'd hand-written text). (and we should note all the important information this is removed from this text by getting rid of this part)

We'll do this using regular expressions, and the `re` library which handles this in python.


In [None]:
import re

This code will go through our `tokenized_text` list and replace any non-alphanumeric characters -- the `r'\W+'` expression -- with nothing, thereby removing them.

In [None]:
tokenized_text_no_punct = []
for w in tokenized_text:
    tokenized_text_no_punct.append(re.sub(r'\W+', '', w))

In [None]:
print(tokenized_text_no_punct)

We might also want to remove 'stopwords', those words that help for a sentence but that have "little meaning" (lol) in and of themselves. These are words like 'the', 'a', 'an', 'by', etc. `nltk` contains a pre-set list of stopwords that we can use for this process, one of the things we downloaded earlier on in this workbook.

In [None]:
stop = stopwords.words('english')

In [None]:
print(stop)

Now let's check each of the words in our list against the stopwords list, and leave them out if they're stopwords

In [None]:
tokenized_text_no_stop = []
for word in tokenized_text_no_punct:
    if word not in stop:
        tokenized_text_no_stop.append(word)

In [None]:
print(tokenized_text_no_stop)

This is better, but there's still a lot of stuff we probably can't work with in there, for one thing all the totally empty tokens (`''`), so let's rewrite this so that it goes through and removes those too, and turns all these to lowercase while we're at it. 

In [None]:
tokenized_text_no_stop = []
for word in tokenized_text_no_punct:
    if (word.lower() not in stop) and (word != ''):
        tokenized_text_no_stop.append(word.lower())

In [None]:
print(tokenized_text_no_stop)

This is looking a lot more uniform now. In a bit we'll see what else we can do to prepare it further. 

Now however, we can also start to look at some interesting functionality provided by `nltk`. Let's get it to label for us the parts of speech, for example:

In [None]:
tokens_pos = pos_tag(tokenized_text_no_stop)

In [None]:
print(tokens_pos)

Each word is now labelled with the part of speech it represents. These can be a bit tough to understand, so `nltk` does contain a built in key: 

In [None]:
nltk.help.upenn_tagset()

It's a good thing that we took a look at this help, because we can see that our order was a little off here. We should have labeled the parts of speech prior to making everything lowercase, because now proper nouns are not recognized.

## Stemming and Lemmatizing

Stemming is the process of reducing words to their root form, whether or not the root form is a word itself. Again, `nltk` contains several built in function to do this work, each using slightly different approaches.

We'll use the function `PorterStemmer()` for this notebook

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(t) for t in tokenized_text_no_stop]

In [None]:
print(porter_stemmed)

We can also take the approach of lemmatising, which is the process of reducing words to their root form only if the root form is also a word in the english language. For this, `nltk` comes with the `WordNetLemmatizer()`. (note that we don't have to import this library because we already imported `wordnet` at the very beginning of this workbook)

In [None]:
lemmatizer = nltk.WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(t) for t in tokenized_text_no_stop]

In [None]:
print(lemmatized)

Now we've got some more-or-less cleaned text, and we can begin to do some NLP in the next half of this workshop.