# Natural language processing part 2:
# Parsing text

## Lecture objectives
* Learn how to do simple word counts
* Understand how to tokenize (split) text into words and sentences
* Understand how and when to lemmatize

In the previous lecture, we parsed the PDF and obtained a clean string. Now, we can do something with that string that requires an understanding of the language - i.e., natural language processing.

We'll focus on English-language parsing. Many of the principles are applicable to other languages too.

First, let's load in the text file that we created in the previous lecture. It's the same `open` function, but we'll use `r` (read) rather than `w` (write).

In [None]:
with open('eirtext.txt','r') as f:
    eirtext = f.read()
print(len(eirtext))

## Bags of words
One of the simplest ways to analyze a text is to look at the raw count of words.

Let's write a function to take a piece of text, and return a DataFrame with word counts. 

What do we do? First, create an empty dictionary to hold our counts. The keys will be the words, and the values the counts.

In [None]:
counts = {}

Then, let's split our text into a list of words. `split()` splits a string by any character - the default is a space. 

For example:

In [None]:
'This is a sentence'.split()

Or split by the hyphens.

In [None]:
'This-is-a-sequence of words'.split('-')

So now let's do that with our EIR text.

In [None]:
wordlist = eirtext.split()

We have a list of words - `wordlist`. Let's loop over them.

If that word already exists in our dictionary, we add 1 to the value. Otherwise, we create a new key and a value of 1.

In [None]:
for word in wordlist:
    lword = word.lower() # convert to lowercase
    if lword in counts:
        counts[lword] +=1
    else:
        # doesn't exist in the dictionary
        counts[lword] = 1
print(counts)

Let's convert this dictionary to a DataFrame, sort it by the word count column, and add a name to the index.

In [None]:
import pandas as pd
df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
df.sort_values('word_count', ascending=False, inplace=True)
df.index.name = 'word'
df.head(10)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Turn the preceding steps into a function. The argument should be a list of words, and the function should return a DataFrame.
</div>

Here, we just take all the code above, indent it, and give the function a name (`countWords`), an argument (`wordlist`), and a `return` statement.

In [None]:
def countWords(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    
    return df

wordlist = eirtext.split()
df = countWords(wordlist)
df.head(10)

## Tokenizing and stopwords
I suppose it's good that an EIR section on air quality mentions emissions. But the other words aren't particularly informative. This type of analysis might be useful in some applications, but here, we really need to push further.

Let's use the `nltk` library to get rid of the little words like "the," "for," etc. These are called *stop words* in natural language processing jargon. 

`nltk` is a mammoth library, and has lots of submodules. We'll use the tokenize functions (more on this in a moment) and `stopwords` submodules for now.

The first time we use them, we have to download the "corpus". If you don't do this, you'll get a helpful error message reminding you of this. See http://www.nltk.org/nltk_data/ for all the corpora that you can download.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# We only need to do this once 
nltk.download('stopwords')
nltk.download('punkt')

Let's take a look at the stopwords. `stopwords.words` just gives us a list of words.

In [None]:
stopwords.words('english')

In several languages, too. Let's look at the first 10.

In [None]:
print(stopwords.words('spanish')[:10])
print(stopwords.words('arabic')[:10])

Tokenizing functions are essentially splitting function. For example, `sent_tokenize` splits into a list of sentences.

In [None]:
sent_tokenize(eirtext)

For our purposes, we want to split into words. We can use `word_tokenize`. This should give us similar results to the `split()` function earlier, but it's a bit more robust.

Before we count the words, let's also use `regex` to drop the digits, punctuation, and other non-words.

In [None]:
import re
wordlist = word_tokenize(re.sub(r"[^A-z\s]", "", eirtext))
df = countWords(wordlist)
df.head()

Now let's drop the stopwords from our counts. 

Remember that `stopwords.words` gives a list of words. So let's use the pandas `drop()` function to drop all of those words from the index. 

We add the `errors='ignore'` argument because not all of our stopwords will be in our DataFrame.

In [None]:
df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)
df.head()

## Lemmatizing
Finally, we might want to *lemmatize* the words. We saw that process used in the [Brinkley & Stahmer](https://journals.sagepub.com/doi/abs/10.1177/0739456X21995890) paper. Lemmatization groups words with the same stem, e.g. `highway` and `highways`, or `constructing` and `construction`, through reducing them to their *root*.

`nltk` has a built-in function for that - `PorterStemmer`.

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem('construction'))
print(ps.stem('highways'))

Even if it doesn't know the (made-up) word, the stemmer takes a decent guess.

In [None]:
print(ps.stem('housingelementifcation'))  

Let's add this to our function, and call our new function `countStems`. It's just one extra line.

In [None]:
def countStems(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        # This is the extra line
        lword = ps.stem(lword)
        
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)

    return df

df = countStems(wordlist)
df.head(10)


Whether the stems are more useful than the original words is obviously a matter for your specific task.

So now we've got the tools to bring in some text to a useful form. In the next lectures, we'll interpret the text using topic modeling and sentiment analysis.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Before analyzing a text, you will probably need to do clean-up such as removing stopwords, converting to lower case, and possibly lemmatizing the words.</li>
</ul>
</div>