# Tokenizing

## Lecture objectives
1. Introduce stopwords and how to remove them
2. Demonstrate how to tokenize (split) text into words and sentences
3. Explore how to lemmatize words

Let's start by loading in the text file that we used in the previous lecture.

In [None]:
with open('../scratch/eirtext.txt','r') as f:
    eirtext = f.read()

## Stopwords
Recall our word counts from the previous lecture. Many of the most common words – `and`, `of`, etc. - are not particularly informative. This type of analysis might be useful in some applications, but here, we really need to push further.

Let's use the `nltk` library to get rid of these common words that don't have a substantive meaning. They are called *stop words* in natural language processing jargon. 

`nltk` is a mammoth library, and has lots of submodules. We'll use the tokenize functions (more on this in a moment) and `stopwords` submodules for now.

The first time we use them, we have to download the "corpus". If you don't do this, you'll get a helpful error message reminding you of this. See http://www.nltk.org/nltk_data/ for all the corpora that you can download.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# We only need to do this once 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

Let's take a look at the stopwords. `stopwords.words` just gives us a list of words.

In [None]:
type(stopwords.words('english'))
print(stopwords.words('english')[:20])

In several languages, too.

In [None]:
print(stopwords.words('spanish')[:10])
print(stopwords.words('arabic')[:10])

## Tokenizing 
Tokenizing functions are essentially splitting functions. (Often, you might be able to use the `split()` function as we did in the previous lecture, but the tokenizing functions are more robust.)

For example, `sent_tokenize` splits into a list of sentences.

In [None]:
sentences = sent_tokenize(eirtext)
print(sentences[2])

For our purposes, we want to split into words. We can use `word_tokenize`. This should give us similar results to the `split()` function earlier, but it's a bit more robust.

Before we count the words, let's also use `regex` to drop the digits, punctuation, and other non-words.

In [None]:
import pandas as pd
import re

# this is exactly the same function from the previous lecture
def countWords(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    
    return df

wordlist = word_tokenize(re.sub(r"[^A-z\s]", "", eirtext))
df = countWords(wordlist)
df.head()

Now let's drop the stopwords from our counts. 

Remember that `stopwords.words` gives a list of words. So let's use the pandas `drop()` function to drop all of those words from the index. 

We add the `errors='ignore'` argument because not all of our stopwords will be in our DataFrame.

In [None]:
df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)
df.head()

## Lemmatizing
Finally, we might want to *lemmatize* the words. We saw that process used in the [Brinkley & Stahmer](https://journals.sagepub.com/doi/abs/10.1177/0739456X21995890) paper. Lemmatization groups words with the same stem, e.g. `highway` and `highways`, or `constructing` and `construction`, through reducing them to their *root*.

`nltk` has a built-in function for that - `PorterStemmer`.

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem('construction'))
print(ps.stem('highways'))

Even if it doesn't know the (made-up) word, the stemmer takes a decent guess.

In [None]:
print(ps.stem('housingelementifcation'))  

Let's add this to our function, and call our new function `countStems`. It's just one extra line.

In [None]:
def countStems(wordlist):
    counts = {} 
    
    for word in wordlist:
        lword = word.lower()
        # This is the extra line
        lword = ps.stem(lword)
        
        if lword in counts:
            counts[lword] +=1
        else:
            counts[lword] = 1

    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    df.drop(index=stopwords.words('english'), errors='ignore', inplace=True)

    return df

df = countStems(wordlist)
df.head(10)


Whether the stems are more useful than the original words is obviously a matter for your specific task.

So now we've got the tools to bring in some text to a useful form. In the next module, we'll interpret the text using topic modeling and sentiment analysis.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Before analyzing a text, you will probably need to do clean-up such as removing stopwords, converting to lower case, and possibly lemmatizing the words.</li>
</ul>
</div>