# Intro to Natural Language Processing with Python

## Info
- Scott Bailey (CIDR), *scottbailey@stanford.edu*
- Javier de la Rosa (CIDR), *versae@stanford.edu*
- Ashley Jester (CIDR/SSDS), *ajester@stanford.edu*

## What are we covering today?
- What is NLP?
- Options for NLP in Python
- Tokenization
- Part of Speech Tagging
- Word transformations (lemmatization, pluralization)
- Sentiment Analysis
- Readability indices

## Goals

By the end of the workshop, we hope you'll have a basic understanding of natural language processing, and enough familiarity with one NLP package, Textblob, to perform basic NLP tasks like tokenization and part of speech tagging. Through analyzing presidential speeches, we also hope you'll understand how these basic tasks open up a number of possibilities for textual analysis, such as readability indices. 

## What is NLP

NLP stands for Natual Language Processing and it involves a huge variety of tasks such as:
- Automatic summarization.
- Coreference resolution.
- Discourse analysis.
- Machine translation.
- Morphological segmentation.
- Named entity recognition.
- Natural language understanding.
- Part-of-speech tagging.
- Parsing.
- Question answering.
- Relationship extraction.
- Sentiment analysis.
- Speech recognition.
- Topic segmentation.
- Word segmentation.
- Word sense disambiguation.
- Information retrieval.
- Information extraction.
- Speech processing.

One of the key ideas is to be able to process text without reading it.

## NLP in Python

Python is builtin with a very mature regular expression library, which is the building block of natural language processing. However, more advanced tasks need different libraries. Traditionally, in the Python ecosystem the Natural Language Processing Toolkit, abbreviated as `NLTK`, has been until recently the only working choice. Unfortunately, the library has not aged well, and even though it's updated to work with the newer versions of Python, it does not provide us the speed we might need to process large corpora.

Another solution that appeared recently is called `spaCy`, and it is much faster since is written in a pseudo-C Python language optimized for speed called Cython.

Both these libraries are complex and therefore there exist wrappers around them to simplify their APIs. The two more popular are `Textblob` for NLTK and CLiPS Parser, and `textacy` for spaCy. In this workshop we will be using Textblob since it is more well established and mature and provides with all we need to start learning some NLP basic tasks.

In [None]:
from textblob import TextBlob

In [None]:
# Helper functions
import requests
from urllib.request import urlopen

def get_text(url):
    try:
        return requests.get(url).text
    except:
        return urlopen(url).read().decode("utf8")
        
def get_speech(url):
    page = get_text(url)
    full_text = page.split('\n')
    return " ".join(full_text[2:])

In [None]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
clinton_speech = get_speech(clinton_url)
clinton_speech

In [None]:
clinton_blob = TextBlob(clinton_speech[:446])
clinton_blob.string == clinton_speech[:446]

## Tokenization

In NLP, the act of splitting text is called tokenization, and each of the individual chunks is called a token. Therefore, we can talk about word tokenization or sentence tokenization depending on what it is that we need to divide the text into.

In [None]:
clinton_blob.words

In [None]:
clinton_blob.sentences

In [None]:
clinton_blob.noun_phrases

A special way of dividing text in tuples of sequential words or letters is usualy referred to as N-Grams.

In [None]:
clinton_blob.ngrams(n=3)

In [None]:
clinton_blob.ngrams(n=5)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `count_chars(text)` that receives `text` and returns the total number of characters ignoring spaces and punctuation marks. For example, `count_chars("Well, I am not 30 years old.")` should return `20`.
<br/>
* **Hint**: You could count the characters in the words.*
</p>
</div>

In [None]:
def count_chars(text):
    ...

count_chars("Well, I am not 30 years old.")

## Part of Speech Tagging

Textblob also allows you to perform Part-Of-Speech tagging, a kind of grammatical chunking, out of the box. By default it uses the Penn U Treebank, but other taggers can be plugged in using NLTK classes.

In [None]:
clinton_blob.tags

In [None]:
for word, pos in clinton_blob.tags:
    print(word, pos)

For what these tags mean, you might check out http://www.clips.ua.ac.be/pages/mbsp-tags

In [None]:
clinton_blob.parse()

In [None]:
clinton_blob.sentences[0]

![Sentence tree](https://github.com/sul-cidr/python_workshops/blob/master/data/tree.svg?raw=1)


In [None]:
clinton_blob.sentences[0].parse()

## Word transformations

In [None]:
from textblob import Word
w = Word("octopi")
w.lemmatize()

In [None]:
w.lemma

In [None]:
v = Word("is")
v.lemmatize("v")

In [None]:
for word in clinton_blob.words:
    print(word, word.lemmatize())

In [None]:
for word in clinton_blob.words:
    print(word, word.lemmatize("v"))

In [None]:
for word, pos in clinton_blob.tags:
    if pos == "VBP":
        print(word, word.lemmatize("v"))

In [None]:
for word in clinton_blob.words:
    print(word, word.pluralize())

## Counting

In [None]:
clinton_blob.word_counts

In [None]:
clinton_blob.word_counts['congress']

In [None]:
clinton_blob.words.count('Mr', case_sensitive=True)

In [None]:
clinton_blob.noun_phrases.count('internal crisis')

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Let's define the lexicon of a person as the number of different words she uses to speak. Write a function `get_lexicon(text, n)` that receives `text` and `n` and returns the lemmas of nouns, verbs, and adjectives that are used at least `n` times. For example, `get_lexicon(clinton_speech, 10)` should return

```
{'A',
 'America',
 'New',
 'So',
 'Thank',
 'Tonight',
 'ask',
 'be',
 'child',
 'do',
 'have',
 'help',
 'make',
 'more',
 'new',
 'people',
 'thank',
 'tonight',
 'want',
 'work',
 'year'}
```.
<br/>
* **Hint**: In Textblob, when a tag refers to nouns, verbs, or adjectives, the first letter of the tag starts with `n`, `v`, or `j`.*
</p>
</div>

In [None]:
def get_lexicon(text, n):
    blob = TextBlob(text)
    ...
    
get_lexicon(clinton_speech, 25)

## Sentiment analysis

Sentiment analysis is a basic form of classification of sentences, commonly into 2 categories.

In [None]:
clinton_blob.sentiment

In [None]:
for sentence in clinton_blob.sentences:
    print(sentence, sentence.sentiment.polarity)

In [None]:
sad_sent = "Life is sad."
sad_blob = TextBlob(sad_sent)
sad_blob.sentiment.polarity

Textblob includes an alternate sentiment analyzer that you can use out of the box. 

In [None]:
from textblob.sentiments import NaiveBayesAnalyzer
blob = TextBlob(clinton_speech[:446], analyzer=NaiveBayesAnalyzer())
for sentence in blob.sentences:
    print(sentence, sentence.sentiment)

In [None]:
para = "Life is good. Life sucks. John hates soda. John hates nasty soda. John likes good soda. John loves soda. John loves sweet soda."
sent_blob = TextBlob(para)
for sent in sent_blob.sentences:
    print(sent, sent.sentiment.polarity)

In [None]:
sent_blob_nb = TextBlob(para, analyzer=NaiveBayesAnalyzer())
for sent in sent_blob_nb.sentences:
    print(sent, sent.sentiment)

These examples used the built-in analyzers, but a Textblob analyzer can be built with a classifier object with its own methods. Some of them are very useful for model selection if you were building your own. The Textblob docs do give an example of how to build a basic sentiment classifier if you're interested.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Rather than just get the sentiment of individual sentences, we could try to calculate the average sentiment of a text by averaging the sentiment of its sentences. Write a function `avg_sentiment(text)` that receives `text` and returns the average positive sentiment as the sum of all probability of positive sentences divided by the number of sentences. For example, `avg_sentiment(para)` should return ~`0.3284`.
<br/>
* **Hint**: Remember to use the `NaiveBayesAnalyzer` analyzer.*
</p>
</div>

In [None]:
def avg_sentiment(text):
    ...

para = "Life is good. Life sucks. John hates soda. John hates nasty soda. John likes good soda. John loves soda. John loves sweet soda."
avg_sentiment(para)

Textblob also lets you simply get the sentiment of a whole text, but you'll notice that this and the average calculated from sentence sentiment are not the same.

In [None]:
sent_blob_nb.sentiment

## Readability indices

Readability indices are ways of assessing how easy or complex it is to read a particular text based on the words and sentences it has. They usually output scores that correlate with grade levels.

A couple of indices that are presumably easy to calculate are the Auto Readability Index (ARI) and the Coleman-Liau Index:

$$
ARI = 4.71\frac{chars}{words}+0.5\frac{words}{sentences}-21.43
$$
$$ CL = 0.0588\frac{letters}{100 words} - 0.296\frac{sentences}{100words} - 15.8 $$

In [None]:
def coleman_liau_index(blob):
    chars = count_chars(blob.words)
    words = len(blob.words)
    sentences = len(blob.sentences)
    return (0.0588 * letters_per_100(chars, words)) - (0.296 * sentences_per_100(sentences, words)) - 15.8

def letters_per_100(chars, words):
    return (chars / words) * 100
    
def sentences_per_100(sentences, words):
    return (sentences / words) * 100

def count_chars(words):
    return sum(len(w) for w in words)

In [None]:
coleman_liau_index(sent_blob)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `auto_readability_index(blob)` that receives a Textblob `blob` and returns the Auto Readability Index (ARI) score as defined above. For example, `auto_readability_index(sent_blob)` should return ~`0.2815`.
<br/>
* **Hint**: Rememer to use the `count_chars()` function we defined before.*
</p>
</div>

In [None]:
def auto_readability_index(blob):
    chars = ...
    words = ...
    sentences = ...
    ...

auto_readability_index(sent_blob)

## Corpus
  
We will work with State of the Union speeches, each from their last year, for Barack Obama, George H.W. Bush, and Bill Clinton, and the recent address to Congress by Donald Trump.

In [None]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
bush_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/bush2008.txt"
obama_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/obama2016.txt"
trump_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/trump.txt"
clinton_speech = get_speech(clinton_url)
bush_speech = get_speech(bush_url)
obama_speech = get_speech(obama_url)
trump_speech = get_speech(trump_url)

In [None]:
speeches = {
    "clinton": TextBlob(clinton_speech, analyzer=NaiveBayesAnalyzer()),
    "bush": TextBlob(bush_speech, analyzer=NaiveBayesAnalyzer()),
    "obama": TextBlob(obama_speech, analyzer=NaiveBayesAnalyzer()),
    "trump": TextBlob(trump_speech, analyzer=NaiveBayesAnalyzer()),
}

Let's get some basic data about the speeches.

In [None]:
print("Name", "Chars", "Words", "Unique", "Sentences", sep="\t")
for speaker, speech in speeches.items():
    print(speaker, count_chars(speech.words), len(speech.words), len(set(speech.words)), len(speech.sentences), sep="\t")

We can calculate the average number of words per sentence for each speech.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `avg_sentence_length(blob)` that receives a Textblob `blob` and returns the average sentence length as the sum of all word lengths divided by the total number of sentences. For example, `avg_sentence_length(sent_blob)` should return ~`3.2857`.
</p>
</div>

In [None]:
def avg_sentence_length(blob):
    ...

avg_sentence_length(sent_blob)

In [None]:
for speaker, speech in speeches.items():
#     speech = speech.replace("Applause.", "")
    print(speaker, avg_sentence_length(speech))

We can also get the most used words. We are going to filter out some common stopwords first.

In [None]:
stopwords_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/english_stopwords.txt"
stopwords = get_text(stopwords_url).split("\n")
stopwords[:10]

In [None]:
def most_used_words(blob, n):
    word_counts = sorted(blob.word_counts.items(), key=lambda p: p[1], reverse=True)
    return list(filter(lambda p: p[0].lower() not in stopwords, word_counts))[:n]

for speaker, speech in speeches.items():
    print(speaker, most_used_words(speech, 10), "\n")

This sort of exploratory work is often the first step in figuring out how to clean a text for text analysis. 

Let's assess the lexical richness, defined as the ratio of number of unique words by the number of total words.

In [None]:
def lexical_richness(words):
    return len(set(words)) / len(words)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, lexical_richness(speech.words))

What about sentiment?

In [None]:
for speaker, speech in speeches.items():
    print(speaker, avg_sentiment(speech.string))

Readbility scores

For the Automated Readability Index, you can get the appropriate grade level here: https://en.wikipedia.org/wiki/Automated_readability_index

In [None]:
for speaker, speech in speeches.items():
    print(speaker, "ARI:", auto_readability_index(speech), "CL:", coleman_liau_index(speech))

In [None]:
for speaker, speech in speeches.items():
    speech = speech.replace("Applause.", "")
    print(speaker, "ARI:", auto_readability_index(speech), "CL:", coleman_liau_index(speech))

To get some comparison, let's also look at some stats calculated through Textacy. You'll note several different scores here, such as the Flesh-Kincaid Grade Level and Readability Ease, the SMOG Index, and the Gunning-Fog Index. Each of these is a measure of readability, with each of them involving the number of syllables overall or number of polysyllabic words. We also see the ARI and CL scores, which use the same formulas we used. However, you might notice that the scores are different. To understand why, you have to dig into the source code for Textacy, where you'll find that it filters out punctuation in creating the word list, which affects the number of characters. It also lowercases the punctuation-filtered words before creating the set of unique words, decreasing that number as well compared to how we calculated it here. These changes affect both the ARI and CL scores.

In [None]:
{'obama': {'FK_level': 7.076928361323411, 'FK_ease': 73.1515946068819, 'CL': 8.082574134674179, 'GF': 10.361327601233576, 'ARI': 7.258372293175114}, 
 'bush': {'FK_level': 9.015548495431595, 'FK_ease': 63.533602854678094, 'CL': 10.373284975782742, 'GF': 12.31341855540742, 'ARI': 10.105743095660415}, 
 'trump': {'FK_level': 8.74771792073162, 'FK_ease': 65.47855524889772, 'CL': 9.922284358447495, 'GF': 11.973927886256654, 'ARI': 9.750467143001387}, 
 'clinton': {'FK_level': 8.3507883263192, 'FK_ease': 68.20265979605051, 'CL': 9.236949903852384, 'GF': 11.56165222833711, 'ARI': 9.172024279702171}}

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `stats(url)` that receives a `url` from a plain text version of a book in Project Gutenberg and returns the a dictionary with statistics (Auto Readability Index, Coleman-Lieu Index, lexical richness, average sentence length in words, average sentiment, number of characters, number of words, number of unique words, number of sentences, and 10 most used words) of the text contained in the URL. For example, `stats("http://www.gutenberg.org/cache/epub/345/pg345.txt")` should return `{'ari': 7.051237118685233,
 'average_sentiment': 0.6216963558545169,
 'characters': 883114,
 'cl': 6.151579188686984,
 'lexical_richness': 15.130625285257873,
 'sentence_length': 19.343680709534368,
 'sentences': 8569,
 'top_words': ['said',
  'could',
  'one',
  'us',
  'must',
  'would',
  'may',
  'shall',
  'see',
  'know'],
 'unique_words': 10955,
 'words': 165756}`.
<br/>
* **Hint**: Rememer to use the `get_text()` function. Be careful with what parameters to pass in to each function.*
</p>
</div>

In [None]:
def stats(url):
    text = get_text(url)
    blob = TextBlob(text)
    return {
        "ari": ...,
        "cl": ...,
        "lexical_richness": ...,
        "sentence_length": ...,
        "average_sentiment": ...,
        "characters": ...,
        "words": ...,
        "unique_words": ...,
        "sentences": ...,
        "top_words": ...,
    }

stats("http://www.gutenberg.org/cache/epub/345/pg345.txt")  # Dracula