In [None]:
!python -m spacy download en

# [SpaCy](https://spacy.io/): Industrial-Strength NLP

The tradtional NLP library has always been [NLTK](http://www.nltk.org/). While `NLTK` is still very useful for linguistics analysis and exporation, `spacy` has become a nice option for easy and fast implementation of the NLP pipeline. What's the NLP pipeline? It's a number of common steps computational linguists perform to help them (and the computer) better understand textual data. Digital Humanists are often fond of the pipeline because it gives us more things to count! Let's what `spacy` can give us that we can count.

In [None]:
from datascience import *
import spacy

Let's start out with a short string from our reading and see what happens.

In [None]:
my_string = '''
"What are you going to do with yourself this evening, Alfred?" said Mr. Royal to his companion, as they issued from his counting-house in New Orleans. "Perhaps I ought to apologize for not calling you Mr. King, considering the shortness of our acquaintance; but your father and I were like brothers in our youth, and you resemble him so much, I can hardly realize that you are not he himself, and I still a young man. It used to be a joke with us that we must be cousins, since he was a King and I was of the Royal family. So excuse me if I say to you, as I used to say to him. What are you going to do with yourself, Cousin Alfred?"
'''

We've downloaded the English model, and now we just have to load it. This model will do ***everything*** for us, but we'll only get a little taste today.

In [None]:
nlp = spacy.load('en')

To parse an entire text we just call the model on a string.

In [None]:
parsed_text = nlp(my_string)
parsed_text

That was quick! So what happened? We've talked a lot about tokenizing, either in words or sentences.

What about sentences?

In [None]:
sents_tab = Table()
sents_tab.append_column(label="Sentence", values=[sentence for sentence in parsed_text.sents])
sents_tab.show()

Words?

In [None]:
toks_tab = Table()
toks_tab.append_column(label="Word", values=[word for word in parsed_text])
toks_tab.show()

What about parts of speech?

In [None]:
toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
toks_tab.show()

Lemmata?

In [None]:
toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
toks_tab.show()

What else? Let's just make a function `tablefy` that will make a table of all this information for us:

In [None]:
def tablefy(parsed_text):
    toks_tab = Table()
    toks_tab.append_column(label="Word", values=[word for word in parsed_text])
    toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
    toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
    toks_tab.append_column(label="Stop Word", values=[word.is_stop for word in parsed_text])
    toks_tab.append_column(label="Punctuation", values=[word.is_punct for word in parsed_text])
    toks_tab.append_column(label="Space", values=[word.is_space for word in parsed_text])
    toks_tab.append_column(label="Number", values=[word.like_num for word in parsed_text])
    toks_tab.append_column(label="OOV", values=[word.is_oov for word in parsed_text])
    toks_tab.append_column(label="Dependency", values=[word.dep_ for word in parsed_text])
    return toks_tab

In [None]:
tablefy(parsed_text).show()

## Challenge

What's the most common adjective? Noun? What if you only include lemmata?

## Dependency Parsing

Let's look at our text again:

In [None]:
parsed_text

Dependency parsing is one of the most useful and interesting NLP tools. A dependency parser will draw a tree of relationships between words. This is how you can find out specifically what adjectives are attributed to a specific person, what verbs are associated with a specific subject, etc.

`spacy` provides an online visualizern amed "displaCy" to visualize dependencies. Let's look at the [first sentence](https://demos.explosion.ai/displacy/?text=%22What%20are%20you%20going%20to%20do%20with%20yourself%20this%20evening%2C%20Alfred%3F%22%20said%20Mr.%20Royal%20to%20his%20companion%2C%20as%20they%20issued%20from%20his%20counting-house%20in%20New%20Orleans.&model=en&cpu=1&cph=1)

![alt text](img/dep_parse.png)

We can loop through a dependency for a subject by checking the `head` attribute for the `pos` tag:

In [None]:
from spacy.symbols import nsubj, VERB

SV = []
for possible_subject in parsed_text:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        SV.append((possible_subject.text, possible_subject.head))

In [None]:
SV

You can imagine that you could look over a large corpus to analyze first person, second person, and third person characterizations. Dependency parsers are also important for understanding a processing natural language, a question answering system for example. These models help the computer understand *what* the question is that is being asked.

## Limitations

How accurate are the models? What happens if we change the style of English we're working with?

In [None]:
shakespeare = '''
Tush! Never tell me; I take it much unkindly
That thou, Iago, who hast had my purse
As if the strings were thine, shouldst know of this.
'''

shake_parsed = nlp(shakespeare.strip())
tablefy(shake_parsed).show()

In [None]:
huck_finn_jim = '''
“Who dah?” “Say, who is you?  Whar is you?  Dog my cats ef I didn’ hear sumf’n.
Well, I know what I’s gwyne to do:  I’s gwyne to set down here and listen tell I hears it agin.”"
'''

hf_parsed = nlp(huck_finn_jim.strip())
tablefy(hf_parsed).show()

In [None]:
text_speech = '''
LOL where r u rn? omg that's sooo funnnnnny. c u in a sec.
'''

ts_parsed = nlp(text_speech.strip())
tablefy(ts_parsed).show()

In [None]:
old_english = '''
þæt wearð underne      eorðbuendum, 
þæt meotod hæfde      miht and strengðo 
ða he gefestnade      foldan sceatas. 
'''
oe_parsed = nlp(old_english.strip())
tablefy(oe_parsed).show()

## NER and Civil War-Era Novels

Wilkens uses a technique called "NER", or "Named Entity Recognition" to let the computer identify all of the geographic place names. Wilkens writes:

> Text strings representing named locations in the corpus were identified using
the named entity recognizer of the Stanford CoreNLP package with supplied training
data. To reduce errors and to narrow the results for human review, only those
named-location strings that occurred at least five times in the corpus and were used
by at least two different authors were accepted. The remaining unique strings were
reviewed by hand against their context in each source volume. [883]

While we don't have the time for a human review right now, `spacy` does allow us to annotate place names (among other things!):

In [None]:
ner_tab = Table()
ner_tab.append_column(label="NER Label", values=[ent.label_ for ent in parsed_text.ents])
ner_tab.append_column(label="NER Text", values=[ent.text for ent in parsed_text.ents])
ner_tab.show()

Cool! It's identified a few types of things for us. We can check what these mean [here](https://spacy.io/docs/usage/entity-recognition#entity-types). `GPE` is country, cities, or states. Seems like that's what Wilkens was using.

Since we don't have his corpus of 1000 novels, let's just take our reading, *A Romance of the Republic*, as an example. We can use the `requests` library to get the raw `HTML` of a web page, and if we take the `.text` property we can make this a nice string.

In [None]:
import requests

text = requests.get("http://www.gutenberg.org/files/10549/10549.txt").text
print(text[:500])

We'll leave the header for now, it shouldn't affect much. Now we need to parse this with that `nlp` function:

In [None]:
parsed = nlp(text)

Now we loop through each entity, and if it is labeled as `GPE` we'll add it to our `places` list. We'll then make a `Counter` object out of that to get the frequency of each place name.

In [None]:
from collections import Counter

places = []

for ent in parsed.ents:
    if ent.label_ == "GPE":
        places.append(ent.text.strip())

places = Counter(places)
places

That looks OK, but it's pretty rough! Keep this in mind when using trained models. They aren't 100% accurate. That's why Wilkens went through manually after.

If you thought NER was cool, wait for this. Now that we have a list of "places", we can send that to an online database to get back latitude and longitude coordinates, along with the US state. But to find the state we need a text file of all states. So let's load that:

In [None]:
with open('data/us_states.txt', 'r') as f:
    states = f.read().split('\n')
    states = [x.strip() for x in states]

states

OK, now we're ready. The `Nominatim` function from the `geopy` library will return an object that has the properties we want. We'll append a new row to our table for each entry. Importantly, we're using the `keys` of the `places` counter because we don't need to ask the database for "New Orleans" 10 times to get the location. So after we get the information we'll just add as many rows as the counter tells us there are.

In [None]:
from geopy.geocoders import Nominatim
from datascience import *
import time

geolocator = Nominatim()

geo_tab = Table(["latitude", "longitude", "name", "state"])

for name in places.keys():
    print("Getting information for " + name + "...")
    
    #finds the lat and lon of each name in the locations list
    location = geolocator.geocode(name)

    try:
        lat = float(location.raw["lat"])
        lon = float(location.raw["lon"])
        for p in location.address.split(","):
            if p.strip() in states:
                state = p.strip()
                break

        for i in range(places[name] - 1):
            geo_tab.append(Table.from_records([{"name": name,
                                          "latitude": lat,
                                          "longitude": lon,
                                          "state": state}]).row(0))
    except:
        pass
    
    time.sleep(1)

In [None]:
geo_tab.show()

Now we can plot a nice [choropleth](https://en.wikipedia.org/wiki/Choropleth_map).

In [None]:
%matplotlib inline

from scripts.choropleth import us_choropleth
us_choropleth(geo_tab)

---

# Homework:

Find the text to three different Civil War-Era (1851-1875) novels on [Project Gutenberg](https://www.gutenberg.org/) (maybe mentioned in our reading?!). Make sure you click for the `.txt` files, and use a `GET` request from the `requests` library to get the text. Then combine the NER location frequencies and plot a choropleth. Look closely at the words plotted. How did the NER model do? How does your choropleth look compared to Wilkens'?