<br>
<br>
<br>
<center><h2>Preparing Text as Data</h2><h4>by Jon Atwell, PhD</h4></center>
<br>
<div style="width:70%;margin-left:12%;font-size:16px">
<br>
<br>

<h4>Getting started</h4><br>

To begin, we're going to make sure we all have the right packages available. This will probably involve downloading a few and could take several minutes. By running the code below, your computer will download and install the packages. <br><br>We start here so that it can run while we're going over the introduction. Once it has finished, you won't ever need to run it again. To run the code, click into the cell and either press `shift + enter/return` or press the `>|Run` button above.
</div>

In [None]:
%%capture
import sys, re
from collections import defaultdict
import collections

!{sys.executable} -m pip install nltk
import nltk
nltk.download('genesis')
!{sys.executable} -m pip install chardet
!{sys.executable} -m pip install langdetect
!{sys.executable} -m pip install bs4

import chardet, langdetect
import matplotlib.pyplot as plt
import seaborn as sb
from bs4 import BeautifulSoup

In [None]:
!{sys.executable} -m pip install spacy
!{sys.executable} -m pip install https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_md-3.0.0/en_coref_md-3.0.0.tar.gz
import spacy

<div style="width:70%;margin-left:12%;font-size:16px">
<br>
<br>
    <center><h2>The Text Prep Pipeline</h2></center>
<br>
This workshop introduces the several steps of a prototypical text cleaning pipeline. The goal is to help you get from the raw text a researcher has already collected to being in a position to analyze the text using quantitative methods. We'll be using the `spaCy` package for much of the work. 'spaCy' is a pretty incredible tool that is quickly replacing the `Natural Language Toolkit` package (aka `nltk`) that has been the workhorse of natural language processing in python for ages. `spaCy` has replaced `nltk`'s more rules-based approach with deep learning approaches. The results are very impressive and you won't need to actually compare it to `nltk` to appreciate how powerful it is. We'll use it in stages 2 and 3.
<br>
<br>
<h3> Stage 1</h3>

In the first stage, we focus on making sure you have type of text we believe we have. This includes doing encoding checks, foreign-language detection, and non-language entity checks. The pipeline should start here because if we find ourselves weeding a lot of documents out at this stage, we might need to rethink whether the corpus is in fact amenable to the research goals.

<div style="margin-left:1em"><h4>Stage 1A: Encoding-check</h4>
    Confirming the text is in an encoding that will permit our operating system to work with and display everything. Often this step is not at all a problem, but we should take a look because there is a chance we'll be losing some information as we move and analyze the data. 
</div> 

<div style="margin-left:1em"><h4>Stage 1B: Foreign language detection</h4>
    It is not uncommon for a collection of texts to contain documents in more than one language. This fact may or may not be a problem for the project, but either way we need to know the language of each document because the algorithms of quantitative approaches can work with mixed-language corpora and we might not realize the results are rubbish for a long time.
\</div> 

<div style="margin-left:1em"><h4>Stage 1C: Non-text entity detection</h4>
    Text produced in the information age can include strings like URLs, metadata, or markdown that in most cases shouldn't be considered a part the semantically-rich language of the domain. These elements need to be detected and handled according to a well-structured plan. 
</div> 

<br>

### Stage 2 ###
After the initial cleaning of stage 1, we start with the core task of breaking down large chunks of text into the units we actually want to analyze. This stage includes the three steps of tokenization, coreference resolution and lemmatization, all of which `spaCy` can do pretty seamlessly.

<div style="margin-left:1em"><h4> Stage 2A: Tokenization</h4>
Tokenization means identifying the word units that make up a text. The whitespaces in the text that allow human readers to identify basic units like words, sentences and paragraphs are just more characters in one long string to a computer. Tokenization fixes that by identifying and removing whitespace and punctuation to create tokens, i.e. individual words.
</div>

<div style="margin-left:1em"><h4>Stage 2B: Coreference resolution</h4>
*Coreference resolution* replaces pronouns with the noun they are actually a reference to. Human readers are so good at substituting the correct noun back in that we often forget how much useful information those words contain. If we leave pronouns in place of the referred-to nouns, however, we'll throw out all that valuable information when we later discard the most common words, such as he, she, it, they, a, the, etc.</div>

<div style="margin-left:1em"><h4> Stage 2C: Lemmatization</h4>
Once we have a collection of *resolved* words, we want to find the *lemmas*, the stems of words that allow us to link variants like *computer*, *computers* and *computing* to the same basic concepts. There might be times when you have a good argument not to do this--maybe you're doing a sentiment or stylistic analysis in which the author's exact choice of words matters a great deal--but if you're doing probabilistic modeling, you'll be forgoing valuable information without doing it because each variant of the lemma will be treated as a unique word.
<br>
One could apply the same logic to argue that synonyms should be treated as the same word. I think most would have in agree with this argued in principle, but in practice language is much too flexible for us to be able to do this systematically. Thankfully if words are being used as synonyms across the corpus, approaches like LDA and vectorization are going to uncover that and tend to group them together. If and only if those groupings exists, you could consider creating a custom thesaurus that maps words used synonymously to a single word. Doing that is outside of the scope of this workshop, but get in touch with me if you'd like to talk about doing something like that.
</div>

<br>

### Stage 3 ###

By the end of the second stage, we'll have a large and clean set of tokens but that full set is likely not going to be the most informative set for quantitative approaches. For one, the most common words are semantically impoverished, either because they are supremely flexible, e.g. the variants of the verb *to be*, or because they are largely functional, e.g. *a*, *the*, and *to*. It is common practice to remove words deemed unimportant for the analytical task. Below we'll cover three common approaches to doing this; stoplisting, parts of speech tagging and word categorization. In most cases, a project should implement a single one. 

<div style="margin-left:1em"><h4>Stage 3, option A: Stoplisting</h4>
    Stoplisting is the practice of identifying a list of words to be removed from all documents. Several different lists of the most common words exist, but sometimes the research setting will have its own highly common (and therefore analytically unhelpful) words. For example, a corpus of restaurant reviews will have the word *food* in it at a much higher frequency than other sources. The word is likely to be uninformative in this context and should probably be stoplisted out.
</div>
<div style="margin-left:1em"><h4>Stage 3, option B: Parts of speech tagging</h4> Another approach is to winnow the document down to the set that covers specific parts of speech. Nouns and verbs, or even just nouns, are likely to be the most informative classes of words for approachs like topic modeling. Adjectives and adverbs are more important for things like sentiment analysis. Some researchers choose to pare the corpus down to a certain class of words, depending on their research goals.
</div>    
<div style="margin-left:1em"><h4>Stage 3, option C: Word category labeling</h4>
Another approach is to categorize words in ways unrelated to parts of speech. Psychometric approaches, like the one implemented in the Linguistic Inquiry and Word Count (LIWC) software, categorize words according to process types like social (e.g. family, friend), cognitive (e.g. think, effect), biological (e.g. hands, eat), etc. Such labels can be used for more targeted approaches. Doing this is unfortunately outside of the scope of this workshop because it requires access to proprietary software.
</div>
<br>
<br>
<br>
<br>

</div>
___

<div style="width:70%;font-size:16px">
<br>

<h3> Stage 1a: Encoding-check </h3>
<br>
From the perspective of a computer text is a long string characters from a pre-identified set. When early designers created a system for mapping character *code-points*--their representation in machine code--to their *glyphs*--the figures we actually visually process--they focused on the characters of the *latin* script used in the writing of the English language and created the *ASCII encoding* scheme that could accomodate up to 255 different characters. This number is determined by the number of bits required to store the character codes, so a smaller set required less memory/disk space. Computationally this was thrifty, but the design was at best shortsighted and at worst imperialistic. The number of characters needed to represent the languages of the world numbered in the hundreds of thousands, so as computers became general purpose tools across the world, new encoding schemes were needed. The second standard turned out to not be enough and a third was developed. Now that we're using emoji all the time, even that encoding might not enough longterm.
<br>
<br>
If there were no cost to having more characters, we would all use the same 32-bit encoding scheme. But the fact of the matter is that would eat up diskspace and more efficient encodings are often used. That means there is a possibility that you'll get encoding errors or lose information as you move text around and analyze it.
When working primarly in English it shouldn't happen often, but we'll start here because it introduces important concepts and can be a pain later in you haven't caught unsupported characters.
<br>
<br>
<br>
</div>
<div style="font-size:16px">
Consider the "raised eyebrow" emoji: 🤨
<br>
<br>
It is not uncommon for it to show up in text found on the internet, but see what happens when we handle it with Python.
<br>
<br>
</div>




In [None]:
print("🤨")

<div style="font-size:16px">Great, no problem. You can also print it by using its 32-bit Unicode name on the following line.</div>

In [None]:
print('\U0001f928')

<div style="font-size:16px">But look what happens in the following code:

In [None]:
string_one = 'What kind of funny business is this? \U0001f928'
string_two = 'What kind of funny business is this? \U0001f928'.encode("utf8")
print(string_one)
print(string_two)

<div style="font-size:16px">
We've lost the visual presentation of the emoji when we encoded it in UTF-8 (the 8-bit unicode version), although we can see the UTF-8 name for it. What has happened is that the encoding process has left us an escape sequence to look up the proper character later. It will stay that way until we do something about it, as you can see when we try to print it below:
<br>
<br>
</div>

In [None]:
print(b'\xf0\x9f\xa4\xa8')

<div style="font-size:16px">So we didn't get any errors, but it didn't print the actual emoji. We have to use the `decode` function to get it back.</div>

In [None]:
print(b'\xf0\x9f\xa4\xa8'.decode("utf8"))

<div style="font-size:16px">
Now you should beginning to see how problems might arise in moving text around. If somewhere a long the line, the text was pushed into an encoding that doesn't support the *glyph*, you'll probably want to know that. To be more concrete, Twitter's API uses the default encoding of UTF-8, which doesn't represent emojis directly. Here is the text of a recent LeBron James tweet.
<br><br>
🗣🗣 #MoreThanAnAthlete 🙏🏾
<br><br>
The following line encodes it in UTF-8. The printed line is what Twitter would send through the API.
<br>
</div>

In [None]:
lebron_tweet = '🗣🗣 #MoreThanAnAthlete 🙏🏾'.encode('utf8')

print(lebron_tweet)

<div style="font-size:16px">
If you aren't paying attention to the encoding, you might do something like accidentally throw this out. Alternatively, you might want to throw it out because it's not clear how we should interpret emoji in quantitative approaches. Either way you should know if the text has characters that are outside of the current encoding! Thankfully someone created a package, `chardet`, to help us do that. We import and use it in the following cells.
<br><br>
</div>

In [None]:
#
another_string = "What kind of funny business is this? 🤨"
print(another_string)


In [None]:
encoded_one = another_string.encode()
print(encoded_one)
print(chardet.detect(encoded_one))

In [None]:
encoded_two = another_string.encode("utf-32")
print(encoded_two)
print(chardet.detect(encoded_two))
print("\n")
print(encoded_two.decode("utf-32"))

<div style="font-size:16px">
As you can see, the package isn't sure what to make of the emoji because it is in fact an escape sequence in UTF-8. It seems to think the language is Turkish. When we force the string to be encoded in UTF-32, `chardet` gets it right. (The byte string also gets huge...see why we don't want to pass everything around in UTF-32!) We can get back the string by decoding it.
    
<br><br><br>
In general you could use `chardet` to detect the encoding, but there are lots of use cases where you can just ask!
Below, you can see `chardet` get this Chinese language site encoding right (UTF-8), but we can also just look at the header on the site, as `raw.info()` does. In general, you should first see if you can establish the encoding through formal means before moving to `chardet`.
</div>

In [None]:
from urllib.request import urlopen
raw = urlopen("http://weixin.qq.com/")
html = raw.read()

print(chardet.detect(html))
print("\n")
print(raw.info())

<div style="font-size:16px">
One final thing before moving onto the next topic. Python opens files assuming they are in UTF-8. If the file isn't in ASCII or UTF-8, you'll get an error. (This error might be familiar to those of you who attended my APIs workshop 😞)<br><br>
Below we save a file in UTF-16 and try to open it up again. We can't because those UTF-16 emoji characters aren't in UTF-8.
</div>

In [None]:
with open("test.txt", "w", encoding="utf16") as f:
    f.write('🗣🗣 #MoreThanAnAthlete 🙏🏾')
    
with open("test.txt","r") as f:
    print(f.readline())

<div style="font-size:16px">
You can fix this by specifying the encoding to work, as we do it the next line
</div>

In [None]:
with open("test.txt","r",encoding="utf-16") as f:
    print(f.readline())

<br><br><br><br>

<div style="width:70%;font-size:16px">


### Stage 1B: Foreign language detection ###

Most human languages can be encoded in UTF-8 (with occasional escape sequences for special characters) but we should know what language the text is actually written in. It is possible to throw a bunch of words written in different languages at the same LDA model and not know any better. The model treats *cat* and *perro* as equivalent type of objects in spite of them referring to different things in different languages.<br><br>

From a research perspective it might be fine to have multiple languages in the same source, but we need to handle each individually or throw off-target language documents out (and record how many were thrown out). To do this we'll use Google's `langdetect` package. Given the state of Google's machine translation and my own experience with it, I think we can put a lot credence in the package's labeling.<br>


Because this tutorial is in English we'll treat English as our target language, but to the best of my knowledge there is good support for other common languages in Python.

<br>
<br>
Let's import `langdetect` and try it out.


In [None]:

five_sentences = ["Colorless green ideas sleep furiously",                      # English
                  "اجتنب مصاحبة الكذاب فإن اضطررت إليه فلا تُصَدِّقْهُ",               # Arabic
                  "Hab' keine Angst, mein Kleiner, es sind nur deine Kleider",  # German
                  "读万卷书不如行万里路",                                          # Chinese
                  "Из огня́ да в полымя́"]                                        # Russian 

for sentence in five_sentences:
    print(langdetect.detect(sentence))

<div style="font-size:16px">
`langdetect` returns the [_ISO 639_ code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the language. As you can see, it got everyone right! 
    
<br><br><br>
In the next cell we open the `NLTK` *genesis* corpus, which contains translations of the Book of Genesis. We iterate over the documents (you can ignore the details of getting the documents out) to sort them by language. Again, English is the target language, but we should keep track of whatelse is in the corpus because if there are lots of off-target documents, you might not be measuring what you think your measuring (e.g. public discourse).
</div>

In [None]:
english = []
not_english = defaultdict(list)

for document_id in nltk.corpus.genesis.fileids():
    document = nltk.corpus.genesis.raw(document_id) # getting an individual document
    language = langdetect.detect(document)          # detecting the language 
    if language == 'en':
        english.append(document)
    else:
        not_english[language].append(document)      # If the doc isn't in English, we save it in a dictionary with
                                                    # the key for the language. 
        

In [None]:
print(len(english))
print(len(not_english))
print(not_english.keys())

<div style="font-size:16px">
If we pretend for a moment that we didn't know what this corpus was, we should be concerned about the extremely high percentage of documents not in the target language. For exactly this reason, you should do language detection very earlier in the cleaning pipeline!
</div>
<br>
<br>
<br>

<div style="width:70%;font-size:16px">
    <h3> Stage 1C: Non-language entity detection</h3>
<br><br>
The final thing to do before moving to the core tasks of pipeline is to remove non-language entities. This includes URLs, any metadata, or any markup (i.e. HTML or XML tags). Even if URLs are substantively interesting for a project, you'll want to identify and remove them before doing any further parsing of the text.
<br><br>
To the best of my knowledge, there is no systematic way of doing this. In the past, I've relied on a combination of *regular expressions* and  *HTML/XML parsing*, but you should really familiarize yourself with the data to make sure that will be sufficient. If, for example, there was any copying and pasting involved in the collection of the data (please don't do this if at all avoidable!), there might be all kinds of nasty stuff in there. If you find yourself in such a position, be careful and seek help!

<br><br><br>
The first class of non-language entities we'll consider are URLs. If you're dealing with any sort of comments or blogs/microblogs, there is a very good chance the users will include some right along side the text. If the data are scraped HTML pages, there is a good chance the users will have used hyperlink markup to embed them--we'll get to that case in a minute--but that is not always the case. People often copy and paste URLs into the text directly. To get those out of the way, we'll use *regular expression*<br><br>

Regular expressions is a protocol for describing patterns to find in text strings. It is a topic worthy of its own workshop, so we'll just get an idea of how to use them. Below is a fake document we want to analyse quantitatively so we need to remove the URL in it.<br>
</div>

In [None]:
dm = """
<directmessage>
    <title>Share a bottle?</title>
        <from>LeBron</from>
        <to>Jon</to>
            <body>
                J, wondering if you want share a bottle of red
                next time you're in LA.
                
                I've been wanting to try this garnacha:
                https://winelibrary.com/wines/grenache-garnacha/2014-clos-erasmus-priorat-98013
                <link xlink:type="simple" xlink:href="https://winelibrary.com/wines/grenache-garnacha/2014-clos-erasmus-priorat-98013">
                
                Hit me up when you have a minute.
                
                -bron
            </body>
</directmessage>
      """

<div style="width:70%;font-size:16px">
In the next cell we import the python regular expressions package `re` and then use its `findall` function to look for strings with the structure of URLs. The *regex* pattern for that is `"([\S]+\.\D{2,3}/\S*|[\S]+\.\D{2,3}\b)"`. The key thing to take away from that garble is `\.\D{2,3}`, which says find a substring that starts with a period and is followed by two or three letters (e.g. .com, .org, .ly, .io). If we find that, we grab everything in front and in back of it until we see whitespace.<br><br>
Please note that this is a pretty robust URL finding *regex* pattern, but it is not foolproof. For example, it doesn't match URLs with UTF-8 characters in the name. Use it with caution.
    </div>

In [None]:
matches = re.findall("([\S]+\.\D{2,3}/\S*|[\S]+\.\D{2,3}\b)", dm)
print(matches)

<div style="width:70%;font-size:16px">
   Assuming we find a URL in the text, we probably want to remove it and log either the actual URL or the fact that we found it. The code below does that.</div>

In [None]:
document_urls = []
count = 0

for match in matches:
    count += 1
    document_urls.append(match)
    dm = dm.replace(match, "") # we replace the URL with an empty string "", meaning nothing. It's just gone.
    
print("Found {} URLs: {}".format(count,document_urls))
print(dm)

<div style="width:70%;font-size:16px">
    
You might be annoyed by the fact that `<link xlink:type="simple" xlink:href="">` is still in there in spite of the URL now being gone. That tag and all the other tags (e.g. directmessage, body, etc.) stem from the fact that the document is in the XML format. XML is like HTML but with an eye more towards moving data around instead of visualizing it in browsers. If the documents are in XML or HTML, we can just use a text parser to get the raw text out. We'll use the most popular HTML parser for Python, `BeautifulSoup`. Below is HTML document for comparison.</div>
<br><br>

In [None]:
html_doc = """
<html>
    <title>Preparing Text as Data workshop</title>
    <body>
    For the workshop we'll be using a Jupyter notebook.
    You can download the notebook <a href="github.com/atwel/preparing_text.ipynb">here</a>
    </body>
</html>
"""

<div style="width:70%;font-size:16px">
    Run code in the next three cells to see what happens with the parsing.</div>

In [None]:
parsed_xml = BeautifulSoup(dm, "lxml")
print(parsed_xml.text) # the field text returns the raw text of the document

In [None]:
parsed_html = BeautifulSoup(html_doc,"html.parser")
print(parsed_html.text)

In [None]:
parsed_guess1 = BeautifulSoup(dm)
print(parsed_guess1.text)

parsed_guess2 = BeautifulSoup(html_doc)
print(parsed_guess2.text)

<br><div style="width:70%;font-size:16px">
The first thing to notice is that in the first cell we specified to use an XML parser (`lxml`), the second an HTML parser (`html.parser`) and the third we didn't specify the parser. Beautiful succeeded in identifying the text was XML, told us that and parsed it. That's pretty spiffy and useful!
<br><br>
The second thing to notice is that the parsers pulled out the tagged URLs. That's good from a cleaning perspective, but we didn't get to count or store them. Thankfully, BeautifulSoup can find those easily. The code in the next cell does that.</div><br>

In [None]:

# HTML link search
html_urls = []
for url in parsed_guess2.findAll("a"):
    html_urls.append(url['href'])
print(html_urls)


# XML link search
xml_urls = []
for url in parsed_guess1.findAll("link"):
    xml_urls.append(url["xlink:href"])
print(xml_urls)

<br><div style="width:70%;font-size:16px">
    There are two loops because the tags for URLs are different for HTML and XML. If you don't know whether documents are in HTML or XML, you'll need to either identify that ahead of time or try both types of searches.<br><br><br>
   Two final comments before we move onto Stage 2. First you may have noticed that the plan text link in the XML document is still there. BeautifulSoup correctly parsed that as text so to get it out, you'll need to nest the regex cleaning within the XML/HTML parsing. Second, we haven't discussed documents stored in the very common JSON format. That's because JSON is usually used in settings with lots of metadata and should be analyzed with the `JSON` package. Doing that is outside of the scope of this workshop.<br><br></div>

<br><div style="width:70%;font-size:16px">
    <h3> Stage 2: Breaking documents down</h3>
    <br><br>
    Now that we have some reasonably clean text we can start pulling it apart. That used to be pretty laborious but `spaCy` makes it oh-so easy. It does all the hard work at the very first step and from there we just do some repackaging.<br><br>


<h3>Stage 2A: Tokenizing</h3>
<br>
The first thing `spaCy` does is to tokenize the document. From our computers' perspective a document is series of lines. There is no functional significance of punctuation marks or *whitespace* characters like tabs, spaces or newline. Those are just more characters in the line. But from the human perspective, words are the fundamental units of meaning. Clearly higher-order structures like sentences, paragraphs, sections, and chapters matter, but everything starts with the words. So we need to tell the computer to identify the words in those character strings/lines.<br><br>
That used to be actually pretty challenging because you couldn't just break whenever you encountered whitespace, newline characters or punctuation marks. But `spaCy` can do it effortlessly so let's let it do its thing.<br><br>

Below we take a chunk of text from a Rick Bass story. Take a look at it and then feed it into `spaCy` in the following cell.<br><br></div>
    
    

In [None]:
bass_doc = """
Ann knew the dogs would stay there forever,
 or until she released them, and it troubled 
her to think that if she drowned, they too
 would die-that they would stand there 
motionless, as she had commanded them, for 
as long as they could.
"""


In [None]:
bass_doc = bass_doc.replace("\n","") # getting rid of the newline character that made it easy to read above.
print("The document is of the type {} and is {} characters long\n\n\n".format(type(bass_doc),len(bass_doc)))
print(bass_doc)

In [None]:
# running this cell will take a bit of time. On fast machines it might be 10-15 seconds. 
# On slower ones, it should still complete in a minute or so. 


nlp = spacy.load('en_coref_md') # loading the trained model of the English language


In [None]:
doc = nlp(bass_doc) # performing neural net magics 

<br><div style="width:70%;font-size:16px">
Now let's see what the `doc` object looks like.
    </div><br>

In [None]:
doc

<br><div style="width:70%;font-size:16px">
That result belies what has actually happened. To see the tokens, run the following code. It looks at the first 35 words of the text.
    </div>

In [None]:
tokenized = [token for token in doc]
print("\nThe tokenized object is of type {}, and its elements are of type {}\n\n".format(type(tokenized),type(tokenized[0])))
print(tokenized)

<br><div style="width:70%;font-size:16px">
That's it, we've tokenized the string! Note that this is a python list with `token` objects in it. Tokens are the words, punctuation, and spaces.
<br><br><br>
    <h3> Stage 2B: Coreference resolution</h3><br><br>
Coreference resolution addresses the fact that humans are very good with pronouns and can identify the original reference with ease. But when we approach text computationally we'll lose the references in a huge pile of she/he/it/that/her/him/etc. We shouldn't do that if we want to measure things! To make the point clearer, consider the following sentence:<br><br>
*My brother lives in his girlfriend's apartment. He really loves it.*<br><br>
The second sentence has two unresolved references, *he* and *it*. Successful coreference resolution will replace 
*he* with *my brother* and *it* with *apartment*. Let's let `spaCy` try.
<br>

In [None]:
test_doc = nlp(u"My brother lives in a basement apartment. He really loves it.")

print("The coreferences are {}".format(test_doc._.coref_clusters))
print("The new sentence is \"{}\"".format(test_doc._.coref_resolved))

<br><div style="width:70%;font-size:16px">
As you can see, `spaCy` got only the [my Brother, he] coreference pair. But that was actually a pretty hard pair of sentences to resolve. Let's try it on the Bass text.<br>

In [None]:
print("Original:")
print(doc)
print("\n\n\nResolved:")
print(doc._.coref_resolved)

<br><div style="width:70%;font-size:16px">
That's a pretty incredible result for almost no effort on our part. The mileage you get will vary from text to text. `spaCy` has a much larger model that might do better right now, but expect improvements in the near future.<br><br><br>
    <h3>Stage 2C: Lemmatization</h3>
    <br><br>
Just like with coference resolution, we might lose a lot of valuable information if we didn't play attention to *lemmas*. A *lemma* is the core stem of a concept and/or word. If we analyse the tokens themselves instead of the lemmas, quantitative approaches might miss important relationships because the same concept is split across multiple token types. If the analysis is focused on linguistic style, you probably shouldn't lemmatize the text because using a specific variants of the same lemma is an important part of style. 
<br><br>    
The practice of stemming is a quick and dirty heuristic for getting *lemmas*.It would stem *doggy*, *dogs*, and *dogginess* back to *dog*, but not treat *are*, *is* as forms of the verb *to be*. Lemmatization is the more principled approach and `spaCy` is really good at it! <br><br>



In [None]:
for token in doc:
    print(token.text, token.lemma_, token.lemma, token.is_stop, token.is_alpha)

In [None]:
for token in doc:
    print("Token:{}\t\t. lemma:{}".format(token, token.lemma_))

<br><div style="width:70%;font-size:16px">
Many words are their own lemma, or are pronouns that rightly shouldn't be lemmatize. But you can see above some roots identified.<br><br>
Now that we know how to resolve coreferences and lemmatize, let's create the properly *resolved* text. We do this by substituting in the coreferences and then finding the lemmas on that new text. We then substitute in the lemmas to get some very awkward, but maximally meaningful text.<br><br>

In [None]:

coref_doc = nlp(doc._.coref_resolved) # rerunning spaCy's parsing process with the coreferences inserted

resolved_doc = [token.lemma_  if token.lemma_ != "-PRON-" else token for token in coref_doc]
# putting in the lemmas, except for pronoun because they don't really have lemmas

print(resolved_doc)

<br><div style="width:70%;font-size:16px">
<h3>Stage 3: Pruning the words</h3>
    <br><br>
Now that we've resolved the tokens of the document to their full references and lemmas, we want to take out the words that aren't going to be particularly informative for quantitative approaches such as  the, or, if, that, etc. Those words are important syntactically and essential for the coreference resolution process, but many quantitative approaches don't want them or punctuation in the model. In many cases, you'll also remove the most common words. The justification for doing so is outside of the scope of this tutorial.
<br>
<br>
    <h3>Stage 3, option A: Stoplisting</h3>
<br>Let's take a look at the counts of the words in our document in the next line. We'll use a much larger document from the Genesis corpus we used above. The code will take a minute or two to run.<br></div>

In [None]:
genesis_doc_raw = nltk.corpus.genesis.raw('english-kjv.txt')
genesis_doc = nlp(genesis_doc_raw)
gen_coref_doc = nlp(genesis_doc._.coref_resolved) # rerunning spaCy's parsing process with the coreferences inserted

gen_resolved_doc = [token.lemma_  if token.lemma_ != "-PRON-" else token.text for token in gen_coref_doc]
# putting in the lemmas, except for pronoun because they don't really have lemmas


counts = collections.Counter()

for word in gen_resolved_doc:
    counts[word] +=1
    
print(counts)

In [None]:
f, ax = plt.subplots(figsize=(16, 8))
sb.distplot(list(counts.values()),hist=True)

<br><div style="width:70%;font-size:16px">
As the raw counts and the histogram make clear, there are a lot of words that show up once or twice and a few words that show up thousands of times. This pattern was observed by George Zipf in 1935 and this type of distribution became known as Zipf's law. It shows up in a lot of place and is more generally called a scale-free or power-law distribution.
<br><br>
The regularity has led many to truncate the distribution at both ends so that the documents of a corpus are reduced to the words inbetween the extremes. Those words just aren't informative because either there everywhere or nowhere, practically speaking. The lower end of the distribution tends to be a, the, at, for, and punctuation. We remove these using a *stoplist*. A *stoplist* is just a list of words that should be removed from texts. Most people start with a premade one and add things to it as necessary.<br><br>
Let's look at `spaCy`'s built-in stoplist.<br>
    </div>


In [None]:
nlp.Defaults.stop_words

<br><div style="width:70%;font-size:16px">
It has lots of common words and prepositions. It doesn't have any punctuation so let's add those.<br></div>

In [None]:
nlp.Defaults.stop_words |= {".","!","?",";",":",",","/","&",'-',"--","\n"}

<br><div style="width:70%;font-size:16px">
    Now we can loop over the tokens again and remove the stoplisted words. To do that, we'll actually create a new document based on the work we did above and then re-parse it. You can avoid this by stoplisting before lemmatizing, but I believe the sequential logic is best expounding in this order.
 <br></div>

In [None]:
recreated_doc = " ".join(gen_resolved_doc)

new_doc = nlp(recreated_doc)

stoplisted = [token for token in new_doc if token.text not in nlp.Defaults.stop_words]
print(stoplisted)

<br><div style="width:70%;font-size:16px">
    And just like that the most common words are gone. We can do it again to remove the words that appear only once or twice.<br></div>


In [None]:
new_stoplist = []
for word, count in counts.items():
    if count < 3:
        new_stoplist.append(word)

fully_stoplisted = [token for token in stoplisted if token.text not in new_stoplist]
print(fully_stoplisted)

<br><div style="width:70%;font-size:16px">
    <h3>Stage 3, option B: Parts of Speech tagging</h3>
    <br><br>
    The final thing we'll cover is parts of speech tagging. Parts of speech tagging is not exactly an obvious thing to do for quantitative analyses because they can't make it into the analysis. In fact, parts of speech can become another means to prune tokens with. Some researcher focus on nouns and verbs, or even just nouns. spaCy again makes it easy identify those and we can use its results to filter the documents yet again.<br><br>
spaCy provides a part of speech (pos) field for each token. We look at in the next line and see if it's equal to NOUN or VERB. If and only if it is, we keep it.<br></div>

In [None]:
noun_verbs = [token.text for token in fully_stoplisted if token.pos_ == "NOUN" or token.pos_ == "VERB"]
print(noun_verbs)

<br><div style="width:70%;font-size:16px">
Now we have some very clean text ready for LDA or vectorization. (FYI spaCy has vectors for words too, but I'm a huge advocate for narrowly scoped vector spaces). It may seem strange to have such unreadable and sterile text as the basis of analysis, but this is where most people start.
    <br><br>