# Python for Linguists
I'm going to try something novel: giving this talk from a Jupyter notebook so I can run code on the fly.

## Who is this guy?
* Henry Anderson ([henry.anderson@uta.edu](mailto:henry.anderson@uta.edu))
* Data Scientist in the University Analytics department
* Specialist in unstructured data (i.e., text), machine learning, and Natural Language Processing
* First year masters student, with interests in computational social science, digital language use, and the language of online communities and networks.

## In this talk...
* Who stands to gain the most
* Why consider _programming,_ generally?
* Why consider _Python,_ specifically?
* Some demos:
  * Custom concordance code, with massive flexibility
  * Some very cool NLP tools:
    * Topic models
    * Word vectors
    * Word vector embeddings with t-SNE and plotting with Matplotlib
    * Automatic dependency parsing, POS-tagging, lemmatization, tokenization, etc. (i.e., text preprocessing)

## Who this talk is for
* Anyone who deals with _data:_ people interested in corpus work, sociolinguistics, natural language processing, digital/online language, etc.
* Anyone interested in _computational social science_ (CSS): i.e. general social science approaches leveraging large datasets and computational horsepower.
  * CSS is currently exploding, and is a hugely important avenue for applied social science research.
  * CSS is also massively interdisciplinary: programming, statistics, machine learning, AI, network analysis, linguistics, sociology, psychology, etc all combine to make CSS happen.
* If you deal mostly with theory, or are primarily an experimentalist, you probably stand to gain less from this talk.

## What does programming offer?
* (Quite literally) infinite control over your data processing: you're not limited by the features someone else decided to code into their program--you can change your code up to do anything you want.
* Scalability and automation of your data work
  * Work with literally millions of documents and billions of words with relative ease.
  * Automate steps from data collection through final analysis.
* Marketable skills: even a little bit of Python, Java, or any other language can open doors in the job market.
* You'll feel like a badass.

## What does _Python_ offer?
* Free (as in speech, not beer.  But also as in beer), open-source, royalty-free.  No licenses to sign, no royalties to pay, and _essentially no restrictions_ on what you can and can't do with it.  (the [Python Software Foundation license](https://docs.python.org/3/license.html) is an extremely permissive BSD-type license)

* Huge userbase that's big into Open Source and Free Software--so it's easy to find help or sample code.

* Rapidly becoming _the_ language for data science, displacing even R in most applications.  (R is still dominant for raw statistics, though Python has plenty of packages that implement common statistical tests).
  * Though, keep an eye on a different language--Julia--over the next few years.  It is truly a worthy contender, but has yet to hit version 1.0 as of this talk.

* Easy-to-learn language.
  * Great documentation and stupid amounts of free, high-quality learning resources.
  * Among its core ideas:
    * Code is read far more than it is written, so the language should be _human-readable._
    * "There should be one, and preferably only one, obvious way to do it."  I.e., the most straightforward approach is _usually_ the best.  (This results in a lot of people writing straightforward, fairly easy-to-follow code).
  * Commonly taught as a first programming language, so there are LOTS of materials for eveyone from beginning programmers to seasoned professionals; the Python community is also very welcoming of newcomers.

* General purpose language: can do (almost) everything you want to make it to.
  * Compare to R, which is great for statistics, and a pain for a lot of other stuff.
  * Or Matlab, which is great for being a broken, slow, difficult software environment, and isn't so good at being, well, good.
    * (this has been your mandatory "Matlab is bad" comment)

* **For linguists**: a _huge_ array of language processing functionality and libraries.
  * [spaCy](https://spacy.io), basically a Python version of Stanford's CoreNLP toolkit (lemmatization, tokenization, dependency parsing, POS tagging, and more).
  * [Gensim](https://radimrehurek.com/gensim/), full of topic models and pretty bleeding-edge NLP tools.
  * [Natural Language Toolkit (NLTK)](http://www.nltk.org/), a _massive_ library that's designed to teach a lot of NLP concepts (but can be used for some serious production work too).
  * [Tensorflow](https://www.tensorflow.org/)+[Keras](https://keras.io/), for quickly and easily building neural networks.
  * [PyTorch](http://pytorch.org/), an up-and-coming (but extremely exciting) neural network library.
  * [Pandas](http://pandas.pydata.org/) for R-like dataframes, statistics, and general tabular data management.
  * [Matplotlib](https://matplotlib.org/) (and others like [Seaborn](https://seaborn.pydata.org/), [PyGal](http://www.pygal.org/en/stable/), [Bokeh](https://bokeh.pydata.org/en/latest/), ...) for high-quality, powerful data visualization.
  * [scikit-learn](http://scikit-learn.org/stable/index.html) for non-neural machine learning (support vector machines, random forests, and a few text features like basic preprocessing)
    * Side note, the scikit-learn [User Guides](http://scikit-learn.org/stable/user_guide.html) are an _excellent_ technical crash course in machine learning, even if you're not too interested in Python.
  * [Networkx](https://networkx.github.io/) for performing network analysis.
  * And dozens more.

# Some basic demos

## Concordances

Concordances can be done with regular expressions and a teeny tiny bit of legwork.  (By the way: if you're working with text, you have no excuse to not learn regular expressions.  But that would be another talk all unto itself).  We'll work with the text of William Hope Hodgeson's novel _[The Boats of the "Glen Carrig"](https://en.wikipedia.org/wiki/The_Boats_of_the_%22Glen_Carrig%22)_, a 1907 horror novel.  The text was taken [from Project Gutenberg)](http://www.gutenberg.org/ebooks/10542), with the site's boilerplate text removed from the front and back.

In [None]:
%%time

import re

def concordance(text, token, window=50):
    pattern = re.compile(r"\b{}\b".format(token.strip()), re.IGNORECASE)
    # convert all whitespaces to single space characters
    text = re.sub(r"\s+", " ", text)
    for i in pattern.finditer(text):
        print(
            "...",
            text[i.start() - 50:i.start()].rjust(50, " "),
            text[i.start():i.start() + 50].ljust(50, " "),
            "...",
            sep=""
        )

glen_carrig = open("glen carrig.txt", "r", encoding="utf8").read()
concordance(glen_carrig, "think")

And if we want to get _really_ clever, we can have our concordance function search by stemmed forms.  We'll revisit stemming in a bit more detail shortly; for now, just know that stemming is the process of determining an uninflected form of words, but it's based purely on character patterns--so each word is treated completely in isolation, with no information about parts of speech.

We need to stem the original text, then search for concordances of any tokens that get stemmed to the same value as our input.  Then we run the previous concordance function on those tokens:

In [None]:
%%time

import re
from gensim.parsing.preprocessing import stem_text

def stem_concordance(text, token, window=50):
    text = re.sub(r"\s+", " ", text)
    # get a unique list of all word-like tokens using a basic regex
    word_finder = re.compile(r"[A-z0-9]+", re.MULTILINE)
    vocab_to_stem = {
        i.lower():stem_text(i)
        for i in set(word_finder.findall(text))
    }
    if token.lower().strip() not in vocab_to_stem:
        print("Token is not in the vocabulary.  Please try again.")
        return
    # now flip the dict to {stemmed_form:{set of unstemmed form}}
    stem_to_vocab = {i:set() for i in vocab_to_stem.values()}
    for i in vocab_to_stem:
        stem_to_vocab[vocab_to_stem[i]].add(i.lower())
    # look up other tokens that have same stem as input token
    stemmed_token = vocab_to_stem[token]
    possible_forms = stem_to_vocab[stemmed_token]
    # and now run the previous concordance function.
    for i in possible_forms:
        concordance(text, i, window=window)

glen_carrig = open("glen carrig.txt", "r", encoding="utf8").read()
stem_concordance(glen_carrig, "think")

That only added about 0.2 seconds to the total runtime.  Nice.

Now let's do it again, but with spaCy instead of Gensim.  spaCy has built-in tokenization, lemmatization, and more that are all based on large, pre-trained machine learning models.  This will give us much better accuracy--both with tokenizing and lemmatizing--but at the cost of _significantly_ higher runtime.  spaCy also has multiple models to choose from--for English, there's small, medium, and large.  The bigger the model, the better its accuracy, but also the slower it runs.  But since the interface is exactly the same, we'll use the small model for speed.

We'll also revisit lemmatization in a bit more detail shortly.  The short version: it's like stemming, but it returns a valid, real word corresponding to the uninflected form of a token.  (unlike stemming, which might map "today" to the root form "todai"--lemmatization would correctly map this to "today").

In [None]:
%%time

import re
import spacy

def lemma_concordance(text, token, window=50):
    print("Loading spaCy model...")
    # change to "en_core_web_md" for medium model,
    # or "en_core_web_lg" for large--no other code here needs changing
    nlp = spacy.load("en_core_web_sm")
    print("Loaded.")
    
    # directly get the mapping of raw form to lemma from spaCy's 
    # tokenization/stemming
    word_finder = re.compile(r"[A-z0-9]+", re.MULTILINE)
    vocab_to_stem = {
        i.lower_:i.lemma_
        for i in nlp(text)
    }
    if token.lower().strip() not in vocab_to_stem:
        print("Token is not in the vocabulary.  Please try again.")
        return
    # now flip the dict to {stemmed_form:{set of unstemmed form}}
    stem_to_vocab = {i:set() for i in vocab_to_stem.values()}
    for i in vocab_to_stem:
        stem_to_vocab[vocab_to_stem[i]].add(i.lower())
    
    # Now get the stemmed form of the input token and look up
    # the list of possible unstemmed forms--this approximates
    # finding other inflected forms of the same word.
    stemmed_token = vocab_to_stem[token]
    possible_forms = stem_to_vocab[stemmed_token]
    
    # and now run the previous concordance function.
    for i in possible_forms:
        concordance(text, i, window=window)

glen_carrig = open("glen carrig.txt", "r", encoding="utf8").read()
lemma_concordance(glen_carrig, "think")

## N-gram frequencies

Python has a number of ways we could find n-grams.  The first is using a pre-built tool, like NLTK's ngrams() function, or a Phrases()/Phraser() combination from Gensim, which are actually used to find multi-word phrases.  Or, we could hack it together ourselves with a few lines of code.

First, let's hack it together ourselves.  Then we'll print out the top most common N-grams and plot the frequencies by rank (on a logarithmic scale, naturally.  We're not monsters, after all).  We'll use spaCy's tokenization for maximum accuracy.

In [None]:
%matplotlib inline

from collections import Counter
from pprint import pprint
import re

import matplotlib as mpl
import matplotlib.pyplot as plt
import spacy

# Change this number manually to change the N in the ngrams
NGRAM_N = 2

# change the default matplotlib figure size for Jupyter's sake
mpl.rcParams["figure.figsize"] = (10,10)

nlp = spacy.load("en_core_web_sm")
doc = open("glen carrig.txt", "r", encoding="utf8").read()
# clean up line breaks and other whitespace issues
doc = re.sub(r"\s+", " ", doc)
doc = nlp(doc)
# replace tokens with lowercase form
doc = [i.lower_ for i in doc]
# find n-grams--represent them as plain old strings
ngrams = [
    "_".join(doc[i:i+NGRAM_N]) 
    for i in range(0, len(doc) - NGRAM_N)
]
ngrams = Counter(ngrams)
# do some prettier formatting than the default printing
print(f"{'NGRAM':<30s}\tCOUNT")
for i in ngrams.most_common(20):
    print(f"{i[0]:<30s}\t{i[1]}")
    
# Now, let's just plot the counts by rank.
counts = sorted(ngrams.values(), reverse=True)
plt.plot(counts)
plt.yscale("log")
plt.xscale("log")
plt.title("Look at this beautiful Zipf distribution!")
plt.xlabel("Rank (log)")
plt.ylabel("Count (log)")
plt.show()

NLTK's ngrams() function will find ngrams for us, like we just did by hand:

In [None]:
%matplotlib inline

from collections import Counter
from pprint import pprint
import re

import matplotlib as mpl
import matplotlib.pyplot as plt
from nltk import ngrams
import spacy

# Change this number manually to change the N in the ngrams
NGRAM_N = 2

# change the default matplotlib figure size for Jupyter's sake
mpl.rcParams["figure.figsize"] = (10,10)

nlp = spacy.load("en_core_web_sm")
doc = open("glen carrig.txt", "r", encoding="utf8").read()
# clean up line breaks and other whitespace issues
doc = re.sub(r"\s+", " ", doc)
doc = nlp(doc)
# replace tokens with lowercase form
doc = [i.lower_ for i in doc]
# This ie the only line that changed with NLTK imported
ngrams = ngrams(doc, NGRAM_N)
ngrams = Counter(ngrams)
# do some prettier formatting than the default printing
print(f"{'NGRAM':<30s}\tCOUNT")
for i in ngrams.most_common(20):
    # okay, so the format string changes a little too
    print(f"{'_'.join(i[0]):<30s}\t{i[1]}")
    
# Now, let's just plot the counts by rank.
counts = sorted(ngrams.values(), reverse=True)
plt.plot(counts)
plt.yscale("log")
plt.xscale("log")
plt.title("Look at this beautiful Zipf distribution!")
plt.xlabel("Rank (log)")
plt.ylabel("Count (log)")
plt.show()

## Phrase-finding/collocation analysis

Ngrams are fun and all, but what if you want to find multi-word phrases that appear more often than they should by random chance alone (i.e., collocates)?  Well, as before, we could hack a bit of code together, or we could use a pre-built tool from the amazing Gensin library: the [Phrasing tools!](https://radimrehurek.com/gensim/models/phrases.html)  These tools let you scan a(n already processed) corpus of texts, and finds bigrams that are collocated more than the raw prior distributions would indicate.  Then, these tools let you transform your original corpus, replacing these bigrams with a single token.  You can repeat this process all you want to find arbitrarily long phrases.

Before doing this, we should run our text through a basic preprocessing pipeline in Gensim.  We'll revisit this in a bit more detail later to talk about what it does; for now, just know that it automates a lot of the basic preprocessing steps for us, like lowercasing, removing stopwords, stemming, etc.

We'll use all the default values for our phrasing models except for the threshold (to guarnatee we find at least _some_ phrases for this demo), but they'll be provided explicitly to show how much customization there is.  Note that the phraser expectes a _list of sentences_, i.e. a _list of lists of words._  We don't strictly need to make them the actual sentences; the only reason Gensim says to use sentences is to avoid collocation across sentence boundaries.

In [None]:
import re

from gensim.models.phrases import Phrases
from gensim.parsing.preprocessing import preprocess_string

doc = open("glen carrig.txt", "r", encoding="utf8").read()
# clean up line breaks and other whitespace issues
doc = re.sub(r"\s+", " ", doc)
doc = preprocess_string(doc)
phrasing = Phrases(
    [doc],
    min_count=5,
    threshold=10,
    max_vocab_size=40000000,
    delimiter=b"_", # this has to be a byte string--just a quirk of this model
    progress_per=10000,
    scoring="default",
)

Now, we can look at some of the phrases that our phrase model discovered:

In [None]:
from pprint import pprint

found_phrases = list(phrasing.export_phrases([doc]))
pprint(found_phrases[:15])

And, we can transform our original document(s), replacing all of the discovered bigrams with a single token (e.g., `["the", boat"]` --> `"the_boat"`).  Gensim likes to use the indexing syntax to do transformations--it's a bit weird but you get used to it.

Note that we'll get a warning from Gensim (warnings are not errors--they're more of a "heads up, something looks weird here" sort of notice).  Gensim also has Phraser() objects, which are initialized from a Phrases() object, and are much faster at transforming a corpus.  This only really matters when you're dealing with _massive_ corpora and datasets; for our single book, we don't really need to bother, but I'll show how it would be done anyways.

In [None]:
from gensim.models.phrases import Phraser

# transform the text with the original phrases object...
phrased_doc = phrasing[doc]
print(phrased_doc[500:550])

# ...or by creating a new Phraser() from it first.
phraser = Phraser(phrasing)
phrased_doc = phraser[doc]
print('\n\n', phrased_doc[500:550])

(as you can see, the results of Phrases() and Phraser() are the same--Phraser() will just be _much_ faster, and use much less memory, for very large phrasing passes).

# Some more fun demos: Natural Language Processing 101

## Getting our data

The previous demos were very simple (even simplistic) and don't really leverage all the cool stuff Python--or programming in general--can do for you.  Let's work with a non-trivial dataset now and do some more NLP-like work.  We'll use the [Blog Authorship Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm).  Download it to the same folder as this notebook, then unzip it into a folder names "blogs" (case-sensitive!).

Each author's blog posts are stored as a .xml file, named in the following format: 

`[ID number].[gender].[age].[industry of employment].[astrological sign].xml`

And they contain blog data that looks like this:

```<Blog>

<date>14,May,2004</date>
<post>

 
      why  i feel really empty and disappointed today.. i hate the teachers. i hate the stress i have to accept. i hate i'am so weak... i hate no one understand.
     

    
</post>
</Blog>```

First thing's first: we need to deal with the XML formatting.  Fortunately, Python has some excellent tools for that, e.g. `xml.etree.ElementTree`.  We'll also want to use a better data structure to represent this text.  There's an absolutely indispensible library called Pandas, which gives you R-style dataframes to work with (and Pandas is probably the single biggest reason that Python has taken over the data science world, demoting R to second-place).

In [54]:
import os

files = [i.path for i in os.scandir("blogs")]
files[:5]

['blogs\\1000331.female.37.indUnk.Leo.xml',
 'blogs\\1000866.female.17.Student.Libra.xml',
 'blogs\\1004904.male.23.Arts.Capricorn.xml',
 'blogs\\1005076.female.25.Arts.Cancer.xml',
 'blogs\\1005545.male.25.Engineering.Sagittarius.xml']

Let's do a first pass where we just try to parse each file, to make sure there are no problems.  (Data validations is a step you _absolutely do not skip_ if you're doing any real data work, after all!)

In [59]:
from xml.etree import ElementTree

for i in files:
    try:
        ElementTree.parse(i)
    except Exception as e:
        print("\nEXCEPTION ON FILE:", i)
        print("EXCEPTION:", e)
        break


EXCEPTION ON FILE: blogs\1000866.female.17.Student.Libra.xml
EXCEPTION: not well-formed (invalid token): line 103, column 225


Uh-oh.  Let's take a look at the file that broke and go see where the problem is.  Fortunately, the error above gave us a line _and_ a column number, so we can go to the exact location where an issue was encountered.

```
<Blog>
[...]

<date>14,January,2003</date>
<post>


      Hehe, just finished dinner! Yum! I'm so happy right now. I don't even know why, I just...am! I'm talking to a bunch of my friends while writing this, which is always fun. Plus I'm doing homework, PLUS I'm watching Law & Order...how massively talented am I? Well, my day was pretty kick butt. Umm, no band OR music theory...very cool, so Kristen and I sat together and did homework and discussed Winter. Next period I hung out with Chris and Kelly for a bit (Alex too for a bit, since he was in Gym) and I left eventually. They started working on music, and...I just always end up feeling left out when they do, so, I try to stay away from those two together in general.Oh well, I went and sat alone in a practice room. Darn...no good stories for the newspaper to write about me. That was a great story, no matter how angry people are about the band comment. I wish people would have read the story actually, instead of reaching paragraph two and deciding it was terrible. Well, anyway, umm...EPVM was interesting. I'm getting nervous about my final. My brother said it was difficult...my brother with the perfect ACT AND SAT scores...arg. I just...wow, I'm so afraid of that test. I'm completely going to fail. Gym...oh jeez...two words  Commando Crawl  OWWWWWWWWWWWWWW  Good lord...that was one of the most painful...oh jeez...just thinking about it. Seriously, if you ever by some chance do that for high ropes...WEAR PANTS! Well, yes...wear pants normally, but don't wear shorts, make sure they are pants, because...it's quite painful if you don't. I made it through though, so it's all good! It just hurt...a lot.   Math, boring, shock shock.  Intervening thought: Why do I always write these while talking to Alex??  Lunch...was interesting. Just hung out with Kelly and Emily some, then Chris and Alex. It was cool...not much to it. Comparative Religion was boring...just, meh, presenting projects...and finally chem...boring, shock shock, and then, I came home, and did stuff, and that is the end of my day...therefore, I shall leave, and go kill alex...I mean...wait...you know NOTHING. You have no evidence...:)

</post>
```

Column numbers aren't displayed in this notebook, but the error is at the ampersand in `Law & Order`.  As it turns out, ampersands are special characters in XML code, and need to be escaped specially.  We could do some manual replacement of characters with the appropriate XML escapes, but that sounds like a lot of work and a lot of room for error.  So instead, let's just use an HTML parser, which I've already tested and can verify does in fact work.

Fortunately, our document structure is so simple that we can just hack this together with regular expressions.  This will be fast, but **is not** how we should in general deal with problematic XML or other markup files--we'd want to use some of Python's more rudimentary tools, like the base HTML or XML parsers, and overwrite their functionality (e.g. by subclassing).

We'll create a list of dictionaries (think JSONs), which we can easily pass into Pandas to make a big, beautiful, glorious Dataframe object (which we'll save to a .csv so we can re-open it directly later).  We'll work from this Dataframe for the rest of this demo.

In [89]:
%%time

import os
import re

from tqdm import tqdm_notebook as tqdm
import pandas as pd

# pre-compile patterns we'll use a lot--for speed
date_finder = re.compile(r"^<date>(.*?)</date>$", re.MULTILINE)
post_finder = re.compile(r"^<post>$(.*?)^</post>$", re.MULTILINE|re.DOTALL)
whitespace_cleaner = re.compile(r"\s+", re.MULTILINE)

def process_file(infile):
    # get the user metadata from the filename
    metadata = os.path.split(infile)[-1].split(".")
    text = open(infile, "r", encoding="ISO-8859-1").read()
    dates = date_finder.findall(text)
    posts = post_finder.findall(text)
    # assert check will crash our program if it fails--use to make sure our approach works!
    assert len(dates) == len(posts)
    blog_data = [
        {
             "Author ID" : metadata[0],
             "Gender"    : metadata[1],
             "Age"       : int(metadata[2]),
             "Industry"  : metadata[3],
             "Sign"      : metadata[4],
             "Date"      : dates[i],
             # all whitespaces to a single space.
             "Post"      : whitespace_cleaner.sub(" ", posts[i])
        }
        for i in range(len(dates))
    ]
    
    return blog_data

blog_dataframe = pd.concat(
    pd.DataFrame(process_file(i))
    for i in tqdm(files, desc="Generating Dataframe from blog posts...")
)
print("Casting Date column to datetime format...")
print("Saving blog dataframe to Blog Data.csv...")
blog_dataframe.to_csv("Blog Data.csv", index=False)
blog_dataframe

Casting Date column to datetime format...
Saving blog dataframe to Blog Data.csv...
Wall time: 2min 24s


There are a lot of entries in this dataframe.  And the dataframe itself is 765MB on my machine.

In [90]:
num_words = sum(len(i.split()) for i in blog_dataframe["Post"])
num_authors = blog_dataframe['Author ID'].nunique()
print(f"Number of posts:             {blog_dataframe.shape[0]:,}")
print(f"Number of authors:           {num_authors:,}")
print(f"Approximate number of words: {num_words:,}")

Number of posts:             681,288
Number of authors:           19,320
Approximate number of words: 136,854,709


Now, let's look at some of the ways we can process this text with various libraries.  We'll use the very first blog post as an working example to show what some of these processes do.

In [91]:
demo_post = list(blog_dataframe["Post"])[0]
print(demo_post)

 Well, everyone got up and going this morning. It's still raining, but that's okay with me. Sort of suits my mood. I could easily have stayed home in bed with my book and the cats. This has been a lot of rain though! People have wet basements, there are lakes where there should be golf courses and fields, everything is green, green, green. But, it is supposed to be 26 degrees by Friday, so we'll be dealing with mosquitos next week. I heard Winnipeg described as an "Old Testament" city on urlLink CBC Radio One last week and it sort of rings true. Floods, infestations, etc., etc.. 


For any NLP or data mining tasks with language, we want to reduce the complexity and dimensionality of the text.  This in turn reduces the _sparsity_ of our data when we get to whatever modeling work we want to do with it, which means our models will be less likely to overfit to our data.

Most text mining is primarily centered around the _content_ of the text, moreso than it _structure_ or _form_.  As such, most preprocessing pipelines will focus on preserving the semantic content over other elements of the text.  Some common preprocessing steps:
* Convert to lowercase
* (Sometimes, but not always) Remove accented characters
* Tokenize, i.e. split a doucment into a list of tokens (words, punctuation, etc)
* Remove tokens:
  * _Stopwords_ (generally, _function_ words), e.g. "the", "a", "to", etc.
  * Words with very low frequencies (contain very little information, and are not useful)
  * Words with extremely high _document-wise_ frequencies (these don't help us discriminate between documents in our corpus)
* Stem, or lemmatize, the text (not always--depends on the analysis!)
  * Stemming is significantly faster, but lemmatization can be more accurate.
  * But for downstream tasks and analysis, both stemming and lemmatization tend to perform about the same.
* Generate a _vectorized_ representation of the text--e.g. Bag-of-Words, Word2Vec, Doc2Vec, GloVe, etc.

For other analyses, though, we might be interested in analyzing the structure elements of the text:
* Identifying (named) entities
* Identifying noun chunks (more or less NPs/DPs)
* Syntactic parsing (dependency and constituent parsing are most common)
* Part-of-speech tagging

As for the actual analyses we might do, they are many:
* Topic modeling
* Sentiment analysis
* Document scoring/classification (e.g., author identification)

The rest of this notebook is a whirlwind tour through some of these capabilities in Python.

## Preprocessing

We can use the two excellent libraries we've already seen: spaCy and Gensim.

Gensim is a much _faster_ library for preprocessing, since it only operates based on raw string patterns and is designed to be fast and scale to massive datasets.  There is the assumption that any error we induce through the very simplistic approaches will be balanced out by the amount of data we work with--generally a good, and correct, assumption.  Gensim does no syntactic parsing, POS tagging, or other such structure-related analysis, but it's not designed for that.

spaCy is the much _more accurate_ library, since it parses text based on large, powerful pre-trained models (think Stanford's CoreNLP toolkit--it's very much analogous to that).  While still very fast, spaCy is painfully slow compared to Gensim.  But, it has a far more robust tokenizer, it can do part-of-speech tagging, lemmatization, dependency parsing, entity and noun chunk identification, and it even has pre-trained word vectors (and can easily compute vectors for strings of words or entire documents).

Let's first look at Gensim's preprocessing.  There's one function--`gensim.parsing.preprocessing.preprocess_string()`--which encompasses all the basic functions we need: lowercasing, de-accenting, tokenizing, stemming (with the Porter stemmer), stopword removal, removal of numbers, and removal of very short words (which are generally noise to use).

### Gensim

In [92]:
from gensim.parsing.preprocessing import preprocess_string

print("Original post:")
print(demo_post)

print("\nAfter Gensim preprocessing:")
print(" ".join(preprocess_string(demo_post)))

Original post:
 Well, everyone got up and going this morning. It's still raining, but that's okay with me. Sort of suits my mood. I could easily have stayed home in bed with my book and the cats. This has been a lot of rain though! People have wet basements, there are lakes where there should be golf courses and fields, everything is green, green, green. But, it is supposed to be 26 degrees by Friday, so we'll be dealing with mosquitos next week. I heard Winnipeg described as an "Old Testament" city on urlLink CBC Radio One last week and it sort of rings true. Floods, infestations, etc., etc.. 

After Gensim preprocessing:
got go morn rain okai sort suit mood easili stai home bed book cat lot rain peopl wet basement lake golf cours field green green green suppos degre fridai deal mosquito week heard winnipeg describ old testament citi urllink cbc radio week sort ring true flood infest


Notice how the Porter Stemmer finds the uninflected forms of words.  It bases its processing _purely_ on the _letters of a word_.  This makes it fast, but it doesn't always give real words or the most correct forms, e.g. `"morning"` --> `"morn"`, and `"okay"` --> `"okai"`.  But, for most text mining or NLP tasks, this is actually not that big of an issue, and later processing steps would filter out any weird edge cases.

### spaCy

spaCy requires us to load models in, as we saw earlier when doing stem-based concordancing.  As before, we'll use their small English model, but we would just change the model name in `spacy.load()` if we wanted a different model. The small model will generally be a bit faster, which is all we need for this demo's purposes.

In [93]:
%%time

import spacy

nlp = spacy.load("en_core_web_sm")
processed_post = " ".join(
    i.lemma_
    for i in nlp(demo_post)
    if i.is_stop == False
    and i.is_punct == False
)

print("Original post:")
print(demo_post)

print("\nAfter spaCy preprocessing:")
print(processed_post)

Original post:
 Well, everyone got up and going this morning. It's still raining, but that's okay with me. Sort of suits my mood. I could easily have stayed home in bed with my book and the cats. This has been a lot of rain though! People have wet basements, there are lakes where there should be golf courses and fields, everything is green, green, green. But, it is supposed to be 26 degrees by Friday, so we'll be dealing with mosquitos next week. I heard Winnipeg described as an "Old Testament" city on urlLink CBC Radio One last week and it sort of rings true. Floods, infestations, etc., etc.. 

After spaCy preprocessing:
  well get go morning -PRON- be rain be okay sort suit mood -PRON- easily stay home bed book cat this lot rain people wet basement lake golf course field green green green but suppose 26 degree friday will deal mosquito week -PRON- hear winnipeg describe old testament city urllink cbc radio one week sort ring true flood infestation etc etc
Wall time: 796 ms


Notice how the processed text is much more _human-readable_ with this approach (this is due to the use of a lemmatizer, rather than a stemmer).  While nice for reporting and inspecting results, the extra overhead in runtime (not evident in this small example) might make this an unreasonable proposition for large datasets if time is an issue.  And, as mentioned earlier, if you're doing an _automated analysis_ of your text later, there isn't always a big difference, if any, in how well stemming versus lemmatization performs.

## Linguistic Analysis with spaCy

spaCy's language models have a LOT of functionality.  Let's look at just some of the most easily accessible ones.

First, we've already seen spaCy's ability to do lemmatization, stopword tagging, and punctuation tagging.

In [116]:
import spacy

nlp = spacy.load("en_core_web_sm")

print(f"{'Token':<15s}\t{'Lemma':<15s}\t{'Is stopword?':<15s}\t Is punctuation?")
for i in nlp(demo_post):
    token = i.text
    lemma = i.lemma_
    is_stop = str(i.is_stop)
    is_punct = str(i.is_punct)
    print(f"{token:<15s}\t{lemma:<15s}\t{is_stop:<15s}\t{is_punct}")

Token          	Lemma          	Is stopword?   	 Is punctuation?
               	               	False          	False
Well           	well           	False          	False
,              	,              	False          	True
everyone       	everyone       	True           	False
got            	get            	False          	False
up             	up             	True           	False
and            	and            	True           	False
going          	go             	False          	False
this           	this           	True           	False
morning        	morning        	False          	False
.              	.              	False          	True
It             	-PRON-         	False          	False
's             	be             	False          	False
still          	still          	True           	False
raining        	rain           	False          	False
,              	,              	False          	True
but            	but            	True           	False
that           	that

And, we can also do part-of-speech tagging.

In [107]:
print(f"{'TOKEN':<15s}\t{'COARSE POS':<17s}\t{'FINE POS'}")
for i in nlp(demo_post):
    token = i.text
    coarse = i.pos_
    fine = i.tag_
    print(f"{token:<15s}\t{coarse:<17s}\t{fine}")

Token          	Coarse-grained POS	Fine-grained POS
               	SPACE            	
Well           	INTJ             	UH
,              	PUNCT            	,
everyone       	NOUN             	NN
got            	VERB             	VBD
up             	PART             	RP
and            	CCONJ            	CC
going          	VERB             	VBG
this           	DET              	DT
morning        	NOUN             	NN
.              	PUNCT            	.
It             	PRON             	PRP
's             	VERB             	VBZ
still          	ADV              	RB
raining        	VERB             	VBG
,              	PUNCT            	,
but            	CCONJ            	CC
that           	DET              	DT
's             	VERB             	VBZ
okay           	ADJ              	JJ
with           	ADP              	IN
me             	PRON             	PRP
.              	PUNCT            	.
Sort           	ADV              	RB
of             	ADV              	RB
suits          	NOUN  

Entity recognition...

In [109]:
from pprint import pprint
ents = nlp(demo_post).ents
pprint(list(ents))

[this morning,
 26 degrees,
 Friday,
 next week,
 Winnipeg,
 an "Old Testament",
 CBC Radio,
 One last week,
 Floods]


Noun chunk identification...

In [110]:
from pprint import pprint
noun_chunks = nlp(demo_post).noun_chunks
pprint(list(noun_chunks))

[everyone,
 It,
 me,
 Sort of suits,
 my mood,
 I,
 bed,
 my book,
 the cats,
 a lot,
 rain,
 People,
 wet basements,
 lakes,
 golf courses,
 fields,
 everything,
 it,
 26 degrees,
 Friday,
 we,
 mosquitos,
 I,
 Winnipeg,
 an "Old Testament" city,
 it,
 Floods,
 infestations]


And, as we might suspect from the above information, spaCy also does dependency parsing.

In [115]:
print(f"{'TOKEN':<15s}\t{'HEAD':<15s}\t{'DEPENDENCY RELATION'}")
for i in nlp(demo_post):
    token = i.text
    head = i.head.text
    dep = i.dep_
    print(f"{token:<15s}\t{head:<15s}\t{dep:<15s}")

TOKEN          	HEAD           	DEPENDENCY RELATION
               	Well           	               
Well           	got            	intj           
,              	got            	punct          
everyone       	got            	nsubj          
got            	got            	ROOT           
up             	got            	prt            
and            	got            	cc             
going          	got            	conj           
this           	morning        	det            
morning        	going          	npadvmod       
.              	got            	punct          
It             	raining        	nsubj          
's             	raining        	aux            
still          	raining        	advmod         
raining        	raining        	ROOT           
,              	raining        	punct          
but            	raining        	cc             
that           	's             	nsubj          
's             	raining        	conj           
okay           	's             	acom

We can use the built-in disiplaCy tool to generate a visualization of the dependency parse (though only of the first sentence, for space's sake):

In [117]:
from spacy import displacy
displacy.render(
    nlp("Well, everyone got up and going this morning."), 
    style="dep",
    jupyter=True # to make this render correctly in the Jupyter notebook
)

There's more that spaCy can do, and there are other models and libraries available for doing this sort of automated parsing and annotation of text (e.g., there are interfaces to Stanford's CoreNLP suite), but spaCy is always a good bet since it's fast (for the amount of work it does), pretty accurate, easy to use, and flexible.

## Topic Modeling

Topic Modeling refers to a wide variety of algorithms that are used to explore and discover "topics" within a corpus.  "Topic" is being used with a very specific meaning here--a topic is a _statistical distribution of words_.  You might already be familiar with Latent Semantic Analysis (LSA; sometimes called Latent Semantic Indexing, or LSI), which is an older model for this sort of analysis.

Most modern algorithms are based on [Latent Dirichlet Allocation (LDA)](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), which uses word co-occurrences within documents to determine the topics.  LDA has given rise to a number of subsequent topic models: 
* Author-Topic models, which are LDA with metadata (usually, but not always, the author of a piece)
* Dynamic topic models, which include time metadata in the modeling process
* Hierarchical Dirichlet Process, an extension of LDA that is _nonparametric_ with regards to the number of topics (but can give less clear results in some cases).

All of these models are implemented in Genim.  Due to the size of our corpus and the fact that this is a live demo, we'll use Gensim's speedier preprocessing tools to work with our data and prepate it for topic modeling.

Gensim requires that the corpus be in a _bag of words_ format for topic modeling, so we'll need to quickly do the conversion.

In [118]:
%%time

from gensim.corpora import Dictionary
from gensim.parsing.preprocessing import preprocess_string
from tqdm import tqdm_notebook as tqdm

# preprocess our corpus
corpus = [
    preprocess_string(i) 
    for i in tqdm(blog_dataframe["Post"], desc="Preprocessing")
]
id2word = Dictionary(tqdm(corpus, desc="Creating Gensim dictionary"))
# remove tokens with extremely high or low frequencies
vocabsize = len(id2word)
id2word.filter_extremes(
    no_above=.5, # remove tokens in > 50% of the documents
    no_below=10  # remove tokens in < 10 documents
)
# Reset index spacings for better efficiency
id2word.compactify()
print(f"Removed {vocabsize - len(id2word)} tokens based on frequency criteria.")
corpus = [
    id2word.doc2bow(i) 
    for i in tqdm(corpus, desc="Creating BoW")
]


Removed 547479 tokens based on frequency criteria.



Wall time: 14min 25s


We absolutely won't run this during this demo, but just for comparison, here's a spaCy preprocessing pipeline that does the same thing.  The only thing that changes is the first pass through the corpus--the bag-of-words steps are as before.

When I ran this next cell on my computer it processed about 20 documents per second, compared to 900-1000 documents per second with the Gensim pipeline above.

In [95]:
%time

from gensim.corpora import Dictionary
import spacy

nlp = spacy.load("en_core_web_sm")

corpus_spacy = [
    [
        i.lemma_
        for i in nlp(j)
        if i.is_stop == False
        and i.is_punct == False
    ]
    for j in tqdm(blog_dataframe["Post"], desc="Preprocessing")
]
corpus_spacy = [
    [
        i.lemma_
        for i in j
        if i.is_stop == False
        and i.is_punct == False
    ]
    for j in tqdm(nlp.pipe(blog_dataframe["Post"], n_threads=3), desc="Preprocessing")
]

id2word_spacy = Dictionary(tqdm(corpus_spacy, desc="Creating Gensim dictionary"))
# remove tokens with extremely high or low frequencies
vocabsize = len(id2word)
id2word_spacy.filter_extremes(
    no_above=.5, # remove tokens in > 50% of the documents
    no_below=10  # remove tokens in < 10 documents
)
# Reset index spacings for better efficiency
id2word_spacy.compactify()
print(f"Removed {vocabsize - len(id2word)} tokens based on frequency criteria.")
corpus_spacy = [
    id2word_spacy.doc2bow(i) 
    for i in tqdm(corpus_spacy, desc="Creating BoW")
]

Wall time: 0 ns


KeyboardInterrupt: 

Now, let's run some of these topic models.  We won't bother tweaking any of the default settings (except for chunksize, which should give us a bit more speed), meaning each one will search for 100 topics.  This is the most important parameter in the models, by far--and sadly, the only really good way to find a good value is to run it at a range of different topic numbers and see what gives you useful results.  You _can_ look at the _coherence_ of each topic (calculated by Gensum automatically) and use that to evaluate your model, but the be-all-end-all is the human interpretability and the usefulness of your topics.

In [None]:
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.atmodel import AuthorTopicModel
from gensim.models.hdpmodel import HdpModel

# we need a mapping of document IDs to a list of authors who wrote them,
# for the author-topic model.  This is pretty easy.
doc2author = dict(
    (i, [j])
    for i,j in enumerate(blog_dataframe["Author ID"])
)

print("Running multicore LDA...")
lda = LdaMulticore(corpus, workers=3, id2word=id2word, chunksize=50000)
print("Saving LDA model...")
lda.save("LDA.model")
print("Done.")
# print("Running author-topic model...")
# atmodel = AuthorTopicModel(corpus, id2word=id2word, doc2author=doc2author, chunksize=50000)
# print("Saving Author-Topic model...")
# atmodel.save("AuthorTopic.model")
# print("Done.")
# # HDP is nonparametric--don't need to specify topic number
# print("Running HDP model...")
# hdp = HdpModel(corpus, chunksize=50000)
# print("Saving HDP model...")
# hdp.save("HDP.model")
# print("Done".)

Running multicore LDA...


Let's look at some of the LDA outputs and see if we can interpret them.  We'll look at only the top ten highest-likelihood topics, sorted by decreasing likelihood.

In [None]:
print(lda.top_topics(corpus, topn=10))

We could do the same with the author-topic model, but what's more interesting is to pick a user or two and see what topics _they_ tend to write about.  Then we can print just those topics out.

In [None]:
from pprint import pprint

user_id = blog_data["Author ID"][0]
for i in atmodel[user_id]:
    topic_num = i[0]
    topic_prob = i[1]
    print(f"Topic #{topic_num} with weight {topic_prob} for author {user_id}")
    pprint(atmodel.show_topic(topic_num, topn=10))

And now, let's look at HDP's output, again sorted by the most likely topics.

In [None]:
print(hdp.top_topics(corpus, topn=10))