# Exploratory Text Analysis in Python using spaCy and textacy

Workshop for Open Data Science Conference East 2021

[Workshop materials](https://github.com/csbailey5t/ODSC_text_analysis)

Scott Bailey <br/>
Digital Research and Scholarship Librarian <br/>
Copyright and Digital Scholarship Center <br/>
NC State University Libraries

## Outline
1. Intro and overview of NLP libraries
2. Document-level analysis <br/>
    a. Tokenization <br/>
    b. Cleaning text data <br />
    c. Part-of-speech tagging <br/>
    d. Named entity recognition <br/>
    e. Similarity vectors <br/>
    f. Rule-based matching <br />
3. Scaling up to corpus-level analysis
4. Further resources for spaCy

## Learning goal:

Through the course of the workshop, you'll practice using the core NLP features of spaCy and textacy, and connect those features to exploratory questions. 

## What do we mean by "exploratory text analysis?"
- How clean are the data?
- What methods do the data support?
- Project scoping 
- Research question refinement
- Iterative research 

## A quick(!) overview of NLP-related libraries in Python
- [nltk](https://www.nltk.org/)
- [gensim](https://radimrehurek.com/gensim/)
- [scikit-learn](https://scikit-learn.org/stable/)
- [stanza/corenlp](https://stanfordnlp.github.io/stanza/)
- [spaCy](https://spacy.io/)
- [huggingface transformers - pytorch and tensorflow](https://github.com/huggingface/transformers)

### Why spaCy and textacy?

SpaCy is an opinionated, performant NLP library that does a lot of the work for you while revealing where you might need to do more custom refinement or model building. Textacy builds smoothly on spaCy to add corpus analysis and common information retrieval methods.

## Questions during the workshop

During the workshop, please do ask questions by way of the chat. I'll be keeping an eye on that, and will answer questions as we go if I can. I'll also give some time during and after the workshop when folks can unmute and ask questions. 

## Jupyter Notebooks, Colab, and Binder

If you haven't worked with [Jupyter](https://jupyter.org/) notebooks before, they are a widely-used literate programming tool that let you write and execute cells of code. 

[Google Colab](https://colab.research.google.com) is a hosted notebook environment from Google, which provides free access to limited GPU resources.

[Binder](https://mybinder.org/) is a great project that builds reproducible environments to execute Jupyter notebooks. 


In [None]:
# Run this cell if working in Colab
!pip install textacy

In [None]:
# If running locally, you can also run this in your terminal with an active virtual environment
!python -m spacy download en_core_web_md

In [None]:
from collections import Counter
import glob
import spacy
from spacy import displacy
import textacy

In [None]:
import en_core_web_md

In [None]:
nlp = en_core_web_md.load()

In [None]:
# from https://en.wikipedia.org/wiki/Data_science
sample_text = """Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge."""

In [None]:
doc = nlp(sample_text)

## Tokenization

In [None]:
for word in doc[:20]:
    print(word)

In [None]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

In [None]:
for sent in doc.sents:
    print(sent)

## Cleaning text data

In [None]:
# One of the common things we do in text analysis is to remove punctuation
no_punct = [token for token in doc if token.is_punct == False]
for token in no_punct[50:100]:
  print(token.text, token.is_punct)

In [None]:
# This has worked, but left in new line characters and spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[50:100]:
  print(token.text)

In [None]:
# Let's say we also want to remove numbers, and lowercase everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

One other common bit of preprocessing is to remove stopwords, that is, the common words in a language that don't convey the information that we are looking for in our analysis. For example, if we looked for the most common words in a text, we would want to remove stopwords so that we don't only get words such as 'a,' 'the,' and 'and.'

In [None]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

For this piece, we've used spaCy's built in stopword list, which is used to create the property `is_stop` for each token. There's a good chance you would want to create custom stopwords lists though, especially if you're working with historical text or really domain-specific text. 

In [None]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["data", "algorithms"]

custom_clean = [token.lower_ for token in doc if token.lower_ not in custom_stopwords]
custom_clean

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on our analysis, we may or may not want to do this much cleaning. But, it is good to understand how much we can do just with spaCy. 

### Since we can break apart the document and filter it now, it's a good time to start counting things

In [None]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

In [None]:
# number of sentences
len(list(doc.sents))

In [None]:
# Count all lower-cased tokens
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

In [None]:
# Count cleaned tokens
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

### Activity

In the cell below, write code to find the five most common noun chunks in the original doc. 

In [None]:
# Write code here

## Part-of-speech tagging

In [None]:
# Coarse grained UPOS: https://universaldependencies.org/docs/u/pos/
for token in doc[:20]:
    print(token.text, token.pos_)

In [None]:
# Fine-grained POS, Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:20]:
    print(token.text, token.tag_)

In [None]:
# Not sure what those tags are? Try spaCy's explain function
spacy.explain("DT")

In [None]:
# Collect tokens by part of speech
verbs = [token for token in doc if token.pos_ == "VERB"]
verbs

In [None]:
# Collect plural nouns
nouns_pl = [token for token in doc if token.tag_ == "NNS" or token.tag_ == "NNPS"]
nouns_pl

### Dependency tree visualization

In [None]:
single_sentence = list(doc.sents)[0]
single_sentence

In [None]:
# spaCy determines the dependency tree for it's doc. Like POS, we can see the dependency tags of each token. 
for token in single_sentence:
    print(token.text, token.dep_)

In [None]:
spacy.explain("dobj")

In [None]:
displacy.render(single_sentence, style="dep", jupyter=True)

## Named entity recognition

[List of entity types in this spaCy model](https://spacy.io/models/en#en_core_web_md)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)    

### Activity

Add or modify a sentence in the original `sample_text` so that spaCy will detect a GPE. Then, in the cell below, write code to return a list of all entities that are either PERSON or GPE.

**hint**: make sure to reprocess the `sample_text` with the `nlp` model. 

In [None]:
# Write code here

### Visualizing named entities

In [None]:
single_sentence = list(doc.sents)[-1]
displacy.render(single_sentence, style="ent", jupyter=True)

## Word, sentence, and document vectors

SpaCy's medium (`md`) and large (`lg`) models include GloVe word vectors trained on the [Common Crawl](https://commoncrawl.org/). 

You could train your own vectors with `gensim` and `word2vec`, use a large language model, or many other libraries and algorithms. But, if you're text is fairly recent and especially from the web, the common crawl vectors might be enough, especially for exploratory work. 

`Token`s have vectors. `Doc`s and `Span`s have vectors that are the average of their token vectors. 

In [None]:
# token vectors
for token in doc[:5]:
    print(token.vector)

In [None]:
# doc vector
doc.vector

In [None]:
# sentence/span vector
list(doc.sents)[0].vector

This is fine, but for exploratory work, we might just be interested in some similarity measures between tokens, sentences, or documents. SpaCy uses the common cosine similarity measure.

In [None]:
for token1 in doc[:10]:
    for token2 in doc[:10]:
        print(token1.text, token2.text, token1.similarity(token2))

**Question**: Looking at the results, can you explain the scale of the similarity score?

In [None]:
for sent1 in doc.sents:
    for sent2 in doc.sents:
        print(sent1.text, "\n", sent2.text, "\n", sent1.similarity(sent2))
        print("----------------------------------------------")

## Rule based matcher

Rule-based matching is an incredibly powerful complement to the statistic models of spaCy. It's also a bit complex though, and it's worth looking at the docs [here](https://spacy.io/usage/rule-based-matching).

In [None]:
for sent in doc.sents:
    print(sent)

In [None]:
from spacy.matcher import Matcher

In [None]:
matcher = Matcher(nlp.vocab)

[Available token attributes for the `Matcher` pattern](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)

In [None]:
# We'll define a pattern as a list of dictionaries, where each dictionary describes a token
pattern = [{'LOWER': 'data'},
           {'POS': 'NOUN'}]
# The Matcher expects a list of patterns
matcher.add("data+noun", [pattern])

In [None]:
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

One of the easiest ways to build up these `Matcher` patterns is to use Explosion's online [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher). 

## Working with multiple documents (a corpus)

For a small corpus, you can build a list or dictionary of processed spaCy docs. Once you have that list or dictionary, approach it in terms of using the type of code we've written above, but applied over the larger data structure. 

For larger corpora, though, you might need to think about streaming data or distributed processing. 

We're going to turn to textacy to work with a corpus of documents shortly, but it is useful to think about how you can use standard data structures in combination with spaCy as needed, and when it is enough for your task.

If you've cloned the repo, or are working in Binder, the data are there in the `sotu` directory. If in Colab, you'll need to run the cell below to download the zip archive, then unzip the data.

In [None]:
# Run this if within Colab to download and unzip data
!wget https://github.com/csbailey5t/ODSC_text_analysis/raw/master/archive.zip
from zipfile import ZipFile
with ZipFile("/content/archive.zip", 'r') as zobj:
  zobj.extractall(path="sotu"))

In [None]:
# Gather the paths for all .txt files in our data directory
fns = glob.glob("sotu/*.txt")
len(fns)

In [None]:
texts = []
for fn in fns:
    with open(fn, 'r') as f:
        texts.append(f.read())

In [None]:
%time corpus = [nlp(text) for text in texts[:5]]

In [None]:
for doc in corpus[:2]:
    for ent in doc.ents:
        print(ent.text, ent.label_)

In [None]:
# Collect all geo-political entities from whole corpus
gpes = [(ent.text, ent.label_) for ent in doc.ents for doc in corpus if ent.label_ == "GPE"]
len(gpes)

In [None]:
gpes[:20]

In [None]:
# get the set of unique GPEs
set(gpes)

### Activity

Choose a method from the single document analysis portion of the workshop, and apply it to this small corpus. For example, you could find the most common words, create a cleaned corpus, or aggregate parts of speech. 

In [None]:
# Write code here

spaCy also provides a `pipe` method on the language model that will process texts in a stream. This can be useful for larger collections of texts, especially when combined with disabling parts of the pipeline you aren't using. 

https://spacy.io/api/language#pipe

Below are timed examples for building the corpus with a standard list comprehension vs the `pipe` method with batching and multiple processes.

In [None]:
# %time docs = [nlp(text) for text in texts]

In [None]:
# %time docs = list(nlp.pipe(texts, batch_size=10, n_process=2))

Let's take a look at how to build a corpus with textacy now.

Textacy corpora can be built directly from a list of texts, or could be buit from texts plus metadata, allowing you to filter the corpus on metadata. For now, we'll stick with just the texts. 

The full docs for textacy are [here](https://textacy.readthedocs.io/en/stable/), with details on the `Corpus` class [here](https://textacy.readthedocs.io/en/stable/api_reference/lang_doc_corpus.html#module-textacy.corpus). The `Corpus` class does provide convenience functions for saving and loading processed corpora. 

Before we run the next cells, if you're on Binder, you'll need to do one thing to deal with Binder's memory limitations. In the "Kernel" menu, hit "Restart". You'll then need to rerun the first four code cells of the notebook, to reimport libraries and initialize the nlp model. After that, skip back down to this section.

In [None]:
# In Binder, you'll need to rerun this line, but not in Colab
fns = glob.glob("sotu/*.txt")

In [None]:
# In Colab, we'll stick with 20 texts
# In Binder, I recommend dropping to 5
texts = []
for fn in fns[:20]:
    with open(fn, 'r') as f:
        texts.append(f.read())

In [None]:
corpus = textacy.Corpus(nlp, data=texts)

In [None]:
corpus

In [None]:
print("number of documents: ", corpus.n_docs)
print("number of sentences: ", corpus.n_sents)
print("number of tokens: ", corpus.n_tokens)



In [None]:
# We'll pass as_strings so that the results we look at will give us strings rather than unique ids.
counts = corpus.word_counts(as_strings=True)

Notice that, by default, the word_counts function is doing a certain amount of cleaning for you: https://chartbeat-labs.github.io/textacy/api_reference/lang_doc_corpus.html#textacy.corpus.Corpus.word_counts

In [None]:
sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20]

For an explanation of -PRON-, see https://spacy.io/api/annotation#lemmatization. Basically it's spaCy's way of lemmatizing pronouns.

In [None]:
word_doc_counts = corpus.word_doc_counts(weighting="freq", smooth_idf=True, filter_stops=True, as_strings=True)

In [None]:
sorted(word_doc_counts.items(), key=lambda x:x[1], reverse=True)[:30]

We should note that these are not tf-idf values, which are term frequencies for individual docs weighted by the inverse document frequency. This is a measure of the number of docs the words appear in weighted by inverse document frequency. We're still getting a sense of which words across the corpus and in the context of the corpus seem to have the most importance, if document frequency is a proxy for importance.

Textacy provides access to different algorithms that can be run on docs, such as TextRank for keyword extraction. We'll start by working on a single doc, and then look at how we might scale up to thinking about the corpus.


In [None]:
import textacy.ke

In [None]:
key_terms_textrank = textacy.ke.textrank(corpus[4])
key_terms_textrank

For comparison, we'll take a look at another algorithm, Yake.

In [None]:
key_terms_yake = textacy.ke.yake(corpus[4])
key_terms_yake



Let's think about aggregating keywords over part of the corpus.


In [None]:
key_terms_yake_corpus = [textacy.ke.yake(doc) for doc in corpus[:20]]

In [None]:
key_terms_yake_corpus[:2]

In [None]:
from itertools import chain

In [None]:
flat_terms_tuples = list(chain(*key_terms_yake_corpus))
flat_terms_tuples[:10]

In [None]:
# we now have a flat list of tuples, but let's shift to a flat list of just the keys in order to 
# count the most common keys
flat_terms = [k for k,v in flat_terms_tuples]
flat_terms[:20]

In [None]:
keyword_counter = Counter(flat_terms)
keyword_counter.most_common(20)

## Keyword in Context

One thing that researchers often find helpful in working with text is simply seeing keywords in context. Maybe you already know terms of interest in your data, but if not, the keyword extract above might help surface interesting words. 

In [None]:
kwic_gens = [textacy.text_utils.KWIC(doc.text, "Nation") for doc in corpus[:20]]

In [None]:
for kwic_gen in kwic_gens:
  for entry in kwic_gen:
    print(entry)

Textacy includes a lot of great information extraction and analysis features, including built-in [corpus vectorization](https://textacy.readthedocs.io/en/stable/api_reference/vsm_and_tm.html#module-textacy.vsm.vectorizers) and [topic modeling](https://textacy.readthedocs.io/en/stable/api_reference/vsm_and_tm.html#textacy.tm.topic_model.TopicModel) by way of [scikit-learn](https://scikit-learn.org/stable/). It also has [text pre-processing](https://textacy.readthedocs.io/en/stable/api_reference/text_processing.html) utilities with sensible defaults. 

In [information extraction](https://textacy.readthedocs.io/en/stable/api_reference/information_extraction.html) there are great tools to extract common structures, such as subject-verb-object triples and direct quotations.

While the current version of textacy doesn't support spaCy v3, the main developer, Burton DeWilde, is actively working on updating textacy for compatibility.

## Resources for spaCy

- [spaCy 101](https://spacy.io/usage/spacy-101) - spaCy's own intro documentation
- [Advanced NLP with spaCy](https://course.spacy.io/) - spaCy's own interactive learning course; you don't need to be "ready" for "advanced" work to benefit from going through this course
- [textacy](https://github.com/chartbeat-labs/textacy) - a Python library built on top of spaCy and scikit-learn to faciliate working with a corpus and providing extra functionality
- [spaCy universe](https://spacy.io/universe) - extensive collection of packages built on top of or with spaCy for various NLP and text analysis tasks
- [spaCy youtube videos](https://www.youtube.com/c/ExplosionAI/videos) - Explosion has a lot of great videos on Youtube, and there are a number of other folks who have created great walkthroughs of using different parts of spaCy.