# Exploratory Text Analysis with Python

Workshop for the Southeast Data Librarian Symposium 2020

Scott Bailey <br/>
Digital Research and Scholarship Librarian <br/>
Copyright and Digital Scholarship Center <br/>
NC State University Libraries

## Outline
1. Intro and overview of NLP libraries
2. Document-level analysis <br/>
    a. Tokenization <br/>
    b. Cleaning text data <br />
    c. Part-of-speech tagging <br/>
    d. Named entity recognition <br/>
    e. Similarity vectors <br/>
    f. Rule-based matching <br />
3. Scaling up to corpus-level analysis
4. Further resources for spaCy

## What do we mean by "exploratory text analysis?"
- How clean are the data?
- What methods do the data support?
- Project scoping 
- Research question refinement
- Iterative research 

## A quick(!) overview of NLP-related libraries in Python
- [nltk](https://www.nltk.org/)
- [gensim](https://radimrehurek.com/gensim/)
- [scikit-learn](https://scikit-learn.org/stable/)
- [stanza/corenlp](https://stanfordnlp.github.io/stanza/)
- [spaCy](https://spacy.io/)
- [huggingface transformers - pytorch and tensorflow](https://github.com/huggingface/transformers)

### Why spaCy?

An opinionated, performant NLP that does a lot of the work for you while revealing where you might need to do more custom refinement or model building. 

## Questions during the workshop

During the workshop, please do ask questions by way of the Zoom chat. I'll be keeping an eye on that, and will answer questions as we go. I'll also give some time during and after the workshop when folks can unmute and ask questions. 

## Jupyter Notebooks, Google Colab, and Binder


In [None]:
# Run this cell if working in Google Colab or Binder
# If working locally, add spaCy to your environment in the preferred way
# and in a shell with that environment, run the model download
!pip install spacy
!python -m spacy download en_core_web_md

In [None]:
from collections import Counter
import glob
import spacy
from spacy import displacy

In [None]:
import en_core_web_md

In [None]:
nlp = en_core_web_md.load()

In [None]:
# from https://se-datalibrarian.github.io/2020/about/
# I've added the final, untrue sentence, though, to make sure we have entities for when we hit named entity recognition.
sample_text = """The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.

In addition to learning about new work in the field, attendees will have the opportunity to network and build partnerships with regional colleagues. It is open to all who wish to attend, including students, data managers, and data scientists.

The Symposium has previously taken place in Athens, Georgia, and has been sponsored by Google for $10 million."""

In [None]:
doc = nlp(sample_text)

## Tokenization

In [None]:
for word in doc[:20]:
    print(word)

In [None]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

In [None]:
for sent in doc.sents:
    print(sent)

## Cleaning text data

In [None]:
# One of the common things we do in text analysis is to remove punctuation
no_punct = [token for token in doc if token.is_punct == False]
for token in no_punct[:50]:
  print(token.text, token.is_punct)

In [None]:
# This has worked, but left in new line characters and spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
  print(token.text)

In [None]:
# Let's say we also want to remove numbers, and lowercase everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

One other common bit of preprocessing is to remove stopwords, that is, the common words in a language that don't convey the information that we are looking for in our analysis. For example, if we looked for the most common words in a text, we would want to remove stopwords so that we don't only get words such as 'a,' 'the,' and 'and.'

In [None]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

For this piece, we've used spaCy's built in stopword list, which is used to create the property `is_stop` for each token. There's a good chance you would want to create custom stopwords lists though, especially if you're working with historical text or really domain-specific text. 

In [None]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["developments", "management"]

custom_clean = [token.lower_ for token in doc if token.lower_ not in custom_stopwords]
custom_clean

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on our analysis, we may or may not want to do this much cleaning. But, it is good to understand how much we can do just with spaCy. 

### Since we can break apart the document and filter it now, it's a good time to start counting things

In [None]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

In [None]:
# number of sentences
len(list(doc.sents))

In [None]:
# Count all lower-cased tokens
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

In [None]:
# Count cleaned tokens
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

**Question:** Why do we have to use a list comprehension for the non-clean doc while we can just pass a variable directly for the cleaned set of tokens?

### Activity

In the cell below, write code to find the five most common noun chunks in the original doc. 

In [None]:
# Write code here

## Part-of-speech tagging

In [None]:
# Coarse grained UPOS: https://universaldependencies.org/docs/u/pos/
for token in doc[:20]:
    print(token.text, token.pos_)

In [None]:
# Fine-grained POS, Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:20]:
    print(token.text, token.tag_)

In [None]:
# Not sure what those tags are? Try spaCy's explain function
spacy.explain("DT")

In [None]:
# Collect tokens by part of speech
verbs = [token for token in doc if token.pos_ == "VERB"]
verbs

In [None]:
# Collect plural nouns
nouns_pl = [token for token in doc if token.tag_ == "NNS" or token.tag_ == "NNPS"]
nouns_pl

### Dependency tree visualization

In [None]:
single_sentence = list(doc.sents)[0]
single_sentence

In [None]:
# spaCy determines the dependency tree for it's doc. Like POS, we can see the dependency tags of each token. 
for token in single_sentence:
    print(token.text, token.dep_)

In [None]:
spacy.explain("dobj")

In [None]:
displacy.render(single_sentence, style="dep")

## Named entity recognition

[List of entity types in spaCy](https://spacy.io/api/annotation#named-entities)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)    

### Activity

Add or modify a sentence in the original `sample_text` so that spaCy will detect a PERSON. Then, in the cell below, write code to return a list of all entities that are either PERSON or GPE.

**hint**: make sure to reprocess the `sample_text` with the `nlp` model. 

In [None]:
# Write code here

### Visualizing named entities

In [None]:
single_sentence = list(doc.sents)[-1]
displacy.render(single_sentence, style="ent")

## Word, sentence, and document vectors

SpaCy's medium (`md`) and large (`lg`) models include GloVe word vectors trained on the [Common Crawl](https://commoncrawl.org/). 

You could train your own vectors with `gensim` and `word2vec`, use a large language model, or many other libraries and algorithms. But, if you're text is fairly recent and especially from the web, the common crawl vectors might be enough, especially for exploratory work. 

`Token`s have vectors. `Doc`s and `Span`s have vectors that are the average of their token vectors. 

In [None]:
# token vectors
for token in doc[:5]:
    print(token.vector)

In [None]:
# doc vector
doc.vector

In [None]:
# sentence/span vector
list(doc.sents)[0].vector

This is fine, but for exploratory work, we might just be interested in some similarity measures between tokens, sentences, or documents. SpaCy uses the common cosine similarity measure.

In [None]:
for token1 in doc[:10]:
    for token2 in doc[:10]:
        print(token1.text, token2.text, token1.similarity(token2))

**Question**: Looking at the results, can you explain the scale of the similarity score?

In [None]:
for sent1 in doc.sents:
    for sent2 in doc.sents:
        print(sent1.text, sent2.text, "\n", sent1.similarity(sent2))
        print("----------------------------------------------")

## Rule based matcher

Rule-based matching is an incredibly powerful complement to the statistic models of spaCy. It's also a bit complex though, and it's worth looking at the docs [here](https://spacy.io/usage/rule-based-matching).

In [None]:
for sent in doc.sents:
    print(sent)

In [None]:
from spacy.matcher import Matcher

In [None]:
matcher = Matcher(nlp.vocab)

[Available token attributes for the `Matcher` pattern](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)

In [None]:
pattern = [{'LOWER': 'symposium'},
           {'DEP': 'aux'}]
matcher.add("sympo+aux", None, pattern)

In [None]:
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

One of the easiest ways to build up these `Matcher` patterns is to use their online [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher). 

## Working with multiple documents (a corpus)

For a small corpus, you can build a list or dictionary of processed spaCy docs. Once you have that list or dictionary, approach it in terms of using the type of code we've written above, but applied over the larger data structure. 

For larger corpora, though, you might need to think about streaming data or distributed processing. 

In [None]:
!wget https://github.com/csbailey5t/sedls/blob/master/aspca-texts.zip
!unzip aspca-texts.zip

In [None]:
fns = glob.glob("texts/*.txt")
len(fns)

In [None]:
texts = []
for fn in fns:
    with open(fn, 'r') as f:
        texts.append(f.read())

In [None]:
%time corpus = [nlp(text) for text in texts[:5]]

In [None]:
for doc in corpus:
    for ent in doc.ents:
        print(ent.text, ent.label_)

In [None]:
# Collect all geo-political entities from whole corpus
gpes = [(ent.text, ent.label_) for ent in doc.ents for doc in corpus if ent.label_ == "GPE"]
len(gpes)

In [None]:
gpes

In [None]:
# get the set of unique GPEs
set(gpes)

### Activity

Choose a method from the single document analysis portion of the workshop, and apply it to this small corpus. For example, you could find the most common words, create a cleaned corpus, or aggregate parts of speech. 

In [None]:
# Write code here

spaCy also provides a `pipe` method on the language model that should batch your document processing. This can be useful for larger collections of texts. We'll only see a small advantage in our small corpus, but it gets more significant as you batch in larger sizes with more processes. 

https://spacy.io/api/language#pipe

In [None]:
%time docs = [nlp(text) for text in texts]

In [None]:
%time docs = list(nlp.pipe(texts, batch_size=10, n_process=1))

## Resources for spaCy

- [spaCy 101](https://spacy.io/usage/spacy-101) - spaCy's own intro documentation
- [Advanced NLP with spaCy](https://course.spacy.io/) - spaCy's own interactive learning course; you don't need to be "ready" for "advanced" work to benefit from going through this course
- [textacy](https://github.com/chartbeat-labs/textacy) - a Python library built on top of spaCy and scikit-learn to faciliate working with a corpus and providing extra functionality
- [spaCy universe](https://spacy.io/universe) - extensive collection of packages built on top of or with spaCy for various NLP and text analysis tasks

## Activity?

I'm happy to stay on for a while and answer questions or help if anyone would like to work with one of their own texts in spaCy to try out some of these techniques/approaches.