# Exploratory Text Analysis with Python

Workshop for the Southeast Data Librarian Symposium 2020

Scott Bailey <br/>
Digital Research and Scholarship Librarian <br/>
Copyright and Digital Scholarship Center <br/>
NC State University Libraries

## What do we mean by "exploratory text analysis?"

## A quick (!) overview of NLP libraries in Python
### Why spaCy?

## Jupyter Notebooks, Google Colab, and Binder



In [None]:
# Run this cell if working in Google Colab or Binder
!pip install spacy
!python -m spacy download en_core_web_md

In [None]:
import glob
import spacy
from spacy import displacy

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
# from https://se-datalibrarian.github.io/2020/about/
# I've added the final, untrue sentence, though, to make sure we have entities for when we hit named entity recognition.
sample_text = """The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.

In addition to learning about new work in the field, attendees will have the opportunity to network and build partnerships with regional colleagues. It is open to all who wish to attend, including students, data managers, and data scientists.

The Symposium has previously taken place in Athens, Georgia, and has been sponsored by Google for $10 million."""

In [None]:
doc = nlp(sample_text)

## Tokenization

In [None]:
for word in doc[:20]:
    print(word)

In [None]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

In [None]:
for sent in doc.sents:
    print(sent)

## Part-of-speech tagging

In [None]:
# Coarse grained UPOS: https://universaldependencies.org/docs/u/pos/
for token in doc[:20]:
    print(token.text, token.pos_)

In [None]:
# Fine-grained POS, Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:20]:
    print(token.text, token.tag_)

In [None]:
# Collect tokens by part of speech
verbs = [token for token in doc if token.pos_ == "VERB"]
verbs

In [None]:
# Collect plural nouns
nouns_pl = [token for token in doc if token.tag_ == "NNS" or token.tag_ == "NNPS"]
nouns_pl

### Visualization dependency tree

In [None]:
single_sentence = list(doc.sents)[0]
single_sentence

In [None]:
displacy.render(single_sentence, style="dep")

## Named entity recognition

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)    

### Visualizing named entities

In [None]:
single_sentence = list(doc.sents)[-1]
displacy.render(single_sentence, style="ent", jupyter=True)

## Word, sentence, and document vectors

In [None]:
# similarity

## Cleaning text data

- stopwords
- lemmas
- punctuation
- alphanum
- lowercase

## Working with multiple documents (a corpus)

For a small corpus, you can build a list of processed spaCy docs. 

In [None]:
!wget https://github.com/csbailey5t/sedls/blob/master/aspca-texts.zip
!unzip aspca-texts.zip

In [None]:
fns = glob.glob("texts/*.txt")
len(fns)

In [None]:
texts = []
for fn in fns:
    with open(fn, 'r') as f:
        texts.append(f.read())

In [None]:
%time corpus = [nlp(text) for text in texts[:5]]

In [None]:
for doc in corpus:
    for ent in doc.ents:
        print(ent.text, ent.label_)

In [None]:
# Collect all geo-political entities from whole corpus
gpes = [(ent.text, ent.label_) for ent in doc.ents for doc in corpus if ent.label_ == "GPE"]
len(gpes)

In [None]:
gpes

In [None]:
# get the set of unique GPEs
set(gpes)

spaCy also provides a `pipe` method on the language model that should batch your document processing. This can be useful for larger collections of texts. We'll only see a small advantage in our small corpus, but it gets more significant as you batch in larger sizes with more processes. 

https://spacy.io/api/language#pipe

In [None]:
%time docs = [nlp(text) for text in texts]

In [None]:
%time docs = list(nlp.pipe(texts, batch_size=10, n_process=1))

## Resources for spaCy

- [spaCy 101](https://spacy.io/usage/spacy-101)
- [Advanced NLP with spaCy](https://course.spacy.io/)
- [textacy](https://github.com/chartbeat-labs/textacy) - a Python library built on top of spaCy and scikit-learn to faciliate working with a corpus and providing extra functionality