# Exploratory Text Analysis with Python

Workshop for the Southeast Data Librarian Symposium 2020

Scott Bailey <br/>
Digital Research and Scholarship Librarian <br/>
Copyright and Digital Scholarship Center <br/>
NC State University Libraries

## Outline
1. Intro and overview of NLP libraries
2. Document-level analysis <br/>
    a. Tokenization <br/>
    b. Cleaning text data <br />
    c. Part-of-speech tagging <br/>
    d. Named entity recognition <br/>
    e. Similarity vectors <br/>
    f. Rule-based matching <br />
3. Scaling up to corpus-level analysis
4. Further resources for spaCy

## What do we mean by "exploratory text analysis?"
- How clean are the data?
- What methods do the data support?
- Project scoping 
- Research question refinement
- Iterative research 

## A quick(!) overview of NLP-related libraries in Python
- [nltk](https://www.nltk.org/)
- [gensim](https://radimrehurek.com/gensim/)
- [scikit-learn](https://scikit-learn.org/stable/)
- [stanza/corenlp](https://stanfordnlp.github.io/stanza/)
- [spaCy](https://spacy.io/)
- [huggingface transformers - pytorch and tensorflow](https://github.com/huggingface/transformers)

### Why spaCy?

An opinionated, performant NLP that does a lot of the work for you while revealing where you might need to do more custom refinement or model building. 

## Questions during the workshop

During the workshop, please do ask questions by way of the Zoom chat. I'll be keeping an eye on that, and will answer questions as we go. I'll also give some time during and after the workshop when folks can unmute and ask questions. 

## Jupyter Notebooks, Google Colab, and Binder


In [None]:
# Run this cell if working in Google Colab or Binder
# If working locally, add spaCy to your environment in the preferred way
# and in a shell with that environment, run the model download
!pip install spacy
!python -m spacy download en_core_web_md

In [31]:
from collections import Counter
import glob
import spacy
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_md")

In [33]:
# from https://se-datalibrarian.github.io/2020/about/
# I've added the final, untrue sentence, though, to make sure we have entities for when we hit named entity recognition.
sample_text = """The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.

In addition to learning about new work in the field, attendees will have the opportunity to network and build partnerships with regional colleagues. It is open to all who wish to attend, including students, data managers, and data scientists.

The Symposium has previously taken place in Athens, Georgia, and has been sponsored by Google for $10 million."""

In [34]:
doc = nlp(sample_text)

## Tokenization

In [35]:
for word in doc[:20]:
    print(word)

The
Southeast
Data
Librarian
Symposium
is
intended
to
provide
an
opportunity
for
librarians
and
other
research
data
specialists
to
explore


In [36]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

The Southeast Data Librarian Symposium
an opportunity
librarians
other research data specialists
developments
the field
data librarianship
the management
sharing
research data
addition
new work
the field
attendees
the opportunity
partnerships
regional colleagues
It
who
students
data managers
data scientists
The Symposium
place
Athens
Georgia
Google


In [37]:
for sent in doc.sents:
    print(sent)

The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.


In addition to learning about new work in the field, attendees will have the opportunity to network and build partnerships with regional colleagues.
It is open to all who wish to attend, including students, data managers, and data scientists.


The Symposium has previously taken place in Athens, Georgia, and has been sponsored by Google for $10 million.


## Cleaning text data

In [40]:
# One of the common things we do in text analysis is to remove punctuation
no_punct = [token for token in doc if token.is_punct == False]
for token in no_punct[:50]:
  print(token.text, token.is_punct)

The False
Southeast False
Data False
Librarian False
Symposium False
is False
intended False
to False
provide False
an False
opportunity False
for False
librarians False
and False
other False
research False
data False
specialists False
to False
explore False
developments False
in False
the False
field False
of False
data False
librarianship False
including False
the False
management False
and False
sharing False
of False
research False
data False


 False
In False
addition False
to False
learning False
about False
new False
work False
in False
the False
field False
attendees False
will False
have False
the False


In [41]:
# This has worked, but left in new line characters and spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
  print(token.text)

The
Southeast
Data
Librarian
Symposium
is
intended
to
provide
an
opportunity
for
librarians
and
other
research
data
specialists
to
explore
developments
in
the
field
of
data
librarianship
including
the
management


In [42]:
# Let's say we also want to remove numbers, and lowercase everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

[&#39;the&#39;,
 &#39;southeast&#39;,
 &#39;data&#39;,
 &#39;librarian&#39;,
 &#39;symposium&#39;,
 &#39;is&#39;,
 &#39;intended&#39;,
 &#39;to&#39;,
 &#39;provide&#39;,
 &#39;an&#39;,
 &#39;opportunity&#39;,
 &#39;for&#39;,
 &#39;librarians&#39;,
 &#39;and&#39;,
 &#39;other&#39;,
 &#39;research&#39;,
 &#39;data&#39;,
 &#39;specialists&#39;,
 &#39;to&#39;,
 &#39;explore&#39;,
 &#39;developments&#39;,
 &#39;in&#39;,
 &#39;the&#39;,
 &#39;field&#39;,
 &#39;of&#39;,
 &#39;data&#39;,
 &#39;librarianship&#39;,
 &#39;including&#39;,
 &#39;the&#39;,
 &#39;management&#39;]

One other common bit of preprocessing is to remove stopwords, that is, the common words in a language that don't convey the information that we are looking for in our analysis. For example, if we looked for the most common words in a text, we would want to remove stopwords so that we don't only get words such as 'a,' 'the,' and 'and.'

In [43]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

[&#39;southeast&#39;,
 &#39;data&#39;,
 &#39;librarian&#39;,
 &#39;symposium&#39;,
 &#39;intended&#39;,
 &#39;provide&#39;,
 &#39;opportunity&#39;,
 &#39;librarians&#39;,
 &#39;research&#39;,
 &#39;data&#39;,
 &#39;specialists&#39;,
 &#39;explore&#39;,
 &#39;developments&#39;,
 &#39;field&#39;,
 &#39;data&#39;,
 &#39;librarianship&#39;,
 &#39;including&#39;,
 &#39;management&#39;,
 &#39;sharing&#39;,
 &#39;research&#39;,
 &#39;data&#39;,
 &#39;addition&#39;,
 &#39;learning&#39;,
 &#39;new&#39;,
 &#39;work&#39;,
 &#39;field&#39;,
 &#39;attendees&#39;,
 &#39;opportunity&#39;,
 &#39;network&#39;,
 &#39;build&#39;]

For this piece, we've used spaCy's built in stopword list, which is used to create the property `is_stop` for each token. There's a good chance you would want to create custom stopwords lists though, especially if you're working with historical text or really domain-specific text. 

In [44]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["developments", "management"]

custom_clean = [token.lower_ for token in doc if token.lower_ not in custom_stopwords]
custom_clean

[&#39;the&#39;,
 &#39;southeast&#39;,
 &#39;data&#39;,
 &#39;librarian&#39;,
 &#39;symposium&#39;,
 &#39;is&#39;,
 &#39;intended&#39;,
 &#39;to&#39;,
 &#39;provide&#39;,
 &#39;an&#39;,
 &#39;opportunity&#39;,
 &#39;for&#39;,
 &#39;librarians&#39;,
 &#39;and&#39;,
 &#39;other&#39;,
 &#39;research&#39;,
 &#39;data&#39;,
 &#39;specialists&#39;,
 &#39;to&#39;,
 &#39;explore&#39;,
 &#39;in&#39;,
 &#39;the&#39;,
 &#39;field&#39;,
 &#39;of&#39;,
 &#39;data&#39;,
 &#39;librarianship&#39;,
 &#39;,&#39;,
 &#39;including&#39;,
 &#39;the&#39;,
 &#39;and&#39;,
 &#39;sharing&#39;,
 &#39;of&#39;,
 &#39;research&#39;,
 &#39;data&#39;,
 &#39;.&#39;,
 &#39;\n\n&#39;,
 &#39;in&#39;,
 &#39;addition&#39;,
 &#39;to&#39;,
 &#39;learning&#39;,
 &#39;about&#39;,
 &#39;new&#39;,
 &#39;work&#39;,
 &#39;in&#39;,
 &#39;the&#39;,
 &#39;field&#39;,
 &#39;,&#39;,
 &#39;attendees&#39;,
 &#39;will&#39;,
 &#39;have&#39;,
 &#39;the&#39;,
 &#39;opportunity&#39;,
 &#39;to&#39;,
 &#39;network&#39;,
 &#39;and&#39;,
 &#39;bui

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on our analysis, we may or may not want to do this much cleaning. But, it is good to understand how much we can do just with spaCy. 

### Since we can break apart the document and filter it now, it's a good time to start counting things

In [45]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

Number of tokens in document:  106
Number of tokens in cleaned document:  51
Number of unique tokens in cleaned document:  41


In [46]:
# number of sentences
len(list(doc.sents))

4

In [47]:
# Count all lower-cased tokens
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

[(&#39;,&#39;, 7),
 (&#39;the&#39;, 6),
 (&#39;data&#39;, 6),
 (&#39;to&#39;, 6),
 (&#39;and&#39;, 5),
 (&#39;in&#39;, 4),
 (&#39;.&#39;, 4),
 (&#39;symposium&#39;, 2),
 (&#39;is&#39;, 2),
 (&#39;opportunity&#39;, 2),
 (&#39;for&#39;, 2),
 (&#39;research&#39;, 2),
 (&#39;field&#39;, 2),
 (&#39;of&#39;, 2),
 (&#39;including&#39;, 2),
 (&#39;\n\n&#39;, 2),
 (&#39;has&#39;, 2),
 (&#39;southeast&#39;, 1),
 (&#39;librarian&#39;, 1),
 (&#39;intended&#39;, 1)]

In [48]:
# Count cleaned tokens
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

[(&#39;data&#39;, 6),
 (&#39;symposium&#39;, 2),
 (&#39;opportunity&#39;, 2),
 (&#39;research&#39;, 2),
 (&#39;field&#39;, 2),
 (&#39;including&#39;, 2),
 (&#39;southeast&#39;, 1),
 (&#39;librarian&#39;, 1),
 (&#39;intended&#39;, 1),
 (&#39;provide&#39;, 1),
 (&#39;librarians&#39;, 1),
 (&#39;specialists&#39;, 1),
 (&#39;explore&#39;, 1),
 (&#39;developments&#39;, 1),
 (&#39;librarianship&#39;, 1),
 (&#39;management&#39;, 1),
 (&#39;sharing&#39;, 1),
 (&#39;addition&#39;, 1),
 (&#39;learning&#39;, 1),
 (&#39;new&#39;, 1)]

**Question:** Why do we have to use a list comprehension for the non-clean doc while we can just pass a variable directly for the cleaned set of tokens?

### Activity

In the cell below, write code to find the five most common noun chunks in the original doc. 

In [50]:
# Write code here

## Part-of-speech tagging

In [8]:
# Coarse grained UPOS: https://universaldependencies.org/docs/u/pos/
for token in doc[:20]:
    print(token.text, token.pos_)

The DET
Southeast PROPN
Data PROPN
Librarian PROPN
Symposium PROPN
is AUX
intended VERB
to PART
provide VERB
an DET
opportunity NOUN
for ADP
librarians NOUN
and CCONJ
other ADJ
research NOUN
data NOUN
specialists NOUN
to PART
explore VERB


In [9]:
# Fine-grained POS, Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:20]:
    print(token.text, token.tag_)

The DT
Southeast NNP
Data NNP
Librarian NNP
Symposium NNP
is VBZ
intended VBN
to TO
provide VB
an DT
opportunity NN
for IN
librarians NNS
and CC
other JJ
research NN
data NNS
specialists NNS
to TO
explore VB


In [52]:
# Not sure what those tags are? Try spaCy's explain function
spacy.explain("DT")

&#39;determiner&#39;

In [10]:
# Collect tokens by part of speech
verbs = [token for token in doc if token.pos_ == "VERB"]
verbs

[intended,
 provide,
 explore,
 including,
 learning,
 will,
 network,
 build,
 wish,
 attend,
 including,
 taken,
 sponsored]

In [11]:
# Collect plural nouns
nouns_pl = [token for token in doc if token.tag_ == "NNS" or token.tag_ == "NNPS"]
nouns_pl

[librarians,
 data,
 specialists,
 developments,
 data,
 data,
 attendees,
 partnerships,
 colleagues,
 students,
 managers,
 data,
 scientists]

### Dependency tree visualization

In [12]:
single_sentence = list(doc.sents)[0]
single_sentence

The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.


In [51]:
# spaCy determines the dependency tree for it's doc. Like POS, we can see the dependency tags of each token. 
for token in single_sentence:
    print(token.text, token.dep_)

The det
Symposium nsubj
has aux
previously advmod
taken ROOT
place dobj
in prep
Athens pobj
, punct
Georgia appos
, punct
and cc
has aux
been auxpass
sponsored conj
by agent
Google pobj
for prep
$ quantmod
10 compound
million pobj
. punct


In [54]:
spacy.explain("dobj")

&#39;direct object&#39;

In [13]:
displacy.render(single_sentence, style="dep")

## Named entity recognition

[List of entity types in spaCy](https://spacy.io/api/annotation#named-entities)

In [14]:
for ent in doc.ents:
    print(ent.text, ent.label_)    

Symposium ORG
Athens GPE
Georgia GPE
Google ORG
$10 million MONEY


### Activity

Add or modify a sentence in the original `sample_text` so that spaCy will detect a PERSON. Then, in the cell below, write code to return a list of all entities that are either PERSON or GPE.

**hint**: make sure to reprocess the `sample_text` with the `nlp` model. 

In [56]:
# Write code here

### Visualizing named entities

In [55]:
single_sentence = list(doc.sents)[-1]
displacy.render(single_sentence, style="ent")

## Word, sentence, and document vectors

SpaCy's medium (`md`) and large (`lg`) models include GloVe word vectors trained on the [Common Crawl](https://commoncrawl.org/). 

You could train your own vectors with `gensim` and `word2vec`, use a large language model, or many other libraries and algorithms. But, if you're text is fairly recent and especially from the web, the common crawl vectors might be enough, especially for exploratory work. 

`Token`s have vectors. `Doc`s and `Span`s have vectors that are the average of their token vectors. 

In [57]:
# token vectors
for token in doc[:5]:
    print(token.vector)

[ 2.7204e-01 -6.2030e-02 -1.8840e-01  2.3225e-02 -1.8158e-02  6.7192e-03
 -1.3877e-01  1.7708e-01  1.7709e-01  2.5882e+00 -3.5179e-01 -1.7312e-01
  4.3285e-01 -1.0708e-01  1.5006e-01 -1.9982e-01 -1.9093e-01  1.1871e+00
 -1.6207e-01 -2.3538e-01  3.6640e-03 -1.9156e-01 -8.5662e-02  3.9199e-02
 -6.6449e-02 -4.2090e-02 -1.9122e-01  1.1679e-02 -3.7138e-01  2.1886e-01
  1.1423e-03  4.3190e-01 -1.4205e-01  3.8059e-01  3.0654e-01  2.0167e-02
 -1.8316e-01 -6.5186e-03 -8.0549e-03 -1.2063e-01  2.7507e-02  2.9839e-01
 -2.2896e-01 -2.2882e-01  1.4671e-01 -7.6301e-02 -1.2680e-01 -6.6651e-03
 -5.2795e-02  1.4258e-01  1.5610e-01  5.5510e-02 -1.6149e-01  9.6290e-02
 -7.6533e-02 -4.9971e-02 -1.0195e-02 -4.7641e-02 -1.6679e-01 -2.3940e-01
  5.0141e-03 -4.9175e-02  1.3338e-02  4.1923e-01 -1.0104e-01  1.5111e-02
 -7.7706e-02 -1.3471e-01  1.1900e-01  1.0802e-01  2.1061e-01 -5.1904e-02
  1.8527e-01  1.7856e-01  4.1293e-02 -1.4385e-02 -8.2567e-02 -3.5483e-02
 -7.6173e-02 -4.5367e-02  8.9281e-02  3.3672e-01 -2

In [58]:
# doc vector
doc.vector

array([ 1.53805362e-02,  9.89923999e-02, -6.09041788e-02, -2.92326603e-02,
        5.46372421e-02,  4.40402627e-02, -2.79662535e-02, -2.95366850e-02,
        4.45587561e-02,  2.31542850e+00, -3.21038872e-01,  5.43637760e-02,
        8.54945108e-02, -4.63223867e-02, -9.17983875e-02, -4.54782806e-02,
       -1.86549556e-02,  1.29680729e+00, -2.47928023e-01, -9.78325754e-02,
        2.06786469e-02,  4.06484790e-02, -9.59417745e-02, -2.08438411e-02,
        1.07212409e-01,  7.01129287e-02, -2.15646476e-02,  3.46690156e-02,
        4.41260561e-02,  4.27467749e-03,  1.24599645e-02,  2.47369166e-02,
        5.78243099e-02,  9.69910920e-02,  4.92216498e-02, -1.50299355e-01,
        3.92799452e-03, -1.92086287e-02, -6.43787906e-02, -2.49829479e-02,
        7.11825094e-04,  8.52283090e-02,  8.23565014e-03, -7.96478465e-02,
       -7.71624744e-02,  9.76254344e-02, -5.98791167e-02,  2.87836324e-02,
        2.46993154e-02, -7.24440217e-02, -3.51522490e-02, -4.19717357e-02,
       -3.80723774e-02, -

In [59]:
# sentence/span vector
list(doc.sents)[0].vector

array([ 1.85621288e-02,  3.02547310e-02, -5.02663441e-02,  2.21122354e-02,
        5.58689535e-02,  8.78722295e-02, -4.20720875e-02, -3.67098721e-04,
        1.50770638e-02,  2.30668187e+00, -3.48410964e-01,  3.92627604e-02,
        1.12249628e-01, -2.02283431e-02, -4.74841110e-02, -4.37880866e-02,
        2.32680459e-02,  1.45328808e+00, -2.73608267e-01, -1.67115375e-01,
        7.36747608e-02,  8.68071243e-02, -1.54967710e-01, -4.44735438e-02,
        1.48329899e-01,  1.06674433e-01,  7.03678504e-02,  7.79173970e-02,
        2.13212259e-02,  4.59656641e-02,  1.91984288e-02,  6.01940528e-02,
        1.11045085e-01,  1.03413790e-01,  5.46271913e-02, -1.74142405e-01,
       -3.82109955e-02, -3.49233598e-02, -1.15441985e-01, -6.03949092e-02,
        3.07112243e-02,  1.53818384e-01, -9.27654803e-02, -5.75195812e-02,
       -1.00384265e-01,  8.87980312e-02, -4.40394990e-02,  6.51326254e-02,
        3.85178179e-02, -6.49373755e-02, -2.22407817e-03, -6.20102175e-02,
       -2.61375662e-02,  

This is fine, but for exploratory work, we might just be interested in some similarity measures between tokens, sentences, or documents. SpaCy uses the common cosine similarity measure.

In [62]:
for token1 in doc[:10]:
    for token2 in doc[:10]:
        print(token1.text, token2.text, token1.similarity(token2))

The The 1.0
The Southeast 0.33651814
The Data 0.326684
The Librarian 0.25095838
The Symposium 0.3484234
The is 0.5536109
The intended 0.44844761
The to 0.5877379
The provide 0.43726653
The an 0.5969309
Southeast The 0.33651814
Southeast Southeast 1.0
Southeast Data 0.06996884
Southeast Librarian 0.20938757
Southeast Symposium 0.21938327
Southeast is 0.26847538
Southeast intended 0.129104
Southeast to 0.2922006
Southeast provide 0.15326436
Southeast an 0.20995641
Data The 0.326684
Data Southeast 0.06996884
Data Data 1.0
Data Librarian 0.105767384
Data Symposium 0.21679558
Data is 0.28532642
Data intended 0.3011669
Data to 0.3160074
Data provide 0.44369358
Data an 0.2718177
Librarian The 0.25095838
Librarian Southeast 0.20938757
Librarian Data 0.105767384
Librarian Librarian 1.0
Librarian Symposium 0.25176087
Librarian is 0.30456054
Librarian intended 0.24139608
Librarian to 0.28961083
Librarian provide 0.21883203
Librarian an 0.2783703
Symposium The 0.3484234
Symposium Southeast 0.21938

**Question**: Looking at the results, can you explain the scale of the similarity score?

In [67]:
for sent1 in doc.sents:
    for sent2 in doc.sents:
        print(sent1.text, sent2.text, "\n", sent1.similarity(sent2))
        print("----------------------------------------------")

The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.

 The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.

 
 1.0
----------------------------------------------
The Southeast Data Librarian Symposium is intended to provide an opportunity for librarians and other research data specialists to explore developments in the field of data librarianship, including the management and sharing of research data.

 In addition to learning about new work in the field, attendees will have the opportunity to network and build partnerships with regional colleagues. 
 0.9366893
----------------------------------------------
The 

## Rule based matcher

## Working with multiple documents (a corpus)

For a small corpus, you can build a list of processed spaCy docs. 

In [17]:
!wget https://github.com/csbailey5t/sedls/blob/master/aspca-texts.zip
!unzip aspca-texts.zip

--2020-10-06 08:35:44--  https://github.com/csbailey5t/sedls/blob/master/aspca-texts.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aspca-texts.zip.1’

aspca-texts.zip.1       [ &lt;=&gt;                ]  79.63K  --.-KB/s    in 0.03s   

2020-10-06 08:35:46 (2.35 MB/s) - ‘aspca-texts.zip.1’ saved [81542]

Archive:  aspca-texts.zip
replace texts/aspca_20200309_3412.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [18]:
fns = glob.glob("texts/*.txt")
len(fns)

108

In [19]:
texts = []
for fn in fns:
    with open(fn, 'r') as f:
        texts.append(f.read())

In [20]:
%time corpus = [nlp(text) for text in texts[:5]]

CPU times: user 5.44 s, sys: 611 ms, total: 6.05 s
Wall time: 6.07 s


In [21]:
for doc in corpus:
    for ent in doc.ents:
        print(ent.text, ent.label_)

GPE
Higgins PERSON
C. R. Davies PERSON
Cheshire PERSON
Altrincham GPE
9 CARDINAL
E. Parker PERSON
hon ORG
107 MONEY
7 CARDINAL
Denbighshire PERSON
K. Ashcroft PERSON
Denbigh
 ORG
hon ORG
Scala PERSON
£7 65 MONEY
7d CARDINAL
DeVon PERSON
Witney Jones PERSON
Mrs Robert Jones PERSON
Lynton GPE
Lynmouth PERSON
the Town Hall FAC
Hillier PERSON
Countess Fortescue ORG
Devon PERSON
Jennifer Pearce PERSON
Mrs Ambrose PERSON
Inspectbr Ambrose PERSON
M. Mackay PERSON
hon ORG
Dorset GPE
North GPE
annual DATE
Mrs Baker PERSON
hon ORG
West Moors GPE
67 MONEY
Durham GPE
G. McIntyre PERSON
Beverley Studio ORG
Barnes ORG
jamieson PERSON
Barnes Auxiliary ORG
Hon ORG
Caledonian NORP
7 MONEY
the Town Hall FAC
Durham GPE
two CARDINAL
Mrs H. H. Rushford PERSON
Mrs Rushford PERSON
hon ORG
Miss G. S. Wilkinson PERSON
hon ORG
A. B. Peacock. PERSON
Essex GPE
Barﬁeld man,&#39;Mr ORG
William Sutton PERSON
135-feet QUANTITY
RSPCA GPE
Bronze
Medal WORK_OF_ART
N. Playne PERSON
hon ORG
Lady Courtauld PERSON
hon ORG
G

In [22]:
# Collect all geo-political entities from whole corpus
gpes = [(ent.text, ent.label_) for ent in doc.ents for doc in corpus if ent.label_ == "GPE"]
len(gpes)

1285

In [23]:
gpes

, &#39;GPE&#39;),
 (&#39;Lynton&#39;, &#39;GPE&#39;),
 (&#39;Lynton&#39;, &#39;GPE&#39;),
 (&#39;Lynton&#39;, &#39;GPE&#39;),
 (&#39;Dorset&#39;, &#39;GPE&#39;),
 (&#39;Dorset&#39;, &#39;GPE&#39;),
 (&#39;Dorset&#39;, &#39;GPE&#39;),
 (&#39;Dorset&#39;, &#39;GPE&#39;),
 (&#39;Dorset&#39;, &#39;GPE&#39;),
 (&#39;North&#39;, &#39;GPE&#39;),
 (&#39;North&#39;, &#39;GPE&#39;),
 (&#39;North&#39;, &#39;GPE&#39;),
 (&#39;North&#39;, &#39;GPE&#39;),
 (&#39;North&#39;, &#39;GPE&#39;),
 (&#39;West Moors&#39;, &#39;GPE&#39;),
 (&#39;West Moors&#39;, &#39;GPE&#39;),
 (&#39;West Moors&#39;, &#39;GPE&#39;),
 (&#39;West Moors&#39;, &#39;GPE&#39;),
 (&#39;West Moors&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#39;),
 (&#39;Durham&#39;, &#39;GPE&#3

In [24]:
# get the set of unique GPEs
set(gpes)

{(&#39;Abattoirs&#39;, &#39;GPE&#39;),
 (&#39;Aberdare&#39;, &#39;GPE&#39;),
 (&#39;Afghanistan&#39;, &#39;GPE&#39;),
 (&#39;Albania&#39;, &#39;GPE&#39;),
 (&#39;Altrincham&#39;, &#39;GPE&#39;),
 (&#39;Ashton&#39;, &#39;GPE&#39;),
 (&#39;Austria&#39;, &#39;GPE&#39;),
 (&#39;BIRMINGHAM&#39;, &#39;GPE&#39;),
 (&#39;BRITAIN&#39;, &#39;GPE&#39;),
 (&#39;BROMLEY&#39;, &#39;GPE&#39;),
 (&#39;BROMLEYvDAVENPORT&#39;, &#39;GPE&#39;),
 (&#39;Bark&#39;, &#39;GPE&#39;),
 (&#39;Birmingham&#39;, &#39;GPE&#39;),
 (&#39;Blackpool&#39;, &#39;GPE&#39;),
 (&#39;Bournemouth&#39;, &#39;GPE&#39;),
 (&#39;Brighton&#39;, &#39;GPE&#39;),
 (&#39;Bristol&#39;, &#39;GPE&#39;),
 (&#39;Bromley&#39;, &#39;GPE&#39;),
 (&#39;Bulgaria&#39;, &#39;GPE&#39;),
 (&#39;Burma&#39;, &#39;GPE&#39;),
 (&#39;CONNAUGHT&#39;, &#39;GPE&#39;),
 (&#39;Cambridge&#39;, &#39;GPE&#39;),
 (&#39;Canterbury&#39;, &#39;GPE&#39;),
 (&#39;Carmarthen&#39;, &#39;GPE&#39;),
 (&#39;Chatham&#39;, &#39;GPE&#39;),
 (&#39;China&#39;, &#39;GPE&#39;),
 (

spaCy also provides a `pipe` method on the language model that should batch your document processing. This can be useful for larger collections of texts. We'll only see a small advantage in our small corpus, but it gets more significant as you batch in larger sizes with more processes. 

https://spacy.io/api/language#pipe

In [None]:
%time docs = [nlp(text) for text in texts]

In [None]:
%time docs = list(nlp.pipe(texts, batch_size=10, n_process=1))

## Resources for spaCy

- [spaCy 101](https://spacy.io/usage/spacy-101) - spaCy's own intro documentation
- [Advanced NLP with spaCy](https://course.spacy.io/) - spaCy's own interactive learning course; you don't need to be "ready" for "advanced" work to benefit from going through this course
- [textacy](https://github.com/chartbeat-labs/textacy) - a Python library built on top of spaCy and scikit-learn to faciliate working with a corpus and providing extra functionality
- [spaCy universe](https://spacy.io/universe) - extensive collection of packages built on top of or with spaCy for various NLP and text analysis tasks

## Activity?

I'm happy to stay on for a while and answer questions or help if anyone would like to work with one of their own texts in spaCy to try out some of these techniques/approaches.