# Exploratory Text Analysis in Python using spaCy and textacy

Workshop for Open Data Science Conference East 2021 <br />
[Workshop materials](https://github.com/csbailey5t/ODSC_text_analysis)

Scott Bailey <br/>
Digital Research and Scholarship Librarian <br/>
Copyright and Digital Scholarship Center <br/>
NC State University Libraries

## Outline
1. Intro and overview of NLP libraries
2. Document-level analysis <br/>
    a. Tokenization <br/>
    b. Cleaning text data <br />
    c. Part-of-speech tagging <br/>
    d. Named entity recognition <br/>
    e. Similarity vectors <br/>
    f. Rule-based matching <br />
3. Scaling up to corpus-level analysis
4. Further resources for spaCy

## What do we mean by "exploratory text analysis?"
- How clean are the data?
- What methods do the data support?
- Project scoping 
- Research question refinement
- Iterative research 

## A quick(!) overview of NLP-related libraries in Python
- [nltk](https://www.nltk.org/)
- [gensim](https://radimrehurek.com/gensim/)
- [scikit-learn](https://scikit-learn.org/stable/)
- [stanza/corenlp](https://stanfordnlp.github.io/stanza/)
- [spaCy](https://spacy.io/)
- [huggingface transformers - pytorch and tensorflow](https://github.com/huggingface/transformers)

### Why spaCy and textacy?

SpaCy is an opinionated, performant NLP library that does a lot of the work for you while revealing where you might need to do more custom refinement or model building. Textacy builds smoothly on spaCy to add corpus analysis and common information retrieval methods.

## Questions during the workshop

During the workshop, please do ask questions by way of the chat. I'll be keeping an eye on that, and will answer questions as we go if I can. I'll also give some time during and after the workshop when folks can unmute and ask questions. 

## Jupyter Notebooks, Google Colab, and Binder


In [None]:
# Run this cell if working in Google Colab or Binder
# If working locally, install dependencies per requirements.txt
# !pip install textacy
# !python -m spacy download en_core_web_md

In [1]:
from collections import Counter
import glob
import spacy
from spacy import displacy
import textacy

In [2]:
import en_core_web_md

In [3]:
# Alternate ways to load the model, with the first working consistently in Colab
# nlp = en_core_web_md.load()
nlp = spacy.load("en_core_web_md")

In [4]:
# from https://en.wikipedia.org/wiki/Data_science
sample_text = """Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge."""

In [5]:
doc = nlp(sample_text)

## Tokenization

In [6]:
for word in doc[:20]:
    print(word)

Data
science
is
an
inter
-
disciplinary
field
that
uses
scientific
methods
,
processes
,
algorithms
and
systems
to
extract


In [7]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk)

Data science
an inter-disciplinary field
scientific methods
processes
algorithms
systems
knowledge
insights
structured and unstructured data
knowledge
actionable insights
data
a broad range
application domains
Data science
data mining
machine learning
big data
Data science
a "concept
unify statistics
data analysis
informatics
their related methods
order
actual phenomena
data
It
techniques
theories
many fields
the context
mathematics
statistics
computer science
information science
domain knowledge
Turing award winner
Jim Gray
data science
a "fourth paradigm
science
everything
science
the impact
information technology
the data deluge


In [8]:
for sent in doc.sents:
    print(sent)

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
Data science is related to data mining, machine learning and big data.


Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data.
It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.
Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.


## Cleaning text data

In [11]:
# One of the common things we do in text analysis is to remove punctuation
no_punct = [token for token in doc if token.is_punct == False]
for token in no_punct[50:100]:
  print(token.text, token.is_punct)

big False
data False


 False
Data False
science False
is False
a False
concept False
to False
unify False
statistics False
data False
analysis False
informatics False
and False
their False
related False
methods False
in False
order False
to False
understand False
and False
analyze False
actual False
phenomena False
with False
data False
It False
uses False
techniques False
and False
theories False
drawn False
from False
many False
fields False
within False
the False
context False
of False
mathematics False
statistics False
computer False
science False
information False
science False
and False
domain False
knowledge False


In [12]:
# This has worked, but left in new line characters and spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[50:100]:
  print(token.text)

big
data
Data
science
is
a
concept
to
unify
statistics
data
analysis
informatics
and
their
related
methods
in
order
to
understand
and
analyze
actual
phenomena
with
data
It
uses
techniques
and
theories
drawn
from
many
fields
within
the
context
of
mathematics
statistics
computer
science
information
science
and
domain
knowledge
Turing


In [13]:
# Let's say we also want to remove numbers, and lowercase everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

['data',
 'science',
 'is',
 'an',
 'inter',
 'disciplinary',
 'field',
 'that',
 'uses',
 'scientific',
 'methods',
 'processes',
 'algorithms',
 'and',
 'systems',
 'to',
 'extract',
 'knowledge',
 'and',
 'insights',
 'from',
 'structured',
 'and',
 'unstructured',
 'data',
 'and',
 'apply',
 'knowledge',
 'and',
 'actionable']

One other common bit of preprocessing is to remove stopwords, that is, the common words in a language that don't convey the information that we are looking for in our analysis. For example, if we looked for the most common words in a text, we would want to remove stopwords so that we don't only get words such as 'a,' 'the,' and 'and.'

In [14]:
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

['data',
 'science',
 'inter',
 'disciplinary',
 'field',
 'uses',
 'scientific',
 'methods',
 'processes',
 'algorithms',
 'systems',
 'extract',
 'knowledge',
 'insights',
 'structured',
 'unstructured',
 'data',
 'apply',
 'knowledge',
 'actionable',
 'insights',
 'data',
 'broad',
 'range',
 'application',
 'domains',
 'data',
 'science',
 'related',
 'data']

For this piece, we've used spaCy's built in stopword list, which is used to create the property `is_stop` for each token. There's a good chance you would want to create custom stopwords lists though, especially if you're working with historical text or really domain-specific text. 

In [15]:
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["data", "algorithms"]

custom_clean = [token.lower_ for token in doc if token.lower_ not in custom_stopwords]
custom_clean

['science',
 'is',
 'an',
 'inter',
 '-',
 'disciplinary',
 'field',
 'that',
 'uses',
 'scientific',
 'methods',
 ',',
 'processes',
 ',',
 'and',
 'systems',
 'to',
 'extract',
 'knowledge',
 'and',
 'insights',
 'from',
 'structured',
 'and',
 'unstructured',
 ',',
 'and',
 'apply',
 'knowledge',
 'and',
 'actionable',
 'insights',
 'from',
 'across',
 'a',
 'broad',
 'range',
 'of',
 'application',
 'domains',
 '.',
 'science',
 'is',
 'related',
 'to',
 'mining',
 ',',
 'machine',
 'learning',
 'and',
 'big',
 '.',
 '\n\n',
 'science',
 'is',
 'a',
 '"',
 'concept',
 'to',
 'unify',
 'statistics',
 ',',
 'analysis',
 ',',
 'informatics',
 ',',
 'and',
 'their',
 'related',
 'methods',
 '"',
 'in',
 'order',
 'to',
 '"',
 'understand',
 'and',
 'analyze',
 'actual',
 'phenomena',
 '"',
 'with',
 '.',
 'it',
 'uses',
 'techniques',
 'and',
 'theories',
 'drawn',
 'from',
 'many',
 'fields',
 'within',
 'the',
 'context',
 'of',
 'mathematics',
 ',',
 'statistics',
 ',',
 'computer',

At this point, we have a list of lower-cased tokens that doesn't contain punctuation, white-space, numbers, or stopwords. Depending on our analysis, we may or may not want to do this much cleaning. But, it is good to understand how much we can do just with spaCy. 

### Since we can break apart the document and filter it now, it's a good time to start counting things

In [16]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

Number of tokens in document:  170
Number of tokens in cleaned document:  89
Number of unique tokens in cleaned document:  63


In [17]:
# number of sentences
len(list(doc.sents))

5

In [18]:
# Count all lower-cased tokens
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

[(',', 13),
 ('and', 13),
 ('data', 12),
 ('science', 8),
 ('"', 8),
 ('of', 5),
 ('.', 5),
 ('is', 4),
 ('to', 4),
 ('knowledge', 3),
 ('from', 3),
 ('a', 3),
 ('the', 3),
 ('-', 2),
 ('that', 2),
 ('uses', 2),
 ('methods', 2),
 ('insights', 2),
 ('related', 2),
 ('statistics', 2)]

In [19]:
# Count cleaned tokens
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

[('data', 12),
 ('science', 8),
 ('knowledge', 3),
 ('uses', 2),
 ('methods', 2),
 ('insights', 2),
 ('related', 2),
 ('statistics', 2),
 ('information', 2),
 ('inter', 1),
 ('disciplinary', 1),
 ('field', 1),
 ('scientific', 1),
 ('processes', 1),
 ('algorithms', 1),
 ('systems', 1),
 ('extract', 1),
 ('structured', 1),
 ('unstructured', 1),
 ('apply', 1)]

### Activity

In the cell below, write code to find the five most common noun chunks in the original doc. 

In [None]:
# Write code here

## Part-of-speech tagging

In [20]:
# Coarse grained UPOS: https://universaldependencies.org/docs/u/pos/
for token in doc[:20]:
    print(token.text, token.pos_)

Data NOUN
science NOUN
is AUX
an DET
inter ADJ
- ADJ
disciplinary ADJ
field NOUN
that DET
uses VERB
scientific ADJ
methods NOUN
, PUNCT
processes NOUN
, PUNCT
algorithms NOUN
and CCONJ
systems NOUN
to PART
extract VERB


In [21]:
# Fine-grained POS, Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:20]:
    print(token.text, token.tag_)

Data NN
science NN
is VBZ
an DT
inter JJ
- JJ
disciplinary JJ
field NN
that WDT
uses VBZ
scientific JJ
methods NNS
, ,
processes NNS
, ,
algorithms NNS
and CC
systems NNS
to TO
extract VB


In [22]:
# Not sure what those tags are? Try spaCy's explain function
spacy.explain("DT")

'determiner'

In [23]:
# Collect tokens by part of speech
verbs = [token for token in doc if token.pos_ == "VERB"]
verbs

[uses,
 extract,
 apply,
 related,
 understand,
 analyze,
 uses,
 drawn,
 imagined,
 driven,
 asserted,
 changing]

In [24]:
# Collect plural nouns
nouns_pl = [token for token in doc if token.tag_ == "NNS" or token.tag_ == "NNPS"]
nouns_pl

[methods,
 processes,
 algorithms,
 systems,
 insights,
 data,
 insights,
 data,
 domains,
 data,
 data,
 statistics,
 data,
 informatics,
 methods,
 phenomena,
 data,
 techniques,
 theories,
 fields,
 mathematics,
 statistics,
 data]

### Dependency tree visualization

In [25]:
single_sentence = list(doc.sents)[0]
single_sentence

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.

In [26]:
# spaCy determines the dependency tree for it's doc. Like POS, we can see the dependency tags of each token. 
for token in single_sentence:
    print(token.text, token.dep_)

Data compound
science nsubj
is ROOT
an det
inter dep
- dep
disciplinary amod
field attr
that nsubj
uses relcl
scientific amod
methods dobj
, punct
processes conj
, punct
algorithms conj
and cc
systems conj
to aux
extract xcomp
knowledge dobj
and cc
insights conj
from prep
structured amod
and cc
unstructured conj
data pobj
, punct
and cc
apply conj
knowledge dobj
and cc
actionable amod
insights conj
from prep
data pobj
across prep
a det
broad amod
range pobj
of prep
application compound
domains pobj
. punct


In [27]:
spacy.explain("dobj")

'direct object'

In [28]:
displacy.render(single_sentence, style="dep", jupyter=True)

## Named entity recognition

[List of entity types in this spaCy model](https://spacy.io/models/en#en_core_web_md)

In [30]:
for ent in doc.ents:
    print(ent.text, ent.label_)    

Turing PERSON
Jim Gray PERSON
fourth ORDINAL


### Activity

Add or modify a sentence in the original `sample_text` so that spaCy will detect a GPE. Then, in the cell below, write code to return a list of all entities that are either PERSON or GPE.

**hint**: make sure to reprocess the `sample_text` with the `nlp` model. 

In [None]:
# Write code here

### Visualizing named entities

In [31]:
single_sentence = list(doc.sents)[-1]
displacy.render(single_sentence, style="ent", jupyter=True)

## Word, sentence, and document vectors

SpaCy's medium (`md`) and large (`lg`) models include GloVe word vectors trained on the [Common Crawl](https://commoncrawl.org/). 

You could train your own vectors with `gensim` and `word2vec`, use a large language model, or many other libraries and algorithms. But, if you're text is fairly recent and especially from the web, the common crawl vectors might be enough, especially for exploratory work. 

`Token`s have vectors. `Doc`s and `Span`s have vectors that are the average of their token vectors. 

In [32]:
# token vectors
for token in doc[:5]:
    print(token.vector)

[-1.7969e-01 -2.5516e-01 -2.1751e-01  1.8151e-01 -4.0652e-01  8.5208e-01
  7.4484e-02  1.3682e-02  8.7480e-02  1.5056e+00 -5.3100e-01  2.8123e-02
  5.7363e-02  9.2619e-02 -5.2687e-01  1.6689e-01 -1.9017e-01  3.1937e+00
 -2.1972e-01 -3.8853e-01  1.6916e-01  2.6669e-01 -3.5948e-01 -1.4874e-01
  2.9541e-01  3.8212e-01  1.5826e-01 -9.2368e-02  3.4473e-01  1.0793e-01
 -2.2861e-01 -2.2966e-01  9.0178e-01 -4.6848e-02 -3.6522e-01 -2.9999e-02
 -3.2167e-01 -1.1985e-01 -3.0740e-01 -3.1308e-01 -1.8787e-01  4.7730e-01
 -1.3486e-01  2.3576e-01 -5.4592e-01 -2.6415e-02  1.2399e-01  1.2621e-01
  2.8233e-01 -1.4344e-01 -1.3727e-01 -3.3906e-01 -1.0746e-01 -2.6406e-02
 -1.5055e-01  8.4884e-02 -3.0304e-01 -1.7760e-01  8.0063e-02 -5.1963e-01
  4.8408e-01  7.4119e-01 -1.0525e-01  4.5329e-01  2.1668e-01  3.2206e-01
 -2.8967e-02  4.1205e-01  5.1266e-01  3.4068e-01  1.5061e-01  6.6381e-01
  9.0662e-01 -3.7938e-01 -1.4235e-01  8.8749e-02  2.2335e-01 -4.2028e-01
  1.4483e-02  2.1708e-01  2.1882e-01  2.3531e-01  4

In [33]:
# doc vector
doc.vector

array([-1.12466343e-01,  1.59790739e-01, -1.12997115e-01, -4.50736731e-02,
        2.63751261e-02,  1.57385424e-01, -2.99619976e-02,  3.67342643e-02,
       -3.36680375e-02,  1.97251296e+00, -2.44648248e-01,  9.13238153e-02,
        7.33316541e-02,  1.35665610e-02, -1.04197107e-01,  9.55005549e-03,
       -3.44923213e-02,  1.45000446e+00, -2.52511472e-01, -1.59972414e-01,
        7.37948045e-02, -1.34542231e-02, -1.16475768e-01, -2.10609008e-02,
        1.27918169e-01,  4.55922149e-02,  9.04883593e-02,  7.26802796e-02,
        5.18373325e-02, -3.09815146e-02, -6.85465857e-02,  6.20514154e-02,
        1.03448085e-01,  1.80202927e-02,  9.48346183e-02, -1.20624557e-01,
       -4.04222123e-02,  2.51795407e-02, -9.20449644e-02, -7.20806345e-02,
        5.79685532e-03,  2.01701690e-02, -5.31277433e-02,  3.71526629e-02,
       -4.70825359e-02,  7.30216131e-03, -8.69988874e-02,  7.25259911e-03,
        3.56209576e-02, -4.34230939e-02,  3.70069854e-02,  6.34394810e-02,
       -8.38835612e-02,  

In [34]:
# sentence/span vector
list(doc.sents)[0].vector

array([-1.41003445e-01,  1.21739812e-01, -1.44238576e-01, -1.61007531e-02,
       -7.10073160e-03,  2.14003816e-01, -1.12373112e-02,  6.85715303e-02,
        5.83196618e-03,  1.91060233e+00, -2.13344872e-01,  1.18561260e-01,
        1.79695562e-02, -1.80001184e-02, -6.21123314e-02,  2.77977157e-02,
       -3.61825936e-02,  1.64743721e+00, -2.42472798e-01, -1.71403274e-01,
        8.64852220e-02,  2.87005603e-02, -1.58356890e-01, -5.06506860e-02,
        1.23856753e-01,  8.57934132e-02,  1.38494328e-01,  6.12009577e-02,
        5.48274778e-02, -7.91425779e-02, -6.47969842e-02,  8.37085396e-02,
        1.58478186e-01, -3.10242288e-02,  2.82637123e-02, -1.50990441e-01,
       -1.29540525e-02, -1.76589098e-02, -6.41159415e-02, -1.09551206e-01,
        6.33410551e-03,  2.59006233e-03, -3.98420878e-02,  4.04196084e-02,
       -6.28071949e-02,  4.23986278e-02, -8.43424052e-02,  3.92410383e-02,
        1.95257366e-02, -1.95598211e-02,  8.04203600e-02,  2.54307482e-02,
       -8.72328877e-02,  

This is fine, but for exploratory work, we might just be interested in some similarity measures between tokens, sentences, or documents. SpaCy uses the common cosine similarity measure.

In [35]:
for token1 in doc[:10]:
    for token2 in doc[:10]:
        print(token1.text, token2.text, token1.similarity(token2))

Data Data 1.0
Data science 0.2967368
Data is 0.28532642
Data an 0.2718177
Data inter 0.19982958
Data - 0.031793762
Data disciplinary 0.18646398
Data field 0.40077925
Data that 0.38139
Data uses 0.40483338
science Data 0.2967368
science science 1.0
science is 0.30085263
science an 0.2522719
science inter 0.19687814
science - 0.065442264
science disciplinary 0.47277686
science field 0.45549664
science that 0.3774315
science uses 0.31416386
is Data 0.28532642
is science 0.30085263
is is 1.0
is an 0.63675654
is inter 0.094093345
is - 0.17651182
is disciplinary 0.29116082
is field 0.33907786
is that 0.6942723
is uses 0.4684848
an Data 0.2718177
an science 0.2522719
an is 0.63675654
an an 1.0
an inter 0.1947275
an - 0.07939471
an disciplinary 0.22468801
an field 0.37208116
an that 0.57462156
an uses 0.38908035
inter Data 0.19982958
inter science 0.19687814
inter is 0.094093345
inter an 0.1947275
inter inter 1.0
inter - 0.0022585576
inter disciplinary 0.21024904
inter field 0.16123219
inter t

**Question**: Looking at the results, can you explain the scale of the similarity score?

In [37]:
for sent1 in doc.sents:
    for sent2 in doc.sents:
        print(sent1.text, "\n", sent2.text, "\n", sent1.similarity(sent2))
        print("----------------------------------------------")

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. 
 Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. 
 1.0
----------------------------------------------
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. 
 Data science is related to data mining, machine learning and big data.

 
 0.92242813
-----------------

## Rule based matcher

Rule-based matching is an incredibly powerful complement to the statistic models of spaCy. It's also a bit complex though, and it's worth looking at the docs [here](https://spacy.io/usage/rule-based-matching).

In [38]:
for sent in doc.sents:
    print(sent)

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
Data science is related to data mining, machine learning and big data.


Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data.
It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.
Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.


In [39]:
from spacy.matcher import Matcher

In [40]:
matcher = Matcher(nlp.vocab)

[Available token attributes for the `Matcher` pattern](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)

In [42]:
# We'll define a pattern as a list of dictionaries, where each dictionary describes a token
pattern = [{'LOWER': 'data'},
           {'POS': 'NOUN'}]
# The Matcher expects a list of patterns
matcher.add("data+noun", [pattern])

In [43]:
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

12112088598235855885 data+noun 0 2 Data science
12112088598235855885 data+noun 45 47 Data science
12112088598235855885 data+noun 50 52 data mining
12112088598235855885 data+noun 60 62 Data science
12112088598235855885 data+noun 70 72 data analysis
12112088598235855885 data+noun 126 128 data science
12112088598235855885 data+noun 167 169 data deluge


One of the easiest ways to build up these `Matcher` patterns is to use Explosion's online [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher). 

## Working with multiple documents (a corpus)

For a small corpus, you can build a list or dictionary of processed spaCy docs. Once you have that list or dictionary, approach it in terms of using the type of code we've written above, but applied over the larger data structure. 

For larger corpora, though, you might need to think about streaming data or distributed processing. 

In [None]:
# Run this if in Colab
# Don't run this if in Binder
!wget https://github.com/csbailey5t/ODSC_text_analysis/blob/master/archive.zip

In [None]:
# Run this if in Colab
# normally we would just use `unzip`, but Colab seems to be having issues with zip files and utilities, so 7-zip it is. 
!7z x archive.zip

In [None]:
# Run this if in Binder
!unzip archive.zip

In [44]:
fns = glob.glob("sotu/*.txt")
len(fns)

228

In [45]:
texts = []
for fn in fns:
    with open(fn, 'r') as f:
        texts.append(f.read())

In [46]:
%time corpus = [nlp(text) for text in texts[:5]]

CPU times: user 5.52 s, sys: 744 ms, total: 6.26 s
Wall time: 6.53 s


In [48]:
for doc in corpus[:2]:
    for ent in doc.ents:
        print(ent.text, ent.label_)

Speaker PERSON
Congress ORG
Today marks MONEY
first ORDINAL
Washington GPE
1790 DATE
the
Nation ORG
American NORP
George
 PERSON
Washington GPE
Winston Churchill PERSON
Franklin Delano Roosevelt PERSON
a day DATE
Douglas MacArthur
 PERSON
Dwight Eisenhower PERSON
John F. Kennedy PERSON
Chamber ORG
last year DATE
Washington GPE
Congress ORG
Washington GPE
State ORG
America GPE
tonight TIME
American NORP
America GPE
Detroit GPE
Steubenville GPE
Newark GPE
Chicago GPE
millions CARDINAL
Americans NORP
last
year DATE
The last decade DATE
1970 DATE
1974 DATE
the spring of 1980 DATE
the last 6 months of 1980 DATE
annual DATE
17 percent PERCENT
21.5
percent PERCENT
8 million CARDINAL
1981 DATE
first ORDINAL
3-year DATE
15 3/4 percent PERCENT
12.4
percent PERCENT
8.9 CARDINAL
the month of December DATE
5.2 percent PERCENT
Americans NORP
today DATE
A year ago DATE
Americans NORP
Six CARDINAL
10 CARDINAL
Americans NORP
about
 CARDINAL
Congress ORG
American NORP
Congress ORG
Congress ORG
the begin

In [49]:
# Collect all geo-political entities from whole corpus
gpes = [(ent.text, ent.label_) for ent in doc.ents for doc in corpus if ent.label_ == "GPE"]
len(gpes)

295

In [50]:
gpes[:20]

[('Washington', 'GPE'),
 ('Washington', 'GPE'),
 ('Washington', 'GPE'),
 ('Washington', 'GPE'),
 ('Washington', 'GPE'),
 ('States', 'GPE'),
 ('States', 'GPE'),
 ('States', 'GPE'),
 ('States', 'GPE'),
 ('States', 'GPE'),
 ('Cuba', 'GPE'),
 ('Cuba', 'GPE'),
 ('Cuba', 'GPE'),
 ('Cuba', 'GPE'),
 ('Cuba', 'GPE'),
 ('the United States', 'GPE'),
 ('the United States', 'GPE'),
 ('the United States', 'GPE'),
 ('the United States', 'GPE'),
 ('the United States', 'GPE')]

In [51]:
# get the set of unique GPEs
set(gpes)

{('Alaska', 'GPE'),
 ('America', 'GPE'),
 ('Annapolis', 'GPE'),
 ('California', 'GPE'),
 ('China', 'GPE'),
 ('Colombia', 'GPE'),
 ('Cuba', 'GPE'),
 ('Great Britain', 'GPE'),
 ('Hawaii', 'GPE'),
 ('Hongkong', 'GPE'),
 ('Honolulu', 'GPE'),
 ('Jefferson', 'GPE'),
 ('Louisiana', 'GPE'),
 ('Manila', 'GPE'),
 ('Mexico', 'GPE'),
 ('Mount Vernon', 'GPE'),
 ('Newfoundland', 'GPE'),
 ('Panama', 'GPE'),
 ('Philippines', 'GPE'),
 ('Porto', 'GPE'),
 ('St. Pierre', 'GPE'),
 ('States', 'GPE'),
 ('Texas', 'GPE'),
 ('The District of Columbia', 'GPE'),
 ('The Hague', 'GPE'),
 ('United', 'GPE'),
 ('Washington', 'GPE'),
 ('buffalo', 'GPE'),
 ('the\nRepublic', 'GPE'),
 ('the Chinese Empire', 'GPE'),
 ('the District of Columbia', 'GPE'),
 ('the Philippine Islands', 'GPE'),
 ('the United States', 'GPE')}

### Activity

Choose a method from the single document analysis portion of the workshop, and apply it to this small corpus. For example, you could find the most common words, create a cleaned corpus, or aggregate parts of speech. 

In [None]:
# Write code here

spaCy also provides a `pipe` method on the language model that will process texts in a stream. This can be useful for larger collections of texts, especially along with disabling parts of the pipeline you aren't using. 

https://spacy.io/api/language#pipe

In [52]:
%time docs = [nlp(text) for text in texts]

CPU times: user 2min 56s, sys: 8.38 s, total: 3min 5s
Wall time: 3min 6s


In [53]:
%time docs = list(nlp.pipe(texts, batch_size=10, n_process=2))

CPU times: user 40.4 s, sys: 7.18 s, total: 47.6 s
Wall time: 2min 5s


## Resources for spaCy

- [spaCy 101](https://spacy.io/usage/spacy-101) - spaCy's own intro documentation
- [Advanced NLP with spaCy](https://course.spacy.io/) - spaCy's own interactive learning course; you don't need to be "ready" for "advanced" work to benefit from going through this course
- [textacy](https://github.com/chartbeat-labs/textacy) - a Python library built on top of spaCy and scikit-learn to faciliate working with a corpus and providing extra functionality
- [spaCy universe](https://spacy.io/universe) - extensive collection of packages built on top of or with spaCy for various NLP and text analysis tasks
- [spaCy youtube videos](https://www.youtube.com/c/ExplosionAI/videos) - Explosion has a lot of great videos on Youtube, and there are a number of other folks who have created great walkthroughs of using different parts of spaCy.