# Language Processing with spaCy

## Installation and Basic Use

## Contents

1. Installation
2. Basic uses
3. Project: Creating a vocabulary list from words

## Installation

1. Installing spaCy
2. Installing models
3. Loading spaCy

**spaCy

You should have Python 3 installed. Dependencies for spaCy may take a while to install.

pip install -U spacy

**Models

SpaCy relies on installed language models. Before we do anything, we have to install at least one language model. We will install the medium English (91MB) and small German models (10MB).

python -m spacy download en_core_web_md
python -m spacy download de_core_news_sm

**Loading spaCy

spaCy is now ready to load in Python. Let's load the English model.

In [5]:
import spacy

nlp = spacy.load('en_core_web_md')

## Basic Uses

1. Loading a text
2. Lemma, parts of speech and stopwords
3. Assessing similarity
4. Working with entities

**Loading a text

Feeling festive, I chose two short sections of *A Christmas Carol* by Charles Dickens to analyze with spaCy. We will open them and process them with spaCy.

In [6]:
bookLoc1 = 'C:\\Users\\seth\\OneDrive\\xmascarol1.txt'
bookLoc2 = 'C:\\Users\\seth\\OneDrive\\xmascarol2.txt'

with open(bookLoc1, 'r') as f:
    text1 = f.read()
    
with open(bookLoc2, 'r') as f:
    text2 = f.read()
    
doc1 = nlp(text1)
doc2 = nlp(text2)

**Identifying Lemma, parts of speech and stopwords

You can iterate through a spaCy document object for its tokens. The token objects allow identification of things like the word lemma, its part of speech, its shape and whether it is a common word (a stopword). Let's take the first part of the text and interate through the tokens for:

1. lemma (stemmed word)
2. part of speech (e.g., noun)
3. syntactic dependecy (what it relates to in the sentence)
4. shape (capitalization)
5. stopword (whether it is a common word)

In [8]:
for token in doc1:
    print(token.text,
         token.lemma_,
         token.pos_,
         token.dep_,
         token.shape_,
         token.is_stop)

п»їMarley п»їmarley NOUN nsubj x»xXxxxx False
was be VERB ROOT xxx True
dead dead ADJ acomp xxxx False
, , PUNCT punct , False
to to PART aux xx True
begin begin VERB advcl xxxx False
with with ADP prep xxxx True
. . PUNCT punct . False
There there ADV expl Xxxxx True
is be VERB ROOT xx True
no no DET neg xx True
doubt doubt NOUN attr xxxx False
whatever whatever DET attr xxxx True
about about ADP prep xxxx True
that that DET pobj xxxx True
. . PUNCT punct . False
The the DET det Xxx True
register register NOUN nsubjpass xxxx False
of of ADP prep xx True
his -PRON- DET poss xxx True
burial burial NOUN pobj xxxx False
was be VERB auxpass xxx True
signed sign VERB ROOT xxxx False
by by ADP agent xx True
the the DET det xxx True
clergyman clergyman NOUN pobj xxxx False
, , PUNCT punct , False
the the DET det xxx True
clerk clerk NOUN appos xxxx False
, , PUNCT punct , False
the the DET det xxx True
undertaker undertaker NOUN conj xxxx False
, , PUNCT punct , False
and and CCONJ cc xxx Tru

**Linguistic Similarity

What if we want to see if texts are similar to one another? Out of the box, spaCy can compare texts based on linguistic features. Let's take the two parts of *A Christmas Carol*.

In [4]:
doc1.similarity(doc2)

0.987069344903874

Not surprisingly, these two short texts are pretty similar. Let's load part of a series of stories that Dickens published anonymously as Boz and is compiled as *Sketches by Boz, Illustrative of Every-Day Life and Every-Day People* and test for similarity.

In [15]:
bozLoc = 'C:\\Users\\seth\\OneDrive\\sketchesboz.txt'
    
with open(bozLoc,'r',encoding='utf8') as f:
    bozText = f.read()

bozDoc = nlp(bozText)

doc1.similarity(bozDoc)

0.9867880349002466

Only slightly less similar than the two parts of *A Christmas Carol*. Let's try Dickens against parts of Joyce's *Ulysses*.

In [18]:
ulyssesLoc = 'C:\\Users\\seth\\OneDrive\\ulysses.txt'

with open(ulyssesLoc, 'r',encoding='utf8') as f:
    ulyssesText = f.read()

ulyssesDoc = nlp(ulyssesText)

bozDoc.similarity(ulyssesDoc)

0.9843544359594985

Less similar although also very similar. What if we try against something very different, [an article about the NBA finals from fivethirtyeight.com]https://fivethirtyeight.com/features/warriors-raptors-game-2-nba-finals/:

In [19]:
basketballLoc = 'C:\\Users\\seth\\OneDrive\\basketball.txt'

with open(basketballLoc, 'r',encoding='utf8') as f:
    basketballText = f.read()

basketballDoc = nlp(basketballText)

bozDoc.similarity(basketballDoc)

0.9732606573565973

Less similar still, although still pretty similar. A different model might give more differentiating results.

**Identifying Entities

A useful feature in spaCy is its ability to distinguish entities. Entities are properly named people or places. Out of the box, spaCy does a pretty good job of finding names, institutions and locations. We can iterate through the entities in the text just like tokens:

In [22]:
for ent in basketballDoc.ents:
    print(ent.text,ent.label_)

two CARDINAL
Kevin Durant PERSON
Game 2 EVENT
the NBA Finals EVENT
Sunday DATE
Durant PERSON
Stephen Curry’s PERSON
Kevon Looney PERSON
two CARDINAL
the first half DATE
MVP Andre Iguodala PERSON
Marc Gasol PERSON
the second quarter DATE
almost a minute TIME
Klay Thompson — PERSON
first ORDINAL
half CARDINAL
eight minutes TIME
one CARDINAL
Raptors ORG
the final five minutes TIME
2 CARDINAL
one CARDINAL
Oakland GPE
Games 3 and 4 DATE
59 CARDINAL
half CARDINAL
Golden State GPE
third ORDINAL
Mario Star PERSON
Warriors ORG
18 CARDINAL
the quarter DATE
Toronto GPE
Thompson PERSON
Durant PERSON
Warriors ORG
22 CARDINAL
Golden State GPE
the entire second half DATE
second ORDINAL
NBA Finals ORG
half CARDINAL
the Elias Sports EVENT
Raptors ORG
Golden State ORG
Raptors ORG
as much as 13 CARDINAL
the closing minutes TIME
Thompson PERSON
Raptors ORG
Raptors ORG
Nick Nurse PERSON
Curry DATE
five CARDINAL
Toronto GPE
12 CARDINAL
2 CARDINAL
Shaun Livingston PERSON
10 seconds TIME
one CARDINAL
Kawhi Le

There are some problems with this index, of course. "Six dimes" is slang for six assists, for example. If you were interested in making a gazeteer (an place index), it would be easy to do by only printing those entities that have geographical qualities. These are "geographical or political entities" (GPE) or "locations" (LOC)

In [23]:
for ent in basketballDoc.ents:
    if ent.label_ in ['GPE','LOC']:
        print(ent.text,ent.label_)

Oakland GPE
Golden State GPE
Toronto GPE
Golden State GPE
Toronto GPE
Houston GPE
Toronto GPE
Houston GPE
Houston GPE
Toronto GPE
Houston GPE
Toronto,3 GPE
Toronto GPE
Toronto GPE
Golden State GPE


### Project: Creating a List of Words for Language Study

1. Eliminating unwanted tokens
2. Creating an updating set

Let's say you are teaching an ESL class for beginners. There are many texts you could assign, but how can you make a vocabulary list? And once you have a list for one text, how can you make sure the words do not repeat?

Let's make a set of the words we need to learn and that we have already learned:

In [24]:
words2learn = set()
learnedWords = set()

**Eliminating unwanted tokens

We need to iterate through the list of tokens from our document and make sure that easy words or non-words do not end up in the list. The first thing to do is to eliminate tokens that should not be in the list. These include punctuation (PUNCT), proper nouns (PROPN), spaces (SPACE), numbers (NUM) and symbols (SYM). Simply test to see if the tokens part of speech is one of these parts. We also should test to see if the token is a stopword so that overly simple words do not end up on the list.

Here we iterate through the first part of *A Christmas Carol* and add the lemma and its part of speech to our learning list. Doing so will mean that strings that could represent different parts of speech (e.g., "run") will both end up in the list.

In [45]:
for token in doc1:
    if token.pos_ not in ['PUNCT','PROPN','SPACE','NUM','SYM'] and not token.is_stop:
        words2learn.add((token.lemma_,token.pos_))

for word in words2learn:
    print(word[0],word[1])

simile NOUN
ironmongery NOUN
sign VERB
shall VERB
chief ADJ
begin VERB
dead ADJ
repeat VERB
change NOUN
good ADJ
mean VERB
wisdom NOUN
disturb VERB
п»їmarley NOUN
incline VERB
register NOUN
nail NOUN
mourner NOUN
piece NOUN
door NOUN
clerk NOUN
undertaker NOUN
regard VERB
coffin NOUN
know VERB
trade NOUN
unhallowed ADJ
emphatically ADV
burial NOUN
clergyman NOUN
permit VERB
mind INTJ
knowledge NOUN
particularly ADV
doubt NOUN
ancestor NOUN
choose VERB
hand NOUN


Then it might make sense to write up this part of the code as a function so we can use it again. 

In [18]:
def usableWords(doc):
    w2l = set()
    for token in doc:
        if token.pos_ not in ['PUNCT','PROPN','SPACE','NUM','SYM'] and not token.is_stop:
            w2l.add((token.lemma_,token.pos_))
    return w2l

Let's assume the student learned the words from the first part of *A Christmas Carol* and wants to learn new words from the second part. We can put the learned words in that set and figure out what words need to be learned.

In [9]:
learnedWords = learnedWords|words2learn

newWords = usableWords(doc2)

words2learn.clear()

print(newWords-learnedWords)


{('burial', 'NOUN'), ('mean', 'VERB'), ('know', 'VERB'), ('coffin', 'NOUN'), ('knowledge', 'NOUN'), ('door', 'NOUN'), ('sign', 'VERB'), ('clergyman', 'NOUN'), ('nail', 'NOUN'), ('doubt', 'NOUN'), ('good', 'ADJ'), ('clerk', 'NOUN'), ('begin', 'VERB'), ('ancestor', 'NOUN'), ('repeat', 'VERB'), ('chief', 'ADJ'), ('hand', 'NOUN'), ('mourner', 'NOUN'), ('п»їmarley', 'NOUN'), ('undertaker', 'NOUN'), ('simile', 'NOUN'), ('piece', 'NOUN'), ('permit', 'VERB'), ('change', 'NOUN'), ('emphatically', 'ADV'), ('shall', 'VERB'), ('regard', 'VERB'), ('choose', 'VERB'), ('register', 'NOUN'), ('particularly', 'ADV'), ('dead', 'ADJ'), ('trade', 'NOUN'), ('ironmongery', 'NOUN'), ('wisdom', 'NOUN'), ('mind', 'INTJ'), ('incline', 'VERB'), ('unhallowed', 'ADJ'), ('disturb', 'VERB')}
***
{('burial', 'NOUN'), ('mean', 'VERB'), ('know', 'VERB'), ('coffin', 'NOUN'), ('knowledge', 'NOUN'), ('door', 'NOUN'), ('sign', 'VERB'), ('clergyman', 'NOUN'), ('nail', 'NOUN'), ('doubt', 'NOUN'), ('good', 'ADJ'), ('clerk', 'N

**Creating a set that updates

It might be useful to take a corpus of texts. There is a site called ESL Fast that has short, easy texts. Using the  urllib module and the HTML parsing module BeautifulSoup, we can download these texts and make lists from them. If you need to download BeautifulSoup, do so in your terminal:

pip install beautifulsoup4

Let's import these modules and download the first fourteen "supereasy" texts on the site. If you don't know how to use BeautifulSoup, you might look at [The Programming Historian's lesson]https://programminghistorian.org/en/lessons/intro-to-beautiful-soup.

In [16]:
import urllib
from bs4 import BeautifulSoup

texts = []

for story in range(1,15):
    url = 'https://www.eslfast.com/supereasy/se/supereasy' + ('00'+str(story))[-3:] + '.htm'
    page = urllib.request.urlopen(url).read()
    text = BeautifulSoup(page, "html.parser").find(class_='MsoNormal').text
    texts.append(text)
    
print(texts[0])



Billy always listens to his mother. He always does what she says. If his mother says, "Brush your teeth," Billy brushes his teeth. If his mother says, "Go to bed," Billy goes to bed. Billy is a very good boy. A good boy listens to his mother. His mother doesn't have to ask him again. She asks him to do something one time, and she doesn't ask again. Billy is a good boy. He does what his mother asks the first time. She doesn't have to ask again. She tells Billy, "You are my best child." Of course Billy is her best child. Billy is her only child.




Let's make a function that takes our learned words plus our new text and generates a list of words we need to learn and prints the words to learn, and one that adds the learned words:

In [19]:
def tolearn(oldWords, newDoc):
    words = usableWords(newDoc)
    newWords = words-oldWords
    for word in newWords:
        print(word[0],'('+word[1]+')')
    return newWords

def learned(oldWords, newWords):
    return oldWords|newWords

Then we can iterate through our texts, get the words to learn for each lesson:

In [22]:
learnedWords = set()

for text in texts:
    print('\n***LESSON '+str(texts.index(text)+1)+'***')
    textDoc = nlp(text)
    words2learn = tolearn(learnedWords,textDoc)
    learnedWords = learned(learnedWords,words2learn)


***LESSON 1***
say (VERB)
good (ADJ)
time (NOUN)
listen (VERB)
bed (NOUN)
course (ADV)
tooth (NOUN)
ask (VERB)
go (VERB)
child (NOUN)
tell (VERB)
mother (NOUN)
brush (VERB)
boy (NOUN)

***LESSON 2***
want (VERB)
live (VERB)
hug (NOUN)
honey (NOUN)
give (VERB)
father (NOUN)
smile (VERB)
long (ADJ)
year (NOUN)
daddy (NOUN)
old (ADJ)
grow (VERB)
big (ADJ)
okay (ADJ)
yes (INTJ)

***LESSON 3***
bark (VERB)
wake (VERB)
dog (NOUN)
loud (ADJ)
look (VERB)
hear (VERB)
outside (ADP)
barking (NOUN)
brown (ADJ)
see (VERB)
loud (ADV)
get (VERB)
window (NOUN)
walk (VERB)
barking (ADJ)
stop (VERB)
open (VERB)

***LESSON 4***
envelope (NOUN)
think (VERB)
pencil (NOUN)
know (VERB)
help (VERB)
yellow (ADJ)
dear (ADJ)
toaster (NOUN)
sister (NOUN)
note (NOUN)
thank (VERB)
kitchen (NOUN)
write (VERB)
find (VERB)
lunch (NOUN)
counter (NOUN)
lose (VERB)
teacher (NOUN)

***LESSON 5***
door (NOUN)
happy (ADJ)
grab (VERB)
sit (VERB)
pink (ADJ)
excited (ADJ)
sneaker (NOUN)
horse (NOUN)
bedroom (NOUN)
minute (NOU