
## Introduction to 
<img src="https://miro.medium.com/max/1200/1*HTtQseukwrBiREJf8MSVcA.jpeg" alt="Spacy Logo" style="width: 600px;"/>



- [Main Documentation Page](https://spacy.io/)  
- [How to install spaCy](https://spacy.io/usage)

## Working with text
H.G. Wells, *The Invisible Man*
<img src="https://www.slashfilm.com/wp/wp-content/images/invisible-man-cast-new.jpg" alt="Spacy Logo" style="width: 600px;"/>



In [None]:
import requests 
book = requests.get('http://www.gutenberg.org/cache/epub/5230/pg5230.txt')

### Without spaCy, Python is able to process text as a sequence of charachters (called a string).  We can slice a string, we can add strings, replace sections of a string and many other tasks.  See [w3schools string functions](https://www.w3schools.com/python/python_ref_string.asp)

Common examples for working with strings:

In [1]:
#Slicing  [begin : end ]
wilde = 'Be yourself; everyone else is already taken.'
wilde[4:-13]

'ourself; everyone else is a'

In [2]:
#Find and replace
wilde = 'Be yourself; everyone else is already taken.'
wilde.replace('yourself', 'a fish').replace('everyone', 'everything')

'Be a fish; everything else is already taken.'

In [3]:
#Split
wilde = 'Be yourself; everyone else is already taken.'
wilde.split() #also try wilde.split(';')

['Be', 'yourself;', 'everyone', 'else', 'is', 'already', 'taken.']

### spaCy takes this a step further to give the machine an understanding of text, not just as a sequence of charachters, but as natural language

[A full list of current languages](https://github.com/explosion/spaCy/tree/master/spacy/lang)




In [4]:
from spacy.lang.de import German

nlp = German()
doc = nlp('Sei du selbst! Alle anderen sind bereits vergeben.')


from spacy.lang.en import English 

nlp = English()
doc = nlp('Be yourself; everyone else is already taken.')

### The document object
Once we have imported a language model and a text, spaCy will create what is called a document (doc) object.  
The doc object typically contains:


|   [attributes](https://spacy.io/api/doc#attributes) |   | 
|---|---|
| tokens  | doc[:5]  |
|  text  | doc.text
| sentences  | doc.sents |
| entities | doc.ents |


Full documentation can be found [here](https://spacy.io/api/doc#_title).


In [6]:
#Note the difference between working with a slice in a doc object versus a Python string 
from IPython.display import display, Markdown, Latex

display(Markdown('**Note the difference between working with a slice of a doc object versus a Python string**'))

print(wilde[:3])
print(doc[:3])

display(Markdown('**Also note how spaCy tokenization differs from Python split()**'))
print('[*] Python:')
for token in wilde.split():
    print(token)
    
print('------')    
print('[*] spaCy:')
for token in doc:
    print(token)

**Note the difference between working with a slice of a doc object versus a Python string**

Be 
Be yourself;


**Also note how spaCy tokenization differs from Python split()**

[*] Python:
Be
yourself;
everyone
else
is
already
taken.
------
[*] spaCy:
Be
yourself
;
everyone
else
is
already
taken
.


In [7]:
# The to_json() method is a useful way to look at all the information contained in the doc 
doc.to_json()

{'text': 'Be yourself; everyone else is already taken.',
 'tokens': [{'id': 0, 'start': 0, 'end': 2},
  {'id': 1, 'start': 3, 'end': 11},
  {'id': 2, 'start': 11, 'end': 12},
  {'id': 3, 'start': 13, 'end': 21},
  {'id': 4, 'start': 22, 'end': 26},
  {'id': 5, 'start': 27, 'end': 29},
  {'id': 6, 'start': 30, 'end': 37},
  {'id': 7, 'start': 38, 'end': 43},
  {'id': 8, 'start': 43, 'end': 44}]}

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" alt="Spacy Logo" style="width: 80px;"/>  
##  Tokens
As you can see above, the doc contains a split of the text into tokens.  Each token object has 65 attributes that can be used during analysis.  Common tasks include:
- removing all punctuation from the text
- counting root forms of the words (lemmata)
- removing stopwords from the doc


|   [attributes](https://spacy.io/api/token#attributes) |   | 
|---|---|
| root form (lemma)  | token.lemma_  |
| Named entity type  | token.ent_type_ |
| token is punctuation  | token.is_punct |
| part of speech | token.pos_ |
| in stop words | token.is_stop |


Full documentation can be found [here](https://spacy.io/api/token#_title).


In [8]:
for token in doc:
    print(token.text,
         token.lemma_,
         token.pos_,
         token.dep_,
         token.shape_,
         token.is_stop)

Be Be   Xx True
yourself yourself   xxxx True
; ;   ; False
everyone everyone   xxxx True
else else   xxxx True
is be   xx True
already already   xxxx True
taken take   xxxx False
. .   . False


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" alt="Spacy Logo" style="width: 80px;"/>  
##  Spans
When studying text, we are often interested in features that involve more than one token.  To do this, we can create a span.  For example, "New York City"

Span [attributes](https://spacy.io/api/span#attributes)

Full documentation can be found [here](https://spacy.io/api/span#_title). 

In [9]:
text = 'I just got back from New York City.'
nlp = English()
doc = nlp(text)
nyc = doc[5:8] #or doc[-4:-1]

print(
    '[*] spaCy',
    nyc.start,
    nyc.end,
    doc[nyc.start:nyc.end],
)
print(  
    '[*] string',
    nyc.start_char,
    nyc.end_char,
    text[nyc.start_char:nyc.end_char]
)

[*] spaCy 5 8 New York City
[*] string 21 34 New York City


# Exercise: create individualized vocabularly lists 
At Haverford, we have an application called [the Bridge](https://bridge.haverford.edu/) that generates custom vocabulary lists for learning Latin and ancient Greek.  To do this, we create a list of words from texts that the student has already read and understood.  We then use the lemma of each word to compare the list of known words against words in a new text.  We can then identify which words will be new to the reader.

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/J%C3%B3kai_M%C3%B3r_litogr%C3%A1fia.jpg/220px-J%C3%B3kai_M%C3%B3r_litogr%C3%A1fia.jpg'>

For current purposes, let's use two texts in Hungarian. Let's say that I'm learning Hungarian and reading Mór Jókai's *The novel of the next century* (1872).  I have just finished book one and want to know what new words I will encounter when reading book two.   

>*Note* I am using Python sets to find the difference between the two books. I could also find the union, the intersection and other set operations.  For more on this topic, there is an excellent tutorial from [Real Python](https://realpython.com/python-sets/). 

In [10]:
# Here I use the requests library to get the texts from Project Gutenberg
import requests 
vol_1 = requests.get('http://www.gutenberg.org/files/55911/55911-0.txt')
vol_2 = requests.get('http://www.gutenberg.org/files/55912/55912-0.txt')    

In [11]:
from spacy.lang.hu import Hungarian 
nlp = Hungarian()
nlp.max_length = 1070000 # This is needed given the length of the text 

vol_1_doc = nlp(vol_1.text)
vol_1_words = set([token.lemma_ for token in vol_1_doc if token.is_stop is False and token.is_punct is False])

vol_2_doc = nlp(vol_2.text)
vol_2_words = set([token.lemma_ for token in vol_2_doc if token.is_stop is False and token.is_punct is False])

new_words = vol_2_words.difference(vol_1_words)
len(new_words)


16812

## Ouch, that's far too many words to learn!  Let's only count the 100 most freqent words and then create our list.

In [None]:
from spacy.tokens import Token
from collections import Counter

# Add an extension to our tokens called "count"
Token.set_extension("count", default=False, force=True)


# Calculate the number of times that a lemma appears in the text
counts = Counter([token.lemma_ for token in vol_1_doc if not token.is_punct and not token.is_stop]).most_common(100)
counts = dict(counts)

# Add the count to each token. 
vol_1_doc = nlp(vol_1.text)
for token in vol_1_doc:
    if token.lemma_ in counts.keys():
        token._.count = counts[token.lemma_]

# Repeat for the second text and find the difference 
counts = Counter([token.lemma_ for token in vol_2_doc if not token.is_punct and not token.is_stop]).most_common(100)
counts = dict(counts)

# I don't speak Hungarian, but these are clearly not words, let's get rid of them
del counts['\r\n']
del counts['\r\n\r\n']
del counts['-e']

vol_2_doc = nlp(vol_2.text)
for token in vol_2_doc:
    if token.lemma_ in counts.keys():
        token._.count = counts[token.lemma_]

# Now we find the difference between the most common words in the two texts        
set_vol1 = set([(token.lemma_, token._.count) for token in vol_1_doc if token._.count])
set_vol2 = set([(token.lemma_, token._.count) for token in vol_2_doc if token._.count])
difference = set_vol2.difference(set_vol1)
difference

In [None]:
## Bonus cell, let's look up the definition of a word in our list

from IPython.core.display import display, HTML

# PyWiktionary https://pypi.org/project/pywiktionary/
from pywiktionary.wiktionary_parser_factory import WiktionaryParserFactory

parser_factory = WiktionaryParserFactory(default_language='hu')
parser_factory_result = parser_factory.get_page('tesz')
display(HTML(parser_factory_result['response']['query']['pages']['13301']['extract']))



# Models 

What if we wanted to create a list of the 100 most freqent verbs or nouns in the text?  With the base Hungarian model, token.pos_ returns nothing. Also take a look at our lemmas. Are those really lemmas?  That model simply does not know parts of speech or Hungarian lemmata.  We need one that does. 

Here is a listing of the officially supported spaCy models: https://spacy.io/models
There are currently models for :
- English
- German
- French
- Spanish
- Portuguese
- Italian
- Dutch
- Greek
- Multi-language

The spaCy documentation lists the features and capabilities of each model.  Keep in mind that there can be several models for a language.  Larger models are often slower and require more memory.   If you're not using the more advanced features of a large model, then you would probably be better off using something small.  As a general rule, it's best to start small and then move up when a larger model is needed. 


To add a spaCy supported model, simply type: 
`python -m spacy download <name of model>` `en_core_web_sm` for example`


In [None]:
import spacy

#rather than
#nlp = English()
#
nlp = spacy.load('en_core_web_sm')


doc = nlp('Be yourself; everyone else is already taken.')
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

There is a growing community of spaCy users.  There are dozens of spaCy-based projects in the [Universe](https://spacy.io/universe) as well as user-created language models.  If you visit [awesome-hungarian-nlp](https://github.com/oroszgy/awesome-hungarian-nlp), for example, you'll find a link to a spaCy Hungarian model [here](https://github.com/oroszgy/spacy-hungarian-models).

This is a full-featured model with
- Word vectors
- Brown clusters
- Token frequencies 
- Sentencizer
- PoS Tagger
- Lemmatizer
- Dependency parser

> If you are working locally, you'll need to install the model:  
> `pip install https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.2.0/hu_core_ud_lg-0.2.0-py3-none-any.whl`

In [None]:
import hu_core_ud_lg

nlp = hu_core_ud_lg.load()
doc = nlp('A jövo század regénye.')

In [None]:
print('token: ',[token.lemma_ for token in doc])
print('pos  : ',[token.pos_ for token in doc])

Now we can create a list of the most common new verbs we'll encounter in book 2 by adding `token.pos_=='VERB'`. Note that the large model requires 6GB of memory and you may get a MemoryError. Note that it takes longer to process than our simpler model.

In [None]:
from spacy.tokens import Token
from collections import Counter

nlp = hu_core_ud_lg.load()
nlp.max_length = 1070000 

# Add an extension to our tokens called "count"
Token.set_extension("count", default=False, force=True)


# Calculate the number of times that a lemma appears in the text
top100 = Counter([token.lemma_ for token in vol_1_doc if not token.is_punct and not token.is_stop]).most_common(100)
top100 = dict(top100)

# Add the count to each token. 
vol_1_doc = nlp(vol_1.text)

for token in vol_1_doc:
    if token.lemma_ in top100.keys():
        token._.count = counts[token.lemma_]

# Repeat for the second text and find the difference 
counts = Counter([token.lemma_ for token in vol_2_doc if not token.is_punct and not token.is_stop]).most_common(100)
counts = dict(counts)

vol_2_doc = nlp(vol_2.text)
for token in vol_2_doc:
    if token.lemma_ in top100.keys():
        token._.count = counts[token.lemma_]

# Now we find the difference between the most common words in the two texts        
set_vol1 = set([(token.lemma_, token._.count) for token in vol_1_doc if token._.count and token.pos_=='VERB'])
set_vol2 = set([(token.lemma_, token._.count) for token in vol_2_doc if token._.count and token.pos_=='VERB'])
difference = set_vol2.difference(set_vol1)
difference

![](https://spacy.io/architecture-bcdfffe5c0b9f221a2f6607f96ca0e4a.svg)

In [None]:
bookLoc1 = 'xmascarol1.txt'
bookLoc2 = 'xmascarol2.txt'

with open(bookLoc1, 'r') as f:
    text1 = f.read()
    
with open(bookLoc2, 'r') as f:
    text2 = f.read()
    
doc1 = nlp(text1)
doc2 = nlp(text2)

#### Identifying lemma, parts of speech and stopwords

You can iterate through a spaCy document object for its tokens. The token objects allow identification of things like the word lemma, its part of speech, its shape and whether it is a common word (a stopword). Let's take the first part of the text and interate through the tokens for:

1. lemma (stemmed word)
2. part of speech (e.g., noun)
3. syntactic dependecy (what it relates to in the sentence)
4. shape (capitalization)
5. stopword (whether it is a common word)

In [None]:
for token in doc1:
    print(token.text,
         token.lemma_,
         token.pos_,
         token.dep_,
         token.shape_,
         token.is_stop)

You can create a span object from part of the document. Just take a slice of the entire doc object.

In [None]:
span = doc1[3:7]
print(span.text)

#### Linguistic similarity

What if we want to see if texts are similar to one another? Out of the box, spaCy can compare texts based on linguistic features. Let's take the two parts of *A Christmas Carol*.

In [None]:
doc1.similarity(doc2)

Not surprisingly, these two short texts are pretty similar. Let's load part of a series of stories that Dickens published anonymously as Boz and is compiled as *Sketches by Boz, Illustrative of Every-Day Life and Every-Day People* and test for similarity.

In [None]:
bozLoc = 'sketchesboz.txt'
    
with open(bozLoc,'r',encoding='utf8') as f:
    bozText = f.read()

bozDoc = nlp(bozText)

doc1.similarity(bozDoc)

Only slightly less similar than the two parts of *A Christmas Carol*. Let's try Dickens against parts of Joyce's *Ulysses*.

In [None]:
ulyssesLoc = 'ulysses.txt'

with open(ulyssesLoc, 'r',encoding='utf8') as f:
    ulyssesText = f.read()

ulyssesDoc = nlp(ulyssesText)

bozDoc.similarity(ulyssesDoc)

Less similar although also very similar. What if we try against something very different, [an article about the NBA finals from fivethirtyeight.com](https://fivethirtyeight.com/features/warriors-raptors-game-2-nba-finals/):

In [None]:
basketballLoc = 'basketball.txt'

with open(basketballLoc, 'r',encoding='utf8') as f:
    basketballText = f.read()

basketballDoc = nlp(basketballText)

bozDoc.similarity(basketballDoc)

Less similar still, although still pretty similar. A different model or a trained model might give more differentiating results.

#### Identifying entities

A useful feature in spaCy is its ability to distinguish entities. Entities are properly named people or places. Out of the box, spaCy does a pretty good job of finding names, institutions and locations. We can iterate through the entities in the text just like tokens:

In [None]:
for ent in basketballDoc.ents:
    print(ent.text,ent.label_)

There are some problems with this index, of course. "Six dimes" is slang for six assists, for example. If you were interested in making a gazeteer (an place index), it would be easy to do by only printing those entities that have geographical qualities. These are "geographical or political entities" (GPE) or "locations" (LOC)

In [None]:
for ent in basketballDoc.ents:
    if ent.label_ in ['GPE','LOC']:
        print(ent.text,ent.label_)

### Project: Creating a List of Words for Language Study

1. Eliminating unwanted tokens
2. Creating an updating set

Let's say you are teaching an ESL class for beginners. There are many texts you could assign, but how can you make a vocabulary list? And once you have a list for one text, how can you make sure the words do not repeat?

Let's make a set of the words we need to learn and that we have already learned:

In [None]:
words2learn = set()
learnedWords = set()

#### Eliminating unwanted tokens

We need to iterate through the list of tokens from our document and make sure that easy words or non-words do not end up in the list. The first thing to do is to eliminate tokens that should not be in the list. These include punctuation (PUNCT), proper nouns (PROPN), spaces (SPACE), numbers (NUM) and symbols (SYM). Simply test to see if the tokens part of speech is one of these parts. We also should test to see if the token is a stopword so that overly simple words do not end up on the list.

Here we iterate through the first part of *A Christmas Carol* and add the lemma and its part of speech to our learning list. Doing so will mean that strings that could represent different parts of speech (e.g., "run") will both end up in the list.

In [None]:
for token in doc1:
    if token.pos_ not in ['PUNCT','PROPN','SPACE','NUM','SYM'] and not token.is_stop:
        words2learn.add((token.lemma_,token.pos_))

for word in words2learn:
    print(word[0],word[1])

Then it might make sense to write up this part of the code as a function so we can use it again. 

In [None]:
def usableWords(doc):
    w2l = set()
    for token in doc:
        if token.pos_ not in ['PUNCT','PROPN','SPACE','NUM','SYM'] and not token.is_stop:
            w2l.add((token.lemma_,token.pos_))
    return w2l

Let's assume the student learned the words from the first part of *A Christmas Carol* and wants to learn new words from the second part. We can put the learned words in that set and figure out what words need to be learned.

In [None]:
learnedWords = learnedWords|words2learn

newWords = usableWords(doc2)

words2learn.clear()

print(newWords-learnedWords)


#### Creating a set that updates

It might be useful to take a corpus of texts. There is a site called ESL Fast that has short, easy texts. Using the  urllib module and the HTML parsing module BeautifulSoup, we can download these texts and make lists from them. If you need to download BeautifulSoup, do so in your terminal:

pip install beautifulsoup4

Let's import these modules and download the first fourteen "supereasy" texts on the site. If you don't know how to use BeautifulSoup, you might look at [The Programming Historian's lesson](https://programminghistorian.org/en/lessons/intro-to-beautiful-soup).

In [None]:
import bs4




In [None]:
import urllib
from bs4 import BeautifulSoup

texts = []

for story in range(1,15):
    url = 'https://www.eslfast.com/supereasy/se/supereasy' + ('00'+str(story))[-3:] + '.htm'
    page = urllib.request.urlopen(url).read()
    text = BeautifulSoup(page, "html.parser").find(class_='MsoNormal').text
    texts.append(text)
    
print(texts[0])

Let's make a function that takes our learned words plus our new text and generates a list of words we need to learn and prints the words to learn, and one that adds the learned words:

In [None]:
def tolearn(oldWords, newDoc):
    words = usableWords(newDoc)
    newWords = words-oldWords
    for word in newWords:
        print(word[0],'('+word[1]+')')
    return newWords

def learned(oldWords, newWords):
    return oldWords|newWords

Then we can iterate through our texts, get the words to learn for each lesson:

In [None]:
learnedWords = set()

for text in texts:
    print('\n***LESSON '+str(texts.index(text)+1)+'***')
    textDoc = nlp(text)
    words2learn = tolearn(learnedWords,textDoc)
    learnedWords = learned(learnedWords,words2learn)

### Try from your favorite books at [Gutenberg.org](gutenberg.org) by the id number in the url.

In [None]:
def loadText(textid, limit=9000):
    text = str(urllib.request.urlopen('http://www.gutenberg.org/files/'+str(textid)+'/'+str(textid)+'-0.txt').read())
    return text[:limit] #making the sample smaller will make it quicker to process

def printEntities(doc):
    labels = ['GPE','LOC', 'DATE']
    for ent in doc.ents:
        if ent.label_ in labels:
            print(ent.text, ent.label_)

textid1 = 59859 #Candide
textid2 = 1342 #Pride and Prejudice by Jane Austen

#doc1 = nlp(loadText(textid1))
#doc2 = nlp(loadText(textid2))

#doc1.similarity(doc2)

#printEntities(doc1)

whitehouse = nlp('We\'re going to protest at the White House')

for ent in whitehouse.ents:
    print(ent.text, ent.label_)