
## Introduction to 
<img width=500 src="https://miro.medium.com/max/1200/1*HTtQseukwrBiREJf8MSVcA.jpeg" alt="Spacy Logo"/>


- [Main Documentation Page](https://spacy.io/)  
- [How to install spaCy](https://spacy.io/usage)
- [spaCy 101, The most important concepts, explained in simple terms
](https://spacy.io/usage/spacy-101)
- [Free course- Advanced NLP with spaCy](https://course.spacy.io/)

- Please go ahead and form into groups of 2-4 people. Say hello and give a brief introduction.  I encourage you to talk and ask questions during the workshop. Listen to me with one ear while you work on code, tinker and make sense of things in ways that have relevance to you.  Please share your ideas and interupt me. Move around. The main goal is that you come away with a basic understanding of spaCy and how it might be useful in your projects.    


# Strings 

### Without spaCy, Python is able to process text as a sequence of characters (called a string).  We can slice a string, we can add strings, replace sections of a string and many other tasks.  

See [w3schools string functions](https://www.w3schools.com/python/python_ref_string.asp)

Common examples for working with strings:

In [0]:
#Selecting a slice, selecting part of the string from [begin : end]
wilde = 'Be yourself; everyone else is already taken.'
print('the string "{}" has {} characters. Note that the index begins at 0.'.format(wilde, len(wilde)))

[print(i, charachter) for i, charachter in enumerate(wilde)][0]

In [0]:
print('wilde[4:12] will start at position 4 and end at 12 ->', wilde[4:12])
print('or read backwards from the end [-40: -32] ->', wilde[-40:-32])
print('you can even mix forward and backward! wilde[-40:12]', wilde[-40:12])

In [0]:
#Find and replace
wilde = 'Be yourself; everyone else is already taken.'
wilde.replace('yourself', 'a fish').replace('everyone', 'everything')

In [0]:
#Split 
wilde = 'Be yourself; everyone else is already taken.'
print(wilde.split()) # Split on empty spaces
print(wilde.split(';'))
print(wilde.split('y'))  #Note that the charachter or space used to split the string is removed

In [0]:
# We can join a list of strings
print(' '.join(['Be', 'yourself;', 'everyone', 'else', 'is', 'already', 'taken.']))

import random

animals = ['fish', 'turtle', 'panther', 'parrot']
adjective = ['scary', 'green', 'overweight', 'fluffy']
print('Be a ' + ' '.join([random.choice(adjective), random.choice(animals)]) + ' everything else is taken.')


# Base Language object
### spaCy gives the machine an understanding of text, not just as a sequence of characters, but as natural language

[A full list of base languages](https://github.com/explosion/spaCy/tree/master/spacy/lang)

- stop words 
- lemmatization 




In [0]:
from spacy.lang.de import German

nlp = German()
doc = nlp('Sei du selbst! Alle anderen sind bereits vergeben.')


from spacy.lang.en import English 

nlp = English()
doc = nlp('Be yourself; everyone else is already taken.')

### The document object
Once we have imported a base language class or language model and a text, spaCy will create what is called a document (doc) object.  
The doc object typically contains:


|   [attributes](https://spacy.io/api/doc#attributes) |   | 
|---|---|
| tokens (individual parts of the text)  | doc[5]  |
|  the text  | doc.text
| the text split into sentences  | doc.sents |
| entities detected in the text | doc.ents |


Full documentation can be found [here](https://spacy.io/api/doc).


In [0]:
print('**Note the difference between working with a slice of a doc object versus a Python string**')

print(wilde[:3])
print(doc[:3])

print('**Also note how spaCy tokenization differs from Python split()**')
print('[*] Python:')
for token in wilde.split():
    print(token)
    
print('------')    
print('[*] spaCy:')
for token in doc:
    print(token)

In [0]:
# The to_json() method is a useful way to view the information contained in the doc object

doc = nlp('Be yourself; everyone else is already taken.')
doc.to_json()

<img height=100 src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" alt="Spacy Logo" style="width: 80px;"/>  
##  Tokens
As you can see above, the doc contains a split of the text into tokens.  Each token object has [65 attributes](https://spacy.io/api/token#attributes) that can be used during analysis.  Common tasks include:
- removing all punctuation from the text
- counting root forms of the words (lemmata)
- removing stopwords from the doc


|   [attributes](https://spacy.io/api/token#attributes) |   | 
|---|---|
| root form (lemma)  | token.lemma_  |
| Named entity type  | token.ent_type_ |
| token is punctuation  | token.is_punct |
| part of speech | token.pos_ |
| in stop words | token.is_stop |


Full documentation can be found [here](https://spacy.io/api/token#_title).


In [0]:
for token in doc:
    print(token.text,
         token.lemma_,
         token.pos_,
         token.dep_,
         token.shape_,
         token.is_stop)

In [0]:
# Useful function to make sense of linguistic terminology and abbreviations 
import spacy 

spacy.explain("PRON")


<img height=100 src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" alt="Spacy Logo" style="width: 80px;"/>  
##  Spans
When studying text, we are often interested in features that involve more than one token.  To do this, we can create a span.  For example, "New York City"

Span [attributes](https://spacy.io/api/span#attributes)

Full documentation can be found [here](https://spacy.io/api/span#_title). 

In [0]:
text = 'I just got back from New York City.'
nlp = English()
doc = nlp(text)

nyc = doc[5:8] 

print(
    '[*] spaCy',
    nyc.start,
    nyc.end,
    doc[nyc.start:nyc.end],
)
print(  
    '[*] string',
    nyc.start_char,
    nyc.end_char,
    text[nyc.start_char:nyc.end_char]
)

# Exercise: create individualized vocabularly lists 
At Haverford, we have an application called [the Bridge](https://bridge.haverford.edu/) that generates custom vocabulary lists for learning Latin and ancient Greek.  To do this, we create a list of words from texts that the student has already read and understood.  We then use the lemma of each word to compare the list of known words against words in a new text.  We can then identify which words will be new to the reader.

<img height="400" src='https://www.boxofficepro.com/wp-content/uploads/2020/02/emma-dom-E_FP_00001_rgb-scaled.jpg'>

Let's say that I'm learning English and reading Jane Austen's *Emma* (1815).  I have just finished volume one and want to know what new words I will encounter when reading volume two.   

>*Note* I am using Python sets to find the difference between the two volumes. I could also find the union, the intersection and other set operations.  For more on this topic, there is an excellent tutorial from [Real Python](https://realpython.com/python-sets/). 

In [0]:
# Here I use the requests library to get the texts from Project Gutenberg
import requests 
emma = requests.get('http://www.gutenberg.org/files/158/158-0.txt')
split = emma.text.find('VOLUME II')
vol_1 = emma.text[:split]
vol_2 = emma.text[split:]

In [0]:
from spacy.lang.en import English
nlp = English()
nlp.max_length = 1070000 # This is needed given the length of the text 

vol_1_doc = nlp(vol_1)

# Create a set of words that are not punctuation or stop words
vol_1_words = set([token.lemma_ for token in vol_1_doc if token.is_stop is False and token.is_punct is False])

vol_2_doc = nlp(vol_2)
vol_2_words = set([token.lemma_ for token in vol_2_doc if token.is_stop is False and token.is_punct is False])

new_words = vol_2_words.difference(vol_1_words)
len(new_words)


## Ouch, that's far too many words to learn!  Let's only count the 100 most frequent words and then create our list.

In [0]:
from spacy.tokens import Token
from collections import Counter

# Add an extension to our tokens called "count"
Token.set_extension("count", default=False, force=True)


# Calculate the number of times that a lemma appears in the text
counts = Counter([token.lemma_ for token in vol_1_doc if not token.is_punct and not token.is_stop]).most_common(100)
counts = dict(counts)

# Add the count to each token. 
vol_1_doc = nlp(vol_1)
for token in vol_1_doc:
    if token.lemma_ in counts.keys():
        token._.count = counts[token.lemma_]

# Repeat for the second text and find the difference 
counts = Counter([token.lemma_ for token in vol_2_doc if not token.is_punct and not token.is_stop]).most_common(100)
counts = dict(counts)

# These are clearly not words, let's get rid of them
del counts['\r\n']
del counts['\r\n\r\n']
del counts[' ']


vol_2_doc = nlp(vol_2)
for token in vol_2_doc:
    if token.lemma_ in counts.keys():
        token._.count = counts[token.lemma_]

# Now we find the difference between the most common words in the two texts        
#set_vol1 = set([(token.lemma_, token._.count) for token in vol_1_doc if token._.count])
#set_vol2 = set([(token.lemma_, token._.count) for token in vol_2_doc if token._.count])
not_words = ['\r\n','\r\n\r\n',' ']
set_vol1 = set([token.lemma_ for token in vol_1_doc if token._.count and token.lemma_ not in not_words])
set_vol2 = set([token.lemma_ for token in vol_2_doc if token._.count and token.lemma_ not in not_words])

difference = set_vol1.difference(set_vol2)
difference

# Models 

What if we wanted to create a list of the 100 most freqent verbs or nouns in the text?  With the base Hungarian model, token.pos_ returns nothing. Also take a look at our lemmas. Are those really the root forms?  The basic Hungarian model simply does not know parts of speech or lemmata.  We need one that does. 

Here is a listing of the officially supported spaCy models: https://spacy.io/models
There are currently models for :
- English
- German
- French
- Spanish
- Portuguese
- Italian
- Dutch
- Greek
- Multi-language

The spaCy documentation lists the features and capabilities of each model.  Keep in mind that there can be several models for a language.  Larger models are often slower and require more memory. In exchange, the larger models are often more accurate and have more features such as word vectors, dependency parsing and other pipelines.   If you're not using the more advanced features of a large model, then you would probably be better off using something small.  As a general rule, it's best to start small and then deliberately move up as needed. 


To add a spaCy supported model, simply type: 
`python -m spacy download <name of model>` `en_core_web_sm` for example


In [0]:
import spacy

#English base language object
#nlp = English()

#English small language model
nlp = spacy.load('en_core_web_sm')


doc = nlp('Be yourself; everyone else is already taken.')
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

**Further Reading on Parts of Speech**  

[Johnathan Reeve, Isolating Literary Style with Raymond Queneau
](https://jonreeve.com/2019/09/exercises-in-style/) ([code notebook](https://gist.github.com/JonathanReeve/cacf9d874b405b621710a7436425af49))

<img height=100 src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png" alt="Spacy Logo" style="width: 80px;"/>  
##  Named Entity Recognition
Most of the models in spaCy have an entity recognizer.  This is similar to identifying parts of speech in the text, but greatly expands what we can automatically identify.  The types of entities and categories will vary from model to model and should be in the model's documentation. For most languages, the categories are: 

|   [named entities](https://spacy.io/api/annotation#ner-wikipedia-scheme) |   | 
|---|---|
| PER  | Named person or family  |
| ORG  | Named corporate, governmental, or other organizational entitity. |
| LOC  | Name of politically or grographically defined location (cities, provinces, countries, international regions, bodies of water, mountains).  |
| MISC | Miscellaneous entities, e.g. events, nationalities, products or works of art. |

[Here is a list of the categories in the spaCy small English model](https://spacy.io/api/annotation#named-entities)

[Here is a useful web application that can be used to assess the categories available in various spaCy models](https://explosion.ai/demos/displacy-ent)


Full documentation can be found [here](https://spacy.io/usage/linguistic-features#named-entities-101).

--- 

H.G. Wells, *The Invisible Man* (1897)
<img src="https://www.slashfilm.com/wp/wp-content/images/invisible-man-cast-new.jpg" alt="invisible man photo" style="width: 600px;"/>

In [0]:
import requests 
invisible_man = requests.get('http://www.gutenberg.org/cache/epub/5230/pg5230.txt')

In [0]:
import spacy
from spacy import displacy
import en_core_web_sm
from IPython.display import HTML, IFrame

nlp = spacy.load('en_core_web_sm')
doc = nlp(invisible_man.text[600:1500])

HTML(displacy.render(doc, style="ent"))

In [0]:
# list of people that appear in the text 
import pandas as pd
doc = nlp(invisible_man.text)
person_list = []
for ent in doc.ents:
    if ent.label_ == 'PERSON':
        person_list.append(ent.text.replace('\r','').replace('\n',''))

df = pd.DataFrame(set(person_list)) 
df.head(10)

In [0]:
# list of places that appear in the text 
import pandas as pd
doc = nlp(invisible_man.text)
place_list = []
for ent in doc.ents:
    if ent.label_ == 'GPE':
        place_list.append(ent.text)

df = pd.DataFrame(set(place_list)) 
df.head(10)

In [0]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
HTML(displacy.render(next(doc.sents), style="dep"))

In [0]:
# Source https://github.com/pmbaumgartner/binder-notebooks/blob/master/holy-nlp.ipynb 

actors_and_actions = []

def token_is_subject_with_action(token):
    nsubj = token.dep_ == 'nsubj'
    head_verb = token.head.pos_ == 'VERB'
    person = token.ent_type_ == 'PERSON'
    return nsubj and head_verb and person

for token in doc:
    if token_is_subject_with_action(token):
        span = doc[token.head.left_edge.i:token.head.right_edge.i+1]
        data = dict(name=token.orth_,
                    span=span.text,
                    verb=token.head.lower_,
                    log_prob=token.head.prob,
                    )
        actors_and_actions.append(data)

print(len(actors_and_actions))

473


In [0]:
import pandas as pd

action_df = pd.DataFrame(actors_and_actions)

print('Unique Names:', action_df['name'].nunique())

most_common = (action_df
    .groupby(['name', 'verb'])
    .size()
    .groupby(level=0, group_keys=False)
    .nlargest(1)
    .rename('Count')
    .reset_index(level=0)
    .rename(columns={
        'verb': 'Most Common'
    })
)

# exclude log prob < -20, those indicate absence in the model vocabulary
most_unique = (action_df[action_df['log_prob'] > -20]
    .groupby(['name', 'verb'])['log_prob']
    .min()
    .groupby(level=0, group_keys=False)
    .nsmallest(1)
    .rename('Log Prob.')
    .reset_index(level = 0)
    .rename(columns={
        'verb': 'Most Unique'
    })
)

# SO groupby credit
# https: //stackoverflow.com/questions/27842613/pandas-groupby-sort-within-groups

Unique Names: 30


In [0]:
most_common.sort_values('Count', ascending=False).head(15)


Unnamed: 0_level_0,name,Count
verb,Unnamed: 1_level_1,Unnamed: 2_level_1
said,Kemp,68
said,Marvel,50
said,Hall,34
said,Adye,19
said,Henfrey,13
said,Bunting,11
said,Cuss,10
said,Jaffers,9
said,Griffin,4
go,Lemme,3


## Working with Stanfordnlp models ![](https://pbs.twimg.com/profile_images/897182721272799232/0CplRl36_400x400.jpg)

[Documentation](https://stanfordnlp.github.io/stanfordnlp/installation_usage.html)

```
$ pip install stanfordnlp spacy-stanfordnlp

```

```python
import stanfordnlp
stanfordnlp.download('en')   # This downloads the English models for the neural pipeline


Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
y

Default download directory: /home/ajanco/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: /home/ajanco/stanfordnlp_resources/en_ewt_models.zip
100%|██████████| 235M/235M [00:51<00:00, 4.92MB/s] 

Download complete.  Models saved to: /home/ajanco/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.
```

In [0]:
import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)

doc = nlp('Be yourself; everyone else is already taken.')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

# Transformer models 
[spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2](https://explosion.ai/blog/spacy-transformers)

In [0]:
!pip install spacy-transformers
!python -m spacy download en_vectors_web_lg

Collecting en_vectors_web_lg==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz (661.8MB)
[K     |████████████████████████████████| 661.8MB 1.2MB/s 
[?25hBuilding wheels for collected packages: en-vectors-web-lg
  Building wheel for en-vectors-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-vectors-web-lg: filename=en_vectors_web_lg-2.1.0-cp36-none-any.whl size=663461747 sha256=29c0b957ba460d655f6326b68ae96f4fdb8d9597f3c631b0005fd2769663e6f5
  Stored in directory: /tmp/pip-ephem-wheel-cache-grylxdwg/wheels/ce/3e/83/59647d0b4584003cce18fb68ecda2866e7c7b2722c3ecaddaf
Successfully built en-vectors-web-lg
Installing collected packages: en-vectors-web-lg
Successfully installed en-vectors-web-lg-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_vectors_web_lg')


In [0]:
nlp = spacy.load("en_vectors_web_lg")


OSError: ignored

![](https://spacy.io/architecture-bcdfffe5c0b9f221a2f6607f96ca0e4a.svg)

# Summary

In [0]:
# Language object
from spacy.lang.es import Spanish
nlp = Spanish()

# Doc object
doc = nlp("La duda es uno de los nombres de la inteligencia.")

# Tokens
for token in doc:
    print(token)
    
# Spans
span = doc[0:2]
print(span)

In [2]:
import spacy 
# python -m spacy download es_core_news_sm

# Models
nlp = spacy.load("es_core_news_sm")
doc = nlp("La duda es uno de los nombres de la inteligencia.")

# Part of speech 
for token in doc:
    print(token.pos_)
    
# Entities 
for token in doc.ents:
    print(token.label_)

DET
NOUN
AUX
PRON
ADP
DET
NOUN
ADP
DET
NOUN
PUNCT
