## Basics of spaCy

Lei Lei

Shanghai International Studies University

leileicn@qq.com

November 8, 2024


## Basics of spaCy

NLP tasks: tokenising, pos-tagging, lemmatising, syntactic parsing, ... 

NLP package: 

1. NLTK

2. Stanford CoreNLP (Java; Stanza, Python)

3. spaCy (Python), the state-of-art package of NLP tasks.

   ...

### What spaCy is
- spaCy is "a library for advanced Natural Language Processing in Python and Cython." 
- It is a tool developed for a number of tasks that are widely used in applied linguistics reseaerch such as sentence segmentation/splitting, tokenization, part-of-speech tagging (pos tagging), lemmatization, syntactic parsing, named entity recognition, word2vec/word embedding, etc.. 

### Features
- Free, open source
- Fast ("Blazing fast")
- Supporting many languages (67+ languages in spaCy 3.2, November 2022)
- Incorporaring multiple tasks: "Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more."
- Integrating state-of-the-arts research (word2vec, sense2vec (a library for computing word similarities, based on Word2vec), deep learning with pretrained transformers like BERT ...)

### Installation

#### Installaiton of spaCy
- pip install -U pip setuptools wheel
- pip install -U spacy

# pip3 install ...

#### Downloading pretrained models 
- python -m spacy download en_core_web_sm
- python -m spacy download zh_core_web_sm

- pip3 install spacy

- python -m spacy download en_core_web_sm
- python -m spacy download en_core_web_md
- python -m spacy download en_core_web_lg

- python -m spacy download zh_core_web_sm
- pip3 install jieba


- pip3 install spacy

- python -m spacy download en_core_web_sm
- python -m spacy download en_core_web_md
- python -m spacy download en_core_web_lg

- python -m spacy download zh_core_web_sm
- pip3 install jieba


In [55]:
import pandas as pd  # a lias

import spacy

## 1. Tokenization and pos-tagging with spacy

In [56]:
# Loading the pretrained model: English tokenizer, tagger, parser

nlp = spacy.load('en_core_web_sm')

# nlp = spacy.load('./en_core_web_sm')

In [57]:
# test on sentences
my_sents = "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more."

# my_sents = '''Research on incidental second language (L2) vocabulary acquisition through reading has claimed that repeated encounters with unfamiliar words and the relative elaboration of processing these words facilitate word learning. 

# However, so far both variables have been investigated in isolation. 

# To help close this research gap, the current study investigates the differential effects of the variables 'word exposure frequency' and 'elaboration of word processing' on the initial word learning and subsequent word retention of advanced learners of L2 English. Whereas results showed equal effects for both variables on initial word learning, subsequent word retention was more contingent on elaborate processing of form-meaning relationships than on word frequency. These results, together with those of the studies reviewed, suggest that processing words again after reading (input output cycles) is superior to reading-only tasks. The findings have significant implications for adaptation and development of teaching materials that enhance L2 vocabulary learning.'''

In [58]:
# Processing the sents

doc = nlp(my_sents)

In [59]:
# Sentence segmentation

# for sent in doc.sents:
#     print('->', sent)

my_sents = [sent for sent in doc.sents]
my_sents


[spaCy is a free open-source library for Natural Language Processing in Python.,
 It features NER, POS tagging, dependency parsing, word vectors and more.]

In [60]:
# Check out the tokens, lemmas, and part-of-speech tags

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)

spaCy spacy INTJ UH
is be AUX VBZ
a a DET DT
free free ADJ JJ
open open ADJ JJ
- - PUNCT HYPH
source source NOUN NN
library library NOUN NN
for for ADP IN
Natural Natural PROPN NNP
Language Language PROPN NNP
Processing Processing PROPN NNP
in in ADP IN
Python Python PROPN NNP
. . PUNCT .
It it PRON PRP
features feature VERB VBZ
NER NER PROPN NNP
, , PUNCT ,
POS POS PROPN NNP
tagging tagging NOUN NN
, , PUNCT ,
dependency dependency NOUN NN
parsing parsing NOUN NN
, , PUNCT ,
word word NOUN NN
vectors vector NOUN NNS
and and CCONJ CC
more more ADJ JJR
. . PUNCT .


### Storing the results in a DataFrame
#### Method 1

- store tokens, lemmas, pos_tags in lists
- obtain a DataFrame with the lists

In [61]:
tokens_list = []
lemmas_list = []
pos_list = []
tag_list = []

mydf = pd.DataFrame()

In [62]:
for token in doc:

    tokens_list.append(token.text)
    lemmas_list.append(token.lemma_)
    pos_list.append(token.pos_)
    tag_list.append(token.tag_)

In [63]:
mydf['tokens'] = tokens_list
mydf['lemmas'] = lemmas_list
mydf['pos'] = pos_list
mydf['tag'] = tag_list

mydf.head()

Unnamed: 0,tokens,lemmas,pos,tag
0,spaCy,spacy,INTJ,UH
1,is,be,AUX,VBZ
2,a,a,DET,DT
3,free,free,ADJ,JJ
4,open,open,ADJ,JJ


### Storing the results in a DataFrame
#### Method 2

- use list comprehension to store the tokens, lemmas, pos_tags in a list of lists
- obtain the DataFrame with the list of lists
- More efficient!

In [64]:
# list comprehension: returns a list, a list of lists
# takes place of a for loop

# mylist = []

# for i in range(0, 10, 2):
#     mylist.append(i)

# print(mylist)

# [i for i in range(0, 10, 2)]


mylist = ['life', 'is', 'short', 'i', 'love', 'python']

# enumerate(), returns: id, item

# mylist2 = []

# for id, token in enumerate(mylist):
#     mylist2.append([id, token])
#     # mylist2.append((id, token))

# mylist2 

[[id, token] for id, token in enumerate(mylist)]

# list, tuple: container

[[0, 'life'], [1, 'is'], [2, 'short'], [3, 'i'], [4, 'love'], [5, 'python']]

In [65]:
[[token.text, token.lemma_, token.pos_, token.tag_] for token in doc]

[['spaCy', 'spacy', 'INTJ', 'UH'],
 ['is', 'be', 'AUX', 'VBZ'],
 ['a', 'a', 'DET', 'DT'],
 ['free', 'free', 'ADJ', 'JJ'],
 ['open', 'open', 'ADJ', 'JJ'],
 ['-', '-', 'PUNCT', 'HYPH'],
 ['source', 'source', 'NOUN', 'NN'],
 ['library', 'library', 'NOUN', 'NN'],
 ['for', 'for', 'ADP', 'IN'],
 ['Natural', 'Natural', 'PROPN', 'NNP'],
 ['Language', 'Language', 'PROPN', 'NNP'],
 ['Processing', 'Processing', 'PROPN', 'NNP'],
 ['in', 'in', 'ADP', 'IN'],
 ['Python', 'Python', 'PROPN', 'NNP'],
 ['.', '.', 'PUNCT', '.'],
 ['It', 'it', 'PRON', 'PRP'],
 ['features', 'feature', 'VERB', 'VBZ'],
 ['NER', 'NER', 'PROPN', 'NNP'],
 [',', ',', 'PUNCT', ','],
 ['POS', 'POS', 'PROPN', 'NNP'],
 ['tagging', 'tagging', 'NOUN', 'NN'],
 [',', ',', 'PUNCT', ','],
 ['dependency', 'dependency', 'NOUN', 'NN'],
 ['parsing', 'parsing', 'NOUN', 'NN'],
 [',', ',', 'PUNCT', ','],
 ['word', 'word', 'NOUN', 'NN'],
 ['vectors', 'vector', 'NOUN', 'NNS'],
 ['and', 'and', 'CCONJ', 'CC'],
 ['more', 'more', 'ADJ', 'JJR'],
 ['.', 

In [66]:
list_of_lists = [[token.text, token.lemma_, token.pos_, token.tag_] for token in doc]
list_of_columns = ['tokens', 'lemmas', 'pos', 'tag']

In [67]:
mydf2 = pd.DataFrame(list_of_lists, columns= list_of_columns)

In [68]:
mydf2.head()

Unnamed: 0,tokens,lemmas,pos,tag
0,spaCy,spacy,INTJ,UH
1,is,be,AUX,VBZ
2,a,a,DET,DT
3,free,free,ADJ,JJ
4,open,open,ADJ,JJ


## 2. Dependency parsing with spaCy

In [69]:
# Loading the pretrained model: English tokenizer, tagger, parser

# i love python.
# good boy, adj + n, adj to modifiy the noun 
# governor/head, dependent, 

nlp = spacy.load("en_core_web_sm")

In [70]:
# test on sentences
my_sents = "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more."

In [71]:
# Processing the sents

doc = nlp(my_sents)

In [72]:
for token in doc:
    # token.dep_ : dep. relation 
    # token.head.text ：head token
    # token.head.lemma_ : head lemma
    # token.head.pos_
    # token.head.tag_
    # spacy.explain(token.dep_)
    
    print(token.text, token.lemma_, token.dep_, spacy.explain(token.dep_), token.head.text, token.head.lemma_)

spaCy spacy nsubj nominal subject is be
is be ROOT root is be
a a det determiner library library
free free amod adjectival modifier library library
open open amod adjectival modifier source source
- - punct punctuation source source
source source compound compound library library
library library attr attribute is be
for for prep prepositional modifier library library
Natural Natural compound compound Language Language
Language Language compound compound Processing Processing
Processing Processing pobj object of preposition for for
in in prep prepositional modifier Processing Processing
Python Python pobj object of preposition in in
. . punct punctuation is be
It it nsubj nominal subject features feature
features feature ROOT root features feature
NER NER dobj direct object features feature
, , punct punctuation NER NER
POS POS compound compound tagging tagging
tagging tagging conj conjunct NER NER
, , punct punctuation tagging tagging
dependency dependency compound compound parsing par

In [73]:
[[token.text, token.lemma_, token.dep_, spacy.explain(token.dep_), token.head.text, token.head.lemma_] for token in doc]

[['spaCy', 'spacy', 'nsubj', 'nominal subject', 'is', 'be'],
 ['is', 'be', 'ROOT', 'root', 'is', 'be'],
 ['a', 'a', 'det', 'determiner', 'library', 'library'],
 ['free', 'free', 'amod', 'adjectival modifier', 'library', 'library'],
 ['open', 'open', 'amod', 'adjectival modifier', 'source', 'source'],
 ['-', '-', 'punct', 'punctuation', 'source', 'source'],
 ['source', 'source', 'compound', 'compound', 'library', 'library'],
 ['library', 'library', 'attr', 'attribute', 'is', 'be'],
 ['for', 'for', 'prep', 'prepositional modifier', 'library', 'library'],
 ['Natural', 'Natural', 'compound', 'compound', 'Language', 'Language'],
 ['Language', 'Language', 'compound', 'compound', 'Processing', 'Processing'],
 ['Processing', 'Processing', 'pobj', 'object of preposition', 'for', 'for'],
 ['in', 'in', 'prep', 'prepositional modifier', 'Processing', 'Processing'],
 ['Python', 'Python', 'pobj', 'object of preposition', 'in', 'in'],
 ['.', '.', 'punct', 'punctuation', 'is', 'be'],
 ['It', 'it', 'ns

In [74]:
for token in doc:
    print(token.text, token.head.text, token.dep_, '->', spacy.explain(token.dep_))

spaCy is nsubj -> nominal subject
is is ROOT -> root
a library det -> determiner
free library amod -> adjectival modifier
open source amod -> adjectival modifier
- source punct -> punctuation
source library compound -> compound
library is attr -> attribute
for library prep -> prepositional modifier
Natural Language compound -> compound
Language Processing compound -> compound
Processing for pobj -> object of preposition
in Processing prep -> prepositional modifier
Python in pobj -> object of preposition
. is punct -> punctuation
It features nsubj -> nominal subject
features features ROOT -> root
NER features dobj -> direct object
, NER punct -> punctuation
POS tagging compound -> compound
tagging NER conj -> conjunct
, tagging punct -> punctuation
dependency parsing compound -> compound
parsing tagging conj -> conjunct
, parsing punct -> punctuation
word vectors compound -> compound
vectors parsing conj -> conjunct
and vectors cc -> coordinating conjunction
more vectors conj -> conjunc

#### Getting word position id

In [75]:
# dependency distance
# token.i
# token.head.i

# for token in doc:
#     print(token.text, token.i, '->', token.dep_, '->', token.head.text, token.head.i)
        
[[token.text, token.i, token.dep_, token.head.text, token.head.i] for token in doc]

[['spaCy', 0, 'nsubj', 'is', 1],
 ['is', 1, 'ROOT', 'is', 1],
 ['a', 2, 'det', 'library', 7],
 ['free', 3, 'amod', 'library', 7],
 ['open', 4, 'amod', 'source', 6],
 ['-', 5, 'punct', 'source', 6],
 ['source', 6, 'compound', 'library', 7],
 ['library', 7, 'attr', 'is', 1],
 ['for', 8, 'prep', 'library', 7],
 ['Natural', 9, 'compound', 'Language', 10],
 ['Language', 10, 'compound', 'Processing', 11],
 ['Processing', 11, 'pobj', 'for', 8],
 ['in', 12, 'prep', 'Processing', 11],
 ['Python', 13, 'pobj', 'in', 12],
 ['.', 14, 'punct', 'is', 1],
 ['It', 15, 'nsubj', 'features', 16],
 ['features', 16, 'ROOT', 'features', 16],
 ['NER', 17, 'dobj', 'features', 16],
 [',', 18, 'punct', 'NER', 17],
 ['POS', 19, 'compound', 'tagging', 20],
 ['tagging', 20, 'conj', 'NER', 17],
 [',', 21, 'punct', 'tagging', 20],
 ['dependency', 22, 'compound', 'parsing', 23],
 ['parsing', 23, 'conj', 'tagging', 20],
 [',', 24, 'punct', 'parsing', 23],
 ['word', 25, 'compound', 'vectors', 26],
 ['vectors', 26, 'conj

#### Wrapping all up in a DataFrame

In [76]:
list_of_lists = [[token.text, token.i, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.text, token.head.i, token.head.lemma_, token.head.pos_, token.head.tag_] for token in doc]

list_of_columns = ['token', 'token_id', 'lemma', 'pos', 'tag', 'dep_relation', 'head_token', 'head_token_id', 'head_lemma', 'head_pos', 'head_tag']

In [77]:
mydf3 = pd.DataFrame(list_of_lists, columns = list_of_columns)

In [78]:
mydf3.head()

Unnamed: 0,token,token_id,lemma,pos,tag,dep_relation,head_token,head_token_id,head_lemma,head_pos,head_tag
0,spaCy,0,spacy,INTJ,UH,nsubj,is,1,be,AUX,VBZ
1,is,1,be,AUX,VBZ,ROOT,is,1,be,AUX,VBZ
2,a,2,a,DET,DT,det,library,7,library,NOUN,NN
3,free,3,free,ADJ,JJ,amod,library,7,library,NOUN,NN
4,open,4,open,ADJ,JJ,amod,source,6,source,NOUN,NN


#### Faciliating the processing steps with a function

In [79]:
def dep_parse(mystr):
        doc = nlp(mystr)
        
        mylist = [[t.text, t.i, t.lemma_, t.pos_, t.tag_, t.dep_, spacy.explain(t.dep_), t.head.text, t.head.i, t.head.lemma_, t.head.pos_, t.head.tag_] for t in doc]
        
        mycolumns = ['token', 'token_id', 'token_lemma', 'token_pos', 'token_tag', 'dep', 'dep_explanation', 'head', 'head_id','head_lemma', 'head_pos', 'head_tag']
        
        mydf = pd.DataFrame(mylist, columns = mycolumns)
        
        return mydf

In [80]:
my_sents = "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more."

In [81]:
dep_parse(my_sents)

Unnamed: 0,token,token_id,token_lemma,token_pos,token_tag,dep,dep_explanation,head,head_id,head_lemma,head_pos,head_tag
0,spaCy,0,spacy,INTJ,UH,nsubj,nominal subject,is,1,be,AUX,VBZ
1,is,1,be,AUX,VBZ,ROOT,root,is,1,be,AUX,VBZ
2,a,2,a,DET,DT,det,determiner,library,7,library,NOUN,NN
3,free,3,free,ADJ,JJ,amod,adjectival modifier,library,7,library,NOUN,NN
4,open,4,open,ADJ,JJ,amod,adjectival modifier,source,6,source,NOUN,NN
5,-,5,-,PUNCT,HYPH,punct,punctuation,source,6,source,NOUN,NN
6,source,6,source,NOUN,NN,compound,compound,library,7,library,NOUN,NN
7,library,7,library,NOUN,NN,attr,attribute,is,1,be,AUX,VBZ
8,for,8,for,ADP,IN,prep,prepositional modifier,library,7,library,NOUN,NN
9,Natural,9,Natural,PROPN,NNP,compound,compound,Language,10,Language,PROPN,NNP
