# Introduction to Spacy

This notebook demonstrates how to use Spacy for various NLP tasks. Compared to NLTK, Spacy works as a blackbox in which many powerful text preprocessing functions are built in and applied to a text through a pipeline.

## Background reading
For background reading see the documentation on Spacy website: https://spacy.io

## Installing SpaCy on your local computer
You first need to install Spacy.
https://spacy.io

* Install spacy through the command line or terminal: "conda install -c conda-forge spacy"

You also need to get the English NLP package (https://spacy.io/models/en) through the command line: 

* "python -m spacy download en_core_web_sm"

You only need to do this once. The next time you start a notebook, everything is available for import.

After installing SpaCy on your local machine, we load and import the relevant packages

In [1]:
import spacy
#We import some other stuff for pretty printing and visualising output of SpaCy
from pprint import pprint
from spacy import displacy
from collections import Counter

## Loading a language model
We are going to load the English language model. For other languages see: https://spacy.io/models/

In [2]:
# We import the English core module for NLP
import en_core_web_sm
nlp = en_core_web_sm.load()

## Applying SpaCy NLP module to an example text

We are going to use the  text on the Apple-Samsung patent law cases. Download the file from Canvas and store it on your local drive, if you have not done so. Adapt the path to your location in the examples below. Note that Windows uses backward slashes where Linux and Mac use forwards slashes for directory paths.

In [3]:
f=open('/Users/piek/Desktop/TextMiningFEW-2019/text-mining-ba-git/notebooks/Lab1-apple-samsung-example.txt','r')
example_text=f.read()
pprint(example_text)

('https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html\n'
 '\n'
 'Documents filed to the San Jose federal court in California on November 23 '
 'list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" '
 'operating systems, which Apple claims infringe its patents.\n'
 'The six phones and tablets affected are the Galaxy S III, running the new '
 'Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, '
 'Galaxy Rugby Pro and Galaxy S III mini.\n'
 'Apple stated it had “acted quickly and diligently" in order to "determine '
 'that these newly released products do infringe many of the same claims '
 'already asserted by Apple."\n'
 'In August, Samsung lost a US patent case to Apple and was ordered to pay its '
 'rival $1.05bn (£0.66bn) in damages for copying features of the iPad and '
 "iPhone in its Galaxy range of devices. Samsung, which is the world's top "
 'mobile phone maker, is appeal

We apply the English nlp module to a text string, which generates an instance of a SpaCy data object assigned to the variable 'doc'

In [4]:
doc = nlp(example_text)

Type 'doc.' followed by TAB to see the different methods and data structures. Take your time to check out the documentation at: https://spacy.io/api/ to learn what they are and what you can do. The better you understand the data objects and function the easier it will be to use it for your final assignment.

## Tokens and their offsets

In [5]:
print("Number of tokens in the document=", len(doc))
print("The text fragment starting from the 3rd token till the 10th=",doc[2:10])
lastbutonetoken = doc[211]
print("The last token in the document with its Part-of-Speech=",lastbutonetoken.text, lastbutonetoken.pos_)

Number of tokens in the document= 213
The text fragment starting from the 3rd token till the 10th= Documents filed to the San Jose federal court
The last token in the document with its Part-of-Speech= devices NOUN


In [6]:
words = [x.text for x in doc]
pos  = [x.pos_ for x in doc]
print("Number of words in the document=", len(words))
print("Frequency count of the Part-of-Speech")
Counter(pos)

Number of words in the document= 213
Frequency count of the Part-of-Speech


Counter({'NOUN': 31,
         'SPACE': 5,
         'VERB': 30,
         'ADP': 17,
         'DET': 17,
         'PROPN': 41,
         'ADJ': 19,
         'NUM': 8,
         'PUNCT': 24,
         'CCONJ': 7,
         'PRON': 1,
         'ADV': 6,
         'PART': 5,
         'SYM': 2})

In [7]:
sentences = [x for x in doc.sents]
print(sentences[3])

Apple stated it had “acted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple."



In [8]:
lastsentence=sentences[-1]
print("number of tokens in de last sentence=", len(lastsentence))
print("First five words in the last sentence:", lastsentence[0:5])
for token in lastsentence:
    print('"' + token.text + '", offset:', token.idx)

number of tokens in de last sentence= 36
First five words in the last sentence: A similar case in the
"A", offset: 953
"similar", offset: 955
"case", offset: 963
"in", offset: 968
"the", offset: 971
"UK", offset: 975
"found", offset: 978
"in", offset: 984
"Samsung", offset: 987
"'s", offset: 994
"favour", offset: 997
"and", offset: 1004
"ordered", offset: 1008
"Apple", offset: 1016
"to", offset: 1022
"publish", offset: 1025
"an", offset: 1033
"apology", offset: 1036
"making", offset: 1044
"clear", offset: 1051
"that", offset: 1057
"the", offset: 1062
"South", offset: 1066
"Korean", offset: 1072
"firm", offset: 1079
"had", offset: 1084
"not", offset: 1088
"copied", offset: 1092
"its", offset: 1099
"iPad", offset: 1103
"when", offset: 1108
"designing", offset: 1113
"its", offset: 1123
"own", offset: 1127
"devices", offset: 1131
".", offset: 1138


In [9]:
# Showing a range of properties for each token in table format
for token in lastsentence:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

A	953	a	False	False	X	DET	DT
similar	955	similar	False	False	xxxx	ADJ	JJ
case	963	case	False	False	xxxx	NOUN	NN
in	968	in	False	False	xx	ADP	IN
the	971	the	False	False	xxx	DET	DT
UK	975	uk	False	False	XX	PROPN	NNP
found	978	find	False	False	xxxx	VERB	VBD
in	984	in	False	False	xx	ADP	IN
Samsung	987	samsung	False	False	Xxxxx	PROPN	NNP
's	994	's	False	False	'x	PART	POS
favour	997	favour	False	False	xxxx	NOUN	NN
and	1004	and	False	False	xxx	CCONJ	CC
ordered	1008	order	False	False	xxxx	VERB	VBD
Apple	1016	apple	False	False	Xxxxx	PROPN	NNP
to	1022	to	False	False	xx	PART	TO
publish	1025	publish	False	False	xxxx	VERB	VB
an	1033	an	False	False	xx	DET	DT
apology	1036	apology	False	False	xxxx	NOUN	NN
making	1044	make	False	False	xxxx	VERB	VBG
clear	1051	clear	False	False	xxxx	ADJ	JJ
that	1057	that	False	False	xxxx	ADP	IN
the	1062	the	False	False	xxx	DET	DT
South	1066	south	False	False	Xxxxx	ADJ	JJ
Korean	1072	korean	False	False	Xxxxx	ADJ	JJ
firm	1079	firm	False	False	xxxx	NOUN	NN
had	1084	have	Fa

## CONSTITUENTS (CHUNKS) AND THEIR HEADS

In [10]:
print("CHUNKS")
for chunk in lastsentence.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

CHUNKS
A similar case NP case
the UK NP UK
Samsung's favour NP favour
Apple NP Apple
an apology NP apology
the South Korean firm NP firm
its iPad NP iPad
its own devices NP devices


## DEPENDENCIES

In [11]:
print("DEPENDENCIES")
for token in lastsentence:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

DEPENDENCIES
A/DT <--det-- case/NN
similar/JJ <--amod-- case/NN
case/NN <--ROOT-- case/NN
in/IN <--prep-- case/NN
the/DT <--det-- UK/NNP
UK/NNP <--pobj-- in/IN
found/VBD <--acl-- case/NN
in/IN <--prep-- found/VBD
Samsung/NNP <--poss-- favour/NN
's/POS <--case-- Samsung/NNP
favour/NN <--pobj-- in/IN
and/CC <--cc-- found/VBD
ordered/VBD <--conj-- found/VBD
Apple/NNP <--dobj-- ordered/VBD
to/TO <--aux-- publish/VB
publish/VB <--xcomp-- ordered/VBD
an/DT <--det-- apology/NN
apology/NN <--dobj-- publish/VB
making/VBG <--acl-- apology/NN
clear/JJ <--ccomp-- making/VBG
that/IN <--mark-- copied/VBN
the/DT <--det-- firm/NN
South/JJ <--amod-- Korean/JJ
Korean/JJ <--amod-- firm/NN
firm/NN <--nsubj-- copied/VBN
had/VBD <--aux-- copied/VBN
not/RB <--neg-- copied/VBN
copied/VBN <--ccomp-- making/VBG
its/PRP$ <--poss-- iPad/NNP
iPad/NNP <--dobj-- copied/VBN
when/WRB <--advmod-- designing/VBG
designing/VBG <--advcl-- copied/VBN
its/PRP$ <--poss-- devices/NNS
own/JJ <--amod-- devices/NNS
devices/NNS <-

In [14]:
# Nice visualisation of the dependencies for the last sentence
displacy.render(lastsentence, jupyter=True, style='dep')

## ENTITIES

https://spacy.io/usage/linguistic-features#entity-types


In [15]:
print("ENTITIES")
pprint([(X.text, X.label_) for X in doc.ents])

ENTITIES
[('San Jose', 'GPE'),
 ('California', 'GPE'),
 ('November 23', 'DATE'),
 ('six', 'CARDINAL'),
 ('Samsung', 'ORG'),
 ('Jelly Bean', 'PERSON'),
 ('Ice Cream Sandwich', 'WORK_OF_ART'),
 ('Apple', 'ORG'),
 ('\n', 'GPE'),
 ('six', 'CARDINAL'),
 ('the Galaxy S III', 'FAC'),
 ('Jelly Bean', 'PERSON'),
 ('Galaxy', 'GPE'),
 ('8.9 Wifi tablet', 'TIME'),
 ('the Galaxy Tab 2 10.1', 'FAC'),
 ('Galaxy Rugby Pro', 'ORG'),
 ('Galaxy S III', 'ORG'),
 ('\n', 'GPE'),
 ('Apple', 'ORG'),
 ('Apple', 'ORG'),
 ('\n', 'GPE'),
 ('August', 'DATE'),
 ('Samsung', 'ORG'),
 ('US', 'GPE'),
 ('Apple', 'ORG'),
 ('1.05bn', 'MONEY'),
 ('0.66bn', 'MONEY'),
 ('iPad', 'ORG'),
 ('iPhone', 'ORG'),
 ('Galaxy', 'GPE'),
 ('Samsung', 'ORG'),
 ('\n', 'GPE'),
 ('UK', 'GPE'),
 ('Samsung', 'ORG'),
 ('Apple', 'ORG'),
 ('South Korean', 'NORP'),
 ('iPad', 'ORG')]


In [16]:
# Highlight the entities and their type through colours in the text
displacy.render(doc, jupyter=True, style='ent')