# SPACY BASICS

In this lab we will learn to use the spacy.io API to annotate text. Many of the concepts seen in this lab are explained in detailed in the spacy course:

https://spacy.io/usage/spacy-101 

Here you can configure the kind of spacy setup (language, annotators, etc.) that you may require for installation:

https://spacy.io/usage 


In [1]:
# Install Spacy and learn about Token and Sentence objects

!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy
  Downloading spacy-3.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.4.4
    Uninstalling spacy-3.4.4:
      Successfully uninstalled spacy-3.4.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 3.5.0 which is incompatible.[0m[31m
[0mSuccessfully installed spacy-3.5.0


# ASSIGNMENT 1

Install the language modules of your choice. 

Read the documentation in https://spacy.io/usage and choose the language modules (according to your interests) that you would like to install.  
  + TODO: Install the language module(s).
  + TODO: Try different language module versions for one language and compare the results obtained.

In [2]:
# TODO install other language modules of your choice following the https://spacy.ioio/usage
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m85.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.4.1
    Uninstalling en-core-web-sm-3.4.1:
      Successfully uninstalled en-core-web-sm-3.4.1
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Loading the language modules

The nlp object is a language model instance. You can assume that, throughout this tutorial, nlp refers to the language model loaded by the language package or packages of your choice. In the following steps we will use spacy to process a string and a text file.

In [3]:
import spacy
#TODO load the installed language module
nlp = spacy.load('en_core_web_sm')

In [4]:
doc = nlp("Washington University, which is located in Missouri, is named after George Washington.")
print(doc)

Washington University, which is located in Missouri, is named after George Washington.


# ASSIGNMENT 2

When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

+ TODO: print the tokens in the Doc object. You should get something like the output below.
+ TODO: print the description of each tag (see morphology example, below)
+ TODO: print the entities recognized by iterating over the Doc object (scrowl down after the morphology print to see an example outputs).

In [5]:
# TODO add your code here to print the tokens in the Doc object
print ([token.text for token in doc])

['Washington', 'University', ',', 'which', 'is', 'located', 'in', 'Missouri', ',', 'is', 'named', 'after', 'George', 'Washington', '.']


+ TODO: print the two entities containing "Washington"


In [20]:
# A slice of the Doc for "Washington University"
silice_doc = doc[0:1]
print(silice_doc.text)

# A slice of the Doc for "George Washington" (without the ".")
text = doc[12:14]
print(text)

Washington
George Washington


In [21]:
# TODO obtain number of sentences
num_of_sentences = list(doc.sents)
len(num_of_sentences)



1

In [22]:
for sentence in num_of_sentences:
  print(sentence)

Washington University, which is located in Missouri, is named after George Washington.


In [9]:
# morphology and syntax
for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_lemma:<20}{token_dep:<20}")

Washington  PROPN     NNP       Washington          compound            
University  PROPN     NNP       University          nsubjpass           
,           PUNCT     ,         ,                   punct               
which       PRON      WDT       which               nsubjpass           
is          AUX       VBZ       be                  auxpass             
located     VERB      VBN       locate              relcl               
in          ADP       IN        in                  prep                
Missouri    PROPN     NNP       Missouri            pobj                
,           PUNCT     ,         ,                   punct               
is          AUX       VBZ       be                  auxpass             
named       VERB      VBN       name                ROOT                
after       ADP       IN        after               prep                
George      PROPN     NNP       George              compound            
Washington  PROPN     NNP       Washington         

In [30]:
# morphology and syntax
for token in doc:
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    # This is for formatting only
    # TODO modify the code above to print the description of each tag, like so:
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{spacy.explain(token.tag_):<50}{token_lemma:<20}{token_dep:<20}")
    

Washington  PROPN     NNP       noun, proper singular                             Washington          compound            
University  PROPN     NNP       noun, proper singular                             University          nsubjpass           
,           PUNCT     ,         punctuation mark, comma                           ,                   punct               
which       PRON      WDT       wh-determiner                                     which               nsubjpass           
is          AUX       VBZ       verb, 3rd person singular present                 be                  auxpass             
located     VERB      VBN       verb, past participle                             locate              relcl               
in          ADP       IN        conjunction, subordinating or preposition         in                  prep                
Missouri    PROPN     NNP       noun, proper singular                             Missouri            pobj                
,           PUNC

In [31]:
# TODO Iterate over the predicted entities
for entity in doc.ents:
    # Print the entity text and its label
    print(entity.text, entity.label_)


Washington University ORG
Missouri GPE
George Washington PERSON


In [34]:
# TODO modify the code above to iterate over the predicted entities at token level, like so:
# iob2 entities
import re

for token in doc:
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    token_iob = token.ent_iob_ + "-" + token.ent_type_
    token_iob = re.sub("-$","",token_iob)
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{spacy.explain(token.tag_):<60}{token_lemma:<20}{token_dep:<20}{token_iob}")




Washington  PROPN     NNP       noun, proper singular                                       Washington          compound            B-ORG
University  PROPN     NNP       noun, proper singular                                       University          nsubjpass           I-ORG
,           PUNCT     ,         punctuation mark, comma                                     ,                   punct               O
which       PRON      WDT       wh-determiner                                               which               nsubjpass           O
is          AUX       VBZ       verb, 3rd person singular present                           be                  auxpass             O
located     VERB      VBN       verb, past participle                                       locate              relcl               O
in          ADP       IN        conjunction, subordinating or preposition                   in                  prep                O
Missouri    PROPN     NNP       noun, proper singular 

In [35]:
# easy feature extraction
for token in doc:
  print (token, token.idx, token.text_with_ws, 
         token.is_alpha, token.is_punct, token.is_space,
         token.shape_, token.is_stop)

Washington 0 Washington  True False False Xxxxx False
University 11 University True False False Xxxxx False
, 21 ,  False True False , False
which 23 which  True False False xxxx True
is 29 is  True False False xx True
located 32 located  True False False xxxx False
in 40 in  True False False xx True
Missouri 43 Missouri True False False Xxxxx False
, 51 ,  False True False , False
is 53 is  True False False xx True
named 56 named  True False False xxxx False
after 62 after  True False False xxxx True
George 68 George  True False False Xxxxx False
Washington 75 Washington True False False Xxxxx False
. 85 . False True False . False


In [36]:
# stopwords available for English
english_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(english_stopwords)
for stop_word in list(english_stopwords)[:10]:
  print(stop_word)

can
become
but
i
eight
then
three
as
few
herein


# ASSIGNMENT 3

+ TODO: Remove stopwords from doc
+ TODO: print only the verbs, 3rd person singular present and the proper singular nouns

In [37]:
# TODO remove stopwords
for word in doc:
  if not word.is_stop:
    print (word)


Washington
University
,
located
Missouri
,
named
George
Washington
.


In [38]:
# TODO print only verbs, 3rd person singular present and proper singular nouns
noun = []
verb = []
for token in doc:
  if token.tag_ == 'NNP':
    noun.append(token)
  if token.tag_ == 'VBZ':
    verb.append(token)
# TODO add your code here
print(noun)
print(verb)

[Washington, University, Missouri, George, Washington]
[is, is]


# ASSIGNMENT 4 (BONUS 1)

Visualizations with spacy. Check the documentation in  https://spacy.io/usage/visualizers and render the dependencies and NER annotations, like so:



In [39]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True, options={'distance': 90})
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})


# ASSIGNMENT 5 (BONUS 2)

In this task you will be annotating a movie review at document and sentence level.

1. Open the file '/content/drive/My Drive/Colab Notebooks/2023-ILTAPP/resources/movie-review.txt'
2. Predict and print the various annotations seen previously (POS, NER, lemmas, etc.) for each of the sentences in the document using at least two language modules for one language of your interest (most basic and most advanced).
3. Visualize the results.



In [40]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [44]:
# TODO add code here
file_name = '/content/drive/MyDrive/NLP_Applications_1/DATA/2023-ILTAPP-20230203T201734Z-001/2023-ILTAPP/resources/guardian.txt'
guardian_text = open(file_name).read()
guardian_document = nlp(guardian_text)

for ele in guardian_document.sents:
  print(ele)

displacy.render(guardian_document, style='ent', jupyter=True, options={'distance': 90})


Twelve years after the fall of the Taliban, Afghanistan is heading for a near-record opium crop as instability pushes up the amount of land planted with illegal but lucrative poppies, according to a bleak UN report.


The rapid growth of poppy farming as western troops head home reflects particularly badly on Britain, which was designated "lead nation" for counter-narcotics work over a decade ago.


"Poppy cultivation is not only expected to expand in areas where it already existed in 2012 … but also in new areas or areas where poppy cultivation was stopped," the Afghanistan Opium Winter Risk Assessment found.


The growth in opium cultivation reflects both spreading instability and concerns about the future.
Farmers are more likely to plant the deadly crop in areas of high violence or where they have not received any agricultural aid, the report said.


Opium traders are often happy to provide seeds, fertilisers and even advance payments to encourage crops, leaving farmers who do not 