# Chapter 3: Linguistic Analysis

So far, we have been analysing natural language by looking at surface forms of words (i.e. forms of words as they appear in the text).
To be able to write good NLP applications, we need linguistic data.
There are many tools to perform linguistic analysis of texts in different levels (NLTK, CoreNLP, Gensim, SpaCy etc.).
We will focus on SpaCy to do this!

With SpaCy, we can build a text processing pipeline, which will involve:
- Sentence segmentation
- Tokenization
- Part-of-speech tagging
- Dependency parsing
- Named entity recognition

SpaCy can be installed with:<br>
__pip install spacy__

Also, we will use an English model. On command line: <br>
__python3 -m spacy download en_core_web_sm__

In [1]:
import spacy
print(spacy.__version__)

3.2.1


## Before getting started: Python Classes/Objects

Python is an object-oriented language (object-oriented programming, OOP).
Objects are an encapsulation of __variables__ and __functions__ into a single entity. Objects are instances of classes. In other words, a class is a template to create your objects.
The concept of OOP in Python focuses on creating reusable code.

An object has two characteristics:

- attributes
- behavior (methods)

To create a class, use the keyword class.
Create a class "Person", with a property called "name":



In [4]:
class Person:
    name = "John"


Now we can use the class named Person to create objects (instances of the class).
Let's Create an object named p1, and print the value of "name":


In [5]:
#create instance or object of class 'Person'
p1 = Person()
print(p1.name)

#in fact similar to creating a dictionary:
#d1 = dict()
#you create instance of the class dict
#below are some methods that can be applied to objects of that class:

#however, with this code we will always have instance 'john'
#see below, with init...

John


What are some objects we know in Python: lists, dictionaries, tuples, strings etc. 
    list.append()
    list.remove()
    list.sort()
    
    dict.keys()
    dict.values()
    dict.items()
    
    string.lower()
    string.strip()
    string.split()
    
    

The examples above are classes and objects  are not really useful in real life applications. If we create a "Person" object, it's name is always "John". Ideally, we would like to create objects by assigning values to their attributes (properties).

All classes have a special method called __\_\_init\_\_()__, which is always executed when the class is being initiated.
Let's use the __\_\_init\_\_()__ function to create a "Person" object by assigning values for its "name" and "age":

In [35]:
class Person:
    #class is a kind of template of which objects can be derived
    def __init__(self, name, age):
        self.name = name
        self.age = age
        #self refers to object
        #age and name are attributes
        #init is a method that initiates a class

# Create a new person object

p1 = Person("John", 30)
#john and 30 are values
# Print its attributes

print(p1.name)
print(p1.age)


John
30


In [17]:
#TD cheesecake ex.

class Person:
    #class is a kind of template of which objects can be derived
    def __init__(cheesecake, name, surname, age):
        cheesecake.name = name
        cheesecake.surname = surname        
        cheesecake.age = age

        #self refers to object
        #age and name are attributes
        #init is a method that initiates a class

# Create a new person object

p1 = Person("John", "Smith",30)
p2 = Person("George", "Washington",30)
#john and 30 are values
# Print its attributes

print("%s\t%s\t%d\n"%(p1.name, p1.surname, p1.age))
print("%s\t%s\t%d\n"%(p2.name, p2.surname, p2.age))

del p2

print("%s\t%s\t%d\n"%(p2.name, p2.surname, p2.age))

John	Smith	30

George	Washington	30



NameError: name 'p2' is not defined

In [13]:
class animal:
    def __init__(self,name,subcat):
        self.name=name
        self.subcat=subcat
        

examp1=animal("tiger","mammal")

print(examp1.subcat)
    

mammal


The __\_\_self\_\___ parameter is a reference to the current instance of the class, and is used to access variables that belongs to the class.
It does not have to be named __\_\_self\_\___ but it has to be the first parameter of any function in the class.


### DIY1. Classes and objects

1. Modify the "self" parameter of the "Person" class to "cheescake".
2. Add the attribute "surname" to the "Person" class.
3. Create a new "Person" object (with all the necessary values for its attributes)
4. Print the name and surname of the "Person" object in a single line



In [27]:
# Add your code here
class Person:
    def __init__(cheescake, name, surname, age):
        cheescake.name = name
        cheescake.age = age
        cheescake.surname = surname 
        
# Create a new person object
p1 = Person("John", "Snow", 30)

# Print its attributes

print(p1.name + " " + p1.surname)




John Snow


You can modify __object__ attributes (properties) directly.
Let's modify the age of p1 to 35:

In [28]:
print(p1.age)

# Change the age and print it again

p1.age = 35
print(p1.age)

30
35


In [29]:
You can also delete objects by using the "del" keyword.

SyntaxError: invalid syntax (2883541124.py, line 1)

In [30]:
print(p1.age)

# Delete the person object and print its age again
del p1
print(p1.age)

35


NameError: name 'p1' is not defined

We can add methods to our classes, which can only be used on objects of this class.
Let's add a method "getName" to our "Person" class, which displays the name and surname of a given "Person" object:

In [2]:
class Person:
    def __init__(self, name, surname, age):
        self.name = name
        self.surname = surname
        self.age = age

    def getName(self):
        #niks na self : inherits
        print("Name: " + self.name + ", Surname: " + self.surname)

p1 = Person("John", "Snow", 30)

p1.getName()




Name: John, Surname: Snow


In [20]:
#TD try


class Person:
    def __init__(self,name,surname,age):
        self.name=name
        self.surname=surname
        self.age=age
    
    def getName(self):
        print("name: " + self.name + ", surname: " + self.surname + ", age: " + self.age)

p1 = Person("John","Smith","45")

p1.getName()        
        

name: John, surname: Smith, age: 45


In [25]:
#TD try 2

class Animal:
    def __init__(self,name,subtype):
        self.name=name
        self.subtype=subtype
        
    def getInfo(self):
        print("name: " + self.name + ", category: " + self.subtype)
    
p1 = Animal("crocodile","reptile")

p1.getInfo()

name: crocodile, category: reptile


As you can see the __\_\_self\_\___ parameter __has to be present and be the first variable__ of a method defined in a class.
While calling the method on the object itself, the __\_\_self\_\___ parameter is not used.
Let's create:
- 1. an attribute for the "Person" class called "pet"
- 2. a method for the "Person" class "changePet", which modifies the value to the "pet" attribute with a given new "value" and prints the old and the new pets.

In this case, let's make the value of "pet" optional within the __\_\_init\_\()___ method.



In [31]:
class Person:
    def __init__(self, name, surname, age, pet=None):
        #pet attribute will be always none by default
        self.name = name
        self.surname = surname
        self.age = age
        self.pet = pet
        

    def getName(self):
      print("Name: " + self.name + ", Surname: " + self.surname)
    
    #Define a new method changePet
    def changePet(self, newpet):
        print("Old pet: " +self.pet)
        self.pet = newpet
        print("New pet: " +self.pet)

p1 = Person("John", "Snow", 30)
print(p1.pet)

p1 = Person("John", "Snow", 30, "Ghost")
print(p1.pet)
# Create a person object with a "pet"

# Use changePet method to change the pet
p1.changePet("Bobby")


None
Ghost
Old pet: Ghost
New pet: Bobby


In [51]:
#TD inserted
import datetime

class Person:
    def __init__(self, name, surname, birthyear, pet=None):
        self.name=name
        self.surname=surname
        self.birthyear=birthyear
        self.pet=pet
        
    
    def getName(self):
      print("Name: " + self.name + ", Surname: " + self.surname)
    
    def changePet(self, pet):
        print("Old pet: "+self.pet)
        self.pet = pet
        print("New pet: "+self.pet)
        
    def getAge1(self, year): #year is huidig jaar in te geven        
        age = year - self.birthyear
        print("age :" + str(age))
    
    def getAge2(self):
        now = datetime.datetime.now()
        currentyear = now.year
        age = currentyear - self.birthyear
        print("age :" + str(age))
        return age
    
    def getSummary(self):
        print("name, " + self.name + " surname, " + self.surname + " age, " + str(self.getAge2()))

p1=Person("Thierry","Desot",1974)
p1.getAge1(2022)
p1.getAge2()
p1.getSummary()

age :48
age :48
age :48
name, Thierry surname, Desot age, 48


### DIY2. Classes and objects

1. Remove the attribute "age" and add the attribute "birthyear" to the "Person" class
2. Add a new method "getAge1" that calculates the age of a given person using the an input "year" and the "birthdate" attribute and prints it. 
3. Add a new method "getAge2", which makes the same calculation by using the "datetime" library to get the current date and time, without an input variable<br>
    - import datetime<br>
    - now = datetime.datetime.now()<br>
    - now.year # returns current year<br>
    - We can calculate the age by using "birthyear" and current year.

4. Add a new method getSummary, which prints the name, surname and the age in one line (using getAge2)


In [3]:
#solution
import datetime

class Person:
    # Define the _init_ method
    def __init__(self, name, surname, birthyear, pet=None):
        self.name = name
        self.surname = surname
        self.birthyear = birthyear
        self.pet = pet
        

    def getName(self):
      print("Name: " + self.name + ", Surname: " + self.surname)
    
    def changePet(self, pet):
        print("Old pet: "+self.pet)
        self.pet = pet
        print("New pet: "+self.pet)
    
    def getAge1(self, year):
        age = year - self.birthyear
        print("Age is " + str(age))
        
    def getAge2(self):
        now = datetime.datetime.now()
        year = now.year
        age = year - self.birthyear
        print("Age is " + str(age))
        return age
        
    def getSummary(self):
        print(self.name + " " +  self.surname + " " + str(self.getAge2()))
        
p1 = Person("John", "Snow", 283)
p1.getAge1(2021)
p1.getAge2()
p1.getSummary()


Age is 1738
Age is 1738
Age is 1738
John Snow 1738


Some of the advantages of OOP are:

__1. Modularity for easier troubleshooting__

Something has gone wrong, and you have no idea where to look? Hope you commented your code!

When working with object-oriented programming languages, you know exactly where to look:" “The Person object broke down? The problem must be in the Person class!”.

Due to encapsulation, objects are self-contained, and each bit of functionality does its own thing while leaving the other bits alone. Also, this modality allows an IT team to work on multiple objects simultaneously while minimizing the chance that one person might duplicate someone else’s functionality.

__2. Reuse of code through inheritance__

Suppose that in addition to your Car object, one colleague needs a RaceCar object, and another needs a Limousine object. Everyone builds their objects separately but discover commonalities between them. In fact, each object is really just a different kind of Car. This is where the inheritance technique saves time: Create one generic class (Car), and then define the subclasses (RaceCar and Limousine) that are to inherit the generic class’s traits.

Of course, Limousine and RaceCar still have their unique attributes and functions. If the RaceCar object needs a method to “fireAfterBurners” and the Limousine object requires a "Chauffeur", each class could implement separate functions just for itself. However, because both classes inherit key aspects from the Car class, for example the “drive” or “fillUpGas” methods, your inheriting classes can simply reuse existing code instead of writing these functions all over again.

What if you want to make a change to all Car objects, regardless of type? This is another advantage of the OO approach. Simply make a change to your Car class, and all car objects will simply inherit the new code.

__3. Flexibility through polymorphism__

Polymorphism is the process of using an operator or function in different ways for different data input. In practical terms, polymorphism means that if class B inherits from class A, it doesn't have to inherit everything about class A; it can do some of the things that class A does differently. Polymorphism is a fancy word that means that the same function can act differently for objects of different types.

Let's say we have two animal species: a dog and a cat. Both are animals (class "Animal"). The "Dog" class and the "Cat" class inherit the Animal class. All three classes have a talk() method, which gives different output for each of them. It can for example print "Animals make different sounds", when used on an "Animal" object. Or it can print "Meow" or "Woof" when applied to a Cat or a Dog object.

__4. Effective problem solving__

Object-oriented programming is often the most natural and pragmatic approach, once you get the hang of it. OOP languages allows you to break down your software into bite-sized problems that you then can solve — one object at a time.

References:
https://www.roberthalf.com/blog/salaries-and-skills/4-advantages-of-object-oriented-programming
http://zetcode.com/lang/python/oop/

## SpaCy: Getting Started

At the center of spaCy is the __object__ containing the processing pipeline. This variable is called "nlp".

For example, to create an English __nlp object__, you can import the English language class from _spacy.lang.en_ and instantiate it. You can use the nlp object like a function to analyze text. 

It contains all the different components in the "NLP pipeline". When you call "nlp" on a text, spaCy first tokenizes the text to produce a __Doc object__. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component. One of the nice things about Spacy is that we only need to apply "nlp" once to analyze text, the entire background pipeline will return the objects.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy.lang.

![title](img/pipeline.svg)

In [2]:
# Import the English language class
import spacy
from spacy.lang.en import English

# Create the nlp object: language specific processing pipeline
nlp = English()

# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")
print(doc.text)


Hello world!


## 1. Tokenization
Token objects represent the tokens in a document – for example, a word or a punctuation character. The tokens are

To get a token at a specific position, you can index into the Doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the token text.


<img src="img/doc_token.png" alt="Drawing" style="width: 400px;"/>

In [3]:
print(doc.text)
# Iterate over tokens in a Doc
for token in doc:
    print(token.text)
    #print(token) # td : also works? 
# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
#print(token.text)
print(token)

Hello world!
Hello
world
!
world


### DIY: Tokens ###
Fill the question marks in the following code.

In [18]:
#TD inserted 
#import the eng language class and create nlp obj
#from ??? import ???
from spacy.lang.en import English


#create nlp object for Eng
#nlp = ???
nlp = English()

#process text
#doc = ??? ('"The winner is a movie from South Korea, what the hell was that all about" Trump asked.')

doc = nlp('"The winner is a movie from South Korea, what the hell was that all about" Trump asked.')

#select the token "Trump" by providing it's index
#trump_token = doc[???]
trump_token = doc[18]
# or trump_token = doc[-3]
#print the selected token's text
#???
print(trump_token.text)



Trump


In [19]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

# Create the nlp object for English
nlp = English()

# Process the text
doc = nlp('"The winner is a movie from South Korea, what the hell was that all about?" Trump asked.' )

# Select the token "Trump" by providing it's index
trump_token = doc[-3]

# Print the selected token's text
print(trump_token.text)

Trump


A Token does not only have a "text" attribute. 

Here are some of the other token attributes:

|Token attribue | Description|
|---|:---|
|doc|The parent document.|
|text|The text content.|
|lemma_|Lemma of the token.|
|lower_|Lowercase form of the token text. Equivalent to Token.text.lower().|
|i|The index of the token within the parent document.|
|shape_|Transform of the tokens’s string, to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,“Xxxx”or“dd”.|
|is_alpha|(bool) Does the token consist of alphabetic characters?|
|is_punct|(bool) Is the token punctuation?|
|like_url|(bool) Does the token resemble a URL?|
|like_num|(bool) Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc.|
|is_stop|(bool)	Is the token part of a “stop-word list”?|


Note: These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context. spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of a attribute (that returns a string), we need to add an underscore \_ to its name (e.g. "lower\_" and "shape\_")


See https://spacy.io/api/token for the full list of token attributes.

In [33]:
#generally speaking, if we want to generate string, the underscore is needed in the spacy attributes
#otherwise without the underscore you get a numerical value, example Lemma2.
#TD lemma not working, check spacy VERSION

doc = nlp("It is so 90's!")
print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])
print('Lemma:    ', [token.lemma_ for token in doc])
print('Lemma2:    ', [token.lemma for token in doc])
print('is_alpha:', [token.is_alpha for token in doc])
print('like_num:', [token.like_num for token in doc])
print('Lowercase: ', [token.lower_ for token in doc] )





Index:    [0, 1, 2, 3, 4, 5]
Text:     ['It', 'is', 'so', '90', "'s", '!']
Lemma:     ['', '', '', '', '', '']
Lemma2:     [0, 0, 0, 0, 0, 0]
is_alpha: [True, True, True, False, False, False]
like_num: [False, False, False, True, False, False]
Lowercase:  ['it', 'is', 'so', '90', "'s", '!']
stopword:  [True, True, True, False, True, False]


### DIY: Token attributes ###
Display the different token attributes shown in the following image, for the sentence "Apple is looking at buying U.K. startup for $1 billion.". Display the results in the same formatting: for each token in the sentence, print the text of "token", "shape", "alpha" and "stop" in the same line.

<table> <tr>
    <td> <img src="img/token_attributes1.png" alt="Drawing" style="width: 100px;"/> </td>
    <td> <img src="img/token_attributes2.png" alt="Drawing" style="width: 300px;"/> </td>
 </tr> </table>


In [37]:
#TD ex
import spacy
from spacy.lang.en import English

doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

print("TEXT\tSHAPE\tALPHA\tSTOP")
#or 
#print("TEXT", "SHAPE", "ALPHA", "STOP")
for token in doc:
    print("%s\t%s\t%s\t%s"%(token.text,token.shape_,token.is_alpha,token.is_stop))
    #or :  print(token.text, token.shape_, token.is_alpha, token.is_stop)

TEXT	SHAPE	ALPHA	STOP
Apple	Xxxxx	True	False
is	xx	True	True
looking	xxxx	True	False
at	xx	True	True
buying	xxxx	True	False
U.K.	X.X.	False	False
startup	xxxx	True	False
for	xxx	True	True
$	$	False	False
1	d	False	False
billion	xxxx	True	False
.	.	False	False


In [39]:
# Write your code here
import spacy
from spacy.lang.en import English

doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

print("TEXT", "SHAPE", "ALPHA", "STOP")
for token in doc:
    print(token.text, token.shape_, token.is_alpha, token.is_stop)


TEXT SHAPE ALPHA STOP
Apple Xxxxx True False
is xx True True
looking xxxx True False
at xx True True
buying xxxx True False
U.K. X.X. False False
startup xxxx True False
for xxx True True
$ $ False False
1 d False False
billion xxxx True False
. . False False


### DIY: Token attributes ###

In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for __two subsequent tokens__: a number and a percent sign.

Use the __like_num__ token attribute to check whether a token in the doc resembles a number.
Get the token following the current token in the document. The index of the next token in the doc is __token.i+1__.
Check whether the next token’s text attribute is a percent sign ”%“.

In [55]:
#TD inserted

from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "Belgium's three official languages are Dutch, spoken by 59% of the population, French, spoken by 40%, "
    "and German, spoken by less than 1%." 
)


# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num == True:
        # Get the next token in the document
        next_token = doc[token.i+1]
        #print(next_token)
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            # Print the token
            print("Percentage found:", token.text)

Percentage found: 59
Percentage found: 40
Percentage found: 1


In [56]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "Belgium's 3 official languages are Dutch, spoken by 59% of the population, French, spoken by 40%, "
    "and German, spoken by less than 1%." 
)


# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            # Print the token
            print("Percentage found:", token.text)

Percentage found: 59
Percentage found: 40
Percentage found: 1


## 1. Tokenization: Spans ##

A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a Span, you can use Python's slice notation. For example, 1:4 will create a slice starting from the token at position 1, up to – but not including! – the token at position 4.


<img src="img/doc_span.png" alt="Drawing" style="width: 400px;"/>

In [None]:
from spacy.lang.en import English
nlp = English()
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:10]
#In fact span = doc[0:2] yields the complete string
# happens based on tokens, not on chars
#no segmentation errors if index number is too large, spacy automatically adapts

# Get the span text via the .text attribute
print(span)

# Try to slice 'world'


### DIY: Spans ###
Fill in the question marks in the following code based on the instructions given in comments.


In [5]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp('"The winner is a movie from South Korea, what the hell was that all about?" Trump asked.' )

# Take a slice of the Doc for "a movie" and print it
a_movie = doc[4:6]
print(a_movie.text)
#print(doc.text)
#print('Text:    ', [token.text for token in doc])

# Print a slice of the Doc for "a movie from South Korea" (without the ",") and print it
a_movie_from_sk = doc[4:9]
print(a_movie_from_sk.text)

a movie
a movie from South Korea


In [75]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()
nlp = spacy.load('en_core_web_sm')


# Process the text
doc = nlp('"The winner is a movie from South Korea, what the hell was that all about?" Trump asked.' )

# Take a slice of the Doc for "a movie" and print it
a_movie = doc[4:6]
print(a_movie.text)

# Print a slice of the Doc for "a movie from South Korea" (without the ",") and print it
a_movie_from_sk = doc[4:9]
print(a_movie_from_sk.text)

a movie
a movie from South Korea


## 2. Statistical Models ##

Statistical models enable spaCy to make predictions __in context__. This usually includes part-of speech tags, syntactic dependencies and named entities.

These ready-to use models are trained on large datasets of labeled example texts.

You can download models from the command line, with different sizes (and accuracies) (https://spacy.io/models/en):<br>

Small English model (11 MB): python -m spacy download en_core_web_sm<br>
Medium English model (91 MB): python -m spacy download en_core_web_md <br>
Large English model (789 MB): python -m spacy download en_core_web_lg<br>

You can find the available models for different languages at https://spacy.io/models
You can see the Dutch models at https://spacy.io/models/nl

The English language class in spacy.lang.en contains the language-specific code and rules included in the library – for example, special case rules for tokenization, stop words or functions to decide whether a word like "twenty two" resembles a number.

spacy.load("en_core_web_sm") loads the installed statistical model with the shortcut name en – in this case, the en_core_web_sm package. Loading a model will initialize the respective language class (in this case, English), set up the processing pipeline and load in the binary weights of the trained model that allow spaCy to make predictions (e.g. whether a word is a noun or what named entities are in the text). So the nlp object you get back after loading a model is an instance of English, but it also has a processing pipeline set up and weights loaded in.


## 2. Statistical Models: POS-tagging ##

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset.

spaCy maps all language-specific part-of-speech tags to a small, fixed set of word type tags following the Universal Dependencies scheme. The universal tags don’t code for any morphological features and only cover the word type.
The following tags are used for English (https://spacy.io/models/en):

<table> <tr>
    <td> <img src="img/postags1.png" alt="Drawing" style="width: 400px;"/> </td>
    <td> <img src="img/postags2.png" alt="Drawing" style="width: 400px;"/> </td>
 </tr> </table>
 
 
Most of the tags and labels look pretty abstract, and they vary between languages. __spacy.explain__ will show you a short description – for example, __spacy.explain("VBZ")__ returns “verb, 3rd person singular present”.


In [4]:
import spacy

# Load the small English model (we don't use the "English()" class anymore!)
nlp = spacy.load('en_core_web_sm')
#print(nlp.pipeline) # list of tuples: name, pipeline component


# Process a text
doc = nlp('"The winner is a movie from South Korea, what the hell was that all about?" Trump asked.')

# Iterate over the tokens
for token in doc:
    # Print the text, its lemma, and the predicted part-of-speech tag
    # We access these contextual attributes with an underscore (pos_)
    print(token.text, token.lemma_, token.pos_, spacy.explain(token.pos_), token.tag_, spacy.explain(token.tag_))

" " PUNCT punctuation `` opening quotation mark
The the DET determiner DT determiner
winner winner NOUN noun NN noun, singular or mass
is be AUX auxiliary VBZ verb, 3rd person singular present
a a DET determiner DT determiner
movie movie NOUN noun NN noun, singular or mass
from from ADP adposition IN conjunction, subordinating or preposition
South South PROPN proper noun NNP noun, proper singular
Korea Korea PROPN proper noun NNP noun, proper singular
, , PUNCT punctuation , punctuation mark, comma
what what PRON pronoun WP wh-pronoun, personal
the the DET determiner DT determiner
hell hell NOUN noun NN noun, singular or mass
was be AUX auxiliary VBD verb, past tense
that that PRON pronoun DT determiner
all all ADV adverb RB adverb
about about ADP adposition IN conjunction, subordinating or preposition
? ? PUNCT punctuation . punctuation mark, sentence closer
" " PUNCT punctuation '' closing quotation mark
Trump Trump PROPN proper noun NNP noun, proper singular
asked ask VERB verb VBD 

### DIY: POS-tagging ###

In [77]:
#TD inserted

import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

text = "They refuse to permit us to obtain the refuse permit." 
# What should be the part of speech for "permit" in both cases?

# Process the text
doc = nlp(text)

for token in doc:
    # Get the lowercase token text, lemma and part-of-speech tags (simple and detailed)
    text = token.text
    lemma = token.lemma_
    pos_simple = token.pos_
    pos_detailed = token.tag_
    # Get the description of the pos tags using .explain
    pos_simple_desc = spacy.explain(token.pos_)
    pos_detailed_desc = spacy.explain(token.tag_)
    # Print: text, lemma, pos-simple, + "\t" +pos-simple de + "\t" +cription, pos-de + "\t" +ailed, pos-detailed description
    # in tab-delimited format
    print(text + "\t" + lemma + "\t" + pos_simple + "\t" +  pos_simple_desc + "\t" + pos_detailed + "\t" + pos_detailed_desc)

They	they	PRON	pronoun	PRP	pronoun, personal
refuse	refuse	VERB	verb	VBP	verb, non-3rd person singular present
to	to	PART	particle	TO	infinitival "to"
permit	permit	VERB	verb	VB	verb, base form
us	we	PRON	pronoun	PRP	pronoun, personal
to	to	PART	particle	TO	infinitival "to"
obtain	obtain	VERB	verb	VB	verb, base form
the	the	DET	determiner	DT	determiner
refuse	refuse	NOUN	noun	NN	noun, singular or mass
permit	permit	NOUN	noun	NN	noun, singular or mass
.	.	PUNCT	punctuation	.	punctuation mark, sentence closer


In [68]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

text = "They refuse to permit us to obtain the refuse permit." 
# What should be the part of speech for "permit" in both cases?

# Process the text
doc = nlp(text)

for token in doc:
    # Get the lowercase token text, lemma and part-of-speech tags (simple and detailed)
    text = token.text
    lemma = token.lemma_
    pos_simple = token.pos_
    pos_detailed = token.tag_
    # Get the description of the pos tags using .explain
    pos_simple_desc = spacy.explain(token.pos_)
    pos_detailed_desc = spacy.explain(token.tag_)
    # Print: text, lemma, pos-simple, pos-simple description, pos-detailed, pos-detailed description
    # in tab-delimited format
    print(text + "\t" + lemma + "\t" + pos_simple + "\t" + pos_simple_desc + "\t" + pos_detailed + "\t" + pos_detailed_desc)

They	-PRON-	PRON	pronoun	PRP	pronoun, personal
refuse	refuse	VERB	verb	VBP	verb, non-3rd person singular present
to	to	PART	particle	TO	infinitival to
permit	permit	VERB	verb	VB	verb, base form
us	-PRON-	PRON	pronoun	PRP	pronoun, personal
to	to	PART	particle	TO	infinitival to
obtain	obtain	VERB	verb	VB	verb, base form
the	the	DET	determiner	DT	determiner
refuse	refuse	NOUN	noun	NN	noun, singular or mass
permit	permit	NOUN	noun	NN	noun, singular or mass
.	.	PUNCT	punctuation	.	punctuation mark, sentence closer


About spaCy's custom pronoun lemma

Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.

## 2. Statistical Models: Dependency parsing ##

Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads.

We use Part-of-Speech tagging to label tokens in a sentence with their grammatical word categories as part-of-speech tags. Yet, they do not have any grammatical relations between them. In order to generate the grammatical relations between the tokens, we use linguistic parsers and syntactic dependency parsing is one of them. Via dependency parsing, we create a tree or a graph data structure of a sentence conveying its tokens' grammatical relations. 

<td> <img src="img/dep.png" alt="Drawing" style="width: 500px;"/>

Dependency relationships are also represented with a universal tagset (For small English model: https://spacy.io/models/en#en_core_web_sm):


<table> <tr>
    <td> <img src="img/deptags1.png" alt="Drawing" style="width: 400px;"/> </td>
    <td> <img src="img/deptags2.png" alt="Drawing" style="width: 400px;"/> </td>
    <td> <img src="img/deptags3.png" alt="Drawing" style="width: 400px;"/> </td>
 </tr> </table>

In [4]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

text = "They refuse to permit us to obtain the refuse permit." 
# What should be the part of speech for "permit" in both cases?

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    #underscore = for string output
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # And print all three types of information in tab-delimited format
    print(token_text + "\t" + token_pos + "\t" + token_dep)

They	PRON	nsubj
refuse	VERB	ROOT
to	PART	aux
permit	VERB	xcomp
us	PRON	dobj
to	PART	aux
obtain	VERB	xcomp
the	DET	det
refuse	NOUN	compound
permit	NOUN	dobj
.	PUNCT	punct


While obtaining word-level dependency labels are easy, they provide linmited information about grammatical relationships. spaCy uses the terms __head__ and __child__ to describe the words connected by a single arc in the dependency tree. 

Head is the grammatical parent of a given token.
Child is the grammatical child of a given token.

We can obtain additional information from the dependency trees by using:
    - token.head (Why singular?)
    - token.children (Why plural?)
    

<td> <img src="img/dep2.png" alt="Drawing" style="width: 500px;"/>

Now let's print the head of each token, as well as the head's pos-tag and dependency label:


In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = "This is a sentence."

doc=nlp(text)


for token in doc:
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    token_head = token.head.text
    token_head_pos = token.head.pos_
    token_head_dep = token.head.dep_
    print(token_text + "\t" + token_pos + "\t" + token_dep + "\t" + token_head + "\t" + token_head_pos)
    
    

This	PRON	nsubj	is	AUX
is	AUX	ROOT	is	AUX
a	DET	det	sentence	NOUN
sentence	NOUN	attr	is	AUX
.	PUNCT	punct	is	AUX


In [17]:
import spacy

# Import the small English model
nlp = spacy.load('en_core_web_sm')

text = "This is a sentence." 
# What should be the part of speech for "permit" in both cases?

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    token_head = token.head.text # the text of the head token // we refer to ANOTHER token, so underscore not using, but again text attribute
    token_head_pos = token.head.pos_ # the pos-tag of the head token
    token_head_dep = token.head.dep_ # the dep of the head token
    # This is for formatting only
    print(token_text, token_pos, token_dep, token_head, token_head_pos, token_head_dep)



This PRON nsubj is AUX ROOT
is AUX ROOT is AUX ROOT
a DET det sentence NOUN attr
sentence NOUN attr is AUX ROOT
. PUNCT punct is AUX ROOT


As each token has a single head, it is "easy" to refer to them.
However, a token can have multiple children. It means that we have to loop over all items in ".children" object to access different children.



In [5]:
import spacy

# Import the small English model
nlp = spacy.load('en_core_web_sm')

text = "This is a sentence." 
# What should be the part of speech for "permit" in both cases?

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # print them
    print(token_text, token_pos, token_dep)
    print(">children:")
    # Get the syntactic children of the token
    token_children = token.children
    print(token_children)
    for c in token_children:
        # Print the text of each child
        #print(c)
        print(c.text)
    print()
    

    

This PRON nsubj
>children:
<generator object at 0x7fcb721c1520>

is AUX ROOT
>children:
<generator object at 0x7fcb721c1680>
This
sentence
.

a DET det
>children:
<generator object at 0x7fcb721c1520>

sentence NOUN attr
>children:
<generator object at 0x7fcb721c1680>
a

. PUNCT punct
>children:
<generator object at 0x7fcb721c1520>



Ok we get much more information about the grammatical relationships but it is still difficult to have a visual understanding.
Spacy also offers a visualization tool: displacy! 

See a full list of all the options you can use with displacy:
https://spacy.io/api/top-level#displacy_options


In [None]:
import spacy
from spacy import displacy
#'display'...

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a second sentence.")
displacy.render(doc, style="dep")

#displacy.render(doc, style="dep", options={'compact':True, 'font':'Arial'})

## 2. Statistical Models: Named-entity recognition ##

Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:
- Which companies were mentioned in the news article?
- Were specified products mentioned in complaints or reviews?
- Does the tweet contain the name of a person? Does the tweet contain this person’s location?

Spacy's NER models trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus support the following entity types:

<td> <img src="img/ner.png" alt="Drawing" style="width: 500px;"/>

Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)








Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"


# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### DIY: Named Entity Recognition ###

Write a code to:
1. Open "austen-emma.txt" 
2. Store the content of the document in a single line (Why?) - Also try your code without this step.
3. For the first 5 sentences in this document, using Spacy, print: (1) the sentence number, (2) the sentence itself, (3) the words that are detected as names in the sentence and (4) their corresponding NER labels.
4. Try to match the following output. Pay attention to the formatting.


#### Output ####
\*\*\* SENT 1 \*\*\*<br>
[Emma by Jane Austen 1816]  VOLUME I<br>
\*\*\* ENTITIES 1 \*\*\*<br>
Jane Austen PERSON<br>
-----<br>
\*\*\* SENT 2 \*\*\*<br>
CHAPTER I   Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.<br>
\*\*\* ENTITIES 2 \*\*\*<br>
Emma Woodhouse PERSON<br>
nearly twenty-one years DATE<br>
-----<br>
\*\*\* SENT 3 \*\*\*<br>
She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.<br>
\*\*\* ENTITIES 3\*\*\*<br>
two CARDINAL<br>
-----<br>
\*\*\* SENT 4 \*\*\*<br>
Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.<br>
\*\*\* ENTITIES 4 \*\*\*<br>
-----<br>
\*\*\* SENT 5 \*\*\*<br>
Sixteen years had Miss Taylor been in Mr. Woodhouse's family, less as a governess than a friend, very fond of both daughters, but particularly of Emma.<br>
\*\*\* ENTITIES 5 \*\*\*<br>
Sixteen years DATE<br>
Taylor PERSON<br>
Woodhouse PERSON<br>
Emma PERSON<br>

In [1]:
#TD inserted
import spacy
import codecs
import sys

#file = open('./austen-emma.txt','r')

#for line in file:
#    print(line)



textfile = codecs.open('./austen-emma.txt','r','utf-8').read()
#storing everything in one line!
textfile_line = textfile.replace('\n', ' ')
nlp = spacy.load('en_core_web_md')
doc=nlp(textfile_line)

In [64]:
#TD inserted

counter = 0


#for sent in doc.sents:
for sent in list(doc.sents)[:5]:    
    counter+=1
    print("sent " + str(counter))
    print(sent.text.strip())
    for ent in sent.ents:
        #print(ent.text,ent.label_)
        print(ent.text,ent.label_)
        #print(ent.text,ent.label_,spacy.explain(ent.label_))
        #print(ent.text, ent.label_, spacy.explain(ent.label_))
    #if counter == 5:
    #    sys.exit()

#textfile = codecs.open('./austen-emma.txt','r','utf-8').read()
#for line in textfile:
#    print(line)


#alternative for sys.exit:
#use index, however, doc is a generator and should first be casted to list,
# so





sent 1
[Emma by Jane Austen 1816]

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
Emma PERSON
Jane Austen 1816 PERSON
Emma Woodhouse PERSON
nearly twenty-one years CARDINAL
sent 2
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.
two CARDINAL
sent 3
Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
indistinct ORG
sent 4
Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.
Sixteen years DATE
Tay

In [47]:
import codecs
import spacy
import sys

#WITHOUT REPLACING NEWLINES BY SPACE
textfile = codecs.open("./austen-emma.txt", "r", "utf-8").read()
#textfile_line = textfile.replace('\n', ' ')
#if above line is not inserted, system will activate sentence segmenter
nlp = spacy.load("en_core_web_sm")
doc = nlp(textfile)

In [48]:
counter = 0
for sent in list(doc.sents)[:5]:
    counter+=1
    print("*** SENTENCE "+str(counter)+" ***")
    print(sent.text.strip())
    print("*** ENTITIES "+str(counter)+" ***")
    for ent in sent.ents:
        print(ent.text, ent.label_, spacy.explain(ent.label_))
        
    print("-----")

*** SENTENCE 1 ***
[Emma by Jane Austen 1816]

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
*** ENTITIES 1 ***
Emma PERSON People, including fictional
Jane Austen 1816 PERSON People, including fictional
Emma Woodhouse PERSON People, including fictional
nearly twenty-one years CARDINAL Numerals that do not fall under another type
-----
*** SENTENCE 2 ***
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.
*** ENTITIES 2 ***
two CARDINAL Numerals that do not fall under another type
-----
*** SENTENCE 3 ***
Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as g

In [42]:
import codecs
import spacy
import sys


textfile = codecs.open("./austen-emma.txt", "r", "utf-8").read()
textfile_line = textfile.replace('\n', ' ')
#if above line is not inserted, system will activate sentence segmenter
nlp = spacy.load("en_core_web_sm")
doc = nlp(textfile_line)





In [46]:

counter = 0
for sent in list(doc.sents)[:5]:
    counter+=1
    print("*** SENTENCE "+str(counter)+" ***")
    print(sent.text.strip())
    print("*** ENTITIES "+str(counter)+" ***")
    for ent in sent.ents:
        print(ent.text, ent.label_, spacy.explain(ent.label_))
        
    print("-----")
    #if counter == 5:
        #sys.exit()

#Between _them_ it was more the intimacy of sisters. 
#TD
#spacy.displacy.render(doc, style="ent")
#spacy.displacy.render(list(doc.sents)[0], style="ent")

*** SENTENCE 1 ***
[Emma by Jane Austen 1816]

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
*** ENTITIES 1 ***
Emma PERSON People, including fictional
Jane Austen 1816 PERSON People, including fictional
Emma Woodhouse PERSON People, including fictional
nearly twenty-one years CARDINAL Numerals that do not fall under another type
-----
*** SENTENCE 2 ***
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.
*** ENTITIES 2 ***
two CARDINAL Numerals that do not fall under another type
-----
*** SENTENCE 3 ***
Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as g

In [44]:
#TD inserted
for sent in list(doc.sents)[:5]:
    spacy.displacy.render(sent, style="ent")

We can use displacy also to visualise named-entities!

In [37]:
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp("Local health officials in the Iraqi Shia city of Najaf said \
an Iranian theology student was the first positive case of the virus, Reuters reports.")
spacy.displacy.render(doc, style='ent')
#outside jup notebook
#spacy.displacy.serve(doc, style='ent')

### DIY. Named entity recognition - visualisation ###
Modify the code for the previous DIY so that instead of printing out the entities and their labels, visualise them for each sentence using displacy!

For each sentence, this code will only:
- print the sentence number
- visualize entities


In [87]:
import codecs
import spacy
import sys

# Write your code here
textfile = codecs.open("./austen-emma.txt", "r", "utf-8").read()
nlp = spacy.load("en_core_web_sm")
doc = nlp(textfile[:1000])

counter = 0
for sent in list(doc.sents)[:5]:
    counter+=1
    print("*** SENT "+str(counter)+" ***")
    spacy.displacy.render(sent, style='ent')

*** SENT 1 ***


*** SENT 2 ***


*** SENT 3 ***


  "__main__", mod_spec)


*** SENT 4 ***


*** SENT 5 ***


  "__main__", mod_spec)


Even though we have the tables that summarize the different labels we alreadh have seen (POS-labels, dependency labels), we can also dynamically get their explanations with spacy.
See the following example.


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Local health officials in the Iraqi Shia city of Najaf said \
an Iranian theology student was the first positive case of the virus, Reuters reports.")

for token in doc:
    print(token.text, token.pos_, spacy.explain(token.pos_))



### DIY: Spacy.explain() ###

Modify the code in the previous DIY (Named entity recognition - visualisation) so that after visualizing the named entities in the sentence, it also prints each NER label and the corrsponding explanation for each label. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Local health officials in the Iraqi Shia city of Najaf said \
an Iranian theology student was the first positive case of the virus, Reuters reports.")

# Write your code here

So far we have been accessing the entities in a Spacy document (doc), as entities can span multiple tokens.<br>
We can also access the named entity label each token belongs to.<br>
To do this we need to use token attribute (and not a document attribute): __ent\_type\___<br>
Note: ent\_type\_ will return an empty string if there are not entities.<br>

We can also see whether the token __starts, continues an entity or is outside an entity (no entity)__, using: __ent\_iob\___<br>
- B: begin
- I: inside
- O: outside (not a part of an entity)



In [71]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Local health officials in the Iraqi Shia city of Najaf said \
an Iranian theology student was the first positive case of the virus, Reuters reports.")

for token in doc:
    print(token.text + "\t" + token.ent_type_ + "\t" + token.ent_iob_)

Local		O
health		O
officials		O
in		O
the		O
Iraqi	NORP	B
Shia	NORP	B
city		O
of		O
Najaf	GPE	B
said		O
an		O
Iranian	NORP	B
theology		O
student		O
was		O
the		O
first	ORDINAL	B
positive		O
case		O
of		O
the		O
virus		O
,		O
Reuters	ORG	B
reports		O
.		O


Iraqi	NORP	B
Shia	NORP	B
Najaf	GPE	B
Iranian	NORP	B
first	ORDINAL	B
Reuters	ORG	B


### DIY: token entities ###

Modify the code in the previous code to loop over all tokens of a doc and print the token text, named entity type and named entity location (iob), only for the tokens which are named entities. 

In [97]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Local health officials in the Iraqi Shia city of Najaf said \
an Iranian theology student was the first positive case of the virus, Reuters reports.")

# Add your code here
for token in doc:
    if token.ent_iob_ == "O":
        continue
    else:
        print(token.text + "\t" + token.ent_type_ + "\t" + token.ent_iob_)

Iraqi	NORP	B
Shia	NORP	B
Najaf	GPE	B
Iranian	NORP	B
first	ORDINAL	B
Reuters	ORG	B


In [72]:
#TD inserted ALTERNATIVE SOLUTION:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Local health officials in the Iraqi Shia city of Najaf said \
an Iranian theology student was the first positive case of the virus, Reuters reports.")

for token in doc:
    if len(token.ent_type_) > 0:
      print(token.text + "\t" + token.ent_type_ + "\t" + token.ent_iob_)

Iraqi	NORP	B
Shia	NORP	B
Najaf	GPE	B
Iranian	NORP	B
first	ORDINAL	B
Reuters	ORG	B
