<img src="https://www.aiforpeople.org/wp-content/uploads/2020/01/cropped-AIforPeople-logo-full-2.png" width="200">

## Natual Language Processing with Python using SpaCy

In this quick tutorial we are going to look at some basic natural language algorithms that the [SpaCy](https://spacy.io/) package allows us to use. The API can be found [here](https://spacy.io/api) and promises *Industrial-Strength*
Natural Language Processing.

This tutorial is run in a Google Colaboratory. Colab notebooks allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more. The entire code is run on a remote Google server. The usage is free, but limited: Google Colab gives you free RAM (13GB) and a GPU (up to 8GB) for up to 12hrs at a time. Using Colab, we can make sure that everyone is running the exact same code with the exact same system. Currently, the server has installed only some basic packages for Python. We can check the current python version by running the next cell:

In [1]:
!python --version

Python 3.6.9


We can see that we are running Python version 3.6.9 on the instance of the server. Now, we can check which of the NLP packages we'd like to use has already been installed. With the next line of code, we are asking `pip` the package installation tool to `list` the `version` of the packages which we find with **g**lobal search for a **r**egular **e**xpression and **p**rint out matches (`grep`). The regular expression we use is `nltk\|spacy\|gensim\|corenlp\|textblob\|pattern\|polyglot\|scikit` in order to find the relevant NLP packages.

In [2]:
!pip list --version | grep 'nltk\|spacy\|gensim\|corenlp\|textblob\|pattern\|polyglot\|scikit'

gensim                   3.6.0          
nltk                     3.2.5          
scikit-image             0.16.2         
scikit-learn             0.22.2.post1   
spacy                    2.2.4          
textblob                 0.15.3         


It seems that some of the packages have already been installed. Most importantly, we want to work with **SpaCy**. Now, we need to obtain the *English* (`en`) data for spacy with the following line of code. It will automaticall download the data we need.

In [3]:
!python -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


## Some NLP Basics with Spacy
Next, we will `import` the SpaCy package and start with some basics of text processing. Since we have downloaded the English `en` models to work with English texts, we can `load` the model. Then, we can put a sentence into the nlp instance and retrieve a new object from SpaCy. Here, we call that interpretation of the sentence `document`. This `document` is now comprised of several pieces of information (`tokens`) provided by the SpaCy model. If we look at every token in the `text` of the document, we can list all syntactical units of our original sentence, i.e. every word of the sentence: 

In [4]:
import spacy
nlp = spacy.load('en')

sentence = "This workshop is awesome!"
document = nlp(sentence)

for token in document:
    print('"' + token.text + '"')


"This"
"workshop"
"is"
"awesome"
"!"


We can now investigate every token (word) within the sentence for more information that has been infered by the SpaCy `en` language model. Among a variety of properties, we can now look at Part-of-Speech (`POS`) tags, the syntactic dependency (`Dep`), whether the token is composed of alphabetical characters (`is alpha`) and much more:
- **Text**: The original word text.
- **Index**: Index of the word
- **Lemma**: The base form of the word.
- **POS**: The simple UPOS part-of-speech tag.
- **Tag**: The detailed part-of-speech tag.
- **Dep**: Syntactic dependency, i.e. the relation between tokens.
- **Shape**: The word shape – capitalization, punctuation, digits.
- **is alpha**: Is the token an alphabetical character?
- **is stop**: Is the token part of a stop list, i.e. the most common words of the language?

In [5]:
# Token class exposes a lot of word-level attributes

doc = nlp(u"This class is awesome.")
print("Text\t\tIndex\t\tLemma\t\tPOS\t\tTag\t\tDep\t\tShape\t\tis alpha\tis stop")
print("---------------------------------------------------------------------------------------------------------------------------------------")
for token in doc:
    print("{0}\t\t{1}\t\t{2}\t\t{3}\t\t{4}\t\t{5}\t\t{6}\t\t{7}\t\t{8}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.pos_,
        token.tag_,
        token.dep_,
        token.shape_,
        token.is_alpha,
        token.is_stop
    ))
 

Text		Index		Lemma		POS		Tag		Dep		Shape		is alpha	is stop
---------------------------------------------------------------------------------------------------------------------------------------
This		0		this		DET		DT		det		Xxxx		True		True
class		5		class		NOUN		NN		nsubj		xxxx		True		False
is		11		be		AUX		VBZ		ROOT		xx		True		True
awesome		14		awesome		ADJ		JJ		acomp		xxxx		True		False
.		21		.		PUNCT		.		punct		.		False		False


A named entity is a “real-world object” that is assigned a name – for example, a person, a country, a product or a book title. Entites can also entail quantities, e.g. time, amount, percentage. **SpaCy** can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case. This can be used to extract information about entities from a corpus and those entities can then be connected with further information. The next line will infer some entities and quantities from an example sentence:

In [6]:
doc = nlp(u"I just bought 2 pairs of shoes from Amazon at 12 p.m. because the sale with 30% off was about to expire.")
for entity in doc.ents:
    print(entity.text, entity.label_)


2 CARDINAL
Amazon ORG
12 p.m. TIME
30% PERCENT


We can also **visualize** the information we've extracted and the quantities identified by the model. For this, we can use the in-built visualize `displacy` that we can simply import from the spacy package. As we want to render entities, we set the style to `ent` and since we are working in a jupyter notebook in this Google Colab, we set `jupyter` to `True`.

In [7]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

Not only can we visualize the recognized entities, but we can also visualize the dependency relations infered by the model. For this, we can take the example sentence *While hunting in Africa, I shot an elephant in my pajamas.* and create a new document (`doc`) from it. Now, we can simply change the render `style` to `dep` (dependency). Optionally, we can configure the `distance` of the words for the visualization-

In [8]:
doc = nlp(u'While hunting in Africa, I shot an elephant in my pajamas.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

## Word Similarities
An important aspect of Natural Language Processing is the investigation of similarities between texts. We can situate each word of our corpus in a conceptual space and this will allow us to compare the "distance" of words within this space. Such an approach makes use of so-called "word embeddings". We are going to have a look at a simple example before we will explain word embeddings in more detail in the next part of the lecture.

In order to compare our words, we have to load one of the language models. The bigger the language model, the more fine-grained (the better) the similarity analysis.

In [None]:
nlp = spacy.load('en_core_web_lg')



spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for lots of reasons. For example, you can suggest a user content that’s similar to what they’re currently looking at.

Each `doc` comes with a `.similarity()` method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether “dog” and “cat” are similar really depends on how you’re looking at it. spaCy’s similarity model usually assumes a pretty general-purpose definition of similarity.

In [24]:
# Computing Similiarity

banana = nlp.vocab[u'banana']
dog = nlp.vocab[u'dog']
fruit = nlp.vocab[u'fruit']
animal = nlp.vocab[u'animal']
shiba = nlp.vocab[u'shiba']
 
print("Dog <=> Animal: \t", dog.similarity(animal))
print("Dog <=> Fruit: \t\t",  dog.similarity(fruit))
print("Banana <=> Fruit: \t", banana.similarity(fruit))
print("Banana <=> Animal: \t",  banana.similarity(animal)) 
print("Fruit <=> Animal: \t",  fruit.similarity(animal))
print("Dog <=> Banana: \t",  dog.similarity(banana))
print()
print("Dog <=> Shiba: \t\t",  dog.similarity(shiba)) 
print("Banana <=> Shiba: \t",  banana.similarity(shiba)) 

Dog <=> Animal: 	 0.66185343
Dog <=> Fruit: 		 0.23552851
Banana <=> Fruit: 	 0.67148364
Banana <=> Animal: 	 0.24272855
Fruit <=> Animal: 	 0.33342767
Dog <=> Banana: 	 0.24327643

Dog <=> Shiba: 		 0.3531367
Banana <=> Shiba: 	 0.09621696


![alt text](https://i.imgur.com/fUJjyhT.png)

We can now see that the similarity values seem to be meaningful! The dog is more similar to animals than to a banana! And if that was too easy to be convincing, we can see that the Shiba dog is also very close to dog and far away from banana in the semantic space. How does these values come to be? 

For this we can have a closer look at the data attached to **dog** and **banana**. Let us print the vector of both items:

In [26]:
tokens = nlp("dog banana")

for token in tokens:
    print(token.text)
    print(token.vector.shape)
    print(token.vector)
    print("\n")

dog
(300,)
[-4.0176e-01  3.7057e-01  2.1281e-02 -3.4125e-01  4.9538e-02  2.9440e-01
 -1.7376e-01 -2.7982e-01  6.7622e-02  2.1693e+00 -6.2691e-01  2.9106e-01
 -6.7270e-01  2.3319e-01 -3.4264e-01  1.8311e-01  5.0226e-01  1.0689e+00
  1.4698e-01 -4.5230e-01 -4.1827e-01 -1.5967e-01  2.6748e-01 -4.8867e-01
  3.6462e-01 -4.3403e-02 -2.4474e-01 -4.1752e-01  8.9088e-02 -2.5552e-01
 -5.5695e-01  1.2243e-01 -8.3526e-02  5.5095e-01  3.6410e-01  1.5361e-01
  5.5738e-01 -9.0702e-01 -4.9098e-02  3.8580e-01  3.8000e-01  1.4425e-01
 -2.7221e-01 -3.7016e-01 -1.2904e-01 -1.5085e-01 -3.8076e-01  4.9583e-02
  1.2755e-01 -8.2788e-02  1.4339e-01  3.2537e-01  2.7226e-01  4.3632e-01
 -3.1769e-01  7.9405e-01  2.6529e-01  1.0135e-01 -3.3279e-01  4.3117e-01
  1.6687e-01  1.0729e-01  8.9418e-02  2.8635e-01  4.0117e-01 -3.9222e-01
  4.5217e-01  1.3521e-01 -2.8878e-01 -2.2819e-02 -3.4975e-01 -2.2996e-01
  2.0224e-01 -2.1177e-01  2.7184e-01  9.1703e-02 -2.0610e-01 -6.5758e-01
  1.8949e-01 -2.6756e-01  9.2639e-02  4.

![alt text](https://thumbs.gfycat.com/QuickSevereCat-small.gif)