<a href="https://colab.research.google.com/github/datenzauberai/tmp/blob/main/Technologie_Praktikum_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Complete Guide to spaCy

Based on: https://nlpforhackers.io/complete-guide-to-spacy/

We need a language model to use spaCy.

In [None]:
import spacy
nlp = spacy.load('en')

Tokenize some text...

In [None]:
doc = nlp('Hello     World!')
for token in doc:
  print('"' + token.text + '"')

"Hello"
"    "
"World"
"!"


Notice the index preserving tokenization in action. Rather than only keeping the words, spaCy keeps the spaces too. This is helpful for situations when you need to replace words in the original text or add some annotations. With NLTK tokenization, there’s no way to know exactly where a tokenized word is in the original raw text. spaCy preserves this “link” between the word and its place in the raw text. Here’s how to get the exact index of a word:

In [None]:
for token in doc:
    print('"' + token.text + '"', token.idx)

"Hello" 0
"    " 6
"World" 10
"!" 15


The `Token` class exposes a lot of word-level attributes. Here are a few examples:

In [None]:
doc = nlp("Next week I'll   be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	-PRON-	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	VERB	MD
  	15	  	False	True	  	SPACE	_SP
be	17	be	False	False	xx	AUX	VB
in	20	in	False	False	xx	ADP	IN
Madrid	23	Madrid	False	False	Xxxxx	PROPN	NNP
.	29	.	True	False	.	PUNCT	.


## The spaCy toolbox
Let’s now explore what are the models bundled up inside spaCy.

### Sentence detection
Here’s how to achieve one of the most common NLP tasks with spaCy:

In [None]:
doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


**Exercise**: *You can enter your own sentences and see how spaCy handles different punctuation marks.*

### Part Of Speech Tagging
We've already seen how this works but let's have another look:

In [None]:
doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Madrid', 'NNP'), ('.', '.')]


**Exercise**: *What does JJ, NN, PRP etc. mean?*

### Named Entity Recognition
Doing NER with spaCy is super easy and the pretrained model performs pretty well:

In [None]:
doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Next week DATE
Madrid GPE


**Exercise**: *Try to challenge spaCy with different locations, institutions, company names or holidays.*

The spaCy NER also has a healthy variety of entities. You can view the full list here: [Entity Types](https://spacy.io/usage/linguistic-features#entity-types)

In [None]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


Let’s use `displaCy` to view a beautiful visualization of the Named Entity annotated sentence:

In [None]:
from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

### Chunking
spaCy automatically detects noun-phrases as well:

In [None]:
doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

Wall Street Journal NP Journal
an interesting piece NP piece
crypto currencies NP currencies


Notice how the chunker also computes the root of the phrase, the main word of the phrase.

### Dependency Parsing
This is what makes spaCy really stand out. Let’s see the dependency parser in action:

In [None]:
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/NNP <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN


If this doesn’t help visualizing the dependency tree, displaCy comes in handy:

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

### Word Vectors
spaCy comes shipped with a Word Vector model as well. We'll need to download a larger model for that:

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
import en_core_web_lg
nlp = en_core_web_lg.load()

The vectors are attached to spaCy objects: `Token`, `Lexeme` (a sort of unnatached token, part of the vocabulary), `Span` and `Doc`. The multi-token objects average its constituent vectors.

Explaining word vectors(aka word embeddings) are not the purpose of this tutorial. Here are a few properties word vectors have:

If two words are similar, they appear in similar contexts
Word vectors are computed taking into account the context (surrounding words)
Given the two previous observations, similar words should have similar word vectors
Using vectors we can derive relationships between words
Let’s see how we can access the embedding of a word in spaCy:

In [None]:
print(nlp.vocab['banana'].vector)

There’s a really famous example of word embedding math: "man" - "woman" + "queen" = "king". It sounds pretty crazy to be true, so let’s test that out:

In [None]:
from scipy import spatial
 
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
 
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector
 
# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
maybe_king = man - woman + queen
computed_similarities = []
 
for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue
 
    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))
 
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'Kings', 'KINGS', 'kings']


Surprisingly, the closest word vector in the vocabulary for “man” – “woman” + “queen” is still “Queen” but “King” comes right after. Maybe behind every King is a Queen?

**Exercise**: *Try this with similar examples, maybe with countries and capitals.*

### Computing Similarity
Based on the word embeddings, spaCy offers a similarity interface for all of it’s building blocks: Token, Span, Doc and Lexeme. Here’s how to use that similarity interface:

In [None]:
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']
 
print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285

0.66185343 0.2355285
0.67148364 0.24272855


Let’s now use this technique on entire texts:

In [None]:
target = nlp("Cats are beautiful animals.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
 
print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101

0.8901765218466683
0.9115828449161616
0.7822956256736615


### Conclusions
spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are: speed, accuracy, extensibility. It also comes shipped with useful assets like word embeddings. It can act as the central part of your production NLP pipeline.

# 🤗 Quick tour

Based on https://huggingface.co/transformers/quicktour.html

In [None]:
!pip install transformers

Let’s have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), such as completing a prompt with new text or translating in another language.

First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.

## Getting started on a task with a pipeline
The easiest way to use a pretrained model on a given task is to use pipeline(). 🤗 Transformers provides the following tasks out of the box:

* Sentiment analysis: is a text positive or negative?
* Text generation (in English): provide a prompt and the model will generate what follows.
* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)
* Question answering: provide the model with some context and a question, extract the answer from the context.
* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.
* Summarization: generate a summary of a long text.
* Translation: translate a text in another language.
* Feature extraction: return a tensor representation of the text.

Let’s see how this work for sentiment analysis (the other tasks are all covered in the [task summary](https://huggingface.co/transformers/task_summary.html)):

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




In [None]:
classifier('We are very happy to show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [None]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
  print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is fairly neutral.

**Exercise**: *Try this with different sentences.*

By default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”. We can look at its [model page](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) to get more information about it. It uses the [DistilBERT architecture](https://huggingface.co/transformers/model_doc/distilbert.html) and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

Let’s say we want to use another model; for instance, one that has been trained on French data. We can search through the [model hub](https://huggingface.co/models) that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags “French” and “text-classification” gives back a suggestion “nlptown/bert-base-multilingual-uncased-sentiment”. Let’s see how we can use it.

You can directly pass the name of the model to use to `pipeline()`:

In [None]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=953.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=669491321.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=39.0, style=ProgressStyle(description_w…




This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! Documentation on model hub: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

In [None]:
# should be 1 star
classifier("Sehr langweilig. Zwei mal gespielt und seitdem in der Ecke.")

[{'label': '1 star', 'score': 0.6738775372505188}]

In [None]:
# should be 1 star
classifier("Das Spiel ist ausschließlich Glücksabhängig, genauso gut könnte man zwei Spielwürfel nehmen und entscheiden dass derjenige gewinnt der die niedrigste Augenzahl würfelt. Nach den tollen Bewertungen hab ich mir mehr erhofft und bin maßlos von diesem Spiel enttäuscht!")

[{'label': '2 stars', 'score': 0.5745571255683899}]

In [None]:
# should be 1 star
classifier("Vom ursprünglichen Spiel waren wir in seiner Raffinesse und trotzdem Einfachheit und Genialität restlos begeistert. Beim vorliegenden bestehen viele Unklarheiten, die auch mit Hilfe der Anleitung nicht geklärt werden können. Die Genialität des ursprünglichen Spieles ist verwässert und die Neuerungen erinnern sehr an andere Spiele. Ich meine, die Erfinder des Spiel haben sich mit der Neuauflage einen Bärendienst erwiesen.")

[{'label': '5 stars', 'score': 0.3628184497356415}]

In [None]:
# should be 5 star
classifier("Super tolles Spiel, welches wir auf einem Kurzurlaub mit Freunden kennen und lieben gelernt haben. Wir spielen das Spiel seitdem in jeder freien Minute. Selbst kleinere Kinder können schon mitspielen - und wenn sie nur die Karten für Mama oder Papa aufdecken.")

[{'label': '5 stars', 'score': 0.8714972138404846}]

In [None]:
# should be 5 star
classifier("Ich liebe spiele, die nicht zu komplizierte Regeln haben, sondern mit wenigen Worten schnell erklärt werden können. Dieses Spiel gehört definitiv dazu. Es macht wirklich Spaß zu spielen, und man kann auch die Spiellänge etwas abändern, indem man vorher eine bestimmte Anzahl von Runden ausmacht. Der dazu gehörige Spielblock ist sinnlos, da reicht ein einfaches Blatt Papier. Dann hätte die Verpackung nämlich etwas kleiner ausfallen können, das finde ich praktischer zum mitnehmen.")

[{'label': '4 stars', 'score': 0.6106915473937988}]

**Exercise**: *Try this with different review. You can come up with them yourself or copy them from a site like TripAdvisor or Amazon.*