In [None]:
import geovpylib.database as db
import spacy


# Connect to Yellow database
db.connect_yellow('switzerland_and_beyond')

# Fetch corpus
persons = db.query("select * from hls.person")
person = persons.iloc[5441]

# Chapter 1: Finding words, phrases, names and concepts

https://course.spacy.io/en/chapter1

## Introduction

In the center of this library, there is this `nlp` object which is basically a pipeline (see next chapter to learn more) created by **spaCy**. 

This object contains eveything needed by the pipeline, like special language rules. It can be used as a function to analyze texts.

Processed texts produce a `Doc` object, which structure all information parsed by the pipeline, with no loss of information (ie. it only add information). This `Doc` object is basically a Python sequence (eg. can be iterated over).

In the `Doc` object you will find `Token` objects. They represent a word, punctuation, ... Each `Token` posess various attributes (more of that later).

You can assemble multiple `Token` together in order to form a `Span`. Which is done by slicing the `Doc` object.

**Create a `doc` in a language**

In [None]:
nlp = spacy.blank('fr')
doc = nlp(person.notice)
print(doc.text)

**Get tokens out of a `doc`**

In [None]:
nlp = spacy.blank('fr')
doc = nlp(person.notice)
token = doc[0]
print(token.text)

**Get a slice of the doc**

In [None]:
nlp = spacy.blank('fr')
doc = nlp(person.notice)
a_slice = doc[2:10]
print(a_slice.text)

**Find dates (births and deaths) in `doc`**

In [None]:
nlp = spacy.blank('fr')
doc = nlp(person.notice)
lendoc = len(doc)

for token in doc:
    if token.text == 'Naît' and doc[token.i + 1].text == "le" and doc[token.i + 2].like_num:
        print('Birth date found:', doc[token.i + 2])
    if token.text == 'meurt' and doc[token.i + 1].text == "le" and doc[token.i + 2].like_num:
        print('Death date found:', doc[token.i + 2])


## Trained pipelines

In short, trained pipelines let you analyze context-specific information, eg if a `Span` is person name, a word is a verb, etc.

How is that done? Under the hood, **spaCy** has statistical models to make those predictions. Usually, pipelines are used to get part-of-speech (*POS*) tags, syntactic dependencies, named entities, ...

Pipelines are trained on large datasets and can be updated to fine-tune predictions.

Downloading pretrained pipelines can be done with the command `spacy download` command (see more [here](https://spacy.io/usage/processing-pipelines)), and in code, can be loaded with `spacy.load('')` function (returns an `nlp` object)
The pipeline also contains the vocabulary, and various information about it.

In **spaCy**, attributes suffixed with "_" return string values, without underscore, it will only return an integer ID value.

Some other exemple of what can be retrieved by a trained pipeline (apart from POS tags) are: dependency (`.dep_` like subjet, object, ...), syntactic head token (`.head`, parent token), named entities (`.ents`).

**Load a pipeline**

In [None]:
nlp = spacy.load("fr_core_news_sm")
text = person.notice
doc = nlp(text)
print(doc)

**Predict language annotation**

In [None]:
nlp = spacy.load("fr_core_news_sm")
text = person.notice
doc = nlp(text)

for token in doc[0:15]:
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

**All kinds of POS found, with explaination**

In [None]:
nlp = spacy.load("fr_core_news_sm")
text = person.notice
doc = nlp(text)

POSs = []
DEPs = []
for token in doc:
    if token.pos_ not in POSs: POSs.append(token.pos_)
    if token.dep_ not in DEPs: DEPs.append(token.dep_)

print('===== Part of Speech: =====')
for pos in POSs:
    print(pos, "-->", spacy.explain(pos))

print('\n===== Dependency labels: =====')
for dep in DEPs:
    print(dep, "==>", spacy.explain(dep))

**All entities found in a text (NER)**

In [None]:
nlp = spacy.load("fr_core_news_sm")
text = person.notice
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, "==>", ent.label_)

## Rule Based Matching

To find matchings in texts, **spaCy** matchings works as regular expression in `Doc` and `Token`. We can find texts, lexical attributes, etc.

Patterns used to find matchings are lists of dictonaries representing token attributes (lower case version of strings, optional tokens, forms of spans, punctuations, ...). Matchings will be a list of tuples.

In [None]:
nlp = spacy.load("fr_core_news_sm")
text = person.notice
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern_birth = [{'TEXT': 'Naît'}, {'TEXT': 'le'}, {'LIKE_NUM': True}]
pattern_death = [{'TEXT': 'meurt'}, {'TEXT': 'le'}, {'LIKE_NUM': True}]
pattern_son = [{'TEXT': 'Fils'}, {'TEXT': 'de'}, {'POS': 'PROPN'}]
pattern_daughter = [{'TEXT': 'Fille'}, {'TEXT': 'de'}, {'POS': 'PROPN'}]

matcher.add("BIRTH", [pattern_birth])
matcher.add("DEATH", [pattern_death])
matcher.add("SON", [pattern_son])
matcher.add("DAUGHTER", [pattern_daughter])

matches = matcher(doc)
print("Total matches found:", len(matches))

for id, start, end in matches:
    print(doc[start:end].text)

# Chapter 2: Large-scale data analysis with spaCy

## Data Structures 1

**spaCy** stores all shared data (is the word alphabetic, the text itself, ...) in a vocabulary. Internally, to increase performance and memory, it only uses hashed versions of words. Vocabulary can be extended manually.

**Word hashes (in vocab)**

In [None]:
nlp = spacy.load("fr_core_news_sm")
text = person.notice
doc = nlp(text)
word = 'meurt'
hash = nlp.vocab.strings[word]
word_from_hash = nlp.vocab.strings[hash]

print(hash, word_from_hash)

## Data Structures 2

The central data structure is the `Doc` object, created by calling the `nlp` function on a text. But `Doc` can also be created manually.

A `Span` can also be manually created by calling it on a `Doc`.

**Manually create a `doc`**

In [None]:
words = ['Hello', 'world', '!']
spaces = [True, False, False]
doc = spacy.tokens.Doc(nlp.vocab, words=words, spaces=spaces)

print(doc.text)


**Add a new entity to the existing entities of a `doc`**

In [None]:
# nlp = spacy.load('fr_core_news_sm')
nlp = spacy.blank('fr')
text = person.notice
doc = nlp(text)

span = spacy.tokens.Span(doc, 35, 36, label="PERSON")
doc.ents = [span]

for ent in doc.ents:
    print(ent.text, ent.label_)

**All proper nouns followed by a verb**

In [None]:
nlp = spacy.load('fr_core_news_sm')
doc = nlp(person.notice)

for token in doc:
    # Is current word a proper noun?
    if token.pos_ == "PROPN":
        # Is next word a verb?
        if doc[token.i + 1].pos_ == "VERB":
            print("- ", token.text, doc[token.i + 1].text)

## Word vectors and semantic similarity

**spaCy** is capable of comparing word similarity through vector representation of them.

To use this similarity function, pipelines need to have it in them (small pipelines do not have it), find more about them [here](https://spacy.io/models).

Similarity scores express how much 2 words are similar, range from 0 (totally different) to 1 (same meaning).

By default similarity scores come from a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between the 2 vectors representing the 2 words. 

In order to "transform" words into vectors, **spaCy** uses [Word2Vec](https://en.wikipedia.org/wiki/Word2vec), which does the embedding (process of transforming texts to numbers)

To have a vector from multiple tokens (like a `Doc` or a `Span`), it is the average of all token vectors that is sent back. That is why the embedding has more value with fewer irrelevant words.


**Word Vectors**

In [None]:
nlp = spacy.load('fr_core_news_md')
doc = nlp(person.notice)

protestant_vector = doc[17].vector
print(protestant_vector)

**Similarities**

In [None]:
nlp = spacy.load("fr_core_news_md")

# Compare 2 documents
doc1 = nlp(person.notice.split(', ')[0])
doc2 = nlp(person.notice.split(', ')[1])
print(doc1.similarity(doc2))

# Compare 2 tokens
doc = nlp(person.notice)
token1 = doc[0] # Naît
token2 = doc[11] # Meurt
print(token1.similarity(token2))

# Compare 2 spans
doc = nlp(person.notice)
span1 = doc[29:41] # directeur de l'école rurale de la Pommière à Chêne-Bougeries
span2 = doc[53:59] # Instituteur à l'école de Plainpalais
print(span1.similarity(span2))

## Combining predictions and rules

Combining statistical prediction and rule based system is the most powerfull trick one can have in his NLP toolbox.

Statistical predictions are powerfull to predict if a span of tokens are person names for exemple, or another exemple is to find relationships between subject and objects.
On the other hand, rule-based approaches are handy if there is a finite numbers of instances you want to find (country names, cities, ...)


**Find matchings in texts**

In [None]:
from spacy.matcher import Matcher
nlp = spacy.load('fr_core_news_md')
doc = nlp(person.notice)

# Define Patterns
pattern_cons_munic = [{'LOWER':'conseil'}, {'LOWER': 'municipal'}]
pattern_cons_natio = [{'LEMMA':'conseiller'}, {'LOWER': 'national'}]

# Add the Patterns
matcher = Matcher(nlp.vocab)
matcher.add('CONSEIL_MUNIC', [pattern_cons_munic])
matcher.add('CONSEIL_NATIO', [pattern_cons_natio])

# Find matchings
matchings = matcher(doc)

# Inspect matchings
for id, start, end in matchings:
    print(doc.vocab.strings[id], doc[start:end])



**Match exact strings**

This is much more efficient than the other techniques, but can have lower metrics

In [None]:
from spacy.matcher import PhraseMatcher

nlp = spacy.blank('fr')
doc = nlp(person.notice)

matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(['Conseil Municipal', 'Conseil municipal', 'conseiller municipal', 'Conseil National', 'conseiller national']))
# patterns = [nlp(role) for role in LIST]
matcher.add('POLITICIAN', patterns)

matchings = matcher(doc)
print([doc[start:end] for match_id, start, end in matchings])

**Get relationship between given entities**

In [None]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load('fr_core_news_sm')
doc = nlp(person.notice)
doc.ents = [] # Reset the ones created by the pipeline

matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(['Conseil Municipal', 'Conseil municipal', 'conseiller municipal', 'Conseil National', 'conseiller national']))
matcher.add('POLITICIAN', patterns)
matchings = matcher(doc)

# Add the matches to the entities
for id, start, end in matchings:
    span = Span(doc, start, end, label="POLITIC_ROLE")
    doc.ents = list(doc.ents) + [span]

    span_root_head = span.root.head
    print(span_root_head, '-->', span.text)

print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "POLITIC_ROLE"])
