In [1]:
import spacy

notice = """
Naît le 31.5.1877 à Choindez (commune de Courrendlin), meurt le 31.5.1931 à Genève, protestant, de Genève en 1902.
Fils d'Henri Albert, directeur de l'école rurale de la Pommière à Chêne-Bougeries, et d'Anna Grosvernier.
Célibataire. Etudes au collège. Instituteur à l'école de Plainpalais (aujourd'hui commune de Genève) en 1908,
puis à celles de Vernier, des Pâquis (Genève), de Versoix, de Carl-Vogt, de La Roseraie et du Grütli (les trois dernières à Genève).
Membre socialiste du Conseil municipal (législatif) de Genève (1914-1931, président en 1922) et conseiller national (1922).
Ernest Joray a également présidé le comité de l'université ouvrière de Genève, de 1910 à sa mort.
"""

**spaCy** code snippets and explaination from https://course.spacy.io/en/

# Chapter 1: Finding words, phrases, names and concepts

## Introduction

In the center of this library, there is this `nlp` object which is basically a pipeline (see next chapter to learn more) created by **spaCy**. 

This object contains eveything needed by the pipeline, like special language rules. It can be used as a function to analyze texts.

Processed texts produce a `Doc` object, which structure all information parsed by the pipeline, with no loss of information (ie. it only add information). This `Doc` object is basically a Python sequence (eg. can be iterated over).

In the `Doc` object you will find `Token` objects. They represent a word, punctuation, ... Each `Token` posess various attributes (more of that later).

You can assemble multiple `Token` together in order to form a `Span`. Which is done by slicing the `Doc` object.

**Create a `doc` in a language**

In [2]:
nlp = spacy.blank('fr')
doc = nlp(notice)
print(doc.text)


Naît le 31.5.1877 à Choindez (commune de Courrendlin), meurt le 31.5.1931 à Genève, protestant, de Genève en 1902.
Fils d'Henri Albert, directeur de l'école rurale de la Pommière à Chêne-Bougeries, et d'Anna Grosvernier.
Célibataire. Etudes au collège. Instituteur à l'école de Plainpalais (aujourd'hui commune de Genève) en 1908,
puis à celles de Vernier, des Pâquis (Genève), de Versoix, de Carl-Vogt, de La Roseraie et du Grütli (les trois dernières à Genève).
Membre socialiste du Conseil municipal (législatif) de Genève (1914-1931, président en 1922) et conseiller national (1922).
Ernest Joray a également présidé le comité de l'université ouvrière de Genève, de 1910 à sa mort.



**Get tokens out of a `doc`**

In [3]:
nlp = spacy.blank('fr')
doc = nlp(notice)
token = doc[0]
print(token.text)





**Get a slice of the doc**

In [4]:
nlp = spacy.blank('fr')
doc = nlp(notice)
a_slice = doc[2:10]
print(a_slice.text)

le 31.5.1877 à Choindez (commune de Courrendlin


**Find dates (births and deaths) in `doc`**

In [5]:
nlp = spacy.blank('fr')
doc = nlp(notice)
lendoc = len(doc)

for token in doc:
    if token.text == 'Naît' and doc[token.i + 1].text == "le" and doc[token.i + 2].like_num:
        print('Birth date found:', doc[token.i + 2])
    if token.text == 'meurt' and doc[token.i + 1].text == "le" and doc[token.i + 2].like_num:
        print('Death date found:', doc[token.i + 2])


Birth date found: 31.5.1877
Death date found: 31.5.1931


## Trained pipelines

In short, trained pipelines let you analyze context-specific information, eg if a `Span` is person name, a word is a verb, etc.

How is that done? Under the hood, **spaCy** has statistical models to make those predictions. Usually, pipelines are used to get part-of-speech (*POS*) tags, syntactic dependencies, named entities, ...

Pipelines are trained on large datasets and can be updated to fine-tune predictions.

Downloading pretrained pipelines can be done with the command `spacy download` command (`python -m spacy download <pipeline name>`, see more [here](https://spacy.io/usage/processing-pipelines)), and in code, can be loaded with `spacy.load('')` function (returns an `nlp` object)
The pipeline also contains the vocabulary, and various information about it.

In **spaCy**, attributes suffixed with "_" return string values, without underscore, it will only return an integer ID value.

Some other exemple of what can be retrieved by a trained pipeline (apart from POS tags) are: dependency (`.dep_` like subjet, object, ...), syntactic head token (`.head`, parent token), named entities (`.ents`).

**Load a pipeline**

In [6]:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(notice)
print(doc)


Naît le 31.5.1877 à Choindez (commune de Courrendlin), meurt le 31.5.1931 à Genève, protestant, de Genève en 1902.
Fils d'Henri Albert, directeur de l'école rurale de la Pommière à Chêne-Bougeries, et d'Anna Grosvernier.
Célibataire. Etudes au collège. Instituteur à l'école de Plainpalais (aujourd'hui commune de Genève) en 1908,
puis à celles de Vernier, des Pâquis (Genève), de Versoix, de Carl-Vogt, de La Roseraie et du Grütli (les trois dernières à Genève).
Membre socialiste du Conseil municipal (législatif) de Genève (1914-1931, président en 1922) et conseiller national (1922).
Ernest Joray a également présidé le comité de l'université ouvrière de Genève, de 1910 à sa mort.



**Predict language annotation**

In [7]:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(notice)

for token in doc[0:15]:
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")


           SPACE     dep       
Naît        PROPN     ROOT      
le          DET       det       
31.5.1877   NOUN      nmod      
à           ADP       case      
Choindez    PROPN     nmod      
(           PUNCT     punct     
commune     NOUN      appos     
de          ADP       case      
Courrendlin PROPN     nmod      
)           PUNCT     punct     
,           PUNCT     punct     
meurt       ADV       advmod    
le          DET       det       
31.5.1931   NUM       nmod      


**All kinds of POS found, with explaination**

In [8]:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(notice)

POSs = []
DEPs = []
for token in doc:
    if token.pos_ not in POSs: POSs.append(token.pos_)
    if token.dep_ not in DEPs: DEPs.append(token.dep_)

print('===== Part of Speech: =====')
for pos in POSs:
    print(pos, "-->", spacy.explain(pos))

print('\n===== Dependency labels: =====')
for dep in DEPs:
    print(dep, "==>", spacy.explain(dep))

===== Part of Speech: =====
SPACE --> space
PROPN --> proper noun
DET --> determiner
NOUN --> noun
ADP --> adposition
PUNCT --> punctuation
ADV --> adverb
NUM --> numeral
VERB --> verb
ADJ --> adjective
CCONJ --> coordinating conjunction
PRON --> pronoun
AUX --> auxiliary

===== Dependency labels: =====
dep ==> unclassified dependent
ROOT ==> root
det ==> determiner
nmod ==> modifier of nominal
case ==> case marking
punct ==> punctuation
appos ==> appositional modifier
advmod ==> adverbial modifier
acl ==> clausal modifier of noun (adjectival clause)
obj ==> object
obl:mod ==> None
flat:name ==> None
amod ==> adjectival modifier
cc ==> coordinating conjunction
conj ==> conjunct
nummod ==> numeric modifier
nsubj ==> nominal subject
aux:tense ==> None




**All entities found in a text (NER)**

In [9]:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(notice)

for ent in doc.ents:
    print(ent.text, "==>", ent.label_)

Naît ==> LOC
Choindez ==> LOC
Courrendlin ==> LOC
Genève ==> LOC
Genève ==> LOC
Fils ==> PER
Henri Albert ==> PER
Pommière ==> LOC
Chêne-Bougeries ==> LOC
Anna Grosvernier ==> PER
Célibataire ==> LOC
Plainpalais ==> LOC
de Genève ==> ORG
Vernier ==> LOC
Pâquis ==> LOC
Genève ==> LOC
Versoix ==> PER
Carl-Vogt ==> PER
La Roseraie ==> MISC
Grütli ==> LOC
Genève ==> LOC
Membre ==> ORG
Conseil ==> ORG
Genève ==> LOC
Ernest Joray ==> PER
université ouvrière de Genève ==> ORG


## Rule Based Matching

To find matchings in texts, **spaCy** matchings works as regular expression in `Doc` and `Token`. We can find texts, lexical attributes, etc.

Patterns used to find matchings are lists of dictonaries representing token attributes (lower case version of strings, optional tokens, forms of spans, punctuations, ...). Matchings will be a list of tuples.

In [10]:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(notice)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern_birth = [{'TEXT': 'Naît'}, {'TEXT': 'le'}, {'LIKE_NUM': True}]
pattern_death = [{'TEXT': 'meurt'}, {'TEXT': 'le'}, {'LIKE_NUM': True}]
pattern_son = [{'TEXT': 'Fils'}, {'TEXT': 'de'}, {'POS': 'PROPN'}]
pattern_daughter = [{'TEXT': 'Fille'}, {'TEXT': 'de'}, {'POS': 'PROPN'}]

matcher.add("BIRTH", [pattern_birth])
matcher.add("DEATH", [pattern_death])
matcher.add("SON", [pattern_son])
matcher.add("DAUGHTER", [pattern_daughter])

matches = matcher(doc)
print("Total matches found:", len(matches))

for id, start, end in matches:
    print(doc[start:end].text)

Total matches found: 2
Naît le 31.5.1877
meurt le 31.5.1931


# Chapter 2: Large-scale data analysis with spaCy

## Data Structures 1

**spaCy** stores all shared data (is the word alphabetic, the text itself, ...) in a vocabulary. Internally, to increase performance and memory, it only uses hashed versions of words. Vocabulary can be extended manually.

**Word hashes (in vocab)**

In [11]:
nlp = spacy.load("fr_core_news_sm")
doc = nlp(notice)
word = 'meurt'
hash = nlp.vocab.strings[word]
word_from_hash = nlp.vocab.strings[hash]

print(hash, word_from_hash)

6629007756632129920 meurt


## Data Structures 2

The central data structure is the `Doc` object, created by calling the `nlp` function on a text. But `Doc` can also be created manually.

A `Span` can also be manually created by calling it on a `Doc`.

**Manually create a `doc`**

In [12]:
words = ['Hello', 'world', '!']
spaces = [True, False, False]
doc = spacy.tokens.Doc(nlp.vocab, words=words, spaces=spaces)

print(doc.text)


Hello world!


**Add a new entity to the existing entities of a `doc`**

In [13]:
# nlp = spacy.load('fr_core_news_sm')
nlp = spacy.blank('fr')
doc = nlp(notice)

span = spacy.tokens.Span(doc, 35, 36, label="PERSON")
doc.ents = [span]

for ent in doc.ents:
    print(ent.text, ent.label_)

rurale PERSON


**All proper nouns followed by a verb**

In [14]:
nlp = spacy.load('fr_core_news_sm')
doc = nlp(notice)

for token in doc:
    # Is current word a proper noun?
    if token.pos_ == "PROPN":
        # Is next word a verb?
        if doc[token.i + 1].pos_ == "VERB":
            print("- ", token.text, doc[token.i + 1].text)

## Word vectors and semantic similarity

**spaCy** is capable of comparing word similarity through vector representation of them.

To use this similarity function, pipelines need to have it in them (small pipelines do not have it), find more about them [here](https://spacy.io/models).

Similarity scores express how much 2 words are similar, range from 0 (totally different) to 1 (same meaning).

By default similarity scores come from a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between the 2 vectors representing the 2 words. 

In order to "transform" words into vectors, **spaCy** uses [Word2Vec](https://en.wikipedia.org/wiki/Word2vec), which does the embedding (process of transforming texts to numbers)

To have a vector from multiple tokens (like a `Doc` or a `Span`), it is the average of all token vectors that is sent back. That is why the embedding has more value with fewer irrelevant words.


**Word Vectors**

In [15]:
nlp = spacy.load('fr_core_news_md')
doc = nlp(notice)

protestant_vector = doc[17].vector
print(protestant_vector)

[-8.8809e-01 -3.5001e-01 -9.6216e-01  2.9848e+00 -4.2264e+00  3.6049e-01
  2.4019e+00  7.3439e-01 -1.4135e-01  2.0341e+00 -2.5751e+00 -8.1169e-01
  1.9722e-01 -1.7658e+00  1.3302e+00 -9.6757e-01 -3.1117e+00 -5.2982e-01
  5.7773e-01  1.1570e+00  7.9589e-02 -1.6649e+00 -1.7397e+00 -1.9404e+00
 -1.1030e-02 -5.6723e-01 -5.8770e-01 -1.3862e-01 -6.9038e-01  2.2236e+00
 -7.0531e-02 -3.8363e-01 -1.8082e+00 -1.4682e+00 -5.5101e-01  3.7012e-01
  1.4327e+00  1.0295e+00  1.1605e+00 -6.2587e-01  1.2407e+00  2.1984e-02
  4.6103e-01  1.8348e+00 -1.7850e+00 -2.2071e+00  4.8325e-02 -2.8803e+00
 -7.7132e-01  4.1523e-01 -6.1587e-01 -1.9203e+00  2.1567e+00 -1.7404e+00
  1.4676e+00 -9.6845e-01  3.0210e+00  1.1269e+00 -8.7140e-01 -4.7627e-01
  1.0900e-01  8.8198e-01 -9.8400e-01  7.7087e-01  7.4175e-01  3.4002e+00
  2.1428e+00 -1.4911e+00 -3.0352e-02  1.0410e+00 -2.4837e+00  1.5486e+00
  1.1754e-01 -4.3461e+00  1.4178e+00  2.1990e+00 -9.5179e-01  5.5366e-01
 -2.0306e+00  1.5718e+00 -1.0287e+00  4.8324e+00 -5

**Similarities**

In [None]:
nlp = spacy.load("fr_core_news_md")

# Compare 2 documents
doc1 = nlp(notice.split(', ')[0])
doc2 = nlp(notice.split(', ')[1])
print(doc1.similarity(doc2))

# Compare 2 tokens
doc = nlp(notice)
token1 = doc[0] # Naît
token2 = doc[11] # Meurt
print(token1.similarity(token2))

# Compare 2 spans
doc = nlp(notice)
span1 = doc[29:41] # directeur de l'école rurale de la Pommière à Chêne-Bougeries
span2 = doc[53:59] # Instituteur à l'école de Plainpalais
print(span1.similarity(span2))

## Combining predictions and rules

Combining statistical prediction and rule based system is the most powerfull trick one can have in his NLP toolbox.

Statistical predictions are powerfull to predict if a span of tokens are person names for exemple, or another exemple is to find relationships between subject and objects.
On the other hand, rule-based approaches are handy if there is a finite numbers of instances you want to find (country names, cities, ...)


**Find matchings in texts**

In [None]:
from spacy.matcher import Matcher
nlp = spacy.load('fr_core_news_md')
doc = nlp(notice)

# Define Patterns
pattern_cons_munic = [{'LOWER':'conseil'}, {'LOWER': 'municipal'}]
pattern_cons_natio = [{'LEMMA':'conseiller'}, {'LOWER': 'national'}]

# Add the Patterns
matcher = Matcher(nlp.vocab)
matcher.add('CONSEIL_MUNIC', [pattern_cons_munic])
matcher.add('CONSEIL_NATIO', [pattern_cons_natio])

# Find matchings
matchings = matcher(doc)

# Inspect matchings
for id, start, end in matchings:
    print(doc.vocab.strings[id], doc[start:end])



**Match exact strings**

This is much more efficient than the other techniques, but can have lower metrics

In [None]:
from spacy.matcher import PhraseMatcher

nlp = spacy.blank('fr')
doc = nlp(notice)

matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(['Conseil Municipal', 'Conseil municipal', 'conseiller municipal', 'Conseil National', 'conseiller national']))
# patterns = [nlp(role) for role in LIST]
matcher.add('POLITICIAN', patterns)

matchings = matcher(doc)
print([doc[start:end] for match_id, start, end in matchings])

**Get relationship between given entities**

In [None]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load('fr_core_news_sm')
doc = nlp(notice)
doc.ents = [] # Reset the ones created by the pipeline

matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(['Conseil Municipal', 'Conseil municipal', 'conseiller municipal', 'Conseil National', 'conseiller national']))
matcher.add('POLITICIAN', patterns)
matchings = matcher(doc)

# Add the matches to the entities
for id, start, end in matchings:
    span = Span(doc, start, end, label="POLITIC_ROLE")
    doc.ents = list(doc.ents) + [span]

    span_root_head = span.root.head
    print(span_root_head, '-->', span.text)

print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "POLITIC_ROLE"])


# Chapter 3: Processing Pipelines

## Processing Pipelines

What can be done in a **spaCy** pipeline, and what happens behind the scene.

When applying a pipeline to a string, it first applies the tokenizer, followed by a series of component (pipeline component, can be parser, entity recognizer, POS tagger, ...) and returns a `Doc` object for the developer to work with.

Built in pipeline component:
- Part-of-speech tagger: sets the `token.tag` and the `token.pos` attribute
- Dependency parser: sets the `token.dep` and `token.head` attributes; but also the sentences and base noun phrases (noun chunks)
- Named entity recognizer: sets the `doc.ents`; and sets atrtibutes to know if a token is part of an entity or not.
- Text classifier: sets the `doc.cats` property (category label that apply to the whole text)

Beware! Text classifier is not by default in any pretrained pipeline, because it is always very usecase specific. It can be used to train a new system.


A pipeline is made of several folders, binary files, and a configuration `config.cfg`. It defines languages and pipeline's component, how they should be configured, applyed, etc.

The list of a pipeline's components can be accessed via `nlp.pipe_names` or `nlp.pipeline` 

**List pipeline components**

In [None]:
nlp = spacy.load('fr_core_news_sm')

print('Pipeline names')
print(nlp.pipe_names)

print('Pipeline labels')
print(nlp.pipe_labels)

print('Pipeline components')
print(nlp.pipeline)

## Custom pipeline components

This is usefull for example to add custom pipelines components for special needs or also to update built-in attributes, like named entity spans.

In the end, a pipeline component is a callable function taking a `Doc`, update it, and returns it.

To create such a function (pipeline component), add the `spacy.Language.component("component_name")` decorator to a new function. After that, the function can be simply added with `nlp.add_pipe("component_name")`. When adding the new component name one can set its place using the `last`, `first`, `before`, `after` keywords. By default it is append in the end.

**Create a new custom pipeline component**

In [None]:
@spacy.Language.component('length_component')
def length_component(doc):
    doc_length = len(doc)
    print(f'This document is {doc_length} tokens long.')
    return doc

nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('length_component', first=True)
doc = nlp(notice)

**More complex component**

In [None]:
nlp = spacy.load('fr_core_news_sm')
parents = ['Henri Albert', 'Anna Grosvernier']
parents_patterns = list(nlp.pipe(parents))
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
matcher.add('PARENT', parents_patterns)

@spacy.Language.component('parent_component')
def parent_component(doc):
    matchings = matcher(doc)
    spans = [spacy.tokens.Span(doc, start, end, label="PARENT") for id, start, end in matchings]
    doc.ents = spans
    return doc

nlp.add_pipe('parent_component', after='ner')
doc = nlp(notice)
print([(ent.text, ent.label_) for ent in doc.ents])

## Extension attributes

To add custom attributes to either `token`, `Doc` or `Span`, they have to be added to the `_` attribute.

Another solution is to add them directly to the global class with the `set_extension` function (they will also be available in the `_` attribute).

Extensions can be attribute (variable), property (variable with getter and setter) or method.

**Add an attribute extension**

In [None]:
nlp = spacy.blank('fr')
spacy.tokens.Token.set_extension("is_town", default=False)

doc = nlp(notice)
doc[5]._.is_town = True
print([(token.text, token._.is_town) for token in doc[0:10]])

**Add a property extension**

In [None]:
nlp = spacy.blank('fr')

def get_reversed(token):
    return token.text[::-1]

# spacy.tokens.Token.set_extension('reversed', getter=get_reversed)

doc = nlp(notice)
token = doc[5]
print('Token   :', token.text)
print('Reversed:', token._.reversed)

**Add a method extension**

In [None]:
nlp = spacy.blank('fr')

def to_html(span, tag):
    return f'<{tag}>{span.text}</{tag}>'

spacy.tokens.Span.set_extension('to_html', method=to_html)

doc = nlp(notice)
span = doc[4:6]
print(span._.to_html('i'))

**Get a HLS link for all persons mentioned**

In [None]:
nlp = spacy.load('fr_core_news_md')

def get_hls_findings(span):
    if span.label_ == 'PER':
        text = span.text.replace(' ', '%20')
        return f'https://hls-dhs-dss.ch/fr/search/?text={text}'

spacy.tokens.Span.set_extension('hls_url', getter=get_hls_findings)


doc = nlp(notice)
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.hls_url)

## Scaling and performance

The most efficient way of creating a lot of `Doc`, is to use the `pipe` method:

```python
docs = [nlp(text) for text in LOTS_OF_TEXTS] # BAD
docs = list(nlp.pipe(LOTS_OF_TEXTS)) # GOOD
```

In order to pass additional information about docs, metadata can be added, and docs can be passed as tuples (`as_tuples` option has to be `True`).

Sometimes, one can only want to have a `Doc` from a text. Calling the full pipeline could be useless and CPU consuming, so to only tokenize the text, better just calling `nlp.make_doc`.

Likewise, it is also possible to enable/disable pipeline components, in order to have specific uses: 

```python
with nlp.select_pipes(disable=["tagger", "parser"]):
    doc = nlp(text)
```

# Chapter 4: Training a neural network model

## Training and updating model

To update an existing model, data needed is from a few hundred, to thousands.
To train a new category, it may be needed to have more than a million training data.

To illustrate, the **spaCy** english model was trained on more than 2 millions words labelled with POS tags, dependencies, and named entities.

Classically in AI, a testing (evaluation) dataset is also needed, to check how the model learns.

Training and evaluation dataset needs to be docs as they should be created by the model (`Doc`s objects with `ents`, `pos`, ... attributes).

Of course, to increase performance, datasets (training and evaluation) can be stored as binary files, for that, the `DocBin` object can be used (`.spacy` extension used for those files). It is more efficient, and creates smaller files than the pickle format.
More of that [here](https://spacy.io/api/docbin)

**Create training/testing data**

In [9]:
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin

nlp = spacy.blank('fr')
matcher = Matcher(nlp.vocab)

pattern1 = [{'LOWER': 'henri'}, {'LOWER': 'albert'}]
pattern2 = [{'LOWER': 'anna'}, {'LOWER': 'grosvernier'}]
matcher.add('PARENT', [pattern1, pattern2])
docs = []
for doc in nlp.pipe([notice]): # Here we simulate that we have multiple notices
    matchings = matcher(doc)
    spans = [Span(doc, start, end) for id, start, end in matchings]
    print(spans)
    doc.ents = spans
    docs.append(doc)

doc_bin = DocBin(docs=docs)
doc_bin.to_disk('./train.spacy')

[Henri Albert, Anna Grosvernier]


## Configuration & train

As mentioned in the section before, the training configuration **has to** be set in the configuration file (`config.cfg`)

This configuration files is the *single source of truth* for all **spaCy** settings, going from how the `nlp` object is created, the list of components and their internal model configuration, to all the training parameters like how to load data, training hyperparameters, ...

But configuration files does required to be created by hand, **spaCy** can do that automatically (see [here](https://spacy.io/usage/training#quickstart) and [here](https://spacy.io/api/cli#init-config) for more info).

Once a pipeline is trained, it is loadable as a normal **spaCy** pipeline with the `spacy.load(pipeline_name)`.

One can also packages his pipeline with the [spacy package command](https://spacy.io/api/cli#package), which ease the deployment and the loading process.

**Generate a configuration file**

```bash
# Generate a config file in french, with only one component: Named Entity Recognition
python -m spacy init config ./config.cfg --lang fr --pipeline ner

# Inspect the generated config
cat ./config.cfg
```

**Train using the CLI**

```bash
python -m spacy train ./config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./test.spacy
```

## Training best practices

It is normal that training does not go as one wants the first time he tries it. It is an iterative process where things need to be tested.

- In the training data, data about already correct prediction is better to be included, to avoid the forgetting problem. So in the end, training data would then mix those data with one's own new data.
- The model will have trouble learning things from context: it will be difficult to learn to distinguish adult clothing from children clothing for example. It is better to have generic objectives.


To create training data (i.e. make annotations) tools should be used: [Brat](http://brat.nlplab.org/), (open source solution), or [Prodigy](https://prodi.gy/) (integrates with **spaCy**)

# Conclusion

What has been touched in this notebook:
- Extract linguistic features: part-of-speech tags, dependencies, named entities
- Work with trained pipelines
- Find words and phrases using Matcher and PhraseMatcher match rules
- Best practices for working with data structures Doc, Token Span, Vocab, Lexeme
- Find semantic similarities using word vectors
- Write custom pipeline components with extension attributes
- Scale up your spaCy pipelines and make them fast
- Create training data for spaCy's statistical models
- Train and update spaCy's neural network models with new data