[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camillepradel/simplon-ner/blob/main/fine_tune_ner.ipynb)

## Setup

In [1]:
!pip install spacy==3.0.6

Collecting spacy==3.0.6
[?25l  Downloading https://files.pythonhosted.org/packages/1b/d8/0361bbaf7a1ff56b44dca04dace54c82d63dad7475b7d25ea1baefafafb2/spacy-3.0.6-cp37-cp37m-manylinux2014_x86_64.whl (12.8MB)
[K     |████████████████████████████████| 12.8MB 4.4MB/s 
[?25hCollecting thinc<8.1.0,>=8.0.3
[?25l  Downloading https://files.pythonhosted.org/packages/61/87/decceba68a0c6ca356ddcb6aea8b2500e71d9bc187f148aae19b747b7d3c/thinc-8.0.3-cp37-cp37m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 47.0MB/s 
Collecting typer<0.4.0,>=0.3.0
  Downloading https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Collecting catalogue<2.1.0,>=2.0.3
  Downloading https://files.pythonhosted.org/packages/9c/10/dbc1203a4b1367c7b02fddf08cb2981d9aa3e688d398f587cea0ab9e3bec/catalogue-2.0.4-py3-none-any.whl
Collecting spacy-legacy<3.1.0,>=3.0.4
  Downloading https://files.pythonhosted.org/p

In [2]:
!python -m spacy download fr_core_news_sm

2021-05-01 22:35:17.072107: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Collecting fr-core-news-sm==3.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.0.0/fr_core_news_sm-3.0.0-py3-none-any.whl (17.2MB)
[K     |████████████████████████████████| 17.3MB 302kB/s 
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [3]:
!git clone https://github.com/camillepradel/simplon-ner.git
%cd simplon-ner

Cloning into 'simplon-ner'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (7/7), done.
/content/simplon-ner


## Training data
File `./train.iob` contains annotated data that we will use to fine tune built-in `fr_core_news_sm` model.

In [4]:
!tail ./train.iob

Le	O
Président	O
de	O
Nantes	B-ORG
Métropole	I-ORG
de	O
Nantes	O
,	O
Jean-Marc	O
AYRAULT	O


The objective of this work is to add more annotated data to this file, so that we get better results after fine tuning in following steps.

To do this, we can either gather unatotated data and manually annotate it, which is cumbersome and time consuming, or we can find smart ways to generate similar annotated data.

One example is to parse web pages in which entities are formally identified. Some websites which can be exploited:
* http://www.trendeo.net/blog/ 
* https://www.digital113.fr/presentation-du-cluster/nos-adherents/
* https://www.occitanie-emploi.fr/categorie-poste/industrie/ 
* https://www.fusacq.com/buzz/fr 
* https://www.boursedirect.fr/fr/actualites/flux/entreprises  


## Preprocessing

In [5]:
# convert IOB train and eval files to spacy binary format
!python -m spacy convert ./train.iob ./ -t spacy -n 1 -c iob
!python -m spacy convert ./eval.iob ./ -t spacy -n 1 -c iob

2021-05-01 22:35:29.033121: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (41 documents): train.spacy[0m
2021-05-01 22:35:33.377215: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (265 documents): eval.spacy[0m


## Fine tuning

In [6]:
!python -m spacy train config.cfg  --paths.train ./train.spacy --paths.dev ./eval.spacy --output ./trained_model

2021-05-01 22:35:38.293236: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Created output directory: trained_model[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-05-01 22:35:48,575] [INFO] Set up nlp object from config
[2021-05-01 22:35:48,590] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-05-01 22:35:48,590] [INFO] Resuming training for: ['ner', 'tok2vec']
[2021-05-01 22:35:48,604] [INFO] Created vocabulary
[2021-05-01 22:35:48,605] [INFO] Finished initializing nlp object
[2021-05-01 22:35:48,605] [INFO] Initialized pipeline components: []
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      6.34   37.74   31.97   46.06    0.38
 11     200          0.00     62.68   59.67   62.50 

## Evaluation

In [7]:
# evaluate our fine-tuned model
!python -m spacy evaluate ./trained_model/model-best/ ./eval.spacy

2021-05-01 22:40:08.491823: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using CPU[0m
[1m

TOK     -    
NER P   64.78
NER R   58.66
NER F   61.57
SPEED   10848

[1m

           P       R       F
ORG    78.01   58.66   66.97
PER     0.00    0.00    0.00
LOC     0.00    0.00    0.00
MISC    0.00    0.00    0.00



In [8]:
# compare with built-in fr_core_news_sm model
!python -m spacy evaluate fr_core_news_sm ./eval.spacy

2021-05-01 22:40:20.181739: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using CPU[0m
[1m

TOK      -    
TAG      0.00 
POS      -    
MORPH    -    
LEMMA    -    
UAS      -    
LAS      -    
NER P    32.05
NER R    46.06
NER F    37.80
SENT P   64.90
SENT R   83.02
SENT F   72.85
SPEED    2755 

[1m

           P       R       F
MISC    0.00    0.00    0.00
ORG    82.98   46.06   59.24
PER     0.00    0.00    0.00
LOC     0.00    0.00    0.00

