#### Advanced NLP with spaCy
#### Chapter 4: Training a neural network model
##### Training and updating models

Training is typically used for better results in a specific domain.  
Training is essential for text classification.  
Also pretty important for entity recognition.  
It is less critical for tagging and parsing.

Training steps:
1. Initialize weights randomly
1. Predict examples with current weights
1. Compare prediction with correct labels
1. Calculate how to change weights to improve predictions
1. Update weights slightly
1. Repeat for appropriate epochs

Spacy supports fine-tuning existing models and new models.

Labeling entities:

Use the DocBin object to serialize the training and evaluation data to disk (faster than pickle):

Some common formats are CoNLL (.conll, .conllu) and IOB (.iob).  
Spacy provides a conversion command for them:

This command also works for spaCy's old json format from v2.

##### Creating training data

In [11]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin

TEXTS = [
  "How to preorder the iPhone X",
  "iPhone X is coming",
  "Should I pay $1,000 for the iPhone X?",
  "The iPhone 8 reviews are here",
  "iPhone 11 vs iPhone 8: What's the difference?",
  "I need a new phone! Any tips?"
]

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

doc_bin = DocBin(docs=docs)
# doc_bin.to_disk("./train.spacy")

[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]


##### Configuring and running the training

Spacy using a config file usually named config.cfg as a single source or truth to initialize the nlp object.
- pipeline components and their models
- training process and hyperparameters

Example config (@ sign references a python function):

To create a file interactively use:

To train a model from spacy files:

Example output:

Once a model is trained load it the same way as the default models:

##### Generating a config file

In [2]:
!python -m spacy init config --help

                                                                               
 Usage: python -m spacy init config [OPTIONS] OUTPUT_FILE                      
                                                                               
 Generate a starter config file for training. Based on your requirements       
 specified via the CLI arguments, this command generates a config with the     
 optimal settings for your use case. This includes the choice of architecture, 
 pretrained weights and related hyperparameters.                               
 DOCS: https://spacy.io/api/cli#init-config                                    
                                                                               
+- Arguments -----------------------------------------------------------------+
| *    output_file      PATH  File to save the config to or - for stdout      |
|                             (will only output config and no additional      |
|                             logging in

In [9]:
!python -m spacy init config .\config.cfg --lang en --pipeline ner

[i] Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[+] Auto-filled config with all values
[+] Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [10]:
!type config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = 

##### Using the training CLI

In [13]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True' # Fix for OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.

In [14]:
!python -m spacy train .\training\config_gadget.cfg --output .\output --paths.train .\training\train_gadget.spacy --paths.dev .\training\dev_gadget.spacy

[i] Saving to output directory: output
[i] Using CPU
[i] To switch to GPU 0, use the option: --gpu-id 0
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     20.33    1.69    1.04    4.44    0.02
  1     200         29.68    982.54   78.69   77.42   80.00    0.79
  2     400         73.66    242.14   80.00   76.00   84.44    0.80
  4     600         65.99    107.64   81.56   82.02   81.11    0.82
  6     800         66.82     75.03   80.23   81.61   78.89    0.80
  9    1000        108.99     73.20   86.49   84.21   88.89    0.86
 12    1200         39.55     26.12   87.91   86.96   88.89    0.88
 16    1400         78.52     39.19   85.56   85.56   85.56    0.86
 22    1600         63.65     17.61   82.98   79.59   86.67    0.83
 28    1800        138.04     37.72   84.32   82.1

##### Best practives to training

- When training on new categories, mix in previously correct prediction examples to avoid overfitting/forgetting
    - Can use old model working well to label data from domain for the new category data
- Models will struggle to learn if prediction cannot be made from local context
    - Keep category labels broad, not too specific