# Train spaCy models on synthesised training data

The instructions on setting up the config files for each training run are from here: <https://spacy.io/usage/training#config>.

To populate our config files, we referred to the config files for models from: <https://github.com/nlpbook/nlpbook>
(from downloading the ag-dataset models).

The variation is on different optimizers and a few changes in learning rates. Advice on how to configure the optimizers
for these spaCy config files is from the optimizer library here: https://thinc.ai/docs/api-optimizers

To package the completed models: <https://towardsdatascience.com/drugs-ner-using-spacy-in-python-f1f3091f8f4e>

## 1 Setup

In [35]:
from pathlib import Path

In [36]:
# Overall location
data_location = "../../data/processed/converted_train_test_data_for_ner/"

# Paths to train, val and test data
train_data_path = Path(data_location, "train/train.spacy")
dev_data_path = Path(data_location, "dev/dev.spacy")
test_data_path = Path(data_location, "test/test.spacy")

# Path to output and results
output_path = Path(data_location, "output")
output_path.mkdir(parents=True, exist_ok=True)

results_path = Path(data_location, "results")
results_path.mkdir(parents=True, exist_ok=True)

## 2 Model training and testing: first round

### 2.1 ADAM, LR 0.0001

In [37]:
# Specific arguments for each model
# Train information
model_desc = "adam_lr_0001"
training_config_path__adam__lr0001 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.0001
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 2.1.1 Training

In [38]:
!python -m spacy train $training_config_path__adam__lr0001 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 2000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 01:39:36,766] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 01:39:37,705] [INFO] Set up nlp object from config
[2021-12-12 01:39:37,715] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 01:39:37,716] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 01:39:37,716] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 01:39:37,721] [INFO] Created vocabulary
[2021-12-12 01:39:37,721] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 01:41:35,110] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 01:41:35,123] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 01:41:35,125] [DEBUG]

#### 2.1.2 Testing

In [39]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   3.77  
NER R   3.70  
NER F   3.74  
SPEED   3729  

[1m

                    P      R      F
scientific       7.41   8.33   7.84
common           0.00   0.00   0.00
pharmaceutical   0.00   0.00   0.00

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/adam_lr_0001


### 2.2 SGD, LR 0.0001

In [40]:
# Specific arguments for each model
# Train information
model_desc = "sgd_lr_0001"
training_config_path__sgd__lr0001 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.001
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 2.2.1 Training

In [41]:
!python -m spacy train $training_config_path__sgd__lr0001 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 2000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 03:16:12,270] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 03:16:13,179] [INFO] Set up nlp object from config
[2021-12-12 03:16:13,188] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 03:16:13,189] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 03:16:13,190] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 03:16:13,194] [INFO] Created vocabulary
[2021-12-12 03:16:13,194] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 03:18:12,611] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 03:18:12,620] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 03:18:12,621] [DEBUG]

#### 2.2.2 Testing

In [42]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   0.00  
NER R   0.00  
NER F   0.00  
SPEED   5100  

[1m

                P      R      F
scientific   0.00   0.00   0.00
common       0.00   0.00   0.00

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/sgd_lr_0001


### 2.3 RADAM, LR 0.0001

In [52]:
# Specific arguments for each model
# Train information
model_desc = "radam_lr_0001"
training_config_path__radam__lr0001 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.0001
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 2.3.1 Training

In [53]:
!python -m spacy train $training_config_path__radam__lr0001 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 2000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 11:37:43,882] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 11:37:44,908] [INFO] Set up nlp object from config
[2021-12-12 11:37:44,918] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 11:37:44,919] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 11:37:44,919] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 11:37:44,923] [INFO] Created vocabulary
[2021-12-12 11:37:44,923] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 11:39:46,792] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 11:39:46,805] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 11:39:46,806] [DEBUG]

#### 2.3.2 Testing

In [54]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   21.74 
NER R   18.52 
NER F   20.00 
SPEED   5036  

[1m

                 P       R       F
scientific   29.63   33.33   31.37
common       10.53    6.67    8.16

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/radam_lr_0001


## 3 Model training and testing: second round

### 3.1 ADAM, LR 0.00001

In [58]:
# Specific arguments for each model
# Train information
model_desc = "adam_lr_00001"
training_config_path__adam__lr00001 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.00001
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 3.1.1 Training

In [59]:
!python -m spacy train $training_config_path__adam__lr00001 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 2000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 15:13:03,522] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 15:13:04,508] [INFO] Set up nlp object from config
[2021-12-12 15:13:04,518] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 15:13:04,519] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 15:13:04,519] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 15:13:04,522] [INFO] Created vocabulary
[2021-12-12 15:13:04,522] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 15:14:56,439] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 15:14:56,450] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 15:14:56,451] [DEBUG]

#### 3.1.2 Testing

In [60]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   0.00  
NER R   0.00  
NER F   0.00  
SPEED   3988  

[1m

                P      R      F
scientific   0.00   0.00   0.00
common       0.00   0.00   0.00

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/adam_lr_00001


### 3.2 RADAM, LR 0.00001

In [55]:
# Specific arguments for each model
# Train information
model_desc = "radam_lr_00001"
training_config_path__radam__lr00001 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.00001
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 3.2.1 Training

In [56]:
!python -m spacy train $training_config_path__radam__lr00001 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 2000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 13:22:19,772] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 13:22:20,578] [INFO] Set up nlp object from config
[2021-12-12 13:22:20,587] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 13:22:20,588] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 13:22:20,588] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 13:22:20,592] [INFO] Created vocabulary
[2021-12-12 13:22:20,592] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 13:24:04,432] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 13:24:04,443] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 13:24:04,445] [DEBUG]

#### 3.2.2 Testing

In [57]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   0.00  
NER R   0.00  
NER F   0.00  
SPEED   4929  

[1m

                P      R      F
common       0.00   0.00   0.00
scientific   0.00   0.00   0.00

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/radam_lr_00001


## 4 Model training and testing: third round

### 4.1 ADAM, LR 0.00005

patience = 2000
max_epochs = 10
max_steps = 3000
eval_frequency = 300

In [61]:
# Specific arguments for each model
# Train information
model_desc = "adam_lr_00005"
training_config_path__adam__lr00005 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.00005
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 4.1.1 Training

In [62]:
!python -m spacy train $training_config_path__adam__lr00005 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 3000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 17:09:20,031] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 17:09:20,755] [INFO] Set up nlp object from config
[2021-12-12 17:09:20,768] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 17:09:20,769] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 17:09:20,769] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 17:09:20,772] [INFO] Created vocabulary
[2021-12-12 17:09:20,773] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 17:11:09,844] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 17:11:09,856] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 17:11:09,857] [DEBUG]

#### 4.1.2 Testing

In [63]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   14.29 
NER R   14.81 
NER F   14.55 
SPEED   3374  

[1m

                     P       R       F
scientific       20.00   25.00   22.22
common            8.33    6.67    7.41
pharmaceutical    0.00    0.00    0.00

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/adam_lr_00005


### 4.3 RADAM, LR 0.00005

In [64]:
# Specific arguments for each model
# Train information
model_desc = "radam_lr_00005"
training_config_path__radam__lr00005 = Path(data_location, f"training_config/config__{model_desc}.cfg")
output_model_path = Path(output_path, model_desc)
output_model_path.mkdir(parents=True, exist_ok=True)
num_max_epochs = 10
learning_rate = 0.00005
# gpu_id = -1 # use CPU. This is the default.

# Test information
best_model_path =  Path(output_model_path,"model-best")
best_model_results_path = Path(results_path, model_desc)
best_model_results_path.mkdir(parents=True, exist_ok=True)

#### 4.3.1 Training

In [65]:
!python -m spacy train $training_config_path__radam__lr00005 \
    --output $output_model_path \
    --paths.train $train_data_path \
    --paths.dev $dev_data_path \
    --training.max_epochs $num_max_epochs \
    --training.max_steps 3000 \
    --training.optimizer.learn_rate $learning_rate \
    --verbose

[2021-12-12 18:53:01,771] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs', 'training.max_steps', 'training.optimizer.learn_rate']
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-12 18:53:02,814] [INFO] Set up nlp object from config
[2021-12-12 18:53:02,829] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 18:53:02,830] [DEBUG] Loading corpus from path: ../../data/train/train.spacy
[2021-12-12 18:53:02,832] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-12 18:53:02,839] [INFO] Created vocabulary
[2021-12-12 18:53:02,839] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2021-12-12 18:54:55,894] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2021-12-12 18:54:55,911] [DEBUG] Loading corpus from path: ../../data/dev/dev.spacy
[2021-12-12 18:54:55,913] [DEBUG]

#### 4.3.2 Testing

In [66]:
!python -m spacy evaluate $best_model_path $test_data_path -dp $best_model_results_path

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   24.44 
NER R   20.37 
NER F   22.22 
SPEED   5063  

[1m

                 P       R       F
scientific   33.33   37.50   35.29
common       11.11    6.67    8.33

[38;5;2m✔ Generated 25 parses as HTML[0m
../../data/results/radam_lr_00005
