# Installing *biome.text*

In [2]:
!pip install -U pip
!pip install -U biome-text
exit(0)

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/fe/ef/60d7ba03b5c442309ef42e7d69959f73aacccd0d86008362a681c4698e83/pip-21.0.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 6.8MB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-21.0.1
Collecting biome-text
  Downloading biome_text-2.0.0-py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 6.9 MB/s 
[?25hCollecting s3fs~=0.4.0
  Downloading s3fs-0.4.2-py3-none-any.whl (19 kB)
Collecting mlflow~=1.9.0
  Downloading mlflow-1.9.1-py3-none-any.whl (11.9 MB)
[K     |████████████████████████████████| 11.9 MB 13.2 MB/s 
[?25hCollecting xlrd~=1.2.0
  Downloading xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 53.8 MB/s 
[?25hCollecting gevent~=20.9.0
  Downloading gevent-20.9.0-cp36-cp36m-m

# Downloading the data

Here we download the preprocessed data and the pre trained word vectors.

In [2]:
!git clone https://github.com/recognai/cantemist-ner/

Cloning into 'cantemist-ner'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 39 (delta 12), reused 29 (delta 8), pack-reused 0[K
Unpacking objects: 100% (39/39), done.


In [3]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz

--2021-02-16 09:26:27--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1285580896 (1.2G) [binary/octet-stream]
Saving to: ‘cc.es.300.vec.gz’


2021-02-16 09:28:13 (11.7 MB/s) - ‘cc.es.300.vec.gz’ saved [1285580896/1285580896]



# Training the system

In [3]:
from biome.text import Pipeline, Dataset, VocabularyConfiguration, TrainerConfiguration

## Loading the datasets

These datasets were created using the *NER_dataprep.ipynb* and *NER_dataprep_test.ipynb* notebooks in our [cantemist-ner](https://github.com/recognai/cantemist-ner) repo.

In [7]:
train_ds = Dataset.from_json("cantemist-ner/data/NER/train_full.json")
# This test dataset was released after the competition ended:
test_ds = Dataset.from_json("cantemist-ner/data/NER/gold_test.json")

# for biome.text v2 we need to rename the "labels" column containing the NER tags
train_ds.rename_column_("labels", "tags")
test_ds.rename_column_("labels", "tags")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1202.0, style=ProgressStyle(description…

Using custom data configuration default



Downloading and preparing dataset json/default-4ecdc33b48e84d7c (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-4ecdc33b48e84d7c/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-4ecdc33b48e84d7c/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514. Subsequent calls will reuse this data.


Using custom data configuration default


Downloading and preparing dataset json/default-bbf39d6db3ce5216 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-bbf39d6db3ce5216/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-bbf39d6db3ce5216/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514. Subsequent calls will reuse this data.


## Defining the pipeline

The architecture and hyerparameters were found by means of a random search HPO.

In [5]:
pipeline_config = {
  'name': 'candemist-ner-first-hpo',
  'features': {
    'word': {
      'embedding_dim': 300,
      'lowercase_tokens': True,
      'trainable': True,
      'weights_file': "/content/cc.es.300.vec.gz"
    },
    'char': {
      'embedding_dim': 64,
      'lowercase_characters': True,
      'encoder': {
        'bidirectional': True,
        'hidden_size': 128,
        'num_layers': 1,
        'type': 'gru'
      },
    'dropout': 0.16517050992687604
    },
  },
  'encoder': {
    'bidirectional': True,
    'hidden_size': 512,
    'input_size': 556,
    'num_layers': 1,
    'type': 'lstm'
  },  
  'head': {
    'dropout': 0.2689579604286324,
    'labels': ['MORFOLOGIA_NEOPLASIA'],
    'type': 'TokenClassification'
  },
}

In [6]:
pl = Pipeline.from_config(pipeline_config)

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


## Defining the vocabulary

Only include words that appear at least two times in the `train_ds` dataset.

In [9]:
vocab_config = VocabularyConfiguration(
    datasets=[train_ds], min_count={"word": 2}
)

## Defining the trainer

The hyerparameters were found by means of a random search HPO.

In [7]:
trainer_dict={
    "optimizer": {
        "type": "adamw",
        "lr": 0.0038931174186587806,
        "weight_decay": 0.01,
    },
    "learning_rate_scheduler": {
      "type": "step",
      "step_size":2, 
      "gamma":0.1
    },
    "batch_size": 32,
    "num_epochs": 4,
    "validation_metric": "+f1-measure-overall",
    "patience":3
}

In [8]:
trainer_config = TrainerConfiguration(**trainer_dict)

## Training the pipeline

In [10]:
pl.train(
    output="output",
    training=train_ds,
    test=test_ds,
    trainer=trainer_config,
    vocab_config=vocab_config,
)

2021-02-16 09:40:03,935 - allennlp.data.vocabulary - INFO - Fitting token dictionary from dataset.
building vocab: 0it [00:00, ?it/s]

HBox(children=(FloatProgress(value=0.0, description='Loading instances into memory', max=35877.0, style=Progre…

2021-02-16 09:40:10,106 - biome.text.dataset - INFO - Caching instances to /root/.cache/huggingface/datasets/json/default-7b14c7191bb43b2a/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/76729142bb3937f9.instance_list)





building vocab: 35877it [00:16, 2162.79it/s]
2021-02-16 09:40:20,566 - allennlp.modules.token_embedders.embedding - INFO - Reading pretrained embeddings from file
2021-02-16 09:40:20,571 - allennlp.modules.token_embedders.embedding - INFO - Recognized a header line in the embedding file with number of tokens: 2000000
100%|##########| 2000000/2000000 [00:30<00:00, 66292.26it/s]
2021-02-16 09:40:50,764 - allennlp.modules.token_embedders.embedding - INFO - Initializing pre-trained embedding layer
2021-02-16 09:40:50,940 - allennlp.modules.token_embedders.embedding - INFO - Pretrained embeddings were found for 11989 out of 14126 tokens


HBox(children=(FloatProgress(value=0.0, description='Loading instances into memory', max=10774.0, style=Progre…

2021-02-16 09:40:56,399 - biome.text.dataset - INFO - Caching instances to /root/.cache/huggingface/datasets/json/default-cffa4ea9ee35ebae/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/90e04dc5c0c285ee.instance_list)





2021-02-16 09:40:57,962 - allennlp.common.params - INFO - random_seed = 13370
2021-02-16 09:40:57,963 - allennlp.common.params - INFO - numpy_seed = 1337
2021-02-16 09:40:57,964 - allennlp.common.params - INFO - pytorch_seed = 133
2021-02-16 09:40:57,973 - allennlp.common.checks - INFO - Pytorch version: 1.7.0+cu101
2021-02-16 09:40:58,008 - allennlp.common.params - INFO - type = gradient_descent
2021-02-16 09:40:58,008 - allennlp.common.params - INFO - local_rank = 0
2021-02-16 09:40:58,009 - allennlp.common.params - INFO - patience = 3
2021-02-16 09:40:58,010 - allennlp.common.params - INFO - validation_metric = +f1-measure-overall
2021-02-16 09:40:58,012 - allennlp.common.params - INFO - num_epochs = 4
2021-02-16 09:40:58,013 - allennlp.common.params - INFO - cuda_device = None
2021-02-16 09:40:58,015 - allennlp.common.params - INFO - grad_norm = None
2021-02-16 09:40:58,018 - allennlp.common.params - INFO - grad_clipping = None
2021-02-16 09:40:58,020 - allennlp.common.params - INF

TrainingResults(model_path='output/model.tar.gz', metrics={'best_epoch': 3, 'peak_worker_0_memory_MB': 3741.7265625, 'peak_gpu_0_memory_MB': 301.75830078125, 'training_duration': '0:13:30.199669', 'training_start_epoch': 0, 'training_epochs': 3, 'epoch': 3, 'training_accuracy': 0.995057364799209, 'training_precision-MORFOLOGIA_NEOPLASIA': 0.9215686274509803, 'training_recall-MORFOLOGIA_NEOPLASIA': 0.9211907495489585, 'training_f1-measure-MORFOLOGIA_NEOPLASIA': 0.9213796497559276, 'training_precision-overall': 0.9215686274509803, 'training_recall-overall': 0.9211907495489585, 'training_f1-measure-overall': 0.9213796497559276, 'training_loss': 4.325106642038002, 'training_worker_0_memory_MB': 3741.7265625, 'training_gpu_0_memory_MB': 301.75830078125, 'test_accuracy': 0.9906576845947861, 'test_precision-MORFOLOGIA_NEOPLASIA': 0.8508845829823083, 'test_recall-MORFOLOGIA_NEOPLASIA': 0.8530405405405406, 'test_f1-measure-MORFOLOGIA_NEOPLASIA': 0.8519611978067827, 'test_precision-overall': 0.8

# Appendix: BETO model with an f1 score of 0.861

In a quick follow-up work, we experimented with the pretrained "*BETO: Spanish Bert*" model, used by the winner (f1: 0.87) and the runner-up (f1: 0.869) of the Cantemist NER competition, and were able to achieve similar results. 

In [4]:
pipeline_dict = {
    "name": "",
    "features": {
        "transformers": {
            "model_name": "dccuchile/bert-base-spanish-wwm-cased",
            #"mismatched": True,  # False for wordpiece tokens, True for word tokens
            "trainable": True,
            "max_length": 512
        },
    },
    "head": {
        "type": "TokenClassification",
        "labels": ["MORFOLOGIA_NEOPLASIA"]
    }
}

In [5]:
pl = Pipeline.from_config(pipeline_dict)

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242120.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=43.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=441944381.0, style=ProgressStyle(descri…




In [None]:
# The hyperparameters were found by means of a random search HPO
trainer_dict = {
    "optimizer": {
        "type": "adamw",
        "lr": 3e-5
    },
    "batch_size": 8,
    "patience": 2,
    "num_epochs": 8,
    "validation_metric" : "+f1-measure-overall",
    "learning_rate_scheduler": {
        "type": "linear_with_warmup",
        "num_epochs": 8,
        "num_steps_per_epoch": 4485,
        "warmup_steps": 100,
    }
}

In [10]:
trainer = TrainerConfiguration(**trainer_dict)

In [11]:
pl.train(
    output="output_beto",
    training=train_ds,
    test=test_ds,
    trainer=trainer,
)

HBox(children=(FloatProgress(value=0.0, description='Loading instances into memory', max=35877.0, style=Progre…

2021-02-16 10:45:51,602 - biome.text.dataset - INFO - Caching instances to /root/.cache/huggingface/datasets/json/default-4ecdc33b48e84d7c/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/eced6fb3a0f87050.instance_list)





HBox(children=(FloatProgress(value=0.0, description='Loading instances into memory', max=10774.0, style=Progre…

2021-02-16 10:45:59,999 - biome.text.dataset - INFO - Caching instances to /root/.cache/huggingface/datasets/json/default-bbf39d6db3ce5216/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/49415b21c2db4666.instance_list)





2021-02-16 10:46:01,860 - allennlp.common.params - INFO - random_seed = 13370
2021-02-16 10:46:01,861 - allennlp.common.params - INFO - numpy_seed = 1337
2021-02-16 10:46:01,862 - allennlp.common.params - INFO - pytorch_seed = 133
2021-02-16 10:46:01,869 - allennlp.common.checks - INFO - Pytorch version: 1.7.0+cu101
2021-02-16 10:46:01,963 - allennlp.common.params - INFO - type = gradient_descent
2021-02-16 10:46:01,965 - allennlp.common.params - INFO - local_rank = 0
2021-02-16 10:46:01,966 - allennlp.common.params - INFO - patience = 2
2021-02-16 10:46:01,968 - allennlp.common.params - INFO - validation_metric = +f1-measure-overall
2021-02-16 10:46:01,972 - allennlp.common.params - INFO - num_epochs = 8
2021-02-16 10:46:01,973 - allennlp.common.params - INFO - cuda_device = None
2021-02-16 10:46:01,975 - allennlp.common.params - INFO - grad_norm = None
2021-02-16 10:46:01,977 - allennlp.common.params - INFO - grad_clipping = None
2021-02-16 10:46:01,979 - allennlp.common.params - INF

KeyboardInterrupt: ignored