## Training Albert Transformer For Named Entity Recognition

This colab notebook shows how to train an Albert Transformer (albert base v1) on NER downstream task by using the 'run_ner.py' script from [Transformers library](https://github.com/huggingface/transformers). The Conll dataset is chosen as the benchmark dataset for this purpose, particularly the German Conll dataset. The steps are as follows:

- Install Transformers from source
- Extract and run the run ner and preprocess scripts
- Download the Conllu dataset
- Run the script:
```python
!python transformers-3.4.0/examples/token-classification/run_tf_ner.py \
  --data_dir ./ \
  --labels ./labels.txt \
  --model_name_or_path $MODEL \
  --output_dir $OUTPUT_DIR \
  --max_seq_length  $MAX_LENGTH \
  --num_train_epochs $NUM_EPOCHS \
  --per_gpu_train_batch_size $BATCH_SIZE \
  --save_steps $SAVE_STEPS \
  --logging_steps $LOGGING_STEPS \
  --seed $SEED \
  --do_train \
  --do_predict \
  --overwrite_output_dir
  ```

  - Specify the hyperparameters of the albert model including the epochs, training steps etc.

In [None]:
!wget 'https://github.com/huggingface/transformers/archive/v3.4.0.zip'

--2021-03-07 11:57:43--  https://github.com/huggingface/transformers/archive/v3.4.0.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/huggingface/transformers/zip/v3.4.0 [following]
--2021-03-07 11:57:43--  https://codeload.github.com/huggingface/transformers/zip/v3.4.0
Resolving codeload.github.com (codeload.github.com)... 140.82.112.9
Connecting to codeload.github.com (codeload.github.com)|140.82.112.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v3.4.0.zip.1’

v3.4.0.zip.1            [    <=>             ]   7.21M  8.86MB/s    in 0.8s    

2021-03-07 11:57:44 (8.86 MB/s) - ‘v3.4.0.zip.1’ saved [7563073]



In [None]:
!unzip 'v3.4.0.zip'

In [None]:
!curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
!curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' A> dev.txt.tmp
!curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0    287      0 --:--:--  0:00:01 --:--:--   287
100 7697k    0 7697k    0     0  4736k      0 --:--:--  0:00:01 --:--:-- 4736k
tr: extra operand ‘A’
Try 'tr --help' for more information.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0    317      0 --:--:--  0:00:01 --:--:--   317
  7  706k    7 53802    0     0  40031      0  0:00:18  0:00:01  0:00:17 40031
curl: (23) Failed writing body (717 != 1271)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   388    0   388    0     0    414      0 --:--:-- --:--:-- --:--:--   414
100 1643k  100 1643k    0 

In [None]:
!wc -l dev.txt.tmp

0 dev.txt.tmp


In [None]:
!wget 'https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py'

--2021-03-07 11:58:04--  https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 991 [text/plain]
Saving to: ‘preprocess.py.2’


2021-03-07 11:58:04 (77.8 MB/s) - ‘preprocess.py.2’ saved [991/991]



In [None]:
MAX_LENGTH=128
MODEL='albert-base-v1'

In [None]:
!python3 preprocess.py train.txt.tmp $MODEL $MAX_LENGTH > train.txt
!python3 preprocess.py dev.txt.tmp $MODEL $MAX_LENGTH > dev.txt
!python3 preprocess.py test.txt.tmp $MODEL $MAX_LENGTH > test.txt

2021-03-07 11:58:10.406288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading: 100% 684/684 [00:00<00:00, 504kB/s]
Downloading: 100% 760k/760k [00:00<00:00, 10.6MB/s]
2021-03-07 11:58:38.077608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-03-07 11:58:42.912366: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


In [None]:
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt

In [None]:
!wget 'https://raw.githubusercontent.com/huggingface/transformers/v3.4.0/examples/token-classification/run_ner.py'

--2021-03-07 11:58:56--  https://raw.githubusercontent.com/huggingface/transformers/v3.4.0/examples/token-classification/run_ner.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11596 (11K) [text/plain]
Saving to: ‘run_ner.py.1’


2021-03-07 11:58:56 (94.7 MB/s) - ‘run_ner.py.1’ saved [11596/11596]



In [None]:
!wget 'https://raw.githubusercontent.com/huggingface/transformers/v3.4.0/examples/token-classification/utils_ner.py' as 'utils_ner.py'

--2021-03-07 11:58:56--  https://raw.githubusercontent.com/huggingface/transformers/v3.4.0/examples/token-classification/utils_ner.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15629 (15K) [text/plain]
Saving to: ‘utils_ner.py.1’


2021-03-07 11:58:56 (29.2 MB/s) - ‘utils_ner.py.1’ saved [15629/15629]

--2021-03-07 11:58:56--  http://as/
Resolving as (as)... failed: No address associated with hostname.
wget: unable to resolve host address ‘as’
--2021-03-07 11:58:56--  http://utils_ner.py/
Resolving utils_ner.py (utils_ner.py)... failed: Name or service not known.
wget: unable to resolve host address ‘utils_ner.py’
FINISHED --2021-03-07 11:58:56--
Total wall clock time: 0.2s
Downloaded: 1 files, 15K in 0.001s (29.2 MB/s)


In [None]:
OUTPUT_DIR='abhilash1910/ner-model'
BATCH_SIZE=32
NUM_EPOCHS=3
SAVE_STEPS=750
SEED=1
SAVE_STEPS = 100 #@param {type: "integer"}
LOGGING_STEPS = 100 #@param {type: "integer"}
SEED = 42 #@param {type: "integer"}

In [None]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/3e/73/742d17d8a9a1c639132affccc9250f0743e484cbf263ede6ddcbe34ef212/datasets-1.4.1-py3-none-any.whl (186kB)
[K     |█▊                              | 10kB 13.2MB/s eta 0:00:01[K     |███▌                            | 20kB 11.5MB/s eta 0:00:01[K     |█████▎                          | 30kB 7.3MB/s eta 0:00:01[K     |███████                         | 40kB 6.4MB/s eta 0:00:01[K     |████████▊                       | 51kB 4.4MB/s eta 0:00:01[K     |██████████▌                     | 61kB 4.9MB/s eta 0:00:01[K     |████████████▎                   | 71kB 5.1MB/s eta 0:00:01[K     |██████████████                  | 81kB 5.7MB/s eta 0:00:01[K     |███████████████▉                | 92kB 5.3MB/s eta 0:00:01[K     |█████████████████▌              | 102kB 5.4MB/s eta 0:00:01[K     |███████████████████▎            | 112kB 5.4MB/s eta 0:00:01[K     |█████████████████████           | 122kB 5.4MB/s eta

In [None]:
!pip install seqeval



In [None]:
%%capture
!git clone https://github.com/huggingface/transformers
%cd transformers
!pip install .
!pip install -r ./examples/requirements.txt
%cd ..


In [None]:
!cd transformers/examples/token-classification

In [None]:
import glob
f=glob.glob('transformers-3.4.0/examples/token-classification/*')
f

['transformers-3.4.0/examples/token-classification/run.sh',
 'transformers-3.4.0/examples/token-classification/utils_ner.py',
 'transformers-3.4.0/examples/token-classification/run_ner.py',
 'transformers-3.4.0/examples/token-classification/README.md',
 'transformers-3.4.0/examples/token-classification/run_pos.sh',
 'transformers-3.4.0/examples/token-classification/run_pos_pl.sh',
 'transformers-3.4.0/examples/token-classification/run_pl.sh',
 'transformers-3.4.0/examples/token-classification/run_chunk.sh',
 'transformers-3.4.0/examples/token-classification/scripts',
 'transformers-3.4.0/examples/token-classification/run_pl_ner.py',
 'transformers-3.4.0/examples/token-classification/run_tf_ner.py',
 'transformers-3.4.0/examples/token-classification/tasks.py',
 'transformers-3.4.0/examples/token-classification/__pycache__',
 'transformers-3.4.0/examples/token-classification/test_ner_examples.py']

In [None]:
!pip install conllu

Collecting conllu
  Downloading https://files.pythonhosted.org/packages/ae/be/be6959c3ff2dbfdd87de4be0ccdff577835b5d08b1d25bf7fd4aaf0d7add/conllu-4.4-py2.py3-none-any.whl
Installing collected packages: conllu
Successfully installed conllu-4.4


In [None]:
!pip install --upgrade pyarrow

Requirement already up-to-date: pyarrow in /usr/local/lib/python3.7/dist-packages (3.0.0)


In [None]:
f=open('./labels.txt').read()
f

'B-LOC\nB-LOCderiv\nB-LOCpart\nB-ORG\nB-ORGderiv\nB-ORGpart\nB-OTH\nB-OTHderiv\nB-OTHpart\nB-PER\nB-PERderiv\nB-PERpart\nI-LOC\nI-LOCderiv\nI-LOCpart\nI-ORG\nI-ORGpart\nI-OTH\nI-OTHderiv\nI-OTHpart\nI-PER\nI-PERderiv\nI-PERpart\nO\n'

In [None]:
!python transformers-3.4.0/examples/token-classification/run_tf_ner.py \
  --data_dir ./ \
  --labels ./labels.txt \
  --model_name_or_path $MODEL \
  --output_dir $OUTPUT_DIR \
  --max_seq_length  $MAX_LENGTH \
  --num_train_epochs $NUM_EPOCHS \
  --per_gpu_train_batch_size $BATCH_SIZE \
  --save_steps $SAVE_STEPS \
  --logging_steps $LOGGING_STEPS \
  --seed $SEED \
  --do_train \
  --do_predict \
  --overwrite_output_dir

2021-03-07 11:59:29.439218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-03-07 11:59:31.722390: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-07 11:59:31.723505: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-07 11:59:31.734753: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-07 11:59:31.734812: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (546f1109f418): /proc/driver/nvidia/version does not exist
2021-03-07 11:59:31.735598: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
03/07/2021 11:59:31 - INFO - __main__ -   n_replicas: 1, distributed trainin

In [None]:
!pip install transformers==3.4.0

Collecting transformers==3.4.0
  Using cached https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl
Collecting tokenizers==0.9.2
  Using cached https://files.pythonhosted.org/packages/35/e7/edf655ae34925aeaefb7b7fcc3dd0887d2a1203ee6b0df4d1170d1a19d4f/tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl
Installing collected packages: tokenizers, transformers
  Found existing installation: tokenizers 0.9.3
    Uninstalling tokenizers-0.9.3:
      Successfully uninstalled tokenizers-0.9.3
  Found existing installation: transformers 3.5.0
    Uninstalling transformers-3.5.0:
      Successfully uninstalled transformers-3.5.0
Successfully installed tokenizers-0.9.2 transformers-3.4.0


In [None]:
files=glob.glob('./abhilash1910/ner-model/*')
files

['./abhilash1910/ner-model/tf_model.h5',
 './abhilash1910/ner-model/config.json',
 './abhilash1910/ner-model/special_tokens_map.json',
 './abhilash1910/ner-model/test_results.txt',
 './abhilash1910/ner-model/tokenizer_config.json',
 './abhilash1910/ner-model/spiece.model',
 './abhilash1910/ner-model/test_predictions.txt']

In [None]:

!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 15.6MB/s eta 0:00:01[K     |▌                               | 20kB 21.4MB/s eta 0:00:01[K     |▉                               | 30kB 25.9MB/s eta 0:00:01[K     |█                               | 40kB 29.2MB/s eta 0:00:01[K     |█▍                              | 51kB 4.2MB/s eta 0:00:01[K     |█▋                              | 61kB 4.6MB/s eta 0:00:01[K     |██                              | 71kB 5.2MB/s eta 0:00:01[K     |██▏                             | 81kB 5.6MB/s eta 0:00:01[K     |██▌                             | 92kB 3.7MB/s eta 0:00:01[K     |██▊                             | 102kB 4.0MB/s eta 0:00:01[K     |███                             | 112kB 4.0MB/s eta 0:00:01[K     |███▎               

## Training is completed

After the training is done, we can move into evaluating the performance of the trained NER model by using the HF transformers pipeline. The pipeline is suited for faster inference on CPU and GPUs.

In [None]:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer,TFAutoModelForTokenClassification


ner_model = pipeline('ner', model='./abhilash1910/ner-model/', tokenizer='./abhilash1910/ner-model/')

Some layers from the model checkpoint at ./abhilash1910/ner-model/ were not used when initializing TFAlbertModel: ['classifier', 'dropout_4']
- This IS expected if you are initializing TFAlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFAlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFAlbertModel were not initialized from the model checkpoint at ./abhilash1910/ner-model/ and are newly initialized: ['pooler/kernel:0', 'pooler/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some layers from the model checkpoint at ./abhilash1910/ner-model/ were not used when initializing TFAlbertForTokenCl

In [None]:
seq='Berlin ist die Hauptstadt von Deutschland'
ner_model(seq)

[{'entity': 'B-ORGpart',
  'index': 0,
  'score': 0.08934617042541504,
  'word': '[CLS]'},
 {'entity': 'B-PERderiv',
  'index': 1,
  'score': 0.09580112248659134,
  'word': '▁berlin'},
 {'entity': 'B-ORGpart',
  'index': 2,
  'score': 0.08364498615264893,
  'word': '▁is'},
 {'entity': 'B-LOCderiv',
  'index': 3,
  'score': 0.07593920826911926,
  'word': 't'},
 {'entity': 'B-PERderiv',
  'index': 4,
  'score': 0.09574996680021286,
  'word': '▁die'},
 {'entity': 'B-LOCderiv',
  'index': 5,
  'score': 0.07097965478897095,
  'word': '▁'},
 {'entity': 'B-PERderiv',
  'index': 6,
  'score': 0.07122448086738586,
  'word': 'haupt'},
 {'entity': 'B-PERderiv',
  'index': 7,
  'score': 0.12397754937410355,
  'word': 'stadt'},
 {'entity': 'I-OTHderiv',
  'index': 8,
  'score': 0.0818650871515274,
  'word': '▁von'},
 {'entity': 'I-LOCderiv',
  'index': 9,
  'score': 0.08271490037441254,
  'word': '▁'},
 {'entity': 'B-LOCderiv',
  'index': 10,
  'score': 0.08616268634796143,
  'word': 'deutschland'}

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer,TFAutoModelForTokenClassification
from transformers import pipeline

model=TFAutoModelForTokenClassification.from_pretrained('abhilash1910/albert-german-ner')
tokenizer=AutoTokenizer.from_pretrained('abhilash1910/albert-german-ner')
ner_model = pipeline('ner', model=model, tokenizer=tokenizer)
seq='Berlin ist die Hauptstadt von Deutschland'
ner_model(seq)

Some layers from the model checkpoint at abhilash1910/albert-german-ner were not used when initializing TFAlbertForTokenClassification: ['dropout_4']
- This IS expected if you are initializing TFAlbertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFAlbertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFAlbertForTokenClassification were not initialized from the model checkpoint at abhilash1910/albert-german-ner and are newly initialized: ['dropout_9']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'entity': 'B-PERderiv',
  'index': 1,
  'score': 0.09580112248659134,
  'word': '▁berlin'},
 {'entity': 'B-ORGpart',
  'index': 2,
  'score': 0.08364498615264893,
  'word': '▁is'},
 {'entity': 'B-LOCderiv',
  'index': 3,
  'score': 0.07593920826911926,
  'word': 't'},
 {'entity': 'B-PERderiv',
  'index': 4,
  'score': 0.09574996680021286,
  'word': '▁die'},
 {'entity': 'B-LOCderiv',
  'index': 5,
  'score': 0.07097965478897095,
  'word': '▁'},
 {'entity': 'B-PERderiv',
  'index': 6,
  'score': 0.07122448086738586,
  'word': 'haupt'},
 {'entity': 'B-PERderiv',
  'index': 7,
  'score': 0.12397754937410355,
  'word': 'stadt'},
 {'entity': 'I-OTHderiv',
  'index': 8,
  'score': 0.0818650871515274,
  'word': '▁von'},
 {'entity': 'I-LOCderiv',
  'index': 9,
  'score': 0.08271490037441254,
  'word': '▁'},
 {'entity': 'B-LOCderiv',
  'index': 10,
  'score': 0.08616268634796143,
  'word': 'deutschland'}]