# Training with 7 coarse-grained labels

## Preprocessing

If the [German LER Dataset](https://github.com/elenanereiss/Legal-Entity-Recognition) has not already been downloaded, download the dataset from GitHub and save it in the `data` folder.

In [None]:
%%bash
wget https://raw.githubusercontent.com/elenanereiss/Legal-Entity-Recognition/master/data/ler_train.conll -P data
wget https://raw.githubusercontent.com/elenanereiss/Legal-Entity-Recognition/master/data/ler_dev.conll -P data
wget https://raw.githubusercontent.com/elenanereiss/Legal-Entity-Recognition/master/data/ler_test.conll -P data

We can define some variables that we need for further pre-processing steps and training the model.
You can find more models for training on 🤗 [huggingface](https://huggingface.co/models?language=de&pipeline_tag=fill-mask&sort=downloads), for example:
- **BERT multilingual** (`bert-base-multilingual-cased`, `bert-base-multilingual-uncased`)
- **BERT German** (`bert-base-german-cased`, `dbmdz/bert-base-german-uncased`, ...)
- **DistilBERT** (`distilbert-base-german-cased`, `distilbert-base-multilingual-cased`)
- **XLM-RoBERTa** (`xlm-roberta-base`, `xlm-roberta-large`, `facebook/xlm-roberta-xl`, ...)
- **ELECTRA** (`stefan-it/electra-base-gc4-64k-200000-cased-generator`, ...)
- **DeBERTa** (`microsoft/mdeberta-v3-base`)
- ...

We use [bert-base-german-cased](https://huggingface.co/bert-base-german-cased) for training.

In [None]:
%%bash
export DATA_DIR=data
export MODEL_DIR=models

export MAX_LENGTH=512
export BERT_MODEL=bert-base-german-cased

export TRAIN=$DATA_DIR/ler_train.conll
export DEV=$DATA_DIR/ler_dev.conll
export TEST=$DATA_DIR/ler_test.conll

First generate coarse-grained labels from the dataset:

In [None]:
%%bash
python3 src/generate_coarse_labels.py $DATA_DIR/ler_train.conll $DATA_DIR/lerc_train.conll
python3 src/generate_coarse_labels.py $DATA_DIR/ler_dev.conll $DATA_DIR/lerc_dev.conll
python3 src/generate_coarse_labels.py $DATA_DIR/ler_test.conll $DATA_DIR/lerc_test.conll

Then edit the splits. The `preprocess.py` script splits longer sentences into smaller ones (once the max. subtoken length is reached). Run the pre-processing script on train, dev and test datasets splits. Note that the script `run_ner.py` takes the following files for training and evaluation: `train.txt`, `dev.txt`, `test.txt`. Then we collect all the labels from the splits. 

In [None]:
%%bash
python3 src/preprocess.py $TRAIN $BERT_MODEL $MAX_LENGTH > $DATA_DIR/train.txt
python3 src/preprocess.py $DEV $BERT_MODEL $MAX_LENGTH > $DATA_DIR/dev.txt
python3 src/preprocess.py $TEST $BERT_MODEL $MAX_LENGTH > $DATA_DIR/test.txt
cat $DATA_DIR/lerc_train.conll $DATA_DIR/lerc_dev.conll $DATA_DIR/lerc_test.conll | cut -d " " -f 2 | grep -v "^$"| sort | uniq >  $DATA_DIR/labels.txt

## Training with Pytorch

To see what arguments can be set, run `python3 run_ner.py --help`.

BERT is trained and evaluated on dev and test splits.

In [None]:
%%bash
python3 run_ner.py --data_dir $DATA_DIR \
    --labels $DATA_DIR/labels.txt \
    --task_type NER \
    --model_name_or_path $BERT_MODEL \
    --output_dir $MODEL_DIR/$BERT_MODEL-$LABEL_TYPE-$MAX_LENGTH \
    --max_seq_length $MAX_LENGTH \
    --num_train_epochs 3 \
    --per_gpu_train_batch_size 12 \
    --learning_rate 1e-5 \
    --save_steps 7500 \
    --seed 1 \
    --do_train \
    --do_eval \
    --do_predict

Results on the dev set:

```bash
11/03/2022 16:30:04 - INFO - __main__ - ***** Eval results *****
11/03/2022 16:30:04 - INFO - __main__ -   eval_precision = 0.9312814556716996
11/03/2022 16:30:04 - INFO - __main__ -   eval_recall = 0.953806502775575
11/03/2022 16:30:04 - INFO - __main__ -   eval_f1 = 0.9424094025465231
```

Results on the test set:

```bash
[INFO|trainer.py:2753] 2022-11-03 16:30:05,402 >> ***** Running Prediction *****
11/03/2022 16:37:15 - INFO - __main__ -   test_precision = 0.9537241887905604
11/03/2022 16:37:15 - INFO - __main__ -   test_recall = 0.9720030063885757
11/03/2022 16:37:15 - INFO - __main__ -   test_f1 = 0.9627768471989577
```

## Training with Tensorflow

To see what arguments can be set, also run `python3 run_tf_ner.py --help`.

In [None]:
%%bash
python3 run_tf_ner.py --data_dir $DATA_DIR \
    --labels $DATA_DIR/labels.txt \
    --task_type NER \
    --model_name_or_path $BERT_MODEL \
    --output_dir $MODEL_DIR/$BERT_MODEL-$LABEL_TYPE-$MAX_LENGTH \
    --max_seq_length $MAX_LENGTH \
    --num_train_epochs 3 \
    --per_gpu_train_batch_size 12 \
    --learning_rate 1e-5 \
    --save_steps 7500 \
    --seed 1 \
    --do_train \
    --do_eval \
    --do_predict

# Results on the dev set:

```
11/03/2022 16:46:16 - INFO - __main__ - ***Classification report for dev split***
              precision    recall  f1-score   support

         LIT       0.90      0.94      0.92       264
         LOC       0.92      0.91      0.91       195
         NRM       0.97      0.98      0.98      1887
         ORG       0.87      0.91      0.89       807
         PER       0.95      0.96      0.95       348
         REG       0.83      0.92      0.87       334
          RS       0.94      0.96      0.95      1209

   micro avg       0.93      0.95      0.94      5044
   macro avg       0.91      0.94      0.93      5044
weighted avg       0.93      0.95      0.94      5044
```

Results on the test set:

```
11/03/2022 16:58:31 - INFO - __main__ - ***Classification report for test split***
              precision    recall  f1-score   support

         LIT       0.93      0.97      0.95       314
         LOC       0.95      0.95      0.95       250
         NRM       0.97      0.98      0.98      2039
         ORG       0.93      0.96      0.95       796
         PER       0.97      0.95      0.96       324
         REG       0.87      0.95      0.91       354
          RS       0.96      0.98      0.97      1245

   micro avg       0.95      0.97      0.96      5322
   macro avg       0.94      0.96      0.95      5322
weighted avg       0.95      0.97      0.96      5322
```

In [None]:
%%bash
./fine_run.sh bert-base-german-cased

# Evaluation
The model (e.g. `bert-base-german-cased`, saved in the folder `models/bert-base-german-cased-coarse-512`) can be easily evaluated (Pytorch or Tensorflow) with the following commands:

In [None]:
%%bash
python3 run_ner.py --data_dir data   --task_type NER   --model_name_or_path models/bert-base-german-cased-coarse-512   --output_dir models/bert-base-german-cased-coarse-512   --do_eval   --do_predict --labels data/labels.txt   --per_device_eval_batch 16   --max_seq_length 512

In [None]:
%%bash
python3 run_tf_ner.py --data_dir data   --task_type NER   --model_name_or_path models/bert-base-german-cased-coarse-512   --output_dir models/bert-base-german-cased-coarse-512   --do_eval   --do_predict --labels data/labels.txt   --per_device_eval_batch 16   --max_seq_length 512