# How to train BiLSTM-CNN-CRF with [German LER Dataset](https://github.com/elenanereiss/Legal-Entity-Recognition)

The Implementation for Sequence Tagging of [BiLSTM-CNN-CRF](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf) is used for the training. See the GitHub Repo for more information.
> This repository contains a BiLSTM-CRF implementation that used for NLP Sequence Tagging (for example POS-tagging, Chunking, or Named Entity Recognition). The implementation is based on Keras 2.2.0 and can be run with Tensorflow 1.8.0 as backend. It was optimized for Python 3.5 / 3.6. It does not work with Python 2.7.


# Create an environment

The best way to use these models is to create an environment in conda with python 3.6:

In [None]:
!conda create -n ler python=3.6

To activate an environment:

In [None]:
!conda activate ler

# Install requirements

In [None]:
pip install -r requirements.txt

Install tensorflow from wheel. I used the following version for Windows and python 3.6.

In [None]:
wget https://raw.githubusercontent.com/fo40225/tensorflow-windows-wheel/master/1.8.0/py36/CPU/sse2/tensorflow-1.8.0-cp36-cp36m-win_amd64.whl

In [None]:
!pip install tensorflow-1.8.0-cp36-cp36m-win_amd64.whl

# Download dataset
Download the [German LER Dataset](https://github.com/elenanereiss/Legal-Entity-Recognition) splits (train, dev, test) from GitHub and save it in the `data` folder.

In [None]:
%%bash
wget https://raw.githubusercontent.com/elenanereiss/Legal-Entity-Recognition/master/data/ler_train.conll -P data
wget https://raw.githubusercontent.com/elenanereiss/Legal-Entity-Recognition/master/data/ler_dev.conll -P data
wget https://raw.githubusercontent.com/elenanereiss/Legal-Entity-Recognition/master/data/ler_test.conll  -P data

# Train

It is possible to use three models for training:
- BiLSTM-CRF: modelName=`blstm-crf`;
- BiLSTM-CRF with character embeddings from BiLSTM: modelName=`char-blstm-crf`;
- BiLSTM-CNN-CRF with character embeddings from CNN: modelName=`blstm-cnn-crf`.

I want to use `char-blstm-crf` because that model gives the best results.

In [None]:
from train import run_training

model_name = "char-blstm-crf"
train = "data/ler_train.conll"
dev = "data/ler_dev.conll"
test = "data/ler_test.conll"
run_training(model_name, train, dev, test)

# Evaluation

To evaluate the stored model in `models/char-blstm-crf.h5`, we first need preditions from the test split `ler_test.conll`. The predictions are written in a file `ler_test_pred.conll`. After that we can get classification report on entity basis of gold labels and predictions.

In [None]:
from predict import write_predictions

gold_labels = "data/ler_test.conll"
predictions = "data/ler_test_pred.conll"
model = "models/{}.h5".format(model_name)

write_predictions(model, gold_labels, predictions)

In [None]:
from evaluate import classification_report_strict

classification_report_strict(gold_labels, predictions)

# Tagger

Pretty print with tagger function via IPython.

In [None]:
import IPython
import util.styles as style_config

text = '''
Ob O
die O
Europäische B-INN
Kommission I-INN
in O
Anwendung O
der O
Grundsätze O
der O
Nr. B-EUN
89 I-EUN
der I-EUN
Vertikal-Leitlinien I-EUN
eine O
andere O
Auffassung O
vertrete O
, O
sei O
unerheblich O
, O
weil O
es O
bei O
der O
Feststellung O
des O
relevanten O
Marktes O
im O
Sinne O
des O
§ B-GS
18 I-GS
Abs. I-GS
1 I-GS
GWB I-GS
um O
eine O
Frage O
des O
nationalen O
Rechts O
gehe O
. O

'''


def normalize(phrase):
    tokens={'" ': '"', '( ': '(', '[ ': '[', ' )': ')', ' .': '.', ' ,': ',', ' ;': ';', ' :': ':', ' ]': ']', ' ?': '?', ' !': '!', ' /': '/', '/ ': '/'}
    for token_with_space, token_without_space in tokens.items():
        phrase = phrase.replace(token_with_space, token_without_space)
    return phrase

def tagger(conll_text):
    tokens = []
    labels = []
    
    for line in conll_text.split("\n"):
        if line != "":
            token, label = line.split(" ")
            tokens.append(token)
            labels.append(label)

    non_entity = ""
    entity = ""
    entity_label = ""
    sentence = ""
    
    for idx in range(len(tokens)):
        if labels[idx] == "O":
            if entity != "":
                sentence += '<span class="spark-nlp-display-entity-wrapper" style="background-color: {}"><span class="spark-nlp-display-entity-name">{} </span><span class="spark-nlp-display-entity-type">{}</span></span>\n'.format(label_colors[tag], normalize(entity), tag)
                entity = ""
                entity_label = ""

            non_entity += tokens[idx] + " "
            if idx == len(tokens)-1:
                sentence += '<span class="spark-nlp-display-others" style="background-color: white">{}</span>\n'.format(normalize(non_entity))
        else:
            bio, tag = labels[idx].split("-")

            if bio == "B":
                sentence += '<span class="spark-nlp-display-others" style="background-color: white">{} </span>\n'.format(normalize(non_entity))
                non_entity = ""
                entity_label = tag

            entity += tokens[idx] + " "
            if idx == len(tokens)-1:
                sentence += '<span class="spark-nlp-display-entity-wrapper" style="background-color: {}"><span class="spark-nlp-display-entity-name">{}</span><span class="spark-nlp-display-entity-type">{}</span></span>\n'.format(label_colors[tag], normalize(entity), tag)
    return sentence

label_colors = style_config.COLORS

html_content = tagger(text)
html_content_save = style_config.STYLE_CONFIG_ENTITIES+ " " + html_content
IPython.display.HTML(html_content_save)