<a href="https://colab.research.google.com/github/fhswf/NLP_BERT/blob/master/NER_BERT_GermEval2014.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### License

Copyright 2020 Christian Gawron (gawron.christian@fh-swf.de)

Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Fine-Tuning BERT for German Named Entity Recognition

## Download and clean data for GermEval 2014 NER task

The following lines download the data set and convert it to a format compatible with CoNLL 2003.

In [2]:
!test -d data/ner || mkdir -p data/ner
!test -e data/ner/train.txt.tmp || curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > data/ner/train.txt.tmp
!test -e data/ner/dev.txt.tmp || curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > data/ner/dev.txt.tmp
!test -e data/ner/test.txt.tmp || curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > data/ner/test.txt.tmp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   566    0   566    0     0   2388      0 --:--:-- --:--:-- --:--:--  2388
100 7697k    0 7697k    0     0  4536k      0 --:--:--  0:00:01 --:--:-- 6177k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   564    0   564    0     0   9096      0 --:--:-- --:--:-- --:--:--  9096
100  706k    0  706k    0     0  2388k      0 --:--:-- --:--:-- --:--:-- 2388k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   565    0   565    0     0  12282      0 --:--:-- --:--:-- --:--:-- 12282
100 1643k    0 1643k    0     0  4861k      0 --:--:-- --:--:-- --:--:-- 4861k


### Data cleanup
The GermEval 2014 data set contains some characters which cannot be parsed by BERT (see [README from original example](https://github.com/huggingface/transformers/blob/master/examples/token-classification/README.md)).

The following code (see [GitHub](https://github.com/huggingface/transformers/blob/master/examples/token-classification/scripts/preprocess.py)) by Stefan Schweter filters these tokens.

In [0]:
MODEL = 'bert-base-german-dbmdz-cased' #@param {type:"string"}

In [0]:
import sys
from transformers import AutoTokenizer


dataset = sys.argv[1]
model_name_or_path = sys.argv[2]
max_len = int(sys.argv[3])

subword_len_counter = 0

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
max_len -= tokenizer.num_special_tokens_to_add()

with open(dataset, "rt") as f_p:
    for line in f_p:
        line = line.rstrip()

        if not line:
            print(line)
            subword_len_counter = 0
            continue

        token = line.split()[0]

        current_subwords_len = len(tokenizer.tokenize(token))

        # Token contains strange control characters like \x96 or \x95
        # Just filter out the complete line
        if current_subwords_len == 0:
            continue

        if (subword_len_counter + current_subwords_len) > max_len:
            print("")
            print(line)
            subword_len_counter = current_subwords_len
            continue

        subword_len_counter += current_subwords_len

        print(line)