# Medical Named Entity Recognition
## Setup:
1. Download the [**NER**](https://cloud.dfki.de/owncloud/index.php/s/WWbnqJ6N8gQQWMD) Model and place it in the **Resources** Folder.
2. Create a Python virtual environment *(use: Pyenv, miniConda, etc..)*
3. Activate VirtualEnv.
4. Install needed python libraries *(python -m pip install -r requierments_ner.txt)*
5. Now you'll have an isolated sandbox to experiment around.

## Prediction
Import relevant libraries

In [1]:
from typing import List
from flair.data import Sentence, Token
from flair.models import SequenceTagger
import spacy
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings, FlairEmbeddings, TransformerWordEmbeddings


  return torch._C._cuda_getDeviceCount() > 0


Start by loading the NER Model:<br />
**model=** *PATH_TO_NER_MODEL* *(.pt file)* <br />

In [2]:
NerTagger: SequenceTagger = SequenceTagger.load(model='Resources/named_entity_recognition_mex_model(custom_flair_embeddings).pt')

2021-03-17 16:39:44,579 loading file Resources/named_entity_recognition_mex_model(custom_flair_embeddings).pt


Load some file with text or some string

In [3]:
input_sentence: str = """Insgesamt gutes Befinden, keine Kraempfe, gute Diurese.
RR gut eingestellt, weiter sehr gute Nierenfunktion. Leberwerte ruecklaeufig. Keine Oedeme.
Im Sono kein Stau.
"""

In [None]:
!python -m spacy download de_core_news_sm

Load Spacy default model for the desired language<br />
Note: install the model before hand *(python -m spacy download **MODEL_NAME**)*<br />
Model-list: https://spacy.io/models

In [4]:
nlp = spacy.load("de_core_news_sm")

This will apply Tokenization and sentence splitting on the given text

In [6]:
doc = nlp(input_sentence)

Bring the data into the prediction format

In [7]:
sentences: list = []
for sent in doc.sents:
    tmpSent: Sentence = Sentence()
    for token in sent:
        tmpSent.add_token(Token(token.text))
    sentences.append(tmpSent)

Iterate through the results and predict each sentence

In [8]:
for sent in sentences:
    NerTagger.predict(sent)

first_sentence: Sentence = sentences[0]

### Display Results:
String Embedded option

In [9]:
print(first_sentence.to_tagged_string())

Insgesamt gutes <B-State_of_health> Befinden <I-State_of_health> , keine Kraempfe <B-Medical_condition> , gute <B-State_of_health> Diurese <B-Process> .


Annotated Spans option

In [10]:
for entity in first_sentence.get_spans('ner'):
    print(entity)

Span [2,3]: "gutes Befinden"   [− Labels: State_of_health (0.9802)]
Span [6]: "Kraempfe"   [− Labels: Medical_condition (0.9969)]
Span [8]: "gute"   [− Labels: State_of_health (0.9964)]
Span [9]: "Diurese"   [− Labels: Process (0.999)]


Dictionary Format option

In [11]:
print(first_sentence.to_dict(tag_type='ner'))

{'text': 'Insgesamt gutes Befinden , keine Kraempfe , gute Diurese .', 'labels': [], 'entities': [{'text': 'gutes Befinden', 'start_pos': None, 'end_pos': None, 'labels': [State_of_health (0.9802)]}, {'text': 'Kraempfe', 'start_pos': None, 'end_pos': None, 'labels': [Medical_condition (0.9969)]}, {'text': 'gute', 'start_pos': None, 'end_pos': None, 'labels': [State_of_health (0.9964)]}, {'text': 'Diurese', 'start_pos': None, 'end_pos': None, 'labels': [Process (0.999)]}]}


Custom Format option

In [12]:
for token in first_sentence.tokens:
    print(token.text, token.get_tag('ner').value, token.get_tag('ner').score)

for token in sentences[1].tokens:
    print(token.text, token.get_tag('ner').value, token.get_tag('ner').score)


Insgesamt O 0.6336771249771118
gutes B-State_of_health 0.9609906077384949
Befinden I-State_of_health 0.9993923902511597
, O 0.9906901121139526
keine O 0.9997430443763733
Kraempfe B-Medical_condition 0.9969049096107483
, O 0.9995402097702026
gute B-State_of_health 0.9963962435722351
Diurese B-Process 0.998982846736908
. O 0.9842742085456848
RR B-Process 0.9997145533561707
gut B-State_of_health 0.9910326600074768
eingestellt I-State_of_health 0.999576985836029
, O 0.998110294342041
weiter O 0.6509582996368408
sehr B-State_of_health 0.6795549392700195
gute I-State_of_health 0.9824131727218628
Nierenfunktion B-Body_part 0.9956809282302856
. O 0.9991376399993896


## Training
### Prepare / Read-in Data:<br />
Data should follow the [CoNLL-U](https://universaldependencies.org/format.html) Format,<br />
and the NER should be labeled using the [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) Format.<br />
Data Example: <br />
* Insgesamt O
* gutes B-State_of_health
* Befinden I-State_of_health
* , O
* keine O
* Kraempfe B-Medical_condition
* , O
* gute B-State_of_health
* Diurese B-Process
* . O <br />
**EMPTY LINE TO INDICATE NEW SENTENCE ON NEW LINE** <br />
* RR B-Process
* gut B-State_of_health
* eingestellt I-State_of_health
* , O
* weiter O
* sehr B-State_of_health
* gute I-State_of_health
* Nierenfunktion B-Body_part
* . O <br />

Split The Data into Train/Dev/Test Sets.<br />
Define the columns of your data, it might include other Features such as POS-Tags or whatever fits your use-case

**Initialize Embeddings** (Either use Default Models found under [Flair-Models](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/FLAIR_EMBEDDINGS.md)
or train / fine-tune your own model and give the path to the model file).

In [None]:
# define columns
columns = {0: 'text', 1: 'ner'}

# 1. init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus('PATH_TO_YOUR_DATA_FILE', columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt',
                              column_delimiter='\t',
                              document_separator_token='\n')

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

embedding_types: List[TokenEmbeddings] = [
        WordEmbeddings('de'),
        FlairEmbeddings("de-forward"),
        FlairEmbeddings("de-backward"),
        PooledFlairEmbeddings('german-forward'),
        PooledFlairEmbeddings('german-backward'),
]

# Stack the embeddings
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True,
                                        locked_dropout=0.3)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus, use_tensorboard=False)

# Start Training
trainer.train('PATH_TO_SAVE_NEW_MODEL_UNDER',
              train_with_dev=False,
              max_epochs=25,
              mini_batch_size=65)

## Fine-Tuning

In [None]:
# define columns
columns = {0: 'text', 1: 'ner'}

# 1. init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus('PATH_TO_YOUR_DATA_FILE', columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt',
                              column_delimiter='\t',
                              document_separator_token='\n')

# 2. load the pre-trained sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger.load("PATH_TO_EXISTING_MODEL")

# 3. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 4. fine-tune on the target corpus
trainer.train(
    base_path="PATH_TO_SAVE_NEW_MODEL_UNDER",
    train_with_dev=False,
    max_epochs=200,
    learning_rate=0.1,
    mini_batch_size=32
)