# Named Entity Recognition for Healthcare with SparkNLP NerDL and NerCRF - Data Preparation and Model Evaluation

## Why is Data Preparation Important?

It's simple to use a pretrained named entity recognition model, but sometimes you need to train your own model to get the best results. This tutorial will show you how to prepare your healthcare training data and train your own NER model using Python and SparkNLP. SparkNLP NerDL has cutting edge scores with the BC2GM dataset (Micro-average F1: 0.87) and other benchmark datasets. You need to use licensed SparkNLP Clinical embeddings to get those cutting edge scores on healthcare data, but Glove embeddings still do great. I'll show you how to train and evaluate your NerCRF and NerDL models on the BC5CDR-Chem dataset using Glove embeddings.

### Preparing the Training Data

To train a NerCRF or NerDL model, you will need to put your tokens and entity labels into a space-separated format called CoNLL. A CoNLL file puts each token of a sentence on a different line, and separates each sentence with an empty line. In the following Python example I will annotate one sentence and save it in CoNLL format.

In [None]:
#Create some tokens
tokens=['An', 'apple', 'a', 'day', 'keeps', 'the', 'doctor', 'away', '.']

#Create part of speech labels or use a place-holder value like "NN".
pos_labels=['DT', 'NN', 'DT', 'NN', 'VBZ', 'DT', 'NN', 'RB', '.']

#Create some named entity labels. 'O' labels mean no named entity was found.
entity_labels=['B-Treatment','I-Treatment','I-Treatment','I-Treatment','O','O','O','O','O']

Please notice the entity labels above. When an entity has more than one word, the label for the first word should begin with "B-" and the label for the following words should begin with "I-". Now let's save the tokens, parts-of-speech, and entity labels in CoNLL format.

In [None]:
conll_lines=''

for token,pos,label in zip(tokens,pos_labels,entity_labels):
    
    conll_lines+="{} {} {} {}\n".format(token, pos, pos, label)

#Add another line break at the end of the sentence in order to create an empty line.
conll_lines+='\n'

#For this example I will print the lines instead of writing a .txt file.
print(conll_lines)


Please see the printed CoNLL above. "An" is the first word in "An apple a day" so it is labelled "B-Treatment", while "apple","a", and "day" are all labelled "I-Treatment". The words that are not "Treatments" are labelled with a capital "O".

Here's another example of a sentence annotated in CoNLL format. The entity is "blood pressure".

In [None]:
#Create some tokens
tokens=['I','checked','my','blood','pressure','this','morning','.']

#Create part-of-speech labels or use a place-holder value like 'NN'.
pos_labels=['PRP', 'VBD', 'PRP', 'NN', 'NN', 'DT', 'NN', '.']

#Create some named entity labels. 'O' labels mean no named entity was found
entity_labels=['O','O','O','B-Test','I-Test','O','O','O']

In [None]:
conll_lines=''

for token,pos,label in zip(tokens,pos_labels,entity_labels):
    
    conll_lines+="{} {} {} {}\n".format(token, pos, pos, label)

#Add another line break at the end of the sentence in order to create an empty line.
conll_lines+='\n'

#For this example I will print the lines instead of writing a .txt file.
print(conll_lines)

As you can see above, 'blood' is the first word in the entity, so it is labelled "B-Test", while "pressure" is the second word in the entity so it is labelled "I-Test". We do this so the model can tell that "blood pressure" is one whole entity, rather than the two separate entities "blood" and "pressure.

Now let's work with some real datasets. First we have to load the data.

In [None]:
import os
! wget -O ncbi.tsv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/ner/conll-2003/NCBIdisease.tsv
! wget -O BC5CDRtrain.txt https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/ner/conll-2003/CRFtrain_dev.txt
! wget -O BC5CDRtest.txt https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/ner/conll-2003/CRFtest.txt

### How to Convert a Pandas Dataframe to CoNLL Format

In the next example I'll read from a Pandas dataframe and write a CoNLL file for NerDL. I'll use the sentence ID (sent_id) column to determine if I need to leave an empty line before a new sentence. Here are the first 5 lines of the dataframe:

In [None]:
import pandas as pd
ncbi=pd.read_csv('ncbi.tsv',sep='\t')

In [None]:
ncbi.head()

For NerDL the part-of-speech column is not used, but a CoNLL must still have a part of speech column. Add a part-of-speech column with 'NN' or some other placeholder as the only value. If you already have a part of speech column, you don't need to take this step.

In [None]:
ncbi['pos']='NN'

My Pandas dataframe is called 'ncbi' and I've added a part-of-speech column which I've called 'pos'. Now write a CoNLL file using the columns of the Pandas dataframe as input.

In [None]:
conll_lines="-DOCSTART- -X- -X- -O-\n\n"
save=0

for sent, token, pos, label in zip(ncbi['sent_id'],ncbi['token'],ncbi['pos'],ncbi['entity_label']): 
    
# If the sentence ID has changed, that means we are starting a new sentence. We have to add an empty line.
    
    if save!=sent:
        conll_lines+='\n'
    
# Save the conll line
    
    conll_lines += "{} {} {} {}\n".format(token, pos, pos, label)
    
    save=sent
    

# Now print all of the lines to a text file

with open(file_loc,'w') as txtfile:
        
    for line in conll_lines:
        txtfile.write(line)

txtfile.close()
    

If you look at the first 25 lines of the final CoNLL file, you'll see that rows containing only line breaks signal the beginning of a new sentence.

In [None]:
with open(file_loc,'r') as f:
    lines=f.readlines()[0:25]
f.close()
lines

Now let's see SparkNLPs cutting edge results! We'll train NerCRF and NerDL models on the BC5CDR-Chem benchmark dataset.

### Training and Evaluating NerCRF

NerCRF is a named entity recognition model in the SparkNLP library which is based on Conditional Random Fields. It requires part-of-speech for model training. To train a model with NerCRF, first import SparkNLP and start your Spark session. Then load the CoNLL.

In [None]:
import sparknlp
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import pyspark.sql.functions as F

spark = sparknlp.start()

spark

In [None]:
from sparknlp.training import CoNLL

file_loc='BC5CDRtrain.txt'
train = CoNLL().readDataset(spark, file_loc)

In [None]:
from pyspark.sql import functions as F

train.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth")).groupBy('ground_truth').count()\
        .orderBy('count', ascending=False).show(100,truncate=False)

I will add Glove embeddings to the dataset before Ner training, but if you want better results with your healthcare projects, use SparkNLP Clinical embeddings. First, set up your pipeline and fit your model to your training dataset. The fitting process could take some time.

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")

nerTagger = NerCrfApproach()\
    .setInputCols(["sentence", "token", "pos","embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(9)\
    
ner_pipeline = Pipeline(stages=[
          word_embeddings,
          nerTagger
 ])

In [None]:
ner_model = ner_pipeline.fit(train)

Next add word embeddings to your test dataset and make your predictions.

In [None]:
from sparknlp.training import CoNLL

file_loc='BC5CDRtest.txt'
test = CoNLL().readDataset(spark, file_loc)

test_data = word_embeddings.transform(test)


In [None]:
predictions = ner_model.transform(test_data)

You can see all of your input and output columns in the final "predictions" dataframe, but I'll focus on the 'ner' column which contains the prediction, and the 'label' column which contains the ground truth. You can use sklearn.metrics classification_report to check the accuracy of the predictions using these 2 columns. 

In [None]:
from sklearn.metrics import classification_report
import pyspark.sql.functions as F

preds = predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
        .select(F.col('cols.0').alias("token"),
        F.col('cols.1').alias("label"),
        F.col('cols.2').alias("ner"))


In [None]:
preds.filter("ner!='O'").show(9)

In [None]:
#Convert the Spark dataframe to a Pandas dataframe.
import pandas as pd
preds_df=preds.toPandas()

In [None]:
print (classification_report(preds_df['label'], preds_df['ner']))


### Training and Evaluating NerDL

NerDL is a deep learning named entity recognition model in the SparkNLP library which does not require training data to contain parts-of-speech. It is a Bidirectional LSTM-CNN. For a more detailed overview of training a model using NerDL, you can check out this [post](https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2Fnamed-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77). We've already loaded the BC5CDR-Chem test and train datasets. Now I can show you how to add Glove embeddings and save the test data as a parquet file before NerDL model training

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")

test_data = word_embeddings.transform(test)

test_data.write.parquet('../test.parquet')


Next set up the rest of the pipeline by adding the location of the test data parquet file and the folder where your Tensorflow graphs are located. Using ".setEvaluationLogExtended(True)" will output a more detailed model evaluation log. When you run the training, If you get an error for incompatible TF graph, use NerDL_Graph.ipynb located [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.1_NerDL_Graph.ipynb) to create a graph using the parameters given in the error message. If you're having trouble with this part of NerDL model training, you should read this [post](https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2Fnamed-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77).

In [None]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(15)\
  .setLr(0.001)\
  .setPo(0.005)\
  .setBatchSize(32)\
  .setRandomSeed(0)\
  .setVerbose(1)\
  .setValidationSplit(0.2)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setIncludeConfidence(True)\
  .setGraphFolder('../tfgraphs')\
  .setTestDataset('../test.parquet')
                  
ner_pipeline = Pipeline(stages=[
          word_embeddings,
          nerTagger
 ])

Even though the word_embeddings pipe is in a previous cell, it is still part of the pipeline. In the next cell I'll fit the model to the training set. This could take some time.

In [None]:
%%time

ner_model = ner_pipeline.fit(train)



You can find the final log at the top of the list here:

In [None]:
! cd ~/annotator_logs && ls -lt

For each training epoch your extended log will print 2 sets of metrics, one for the validation dataset and one for the test dataset. (The metrics for the validation data is on the top). For each dataset there's a table showing true positives (tp), false positives (fp), false negatives (fn), precision, recall and f1 scores for each entity (except 'O'). Beneath this table you'll find the macro-average and micro-average precision, recall and f1 scores for the dataset. So if you're looking for the micro-average f1 score for the test data, you'll find it on the last line of the log for each epoch.

In [None]:
!cat ~/annotator_logs/NerDLApproach_15b6d84b808b.log

Overall our NerDL and NerCRF models didn't do too bad with the BC5CDR-Chem benchmark dataset enriched with Glove embeddings. In the 11th epoch the NerDL model's macro-average f1 score on the test set was 0.86 and after 9 epochs the NerCRF had a macro-average f1 score of 0.88 on the test set. However, using Clinical embeddings instead of Glove will bring your NerDL micro-average F1 score from 0.887 up to 0.915, much closer to the best published score for this dataset.