In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

In [2]:
import ktrain
from ktrain import text

Using TensorFlow backend.


# Sequence Tagging

Sequence tagging (or sequence labeling) involves classifying words or sequences of words as representing some category or concept of interest.  One example of sequence tagging is Named Entity Recognition (NER), where we classify words or sequences of words that identify some entity such as a person, organization, or location.  In this tutorial, we will show how to use *ktrain* to perform sequence tagging in three simple steps.

## STEP 1: Load and Preprocess Data

The `entities_from_txt` function can be used to load tagged sentences from a text file.  The text file can be in one of two different formats: 1) the [CoNLL2003 format](https://www.aclweb.org/anthology/W03-0419) or 2) the [Groningen Meaning Bank (GMB) format](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). In both formats, there is one word and its associated tag on each line (where the word and tag are delimited by a space, tab or comma).  Words are ordered as they appear in the sentence.  In the CoNLL2003 format, there is a blank line that delineates sentences.  In the GMB format, there is a third column for Sentence ID that assignes a number to each row indicating the sentence to which the word belongs.  If you are building a sequence tagger for your own use case, the training data should be formatted into one of these two formats.

In this notebook, we will be building a sequence tagger using the Groningen Meaning Bank NER dataset available on Kaggle [here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). The format essentially looks like this (with fields being delimited by comma):
```
      SentenceID   Word     Tag    
      1            Paul     B-PER
      1            Newman   I-PER
      1            is       O
      1            a        O
      1            great    O
      1            actor    O
      1            .        O
 ```

We will be using the file `ner_dataset.csv` (which conforms to the format above) and will load and preprocess it using the `entities_from_txt` function.  The output is simlar to data-loading functions used in previous tutorials and includes the processed training set, processed validaton set, and an instance of `NERPreprocessor`.  

The Kaggle dataset `ner_dataset.csv` the three columns of interest (mentioned above) are labeled 'Sentence #', 'Word', and 'Tag'.  Thus, we specify these in the call to the function.

In [3]:
DATAFILE = '/home/amaiya/data/groningen_meaning_bank/ner_dataset.csv'
(trn, val, preproc) = text.entities_from_txt(DATAFILE,
                                             embeddings='word2vec',
                                             sentence_column='Sentence #',
                                             word_column='Word',
                                             tag_column='Tag', 
                                             data_format='gmb')

Number of sentences:  47959
Number of words in the dataset:  35178
Tags: ['B-eve', 'I-art', 'I-org', 'I-geo', 'O', 'B-org', 'B-nat', 'B-gpe', 'B-tim', 'I-eve', 'B-art', 'I-tim', 'B-geo', 'B-per', 'I-gpe', 'I-nat', 'I-per']
Number of Labels:  17
Longest sentence: 104 words


In the cell above, notice that we suppied `embeddings='word2vec'`.  This directs *ktrain* to employ pretrained word vectors trained by a Word2Vec continuous-bag-of-words model (CBOW).   The word2vec embeddings are 1.5G and will be automatically downloaded to and loaded in STEP 2 (download location is `<home_directory>/ktrain_data`). To disable pretrained word embeddings, set `embeddings=None` and randomly initialized word embeddings will be employed.   Additional pretrained embeddings such as those based on [BERT](https://arxiv.org/abs/1810.04805) or [ELMO](https://arxiv.org/abs/1802.05365) are expected to be included in future versions of *ktrain*.  Use of pretrained word embeddings will typically boost final accuracy.

## STEP 2:  Define a Model

The `print_sequence_taggers` function shows that, as of this writing, *ktrain* currently supports a Bidirectional LSTM-CRM model for sequence tagging.  

In [4]:
text.print_sequence_taggers()

bilstm-crf: Bidirectional LSTM-CRF  (https://arxiv.org/abs/1603.01360)


In [5]:
model = text.sequence_tagger('bilstm-crf', preproc)

pretrained cbow word embeddings will be used with bilstm-crf
Loading pretrained word vectors...this may take a few moments...
Done.


In [6]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val)

## STEP 3: Train and Evaluate the Model

Here, we will train for a single epoch using a learning rate of 0.001 (the default learning rate for Adam in Keras) and see how well we do.

In [7]:
learner.fit(1e-3, 1)

Epoch 1/1


<keras.callbacks.History at 0x7f251bc909e8>

Our F1-score is **83.26** after a single pass through the dataset. Not bad for a single epoch of training.

In [8]:
learner.validate(class_names=preproc.get_classes())

   F1: 83.26
           precision    recall  f1-score   support

      geo       0.86      0.88      0.87      3773
      org       0.67      0.74      0.71      1935
      tim       0.88      0.83      0.85      2058
      art       0.00      0.00      0.00        35
      per       0.79      0.80      0.80      1669
      gpe       0.98      0.92      0.95      1558
      nat       0.00      0.00      0.00        23
      eve       0.20      0.05      0.08        20

micro avg       0.83      0.83      0.83     11071
macro avg       0.83      0.83      0.83     11071



0.8326110509209101

Let's invoke `view_top_losses` to see the sentence we got the most wrong. This single sentence about James Brown contains 10 words that are misclassified.  We can see here that our model has trouble with titles of songs. In addition, some of the ground truth labels for this example are sketchy and incomplete, which also makes things difficult.

In [12]:
learner.view_top_losses(n=1)

total incorrect: 10
Word            True : (Pred)
Mr.            :B-per (B-per)
Brown          :I-per (I-per)
is             :O     (O)
known          :O     (O)
by             :O     (O)
millions       :O     (O)
of             :O     (O)
fans           :O     (O)
as             :O     (O)
"              :O     (O)
The            :O     (O)
Godfather      :B-per (B-org)
of             :O     (O)
Soul           :B-per (B-per)
"              :O     (O)
thanks         :O     (O)
to             :O     (O)
such           :O     (O)
classic        :O     (O)
songs          :O     (O)
as             :O     (O)
"              :O     (O)
Please         :B-art (O)
,              :O     (O)
Please         :O     (B-geo)
,              :O     (O)
Please         :O     (O)
,              :O     (O)
"              :O     (O)
"              :O     (O)
It             :O     (O)
's             :O     (O)
a              :O     (O)
Man            :O     (O)
's             :O     (O)
World          :O   

## Making Predictions on New Sentences

Let's use our model to extract entities from new sentences. We begin by instantating a `Predictor` object.

In [13]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [14]:
predictor.predict('As of 2019, Donald Trump is still the President of the United States.')

[('As', 'O'),
 ('of', 'O'),
 ('2019', 'B-tim'),
 (',', 'O'),
 ('Donald', 'B-per'),
 ('Trump', 'I-per'),
 ('is', 'O'),
 ('still', 'O'),
 ('the', 'O'),
 ('President', 'B-per'),
 ('of', 'O'),
 ('the', 'O'),
 ('United', 'B-geo'),
 ('States', 'I-geo'),
 ('.', 'O')]

We can save the predictor for later deployment.

In [15]:
predictor.save('/tmp/mypred')

In [16]:
reloaded_predictor = ktrain.load_predictor('/tmp/mypred')

In [17]:
reloaded_predictor.predict('Paul Newman is my favorite American actor.')

[('Paul', 'B-per'),
 ('Newman', 'I-per'),
 ('is', 'O'),
 ('my', 'O'),
 ('favorite', 'O'),
 ('American', 'B-gpe'),
 ('actor', 'O'),
 ('.', 'O')]