<a href="https://colab.research.google.com/github/alexwz/ComboNER/blob/master/Combo_NER_usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


While embedding vectors are a part of the model, BPEMB is required to use the subword tokenizer of appropriate granularity and encode words to subword-IDs.

In [2]:
!pip install bpemb==0.3.3

Collecting bpemb==0.3.3
  Downloading https://files.pythonhosted.org/packages/f2/6f/9191b85109772636a8f8accb122900c34db26c091d2793218aa94954524c/bpemb-0.3.3-py3-none-any.whl
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/ac/aa/1437691b0c7c83086ebb79ce2da16e00bef024f24fec2a5161c35476f499/sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 6.0MB/s 
Installing collected packages: sentencepiece, bpemb
Successfully installed bpemb-0.3.3 sentencepiece-0.1.96


To de-serialize the model, we use load_model method from TensorFlow 2:

In [5]:
import tensorflow as tf

model = tf.keras.models.load_model( '/content/drive/My Drive/combo/model_final/1model_final/')



In [6]:
model.compile(optimizer='Adamax')

In [7]:
from bpemb import BPEmb

params = {  'embedding_size': 100, 'vocabulary_size': 50000 }

bpemb_pl = BPEmb(lang="pl", dim=params['embedding_size'], vs=params['vocabulary_size'])

downloading https://nlp.h-its.org/bpemb/pl/pl.wiki.bpe.vs50000.model


100%|██████████| 1135152/1135152 [00:01<00:00, 735150.12B/s]


downloading https://nlp.h-its.org/bpemb/pl/pl.wiki.bpe.vs50000.d100.w2v.bin.tar.gz


100%|██████████| 19000997/19000997 [00:03<00:00, 5193698.12B/s]


In [8]:
tokenids = bpemb_pl.encode_ids('Ala ma kota, pięć psów i mieszka na wsi.')

In [9]:
tokenids

[5695, 123, 24167, 49903, 3868, 16843, 28, 1859, 33, 1349, 49902]

In [24]:
sentence_size = len(tokenids)

We need a method to pad the subword IDs to the max length of 31:

In [13]:
def pad_input(tokenids):
  return tf.keras.preprocessing.sequence.pad_sequences(tokenids, maxlen=31, padding='post')


In [11]:
pad_input([tokenids])

array([[ 5695,   123, 24167, 49903,  3868, 16843,    28,  1859,    33,
         1349, 49902,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0]], dtype=int32)

The easiest way of using Combo-NER is via TensorFlow's predict API. For each input sentence, the model returns four lists of predictions:

*   part-of-speech
*   dependency heads
*   dependency relation labels
*   named entities




In [14]:
y_pos, y_heads, y_deprels, y_namedents = model.predict( pad_input([tokenids]) )

For each subword token and each class, the model returns its probability. To obtain the label ID, we have to apply argmax:

In [15]:
y_pos = tf.argmax(y_pos, -1)


Now, each subword has it's own class label as an integer:

In [16]:
y_pos

<tf.Tensor: shape=(1, 31), dtype=int64, numpy=
array([[11, 15,  7, 15,  8,  7,  4,  7,  1,  7,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])>

To obtain string labels, we need to load sklearn's label encoders, available from http://mozart.ipipan.waw.pl/~axw/Combo-NER/

In [4]:
import pickle

le_ner = pickle.load(open('/content/drive/My Drive/combo/le_ner.pkl', 'rb'))
le_upostag = pickle.load(open('/content/drive/My Drive/combo/le_upostag.pkl', 'rb'))
le_deprel = pickle.load(open('/content/drive/My Drive/combo/le_deprel.pkl', 'rb'))

Now let's print POS, remembering to pad to sentence size:

In [25]:
list(le_upostag.inverse_transform(y_pos[0][:sentence_size-1]))

['PROPN',
 'VERB',
 'NOUN',
 'VERB',
 'NUM',
 'NOUN',
 'CCONJ',
 'NOUN',
 'ADP',
 'NOUN',
 'ADJ']

Other output types (dependency, named entities) can be printed in the same manner.