## Using the complex word sequence labeller

In order to use the complex word models you must download the sequence labeller files available [here](https://github.com/marekrei/sequence-labeler), please cite both the sequence labeller paper and CWI sequence labelling paper if using these models for research. 

Below is example code showing each function in the `Complexity_labeller class`

In [1]:
import sys
sys.path.insert(0, './sequence-labeler-master')
import numpy as np
from complex_labeller import Complexity_labeller
model_path = './cwi_seq.model'
temp_path = './temp_file.txt'

Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/daniel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
model = Complexity_labeller(model_path, temp_path)

Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


  char_lstm_cell_fw = tf.nn.rnn_cell.LSTMCell(self.config["char_recurrent_size"],
  char_lstm_cell_bw = tf.nn.rnn_cell.LSTMCell(self.config["char_recurrent_size"],
  self._kernel = self.add_variable(
  self._bias = self.add_variable(
  char_output_tensor = tf.layers.dense(char_output_tensor, char_hidden_layer_size, activation=tf.tanh, kernel_initializer=self.initializer)
  return layer.apply(inputs)
  attention_output = tf.layers.dense(attention_evidence_tensor, self.config["word_embedding_size"], activation=tf.tanh, kernel_initializer=self.initializer)
  attention_output = tf.layers.dense(attention_output, self.config["word_embedding_size"], activation=tf.sigmoid, kernel_initializer=self.initializer)
  word_lstm_cell_fw = tf.nn.rnn_cell.LSTMCell(self.config["word_recurrent_size"],
  word_lstm_cell_bw = tf.nn.rnn_cell.LSTMCell(self.config["word_recurrent_size"],
  lmcost_hidden_layer = tf.layers.dense(input_tensor, self.config["lmcost_hidden_layer_size"], activation=tf.tanh, kernel_ini

There are two options when converting text to CoNLL-type tab-separated format:

- `convert_format_string`
- `convert_format_token`

In [10]:
Complexity_labeller.convert_format_string(model, 'Our group fucked up the machine learning project')

In [11]:
Complexity_labeller.convert_format_token(model, ['After time', 'the end', 'of PARSEME', 'action late'])

Once the text has been converted there are four methods to access complexity information:

- `get_dataframe`
- `get_bin_labels`
- `get_prob_labels`

In [59]:
#Converting example sentence:'Based in an armoured train parked in its sidings, he met with numerous ministers'

Complexity_labeller.convert_format_string(model,'My dog bit my cat and now he is sad.')

The `get_dataframe` method returns a dataframe containing the original tokenized sentence, binary complexity labels and complex class probabilities.

If a word recieves a binary label = 1, it has been classified as a complex word.

In [60]:
dataframe = Complexity_labeller.get_dataframe(model)

In [61]:
dataframe

Unnamed: 0,index,sentences,labels,probs
0,0,"[My, dog, bit, my, cat, and, now, he, is, sad, .]","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]","[[0.9991967, 0.0008033489], [0.578063, 0.42193..."


In [62]:
dataframe['probs'][0]

array([[9.9919671e-01, 8.0334890e-04],
       [5.7806301e-01, 4.2193702e-01],
       [9.9367577e-01, 6.3242521e-03],
       [9.9938452e-01, 6.1543082e-04],
       [3.9162654e-01, 6.0837346e-01],
       [9.9991298e-01, 8.6956745e-05],
       [9.9985981e-01, 1.4018305e-04],
       [9.9984980e-01, 1.5016280e-04],
       [9.9995184e-01, 4.8179219e-05],
       [9.7847718e-01, 2.1522792e-02],
       [9.9995804e-01, 4.1934447e-05]], dtype=float32)

Example below shows how to access binary information from the dataframe format: 

In [63]:
list(zip(dataframe['sentences'].values[0],dataframe['labels'].values[0]))

[('My', 0),
 ('dog', 0),
 ('bit', 0),
 ('my', 0),
 ('cat', 1),
 ('and', 0),
 ('now', 0),
 ('he', 0),
 ('is', 0),
 ('sad', 0),
 ('.', 0)]

`get_bin_labels` returns the binary complexity labels for the input

In [64]:
Complexity_labeller.get_bin_labels(model)

[array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])]

The `get_prob_labels` method returns the probability of each token belonging to the complex class.

In [65]:
Complexity_labeller.get_prob_labels(model)

[0.0008033489,
 0.42193702,
 0.006324252,
 0.0006154308,
 0.60837346,
 8.6956745e-05,
 0.00014018305,
 0.0001501628,
 4.817922e-05,
 0.021522792,
 4.1934447e-05]