# Named Entity Recognition

Notebook based on the work of Lee et al. ("LEAN-LIFE: A Label-Efficient Annotation Framework Towards Learning from Explanation," 2020).

Ensure that FastAPI is up and running. [instructions here](../fast_api/readme.md)

# Table of contents

1. Data prep
2. Standard
        2.1. Train Model
        2.2. Predict Labels
        2.3. Evaluate Model
3. Trigger
        3.1. Soft Match
            3.1.1. Train Model
            3.1.2. Predict Labels
            3.1.3. Evaluate Model
<!--         3.2. Strict Match
            3.2.1. Train Model
            3.2.2. Predict Labels
            3.2.3. Evaluate Model -->

In [1]:
# imports
import requests
import json
import numpy as np

In [2]:
FAST_API_URL = "http://localhost:9000"

 ---
 # 1. Data prep

---
#### Prepare data for Trigger training

In [3]:
# load labelled data with explanations for Trigger training
data = json.load(open('explanation_IDRISI-RE-flood_without_empty.json'))
data[0:3]

[{'text': 'Korean actress Lee Young-ae know as ‘ Changumi ’ made a contribution of USD 50,000 to support flood relief efforts in Sri Lanka',
  'label': 'O O O O O O O O O O O O O O O O O O O O O B-LOC I-LOC',
  'explanation': 'O O O O O O O O O O O O O O O O O O T-0 T-0 T-0 O O'},
 {'text': 'RT @ Vidiyallk : Government seeks # world aid for # FloodRelief ; 4 ministers appointed to work out plans for donor conference in # Colombo short',
  'label': 'O O O O O O O O O O O O O O O O O O O O O O O O O B-LOC O',
  'explanation': 'O O O O O O O O O O O O O O O O O O O O O O T-0 T-0 O O O'},
 {'text': 'First phase of the @ IMCD_officials # FloodRelief program is over . Please direct dry ration donations to Rathmalana Air Force Camp . # FloodSL',
  'label': 'O O O O O O O O O O O O O O O O O O B-LOC O O O O O O',
  'explanation': 'O O O O O O O O O O O O O O O O T-0 T-0 O T-1 T-2 T-2 O O O'}]

---
#### Prepare data for evaluation

In [4]:
gen = np.random.default_rng(seed=1337)

data_shuffled = data.copy()
gen.shuffle(data_shuffled)
data_shuffled[0:3]

[{'text': '@ GetalongDRescue Maryland . Here is Dante a rescue from # Arkansas',
  'label': 'O O O O O O O O O O O B-LOC',
  'explanation': 'O O O O O O O O O T-0 O O'},
 {'text': 'JUST IN : Howard County police say the body of Eddison Alexander Hermond , who went missing during destructive flash flooding in Maryland after trying to help a woman rescue her cat , was found by searchers',
  'label': 'O O O B-LOC I-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O',
  'explanation': 'O O O O O T-0 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O'},
 {'text': 'NEW : Officials say Offutt Air Force Base has returned to full operation in Nebraska after its runway and dozens of buildings were inundated by floodwaters from the nearby Missouri River earlier this month .',
  'label': 'O O O O O O O O O O O O O O B-LOC O O O O O O O O O O O O O O O O O O O O',
  'explanation': 'O O O O O T-0 T-0 T-0 O O O O O T-1 O O O O O O O O O O O O O O O O O O O O O'}]

In [5]:
size = len(data_shuffled)
size

3103

In [6]:
train_split_end = 0.7
dev_split_end = 0.85
train, dev, test = np.split(data_shuffled,
                       [int(size*train_split_end), int(size*dev_split_end)]
                   )

In [7]:
train, dev, test = list(train), list(dev), list(test)

In [8]:
def remove_triggers(dataset):
    return list(map(
        lambda x: { 'text':x['text'], 'label':x['label'] },
        dataset
    ))

train_without_triggers = remove_triggers(train)
dev_without_triggers = remove_triggers(dev)
test_without_triggers = remove_triggers(test)

In [9]:
print(len(train))
print(len(test))
print(len(dev))
print(len(train)+len(dev)+len(test))
print(train[0:3])
print(train_without_triggers[0:3])

2172
466
465
3103
[{'text': '@ GetalongDRescue Maryland . Here is Dante a rescue from # Arkansas', 'label': 'O O O O O O O O O O O B-LOC', 'explanation': 'O O O O O O O O O T-0 O O'}, {'text': 'JUST IN : Howard County police say the body of Eddison Alexander Hermond , who went missing during destructive flash flooding in Maryland after trying to help a woman rescue her cat , was found by searchers', 'label': 'O O O B-LOC I-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'explanation': 'O O O O O T-0 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O'}, {'text': 'NEW : Officials say Offutt Air Force Base has returned to full operation in Nebraska after its runway and dozens of buildings were inundated by floodwaters from the nearby Missouri River earlier this month .', 'label': 'O O O O O O O O O O O O O O B-LOC O O O O O O O O O O O O O O O O O O O O', 'explanation': 'O O O O O T-0 T-0 T-0 O O O O O T-1 O O O O O O O O O O O O O O O O O O O O O'}]
[{'te

---
#### Prepare data for prediction

In [10]:
test_without_triggers_as_list = list(map(
    lambda x: [x['text'], x['label']],
    test
))
test_only_text = list(map(
    lambda x: [x['text']],
    test
))

In [11]:
predict = list(map(lambda x: x[0], test_only_text))[:10]
predict[0:3]

['Pakistan Navy conducts relief operations in flood-battered Sri Lanka # news',
 'An important lesson from Kerala floods . Keep your insurance policy numbers and their helplines handy so you can reach out as soon as possible during such a calamity Heres what youve to do if you lose your papers during such periods # KeralaFloodRelief # KeralaFloods',
 'MPD 0261 SE # PENNSYLVANIA , N # DELAWARE INTO # MARYLAND , # DC & CNTRL/E # VIRGINIA FLASH FLOODING POSSIBLE VALID TIL 0510 AM EDT ADDITIONAL FLASH FLOODING WILL BE POSSIBLE OVER THE NEXT FEW HOURS . AN ADDITIONAL 2-4 IN WILL BE POSSIBLE ON TOP OF SATURATED SOILS']

---
---
# 2. Standard

Check [this](../fast_api/json_schema.py#L516) json schema for a list of all parameters.

Check [this](../model_training/internal_api/defaults.py) for default values.

## 2.1. Train model

### Define training parameters

In [12]:
params_standard_training = {
    # a string name representing the model name
    "experiment_name": "idrisi_ner_standard",
    # a string name representing the dataset name
    "dataset_name": "idrisi",
    # task type - "ner" for Named Entity REcognition
    "task": "ner",
    # when "True" data has to be passed, 
    # "False" when re-training or the data was processed earlier and can be retrieved
    "build_data": True,
    "num_epochs": 10,
    # training batch size
    "batch_size": 10,
    # learning rate
    "learning_rate": 0.01,
    # embedding to be used for training. usual default: "glove.6B.100d"
    "embeddings": "glove.6B.100d",
    # embedding dimension of the "embeddings" provided
    "emb_dim": 100,
    # number of hidden dimensions
    "hidden_dim": 200,
    # random seed
    "seed": 1337,
}

### Run model training

In [28]:
# depending on input size, and computing environment this might take time.
# please check FAST API logs for updates
response = requests.post(
    FAST_API_URL + '/training/standard/ner/api/',
    json={
        'params': params_standard_training,
        'labeled_data': train_without_triggers,
        'dev_data': dev_without_triggers,
        'eval_data': test_without_triggers,
    }
)
# JSON with "save_path" key is returned when successful
response.text

'{"save_path":"/home/yoriyari/LEAN-LIFE/model_api/fast_api/../model_training/trigger_ner/utilities/../../generated_data/saved_models/naive_idrisi_glove.6B.100d_1337_-1.0"}'

---
## 2.2. Predict labels

### Define prediction parameters

In [13]:
params_standard_prediction = {
    # a string name representing the model name
    "experiment_name": "idrisi_ner_standard",
    # a string name representing the dataset name
    "dataset_name": "idrisi",
    # task type - "ner" for Named Entity Recognition
    "task": "ner",
    # prediction batch size
    "batch_size": 10,
    # embedding to be used for training. usual default: "glove.6B.100d"
    "embeddings": "glove.6B.100d",
    # embedding dimension of the "embeddings" provided
    "emb_dim": 100,
    # number of hidden dimensions
    "hidden_dim": 200,
}

### Fetch predictions

In [14]:
response = requests.post(
    FAST_API_URL + '/training/standard/ner/predict/',
    json={
        'params': params_standard_prediction,
        'prediction_data': predict,
    }
)
# returns JSON object with keys
# "class_probs" - list of list representing predictions for each label
# "class_preds" - list representing the label that was predicted
response.text

'{"class_preds":["S-LOC O O O O O O B-LOC E-LOC O O","O O O O S-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O","O O O O O O O O O O O O O O O O O O S-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O","O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O","O O O O O O O O O O O O O O S-LOC","O O O O O O O O O O O O O S-LOC O O O O O O O O O O O O O O","O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O","O O O O O O O O O O S-LOC O O O S-LOC O O O O O O O O O O O O O","O O O O O O O B-LOC E-LOC O S-LOC O O O O O","O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O"]}'

In [15]:
predict

['Pakistan Navy conducts relief operations in flood-battered Sri Lanka # news',
 'An important lesson from Kerala floods . Keep your insurance policy numbers and their helplines handy so you can reach out as soon as possible during such a calamity Heres what youve to do if you lose your papers during such periods # KeralaFloodRelief # KeralaFloods',
 'MPD 0261 SE # PENNSYLVANIA , N # DELAWARE INTO # MARYLAND , # DC & CNTRL/E # VIRGINIA FLASH FLOODING POSSIBLE VALID TIL 0510 AM EDT ADDITIONAL FLASH FLOODING WILL BE POSSIBLE OVER THE NEXT FEW HOURS . AN ADDITIONAL 2-4 IN WILL BE POSSIBLE ON TOP OF SATURATED SOILS',
 'Our @ ArtofLiving 2nd Truck , with 4 Tons of Flood Relief Material departed to Ernakulam AOL Centre.Wonderful Seva by Chennai Volunteers & # Yoga Kids supported by KLM Royal Dutch Airlines / Lufthansa Airlines Employees & Budding Minds International School. # KeralaFloodRelief',
 'Govt . seeks assistance from development partners to rebuild after floods # lka # srilanka',
 '

---
## 2.3. Evaluate model

### Define evaluation parameters

In [16]:
params_standard_eval = {
    # a string name representing the model name
    "experiment_name": "idrisi_ner_standard",
    # a string name representing the dataset name
    "dataset_name": "idrisi",
    # task type - "ner" for Named Entity Recognition
    "task": "ner",
    # evaluation batch size
    "batch_size": 10,
    # embedding to be used for training. usual default: "glove.6B.100d"
    "embeddings": "glove.6B.100d",
    # embedding dimension of the "embeddings" provided
    "emb_dim": 100,
    # number of hidden dimensions
    "hidden_dim": 200,
}

### Run model evaluation

In [17]:
response = requests.post(
    FAST_API_URL + '/training/standard/ner/eval/',
    json={
        'params': params_standard_eval,
        'eval_data': test_without_triggers_as_list,
    }
)
# returns JSON object with keys
# "precision", "recall" and "f1"
response.text

'{"precision":54.67625899280576,"recall":65.23605150214593,"f1":59.49119373776909}'

---
---
# 3. Trigger

Check [this](../fast_api/json_schema.py#L516) json schema for a list of all parameters.

Check [this](../model_training/internal_api/defaults.py) for default values.

## 3.1. Train model

### Define training parameters

In [12]:
params_training = {
    # a string name representing the model name
    "experiment_name": "idrisi_ner_soft_match",
    # a string name representing the dataset name
    "dataset_name": "idrisi_trigger",
    # task type - "ner" for Named Entity REcognition
    "task": "ner",
    # when "True" data has to be passed, 
    # "False" when re-training or the data was processed earlier and can be retrieved
    "build_data": True,
    # number of epochs
    "num_epochs": 10,
    # number of pre training epochs
    "pre_train_num_epochs": 20,
    # training batch size
    "batch_size": 10,
    # learning rate
    "learning_rate": 0.01,
    # embedding to be used for training. usual default: "glove.6B.100d"
    "embeddings": "glove.6B.100d",
    # embedding dimension of the "embeddings" provided
    "emb_dim": 100,
    # number of hidden dimensions
    "hidden_dim": 200,
    # random seed
    "seed": 1337,
}

### Run model training

In [16]:
# depending on input size, and computing environment this might take time.
# please check FAST API logs for updates
response = requests.post(
    FAST_API_URL + '/training/trigger/api/',
    json={
        'params': params_training,
        'explanation_triples': train,
        'dev_data': dev,
        'eval_data': test,
    }
)
# JSON with "save_path" key is returned when successful
response.text

'{"save_path":"/home/yoriyari/LEAN-LIFE/model_api/fast_api/../model_training/trigger_ner/utilities/../../generated_data/saved_models/trigger_idrisi_trigger_glove.6B.100d_1337_-1.0"}'

---
## 3.2. Predict labels

### Define prediction parameters

In [17]:
params_prediction = {
    # a string name representing the model name
    "experiment_name": "idrisi_ner_soft_match",
    # a string name representing the dataset name
    "dataset_name": "idrisi_trigger",
    # task type - "ner" for Named Entity Recognition
    "task": "ner",
    # prediction batch size
    "batch_size": 10,
    # embedding to be used for training. usual default: "glove.6B.100d"
    "embeddings": "glove.6B.100d",
    # embedding dimension of the "embeddings" provided
    "emb_dim": 100,
    # number of hidden dimensions
    "hidden_dim": 200,
}

### Fetch predictions

In [18]:
response = requests.post(
    FAST_API_URL + '/training/trigger/predict/',
    json={
        'params': params_prediction,
        'prediction_data': predict,
    }
)
# returns JSON object with keys
# "class_probs" - list of list representing predictions for each label
# "class_preds" - list representing the label that was predicted
response.text

'{"class_preds":["S-LOC O O O O O O B-LOC E-LOC O O","O O O O S-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O","O O O O S-LOC O O O O O O O O O O O O O S-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O","O O O O O O O O O O O O O O O S-LOC O O O O S-LOC O O O O O O O O O O O O O O O O O O O O O O","O O O O O O O O O O O O O O O","O O O O O O O O O O O O O S-LOC O O O O O O O O O O O O O O","S-LOC O O O O O O O O O O O O O O O O O O O O O O O S-LOC O S-LOC O S-LOC O O O O O O O O O O","O O O O O O O O O O S-LOC O O O S-LOC O O O O O O O O O O O O O","O O O O O O O B-LOC E-LOC O S-LOC O O O O O","O O O O O O O O O O S-LOC O O O O O O S-LOC O O O O O O O O O O O O O O"],"trigger_preds":["<UNK> @ GetalongDRescue Maryland . Here is Dante a rescue","NEW Officials","Force Base","went missing during","woman her cat was","Offutt Air Force","went missing during destructive","to help woman her","# Arkansas JUST IN : Howard","woman her ca

In [19]:
predict

['Pakistan Navy conducts relief operations in flood-battered Sri Lanka # news',
 'An important lesson from Kerala floods . Keep your insurance policy numbers and their helplines handy so you can reach out as soon as possible during such a calamity Heres what youve to do if you lose your papers during such periods # KeralaFloodRelief # KeralaFloods',
 'MPD 0261 SE # PENNSYLVANIA , N # DELAWARE INTO # MARYLAND , # DC & CNTRL/E # VIRGINIA FLASH FLOODING POSSIBLE VALID TIL 0510 AM EDT ADDITIONAL FLASH FLOODING WILL BE POSSIBLE OVER THE NEXT FEW HOURS . AN ADDITIONAL 2-4 IN WILL BE POSSIBLE ON TOP OF SATURATED SOILS',
 'Our @ ArtofLiving 2nd Truck , with 4 Tons of Flood Relief Material departed to Ernakulam AOL Centre.Wonderful Seva by Chennai Volunteers & # Yoga Kids supported by KLM Royal Dutch Airlines / Lufthansa Airlines Employees & Budding Minds International School. # KeralaFloodRelief',
 'Govt . seeks assistance from development partners to rebuild after floods # lka # srilanka',
 '

---
## 3.3. Evaluate model

### Define evaluation parameters

In [20]:
params_eval = {
    # a string name representing the model name
    "experiment_name": "idrisi_ner_soft_match",
    # a string name representing the dataset name
    "dataset_name": "idrisi_trigger",
    # task type - "ner" for Named Entity Recognition
    "task": "re",
    # evaluation batch size
    "batch_size": 10,
    # embedding to be used for training. usual default: "glove.6B.100d"
    "embeddings": "glove.6B.100d",
    # embedding dimension of the "embeddings" provided
    "emb_dim": 100,
    # number of hidden dimensions
    "hidden_dim": 200,
}

### Run model evaluation

In [21]:
response = requests.post(
    FAST_API_URL + '/training/trigger/eval/',
    json={
        'params': params_eval,
        'eval_data': test_without_triggers_as_list,
    }
)
# returns JSON object with keys
# "precision", "recall" and "f1"
response.text

'{"precision":41.50485436893204,"recall":73.39055793991416,"f1":53.02325581395348}'