## Monolingual Norec experiments with bert
Targeted sentiment analysis with simpletransformers for sequence tagging. Highly recommend to do this in a dedicated Python environment. You need PyTorch to interact with Cuda, and Simpletransformers to interact with pytorch, and you need the right Python version to support this chain. I suggest you begin with having a Cuda version that is listed in the pytorch installation guide, and take it from there. 

In [1]:
from simpletransformers.ner import NERModel, NERArgs
import torch
import pandas as pd
import pickle as pk
import os
import random
import json
import time, datetime
random.seed(64)
# import nltk
import re
# from nltk.tokenize.simple import SpaceTokenizer
from helpers import *

print(f"Cuda: {torch.cuda.is_available()}")

Cuda: True


## Prepare the data.
I collect the experiments to run as a list of tuples. Each experiment consists of training set and testing set, and a list of tags. This list of tags is used only for checking that the files have the tags we expect. Highly recommend to do this checking, that we have the expected tags in the datasets. Token and tags need to be spaceseparated

In [2]:
datasets = [("data/spaceseparated/norec_fine/norec_fine_train.conll", 'data/spaceseparated/norec_fine/norec_fine_dev.conll', ['I-targ-negative', 'I-targ-positive',
 'B-targ-negative', 'B-targ-positive', 'O'])]
'''
datasets = [
 ( "data/spaceseparated/norec_fine_nopolarity/nopol_norec_fine_train.conll"
 , "data/spaceseparated/norec_fine_nopolarity/nopol_norec_fine_dev.conll",
  ['I-targ', 'B-targ',  'O'] )] # To not repeat the first which was successful
'''
for train_path, dev_path, tags in datasets:
    for path in [dev_path,train_path]:
        with open (path) as rf:
            text = rf.read()
            # print(path)
            # print(tags)
            assert tagsset(text, separator = " ") == set(tags), tagsset(text, separator = " ")
print("Tags checked OK")

Tags checked OK


## Perform the fine-tuning
Note that this script does not automatically  create the folders needed to save the model and to record the output. I recommend you run the following cell with 1 epoch to see that this works, befor setting it back to 8 or whatever you consider to be adequate. 3 should be enough, but I got a litte better result with 8 so I kept that.

Simpletransformers has included the code for wandb but I have not tried to connect and use that what is supposed to be a great reporting and logging tool.

Note that if you run many epochs and save the models, you will need a lot of space.

In [3]:
# Run bert multilingual with the data from previous cell

family = "bert"
# transformersmodel = 'bert-base-multilingual-cased' 
# transformersmodel = 'ltgoslo/norbert'
transformersmodel = 'NbAiLab/nb-bert-base'
transformermodel_txt= transformersmodel.replace("/", "_")
results = []

for train_path, dev_path, tags in datasets:
    model_args = NERArgs() # New args loading fall 2020
    model_args.train_batch_size = 12
    model_args.num_train_epochs = 3
    model_args.weight_decay = 0.001
    model_args.overwrite_output_dir = True
    model_args.silent = False
    model_args.save_steps = 200000

    model = NERModel(family,transformersmodel , labels = tags,args=model_args)

    out_d = "outputs/simpletransformers/"+transformermodel_txt+"_"+train_path.split("/")[-2]
    running = os.path.join(out_d, "running") # Logging individual results
    if not os.path.exists(os.path.dirname(running)):
        os.makedirs(os.path.dirname(running))


    model.train_model(train_path, output_dir= out_d)
    print(transformersmodel, "Done training")

    result, model_outputs, predictions = model.eval_model(dev_path)

    #Record settings and results
    result["train"] = train_path
    result["dev_test"] = dev_path
    result["training_epochs"] = model_args.num_train_epochs
    result["transformer_model"] = transformersmodel
    json_path = os.path.join(running,"result_"+datetime.datetime.now().strftime("%Y%m%d%H%M")+".json")
    if not os.path.exists(os.path.dirname(json_path)):
        os.makedirs(os.path.dirname(json_path))
    with open(json_path, "w") as wf:
        json.dump(result, wf)
    results.append(result)
    with open(os.path.join(running, "norec_fine_mono_predictions" + datetime.datetime.now().strftime("%Y%m%d%H%M")+".json" ), "w") as wf:
        json.dump(predictions, wf)

json_path = "summaries/results_"+datetime.datetime.now().strftime("%Y%m%d%H%M")+".json"
if not os.path.exists(os.path.dirname(json_path)):
    os.makedirs(os.path.dirname(json_path))
with open(json_path, "w") as wf:
    json.dump(results, wf)
df = pd.DataFrame.from_dict(results)



Some weights of the model checkpoint at NbAiLab/nb-bert-base were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from 

KeyboardInterrupt: 

In [None]:
# Run all above
df

In [None]:
df.to_csv(f"summaries/{transformermodel_txt}.csv")

## Inference
First on the existing model, then an example of how to load a saved model

In [None]:
predictions, raw_outputs = model.predict (["Mannen på scenen synger stygt", "Damen på scenen synger stygt" , "Disse bilene har et fantastisk veigrep"])
for sentence in predictions:
    print(sentence)
    # print(json.dumps(sentence, indent=3, ensure_ascii=False))


In [None]:
# Inference on saved model
# out_d: 'outputs/simpletransformers/bert-base-multilingual-cased_norec_fine'
if False:
    last_epoch= sorted([f for f in os.listdir(out_d) if "-" in f], key = lambda x: int(x.split("-")[-1]) )[-1]
    print(last_epoch)
    model2 = NERModel(family, os.path.join(out_d, last_epoch))
    predictions, raw_outputs = model2.predict (["Mannen på scenen synger stygt", "Damen på scenen synger stygt" , "Disse bilene har et fantastisk veigrep"])
    for sentence in predictions:
        print(sentence)