## Introduction
This notebook was created as a tutorial for the puplication "Space transformers: language modeling for space systems". It uses the further pre-trained SpaceTransformer models (SpaceBERT, SpaceSciBERT, SpaceRoBERTa) and fine-tunes them on the Concept Recognition task from the paper. 

---

If you use the models for your experiments please cite: 


Berquand, A., Darm, P., & Riccardi, A. (2021). Space transformers: language modeling for space systems. IEEE Access, 9, 133111-133122. https://doi.org/10.1109/ACCESS.2021.3115659


In [None]:
# @title Licensed under the MIT License

# Copyright (c) 2021 Paul Darm

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

Google Colab offers free GPUs and TPUs. Since we'll be training a large neural network it's best to take advantage of this (in this case we'll attach a GPU), otherwise training will take a very long time.

A GPU can be added by going to the menu and selecting:

`Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)`

Then run the following cell to confirm that the GPU is detected.

In [1]:
!nvidia-smi

Thu Apr 28 10:36:40 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!nvidia-smi

Wed Apr 27 20:04:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
#@title Install necessary libraries and modules when running notebook on Google Colab
!pip install transformers

from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

import torch
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit 
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score#
from sklearn.metrics import precision_score

import spacy
import json
import time
import pandas as pd
import warnings

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 9.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 36.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 50.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: p

## Load data

In [3]:
## Load CR dataset from GitHub
!wget https://raw.githubusercontent.com/strath-ace/smart-nlp/master/SpaceTransformers/CR/CR_ECSS_dataset.json
dataset = pd.read_json('/content/CR_ECSS_dataset.json')
dataset

--2022-04-28 10:37:16--  https://raw.githubusercontent.com/strath-ace/smart-nlp/master/SpaceTransformers/CR/CR_ECSS_dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1645094 (1.6M) [text/plain]
Saving to: ‘CR_ECSS_dataset.json’


2022-04-28 10:37:16 (207 MB/s) - ‘CR_ECSS_dataset.json’ saved [1645094/1645094]



Unnamed: 0,sentence_id,words,labels
0,0,It,O
1,0,shall,O
2,0,be,O
3,0,demonstrated,O
4,0,that,O
...,...,...,...
31438,874,energy,Space Environment
31439,874,deposition,Space Environment
31440,874,",",O
31441,874,e,O


In [4]:
## Unique values of labels (tags) in dataset 
tag_vals = dataset['labels'].unique() 
##Dict to transform numbers back into their original tags  'Quality control' --> 8 / 'O' --> 4
tag2idx = {}
for count,  tag in enumerate(tag_vals):
   tag2idx[tag] = count
print(tag2idx)
## Dict to transform numbers back into their original tags 4 --> 'O', 2 --> 'Space Environment'
tag2name={tag2idx[key] : key for key in tag2idx.keys()}

## Add additional tag for "None" label of Pytorch
## --> see https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities
tag2name[-100]= "None"
print(tag2name)

{'O': 0, 'Cleanliness': 1, 'Materials / EEEs': 2, 'Nonconformity': 3, 'System engineering': 4, 'Quality control': 5, 'Measurement': 6, 'Parameter': 7, 'GN&C': 8, 'Project Scope': 9, 'OBDH': 10, 'Power': 11, 'Structure & Mechanism': 12, 'Thermal': 13, 'Telecom.': 14, 'Space Environment': 15, 'Project Organisation / Documentation': 16, 'Safety / Risk (Control)': 17, 'Propulsion': 18}
{0: 'O', 1: 'Cleanliness', 2: 'Materials / EEEs', 3: 'Nonconformity', 4: 'System engineering', 5: 'Quality control', 6: 'Measurement', 7: 'Parameter', 8: 'GN&C', 9: 'Project Scope', 10: 'OBDH', 11: 'Power', 12: 'Structure & Mechanism', 13: 'Thermal', 14: 'Telecom.', 15: 'Space Environment', 16: 'Project Organisation / Documentation', 17: 'Safety / Risk (Control)', 18: 'Propulsion', -100: 'None'}


In [5]:
tag_vals

array(['O', 'Cleanliness', 'Materials / EEEs', 'Nonconformity',
       'System engineering', 'Quality control', 'Measurement',
       'Parameter', 'GN&C', 'Project Scope', 'OBDH', 'Power',
       'Structure & Mechanism', 'Thermal', 'Telecom.',
       'Space Environment', 'Project Organisation / Documentation',
       'Safety / Risk (Control)', 'Propulsion'], dtype=object)

## Data pre-processing 

In [6]:
def tokenize_and_align_labels(examples, labels, tokenizer):
    """
    Taken and adapted  from: https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
    Adjust the list of lables to the word-piece tokenisation of BERT-type models  
    [polyethylene] --> [poly, ##eth, ##yle, ##ne]
    [2]            --> [2, -100,-100,-100]

    """
    tokenized_inputs = tokenizer([example for example in examples], padding=True, truncation=False,is_split_into_words=True)

    word_piece_labels = []
    label_all_tokens = True
    for i, label in enumerate(labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            # --> see https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        word_piece_labels.append(label_ids)

    tokenized_inputs["labels"] = word_piece_labels
    return tokenized_inputs

## Tokenisation

The single words in the requirements need to be "tokenised", before we can use them as input for the model. 

Tokenisation consists of two steps: 



1.   Transforming words into single tokens the models vocabulary (BERT models --> word-piece tokenisation, polyethylene --> poly ##eth ##yle ##ne) 
2.   Transform the tokens into their corresponing IDs of the vocabulary ( poly ##eth ##yle ##ne --> 26572, 11031, 12844, 2638)

Therefore, we need to download the respective tokeniser from Huggingface (BERT - for SpaceBERT, SciBERT - for SpaceSciBERT, RoBERTa for SpaceRoBERTa).
We do this by instantiating our tokenizer with the AutoTokenizer.from_pretrained .


In [None]:
#=======================================
#          DATA PREPARATION 
#=======================================

## Different Tokenizers
#tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')  ##SpaceBERT
#tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased') ## SpaceSciBERT
tokenizer = AutoTokenizer.from_pretrained('roberta-base',add_prefix_space=True) ## SpaceRoBERTa

label_all_tokens = True

sentences = [[word for word in dataset[dataset['sentence_id']==i]['words'].values] for i in dataset['sentence_id'].unique()]
labels = [[tag2idx[label] for label in dataset[dataset['sentence_id']==i]['labels'].values] for i in dataset['sentence_id'].unique()]

encoded_input = tokenize_and_align_labels(sentences, labels, tokenizer)

input_ids = encoded_input["input_ids"]

attention_masks = encoded_input["attention_mask"]

labels = encoded_input['labels']

for i in range(0,5):
        print("No.%d,len:%d"%(i,len(input_ids[i])))
        print("texts:%s"%(" ".join(i for i in tokenizer.tokenize(tokenizer.decode(input_ids[i])))))
        print("No.%d,len:%d"%(i,len(labels[i])))
        #Reduced Tags
        print("lables:%s"%(" ".join(tag2name[j] for j in labels[i])))


tr_inputs, val_inputs, tr_tags, val_tags,tr_masks, val_masks = train_test_split(input_ids, labels,attention_masks, 
                                                     random_state=4, test_size=0.213)
                                                           


tr_inputs = torch.as_tensor(tr_inputs)
val_inputs = torch.as_tensor(val_inputs)
tr_tags = torch.as_tensor(tr_tags)
val_tags = torch.as_tensor(val_tags)
tr_masks = torch.as_tensor(tr_masks)
val_masks = torch.as_tensor(val_masks)

batch_num = 8

# Only set token embedding, attention embedding, no segment embedding
train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)
# Drop last can make batch training better for the last one
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_num,drop_last=False)

valid_data = TensorDataset(val_inputs, val_masks, val_tags)
valid_sampler = SequentialSampler(valid_data)
valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=batch_num)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

No.0,len:497
texts:<s> ĠIt Ġshall Ġbe Ġdemonstrated Ġthat Ġno Ġadditional Ġcontamination Ġis Ġintroduced Ġduring Ġthe Ġhandling Ġprocess . Ġ Ċ Ġ Ġ Ġ Ġ ĠNOTE Ġ1 ĠCont amination Ġcan Ġbe Ġavoided Ġby Ġusing Ġtw eez ers Ġand Ġclean Ġgloves , Ġand Ġensuring Ġthat Ġgloves Ġand Ġchemicals Ġare Ġcompatible Ġ Ċ Ġ Ġ Ġ Ġ ĠNOTE Ġ2 ĠTypically Ġused Ġgloves Ġare Ġof Ġpowder Ġ- Ġfree Ġnylon , Ġnit ri le , Ġlatex , Ġl int Ġ- Ġfree Ġcotton . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <

In [None]:
tokenizer('polyethylene')

{'input_ids': [0, 11424, 30298, 2552, 2], 'attention_mask': [1, 1, 1, 1, 1]}

## Model Loading 
After data preparation we load a pre-trained SpaceTransformer, for fine-tuning it on CR. Since our tasks is about classifying single words (tokens), we use the AutoModelForTokenClassification class. 

Like with the tokenizer, the from_pretrained method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which we can get from our tag2idx dictionary).

When loading the model, a warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some other (the token classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

In [None]:
!nvidia-smi

Wed Apr 27 20:05:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
## Load pre-trained SpaceBERT from Huggingface for fine-tuning (exchange for any model you want to finetune)

model = AutoModelForTokenClassification.from_pretrained("icelab/spaceroberta", num_labels=len(tag2idx))

## Train with GPU, if available
n_gpu = torch.cuda.device_count()
if torch.cuda.is_available():
    model.cuda()
    if n_gpu >1:
        model = torch.nn.DataParallel(model)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Some weights of the model checkpoint at icelab/spaceroberta were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at icelab/spaceroberta and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this 

In [None]:
device

device(type='cuda')

In [None]:
!nvidia-smi

Wed Apr 27 20:07:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    58W / 149W |   1091MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
param_size = 0
for param in model.parameters():
    param_size += param.nelement() * param.element_size()
buffer_size = 0
for buffer in model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

size_all_mb = (param_size + buffer_size) / 1024**2
print('model size: {:.3f}MB'.format(size_all_mb))

model size: 473.296MB


## Train script 

This training loop is heavily influenced by Chris McCormicks notebook "BERT Fine-Tuning Sentence Classification". 

https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX#scrollTo=GI0iOY8zvZzL

**From the notebook:**

"> *Thank you to [Stas Bekman](https://ca.linkedin.com/in/stasbekman) for contributing the insights and code for using validation loss to detect over-fitting!*

**Training:**
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Clear out the gradients calculated in the previous pass. 
    - In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
- Forward pass (feed input data through the network)
- Backward pass (backpropagation)
- Tell the network to update parameters with optimizer.step()
- Track variables for monitoring progress

**Evalution:**
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Forward pass (feed input data through the network)
- Compute loss on our validation data and track variables for monitoring progress" 



In [None]:
import time
import datetime
import random
import numpy as np
from transformers import set_seed
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

#
# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
set_seed(seed_val)
#=======================================
#          Train parameters
#=======================================

## determines if all layers are fine_tuned or just the last, newly initialized ones 
FULL_FINETUNING = True


if FULL_FINETUNING:
    # Fine-tune model all layer parameters
    param_optimizer = list(model.named_parameters())
    #no_decay = ['bias', 'gamma', 'beta']
    no_decay = []
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.00, 'correct_bias' : False},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0,'correct_bias' : False}
    ] 
else:
    # Only fine tune classifier parameters
    param_optimizer = list(model.classifier.named_parameters()) 
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5) ## lr = learning rate of the model: default value 5e-5


# Number of training epochs. The BERT authors recommend between 2 and 4. 
epochs = 4


# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []
train_loss = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        outputs = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)
        loss, logits =  outputs[:2]
                           

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
    train_loss.append(avg_train_loss)

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    # total_eval_accuracy = 0
    total_eval_loss = 0
    #nb_eval_steps = 0

    y_true = []
    y_pred = []

    # Evaluate data for one epoch
    for batch in valid_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)[:2]
            #loss, logits =  outputs[:2]
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

         # Only predict the real word, mark=0, will not calculate
        input_mask = b_input_mask.to('cpu').numpy()

        # Transform logits into Labels and compare them for each sentence
        for i,mask in enumerate(input_mask):
          # Real one
          temp_1 = []
          # Predict one
          temp_2 = []
          ## check Attention mask
          for j, m in enumerate(mask):
            #  print(j)
            #  print(m)
          # Mark=0, meaning its a pad word, dont compare
              if m:
                  ### Checking [CLS] and [SEP] Tokens
                  if tag2name[label_ids[i][j]] != "None" :
                      temp_1.append(tag2name[label_ids[i][j]])
                      temp_2.append(tag2name[logits[i][j]])
              else:
                  break
        
            
          y_true.append(temp_1)
          y_pred.append(temp_2)

        
        
    y_true_words = [word if word!='O' else word for require in y_true for word in require ]
    y_pred_words = [word if word!='O' else word for require in y_pred for word in require ]
    
    scores = precision_recall_fscore_support(y_true_words,y_pred_words,labels = [label for label in set(y_true_words)if label!='O'])[2:]
    ## result dicts 
    f1_scores = {}
    examples = {}
    ## add values for each Label
    names =  [label for label in set(y_true_words)if label!='O']
    for i, label in enumerate(names):

      f1_scores[label] = scores[0][i]
      examples[label]= scores[1][i]   
    
    ## adds averaged scores and summed up examples to result dic
    f1_scores['weighted'] = f1_score(y_true_words, y_pred_words, average='weighted',labels =[label for label in set(y_true_words)if label!='O'])
    examples['sum']= np.sum([examples[key] for key in examples.keys()])
    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(valid_dataloader)

    ## print validation loss and F1 score for epoch 
    
    print("  F1_score: {0:.2f}".format(f1_scores['weighted']))
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))

    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'F1 score': f1_scores['weighted'],
            'examples_sum': examples['sum'],
            'Label_F1_scores':f1_scores,
            'examples'    : examples,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )
    ## save training values to file
    with open("Train_results.json", 'w+', encoding='utf-8') as file:
                      pd.DataFrame(training_stats).to_json(file, orient='records', force_ascii=False)

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))






Training...

  Average training loss: 0.89
  Training epoch took: 0:02:01

Running Validation...
  F1_score: 0.60
  Validation Loss: 0.56
  Validation took: 0:00:12

Training...

  Average training loss: 0.38
  Training epoch took: 0:02:01

Running Validation...
  F1_score: 0.65
  Validation Loss: 0.49
  Validation took: 0:00:12

Training...

  Average training loss: 0.22
  Training epoch took: 0:02:02

Running Validation...
  F1_score: 0.70
  Validation Loss: 0.47
  Validation took: 0:00:12

Training...

  Average training loss: 0.14
  Training epoch took: 0:02:01

Running Validation...
  F1_score: 0.70
  Validation Loss: 0.48
  Validation took: 0:00:12

Training complete!
Total training took 0:08:52 (h:mm:ss)


In [None]:
!nvidia-smi

Wed Apr 27 20:18:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    58W / 149W |   7740MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
## Print classification report with results 
y_true_words = [word if word!='O' else word for require in y_true for word in require ]
y_pred_words = [word if word!='O' else word for require in y_pred for word in require ]
report =classification_report(y_true_words, y_pred_words, digits=3, labels = [label for label in set(y_true_words)if label!='O'])
print(report)

                                      precision    recall  f1-score   support

                   Space Environment      0.714     0.717     0.716       293
                  System engineering      0.680     0.692     0.686       120
                           Parameter      0.529     0.575     0.551       160
                     Quality control      0.762     0.749     0.755       231
                               Power      0.805     0.841     0.823       182
                                GN&C      0.653     0.646     0.649        96
                             Thermal      0.667     0.852     0.748       108
                         Measurement      0.860     0.908     0.883       217
               Structure & Mechanism      0.595     0.566     0.580        83
                          Propulsion      0.617     0.430     0.507        86
Project Organisation / Documentation      0.377     0.349     0.363        83
                       Nonconformity      0.538     0.596     0

## Save fine-tuned model in drive / to disc

#### Connect to drive

In [None]:
## connect the notebook to your own Google Drive storage 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Save / load model 

In [None]:
## Save Fine-tuned model 
## Specify path in google drive
path = "/content/drive/My Drive/"
model_out_address = path +'models/Fine-tuned_SpaceBERT'
## Save label dicts in model config for loading the model later again 
model.config.id2label = {key: tag2name[key] for key in tag2name.keys() - {-100}} ## -100 Key need to be deleted from dict otherwise error when loading model again 
model.config.label2id = tag2idx

## Save model
model.save_pretrained(model_out_address, save_config = True)
## Save tokeniser
tokenizer.save_pretrained(model_out_address)

('/content/drive/My Drive/models/Fine-tuned_SpaceBERT/tokenizer_config.json',
 '/content/drive/My Drive/models/Fine-tuned_SpaceBERT/special_tokens_map.json',
 '/content/drive/My Drive/models/Fine-tuned_SpaceBERT/vocab.txt',
 '/content/drive/My Drive/models/Fine-tuned_SpaceBERT/added_tokens.json',
 '/content/drive/My Drive/models/Fine-tuned_SpaceBERT/tokenizer.json')

In [None]:
##Load saved model from drive 
model = AutoModelForTokenClassification.from_pretrained(model_out_address)#num_labels=len(tag2idx))
tokenizer = AutoTokenizer.from_pretrained(model_out_address)
## Train with GPU, if available
n_gpu = torch.cuda.device_count()
if torch.cuda.is_available():
    model.cuda();
    if n_gpu >1:
        model = torch.nn.DataParallel(model)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
## Or load CR model from Huggingface hub 
model_file_address = 'icelab/spacescibert_CR'

tokenizer = AutoTokenizer.from_pretrained("icelab/spacescibert_CR")

model = AutoModelForTokenClassification.from_pretrained(model_file_address)#num_labels=len(tag2idx))
n_gpu = torch.cuda.device_count()
if torch.cuda.is_available():
    model.cuda();
    if n_gpu >1:
        model = torch.nn.DataParallel(model)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Example labelling

In [None]:
text ="The CubeSat RF design shall either have one RF inhibit and a RF power output no greater than 1.5W at the transmitter antenna's RF input OR the CubeSat shall have a minimum of two independent RF inhibits (CDS 3.3.9) (ISO 5.5.6). "

In [None]:
import numpy as np
import spacy
nlp = spacy.load("en_core_web_sm")
#model.eval();
sentences = []
#spacy.require_gpu()
docs = nlp.tokenizer(text)
#docs = list(docs)
#for sentence in docs:
sentences = [word.text for word in docs]

encoded_input = tokenizer(sentences, return_tensors="pt", padding=True, is_split_into_words=True)#, truncation=True, max_length=512)
input_ids = encoded_input['input_ids']
attention_masks = encoded_input["attention_mask"]

#b_input_ids = batch[0].to(device)
#b_input_mask = batch[1].to(device)
i_ids = input_ids.to(device)
a_masks = attention_masks.to(device)
with torch.no_grad():  
  prediction = model(i_ids, token_type_ids=None, attention_mask=a_masks)[0]

logits = torch.argmax(F.log_softmax(prediction, dim=2), dim=2)
logits = logits.detach().cpu().numpy()

In [None]:
## Tags from logits without added [CLS] / [SEP] tokens
tags_s = [tag2name[t] for t in logits[0][1:-1]]
#scores = scores[1:-1]
# Count if word was split by tokenizer to split labels of prediction
j = 0
## Tags, adjusted to wordpiece tokenisation
tags_r = []
#scores_r = []
for word_count, word in enumerate(doc):
    ## Tokenise each word of the sentence
    word_ids = tokenizer.tokenize(word.text)

    ## Tokeniser of Spacy tokenises "/n" --> Tokenizer of BERT & RoBERTa doesn't --> len(tokenizer.tokenize("\n"))= 0 --> Have to tokenise accordingly
    if len(word_ids) == 0:
      j -= 1
      tags_r.append('O')
      #scores_r.append(0)
      pass
    ## == 1 --> Words get not split by word-piece tokeniser
    elif len(word_ids) == 1:
      # spans.append(mappings[word_count])#s[word_count])
      tags_r.append(tags_s[word_count + j])
    # scores_r.append(scores[word_count + j])
      # pass
    ## Word gets split --> Only one prediction gets added to list of tags
    else:
      tags_r.append(tags_s[word_count + j])
    # scores_r.append(scores[word_count + j])
      j += (len(word_ids) - 1)


#### Get spans in input sentence
            
## Label of previous word
temp_label = ''
## word counter, how many steps to go back for complete sequence
j = 0
for word_count, word in enumerate(doc):
    ## Tag is "0" and no Label in previous word
    if tags_r[word_count] == 'O' and j == 0:
        pass
    ## Tag is same as before
    elif temp_label == tags_r[word_count][2:]:
        j -= 1
    ## Tag is "0" indicating that concept is complete --> add to span and reset word_counter
    elif tags_r[word_count] == 'O' and j != 0:
        spans.append({"start": doc[word_count + j].idx,
                      "end": (doc[word_count - 1].idx + len(doc[word_count - 1].text)),
                      'token_start': doc[word_count + j].i,
                      'token_end': doc[word_count - 1].i, "label": tags_r[word_count - 1], 
                  #   'score': np.mean(scores_r[word_count +j:word_count])
                      })# [2:]
        j = 0  #
        temp_label = ''
    ## Tag is a Label, word counter starts
    else:
        temp_label = tags_r[word_count][2:]
        j -= 1

In [None]:
def predict_spans(text, model): 
    
    """ 
    Function to take a text as input and output span object for use with spacy
    text = str
    model = Huggingface.Tokenclassificationmodel
    """

    nlp = spacy.load("en_core_web_sm")
    model.eval();
    sentences = []

    docs = nlp.tokenizer(text)

    ## use spacy to tokenize the requirement 

    word_list = [word.text for word in docs]

    ## tokenize the word with loaded tokenizer

    encoded_input = tokenizer(word_list, return_tensors="pt", padding=True, is_split_into_words=True, truncation=True, max_length=512)
    input_ids = encoded_input['input_ids']
    attention_masks = encoded_input["attention_mask"]

    ## load arrays on device (GPU)
    i_ids = input_ids.to(device)
    a_masks = attention_masks.to(device)

    with torch.no_grad():  
      prediction = model(i_ids, token_type_ids=None, attention_mask=a_masks, )[0]

    logits = torch.argmax(F.log_softmax(prediction, dim=2), dim=2)
    logits = logits.detach().cpu().numpy()

    tokens = []
    spans = []
    
    
    ## Tags from logits without added [CLS] / [SEP] tokens
    tags_s = [tag2name[t] for t in logits[0][1:-1]]
    #scores = scores[1:-1]
    # Count if word was split by tokenizer to split labels of prediction
    j = 0
    ## Tags, adjusted to wordpiece tokenisation
    tags_r = []
    #scores_r = []
    for word_count, word in enumerate(doc):
        ## Tokenise each word of the sentence
        word_ids = tokenizer.tokenize(word.text)

        ## Tokeniser of Spacy tokenises "/n" --> Tokenizer of BERT & RoBERTa doesn't --> len(tokenizer.tokenize("\n"))= 0 --> Have to tokenise accordingly
        if len(word_ids) == 0:
          j -= 1
          tags_r.append('O')
          #scores_r.append(0)
          pass
        ## == 1 --> Words get not split by word-piece tokeniser
        elif len(word_ids) == 1:
          # spans.append(mappings[word_count])#s[word_count])
          tags_r.append(tags_s[word_count + j])
        # scores_r.append(scores[word_count + j])
          # pass
        ## Word gets split --> Only one prediction gets added to list of tags
        else:
          tags_r.append(tags_s[word_count + j])
        # scores_r.append(scores[word_count + j])
          j += (len(word_ids) - 1)


    #### Get spans in input sentence
                
    ## Label of previous word
    temp_label = ''
    ## word counter, how many steps to go back for complete sequence
    j = 0
    for word_count, word in enumerate(doc):
        ## Tag is "0" and no Label in previous word
        if tags_r[word_count] == 'O' and j == 0:
            pass
        ## Tag is same as before
        elif temp_label == tags_r[word_count][2:]:
            j -= 1
        ## Tag is "0" indicating that concept is complete --> add to span and reset word_counter
        elif tags_r[word_count] == 'O' and j != 0:
            spans.append({"start": doc[word_count + j].idx,
                          "end": (doc[word_count - 1].idx + len(doc[word_count - 1].text)),
                          'token_start': doc[word_count + j].i,
                          'token_end': doc[word_count - 1].i, "label": tags_r[word_count - 1], 
                      #   'score': np.mean(scores_r[word_count +j:word_count])
                          })# [2:]
            j = 0  #
            temp_label = ''
        ## Tag is a Label, word counter starts
        else:
            temp_label = tags_r[word_count][2:]
            j -= 1
    return spans

In [None]:
try:
    tag2name
except NameError:
    tag2name = model.config.id2label
    tag2name[-100]=None 

nlp = spacy.load('en_core_web_sm', disable=['ner'])
doc = nlp(text)
spans = predict_spans(text, model)
concepts =[]
for annot in spans:
                #concepts.append({'text': data[sentence]['text'],'word': doc.char_span(data[sentence]['spans'][annot]['start'], data[sentence]['spans'][annot]['end'], label=data[sentence]['spans'][annot]['label']).text, 'label':data[sentence]['spans'][annot]['label']})
                concepts.append({'word': doc.char_span(annot['start'], annot['end'], label=annot['label']).text, 'label':annot['label']#, 'score':annot['score']
                                 })
concepts


[{'label': 'System engineering', 'word': 'CubeSat RF design'},
 {'label': 'Telecom.', 'word': 'one RF inhibit'},
 {'label': 'Telecom.', 'word': 'RF power output'},
 {'label': 'Measurement', 'word': '1.5W'},
 {'label': 'Telecom.', 'word': 'transmitter antenna'},
 {'label': 'Telecom.', 'word': 'RF input'},
 {'label': 'System engineering', 'word': 'CubeSat'},
 {'label': 'Telecom.', 'word': 'two independent RF inhibits'}]

Visualise with spaCy

In [None]:
from spacy.tokens import Span
from spacy import displacy
## https://spacy.io/usage/visualizers
for span in spans:
    doc.ents = list(doc.ents) + [Span(doc, span['token_start'], span['token_end']+1, span['label'])]

spacy.displacy.render(doc, style='ent', jupyter=True, options={'distance': 90})