# Data semantics

The project involves probing the subclass-of and instance-of ontological relations in masked language models. The ontology is used as a structured gold standard to construct negative and positive examples for the model.

The goal is therefore to evaluate whether and to what extent masked language models (such as BERT) encode ontological containment relations (is-a). This is done through probing experiments based on binary classification.

Objective is replicate result of following paper:
[Language Model Analysis for Ontology Subsumption Inference](https://arxiv.org/pdf/2302.06761)

## 0. Setup

In [2]:
!pip install -r requirements.txt



In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load & Explore dataset

### 1.1. Load dataset

In [4]:
dataset = load_dataset("krr-oxford/OntoLAMA", "go-atomic-SI")

Small visualization to see dataset components:

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['v_sub_concept', 'v_super_concept', 'label', 'axiom'],
        num_rows: 772870
    })
    validation: Dataset({
        features: ['v_sub_concept', 'v_super_concept', 'label', 'axiom'],
        num_rows: 96608
    })
    test: Dataset({
        features: ['v_sub_concept', 'v_super_concept', 'label', 'axiom'],
        num_rows: 96610
    })
})

### 1.2. Dataset into pd-dataframe

In [6]:
for i in range(3):
    print(dataset['train'][i])

{'v_sub_concept': 'cytosolic lipolysis', 'v_super_concept': 'biological process', 'label': 1, 'axiom': 'SubClassOf(<http://purl.obolibrary.org/obo/GO_0061725> <http://purl.obolibrary.org/obo/GO_0008150>)'}
{'v_sub_concept': 'mitochondrial oxoglutarate dehydrogenase complex', 'v_super_concept': 'oxidoreductase complex', 'label': 1, 'axiom': 'SubClassOf(<http://purl.obolibrary.org/obo/GO_0009353> <http://purl.obolibrary.org/obo/GO_1990204>)'}
{'v_sub_concept': 'positive regulation of protein localization', 'v_super_concept': 'regulation of localization', 'label': 1, 'axiom': 'SubClassOf(<http://purl.obolibrary.org/obo/GO_1903829> <http://purl.obolibrary.org/obo/GO_0032879>)'}


We have a format like this:
- sub: specific concept

- sup: general concept

- label:

    - 1 → relazione is-a valida

    - 0 → relazione negativa

#### 1.2.1. Split into pd-dataframe

putting dataset of train test and val as pandas datasets (this makes it easier to analyze them)

In [7]:
df_train = pd.DataFrame(dataset['train'])
df_train

Unnamed: 0,v_sub_concept,v_super_concept,label,axiom
0,cytosolic lipolysis,biological process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
1,mitochondrial oxoglutarate dehydrogenase complex,oxidoreductase complex,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
2,positive regulation of protein localization,regulation of localization,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
3,hemicellulose catabolic process,biological process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
4,positive regulation of cell morphogenesis invo...,positive regulation of biological process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
...,...,...,...,...
772865,positive regulation of mitotic recombination-d...,positive regulation of meiosis i spindle assem...,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
772866,aminolevulinate transaminase activity,aspartate-phenylpyruvate transaminase activity,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
772867,lung lobe formation,sno(s)rna metabolic process,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
772868,polysome binding,glycolate dehydrogenase activity,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...


In [8]:
df_test = pd.DataFrame(dataset['test'])
df_test

Unnamed: 0,v_sub_concept,v_super_concept,label,axiom
0,endonucleolytic cleavage of tricistronic rrna ...,cellular process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
1,"positive regulation of (z)-nonadeca-1,14-diene...",positive regulation of biosynthetic process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
2,regulation of retrograde trans-synaptic signal...,regulation of biological process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
3,regulation of glutamine transport,regulation of anion transport,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
4,platelet-derived growth factor receptor-ligand...,membrane protein complex,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
...,...,...,...,...
96605,"regulation of complement activation, alternati...",regulation of melanization defense response,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
96606,tubulin binding,fatz binding,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
96607,coenzyme a metabolic process,signaling receptor complex adaptor activity,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
96608,ecdysone 20-monooxygenase activity,methyl tertiary butyl ether 3-monooxygenase ac...,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...


In [9]:
df_val = pd.DataFrame(dataset['validation'])
df_val

Unnamed: 0,v_sub_concept,v_super_concept,label,axiom
0,tryparedoxin peroxidase activity,molecular function,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
1,branched-chain amino acid catabolic process to...,branched-chain amino acid catabolic process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
2,gerfelin catabolic process,cellular catabolic process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
3,vascular endothelial growth factor production,multicellular organismal process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
4,modification by virus of host cell cycle regul...,modulation by symbiont of host cellular process,1,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
...,...,...,...,...
96603,corticotropin hormone secreting cell development,oxazole or thiazole metabolic process,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
96604,interferon-epsilon production,quercetin 3-o-methyltransferase activity,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
96605,protein activation cascade,"mitochondrial electron transport, cytochrome c...",0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...
96606,homoserine transmembrane transporter activity,benzodiazepine receptor activity,0,SubClassOf(<http://purl.obolibrary.org/obo/GO_...


In [10]:
print("Train:", df_train.shape)
print("Validation:", df_val.shape)
print("Test:", df_test.shape)

Train: (772870, 4)
Validation: (96608, 4)
Test: (96610, 4)


here we can clearly see that most of tokens are made of 5 or less words (separated by space)

## 2. Creating prompt for model

having prompt directly into train dataset makes it easier for train aftewards:

In [35]:
def make_prompt(row):
    return f"{row['v_sub_concept']} is a [MASK] of {row['v_super_concept']}."

In [36]:
df_train["prompt"] = df_train.apply(make_prompt, axis=1)
df_train[["prompt", "label"]].head()

Unnamed: 0,prompt,label
0,cytosolic lipolysis is a [MASK] of biological ...,1
1,mitochondrial oxoglutarate dehydrogenase compl...,1
2,positive regulation of protein localization is...,1
3,hemicellulose catabolic process is a [MASK] of...,1
4,positive regulation of cell morphogenesis invo...,1


## 3. Creating + training + evaluate models

### 3.1. First model:

for our first model we try using an instance of pre-trained bert:


In [37]:
model_name = "bert-base-uncased"

In [38]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

model.eval()  # IMPORTANTISSIMO: modalità inference

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

### 3.2. Tokenization for model

In [39]:
example = df_train.iloc[0]["prompt"]

inputs = tokenizer(
    example,
    return_tensors="pt"
)

inputs

{'input_ids': tensor([[  101, 22330, 13122, 23518,  5423,  4747, 20960,  2003,  1037,   103,
          1997,  6897,  2832,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

After tokenization we can see 103 is [MASK] for BERT. Then we need to find mask index for model

In [None]:
mask_token_id = tokenizer.mask_token_id
mask_index = (inputs["input_ids"] == mask_token_id).nonzero(as_tuple=True)[1]

mask_index

### 3.3. Probing with model:

we ask the model "what are the most probable words to replace [MASK]"

In [None]:
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
mask_logits = logits[0, mask_index, :]


-------



In [None]:
print("Accuracy:", accuracy_score(df_test_sample_2["label"], df_test_sample_2["pred"]))
print("F1:", f1_score(df_test_sample_2["label"], df_test_sample_2["pred"]))
print("Precision:", precision_score(df_test_sample_2["label"], df_test_sample_2["pred"]))
print("Recall:", recall_score(df_test_sample_2["label"], df_test_sample_2["pred"]))

Accuracy: 0.6253333333333333
F1: 0.5167669819432502
Precision: 0.7410604192355117
Recall: 0.3966996699669967
