### Task 1

In [1]:
import math
import re
from random import *
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### 1. Data 

In [3]:
from datasets import load_dataset
from itertools import islice
from numpy.random import default_rng

SEED = 1234
rand = default_rng(SEED)

ds_stream = load_dataset("Skylion007/openwebtext", split="train", streaming=True)

# Take a small chunk first (e.g., 50k) then sample 10k from it
buffer_size = 50_000
buffer = list(islice(ds_stream, buffer_size))

# Randomly sample 10k from buffer
sample_idx = rand.choice(len(buffer), 10_000, replace=False)
dataset_sample = [buffer[i] for i in sample_idx]

# Extract texts
texts = [ex["text"] for ex in dataset_sample]
texts[:3]

  from .autonotebook import tqdm as notebook_tqdm


 'Story highlights It is the farthest north the Russian spy ship has ventured\n\nThe vessel is outfitted with a variety of high-tech spying equipment\n\nWashington (CNN) A Russian spy ship sits 30 miles off the coast of Connecticut, a US defense official told CNN, while an armed Russian warplane recently carried out a "mock attack" against a US ship.\n\nThis is the farthest north the Russian spy vessel has ever ventured, according to US defense official.\n\nCNN reported that the Leonov, which conducted similar patrols in 2014 and 2015, was off the coast of Delaware Wednesday, but typically it only travels as far as Virginia.\n\nThe ship is based with Russia\'s northern fleet on the North Sea but had stopped over in Cuba before conducting its patrol along the Atlantic Coast and is expected to return there following its latest mission.\n\nThe vessel is outfitted with a variety of high-tech spying equipment and is designed to intercept signals intelligence. The official said that the US N

## Convert OpenWebText documents into a flat list of real sentences

In [16]:
import re

def split_to_sentences(doc: str):
    doc = doc.replace("\n", " ").strip()
    # Simple sentence split based on punctuation
    sents = re.split(r"(?<=[.!?])\s+", doc)
    # Keep meaningful sentences (avoid super short noise)
    sents = [s for s in sents if len(s.split()) >= 5]
    return sents

# texts = list of OpenWebText documents you sampled (e.g., 10k docs)
all_sents = []
for doc in texts:
    if isinstance(doc, str) and len(doc.strip()) > 0:
        all_sents.extend(split_to_sentences(doc))

# OPTIONAL: cap number of sentences to keep it manageable
all_sents = all_sents[:200_000]

print("Total sentences:", len(all_sents))
print(all_sents[0])

Total sentences: 200000
As for violence, where would art and literature be without it?


In [17]:
sentences = [s.replace("\n", " ") for s in all_sents]
sentences = [s for s in sentences if len(s.split()) <= 200]  # keep shorter sentences for training stability

print("Kept sentences:", len(sentences))
print(sentences[0])

Kept sentences: 199828
As for violence, where would art and literature be without it?


In [18]:
text = [s.lower() for s in sentences]
text = [re.sub(r"[.,!?\-]", "", s) for s in text]

print(text[0])

as for violence where would art and literature be without it


4) Build vocab (word2id / id2word) with special tokens

In [19]:
from tqdm.auto import tqdm

word_list = list(set(" ".join(text).split()))

word2id = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}

for i, w in tqdm(enumerate(word_list), total=len(word_list), desc="Creating word2id"):
    word2id[w] = i + 4

id2word = {v: k for k, v in word2id.items()}
vocab_size = len(word2id)

print("Vocab size:", vocab_size)

Creating word2id: 100%|██████████| 196044/196044 [00:00<00:00, 3544075.88it/s]

Vocab size: 196048





5) Build token_list 

In [20]:
token_list = []
for sentence in tqdm(text, desc="Processing sentences to token IDs"):
    token_list.append([word2id[w] for w in sentence.split() if w in word2id])

print(token_list[0][:30])

Processing sentences to token IDs: 100%|██████████| 199828/199828 [00:00<00:00, 237581.39it/s]

[114742, 84692, 52426, 79853, 116219, 143067, 175482, 42225, 13199, 52736, 146747]





### 3. Data loader

In [21]:
from random import random, shuffle, randint, randrange

batch_size = 6
max_mask   = 20
max_len    = 256

In [23]:
def make_batch():
    batch = []
    positive = negative = 0

    while positive != batch_size / 2 or negative != batch_size / 2:

        tokens_a_index, tokens_b_index = randrange(len(sentences)), randrange(len(sentences))
        tokens_a, tokens_b = token_list[tokens_a_index], token_list[tokens_b_index]

        # 1) token embedding
        input_ids = [word2id['[CLS]']] + tokens_a + [word2id['[SEP]']] + tokens_b + [word2id['[SEP]']]

        # 2) segment embedding
        segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1)

        # truncate if too long
        input_ids = input_ids[:max_len]
        segment_ids = segment_ids[:max_len]

        # 3) masking
        n_pred = min(max_mask, max(1, int(round(len(input_ids) * 0.15))))

        candidates_masked_pos = [
            i for i, token in enumerate(input_ids)
            if token != word2id['[CLS]'] and token != word2id['[SEP]']
        ]
        shuffle(candidates_masked_pos)

        masked_tokens, masked_pos = [], []
        for pos in candidates_masked_pos[:n_pred]:
            masked_pos.append(pos)
            masked_tokens.append(input_ids[pos])

            # ✅ correct 80/10/10 using ONE random value
            p = random()
            if p < 0.8:  # 80% -> [MASK]
                input_ids[pos] = word2id['[MASK]']
            elif p < 0.9:  # 10% -> random token (avoid specials 0-3)
                input_ids[pos] = randint(4, vocab_size - 1)
            else:
                pass  # 10% -> keep original

        # 4) pad to max_len
        n_pad = max_len - len(input_ids)
        input_ids.extend([word2id['[PAD]']] * n_pad)
        segment_ids.extend([0] * n_pad)

        # 5) pad masked tokens/pos to max_mask
        if max_mask > n_pred:
            n_pad = max_mask - n_pred
            masked_tokens.extend([0] * n_pad)
            masked_pos.extend([0] * n_pad)

        # 6) NSP label (now valid because sentences are in order)
        if tokens_a_index + 1 == tokens_b_index and positive < batch_size / 2:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True])
            positive += 1
        elif tokens_a_index + 1 != tokens_b_index and negative < batch_size / 2:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False])
            negative += 1

    return batch

In [25]:

batch = make_batch()
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

In [26]:
input_ids.shape, segment_ids.shape, masked_tokens.shape, masked_pos.shape, isNext

(torch.Size([6, 256]),
 torch.Size([6, 256]),
 torch.Size([6, 20]),
 torch.Size([6, 20]),
 tensor([0, 0, 0, 1, 1, 1]))

In [27]:
masked_tokens

tensor([[ 73112, 163717, 146919,  22505,  22505,   9744,  62405,  56467,  70665,
          22505,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0],
        [159046, 152056, 193488,  88897, 184702,  79067,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0],
        [107932, 151712,  23486,  90423, 160793,  22505,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0],
        [160793, 101181, 137789,  55481,  82588,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0],
        [ 97007,  34624,  46710,  13199,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0],
        [ 64604,   5909, 161217,  65454, 106745, 130672, 193

### 4. Model

In [43]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, max_len, n_segments, d_model, device):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)   # token embedding
        self.pos_embed = nn.Embedding(max_len, d_model)      # position embedding
        self.seg_embed = nn.Embedding(n_segments, d_model)   # segment embedding
        self.norm = nn.LayerNorm(d_model)
        self.device = device

    def forward(self, x, seg):
        # x, seg: (batch_size, seq_len)
        seq_len = x.size(1)

        # put pos on same device as x
        pos = torch.arange(seq_len, dtype=torch.long, device=x.device)
        pos = pos.unsqueeze(0).expand_as(x)  # (seq_len,) -> (batch_size, seq_len)

        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)

In [44]:
def get_attn_pad_mask(seq_q, seq_k, device):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1).to(device)  # batch_size x 1 x len_k(=len_q), one is masking
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

### Testing the attention mask


In [45]:
print(get_attn_pad_mask(input_ids, input_ids, device).shape)

torch.Size([6, 256, 256])


### 4.3 Encoder

In [53]:
class EncoderLayer(nn.Module):
    def __init__(self, n_heads, d_model, d_ff, d_k, device):
        super().__init__()
        self.enc_self_attn = MultiHeadAttention(n_heads, d_model, d_k, device)
        self.pos_ffn = PoswiseFeedForwardNet(d_model, d_ff)

    def forward(self, enc_inputs, enc_self_attn_mask):
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask)
        enc_outputs = self.pos_ffn(enc_outputs)
        return enc_outputs, attn

In [54]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k, device):
        super(ScaledDotProductAttention, self).__init__()
        self.scale = torch.sqrt(torch.FloatTensor([d_k])).to(device)

    def forward(self, Q, K, V, attn_mask):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / self.scale # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn 

In [55]:
n_layers = 6    # number of Encoder of Encoder Layer
n_heads  = 8    # number of heads in Multi-Head Attention
d_model  = 768  # Embedding Size
d_ff = 768 * 4  # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

In [56]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, d_model, d_k, device):
        super(MultiHeadAttention, self).__init__()
        self.n_heads = n_heads
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_k
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, self.d_v * n_heads)
        self.device = device
    def forward(self, Q, K, V, attn_mask):
        # q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
        residual, batch_size = Q, Q.size(0)
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
        q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]

        attn_mask = attn_mask.unsqueeze(1).repeat(1, self.n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]

        # context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        context, attn = ScaledDotProductAttention(self.d_k, self.device)(q_s, k_s, v_s, attn_mask)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_v) # context: [batch_size x len_q x n_heads * d_v]
        output = nn.Linear(self.n_heads * self.d_v, self.d_model, device=self.device)(context)
        return nn.LayerNorm(self.d_model, device=self.device)(output + residual), attn # output: [batch_size x len_q x d_model]

In [57]:
class PoswiseFeedForwardNet(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # (batch_size, len_seq, d_model) -> (batch_size, len_seq, d_ff) -> (batch_size, len_seq, d_model)
        return self.fc2(F.gelu(self.fc1(x)))

### 4.4 Putting them together

In [3]:
class BERT(nn.Module):
    def __init__(self, n_layers, n_heads, d_model, d_ff, d_k, n_segments, vocab_size, max_len, device):
        super(BERT, self).__init__()
        self.params = {'n_layers': n_layers, 'n_heads': n_heads, 'd_model': d_model,
                       'd_ff': d_ff, 'd_k': d_k, 'n_segments': n_segments,
                       'vocab_size': vocab_size, 'max_len': max_len}
        self.embedding = Embedding(vocab_size, max_len, n_segments, d_model, device)
        self.layers = nn.ModuleList([EncoderLayer(n_heads, d_model, d_ff, d_k, device) for _ in range(n_layers)])
        self.fc = nn.Linear(d_model, d_model)
        self.activ = nn.Tanh()
        self.linear = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 2)
        # decoder is shared with embedding layer
        embed_weight = self.embedding.tok_embed.weight
        n_vocab, n_dim = embed_weight.size()
        self.decoder = nn.Linear(n_dim, n_vocab, bias=False)
        self.decoder.weight = embed_weight
        self.decoder_bias = nn.Parameter(torch.zeros(n_vocab))
        self.device = device

    def forward(self, input_ids, segment_ids, masked_pos):
        output = self.embedding(input_ids, segment_ids)
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids, self.device)
        for layer in self.layers:
            output, enc_self_attn = layer(output, enc_self_attn_mask)
        # output : [batch_size, len, d_model], attn : [batch_size, n_heads, d_mode, d_model]
        
        # 1. predict next sentence
        # it will be decided by first token(CLS)
        h_pooled   = self.activ(self.fc(output[:, 0])) # [batch_size, d_model]
        logits_nsp = self.classifier(h_pooled) # [batch_size, 2]

        # 2. predict the masked token
        masked_pos = masked_pos[:, :, None].expand(-1, -1, output.size(-1)) # [batch_size, max_pred, d_model]
        h_masked = torch.gather(output, 1, masked_pos) # masking position [batch_size, max_pred, d_model]
        h_masked  = self.norm(F.gelu(self.linear(h_masked)))
        logits_lm = self.decoder(h_masked) + self.decoder_bias # [batch_size, max_pred, n_vocab]

        return logits_lm, logits_nsp
    
    def get_last_hidden_state(self, input_ids, segment_ids):
        output = self.embedding(input_ids, segment_ids)
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids, self.device)
        for layer in self.layers:
            output, enc_self_attn = layer(output, enc_self_attn_mask)

        return output

### 4. Training

In [61]:
from tqdm.auto import tqdm

n_layers = 12    # number of Encoder of Encoder Layer
n_heads  = 12    # number of heads in Multi-Head Attention
d_model  = 768  # Embedding Size
d_ff = d_model * 4  # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

num_epoch = 700
model = BERT(n_layers, n_heads, d_model, d_ff, d_k, n_segments, vocab_size, max_len, device).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [62]:
batch = make_batch()
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

# Move inputs to GPU
input_ids = input_ids.to(device)
segment_ids = segment_ids.to(device)
masked_tokens = masked_tokens.to(device)
masked_pos = masked_pos.to(device)
isNext = isNext.to(device)

# Wrap the epoch loop with tqdm
for epoch in tqdm(range(num_epoch), desc="Training Epochs"):
    optimizer.zero_grad()
    logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)    
    #logits_lm: (bs, max_mask, vocab_size) ==> (6, 5, 34)
    #logits_nsp: (bs, yes/no) ==> (6, 2)

    #1. mlm loss
    #logits_lm.transpose: (bs, vocab_size, max_mask) vs. masked_tokens: (bs, max_mask)
    loss_lm = criterion(logits_lm.transpose(1, 2), masked_tokens) # for masked LM
    loss_lm = (loss_lm.float()).mean()
    #2. nsp loss
    #logits_nsp: (bs, 2) vs. isNext: (bs, )
    loss_nsp = criterion(logits_nsp, isNext) # for sentence classification
    
    #3. combine loss
    loss = loss_lm + loss_nsp
    if epoch % 100 == 0:
        print('Epoch:', '%02d' % (epoch), 'loss =', '{:.6f}'.format(loss))
    loss.backward()
    optimizer.step()

Training Epochs:   0%|          | 0/700 [00:00<?, ?it/s]

Epoch: 00 loss = 168.357971


Training Epochs:  14%|█▍        | 100/700 [06:04<37:23,  3.74s/it]

Epoch: 100 loss = 2.679922


Training Epochs:  29%|██▊       | 200/700 [13:16<33:20,  4.00s/it]

Epoch: 200 loss = 2.827845


Training Epochs:  43%|████▎     | 300/700 [21:26<32:25,  4.86s/it]  

Epoch: 300 loss = 2.549943


Training Epochs:  57%|█████▋    | 400/700 [30:43<26:55,  5.39s/it]

Epoch: 400 loss = 2.789822


Training Epochs:  71%|███████▏  | 500/700 [37:46<10:50,  3.25s/it]

Epoch: 500 loss = 2.321024


Training Epochs:  86%|████████▌ | 600/700 [43:58<06:02,  3.63s/it]

Epoch: 600 loss = 2.437893


Training Epochs: 100%|██████████| 700/700 [50:06<00:00,  4.29s/it]


In [64]:
# Save the model after training
torch.save([model.params, model.state_dict()], 'model/model_bert.pth')
print("Model saved to model_bert.pth")

Model saved to model_bert.pth


5. Inference

Since our dataset is very small, it won't work very well, but just for the sake of demonstration.

In [65]:
# load the model and all its hyperparameters
params, state = torch.load('model/model_bert.pth')
model_bert = BERT(**params, device=device).to(device)
model_bert.load_state_dict(state)

<All keys matched successfully>

In [66]:
# Predict mask tokens ans isNext
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(batch[1]))
print([id2word[w.item()] for w in input_ids[0] if id2word[w.item()] != '[PAD]'])
input_ids = input_ids.to(device)
segment_ids = segment_ids.to(device)
masked_tokens = masked_tokens.to(device)
masked_pos = masked_pos.to(device)
isNext = isNext.to(device)

logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)
#logits_lm:  (1, max_mask, vocab_size) ==> (1, 5, 34)
#logits_nsp: (1, yes/no) ==> (1, 2)

#predict masked tokens
#max the probability along the vocab dim (2), [1] is the indices of the max, and [0] is the first value
logits_lm = logits_lm.data.cpu().max(2)[1][0].data.numpy() 
#note that zero is padding we add to the masked_tokens
print('masked tokens (words) : ',[id2word[pos.item()] for pos in masked_tokens[0]])
print('masked tokens list : ',[pos.item() for pos in masked_tokens[0]])
print('masked tokens (words) : ',[id2word[pos.item()] for pos in logits_lm])
print('predict masked tokens list : ', [pos for pos in logits_lm])

#predict nsp
logits_nsp = logits_nsp.cpu().data.max(1)[1][0].data.numpy()
print(logits_nsp)
print('isNext : ', True if isNext else False)
print('predict isNext : ',True if logits_nsp else False)

['[CLS]', 'virtually', '[MASK]', 'major', 'police', '[MASK]', 'in', 'america—and', 'many', 'minor', 'ones—have', 'explosive', 'ordnance', 'disposal', 'robots', 'similar', 'to', 'the', '[MASK]', 'used', '[MASK]', 'dallas', '[SEP]', 'she', 'knows', 'that', 'diarrhoea', '[MASK]', 'caused', 'largely', 'by', 'people', 'ingesting', 'water', 'or', 'food', 'contaminated', 'by', 'human', 'waste', '–', 'kills', 'more', 'children', '[MASK]', 'than', 'hiv/aids', 'tuberculosis', 'and', 'malaria', 'combined', '[SEP]']
masked tokens (words) :  ['every', 'tuberculosis', 'one', 'department', 'in', '–', 'robots', 'worldwide', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
masked tokens list :  [193814, 81530, 23486, 172424, 22934, 155412, 186001, 89035, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
masked tokens (words) :  ['[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD

### Task 2

# [Sentence-BERT](https://arxiv.org/pdf/1908.10084.pdf)

[Reference Code](https://www.pinecone.io/learn/series/nlp/train-sentence-transformers-softmax/)

In [4]:
import os
import math
import re
from   random import *
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [5]:
import sys
print(sys.executable)

/Users/anushkaojha/2nd/NLP/NLP_Assignments/A4/.venv/bin/python


## 1. Data

### Train, Test, Validation 

In [6]:
import datasets
snli = datasets.load_dataset('snli')
mnli = datasets.load_dataset('glue', 'mnli')
mnli['train'].features, snli['train'].features

({'premise': Value('string'),
  'hypothesis': Value('string'),
  'label': ClassLabel(names=['entailment', 'neutral', 'contradiction']),
  'idx': Value('int32')},
 {'premise': Value('string'),
  'hypothesis': Value('string'),
  'label': ClassLabel(names=['entailment', 'neutral', 'contradiction'])})

In [7]:
# List of datasets to remove 'idx' column from
mnli.column_names.keys()

dict_keys(['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched'])

In [8]:
# Remove 'idx' column from each dataset
for column_names in mnli.column_names.keys():
    mnli[column_names] = mnli[column_names].remove_columns('idx')

In [9]:
mnli.column_names.keys()

dict_keys(['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched'])

In [10]:
import numpy as np
np.unique(mnli['train']['label']), np.unique(snli['train']['label'])
#snli also have -1

(array([0, 1, 2]), array([-1,  0,  1,  2]))

In [11]:
# there are -1 values in the label feature, these are where no class could be decided so we remove
snli = snli.filter(
    lambda x: 0 if x['label'] == -1 else 1
)

In [12]:
import numpy as np
np.unique(mnli['train']['label']), np.unique(snli['train']['label'])
#snli also have -1

(array([0, 1, 2]), array([0, 1, 2]))

In [54]:
# Assuming you have your two DatasetDict objects named snli and mnli
from datasets import DatasetDict
# Merge the two DatasetDict objects
raw_dataset = DatasetDict({
    'train': datasets.concatenate_datasets([snli['train'], mnli['train']]).shuffle(seed=55).select(list(range(5000))),
    'test': datasets.concatenate_datasets([snli['test'], mnli['test_mismatched']]).shuffle(seed=55).select(list(range(100))),
    'validation': datasets.concatenate_datasets([snli['validation'], mnli['validation_mismatched']]).shuffle(seed=55).select(list(range(100)))
})
#remove .select(list(range(1000))) in order to use full dataset
# Now, merged_dataset_dict contains the combined datasets from snli and mnli
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 100
    })
})

In [55]:
import sys, site
print("Executable:", sys.executable)
print("Version:", sys.version)
print("User site:", site.getusersitepackages())
print("Site packages:", site.getsitepackages())

Executable: /Users/anushkaojha/2nd/NLP/NLP_Assignments/A4/.venv/bin/python
Version: 3.9.6 (default, Dec  2 2025, 07:27:58) 
[Clang 17.0.0 (clang-1700.6.3.2)]
User site: /Users/anushkaojha/Library/Python/3.9/lib/python/site-packages
Site packages: ['/Users/anushkaojha/2nd/NLP/NLP_Assignments/A4/.venv/lib/python3.9/site-packages']


## 2. Preprocessing

In [56]:
import sys
print(sys.executable)

/Users/anushkaojha/2nd/NLP/NLP_Assignments/A4/.venv/bin/python


In [57]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [58]:
def preprocess_function(examples):
    max_seq_length = 128
    padding = 'max_length'
    # Tokenize the premise
    premise_result = tokenizer(
        examples['premise'], padding=padding, max_length=max_seq_length, truncation=True)
    #num_rows, max_seq_length
    # Tokenize the hypothesis
    hypothesis_result = tokenizer(
        examples['hypothesis'], padding=padding, max_length=max_seq_length, truncation=True)
    #num_rows, max_seq_length
    # Extract labels
    labels = examples["label"]
    #num_rows
    return {
        "premise_input_ids": premise_result["input_ids"],
        "premise_attention_mask": premise_result["attention_mask"],
        "hypothesis_input_ids": hypothesis_result["input_ids"],
        "hypothesis_attention_mask": hypothesis_result["attention_mask"],
        "labels" : labels
    }

tokenized_datasets = raw_dataset.map(
    preprocess_function,
    batched=True,
)

tokenized_datasets = tokenized_datasets.remove_columns(['premise','hypothesis','label'])
tokenized_datasets.set_format("torch")

Map: 100%|██████████| 5000/5000 [00:02<00:00, 2184.79 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 2194.42 examples/s]


In [59]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['premise_input_ids', 'premise_attention_mask', 'hypothesis_input_ids', 'hypothesis_attention_mask', 'labels'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['premise_input_ids', 'premise_attention_mask', 'hypothesis_input_ids', 'hypothesis_attention_mask', 'labels'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['premise_input_ids', 'premise_attention_mask', 'hypothesis_input_ids', 'hypothesis_attention_mask', 'labels'],
        num_rows: 100
    })
})

## 3. Data loader

In [60]:
from torch.utils.data import DataLoader

# initialize the dataloader
batch_size = 16
train_dataloader = DataLoader(
    tokenized_datasets['train'], 
    batch_size=batch_size, 
    shuffle=True
)
eval_dataloader = DataLoader(
    tokenized_datasets['validation'], 
    batch_size=batch_size
)
test_dataloader = DataLoader(
    tokenized_datasets['test'], 
    batch_size=batch_size
)

In [61]:
for batch in train_dataloader:
    print(batch['premise_input_ids'].shape)
    print(batch['premise_attention_mask'].shape)
    print(batch['hypothesis_input_ids'].shape)
    print(batch['hypothesis_attention_mask'].shape)
    print(batch['labels'].shape)
    break

torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16])


## 4. Model

In [62]:
# import sys

# !{sys.executable} -m pip install --upgrade pip
# !{sys.executable} -m pip install \
#     bert

In [63]:
from bert import *
from bert_class import BERT

In [64]:
#Using the model trained and saved in task1
load_path = 'model/model_bert.pth'


In [65]:
params, state = torch.load(load_path, map_location= device)
model = BERT(**params, device=device).to(device)
model.load_state_dict(state)

<All keys matched successfully>

### Pooling
SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding

In [66]:
# define mean pooling function
def mean_pool(token_embeds, attention_mask):
    # reshape attention_mask to cover 768-dimension embeddings
    in_mask = attention_mask.unsqueeze(-1).expand(
        token_embeds.size()
    ).float()
    # perform mean-pooling but exclude padding tokens (specified by in_mask)
    pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
        in_mask.sum(1), min=1e-9
    )
    return pool

## 5. Loss Function

## Classification Objective Function 
We concatenate the sentence embeddings $u$ and $v$ with the element-wise difference  $\lvert u - v \rvert $ and multiply the result with the trainable weight  $ W_t ∈  \mathbb{R}^{3n \times k}  $:

$ o = \text{softmax}\left(W^T \cdot \left(u, v, \lvert u - v \rvert\right)\right) $

where $n$ is the dimension of the sentence embeddings and k the number of labels. We optimize cross-entropy loss. This structure is depicted in Figure 1.

## Regression Objective Function. 
The cosine similarity between the two sentence embeddings $u$ and $v$ is computed (Figure 2). We use means quared-error loss as the objective function.

(Manhatten / Euclidean distance, semantically  similar sentences can be found.)

<img src="./figures/sbert-architecture.png" >

In [67]:
def configurations(u,v):
    # build the |u-v| tensor
    uv = torch.sub(u, v)   # batch_size,hidden_dim
    uv_abs = torch.abs(uv) # batch_size,hidden_dim
    
    # concatenate u, v, |u-v|
    x = torch.cat([u, v, uv_abs], dim=-1) # batch_size, 3*hidden_dim
    return x

def cosine_similarity(u, v):
    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    similarity = dot_product / (norm_u * norm_v)
    return similarity

In [68]:
classifier_head = torch.nn.Linear(768*3, 3).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
optimizer_classifier = torch.optim.Adam(classifier_head.parameters(), lr=2e-5)

criterion = nn.CrossEntropyLoss()

In [69]:
from transformers import get_linear_schedule_with_warmup

# and setup a warmup for the first ~10% steps
total_steps = int(len(raw_dataset) / batch_size)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
		optimizer, num_warmup_steps=warmup_steps,
  	num_training_steps=total_steps - warmup_steps
)

# then during the training loop we update the scheduler per step
scheduler.step()

scheduler_classifier = get_linear_schedule_with_warmup(
		optimizer_classifier, num_warmup_steps=warmup_steps,
  	num_training_steps=total_steps - warmup_steps
)

# then during the training loop we update the scheduler per step
scheduler_classifier.step()

## 6. Training

In [70]:
max_seq_length = 128

In [72]:
from tqdm.auto import tqdm

num_epoch = 5
# 1 epoch should be enough, increase if wanted
for epoch in range(num_epoch):
    model.train()
    classifier_head.train()
    # initialize the dataloader loop with tqdm (tqdm == progress bar)
    for step, batch in enumerate(tqdm(train_dataloader, leave=True)):
        # zero all gradients on each new step
        optimizer.zero_grad()
        optimizer_classifier.zero_grad()

        # prepare batches and more all to the active device
        inputs_ids_a = batch['premise_input_ids'].to(device)
        inputs_ids_b = batch['hypothesis_input_ids'].to(device)
        attention_a = batch['premise_attention_mask'].to(device)
        attention_b = batch['hypothesis_attention_mask'].to(device)

        bs, seq_len = inputs_ids_a.shape
        segment_ids = torch.zeros((bs, seq_len), dtype=torch.long, device=device)

        label = batch['labels'].to(device)

        # extract token embeddings from BERT at last_hidden_state
        u_last_hidden_state = model.get_last_hidden_state(inputs_ids_a, segment_ids)
        v_last_hidden_state = model.get_last_hidden_state(inputs_ids_b, segment_ids)

        # get the mean pooled vectors
        u_mean_pool = mean_pool(u_last_hidden_state, attention_a) # batch_size, hidden_dim
        v_mean_pool = mean_pool(v_last_hidden_state, attention_b) # batch_size, hidden_dim

        # build the |u-v| tensor
        uv = torch.sub(u_mean_pool, v_mean_pool)   # batch_size,hidden_dim
        uv_abs = torch.abs(uv) # batch_size,hidden_dim

        # concatenate u, v, |u-v|
        x = torch.cat([u_mean_pool, v_mean_pool, uv_abs], dim=-1) # batch_size, 3*hidden_dim

        # process concatenated tensor through classifier_head
        x = classifier_head(x) #batch_size, classifer

        # calculate the 'softmax-loss' between predicted and true label
        loss = criterion(x, label)

        # using loss, calculate gradients and then optimizerize
        loss.backward()
        optimizer.step()
        optimizer_classifier.step()

        scheduler.step() # update learning rate scheduler
        scheduler_classifier.step()

    print(f'Epoch: {epoch + 1} | loss = {loss.item():.6f}')

100%|██████████| 313/313 [35:25<00:00,  6.79s/it]


Epoch: 1 | loss = 1.208837


100%|██████████| 313/313 [26:57<00:00,  5.17s/it]


Epoch: 2 | loss = 1.081702


100%|██████████| 313/313 [23:45<00:00,  4.55s/it]


Epoch: 3 | loss = 1.397938


100%|██████████| 313/313 [21:57<00:00,  4.21s/it]


Epoch: 4 | loss = 1.181835


100%|██████████| 313/313 [21:13<00:00,  4.07s/it]

Epoch: 5 | loss = 1.121003





In [73]:
labels = []
predictions = []
probabilities = []
classes = ["entailment", "neutral", "contradiction"]

In [76]:
model.eval()
classifier_head.eval()
total_similarity = 0
with torch.no_grad():
    for step, batch in enumerate(eval_dataloader):
        # Move batches to the active device
        inputs_ids_a = batch['premise_input_ids'].to(device)
        inputs_ids_b = batch['hypothesis_input_ids'].to(device)
        attention_a = batch['premise_attention_mask'].to(device)
        attention_b = batch['hypothesis_attention_mask'].to(device)
        segment_ids = torch.zeros(inputs_ids_a.shape[0], inputs_ids_a.shape[1], dtype=torch.int32).to(device)
        label = batch['labels'].to(device)

        # Extract token embeddings from BERT
        u = model.get_last_hidden_state(inputs_ids_a, segment_ids)  # (batch_size, seq_len, hidden_dim)
        v = model.get_last_hidden_state(inputs_ids_b, segment_ids)  # (batch_size, seq_len, hidden_dim)

        # Get the mean pooled vectors (Keep them as Tensors)
        u_mean_pool = mean_pool(u, attention_a)  # (batch_size, hidden_dim)
        v_mean_pool = mean_pool(v, attention_b)  # (batch_size, hidden_dim)

        # Computing cosine similarity
        similarity_score = cosine_similarity(u_mean_pool.cpu().numpy().reshape(-1), v_mean_pool.cpu().numpy().reshape(-1))
        total_similarity += similarity_score

        # Concatenate [u, v, |u - v|]
        uv_abs = torch.abs(u_mean_pool - v_mean_pool)  # [batch_size, hidden_dim]
        x = torch.cat([u_mean_pool, v_mean_pool, uv_abs], dim=-1)  # [batch_size, 3*hidden_dim]

        # Classification
        logit_fn = classifier_head(x)  # (batch_size, num_classes)
        probs = torch.nn.functional.softmax(logit_fn, dim=-1)

        preds = torch.argmax(logit_fn, dim=-1)

        labels.extend(label.cpu().tolist())
        probabilities.extend(probs.cpu().tolist())
        predictions.extend(preds.cpu().tolist())

average_similarity = total_similarity / len(eval_dataloader)
print(f"Average Cosine Similarity: {average_similarity:.4f}")

Average Cosine Similarity: 0.9990


In [77]:
from sklearn.metrics import classification_report

print(classification_report(labels, predictions, target_names=classes))

               precision    recall  f1-score   support

   entailment       0.00      0.00      0.00        62
      neutral       0.44      0.10      0.16        72
contradiction       0.33      0.92      0.49        66

     accuracy                           0.34       200
    macro avg       0.26      0.34      0.22       200
 weighted avg       0.27      0.34      0.22       200



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [79]:
# saving the model
torch.save({
    "bert_params": model.params,
    "bert_state": model.state_dict(),
    "clf_state": classifier_head.state_dict(),
    "max_seq_length": max_seq_length
}, "model/sen_bert_full.pth")

## 7. Inference

In [80]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(model, tokenizer, sentence_a, sentence_b, device):
    # Tokenize and convert sentences to input IDs and attention masks
    inputs_a = tokenizer(sentence_a, return_tensors='pt', max_length=max_seq_length, truncation=True, padding='max_length').to(device)
    inputs_b = tokenizer(sentence_b, return_tensors='pt', max_length=max_seq_length, truncation=True, padding='max_length').to(device)

    # Move input IDs and attention masks to the active device
    inputs_ids_a = inputs_a['input_ids']
    attention_a = inputs_a['attention_mask']
    inputs_ids_b = inputs_b['input_ids']
    attention_b = inputs_b['attention_mask']
    segment_ids = torch.zeros(1, max_seq_length, dtype=torch.int32).to(device)

    # Extract token embeddings from BERT
    u = model.get_last_hidden_state(inputs_ids_a, segment_ids)  # all token embeddings A = batch_size, seq_len, hidden_dim
    v = model.get_last_hidden_state(inputs_ids_b, segment_ids)  # all token embeddings B = batch_size, seq_len, hidden_dim

    # Get the mean-pooled vectors
    u = mean_pool(u, attention_a).detach().cpu().numpy()  # shape (1, hidden_dim)
    v = mean_pool(v, attention_b).detach().cpu().numpy()  # shape (1, hidden_dim)

    similarity_score = cosine_similarity(u, v)[0, 0]      # scalar

    # Calculate cosine similarity
    return similarity_score

In [81]:
# Example usage:
sentence_a = 'Your contribution helped make it possible for us to provide our students with a quality education.'
sentence_b = "Your contributions were of no help with our students' education."
similarity = calculate_similarity(model, tokenizer, sentence_a, sentence_b, device)
print(f"Cosine Similarity: {similarity.item():.4f}")

Cosine Similarity: 0.9993


In [90]:
# Example usage:
sentence_a = ' A woman is cooking dinner in the kitchen.'
sentence_b = 'A lady is preparing a meal at home.'
similarity = calculate_similarity(model, tokenizer, sentence_a, sentence_b, device)
print(f"Cosine Similarity: {similarity.item():.4f}")

Cosine Similarity: 0.9993


### Task: 3

In [82]:
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

In [83]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
pre_trained_model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')

In [84]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [85]:
pos_sentence = ["The cat is sleeping on the couch.", "The feline is resting on the sofa."]
opp_sentence = ["He is very punctual and reliable.", "You can never count on him to be on time."]
encoded_input = tokenizer(pos_sentence, padding=True, truncation=True, return_tensors='pt')

In [86]:
with torch.no_grad():
    model_output = pre_trained_model(**encoded_input)


In [87]:
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sent_a_emb = sentence_embeddings[0].cpu().numpy().reshape(1, -1)
sent_b_emb = sentence_embeddings[1].cpu().numpy().reshape(1, -1)
cosine_similarity(sent_a_emb, sent_b_emb)[0][0]

np.float32(0.731993)

In [88]:
encoded_input = tokenizer(opp_sentence, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = pre_trained_model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

In [89]:
sent_a_emb = sentence_embeddings[0].cpu().numpy().reshape(1, -1)
sent_b_emb = sentence_embeddings[1].cpu().numpy().reshape(1, -1)
cosine_similarity(sent_a_emb, sent_b_emb)[0][0]

np.float32(0.48303467)

## Classification Report

The following table summarizes the performance of our Sentence-BERT model on the validation dataset. The evaluation metrics include Precision, Recall, F1-Score, and Support for each class.

| Class           | Precision | Recall | F1-Score | Support |
|----------------|-----------|--------|----------|---------|
| Entailment     | 0.00      | 0.00   | 0.00     | 62      |
| Neutral        | 0.44      | 0.10   | 0.16     | 72      |
| Contradiction  | 0.33      | 0.92   | 0.49     | 66      |
| **Accuracy**   |           |        | **0.34** | 200     |
| **Macro Avg**  | 0.26      | 0.34   | 0.22     | 200     |
| **Weighted Avg** | 0.27   | 0.34   | 0.22     | 200     |


## Explanation for Zero Scores in Entailment

The entailment class shows zero precision, recall, and F1-score because the model did not predict any samples as entailment during evaluation. When a class receives no predicted instances, precision becomes undefined and is automatically set to zero by the evaluation metric.

This indicates that the classifier predominantly predicted the contradiction class. The high recall (0.92) for contradiction suggests that the model learned to favor this class, likely because it was easier to separate from the others using the learned sentence embeddings.

Possible reasons for this behavior include:

1. **Limited Training Data** – The model was trained on a relatively small subset of the dataset, reducing its ability to generalize across all classes.
2. **Embedding Quality** – Since the base BERT model was trained from scratch on a limited corpus, the sentence embeddings may not capture nuanced semantic relationships required for entailment detection.
3. **Class Prediction Bias** – During training, the classifier may have converged toward predicting the dominant or easiest class to minimize loss.

Overall, the results indicate that while the model can strongly detect contradictions, it struggles to distinguish entailment and neutral relationships. This suggests that improved pretraining or larger training data would likely enhance performance.

## Comparison of Our Model with Pre-trained Model

To evaluate semantic understanding, we compare our trained model against a pre-trained Sentence Transformer model using cosine similarity.

| Model Type   | Cosine Similarity (Similar sentence) | Cosine Similarity (Dissimilar sentence) |
|--------------|----------------------------------------|-------------------------------------------|
| Our Model    | 0.993                                  | 0.993                                     |
| Pre-trained  | 0.731                                  | 0.483                                     |


### Interpretation

Both similar and dissimilar sentence pairs resulted in a cosine similarity score of **0.993**. This indicates that the model is not effectively distinguishing between semantically related and unrelated sentence pairs.

Ideally, similar sentences should produce a high cosine similarity score, while dissimilar or contradictory sentences should yield a significantly lower score. Since both scores are nearly identical, this suggests that the sentence embeddings are not sufficiently discriminative.

Possible reasons include:

1. The base BERT model was trained from scratch on a limited dataset, leading to weak semantic representations.
2. The classifier may have overfitted to a dominant class during training.
3. The sentence embedding space may have collapsed, producing highly similar vectors regardless of input.

These results highlight the importance of large-scale pretraining and proper fine-tuning when building robust sentence embedding models.

## Discussion

The implementation of BERT from scratch was carried out by taking reference from the professor’s provided materials. For pretraining, the Wikipedia dataset from Hugging Face was initially selected. However, due to hardware limitations, it was not feasible to train on the full dataset. As a result, the dataset was filtered down to 100,000 samples for training. After preprocessing and vocabulary construction, the BERT model class was implemented and training was initiated.

During training, significant computational challenges were encountered. Memory constraints required reducing the batch size to 3 and limiting the number of epochs. Initially, the model was tested with a very large number of epochs (1000), but the loss showed minimal improvement and eventually the system ran out of memory. Consequently, the training configuration was adjusted to 700 epochs with reduced batch size. As expected, the limited dataset size and constrained training setup negatively affected model performance during inference.

In Task 2, the SNLI and MNLI datasets were used to train a custom Sentence-BERT style model for Natural Language Inference. The implementation was again based on the professor’s reference code. After preprocessing and tokenization, the model was trained for 5 epochs. Similar to Task 1, memory limitations prevented the use of a larger batch size. The intended batch size of 32 exceeded available memory, so it was reduced to 8. While training completed successfully, the model struggled to generalize well across all classes.

During evaluation and analysis (Task 3), the model’s performance was compared with a pre-trained model from Hugging Face. The comparison clearly demonstrated the performance gap between a model trained from scratch on limited data and a large-scale pre-trained transformer. The custom model showed weak discrimination across semantic classes, whereas the pre-trained model produced more stable and meaningful similarity representations.

Overall, the main challenges encountered throughout this assignment were:

- Limited model performance due to training from scratch  
- Reduced dataset size caused by hardware constraints  
- Memory limitations affecting batch size and training stability  
- Computational resource restrictions preventing large-scale experimentation  

### Proposed Improvements

To improve the model’s performance in future work, the following strategies are suggested:

- Increase the size of the training dataset  
- Utilize more powerful hardware (GPU with larger memory)  
- Experiment with larger batch sizes and optimized learning rates  
- Increase model depth (more layers) and hidden dimensions  
- Apply transfer learning using a pre-trained base model instead of training from scratch  

## Web Application Interface Documentation

For this assignment, I developed the web interface using Dash. The entire user interface along with the necessary model integration is implemented in the app.py file. It is a simple UI consisting of two text input fields, a Predict button, basic input validations, and a result display section. The demo of the application can be found in the README.md file inside the A4 folder.

The model is integrated into the interface through a straightforward process. First, the trained model is loaded from the saved checkpoint (sen_bert_full.pth), and the stored weights are restored into the BERT model and classifier head. For tokenization, the BertTokenizer from bert-base-uncased is used. The input sentences are converted into embeddings using the get_last_hidden_state() method and mean pooling. Cosine similarity between the two sentence embeddings is computed, and the concatenated vector [u, v, |u - v|] is passed to the classifier to predict one of the three labels: Entailment, Neutral, or Contradiction. The prediction and similarity score are then displayed on the interface.

The user interaction flow is as follows:
- User enters two sentences
- Sentences may express similar, neutral, or opposite meanings
- User clicks the Predict button
- The prediction and cosine similarity score are displayed on the screen
