## Install packages

Documentation for the toolkit:
*   https://huggingface.co/transformers/
*   https://huggingface.co/docs/accelerate/index

In [65]:
!pip install transformers==4.26.1
!pip install accelerate==0.16.0



##Import Package

In [66]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	torch.manual_seed(seed)
	if torch.cuda.is_available():
			torch.cuda.manual_seed(seed)
			torch.cuda.manual_seed_all(seed)
	np.random.seed(seed)
	random.seed(seed)
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True
same_seeds(2)

In [67]:
from transformers import (
  AutoTokenizer,
  AutoModelForQuestionAnswering,
)

model = AutoModelForQuestionAnswering.from_pretrained("bert-base-cased").to(device)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")



Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs

In [68]:
import json

!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -O train-v1.1.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -O dev-v1.1.json

# 加载数据集
with open("train-v1.1.json", "r") as f:
    train_data = json.load(f)["data"]

with open("dev-v1.1.json", "r") as f:
    dev_data = json.load(f)["data"]

--2024-03-23 00:16:36--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.110.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [application/json]
Saving to: ‘train-v1.1.json’


2024-03-23 00:16:37 (48.9 MB/s) - ‘train-v1.1.json’ saved [30288272/30288272]

--2024-03-23 00:16:38--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.110.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘dev-v1.1.json’


2024-03-23 00:16:38 (18.4 MB/s) - ‘dev-v1.1.json’ saved [4854279/4854279]



In [69]:
def load_and_split_squad_data(file_path, split_ratio=0.8):
    with open(file_path, 'r') as f:
        squad_data = json.load(f)['data']

    # Flatten the data
    flattened_data = []
    for article in squad_data:
        for paragraph in article['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']
                answers = [answer['text'] for answer in qa['answers']]
                flattened_data.append({
                    'context': context,
                    'question': question,
                    'answers': answers
                })

    # Shuffle and split the data
    random.shuffle(flattened_data)
    split_index = int(len(flattened_data) * split_ratio)
    train_data = flattened_data[:split_index]
    test_data = flattened_data[split_index:]

    return train_data, test_data

# Load and split the dataset
train_data, test_data = load_and_split_squad_data('train-v1.1.json', split_ratio=0.8)

In [46]:
train_data[0]

{'context': 'There is very little voice acting in the game, as is the case in most Zelda titles to date. Link remains silent in conversation, but grunts when attacking or injured and gasps when surprised. His emotions and responses are largely indicated visually by nods and facial expressions. Other characters have similar language-independent verbalizations, including laughter, surprised or fearful exclamations, and screams. The character of Midna has the most voice acting—her on-screen dialog is often accompanied by a babble of pseudo-speech, which was produced by scrambling the phonemes of English phrases[better source needed] sampled by Japanese voice actress Akiko Kōmoto.',
 'question': 'What does Link say when attacking?',
 'answers': ['grunts']}

In [77]:
class SquadDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=384):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        context = item['context']
        question = item['question']
        answers = item['answers']

        # Encode the context and question with the tokenizer
        encoded = self.tokenizer.encode_plus(
            question, context,
            max_length=self.max_length,
            truncation="only_second",  # Truncate only the context if it's too long
            padding="max_length",
            return_tensors="pt"
        )

        # Find the position of the answer in the context
        answer_start = context.find(answers[0])
        answer_end = answer_start + len(answers[0])

        # Convert the answer position to token position
        token_start = encoded.char_to_token(answer_start, sequence_index=1)
        token_end = encoded.char_to_token(answer_end - 1, sequence_index=1)

        # If the answer is out of the span (due to truncation), label it as (0, 0)
        if token_start is None or token_end is None:
            token_start = token_end = 0

        # The model expects the start and end positions of the answer in the context
        encoded.update({
            'start_positions': torch.tensor(token_start),
            'end_positions': torch.tensor(token_end)
        })

        return encoded

In [52]:
# Create the training and validation datasets
train_dataset = SquadDataset(train_data, tokenizer)
test_dataset = SquadDataset(test_data, tokenizer)
dev_dataset = SquadDataset(dev_data, tokenizer)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)
dev_loader = DataLoader(dev_dataset, batch_size=8, shuffle=False)

In [27]:
train_loader

<torch.utils.data.dataloader.DataLoader at 0x7e29967f1540>

In [79]:
sample_encoded = train_dataset[0]
print(sample_encoded["input_ids"])
print(sample_encoded["attention_mask"])
print(f"Start position: {sample_encoded['start_positions']}, End position: {sample_encoded['end_positions']}")

tensor([[  101,  1327,  1674, 11193,  1474,  1165,  7492,   136,   102,  1247,
          1110,  1304,  1376,  1490,  3176,  1107,  1103,  1342,   117,  1112,
          1110,  1103,  1692,  1107,  1211,   163, 22654,  1161,  3727,  1106,
          2236,   119, 11193,  2606,  3826,  1107,  3771,   117,  1133, 24673,
          1116,  1165,  7492,  1137,  4475,  1105, 27531,  1165,  3753,   119,
          1230,  6288,  1105, 11317,  1132,  3494,  4668, 19924,  1118, 11294,
          1105, 14078, 11792,   119,  2189,  2650,  1138,  1861,  1846,   118,
          2457, 14093, 20412,   117,  1259,  7053,   117,  3753,  1137, 22984,
          4252, 17405,  1116,   117,  1105, 12264,   119,  1109,  1959,  1104,
          9825,  1605,  1144,  1103,  1211,  1490,  3176,   783,  1123,  1113,
           118,  3251, 17693,  8032,  1110,  1510,  4977,  1118,   170,   171,
          6639,  2165,  1104, 23563,   118,  4055,   117,  1134,  1108,  1666,
          1118,   188,  1665,  4515,  6647,  1103,  

In [90]:
from transformers import AdamW, get_linear_schedule_with_warmup


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)



In [94]:
def compute_f1(predicted_span, true_span):
    pred_tokens = set(range(predicted_span[0], predicted_span[1] + 1))
    true_tokens = set(range(true_span[0], true_span[1] + 1))

    # 计算精确率和召回率
    common_tokens = pred_tokens.intersection(true_tokens)

    # 如果没有交集，F1为0
    if not common_tokens:
        return 0.0

    precision = len(common_tokens) / len(pred_tokens)
    recall = len(common_tokens) / len(true_tokens)

    # 计算F1得分
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1


In [99]:
model.train()

step = 0
for epoch in range(num_epochs):
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].squeeze().to(device)
        attention_mask = batch['attention_mask'].squeeze().to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        # 计算预测和真实开始/结束位置的F1得分
        start_preds = torch.argmax(outputs.start_logits, dim=1)
        end_preds = torch.argmax(outputs.end_logits, dim=1)
        f1_scores = [compute_f1((start_preds[i].item(), end_preds[i].item()), (start_positions[i].item(), end_positions[i].item())) for i in range(len(start_positions))]

        # 打印平均F1得分和损失
        if (step + 1) % 1000 == 0:
            print(f"Epoch {epoch + 1} | Step {step + 1} | loss = {loss.item():.3f}, F1 = {np.mean(f1_scores):.3f}")

        step += 1
torch.save(model.state_dict(), f"bert_finetuned_squad_epoch_{epoch+1}.pth")
print(f"Epoch {epoch + 1} completed. Model saved.")


  0%|          | 0/8760 [00:00<?, ?it/s]

Epoch 1 | Step 1000 | loss = 0.262, F1 = 0.986
Epoch 1 | Step 2000 | loss = 1.591, F1 = 0.613
Epoch 1 | Step 3000 | loss = 0.498, F1 = 0.875
Epoch 1 | Step 4000 | loss = 1.174, F1 = 0.752
Epoch 1 | Step 5000 | loss = 1.023, F1 = 0.740
Epoch 1 | Step 6000 | loss = 2.546, F1 = 0.403
Epoch 1 | Step 7000 | loss = 0.269, F1 = 0.903
Epoch 1 | Step 8000 | loss = 0.600, F1 = 0.884


  0%|          | 0/8760 [00:00<?, ?it/s]

Epoch 2 | Step 9000 | loss = 0.407, F1 = 0.933
Epoch 2 | Step 10000 | loss = 0.702, F1 = 0.888
Epoch 2 | Step 11000 | loss = 0.390, F1 = 0.897
Epoch 2 | Step 12000 | loss = 0.525, F1 = 0.870
Epoch 2 | Step 13000 | loss = 0.230, F1 = 0.963
Epoch 2 | Step 14000 | loss = 0.484, F1 = 0.859
Epoch 2 | Step 15000 | loss = 0.426, F1 = 0.912
Epoch 2 | Step 16000 | loss = 0.329, F1 = 0.941
Epoch 2 | Step 17000 | loss = 0.076, F1 = 1.000


  0%|          | 0/8760 [00:00<?, ?it/s]

Epoch 3 | Step 18000 | loss = 0.036, F1 = 1.000
Epoch 3 | Step 19000 | loss = 0.185, F1 = 0.986
Epoch 3 | Step 20000 | loss = 0.093, F1 = 1.000
Epoch 3 | Step 21000 | loss = 0.091, F1 = 1.000
Epoch 3 | Step 22000 | loss = 0.107, F1 = 0.958
Epoch 3 | Step 23000 | loss = 0.112, F1 = 0.982
Epoch 3 | Step 24000 | loss = 0.396, F1 = 0.904
Epoch 3 | Step 25000 | loss = 0.163, F1 = 0.997
Epoch 3 | Step 26000 | loss = 0.133, F1 = 1.000
Epoch 3 completed. Model saved.


In [103]:
from collections import defaultdict
import torch


def evaluate(model, dataloader, device):
    model.eval()

    predictions, true_labels = [], []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].squeeze().to(device)
            attention_mask = batch['attention_mask'].squeeze().to(device)
            start_true = batch['start_positions'].to(device)
            end_true = batch['end_positions'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)

            # 预测的开始和结束位置
            start_pred = torch.argmax(outputs.start_logits, dim=1)
            end_pred = torch.argmax(outputs.end_logits, dim=1)

            predictions.extend(zip(start_pred, end_pred))
            true_labels.extend(zip(start_true, end_true))

    # 计算F1得分和其他性能指标
    f1_scores = [compute_f1(pred, true) for pred, true in zip(predictions, true_labels)]
    mean_f1 = np.mean(f1_scores)

    return mean_f1

# 调用evaluate函数
mean_f1_score = evaluate(model, test_loader, device)
print(f"Mean F1 Score on Test Set: {mean_f1_score:.3f}")


Mean F1 Score on Test Set: 0.764


In [104]:
def predict_answer(model, tokenizer, context, question, device):
    model.eval()


    inputs = tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)


    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)


    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1


    answer_tokens = input_ids[0, answer_start:answer_end]
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(answer_tokens))

    return answer


In [106]:
context = "Recent work has shown success in incorporating pre-trained models like BERT to improve NLP systems. However, existing pre-trained models lack of causal knowledge which prevents today's NLP systems from thinking like humans. In this paper, we investigate the problem of injecting causal knowledge into pre-trained models. There are two fundamental problems: 1) how to collect various granularities of causal pairs from unstructured texts; 2) how to effectively inject causal knowledge into pre-trained models. To address these issues, we extend the idea of CausalBERT from previous studies, and conduct experiments on various datasets to evaluate its effectiveness. In addition, we adopt a regularization-based method to preserve the already learned knowledge with an extra regularization term while injecting causal knowledge. Extensive experiments on 7 datasets, including four causal pair classification tasks, two causal QA tasks and a causal inference task, demonstrate that CausalBERT captures rich causal knowledge and outperforms all pre-trained models-based state-of-the-art methods, achieving a new causal inference benchmark."
question = "What are the foundamental Problem?"
answer = predict_answer(model, tokenizer, context, question, device)
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: What are the foundamental Problem?
Answer: how to collect various granularities of causal pairs from unstructured texts ; 2 ) how to effectively inject causal knowledge into pre - trained models
