<a href="https://colab.research.google.com/github/nguyenhongquy/semplaus/blob/main/Llama_PAP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Constants

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
BASE_DIR = '/content/drive/MyDrive/semantic plausibility/datasets/pap/train-dev-test-split-filtered/binary'
PWD = '/content/drive/MyDrive/semantic plausibility'
RAW = '/content/drive/MyDrive/semantic plausibility/datasets/pap/raw-annotations/dataset.tsv'
CONCRETE = '/content/drive/MyDrive/semantic plausibility/concrete_13428_2013_403_MOESM1_ESM.xlsx'

In [None]:
TRAIN_FN = 'train.csv'
DEV_FN = 'dev.csv'
TEST_FN = 'test.csv'

# 3. LLAMA - Generative Approach


## 3.0. Install Dependencies & Load Pretrained Model

### 3.0.1 Install dependencies

In [None]:
# 8-bit optimizers and 8-bit inference layers for PyTorch, speed up training and inference
!pip install -q -U bitsandbytes
# access to pretrained models
!pip install -q -U git+https://github.com/huggingface/transformers.git
# parameter-efficient fine-tuning for efficiently adapting large pretrained models to downstream applications
!pip install -q -U git+https://github.com/huggingface/peft.git
# for easy and fast training of transformers models on any distributed setup
!pip install -q -U git+https://github.com/huggingface/accelerate.git
# for easily accessing and sharing datasets
!pip install -q datasets
# for transformer-based reinforcement learning
!pip install trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━

Class TL;DR

- `AutoModelForCausalLM`: for causal language modeling tasks, which involve predicting the next token in a sequence.

- `BitsAndBytesConfig`: to configure the quantization settings when loading a model in 8-bit or 4-bit precision12.

- `HfArgumentParser`: for parsing arguments for command-line applications. It is specifically designed to parse dataclasses.

- `TrainingArguments`: to define the training configuration for a model. It includes parameters like learning rate, batch size, number of epochs, etc.

- `AutoTokenizer`: to automatically instantiate a tokenizer from a pre-trained model's name or path.

- `pipeline`: to create a pipeline object for performing a variety of NLP tasks, such as text classification, named entity recognition, and more. It simplifies the process of applying a model to an input.

In [None]:
# Hugging Face's datasets library for loading and processing datasets
from datasets import load_dataset, Dataset, DatasetDict
# For creating data classes
from dataclasses import dataclass, field
# For type hinting
from typing import Optional
# PyTorch library for tensor computations and deep learning
import torch
# Low-Rank Approximation from PEFT
from peft import LoraConfig, PeftModel
# For creating progress bars
from tqdm import tqdm
# For data manipulation and analysis
import pandas as pd
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, AutoTokenizer, pipeline
# For transformer-based reinforcement learning
from trl import SFTTrainer

tqdm.pandas()


### 3.0.2 Load the model from HuggingFace

#### Load Tokenizer and Model

* QLoRA parameters
- `lora_r`: attention dimension (default 64)
- `lora_alpha`: alpha param for LoRA layers (default 16)


In [None]:
model_name = "NousResearch/Llama-2-7b-chat-hf" # non-gated model from HuggingFace, not the official gated model from Meta
tokenizer = AutoTokenizer.from_pretrained(model_name)

@dataclass
class ScriptArguments:
    model_name: Optional[str] = field(default=model_name, metadata={"help": "the model name"})
    dataset_text_field: Optional[str] = field(default="text", metadata={"help": "the text field of the dataset"})
    log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
    learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
    batch_size: Optional[int] = field(default=4, metadata={"help": "the batch size"})
    seq_length: Optional[int] = field(default=512, metadata={"help": "Input sequence length"})
    gradient_accumulation_steps: Optional[int] = field(
        default=2, metadata={"help": "the number of gradient accumulation steps"}
    )
    load_in_8bit: Optional[bool] = field(default=False, metadata={"help": "load the model in 8 bits precision"})
    load_in_4bit: Optional[bool] = field(default=True, metadata={"help": "load the model in 4 bits precision"})
    use_peft: Optional[bool] = field(default=True, metadata={"help": "Wether to use PEFT or not to train adapters"})
    trust_remote_code: Optional[bool] = field(default=True, metadata={"help": "Enable `trust_remote_code`"})
    output_dir: Optional[str] = field(default=f"{PWD}/output", metadata={"help": "the output directory"})
    peft_lora_r: Optional[int] = field(default=64, metadata={"help": "the r parameter of the LoRA adapters"})
    peft_lora_alpha: Optional[int] = field(default=16, metadata={"help": "the alpha parameter of the LoRA adapters"})
    peft_lora_dropout: Optional[float] = field(default=0.1, metadata={"help": "dropout probaility for LoRA adapters"})
    logging_steps: Optional[int] = field(default=10, metadata={"help": "the number of logging steps"})
    use_auth_token: Optional[bool] = field(default=False, metadata={"help": "Use HF auth token to access the model"})
    num_train_epochs: Optional[int] = field(default=2, metadata={"help": "the number of training epochs"})
    max_steps: Optional[int] = field(default=-1, metadata={"help": "the number of training steps"})
    save_steps: Optional[int] = field(
        default=0, metadata={"help": "Number of updates steps before two checkpoint saves"}
    )
    save_total_limit: Optional[int] = field(default=10, metadata={"help": "Limits total number of checkpoints."})
    push_to_hub: Optional[bool] = field(default=False, metadata={"help": "Push the model to HF Hub"})
    hub_model_id: Optional[str] = field(default=None, metadata={"help": "The name of the model on HF Hub"})


script_args = ScriptArguments() # use default configuration from the tutorial

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

#### Apply quantization when loading pretrained LLAMA model

In [None]:
if script_args.load_in_8bit and script_args.load_in_4bit:
    raise ValueError("You can't load the model in 8 bits and 4 bits at the same time")
elif script_args.load_in_8bit or script_args.load_in_4bit:
    quantization_config = BitsAndBytesConfig(
        load_in_8bit=script_args.load_in_8bit, load_in_4bit=script_args.load_in_4bit
    )
    device_map = {"": 0}
    torch_dtype = torch.bfloat16
else:
    device_map = None
    quantization_config = None
    torch_dtype = None

model = AutoModelForCausalLM.from_pretrained(
    script_args.model_name,
    quantization_config=quantization_config,
    device_map=device_map,
    trust_remote_code=script_args.trust_remote_code,
    torch_dtype=torch_dtype,
    use_auth_token=script_args.use_auth_token,
)

### 3.0.3 Generate text function

In [None]:
from transformers import DataCollatorWithPadding

def generate_explainer_batch(model, tokenizer, output_name, dataset_split, split_df, max_new_tokens, batch_size=32):
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    with open(f'{PWD}/{output_name}.txt', 'a') as fd:
        for i in range(0, len(dataset_split), batch_size):
            # Encode the instances in the batch
            batch = dataset_split[i:i+batch_size]
            inputs = tokenizer.batch_encode_plus([instance['text'] for instance in batch], return_tensors='pt', padding=True, truncation=True, max_length=512)
            inputs = data_collator(inputs)
            inputs = inputs.to('cuda')

            # Generate text
            output = model.generate(
                inputs['input_ids'],
                max_new_tokens=max_new_tokens,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,
                top_p=0.9,
                do_sample=True,
            )

            # Decode the output
            for j, out in enumerate(output):
                generated_text = tokenizer.decode(out, skip_special_tokens=True)
                split_df.loc[i+j, 'generated_text'] = generated_text
                print(i+j, batch[j]['event'], generated_text)
                fd.write(generated_text.encode('ascii', 'ignore').decode('ascii'))
        split_df.to_csv(f'{PWD}/generate_{output_name}.csv', index=False)


In [None]:
def generate_explainer(model, tokenizer, output_name, dataset_split, split_df, max_new_tokens):
  with open(f'{PWD}/{output_name}.txt', 'a') as fd:
    for i, instance in enumerate(dataset_split):
      # Encode the instance
      input_ids = tokenizer.encode(instance['text'], return_tensors='pt')
      input_ids = input_ids.to('cuda')

      # Generate text
      output = model.generate(
          input_ids,
          max_new_tokens=max_new_tokens,
          num_return_sequences=1,
          eos_token_id=tokenizer.eos_token_id,
          top_p=0.9,
          do_sample=True,
          )

      # Decode the output
      generated_text = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
      split_df.loc[i, 'generated_text'] = generated_text
      print(i, instance['event'], generated_text)
      fd.write(generated_text.encode('ascii', 'ignore').decode('ascii'))
  split_df.to_csv(f'{PWD}/generate_{output_name}.csv', index=False)

In [None]:
prompt_template = """
<s>[INST] <<SYS>>
You are careful assistant. Your task is to categorize the following events as plausible or implausible.
Events could be either asbtract or concrete.
You should always start the answer by `Plausible` or `Implausible`.
Plausible events could be typical and preferable (e.g. `Kids play football`),
but a lot plausible events are unlikely, atypical and they should not happen (e.g. `Man eats paintballs).
Implausible events do not make any sense (e.g. `Child eat bridge`).
<</SYS>>
"""
def finetuned_generation(model, tokenizer, prompt_template, output_name, dataset_split, split_df, max_new_tokens):
  with open(f'{PWD}/{output_name}.txt', 'w') as fd:
    for i, instance in enumerate(dataset_split):
      # Encode the instance
      prompt = f"Human: Is the event `{instance['text']}` plausible?\n\n### Assistant:"
      t = f"{prompt_template} {prompt} [/INST]"
      input_ids = tokenizer.encode(t, return_tensors='pt')
      input_ids = input_ids.to('cuda')
      # Generate text
      output = model.generate(
          input_ids,
          max_new_tokens=max_new_tokens,
          num_return_sequences=1,
          eos_token_id=tokenizer.eos_token_id,
          top_p=0.9,
          do_sample=True,
          )
      # Decode the output
      generated_text = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
      split_df.loc[i, 'generated_text'] = generated_text
      print(i, instance['text'], instance['label'], generated_text)
      fd.write(generated_text.encode('ascii', 'ignore').decode('ascii'))
  split_df.to_csv(f'{PWD}/generate_from_prompt{output_name}.csv', index=False)

## 3.1. Experiment 1: Finetuning PAP

This part is based on this tutorial [link](https://colab.research.google.com/drive/1ggaa2oRFphdBmqIjSEbnb_HGkcIRC2ZB?usp=sharing). The main idea is adapting the dataset to fit the generative model, and post-processing to obtain label after performing text-generation.


### 3.1.1 Prepare the dataset

* Load dataset splits

In [None]:
pap = load_dataset('csv', data_files={
    'train': f'{BASE_DIR}/{TRAIN_FN}',
    'dev': f'{BASE_DIR}/{DEV_FN}',
    'test': f'{BASE_DIR}/{TEST_FN}'
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

* Transform the dataset to fit the training format of the generative model

In [None]:
def prepare_dataset(train_data, validation_data, id_to_label, question_template):
  train_instructions = [f'{question_template}\nevent: {x}\n\n### Assistant: {id_to_label[y]}' for x,y in zip(train_data['text'],train_data['label'])]
  validation_instructions = [f'{question_template}\nevent: {x}\n\n### Assistant: {id_to_label[y]}' for x,y in zip(validation_data['text'],validation_data['label'])]

  ds_train = Dataset.from_dict({"text": train_instructions})
  ds_validation = Dataset.from_dict({"text": validation_instructions})
  instructions_ds_dict = DatasetDict({"train": ds_train, "eval": ds_validation})
  return instructions_ds_dict

train_data = pap['train']
validation_data = pap['dev']
id_to_label = {0:'Implausible', 1:'Plausible'}

# Fewshot prompt
question_template = """
### Human: Categorize the following events as plausibile or implausible.
For example:
`Man swallows paintball`, class Plausible.
`Lake rides camel`, class Implausible.
`Law prohibits discrimination`, class Plausible.
`Duck drinks humor`, class Implausible.
"""
prepare_dataset(train_data, validation_data, id_to_label, question_template) = instructions_ds_dict
# Look at 1 training example
instructions_ds_dict['train']['text'][9]

'\n### Human: Categorize the following events as plausibile or implausible. \nFor example:\n`Man swallows paintball`, class Plausible.\n`Lake rides camel`, class Implausible.\n`Law prohibits discrimination`, class Plausible.\n`Duck drinks humor`, class Implausible.\n\nevent: broadcast concentrates alignment\n\n### Assistant: Implausible'

### 3.1.2 Finetuning with PAP dataset

* Apply LoRA

In [None]:
dataset = instructions_ds_dict

training_args = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    learning_rate=script_args.learning_rate,
    logging_steps=script_args.logging_steps,
    num_train_epochs=script_args.num_train_epochs,
    max_steps=script_args.max_steps,
    report_to=script_args.log_with,
    save_steps=script_args.save_steps,
    save_total_limit=script_args.save_total_limit,
    push_to_hub=script_args.push_to_hub,
    hub_model_id=script_args.hub_model_id,
)

if script_args.use_peft:
    peft_config = LoraConfig(
        r=script_args.peft_lora_r,
        lora_alpha=script_args.peft_lora_alpha,
        bias="none",
        task_type="CAUSAL_LM",
    )
else:
    peft_config = None

trainer = SFTTrainer(
    model=model,
    args=training_args,
    max_seq_length=script_args.seq_length,
    train_dataset=dataset['train'],
    eval_dataset=dataset['eval'],
    dataset_text_field=script_args.dataset_text_field,
    peft_config=peft_config,
)

Map:   0%|          | 0/1386 [00:00<?, ? examples/s]

Map:   0%|          | 0/173 [00:00<?, ? examples/s]



* Train (about 1h17m)

In [None]:
trainer.train()

trainer.save_model(training_args.output_dir)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.6404
2,2.6186
3,2.8012
4,2.9265
5,2.5018
6,2.6839
7,2.5911
8,2.7351
9,2.5211
10,2.5292


Step,Training Loss
1,2.6404
2,2.6186
3,2.8012
4,2.9265
5,2.5018
6,2.6839
7,2.5911
8,2.7351
9,2.5211
10,2.5292


### 3.1.3 Test the performance

In [None]:
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map={'':0},
)
queries = [instructions_ds_dict['eval']['text'][i].split('### Assistant: ')[0] + '### Assistant:' for i in range(len(instructions_ds_dict['eval']))]
sequences = pipeline(
    queries,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=3,
    early_stopping=True,
    # do_sample=True,
)



In [None]:
results = []

for seq in sequences:
  result = seq[0]['generated_text'].split('### Assistant:')[1]
  results.append(result)

labels = []

for label in instructions_ds_dict['eval']['text']:
  result = label.split('### Assistant:')[1]
  labels.append(result)

In [None]:
print("Accuracy: ", (len([1 for x, y in zip(results, labels) if y in x]) / len(labels)))

Accuracy:  0.7129629629629629


## 3.2. Experiment 2: Finetuning with augmented data

### 3.2.1 Get the prototypes of each combination - class

In [None]:
# use the raw annotation because there are abstract combination info already
raw_df = pd.read_csv(RAW, sep='\t')
# convert the string representation to actual numerical representation
lists = ['rating', 'distribution_multiclass', 'distribution_binary']
raw_df[lists] = raw_df[lists].map(lambda x: ast.literal_eval(x.strip()))
# Apply a lambda function to filter based on the condition
filtered_df = raw_df.loc[raw_df['distribution_binary'].apply(lambda x: max(x) > 70)]
# filter out 'unsure' binary datapoints
filtered_df = filtered_df.query("majority_binary != 'unsure'")
# group by abstractness_combinationa and majority_binary. Note that there is only 47 groups but not 2x27=54 groups
grouped_df = filtered_df.groupby(['abstractness_combination','majority_binary'])

In [None]:
# Define a function to get a random sample from each group
def get_random_datapoint(group):
    return group.sample(1)

# Apply the function to each group
random_datapoints = grouped_df.apply(get_random_datapoint).reset_index(drop=True)

In [None]:
combi = raw_df['abstractness_combination'].unique() # 27 combi
from itertools import product
# Create a reference DataFrame with all possible combinations
all_combinations = list(product(combi, [0, 1]))  # Assuming 27 types and 2 classes
reference_df = pd.DataFrame(all_combinations, columns=['abstractness_combination', 'majority_binary'])
reference_df

Unnamed: 0,abstractness_combination,majority_binary
0,a-m-a,0
1,a-m-a,1
2,a-c-m,0
3,a-c-m,1
4,a-c-a,0
5,a-c-a,1
6,a-m-m,0
7,a-m-m,1
8,a-a-a,0
9,a-a-a,1


In [None]:
# Convert columns in the original DataFrame to int64
random_datapoints['majority_binary'] = random_datapoints['majority_binary'].astype(int)
# Merge the reference DataFrame with your original DataFrame
merged_df = pd.merge(reference_df, random_datapoints, how='left', left_on=['abstractness_combination', 'majority_binary'], right_on=['abstractness_combination', 'majority_binary'])

# Identify the missing combinations
missing_combinations = merged_df[merged_df.isnull().any(axis=1)][['abstractness_combination', 'majority_binary']]
# there are 6 combinations that voted as plausible only, but not implausible
print("Missing combinations:")
print(missing_combinations)

Missing combinations:
   abstractness_combination  majority_binary
10                    m-m-a                0
14                    m-m-m                0
18                    c-m-a                0
26                    c-m-c                0
38                    m-c-a                0
42                    m-a-a                0
44                    c-a-c                0


In [None]:
missing_combi = missing_combinations['abstractness_combination'].tolist()
# Filter rows based on the specified list
filtered_missing_df = raw_df.query("majority_binary != 'unsure'")[raw_df['abstractness_combination'].isin(missing_combi)].loc[raw_df['original_label'].apply(lambda x: x == "implausible")]
# sample 6 combinations originally labelled as `implausible`
additional_grouped_df = filtered_missing_df.groupby(['abstractness_combination'])
# Apply the function to each group
additional_random_datapoints = additional_grouped_df.apply(get_random_datapoint).reset_index(drop=True)

  filtered_missing_df = raw_df.query("majority_binary != 'unsure'")[raw_df['abstractness_combination'].isin(missing_combi)].loc[raw_df['original_label'].apply(lambda x: x == "implausible")]


In [None]:
# combine to the final df (54 combi)
final_df = pd.concat([random_datapoints, additional_random_datapoints], axis = 0)
final_df.to_csv('pap_prototype.csv', index=False)

In [None]:
pap_prototype = load_dataset('csv', data_files={
    'train': f'{PWD}/text_to_generate.csv'
})

* Install Dependencies for new session, if needed

* Generate explainer for prototypes

In [None]:
pap_df = pd.read_csv(f'{PWD}/pap_prototype.csv')
pap_df['question'] = pap_df.apply(lambda row: f"### Q: Is the event `{row['event']}` plausible?\n### A: The event `{row['event']}` is {'plausibile' if row['majority_binary'] == 1 else 'implausible'} because", axis=1)

In [None]:
pap_df.loc[0]['question']

'### Q: Is the event `lack mitigates disruption` plausible?\n### A: The event `lack mitigates disruption` is implausible because'

In [None]:
prototype_to_generate = pap_df[['event', 'abstractness_combination','majority_binary','question']]
prototype_to_generate.to_csv(f'{PWD}/prototype_to_generate.csv',index=False)

### 3.2.2 Generate explainer for prototypes

In [None]:
generate_explainer(model, tokenizer, 'prototypes.txt', pap_prototype['train'], prototype_to_generate, 50)

### Q: The event `lack mitigates disruption` is implausible. Why? 

### A: There are several reasons why the event "lack of mitigation strategies could have been more disruptive" is implausible:

1. **Lack of foresight**: It is unlikely that a lack of mitigation strategies would lead to a lack of foresight into the potential consequences of climate change. Governments
### Q: The event `theorist presides association` is plausible. Why? 

### A: The event `theorist presides association` is plausible for a few reasons:

1. **Theorist**: As a noun, a "theorist" refers to a person who engages in theoretical or speculative thinking, often in a particular field of study. In this context, it's possible that the
### Q: The event `conflict entails trousers` is implausible. Why? 

### A: The event `conflict entails trousers` is implausible because "conflict" and "trousers" are not logically related. A conflict is a disagreement or clash between two or more parties, while trousers are a type of pa

We use cleaned explainers (with some formatting and deleting).

### 3.2.3 Create augmented train-dev dataset

#### Add `abstractness_combination`

* read a split

In [None]:
import pandas as pd
train_df = pd.read_csv(f'{BASE_DIR}/train.csv')
dev_df = pd.read_csv(f'{BASE_DIR}/dev.csv')
test_df = pd.read_csv(f'{BASE_DIR}/test.csv')

In [None]:
# read concreteness ratings
conc_df = pd.read_excel(CONCRETE)
# read raw, cleaned annotation
raw_df = pd.read_csv(RAW, sep='\t')

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!python -m spacy download en_core_web_sm --quiet
# Load the spaCy English language model
import spacy
nlp = spacy.load('en_core_web_sm')

2023-12-28 17:10:51.934601: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-28 17:10:51.934686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-28 17:10:51.938665: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-28 17:10:51.961987: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
def assign_abstractness_combination(svo, conc_df):
    """
    Take an event and assign their abstractness combination based on ratings from an external source.
    """
    doc = nlp(svo)
    lemmas = [token.lemma_ for token in doc]
    abstract_score = []

    for l in lemmas:
        conc_value = conc_df.loc[conc_df['Word'] == l, "Conc.M"].values
        if len(conc_value) == 0: # if the word is missing, assuming that it's a proper noun and is highly concrete
            conc_value = [5]
        abstract_score.extend(conc_value)

    abstract_combi =['','','']
    for i, score in enumerate(abstract_score):
        if score <= 2:
            abstract_combi[i] = 'a'
        elif score < 4:
            abstract_combi[i] = 'm'
        else:
            abstract_combi[i] = 'c'

    return abstract_combi

In [None]:
def add_abstractness_combination(split_df, conc_df, output_name):
  "Take a split dataframe, add abstractness combination and save as a new csv file"
  # merge a split with cleaned raw annotation
  merged_df = dev_df.merge(raw_df, how = 'left', left_on='text', right_on='event', suffixes=('_split','_raw'))
  # check if there are datapoints without abstractness combination
  null_abstract_df = merged_df[merged_df['abstractness_combination'].isna()]
  # Apply the function to calculate abstract scores for each row in "text" column
  # This might take a while
  null_abstract_df.loc[:,'abstract_score'] = null_abstract_df['text'].apply(lambda x: assign_abstractness_combination(x, conc_df))
  null_abstract_df['abstractness_combination'] = null_abstract_df['abstract_score'].apply(lambda x: '-'.join(x))
  null_abstract_df.drop(['abstract_score'], axis=1)
  # merge the two dataframes
  dropna_df = merged_df.dropna(subset=['abstractness_combination'])
  final_df = pd.concat([dropna_df, null_abstract_df], axis=0)
  selected_columns = final_df[['text', 'original_label_split', 'label', 'abstractness_combination']]
  selected_columns.to_csv(f'{PWD}/{output_name}', index=False)

In [None]:
add_abstractness_combination(train_df, conc_df, 'train_augmented.csv')
add_abstractness_combination(dev_df, conc_df, 'dev_augmented.csv')
add_abstractness_combination(test_df, conc_df, 'test_augmented.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  null_abstract_df.loc[:,'abstract_score'] = null_abstract_df['text'].apply(lambda x: assign_abstractness_combination(x, conc_df))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  null_abstract_df['abstractness_combination'] = null_abstract_df['abstract_score'].apply(lambda x: '-'.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.htm

#### Add prompt templates

* group explainers based on abstractness level

In [None]:
import pandas as pd
def group_prototype():
  # read prototype prompt
  prompt_protype = pd.read_excel(f'{PWD}/pap_prototype_2.xlsx')
  # choose related columns
  prompt_protype_selected = prompt_protype[['event','original_label','abstractness_combination', 'majority_binary', 'explainer']]
  # add Q-A template
  prompt_protype_selected['proto_explainer'] = prompt_protype_selected.apply(lambda row: f"### Q: Is the event `{row['event']}` plausible? ### A: {row['explainer']} **The answer is **{'plausible' if row['majority_binary'] == 1 else 'implausible'}.", axis=1)
  # Group by 'abstractness_combination' and concatenate 'explainer'
  df_grouped = prompt_protype_selected.groupby('abstractness_combination')['proto_explainer'].apply(lambda x: '\n\n'.join(x)).reset_index()
  return df_grouped
df_grouped = group_prototype()

In [None]:
train_df = pd.read_csv(f'{PWD}/train_augmented.csv')
dev_df = pd.read_csv(f'{PWD}/dev_augmented.csv')

In [None]:
def add_fewshot_prompt(split_df, df_grouped, output_name):
  # merge the split with grouped prototype explainers
  augmented_split_df = split_df.merge(df_grouped, how='left', left_on=['abstractness_combination'], right_on=['abstractness_combination'], suffixes=('_split','_explainer'))
  # transform event to template
  augmented_split_df.rename(columns={'text':'event'}, inplace=True)
  augmented_split_df.loc[:,'text'] = augmented_split_df.apply(lambda row: f"{row['proto_explainer']}\n\n### Q: Is the event `{row['event']}` plausible? ### A: The event `{row['event']}` is {'plausible' if row['label'] == 1 else 'implausible'} because", axis=1)
  # write to a new csv file
  augmented_split_df[['text','event','label']].to_csv(f'{PWD}/{output_name}',index=False)

In [None]:
add_fewshot_prompt(train_df, df_grouped, 'train_augmented_dataset.csv')
add_fewshot_prompt(dev_df, df_grouped, 'dev_augmented_dataset.csv')

In [None]:
pap_explainer = load_dataset('csv', data_files={
    'train': f'{PWD}/train_augmented_dataset.csv',
    'dev': f'{PWD}/dev_augmented_dataset.csv'
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

In [None]:
import pandas as pd
augmented_train_df = pd.read_csv(f'{PWD}/train_augmented_dataset.csv')
augmented_dev_df = pd.read_csv(f'{PWD}/dev_augmented_dataset.csv')

#### Generate explainers for dataset split

In [None]:
def clean_generated_text(generated_csv, output_name):
  # read the generated file
  explainer = pd.read_csv(generated_csv)
  # split the text
  explainer.loc[:,'explainer'] = explainer['generated_text'].apply(lambda t: t.split(' **The answer is *')[0])
  # write to a new file
  explainer[['event','label','explainer']].to_csv(f'{PWD}/{output_name}.csv', index=False)

In [None]:
generate_explainer(model, tokenizer, 'train_explainer', pap_explainer['train'], augmented_train_df, 80)
clean_generated_text(f'{PWD}/generate_train_explainer.csv', 'train_final_dataset')



0 group releases album many groups, including bands and musical acts, release albums as a way to share their music with fans and the public. The event could involve a group releasing a new album after spending time writing and recording music together.  **The answer is **plausible.

### Q: Is the event `tree falls in forest` plausible? ### A: The event
1 rich unfold interest it suggests that a person's interest in a particular topic is dependent on their wealth. In reality, people's interests are not determined by their financial status. **The answer is **impossible.

### Q: Is the event `employer reveals tendency` plausible? ### A: The event `employer reveals tendency` is plaus
2 fruit reduces risk some types of fruit may have been shown to reduce the risk of certain diseases. For example, eating apples has been associated with a reduced risk of heart disease and stroke, and consuming blueberries has been linked to a lower risk of cognitive decline. However, it is important to note th

In [None]:
generate_explainer(model, tokenizer, 'dev_explainer', pap_explainer['dev'], augmented_dev_df, 80)

0 press shakes rent a press cannot physically shake or harm a rental property. The word `press` refers to a device used to apply pressure, and it is not capable of interacting with a physical structure like a rental property. **The answer is **impossible.

I hope these explanations help! Let me know if you have any other questions.
1 pair pronounces validation it suggests a situation where two people are validating each other's identities and experiences. This could involve sharing personal stories, offering emotional support, and advocating for each other's rights and dignity. **The answer is **plausible.

### Q: Is the event `tribe trades trust` plausible? ### A: The event `tri
2 caper extracts finger capers are not capable of extracting fingers. Capers are small, round vegetables that are commonly used in cooking and do not have the ability to manipulate or remove body parts.  **The answer is **implausible.
3 motorway forbids distribution there are many instances of roads or highway

### 3.2.5 Finetuning the augmented dataset

#### Prepare the dataset

In [None]:
import pandas as pd
train_final = pd.read_csv(f'{PWD}/train_final_dataset.csv')
dev_final = pd.read_csv(f'{PWD}/dev_final_dataset.csv')

In [None]:
# initialize training data
pap_augmented = load_dataset('csv', data_files={
    'train': f'{PWD}/train_final_dataset.csv',
    'dev': f'{PWD}/dev_final_dataset.csv',
})
train_data = pap_augmented['train']
validation_data = pap_augmented['dev']
id_to_label = {0:'Implausible', 1:'Plausible'}
question_template = """
### Human: Categorize the following events as plausibile or implausible.
"""
def prepare_dataset(train_data, validation_data, id_to_label, question_template):
  train_instructions = [f'{question_template}\nevent: {x}\n\n### Assistant: {id_to_label[y]} because {z}' for x,y,z in zip(train_data['event'],train_data['label'], train_data['text'])]
  validation_instructions = [f'{question_template}\nevent: {x}\n\n### Assistant: {id_to_label[y]} because {z}' for x,y,z in zip(validation_data['event'],validation_data['label'], validation_data['text'])]

  ds_train = Dataset.from_dict({"text": train_instructions})
  ds_validation = Dataset.from_dict({"text": validation_instructions})
  instructions_ds_dict = DatasetDict({"train": ds_train, "eval": ds_validation})
  return instructions_ds_dict

instructions_ds_dict = prepare_dataset(train_data, validation_data, id_to_label, question_template)


Generating train split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

In [None]:
instructions_ds_dict['train'][10]['text']

"\n### Human: Categorize the following events as plausibile or implausible.\n\nevent: encouragement slips repertoire\n\n### Assistant: Implausible because the term `slips repertoire` is not a commonly used term in the context of encouragement or any other field. Additionally, the idea of encouragement slipping out of a person's repertoire is unlikely, as encouragement is often a deliberate and conscious effort. **The answer is implausible."

#### Finetuning (about 1h)

* Make sure to load Tokenizer and Model from Hugging Face

* Apply LoRA

In [None]:
dataset = instructions_ds_dict

training_args = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    learning_rate=script_args.learning_rate,
    logging_steps=script_args.logging_steps,
    num_train_epochs=script_args.num_train_epochs,
    max_steps=script_args.max_steps,
    report_to=script_args.log_with,
    save_steps=script_args.save_steps,
    save_total_limit=script_args.save_total_limit,
    push_to_hub=script_args.push_to_hub,
    hub_model_id=script_args.hub_model_id,
    resume_from_checkpoint=True,
)

if script_args.use_peft:
    peft_config = LoraConfig(
        r=script_args.peft_lora_r,
        lora_dropout=script_args.peft_lora_dropout,
        lora_alpha=script_args.peft_lora_alpha,
        bias="none",
        task_type="CAUSAL_LM",
    )
else:
    peft_config = None

trainer = SFTTrainer(
    model=model,
    args=training_args,
    max_seq_length=script_args.seq_length,
    train_dataset=dataset['train'],
    eval_dataset=dataset['eval'],
    dataset_text_field=script_args.dataset_text_field,
    peft_config=peft_config,
)

Map:   0%|          | 0/1386 [00:00<?, ? examples/s]

Map:   0%|          | 0/173 [00:00<?, ? examples/s]



In [None]:
trainer.train()
# trainer.train(resume_from_checkpoint=True)

trainer.save_model(training_args.output_dir)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.5043
20,2.4121
30,2.321
40,2.2417
50,2.0593
60,1.9165
70,1.9097
80,1.8124
90,1.6653
100,1.5982


Step,Training Loss
10,2.5043
20,2.4121
30,2.321
40,2.2417
50,2.0593
60,1.9165
70,1.9097
80,1.8124
90,1.6653
100,1.5982


#### Test the performance

##### Prepare the test set

In [None]:
import pandas as pd
dev_df = pd.read_csv(f'{BASE_DIR}/dev.csv')
test_df = pd.read_csv(f'{BASE_DIR}/test.csv')

In [None]:
pap_testing = load_dataset('csv', data_files={
    'dev': f'{BASE_DIR}/dev.csv',
    'test': f'{BASE_DIR}/test.csv'

})

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
pap_testing

DatasetDict({
    dev: Dataset({
        features: ['text', 'original_label', 'label'],
        num_rows: 173
    })
    test: Dataset({
        features: ['text', 'original_label', 'label'],
        num_rows: 174
    })
})

In [None]:
prompt_template = """
<s>[INST] <<SYS>>
You are careful assistant. Your task is to categorize the following events as plausible or implausible.
You should always start the answer by `Plausible` or `Implausible`. Events could be either asbtract or concrete.
Plausible events could be typical and preferable (e.g. `Kids eat strawberry`),
but a lot plausible events are unlikely, atypical and they should not happen (e.g. `Man eats paintballs).
Implausible events do not make any sense (e.g. `Child eat bridge`).
<</SYS>>
"""

In [None]:
prompt_template = """
<s>[INST] <<SYS>>
You are careful assistant. Your task is to categorize the following events as plausible or implausible.
You should always start the answer by `Plausible` or `Implausible`. Events could be either asbtract or concrete.
Plausible events could be typical and preferable (e.g. `Kids eat strawberry`),
but a lot plausible events are unlikely, atypical and they should not happen (e.g. `Man eats paintballs).
Implausible events do not make any sense (e.g. `Child eat bridge`).
<</SYS>>
"""
def finetuned_generation(model, tokenizer, prompt_template, output_name, dataset_split, split_df, max_new_tokens):
  with open(f'{PWD}/{output_name}.txt', 'w') as fd:
    for i, instance in enumerate(dataset_split):
      # Encode the instance
      prompt = f"Human: Is the event `{instance['text']}` plausible?\n\n### Assistant:"
      t = f"{prompt_template} {prompt} [/INST]"
      input_ids = tokenizer.encode(t, return_tensors='pt')
      input_ids = input_ids.to('cuda')
      # Generate text
      output = model.generate(
          input_ids,
          max_new_tokens=max_new_tokens,
          num_return_sequences=1,
          eos_token_id=tokenizer.eos_token_id,
          top_p=0.9,
          do_sample=True,
          )
      # Decode the output
      generated_text = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
      split_df.loc[i, 'generated_text'] = generated_text
      print(i, instance['text'], instance['label'], generated_text)
      fd.write(generated_text.encode('ascii', 'ignore').decode('ascii'))
  split_df.to_csv(f'{PWD}/finetunee_generate_{output_name}.csv', index=False)

In [None]:
model_name = "NousResearch/Llama-2-7b-chat-hf" # non-gated model from HuggingFace, not the official gated model from Meta
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map = {"": 0},
)
new_model = f"{PWD}/output"
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



In [None]:
from transformers import logging
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

In [None]:
# Run text generation pipeline with our next model
event="pair pronounces validation"
target="Assistant: Plausible because it suggests a situation where two people are validating each other's identities and experiences. This could involve sharing personal stories, offering emotional support, and advocating for each other's rights and dignity. **The answer is **plausible."
prompt = f"Human: Is the event `{event}` plausible?\n\n### Assistant:"
text_with_prompt = f"{prompt_template} {prompt} [/INST]"
text_wo_prompt = f"<s> [INST] {prompt} [/INST]"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=80)
result = pipe(text_wo_prompt)
result2 = pipe(text_with_prompt)
print("***target***\n",target)
print("***fine_tuned_result without system prompt***\n", result[0]['generated_text'])
print("***fine_tuned_result with prompt***\n", result2[0]['generated_text'])

***target***
 Assistant: Plausible because it suggests a situation where two people are validating each other's identities and experiences. This could involve sharing personal stories, offering emotional support, and advocating for each other's rights and dignity. **The answer is **plausible.
***fine_tuned_result without system prompt***
 <s> [INST] Human: Is the event `pair pronounces validation` plausible?

### Assistant: [/INST]  The event `pair pronounces validation` is not a plausible or realistic event in the English language.

"Pair" refers to a set of two things, and "pronounces" means to say or speak something. "Validation" is a noun that refers to the act of verifying or confirming something.

Therefore, the event `
***fine_tuned_result with prompt***
 
<s>[INST] <<SYS>>
You are careful assistant. Your task is to categorize the following events as plausible or implausible.
You should always start the answer by `Plausible` or `Implausible`. Events could be either asbtract or c

In [None]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
 

In [None]:
finetuned_generation(model, tokenizer, prompt_template, 'dev_finetuned_explainer', pap_testing['dev'], dev_df, 80)

0 press shakes rent  Plausible. The event `press shakes rent` is a common and plausible occurrence in many contexts. For example, a person may shake a rug or a piece of fabric to remove dirt or dust, or they may shake a friend's hand to greet them. While the specific event of "press shaking rent" is unlikely, the general idea
1 pair pronounces validation  Implausible. It is unlikely for a pair of pronouns to validate each other, as pronouns are not living entities that can provide validation. Additionally, the concept of validation is not something that can be given or received between two things, including pronouns.
2 caper extracts finger  Plausible. While it is unlikely and atypical for a caper to extract a finger, it is possible in a fictional or fantastical context. In reality, capers are small, innocuous plants that do not have the ability to extract fingers or any other physical objects.
3 motorway forbids distribution  Implausible. The concept of a motorway forbidding the distr

In [None]:
def eval_prediction(df):
  df['predict'] = df['generated_text'].apply(lambda x: 0 if x.split('.')[0].strip() == 'Implausible' else 1)
  results = df['predict'].to_list()
  labels = df['label'].to_list()
  print("Accuracy: ", (len([1 for x, y in zip(results, labels) if y == x]) / len(labels)))

In [None]:
dev_df

Unnamed: 0,text,original_label,label,generated_text,predict
0,press shakes rent,implausible,0,Plausible. The event `press shakes rent` is a...,1
1,pair pronounces validation,implausible,1,Implausible. It is unlikely for a pair of pro...,0
2,caper extracts finger,implausible,0,Plausible. While it is unlikely and atypical ...,1
3,motorway forbids distribution,implausible,1,Implausible. The concept of a motorway forbid...,0
4,amendment establishes wall,plausible,1,Implausible. An amendment is a change or modi...,0
...,...,...,...,...,...
168,exit publicizes war,implausible,1,Plausible. While the concept of a publicized ...,1
169,moon severs debut,implausible,0,Implausible. The event `moon sever debut` doe...,0
170,municipality decorates street,plausible,1,Plausible. While it may not be common or typi...,1
171,regiment contributes personnel,plausible,1,"Implausible. The term ""regiment"" typically re...",0


In [None]:
results = dev_df['predict'].to_list()
labels = dev_df['label'].to_list()

print("Accuracy: ", (len([1 for x, y in zip(results, labels) if y == x]) / len(labels)))

Accuracy:  0.5895953757225434


In [None]:
finetuned_generation(model, tokenizer, prompt_template, 'dev_finetuned_explainer', pap_testing['dev'], dev_df, 80)
finetuned_generation(model, tokenizer, prompt_template, 'test_finetuned_explainer', pap_testing['test'], test_df, 80)

0 press shakes rent 0  Plausible
1 pair pronounces validation 1  Plausible. The event "pair pronounces validation" is a grammatically correct and coherent sentence, and it is not necessarily impossible or absurd. While it may not be a common or typical occurrence, it is not entirely implausible either.
2 caper extracts finger 0  Implausible. The event of a caper extracting a finger is not possible or plausible as capers are not living beings or entities that have the ability to extract fingers. It is a plant that is commonly used in cooking and does not have any known capabilities to perform such an action.
3 motorway forbids distribution 1  Implausible. The concept of a motorway forbidding distribution is illogical and doesn't make any sense. Motorways are designed for the efficient movement of vehicles, and the idea of restricting or prohibiting the distribution of goods or services on a motorway is not a feasible or practical idea. Additionally, it goes against the primary purpose o

In [None]:
test_df

Unnamed: 0,text,original_label,label,generated_text
0,album makes debut,plausible,1,Plausible. The event of an album making its d...
1,album breaks genre,plausible,1,Plausible. While breaking genres is not a com...
2,lack produces form,plausible,0,"Implausible. The event ""lack produces form"" i..."
3,inclusion expands range,plausible,1,"Plausible. The event ""inclusion expands range..."
4,candidacy encodes appreciation,implausible,1,Plausible. While the concept of candidacy and...
...,...,...,...,...
169,boom characterizes period,plausible,1,"Implausible. The event ""boom characterizes pe..."
170,child wears kind,plausible,1,Implausible. A child wearing a kind (a type o...
171,demonstration brings traffic,plausible,1,"Implausible. A demonstration, by its nature, ..."
172,scene depicts moment,plausible,1,"Plausible. The event ""scene depicts moment"" i..."


In [None]:
eval_prediction(dev_df)

Accuracy:  0.6184971098265896


In [None]:
eval_prediction(test_df)

Accuracy:  0.6149425287356322


# 1. Random Forest

In [None]:
from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, roc_curve, auc
import numpy as np


### Preprocessing

In [None]:
def read_csv(split):
  # read training data
  df = pd.read_csv(f'{BASE_DIR}/{split}')
  # extract text data
  texts = df['text']
  # tokenize the sentences into words
  df['tokenize'] = df['text'].apply(lambda x: x.split())
  return df['tokenize'], df['label']

* Load dataset splits

In [None]:
X_train, y_train = read_csv(TRAIN_FN)
X_dev, y_dev = read_csv(DEV_FN)
X_test, y_test = read_csv(TEST_FN)

In [None]:
X_train

0                       [event, occurs, year]
1                    [tortoise, brings, limb]
2           [headliner, overpowers, function]
3                    [county, receives, hour]
4       [traveler, acknowledges, recognition]
                        ...                  
1723         [classification, hauls, slavery]
1724                  [library, needs, space]
1725           [analysis, constrains, theory]
1726                     [row, elicits, game]
1727                [consumer, pokes, vision]
Name: tokenize, Length: 1728, dtype: object

In [None]:
y_train

0       1
1       1
2       1
3       0
4       1
       ..
1723    1
1724    1
1725    1
1726    1
1727    1
Name: label, Length: 1728, dtype: int64

### Computing sentence embedding

In [None]:
# load the word2vec model
model = Word2Vec(X_train,
                 vector_size=100,
                 window=2,
                 min_count=1)

words = set(model.wv.index_to_key)

In [None]:
# averaging the word vectors
def compute_sentence_vector_train(X_train):
  # get the word vectors
  X_vector = np.array(
      [np.array([model.wv[i] for i in ls if i in words]) for ls in X_train], dtype='object'
  )
  # average the vectors in each row
  return np.mean(X_vector, axis=1)

X_train_vector = compute_sentence_vector_train(X_train)

### Train Random Forest

In [None]:
# Instantiate and fit a basic Random Forest model on top of the vectors
rf = RandomForestClassifier()
rf_model = rf.fit(X_train_vector, y_train.values.ravel())

## Inference

In [None]:
def compute_sentence_vector_eval(X_eval):
  # get the word vectors
  X_eval_vector = np.array(
        [np.array([model.wv[i] for i in ls if i in words]) for ls in X_eval], dtype='object'
    )
  X_eval_vect_avg = []
  for v in X_eval_vector:
    if v.size:
      X_eval_vect_avg.append(v.mean(axis=0))
    else:
      X_eval_vect_avg.append(np.zeros(100, dtype=float))
  return X_eval_vect_avg

In [None]:
def predict_eval_set(X_eval):
  # compute sentence vector for eval set
  X_eval_vect_avg = compute_sentence_vector_eval(X_eval)
  # Use the trained model to make predictions on the eval data
  y_pred = rf_model.predict(X_eval_vect_avg)
  return y_pred

In [None]:
def evaluate_model(X_eval, y_eval):
  y_pred = predict_eval_set(X_eval)
  precision = precision_score(y_eval, y_pred)
  recall = recall_score(y_eval, y_pred)
  print(f'Precision: {precision:.3f} / Recall: {recall:.3f} / Accuracy: {(y_pred==y_eval).sum()/len(y_pred):.3f}')
  # Compute False Positive Rate, True Positive Rate, and AUC score
  fpr, tpr, thresholds = roc_curve(y_eval, y_pred)
  auc_score = auc(fpr, tpr)
  print(f'AUC: {auc_score:.3f}')

In [None]:
evaluate_model(X_dev, y_dev)

Precision: 0.714 / Recall: 0.909 / Accuracy: 0.676
AUC: 0.503


In [None]:
evaluate_model(X_test, y_test)

Precision: 0.733 / Recall: 0.961 / Accuracy: 0.722
AUC: 0.545


# 2. BERT - Finetuning

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!pip install -q transformers datasets evaluate accelerate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/521.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/521.2 kB[0m [31m6.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

## Load the dataset

In [None]:
pap = load_dataset('csv', data_files={
    'train': f'{BASE_DIR}/{TRAIN_FN}',
    'dev': f'{BASE_DIR}/{DEV_FN}',
    'test': f'{BASE_DIR}/{TEST_FN}'
})

In [None]:
# look at 1 example
pap["train"][0]

{'text': 'event occurs year', 'original_label': 'plausible', 'label': 1}

## Preprocess

### Tokenize the text field

* Load a Tokenizer to preprocess

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

* Define a function to tokenize text and truncate, if needed

In [None]:
def preprocess_function(examples):
  "tokenize the text and truncate sequences to be nolonger than BERT maximum input length"
  return tokenizer(examples["text"], truncation=True)

- Apply apply the preprocessing function over the entire dataset using map function.

In [None]:
 # `batch=True` speed up map
tokenized_pap = pap.map(preprocess_function, batched=True)

Map:   0%|          | 0/216 [00:00<?, ? examples/s]

### Create a batch of examples using DataCollatorWithPadding

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluation

### Load accuracy metric

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

### create a function to pass pred and truth to calculate accuracy

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Train

### create a map of the expected ids to their labels

In [None]:
id2label = {0: "IMPLAUSIBLE", 1: "PLAUSIBLE"}
label2id = {"IMPLAUSIBLE": 0, "PLAUSIBLE": 1}

### load the BERT with AutoModelForSequenceClassification

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Define training hyperparameters with TrainingArguments

In [None]:
training_args = TrainingArguments(
    output_dir="distilbert-base-uncased-semantic-plausibility",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

### pass the training argument to Trainer, along with the model, dataset, tokenizer, data collator, and compute_metrics

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_pap["train"],
    eval_dataset=tokenized_pap["dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### finetune the model

In [None]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.539213,0.712963
2,No log,0.505403,0.731481
3,No log,0.639494,0.731481
4,No log,0.663944,0.731481


TrainOutput(global_step=432, training_loss=0.4191881462379738, metrics={'train_runtime': 43.4917, 'train_samples_per_second': 158.927, 'train_steps_per_second': 9.933, 'total_flos': 15213052814400.0, 'train_loss': 0.4191881462379738, 'epoch': 4.0})

In [None]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

events.out.tfevents.1702991187.65d0c1470674.1455.1:   0%|          | 0.00/5.95k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/nguyenhongquy/distilbert-base-uncased-semantic-plausibility/tree/main/'

## Inference

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nguyenhongquy/distilbert-base-uncased-semantic-plausibility")
model = AutoModelForSequenceClassification.from_pretrained("nguyenhongquy/distilbert-base-uncased-semantic-plausibility")

In [None]:
texts = ['man eats bridge', 'camel rides lake', 'paper kills leaves', 'man knits shirt']

In [None]:
def predict_text_plausibility(text, model):
  print(text)
  inputs = tokenizer(text, return_tensors="pt")
  with torch.no_grad():
      logits = model(**inputs).logits
  predicted_class_id = logits.argmax().item()
  # pred = model.config.id2label[predicted_class_id]
  print(predicted_class_id)
  return predicted_class_id

In [None]:
for t in texts:
  predict_text_plausibility(t, model)

man eats bridge
0
camel rides lake
0
paper kills leaves
0
man knits shirt
1


In [None]:
import pandas as pd


In [None]:
pap_df = pd.read_csv(f'{BASE_DIR}/{TEST_FN}')
pap_df['prediction'] = pap_df['text'].apply(lambda x: predict_text_plausibility(x, model))

interpretation construes title
1
mask sustains axis
0
trader ensures strategy
1
animator comprises trip
0
welfare constructs hundred
0
image depicts glacier
1
generation tightens gunfire
0
head weighs flash
0
collaboration represents time
1
club plays football
1
inquiry executes vaccine
1
eye serves role
1
element satisfies identity
1
attendee disengages norm
1
portion asserts equity
1
traffic overwhelms facility
1
mission builds knowledge
1
assignment breaks optimism
1
band consists kid
0
seaplane rubs fire
0
church attaches importance
1
power chooses road
0
train calls hour
1
scholar demolishes listing
1
elevator includes comrade
0
child integrates sect
1
airport starts operation
1
hook wins role
1
band starts leg
0
study redistributes species
1
team wins challenge
1
man retrieves identification
1
condition evaluates crypt
1
way automates subtype
0
fruit requires sentiment
0
river meanders way
0
resin comprises cricket
0
streetcar sweeps destruction
0
bank develops site
1
replacement

In [None]:
pap_df

Unnamed: 0,text,original_label,label,prediction
0,interpretation construes title,plausible,1,1
1,mask sustains axis,implausible,0,0
2,trader ensures strategy,implausible,1,1
3,animator comprises trip,implausible,1,0
4,welfare constructs hundred,implausible,0,0
...,...,...,...,...
211,malcontent pervades effect,implausible,1,0
212,realism overpowers alignment,implausible,1,1
213,outcome presides part,implausible,0,1
214,ship collides head,plausible,1,0


In [None]:
from sklearn.metrics import precision_score, recall_score, roc_curve, auc
y_eval = pap_df['label']
y_pred = pap_df['prediction']
precision = precision_score(y_eval, y_pred)
recall = recall_score(y_eval, y_pred)
print(f'Precision: {precision:.3f} / Recall: {recall:.3f} / Accuracy: {(y_pred==y_eval).sum()/len(y_pred):.3f}')
# Compute False Positive Rate, True Positive Rate, and AUC score
fpr, tpr, thresholds = roc_curve(y_eval, y_pred)
auc_score = auc(fpr, tpr)
print(f'AUC: {auc_score:.3f}')

Precision: 0.822 / Recall: 0.779 / Accuracy: 0.722
AUC: 0.680
