# QA finetuning
Notebook demonstrating fine-tuning bert model for qa task

## Fine-tuning on coupon data
### 1. dataset preparation

In [1]:
import json, pandas

In [14]:
USED_COLUMNS = ["Text", "View Class Name"]

with open("ds/18929485529_expected.json", "r", encoding='utf-8') as f:
    resps = json.load(f)
    
for x in resps["coupons"]:
    x.pop("discount")
    x.pop("validity")
    
frame = pandas.read_csv("ds/18929485529.csv", encoding='utf-8')
# currently hardcoded
sample_indices = [slice(2, 7), slice(7, 12), slice(49, 54), slice(78, 83), slice(92, 97), slice(97, 102)]

frame = frame[USED_COLUMNS]
contexts = [frame[ind].to_csv() for ind in sample_indices]

QUESTIONS = {
    "old_price": "What was the old, higher price of product?",
    "new_price": "What is the current price of product?",
    "product_name": "How is the product named?"
}

### Converting answers to locations in contexts
Note on dataset:
I have cleared dataset provided in coupon-extraction-demo repo:
* I have removed `FELIX Knabber Mix 12 x 85 g` and `FELIX So gut wie es aussieht in Gelee` products as I believe they are not correctly labeled
* i have changed old price of `FELIX Knabber Mix 200 g` to 2.99 and name to "FELIX Knabber Mix"

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.vocab["[Q1]"] = tokenizer.vocab["[unused128]"]
tokenizer.vocab["[Q2]"] = tokenizer.vocab["[unused129]"]
tokenizer.vocab["[Q3]"] = tokenizer.vocab["[unused130]"]

answers_converted = {k: [] for k in QUESTIONS}
answers = {k: [e[k] for e in resps['coupons']] for k in QUESTIONS}

tokenized = tokenizer(contexts, return_offsets_mapping=True, add_special_tokens=False)

for i, (ctx, tokenized_ctx, ctx_offsets) in enumerate(zip(contexts, tokenized["input_ids"], tokenized["offset_mapping"])):
    decoded_tokens = tokenizer.convert_ids_to_tokens(tokenized_ctx)
    token_offsets = []
    for token, (start, end) in zip(decoded_tokens, ctx_offsets):
        token_offsets.append({"token": token, "start": start, "end": end, "text": ctx[start:end]})

    # Print tokens alongside their positions and text
    """for t in token_offsets:
        print(f"Token: {t['token']}, Start: {t['start']}, End: {t['end']}, Text: '{t['text']}'")"""
        
    for q in answers:
        answer = answers[q][i]
        start_char = ctx.find(answer)
        end_char = start_char + len(answer)
        
        # Locate the corresponding tokens
        start_token_idx = None
        end_token_idx = None
        
        for idx, (start, end) in enumerate(ctx_offsets):
            if start <= start_char < end:
                start_token_idx = idx
            if start < end_char <= end:
                end_token_idx = idx
                break
        
        print(f"Answer: '{answer}'")
        print(f"Character-level Start: {start_char}, End: {end_char}")
        print(f"Token-level Start: {start_token_idx}, End: {end_token_idx}")
        
        answers_converted[q].append([start_token_idx, end_token_idx])

Answer: '14.99'
Character-level Start: 29, End: 34
Token-level Start: 10, End: 12
Answer: '9.99'
Character-level Start: 62, End: 66
Token-level Start: 23, End: 25
Answer: 'JOHNNIE WALKER Red Label Blended Scotch'
Character-level Start: 125, End: 164
Token-level Start: 48, End: 53
Answer: '0.99'
Character-level Start: 29, End: 33
Token-level Start: 10, End: 12
Answer: '0.75'
Character-level Start: 61, End: 65
Token-level Start: 23, End: 25
Answer: 'SAN MIGUEL Especial'
Character-level Start: 125, End: 144
Token-level Start: 48, End: 52
Answer: '2.99'
Character-level Start: 30, End: 34
Token-level Start: 10, End: 12
Answer: '2.79'
Character-level Start: 63, End: 67
Token-level Start: 23, End: 25
Answer: 'FELIX Knabber Mix'
Character-level Start: 128, End: 145
Token-level Start: 48, End: 52
Answer: '8.99'
Character-level Start: 30, End: 34
Token-level Start: 10, End: 12
Answer: '5.85'
Character-level Start: 63, End: 67
Token-level Start: 23, End: 25
Answer: 'CHANTRÉ Weinbrand'
Character-l

### Create JSON dataset and convert it to datasets library object

In [16]:
as_json = [{
   "id": ci * len(QUESTIONS) + qi,
   "title":"example_title",
   "context": ctx,
   "question": QUESTIONS[q_key],
   "answers":{
      "text":[
         answers[q_key][ci]
      ],
      "answer_start":[
         answers_converted[q_key][ci][0]
      ]
   }
} for ci, ctx in enumerate(contexts) for qi, q_key in enumerate(QUESTIONS)]
with open("ds.json", "w", encoding='utf-8') as f:
    for entry in as_json:
        json.dump(entry, f)
        f.write("\n")

from datasets import load_dataset

dataset = load_dataset("json", data_files="ds.json")
dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 18
    })
})

In [17]:
def preprocess_function(examples):
    questions = examples["question"]
    contexts = examples["context"]
    answers = examples["answers"]
    
    inputs = tokenizer(
        questions,
        contexts,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    
    start_positions = []
    end_positions = []
    
    for i, answer in enumerate(answers):
        start_char = answer['answer_start'][0]
        end_char = start_char + len(answer['text'][0])
        
        # Map start and end character positions to token indices
        start_positions.append(inputs.char_to_token(i, start_char))
        end_positions.append(inputs.char_to_token(i, end_char - 1))
        
        # Handling edge cases where the tokenizer may not capture the exact indices
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length - 1
    
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    
    return inputs

tokenized_dataset = dataset['train'].map(preprocess_function, batched=True)


Map:   0%|          | 0/18 [00:00<?, ? examples/s]

Lets test basic bert on our problem

In [18]:
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="bert-base-uncased", tokenizer="bert-base-uncased")
result = qa_pipeline({"context": contexts[5], "question": QUESTIONS["old_price"]})
print(result)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.00034702528500929475, 'start': 96, 'end': 118, 'answer': 'gespart,android.widget'}


As we see it is performing poorly
### Fine-Tuning
You may skip this cell and just download fine-tuned model from my HuggingFace profile below

In [21]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [22]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

training_args = TrainingArguments(
    "bert-uncased-finetuned-csv-qa",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=50,
    weight_decay=0.02,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

trainer.train()


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,5.441133
2,No log,4.568306
3,No log,3.901905
4,No log,3.364568
5,No log,2.809478
6,No log,2.315282
7,No log,1.945634
8,No log,1.690385
9,No log,1.463345
10,No log,1.216905


TrainOutput(global_step=100, training_loss=1.0835047149658203, metrics={'train_runtime': 1391.4307, 'train_samples_per_second': 0.647, 'train_steps_per_second': 0.072, 'total_flos': 235167081062400.0, 'train_loss': 1.0835047149658203, 'epoch': 50.0})

In [23]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/SzymonKozl/bert-uncased-finetuned-csv-qa/commit/c76e76790667977ab28482f9ee31c21877e7c188', commit_message='End of training', commit_description='', oid='c76e76790667977ab28482f9ee31c21877e7c188', pr_url=None, repo_url=RepoUrl('https://huggingface.co/SzymonKozl/bert-uncased-finetuned-csv-qa', endpoint='https://huggingface.co', repo_type='model', repo_id='SzymonKozl/bert-uncased-finetuned-csv-qa'), pr_revision=None, pr_num=None)

In [26]:
qa_pipeline = pipeline("question-answering", model="SzymonKozl/bert-uncased-finetuned-csv-qa", tokenizer="SzymonKozl/bert-uncased-finetuned-csv-qa")
result = qa_pipeline(**{"context": contexts[0], "question": QUESTIONS["product_name"]})
print(result)

{'score': 0.0004328570794314146, 'start': 125, 'end': 143, 'answer': 'JOHNNIE WALKER Red'}


As we see the results are not perfect but there is some improvement

### Conclusions
It is hard to draw a conclusions from fine tuning on such small dataset. However here are several observations that might help in further work:
* almost identical prompts do not work. Following example resulted in model answering the same answer to each question:
```py
QUESTIONS = {
    "old_price": "What is the old price of product?",
    "new_price": "What is the new price of product?",
    "product_name": "What is the name of the discounted product?"
}
```
* fine tuning QA for CSV blocks with no dropped columns results in nan loss and prevents training at all
* At this moment evaluation results are not looking promising - but we will need to check on real datasets - purpose of this notebook was to show possibility to treat our problem as QA task