# **1. Idea and Use Case**


I'm building a domain-specific Question Answering (QA) system focused on food allergens across global cuisines. The idea is to fine-tune a pretrained language model (like Falcon-RW-1B) using a custom dataset of Q&A pairs that help users identify potential allergens in specific regional or national dishes.

The inspiration behind this project came from my mother, who has a gluten allergy. Dining out was often challenging for her, especially when faced with dishes that had unfamiliar names and gave no hint about their ingredients. I realized that instead of relying on how well a waiter understood what gluten is, a more reliable and accessible solution would be an intelligent system that could instantly provide allergen information for various cuisines.

# **2. Environtment Setup**
For this project, the following tools and libraries were used for data processing, model fine-tuning, evaluation, and Hugging Face integration. While the actual installations and imports were performed across different cells as needed, below is a consolidated view of all dependencies used in the project.



## Required Libraries:

!pip install transformers datasets accelerate peft evaluate trl bitsandbytes fuzzywuzzy[speedup]

## Import Statements:

**from datasets import Dataset, load_dataset, DatasetDict**

**from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments**

**from fuzzywuzzy import fuzz**

**from peft import LoraConfig, get_peft_model, TaskType**

**from trl import SFTTrainer**

**import requests**

**import json**

**import torch**

# **3. Synthetic Data Generation**


In order to create a domain-specific dataset on global cuisine and allergen information, I used prompt-based generation through large language models such as ChatGPT and Claude. The goal being producing high-quality question–answer (QA) pairs, where each question queries the allergens present in a specific  dish, and the answer provides a clear allergen breakdown.

Methodology:
1. Schema:
Each entry followed a simple format-
* Question: *What allergens are present in [dish name]?*
* Answer: A concise list of common allergens such as *gluten, dairy, nuts, etc.*, based on typical ingredients.
2. Prompting: Manually crafted prompts were used with ChatGPT and Claude to maintain uniqueness and consistency.
3. Quality Control: The generated responses were reviewed to make sure that they adhered to allergen categories.
4. Data Format: The resulting dataset was saved in a **JSON** format.
5. Upload: The finalized dataset was uploaded to Hugging Face as a **public** dataset.

# **4. Model Fine-tuning & Evaluation**
Please note:The models' accuracies have been computed after their respective loadings.

### Loading the dataset from Huggingface:

In [None]:
!pip install -U datasets numpy==2.0.2



In [None]:
!pip install transformers



In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

In [None]:
!pip install datasets

import json
import requests
from datasets import Dataset

#Downloading the JSON file
url = "https://huggingface.co/datasets/YMTEA/cuisineqa/resolve/main/dataset_allergy.json"
response = requests.get(url)
data = response.json() # This is a list of dicts

#Converting to Hugging Face Dataset
dataset = Dataset.from_list(data)

#Preview
dataset[0]



{'question': 'What allergens are present in Margherita pizza?',
 'answer': 'Contains gluten (wheat), dairy (mozzarella cheese), and yeast; may also cause reactions in individuals sensitive to tomatoes. Cross-contact with eggs, nuts, soy, or shellfish is possible in shared kitchens.'}

### Loading the pre-trained model (without fine-tuning):

In [None]:
!pip install accelerate --quiet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [None]:
model_name = "tiiuae/falcon-rw-1b"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Functions for checking accuracy:

In [None]:
def get_answer(question):
    prompt = f"Question: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the part after 'Answer:'
    answer = answer.split("Answer:")[-1].strip()
    return answer

In [None]:
!pip install fuzzywuzzy[speedup]



In [None]:
from fuzzywuzzy import fuzz

def fuzzy_match(prediction, ground_truth, threshold=80):
    score = fuzz.ratio(prediction.strip().lower(), ground_truth.strip().lower())
    return score >= threshold, score

### Checking accuracy of the pre-trained model (before fine-tuning):

In [None]:
correct = 0
total = len(dataset)
scores = []

for i in range(total):
    question = dataset[i]["question"]
    true_answer = dataset[i]["answer"]
    pred_answer = get_answer(question)

    matched, score = fuzzy_match(pred_answer, true_answer)
    scores.append(score)
    if matched:
        correct += 1

    print(f"Q: {question}")
    print(f"True: {true_answer}")
    print(f"Pred: {pred_answer}")
    print(f"Similarity: {score} | Match: {matched}")
    print("---")

accuracy = correct / total
avg_score = sum(scores) / total
print(f"Fuzzy Accuracy (≥80% match): {accuracy:.2%}")
print(f"Average Similarity Score: {avg_score:.2f}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


KeyboardInterrupt: 

### Formatting dataset:

In [None]:
!pip install transformers datasets peft trl accelerate bitsandbytes --quiet

In [None]:
from datasets import Dataset

#Formatting the dataset into training prompts
train_data = []

for item in dataset:
    q = item["question"]
    a = item["answer"]
    prompt = f"Question: {q}\nAnswer: {a}"
    train_data.append({"text": prompt})

hf_dataset = Dataset.from_list(train_data)

Tokenizing function:

In [None]:
from transformers import AutoTokenizer

model_name = "tiiuae/falcon-rw-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

def tokenize_function(example):
    encoding = tokenizer(
        example["text"],
        truncation=True,
        max_length=256,
        padding="max_length",
        return_tensors=None,
    )
    encoding["labels"] = encoding["input_ids"].copy()  # Set labels = input_ids
    return encoding

In [None]:
tokenizer.pad_token = tokenizer.eos_token

# Tokenize
tokenized_dataset = hf_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


### Fine-tuning using LoRA:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

model_name = "tiiuae/falcon-rw-1b"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"],
    lora_dropout=0.01,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)



### Training the model using the synthetically generated dataset:

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

#Defining training arguments
training_args = TrainingArguments(
    output_dir="./lora-falcon-qa",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=1,
    save_strategy="no",
    report_to="none"
)

#Defining a formatting function
def formatting_func(example):
    return example["text"]

#Creating the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    formatting_func=formatting_func
)




Truncating train dataset:   0%|          | 0/180 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
#Start training
trainer.train()

### Saving the model after fine-tuning:

In [None]:
trainer.model.save_pretrained("fine-tuned-falcon-cuisineqa")
tokenizer.save_pretrained("fine-tuned-falcon-cuisineqa")

### Loading and checking accuracy after fine-tuning:

In [None]:
model_path = "fine-tuned-falcon-cuisineqa"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
model.eval()
model.to("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
correct = 0
total = len(dataset)
scores = []

for i in range(total):
    question = dataset[i]["question"]
    true_answer = dataset[i]["answer"]
    pred_answer = get_answer(question)

    matched, score = fuzzy_match(pred_answer, true_answer)
    scores.append(score)
    if matched:
        correct += 1

    print(f"Q: {question}")
    print(f"True: {true_answer}")
    print(f"Pred: {pred_answer}")
    print(f"Similarity: {score:.2f} | Match: {matched}")
    print("---")

accuracy = correct / total
avg_score = sum(scores) / total
print(f"Fuzzy Accuracy (≥80% match): {accuracy:.2%}")
print(f"Average Similarity Score: {avg_score:.2f}")

# 5. Analysis of Accuracy checks

To evaluate the impact of fine-tuning on synthetic data, I compared the performance of the **base model (`tiiuae/falcon-rw-1b`)** with the **fine-tuned model** using a domain-specific test set of allergen-related QA pairs.

#### Evaluation Metrics
The following metrics were used:
- **Fuzzy Accuracy (≥80% match):** Measures how often the model’s output closely matches the ground-truth answer.
- **Average Similarity Score:** A numerical metric indicating how similar the predicted answer is to the expected answer (based on token overlap).

#### Results

| Model             | Fuzzy Accuracy (≥80%) | Avg. Similarity Score |
|------------------|------------------------|------------------------|
| Base Model        | 0.00%                  | 35.53                 |
| Fine-Tuned Model  | 1.67%                  | 55.58                 |

#### Analysis
- The base model lacked domain-specific knowledge, especially in allergen-related contexts.
- Fine-tuning with 180 synthetic examples helped the model generalize better to the task.
- It is observed that there is modest but measurable improvements in both similarity and fuzzy accuracy, indicating that even minimal domain-specific fine-tuning can nudge performance forward.




# 6. Final Thoughts and Project Analysis

####Project Summary

- **Use Case Chosen:**  
  I focused on a real-world problem inspired by my mother's gluten allergy. I aimed to build a system that could answer allergen-related questions for  dishes whose names didnt really disclose what they contained.

- **Synthetic Dataset Generation:**  
  I prompted ChatGPT and Claude multiple times to generate close to 180 unique QA pairs, each covering a dish and its potential allergens.

- **Model and Fine-Tuning:**  
  I fine-tuned the 'tiiuae/falcon-rw-1b' model using LoRA-based parameter-efficient fine-tuning. Hugging Face Transformers and Datasets libraries were used.

- **Evaluation Approach:**  
  I compared the base model and the fine-tuned model using two metrics:  
  - Fuzzy Accuracy (≥80% match)  
  - Average Similarity Score

####Analysis of Results

- **Improvement Observed:**
  - **Fuzzy Accuracy** improved from **0.00% -> 1.67%**
  - **Average Similarity** increased from **35.53 -> 55.58**
  - Indicating that a small set of high-quality synthetic examples can positively impact model performance.

- **Example-Based Insights:**
  - The base model displayed an inability to carefully categorize/highlight the allergens, and ended up giving a more generic answer with ingredients rather than the common allergens.
  - For example,
     - Q: What allergens are present in Enchiladas?
     - True: Contains gluten (tortillas), dairy (cheese, sour cream), and sometimes soy (in sauces).
     - Model's Prediction: Enchiladas are made with a variety of ingredients. The most common allergens are:
         - Wheat
         - Soy
         - Milk
         - Eggs
         - Fish
         - Shellfish
         - Tree nuts
         - Peanuts
         - Sesame
     - Similarity: 35 | Match: False
  - The ***fine-tuned*** model, on the other hand, was more likely to mention domain-specific allergens like *gluten, nuts,* and *dairy*.
  - For example,
    - Q: What allergens are present in Enchiladas?
    - True: Contains gluten (tortillas), dairy (cheese, sour cream), and sometimes soy (in sauces).
    - Model's Prediction: Contains dairy (cheese, sour cream), gluten (flour tortillas), and sometimes soy (in sauces).
    - Similarity: 74.00 | Match: False
  - It also started producing more structured and complete responses.

- **Limitations:**
  - The dataset was relatively small (180 samples) which limited generalization.
  - Some hallucinations still occurred.

####Ideas for Improvement

- **Dataset Enhancements:**
  - Increasing the volume by augmenting with similar dish variants.

- **Fine-Tuning Improvements:**
  - Training for more epochs while not being overly hardware intensive.

- **Resource-Based Improvements:**
  - Using larger base models if compute permits.

####Learning Outcomes

- **Synthetic Data Insights:**
  - Synthetic data proved to be highly effective for domain-specific tasks, especially in low-resource settings.

- **LLM Fine-Tuning Learnings:**
  - LoRA is an efficient way to fine-tune large models.
  - Proper tokenization plays an important role in the final accuracy checking.
  - Some base model predictions were completely irrelevant, emphasizing the value of fine-tuning.

---

This project showed me the power of targeted synthetic data and lightweight fine-tuning to solve real-world problems, especially in under-represented domains like allergen awareness across global cuisines.


# 7. References

- https://digestize.medium.com/how-llms-work-9e8da8691541
- https://medium.com/@adityamahajan.work/easyocr-a-comprehensive-guide-5ff1cb850168
- https://youtu.be/LPZh9BOjkQs?si=8RkzPGkSx5fAVbqV
- https://youtu.be/HaN6qCr_Afc?si=WuEhkqBq_i54kDWr
- https://chatgpt.com/?oai-dm=1
- https://claude.ai/new
