# Fine-Tuned Q&A System Using SQuAD Dataset for Enhanced Answer Extraction

# Index:
0. **Introduction**  

1. **Library Imports**  

2. **Load Model and dataset**  

3. **Exploratory Data Analysis (EDA)**  

4. **Data Preprocessing**  

5. **Model Training**  

6. **Model Evaluation**

# 0. Introduction
### Overview
This is a Question & Answer (Q&A) system. We have fine-tuned the DistilBERT model locally using the SQuAD dataset and compared the performance of the fine-tuned model with the pre-fine-tuned version of DistilBERT. The aim is to evaluate the improvements in answer extraction accuracy and model performance after fine-tuning.

### What is DistilBERT Model?
DistilBERT is a smaller, faster, and more efficient version of the BERT (Bidirectional Encoder Representations from Transformers) model. It was developed by Hugging Face with the goal of retaining most of BERT’s language understanding capabilities while reducing the model’s size and computational requirements.

### What's the difference between the two models?
1.	distilBertModel: This is a base DistilBERT model that I fine-tuned locally using the SQuAD dataset. Its performance depends on how well the fine-tuning process was done.

2.	preFineTuningDistilBertModel: This is a model that has already been fine-tuned by Hugging Face using the SQuAD dataset. It is optimized for question answering tasks and ready to use without additional training.


# 1. Library Imports

In [2]:
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizerFast, Trainer, TrainingArguments
from datasets import load_dataset
import json

from transformers import DistilBertForQuestionAnswering
from transformers import pipeline

# 2. Load Model and dataset

In [3]:
model_name = "distilbert-base-uncased"
distilBertModel = DistilBertForQuestionAnswering.from_pretrained(model_name)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

squad_dataset = load_dataset("squad")


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3. EDA

In [4]:
print(squad_dataset)

formatted_output = json.dumps(squad_dataset['train'][:2], indent=4)
print(formatted_output)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})
{
    "id": [
        "5733be284776f41900661182",
        "5733be284776f4190066117f"
    ],
    "title": [
        "University_of_Notre_Dame",
        "University_of_Notre_Dame"
    ],
    "context": [
        "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Be

## 4. Data Preprocessing
We created a preprocessing function that prepares both training and validation datasets for the QA system based on the SQuAD dataset. It tokenizes the context and questions, handles the positioning of answers within the context, and assigns start and end positions for the answers during training. The code uses train_dataset for training and validation_dataset for evaluation, ensuring the model is properly trained and validated on the correct input format.

In [10]:
max_length = 384
stride = 128

def preprocess_examples(examples, is_training=True):
    questions = [q.strip() for q in examples["question"]]
    
    # Tokenize the context and questions
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    
    # Handle offset_mapping and example reflection
    offset_mapping = inputs["offset_mapping"]
    sample_map = inputs.pop("overflow_to_sample_mapping")
    
    if is_training:
        answers = examples["answers"]
        start_positions = []
        end_positions = []

        for i, offset in enumerate(offset_mapping):
            sample_idx = sample_map[i]
            answer = answers[sample_idx]
            start_char = answer["answer_start"][0]
            end_char = answer["answer_start"][0] + len(answer["text"][0])
            sequence_ids = inputs.sequence_ids(i)

            # Find the start and end of the context
            idx = 0
            while sequence_ids[idx] != 1:   # Skip tokens that are not in the context
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:   # Find where the context ends
                idx += 1
            context_end = idx - 1

            # If the answer is not fully inside the context, set start and end positions to 0
            if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                # Calculate the start and end token positions for the answer
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)

        # Add start and end positions to the inputs
        inputs["start_positions"] = start_positions
        inputs["end_positions"] = end_positions
    else:
        # For validation, we only save the example ID and offset mapping
        example_ids = []
        for i in range(len(inputs["input_ids"])):
            sample_idx = sample_map[i]
            example_ids.append(examples["id"][sample_idx])

            sequence_ids = inputs.sequence_ids(i)
            offset = offset_mapping[i]
            inputs["offset_mapping"][i] = [
                o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
            ]

        inputs["example_id"] = example_ids

    return inputs

# Apply preprocessing to the training dataset
train_dataset = squad_dataset["train"].map(
    lambda examples: preprocess_examples(examples, is_training=True),
    batched=True,
    remove_columns=squad_dataset["train"].column_names,
)

# Apply preprocessing to the validation dataset
validation_dataset = squad_dataset["validation"].map(
    lambda examples: preprocess_examples(examples, is_training=False),
    batched=True,
    remove_columns=squad_dataset["validation"].column_names,
)

# Check the size of the resulting datasets
print(len(squad_dataset["train"]), len(train_dataset))
print(len(squad_dataset["validation"]), len(validation_dataset))


Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

87599 88524
10570 10784


## 5. Model Training

### Explanation of the Training Parameter:

1. **`evaluation_strategy="epoch"`**:
   - **Explanation**: This sets the evaluation frequency. In this case, evaluation will occur at the end of each training epoch.
   - **Why it's set**: By evaluating at the end of each epoch, you can monitor the model’s performance and see how well it generalizes to the validation data after each complete pass through the training data.

2. **`learning_rate=3e-5`**:
   - **Explanation**: The learning rate controls how much to update the model's weights in response to the gradients during training. A smaller learning rate makes smaller adjustments to the weights, while a larger one makes bigger updates.
   - **Why it's set**: A learning rate of `3e-5` is commonly used for fine-tuning pre-trained models like BERT or DistilBERT. It's small enough to ensure stable convergence but large enough to make meaningful progress in weight updates.

3. **`per_device_train_batch_size=16`**:
   - **Explanation**: This specifies how many training samples are processed in one forward and backward pass per device (e.g., GPU or CPU). A larger batch size increases memory usage but can result in faster training due to better GPU utilization.
   - **Why it's set**: A batch size of 16 strikes a balance between memory constraints (depending on my device’s RAM/VRAM) and ensuring enough data is processed in each iteration for stable gradient updates.

4. **`per_device_eval_batch_size=16`**:
   - **Explanation**: This is the batch size used for evaluation on the validation dataset per device. Similar to training batch size but used for validation.
   - **Why it's set**: Using the same batch size for evaluation ensures that the model processes validation data efficiently. Typically, evaluation batch sizes are kept the same or slightly larger than training batch sizes if memory allows.

5. **`num_train_epochs=3`**:
   - **Explanation**: The number of epochs defines how many times the entire training dataset is passed through the model. One epoch means that the model has seen the entire dataset once.
   - **Why it's set**: 3 epochs are often enough for fine-tuning models like DistilBERT on smaller datasets like SQuAD. This prevents overfitting, but you can adjust this based on the dataset size and model performance.

6. **`weight_decay=0.01`**:
   - **Explanation**: Weight decay is a form of regularization that helps to prevent overfitting by adding a penalty to the loss function based on the magnitude of the model’s weights.
   - **Why it's set**: A small weight decay (like `0.01`) is commonly used to improve generalization by discouraging overly large weights, which can lead to overfitting, especially in fine-tuning tasks.

In [13]:
training_args = TrainingArguments(
    output_dir="./results",              # Directory to save the model output
    evaluation_strategy="epoch",         # Evaluate the model at the end of each epoch
    learning_rate=3e-5,                  # Learning rate for the optimizer
    per_device_train_batch_size=16,      # Batch size for training on each device (e.g., GPU/CPU)
    per_device_eval_batch_size=16,       # Batch size for evaluation on each device (e.g., GPU/CPU)
    num_train_epochs=3,                  # Number of training epochs
    weight_decay=0.01,                   # Weight decay for regularization to prevent overfitting
)

trainer = Trainer(
    model = distilBertModel,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = validation_dataset,
)

trainer.train()



  0%|          | 0/16599 [00:00<?, ?it/s]

{'loss': 2.5681, 'grad_norm': 18.244218826293945, 'learning_rate': 2.9096331104283392e-05, 'epoch': 0.09}
{'loss': 1.6484, 'grad_norm': 15.80468463897705, 'learning_rate': 2.819266220856678e-05, 'epoch': 0.18}
{'loss': 1.5156, 'grad_norm': 13.678197860717773, 'learning_rate': 2.7288993312850172e-05, 'epoch': 0.27}
{'loss': 1.4489, 'grad_norm': 11.911229133605957, 'learning_rate': 2.6385324417133564e-05, 'epoch': 0.36}
{'loss': 1.4034, 'grad_norm': 14.737772941589355, 'learning_rate': 2.5481655521416952e-05, 'epoch': 0.45}
{'loss': 1.2843, 'grad_norm': 12.159255981445312, 'learning_rate': 2.4577986625700344e-05, 'epoch': 0.54}
{'loss': 1.2776, 'grad_norm': 21.655744552612305, 'learning_rate': 2.3674317729983735e-05, 'epoch': 0.63}
{'loss': 1.2746, 'grad_norm': 14.408231735229492, 'learning_rate': 2.2770648834267124e-05, 'epoch': 0.72}
{'loss': 1.2048, 'grad_norm': 12.982665061950684, 'learning_rate': 2.1866979938550515e-05, 'epoch': 0.81}
{'loss': 1.1996, 'grad_norm': 14.543550491333008

  0%|          | 0/674 [00:00<?, ?it/s]

{'eval_runtime': 145.327, 'eval_samples_per_second': 74.205, 'eval_steps_per_second': 4.638, 'epoch': 1.0}
{'loss': 0.9126, 'grad_norm': 9.365504264831543, 'learning_rate': 1.9155973251400687e-05, 'epoch': 1.08}
{'loss': 0.8967, 'grad_norm': 10.738567352294922, 'learning_rate': 1.825230435568408e-05, 'epoch': 1.17}
{'loss': 0.8716, 'grad_norm': 13.488021850585938, 'learning_rate': 1.7348635459967467e-05, 'epoch': 1.27}
{'loss': 0.9189, 'grad_norm': 15.60020637512207, 'learning_rate': 1.644496656425086e-05, 'epoch': 1.36}
{'loss': 0.9045, 'grad_norm': 13.920702934265137, 'learning_rate': 1.554129766853425e-05, 'epoch': 1.45}
{'loss': 0.8807, 'grad_norm': 14.941184043884277, 'learning_rate': 1.463762877281764e-05, 'epoch': 1.54}
{'loss': 0.8802, 'grad_norm': 12.228151321411133, 'learning_rate': 1.373395987710103e-05, 'epoch': 1.63}
{'loss': 0.891, 'grad_norm': 15.575723648071289, 'learning_rate': 1.283029098138442e-05, 'epoch': 1.72}
{'loss': 0.8685, 'grad_norm': 16.743860244750977, 'lea

  0%|          | 0/674 [00:00<?, ?it/s]

{'eval_runtime': 142.6511, 'eval_samples_per_second': 75.597, 'eval_steps_per_second': 4.725, 'epoch': 2.0}
{'loss': 0.6719, 'grad_norm': 12.660990715026855, 'learning_rate': 9.215615398517983e-06, 'epoch': 2.08}
{'loss': 0.6537, 'grad_norm': 14.955159187316895, 'learning_rate': 8.311946502801373e-06, 'epoch': 2.17}
{'loss': 0.6489, 'grad_norm': 16.076448440551758, 'learning_rate': 7.408277607084765e-06, 'epoch': 2.26}
{'loss': 0.652, 'grad_norm': 13.49941635131836, 'learning_rate': 6.504608711368155e-06, 'epoch': 2.35}
{'loss': 0.6389, 'grad_norm': 14.982282638549805, 'learning_rate': 5.6009398156515455e-06, 'epoch': 2.44}
{'loss': 0.636, 'grad_norm': 32.58141326904297, 'learning_rate': 4.697270919934936e-06, 'epoch': 2.53}
{'loss': 0.6277, 'grad_norm': 19.506061553955078, 'learning_rate': 3.7936020242183263e-06, 'epoch': 2.62}
{'loss': 0.6323, 'grad_norm': 8.488372802734375, 'learning_rate': 2.889933128501717e-06, 'epoch': 2.71}
{'loss': 0.6493, 'grad_norm': 20.377105712890625, 'lear

  0%|          | 0/674 [00:00<?, ?it/s]

{'eval_runtime': 145.6394, 'eval_samples_per_second': 74.046, 'eval_steps_per_second': 4.628, 'epoch': 3.0}
{'train_runtime': 11789.8511, 'train_samples_per_second': 22.525, 'train_steps_per_second': 1.408, 'train_loss': 0.9926093652080474, 'epoch': 3.0}


TrainOutput(global_step=16599, training_loss=0.9926093652080474, metrics={'train_runtime': 11789.8511, 'train_samples_per_second': 22.525, 'train_steps_per_second': 1.408, 'total_flos': 2.602335381127373e+16, 'train_loss': 0.9926093652080474, 'epoch': 3.0})

## Model Evaluation

We evaluate the model performance by a series of context-question pairs.

In [15]:

# Initialize the pipline
question_answerer = pipeline("question-answering", model=distilBertModel, tokenizer=tokenizer)

# Test with multiple context-question pairs
context_question_pairs = [
    # History Question
    {
        "context": "The Wright brothers, Orville and Wilbur Wright, were two American aviation pioneers generally credited with inventing, building, and flying the world's first successful motor-operated airplane. They made their first controlled, sustained flight on December 17, 1903, near Kitty Hawk, North Carolina.",
        "question": "When did the Wright brothers make their first flight?"
    },
    # Geography Question
    {
        "context": "Mount Everest is Earth's highest mountain, located in the Himalayas. Its peak is 8,848.86 meters (29,031.7 ft) above sea level. The international border between Nepal and China runs across its summit point.",
        "question": "Where is Mount Everest located?"
    },
    # Science Question
    {
        "context": "Photosynthesis is a process used by plants and other organisms to convert light energy, usually from the sun, into chemical energy that can be later released to fuel the organisms' activities. During photosynthesis, plants capture light energy and use it to convert carbon dioxide and water into glucose and oxygen.",
        "question": "What is the main product of photosynthesis?"
    },
    # Literature Question
    {
        "context": "William Shakespeare was an English playwright, poet, and actor, widely regarded as the greatest writer in the English language and the world's greatest dramatist. He is often called England's national poet and the 'Bard of Avon.' His extant works, including collaborations, consist of some 39 plays, 154 sonnets, and two long narrative poems.",
        "question": "How many plays did William Shakespeare write?"
    },
    # Technology Question
    {
        "context": "The internet is a global network of interconnected computers that communicate through standardized protocols. It was developed in the late 1960s and has since revolutionized communication, information sharing, and commerce worldwide. The World Wide Web, which was invented by Tim Berners-Lee in 1989, is a system of interlinked hypertext documents accessed via the internet.",
        "question": "Who invented the World Wide Web?"
    },
    # Art/Culture Question
    {
        "context": "The Mona Lisa is a portrait painting by the Italian artist Leonardo da Vinci. It is considered an archetypal masterpiece of the Italian Renaissance and has been described as 'the most famous, visited, talked about, and sung about' work of art in the world. The painting is thought to depict Lisa Gherardini, the wife of a wealthy Florentine merchant, and was completed in the early 16th century.",
        "question": "Who painted the Mona Lisa?"
    },
    # Politics Question
    {
        "context": "The United Nations (UN) is an international organization founded in 1945 after the Second World War by 51 countries, committed to maintaining international peace and security, developing friendly relations among nations, promoting social progress, better living standards, and human rights.",
        "question": "When was the United Nations founded?"
    },
    # Economics Question
    {
        "context": "Inflation is the rate at which the general level of prices for goods and services rises, and subsequently, the purchasing power of currency falls. Central banks attempt to limit inflation, and avoid deflation, to keep the economy running smoothly. The primary measure of inflation is the inflation rate, the annual percentage change in the price index.",
        "question": "What is inflation?"
    }
]

# Handle every question-context pair
for pair in context_question_pairs:
    result = question_answerer(question=pair["question"], context=pair["context"])
    print(f"Question: {pair['question']}")
    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['score']:.2f}")
    print('-' * 30)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Question: When did the Wright brothers make their first flight?
Answer: December 17, 1903
Confidence: 0.98
------------------------------
Question: Where is Mount Everest located?
Answer: the Himalayas
Confidence: 0.76
------------------------------
Question: What is the main product of photosynthesis?
Answer: chemical energy
Confidence: 0.07
------------------------------
Question: How many plays did William Shakespeare write?
Answer: 39
Confidence: 0.89
------------------------------
Question: Who invented the World Wide Web?
Answer: Tim Berners-Lee
Confidence: 1.00
------------------------------
Question: Who painted the Mona Lisa?
Answer: Leonardo da Vinci
Confidence: 0.95
------------------------------
Question: When was the United Nations founded?
Answer: 1945
Confidence: 0.93
------------------------------
Question: What is inflation?
Answer: the rate at which the general level of prices for goods and services rises
Confidence: 0.20
------------------------------


## Load pre-fine-tuned model
We load the model which has been fine tuned by Hugging Face using SQuAD dataset.

In [16]:

# Load the pre-fine-tuned DistilBERT model
preFineTuningDistilBertModel = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')

preFineTuning_question_answerer = pipeline("question-answering", model=preFineTuningDistilBertModel, tokenizer=tokenizer)

for pair in context_question_pairs:
    result = preFineTuning_question_answerer(question=pair["question"], context=pair["context"])
    print(f"Question: {pair['question']}")
    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['score']:.2f}")
    print('-' * 30)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Question: When did the Wright brothers make their first flight?
Answer: December 17, 1903
Confidence: 0.98
------------------------------
Question: Where is Mount Everest located?
Answer: the Himalayas
Confidence: 0.67
------------------------------
Question: What is the main product of photosynthesis?
Answer: light energy
Confidence: 0.10
------------------------------
Question: How many plays did William Shakespeare write?
Answer: 39
Confidence: 0.86
------------------------------
Question: Who invented the World Wide Web?
Answer: Tim Berners-Lee
Confidence: 1.00
------------------------------
Question: Who painted the Mona Lisa?
Answer: Leonardo da Vinci
Confidence: 0.98
------------------------------
Question: When was the United Nations founded?
Answer: 1945
Confidence: 0.96
------------------------------
Question: What is inflation?
Answer: the annual percentage change in the price index
Confidence: 0.45
------------------------------


### Comparison of Performance: Fine-Tuned Model vs Hugging Face Fine-Tuned Model

#### 1. **Overall Summary**:
Both models perform similarly well in answering most questions, but there are some differences in the confidence scores and the accuracy of specific answers. Below is a more detailed comparison of how each model handled the questions:

#### 2. **Performance on Individual Questions**:

| **Question**                                    | **My Fine-Tuned Model Answer**                  | **My Model Confidence** | **Hugging Face Model Answer**         | **Hugging Face Confidence** | **Comparison** |
|-------------------------------------------------|---------------------------------------------------|---------------------------|---------------------------------------|------------------------------|----------------|
| When did the Wright brothers make their first flight? | December 17, 1903                                  | 0.98                      | December 17, 1903                     | 0.98                         | Both models perform identically. |
| Where is Mount Everest located?                 | the Himalayas                                      | 0.76                      | the Himalayas                         | 0.67                         | Both models answered correctly; my model shows slightly higher confidence. |
| What is the main product of photosynthesis?     | chemical energy                                    | 0.07                      | light energy                          | 0.10                         | Both models give wrong answers with low confidence; neither model performed well. |
| How many plays did William Shakespeare write?   | 39                                                | 0.89                      | 39                                    | 0.86                         | Both models answer correctly, with my model having slightly higher confidence. |
| Who invented the World Wide Web?                | Tim Berners-Lee                                    | 1.00                      | Tim Berners-Lee                       | 1.00                         | Both models performed perfectly. |
| Who painted the Mona Lisa?                      | Leonardo da Vinci                                  | 0.95                      | Leonardo da Vinci                     | 0.98                         | Both models are correct, but Hugging Face’s model has higher confidence. |
| When was the United Nations founded?            | 1945                                              | 0.93                      | 1945                                  | 0.96                         | Both models answered correctly, Hugging Face’s model has slightly higher confidence. |
| What is inflation?                              | the rate at which the general level of prices for goods and services rises | 0.20 | the annual percentage change in the price index | 0.45                         | Hugging Face’s model gave a more concise answer and had higher confidence. |

#### 3. **Key Observations**:

1. **Accuracy**:
   - Both models provided correct answers for most questions, such as **"When did the Wright brothers make their first flight?"**, **"Where is Mount Everest located?"**, **"Who invented the World Wide Web?"**, **"Who painted the Mona Lisa?"**, and **"How many plays did William Shakespeare write?"**.
   - However, for the question **"What is the main product of photosynthesis?"**, both models provided incorrect answers with low confidence. This indicates that neither model is strong in answering this particular scientific question.
   - On the question **"What is inflation?"**, the Hugging Face model provided a more accurate and concise answer, along with higher confidence.

2. **Confidence**:
   - The **confidence scores** are similar between both models in most cases, but my model shows slightly higher confidence on questions like **"Where is Mount Everest located?"** and **"How many plays did William Shakespeare write?"**.
   - However, for some other questions like **"What is inflation?"** and **"Who painted the Mona Lisa?"**, the Hugging Face model had a higher confidence, suggesting better understanding or more optimized fine-tuning.

3. **Consistency**:
   - Both models performed almost identically on most factual questions, such as dates and names (e.g., Wright brothers' flight date, Tim Berners-Lee inventing the web).
   - The **confidence** variance is noticeable but not large. In most cases, both models maintained fairly high confidence for correct answers and lower confidence for more uncertain or incorrect ones.

4. **Edge Cases**:
   - For more nuanced or complex questions (e.g., "What is the main product of photosynthesis?"), both models struggled, which suggests that more specialized fine-tuning or additional data may be necessary to improve performance on such questions.

#### 4. **Which Model Performed Better?**
- In terms of **accuracy**, both models perform equally well on most questions, providing correct answers for a majority of them.
- In terms of **confidence**, my fine-tuned model had higher confidence in some cases, but the **Hugging Face fine-tuned model** exhibited slightly higher confidence in some other key questions, such as **"What is inflation?"**.
- Overall, both models are very close in performance, with **slightly higher confidence** for the Hugging Face model in certain cases.

### Conclusion:
My fine-tuned model performed very well, comparable to the Hugging Face pre-fine-tuned model. For most questions, both models gave correct answers with high confidence. The difference in confidence scores is small, and neither model has a clear, significant advantage over the other. However, for a few specific questions, like **"What is inflation?"**, the Hugging Face model performed better in terms of both accuracy and confidence.