# HTML to JSON Conversion with Phi-3 Model

In this project, we'll fine-tune the Phi-3 model for converting HTML tables to JSON format. This specialized task is crucial for data extraction and transformation in various web scraping and data processing applications.

## Setup and Model Initialization

First, we'll set up our environment and initialize the Phi-3 model:

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install triton
!pip install  "xformers<0.0.27"
!pip install jsonlines

In [2]:
from huggingface_hub import login
login(token="<token>")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [5]:


max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3-mini-4k-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

## Model Configuration

We'll add LoRA adapters to update only a small percentage of parameters:

In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Data Preparation

We'll use a custom dataset of HTML tables and their corresponding JSON representations. The data is stored in a JSONL file:


In [7]:
from unsloth.chat_templates import get_chat_template
from datasets import Dataset
import jsonlines

# Set up the tokenizer with the appropriate chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="phi-3",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)


input_file = "./output_phi35.jsonl"

with jsonlines.open(input_file) as reader:
    data = list(reader)

# Create a Dataset object
dataset = Dataset.from_list(data)

# Split the dataset into training and validation sets
seed = 2024
splits = dataset.train_test_split(test_size=0.2, seed=seed)
train_dataset = splits['train']
valid_dataset = splits['test']

Let's see how the `Phi-3` format works by printing the 5th element

In [11]:
len(data)

30000

## Training Configuration

We'll set up the training configuration using the SFTTrainer:

In [17]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset=valid_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 50,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/24000 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/6000 [00:00<?, ? examples/s]

In [1]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

## Model Training

Now, we'll train our model:

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 24,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 4
\        /    Total batch size = 32 | Total steps = 750
 "-____-"     Number of trainable parameters = 29,884,416


Step,Training Loss
50,0.4657
100,0.3384
150,0.3321
200,0.3293
250,0.3262
300,0.3261
350,0.3242
400,0.3252
450,0.325
500,0.3246


In [2]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## Inference

After training, we can use our model for HTML to JSON conversion:

In [18]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": """
You are an AI assistant specialized in converting HTML tables to JSON format.

<table id="Table269494954447Curator">
<caption>Table 26.94.94.95.44.47 Curator</caption>
<thead>
<tr>
<th> </th>
<th>    Jacob Hunter    </th>
<th>    Kendra Jackson    </th>
<th>    Jennifer Brown    </th>
<th>    Mrs. Jamie Alvarez DDS    </th>
</tr>
</thead>
<tbody>
<tr>
<td align="right" style="font-weight:bold">    Blair and Sons    </td>
<td align="left" style="font-style:italic">    1285    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1242    </td>
<td align="left" style="font-style:italic">    1265    </td>
<td align="left" style="font-weight:bold">    999    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    Wilkerson Group    </td>
<td align="left" style="font-style:italic">    1328%    </td>
<td align="left" style="font-weight:bold; font-style:italic">    758    </td>
<td align="left" style="font-style:italic">    759%    </td>
<td align="left" style="font-weight:bold">    958    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    Wilson-English    </td>
<td align="left" style="font-style:italic">    557%    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1306%    </td>
<td align="left" style="font-style:italic">    1109    </td>
<td align="left" style="font-weight:bold">    294%    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    White LLC    </td>
<td align="left" style="font-style:italic">    1309    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1171    </td>
<td align="left" style="font-style:italic">    1459    </td>
<td align="left" style="font-weight:bold">    1206    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    Thomas-Rogers    </td>
<td align="left" style="font-style:italic">    492%    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1475%    </td>
<td align="left" style="font-style:italic">    50%    </td>
<td align="left" style="font-weight:bold">    1489    </td>
</tr>
</tbody><tfoot><tr><td>Creation: 29Apr2016 Vietnam</td></tr></tfoot>
</table>
```

 """},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 500, use_cache = True)
tokenizer.batch_decode(outputs)

['<|user|> You are an AI assistant specialized in converting HTML tables to JSON format.\n\n<table id="Table269494954447Curator">\n<caption>Table 26.94.94.95.44.47 Curator</caption>\n<thead>\n<tr>\n<th> </th>\n<th>    Jacob Hunter    </th>\n<th>    Kendra Jackson    </th>\n<th>    Jennifer Brown    </th>\n<th>    Mrs. Jamie Alvarez DDS    </th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td align="right" style="font-weight:bold">    Blair and Sons    </td>\n<td align="left" style="font-style:italic">    1285    </td>\n<td align="left" style="font-weight:bold; font-style:italic">    1242    </td>\n<td align="left" style="font-style:italic">    1265    </td>\n<td align="left" style="font-weight:bold">    999    </td>\n</tr>\n<tr>\n<td align="right" style="font-weight:bold">    Wilkerson Group    </td>\n<td align="left" style="font-style:italic">    1328%    </td>\n<td align="left" style="font-weight:bold; font-style:italic">    758    </td>\n<td align="left" style="font-style:italic">    759%    <

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")
model.push_to_hub("phi-3.5-mini-html2json")

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [19]:
from transformers import TextStreamer

if True:
    from unsloth import FastLanguageModel
    model, _ = FastLanguageModel.from_pretrained(
        model_name = "avi32636/phi-3.5-mini-html2json", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": """
You are an AI assistant specialized in converting HTML tables to JSON format.

<table id="Table269494954447Curator">
<caption>Table 26.94.94.95.44.47 Curator</caption>
<thead>
<tr>
<th> </th>
<th>    Jacob Hunter    </th>
<th>    Kendra Jackson    </th>
<th>    Jennifer Brown    </th>
<th>    Mrs. Jamie Alvarez DDS    </th>
</tr>
</thead>
<tbody>
<tr>
<td align="right" style="font-weight:bold">    Blair and Sons    </td>
<td align="left" style="font-style:italic">    1285    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1242    </td>
<td align="left" style="font-style:italic">    1265    </td>
<td align="left" style="font-weight:bold">    999    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    Wilkerson Group    </td>
<td align="left" style="font-style:italic">    1328%    </td>
<td align="left" style="font-weight:bold; font-style:italic">    758    </td>
<td align="left" style="font-style:italic">    759%    </td>
<td align="left" style="font-weight:bold">    958    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    Wilson-English    </td>
<td align="left" style="font-style:italic">    557%    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1306%    </td>
<td align="left" style="font-style:italic">    1109    </td>
<td align="left" style="font-weight:bold">    294%    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    White LLC    </td>
<td align="left" style="font-style:italic">    1309    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1171    </td>
<td align="left" style="font-style:italic">    1459    </td>
<td align="left" style="font-weight:bold">    1206    </td>
</tr>
<tr>
<td align="right" style="font-weight:bold">    Thomas-Rogers    </td>
<td align="left" style="font-style:italic">    492%    </td>
<td align="left" style="font-weight:bold; font-style:italic">    1475%    </td>
<td align="left" style="font-style:italic">    50%    </td>
<td align="left" style="font-weight:bold">    1489    </td>
</tr>
</tbody><tfoot><tr><td>Creation: 29Apr2016 Vietnam</td></tr></tfoot>
</table>
```

 """},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
model_output = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 400, use_cache = True)
tokenizer.batch_decode(model_output)


==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
<|user|> You are an AI assistant specialized in converting HTML tables to JSON format.

<table id="Table269494954447Curator">
<caption>Table 26.94.94.95.44.47 Curator</caption>
<thead>
<tr>
<th> </th>
<th>    Jacob Hunter    </th>
<th>    Kendra Jackson    </th>
<th>    Jennifer Brown    </th>
<th>    Mrs. Jamie Alvarez DDS    </th>
</tr>
</thead>
<tbody>
<tr>
<td align="right" style="font-weight:bold">    Blair and Sons    </td>
<td align="left" style="font-style:italic">    1285    </td>
<td align="left" style="font-weight:bold; font-

['<|user|> You are an AI assistant specialized in converting HTML tables to JSON format.\n\n<table id="Table269494954447Curator">\n<caption>Table 26.94.94.95.44.47 Curator</caption>\n<thead>\n<tr>\n<th> </th>\n<th>    Jacob Hunter    </th>\n<th>    Kendra Jackson    </th>\n<th>    Jennifer Brown    </th>\n<th>    Mrs. Jamie Alvarez DDS    </th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td align="right" style="font-weight:bold">    Blair and Sons    </td>\n<td align="left" style="font-style:italic">    1285    </td>\n<td align="left" style="font-weight:bold; font-style:italic">    1242    </td>\n<td align="left" style="font-style:italic">    1265    </td>\n<td align="left" style="font-weight:bold">    999    </td>\n</tr>\n<tr>\n<td align="right" style="font-weight:bold">    Wilkerson Group    </td>\n<td align="left" style="font-style:italic">    1328%    </td>\n<td align="left" style="font-weight:bold; font-style:italic">    758    </td>\n<td align="left" style="font-style:italic">    759%    <

## Evaluation
To assess the performance of our fine-tuned model, we'll implement an evaluation function that processes a subset of our validation dataset:


In [20]:
import re
import json

def extract_json(text):

    # Find the content between <|assistant|> and <|end|> tags
    pattern = r'<\|assistant\|(.*?)<\|end\|>'
    match = re.search(pattern, text, re.DOTALL)

    if match:
        json_string = match.group(1).strip()
        try:
            # Parse the JSON string
            json_data = json.loads(json_string)
            return json_data
        except json.JSONDecodeError as e:
            print(f"\nJSON decode error: {e}")
            return None
    else:
        # Try an alternative pattern without the tags
        alternative_pattern = r'\{.*\}'
        alt_match = re.search(alternative_pattern, text, re.DOTALL)
        if alt_match:
            json_string = alt_match.group(0)
            try:
                json_data = json.loads(json_string)
                return json_data
            except json.JSONDecodeError as e:
                print(f"\nJSON decode error: {e}")
                return None
        else:
            print("\nNo JSON-like structure found in the input text.")
            return None

In [21]:
# output_json = extract_json(generated_text)

## Check on Test Set

In [29]:
from typing import List, Dict, Union
from tqdm import tqdm

def process_dataset(dataset: List[Dict], model, tokenizer, extract_json_func, num_examples: Union[int, None] = None) -> List[Dict]:
    """
    Process a dataset and generate predictions for each example, up to a specified number.

    :param dataset: List of dictionaries containing the dataset examples
    :param model: The model to use for generating predictions
    :param tokenizer: The tokenizer to use for encoding/decoding
    :param extract_json_func: Function to extract JSON from generated text
    :param num_examples: Maximum number of examples to process (None for all examples)
    :return: List of dictionaries containing results for each processed example
    """
    results = []
    processed_count = 0

    for i, example in tqdm(enumerate(dataset), total=num_examples, desc="Processing dataset"):
        # Check if we've reached the desired number of examples
        if num_examples is not None and i >= num_examples:
            break

        # Generate model output
        input_text = example["conversations"][0]["value"]
        messages = [
            {"from": "human", "value":input_text },
        ]
        inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
        text_streamer = TextStreamer(tokenizer)
        model_output = model.generate(input_ids = inputs, max_new_tokens = 400, use_cache = True) # , streamer = text_streamer

        # Decode the model output
        generated_text = tokenizer.decode(model_output[0], skip_special_tokens=True)

        # Extract JSON from the generated text
        output_json = extract_json_func(generated_text)

        # Create the result dictionary
        result_dict = {
            "item": i,  # Replace "item" with the actual key if different
            "true": example["conversations"][1]["value"],
            "pred": output_json
        }

        results.append(result_dict)
        processed_count += 1

    return results




In [30]:

# Or process a subset of the dataset
num_examples = 100
subset_results = process_dataset(valid_dataset, model, tokenizer, extract_json, num_examples=num_examples)

print(f"Processed {len(subset_results)} examples from the subset")

Processing dataset:  10%|█         | 10/100 [02:47<24:45, 16.50s/it]


No JSON-like structure found in the input text.


Processing dataset:  22%|██▏       | 22/100 [06:17<25:55, 19.94s/it]


JSON decode error: Expecting ',' delimiter: line 1 column 629 (char 628)


Processing dataset:  28%|██▊       | 28/100 [07:49<20:34, 17.14s/it]


No JSON-like structure found in the input text.


Processing dataset:  29%|██▉       | 29/100 [08:14<23:01, 19.46s/it]


JSON decode error: Expecting ',' delimiter: line 1 column 720 (char 719)


Processing dataset:  38%|███▊      | 38/100 [10:41<18:18, 17.71s/it]


No JSON-like structure found in the input text.


Processing dataset:  39%|███▉      | 39/100 [11:06<20:05, 19.77s/it]


No JSON-like structure found in the input text.


Processing dataset:  46%|████▌     | 46/100 [13:06<17:04, 18.97s/it]


No JSON-like structure found in the input text.


Processing dataset:  57%|█████▋    | 57/100 [15:49<11:25, 15.94s/it]


JSON decode error: Expecting ',' delimiter: line 1 column 698 (char 697)


Processing dataset:  62%|██████▏   | 62/100 [17:11<11:30, 18.17s/it]


No JSON-like structure found in the input text.


Processing dataset:  67%|██████▋   | 67/100 [18:28<09:20, 16.99s/it]


No JSON-like structure found in the input text.


Processing dataset:  70%|███████   | 70/100 [19:22<09:12, 18.42s/it]


No JSON-like structure found in the input text.


Processing dataset:  74%|███████▍  | 74/100 [20:31<08:16, 19.10s/it]


No JSON-like structure found in the input text.


Processing dataset:  75%|███████▌  | 75/100 [20:55<08:37, 20.69s/it]


JSON decode error: Expecting ',' delimiter: line 1 column 716 (char 715)


Processing dataset:  77%|███████▋  | 77/100 [21:38<08:10, 21.33s/it]


No JSON-like structure found in the input text.


Processing dataset:  79%|███████▉  | 79/100 [22:14<07:04, 20.22s/it]


No JSON-like structure found in the input text.


Processing dataset:  80%|████████  | 80/100 [22:39<07:12, 21.61s/it]


No JSON-like structure found in the input text.


Processing dataset:  82%|████████▏ | 82/100 [23:15<06:06, 20.38s/it]


No JSON-like structure found in the input text.


Processing dataset:  83%|████████▎ | 83/100 [23:40<06:07, 21.61s/it]


JSON decode error: Expecting ',' delimiter: line 1 column 667 (char 666)


Processing dataset:  88%|████████▊ | 88/100 [24:55<03:38, 18.21s/it]


No JSON-like structure found in the input text.


Processing dataset:  94%|█████████▍| 94/100 [26:40<01:45, 17.62s/it]


JSON decode error: Expecting ',' delimiter: line 1 column 752 (char 751)


Processing dataset: 100%|██████████| 100/100 [28:20<00:00, 17.00s/it]

Processed 100 examples from the subset





## Evaluation Metrics
We'll define functions to compare the predicted JSON with the ground truth:

In [31]:
import json
from difflib import SequenceMatcher
from typing import List, Dict

def compare_dicts(dict1: Dict, dict2: Dict) -> float:
    if dict1 == dict2:
        return 1.0
    similarity = sum(
        SequenceMatcher(None, str(dict1.get(k, '')), str(dict2.get(k, ''))).ratio()
        for k in set(dict1) | set(dict2)
    )
    return similarity / max(len(dict1), len(dict2))

def evaluate_json_pair(true_data: Dict, pred_data: Dict) -> Dict:
    # Check main structure
    main_parts = ['header', 'body', 'footer']
    structure_check = all(part in true_data and part in pred_data for part in main_parts)

    results = {
        'structure_valid': structure_check,
        'header_similarity': compare_dicts(true_data.get('header', {}), pred_data.get('header', {})),
        'body_content_match': true_data.get('body', {}).get('content') == pred_data.get('body', {}).get('content'),
        'body_headers_similarity': compare_dicts(
            true_data.get('body', {}).get('headers', {}),
            pred_data.get('body', {}).get('headers', {})
        ),
        'footer_similarity': compare_dicts(true_data.get('footer', {}), pred_data.get('footer', {})),
    }

    # Calculate overall match percentage
    results['match_percentage'] = (
        (results['header_similarity'] * 0.25) +
        (results['body_content_match'] * 0.25) +
        (results['body_headers_similarity'] * 0.25) +
        (results['footer_similarity'] * 0.25)
    ) * 100

    return results

def evaluate_subset_results(subset_results: List[Dict]) -> List[Dict]:
    evaluation_results = []

    for result in subset_results:
        try:
            true_data = json.loads(result['true']) if isinstance(result['true'], str) else result['true']
            pred_data = json.loads(result['pred']) if isinstance(result['pred'], str) else result['pred']

            if not isinstance(true_data, dict) or not isinstance(pred_data, dict):
                raise ValueError("Data is not a dictionary after parsing")

            evaluation = evaluate_json_pair(true_data, pred_data)
            evaluation['item'] = result['item']
            evaluation['status'] = 'success'
        except json.JSONDecodeError:
            evaluation = {'item': result['item'], 'status': 'problem', 'error': 'JSON parsing error'}
        except ValueError as e:
            evaluation = {'item': result['item'], 'status': 'problem', 'error': str(e)}
        except Exception as e:
            evaluation = {'item': result['item'], 'status': 'problem', 'error': f'Unexpected error: {str(e)}'}

        evaluation_results.append(evaluation)

    return evaluation_results

def print_evaluation_results(evaluation_results: List[Dict]):
    for result in evaluation_results:
        print("=" * 50)
        print(f"Item: {result['item']}")
        print("=" * 50)

        if result['status'] == 'problem':
            print(f"Problem: {result['error']}")
            print("\n" + "-" * 50 + "\n")
            continue

        print(f"Structure valid: {'✓' if result['structure_valid'] else '✗'}")
        print("\nHeader:")
        print(f" Similarity: {result['header_similarity']:.2f}")
        print("\nBody:")
        print(f" Content match: {'✓' if result['body_content_match'] else '✗'}")
        print(f" Headers similarity: {result['body_headers_similarity']:.2f}")
        print("\nFooter:")
        print(f" Similarity: {result['footer_similarity']:.2f}")

        match_percentage = result['match_percentage']
        print("\nOverall Match:")
        print(f" Percentage: {match_percentage:.2f}%")
        print(" Quality: ", end="")
        if match_percentage >= 90:
            print("Excellent 🌟")
        elif match_percentage >= 75:
            print("Good 👍")
        elif match_percentage >= 50:
            print("Fair 😐")
        else:
            print("Poor 😞")
        print("\n" + "-" * 50 + "\n")

# Usage
# Assuming you have already run the process_dataset function and have subset_results

evaluation_results = evaluate_subset_results(subset_results)
print_evaluation_results(evaluation_results)

# Calculate overall statistics, excluding problematic samples
valid_results = [result for result in evaluation_results if result['status'] == 'success']
if valid_results:
    average_match_percentage = sum(result['match_percentage'] for result in valid_results) / len(valid_results)
    print(f"Average match percentage across all successfully evaluated items: {average_match_percentage:.2f}%")
    print(f"Total samples: {len(evaluation_results)}, Successfully evaluated: {len(valid_results)}, Problematic: {len(evaluation_results) - len(valid_results)}")
else:
    print("No valid results to calculate statistics.")

Item: 0
Structure valid: ✓

Header:
 Similarity: 1.00

Body:
 Content match: ✓
 Headers similarity: 1.00

Footer:
 Similarity: 1.00

Overall Match:
 Percentage: 100.00%
 Quality: Excellent 🌟

--------------------------------------------------

Item: 1
Structure valid: ✓

Header:
 Similarity: 1.00

Body:
 Content match: ✓
 Headers similarity: 1.00

Footer:
 Similarity: 1.00

Overall Match:
 Percentage: 100.00%
 Quality: Excellent 🌟

--------------------------------------------------

Item: 2
Structure valid: ✓

Header:
 Similarity: 1.00

Body:
 Content match: ✓
 Headers similarity: 1.00

Footer:
 Similarity: 1.00

Overall Match:
 Percentage: 100.00%
 Quality: Excellent 🌟

--------------------------------------------------

Item: 3
Structure valid: ✓

Header:
 Similarity: 1.00

Body:
 Content match: ✓
 Headers similarity: 1.00

Footer:
 Similarity: 1.00

Overall Match:
 Percentage: 100.00%
 Quality: Excellent 🌟

--------------------------------------------------

Item: 4
Structure valid: