### Ragnarok Online Trade Buddy

This generator is used to create a dataset of ragnarok items , items like armors  being the topics of the dataset.

The generator will iteratively generate a dataset of products, with each product being a topic of the dataset.

The generator will use a combination of a prompt and a model to generate the dataset.

Parameters:
- model: The model to use to generate the dataset.
- prompt: The prompt to use to generate the dataset.
- product_type: The type of product to generate.
- num_products: The number of products to generate.
- custom_attributes: A list of custom attributes to add to the product.
- num_examples: The number of examples to generate for each thing.

Output format options:
- JSON
- CSV
- Markdown



In [2]:
import torch
import gc
import os
import json
import logging
import pandas as pd
import traceback # For detailed error logging
from typing import List, Dict, Optional, Any # Added Optional and Any
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login
from save_dataset import save_dataset

In [3]:
# Model Options
MODEL_MISTRAL = 'mistralai/Mistral-7B-Instruct-v0.2'
MODEL_PHI = 'microsoft/Phi-3-mini-4k-instruct'

In [4]:
# Prompt Options
PROMPT = """
  You are a dataset generator for items in the video game ragnarok online.
  You will be given a list of optional attributes and a list of required attributes.
  Just make sure to generate the products as outputs with its value in zeny(in game currency).

  The products output should be human readable or JSON format and generate a comprehensive
  data set depending on the product type requested by the user.
"""


In [5]:
# --- Setup Logging ---
# Configure logging to output to console
# You can customize the format, level, and output (e.g., to a file)
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(name)s - %(message)s')
logger = logging.getLogger(__name__) # Create a logger for this module

logger.info(f"PyTorch version: {torch.__version__}")
logger.info(f"MPS available: {torch.backends.mps.is_available()}")
logger.info(f"MPS built: {torch.backends.mps.is_built()}")

# --- Authentication
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    logger.warning("HF_TOKEN environment variable not found. Some operations might fail.")
else:
    try:
        login(hf_token, add_to_git_credential=True)
        logger.info("Successfully logged into Hugging Face Hub.")
    except Exception as e:
        logger.error(f"Failed to log into Hugging Face Hub: {e}")

2025-05-27 23:50:47,441 - INFO - __main__ - PyTorch version: 2.6.0
2025-05-27 23:50:47,452 - INFO - __main__ - MPS available: True
2025-05-27 23:50:47,453 - INFO - __main__ - MPS built: True
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
2025-05-27 23:50:48,448 - INFO - __main__ - Successfully logged into Hugging Face Hub.


In [6]:
class DatasetGenerator:
    """
    Generates datasets using a model.
    Manages model loading, prompt formatting, generation, and resource cleanup.
    """

    DEFAULT_PROMPT_TEMPLATE = """You are an item dataset generator for the video game ragnarok online.
        Your goal is to generate exactly {num_products} items of the type: {product_type}.

        IMPORTANT: Generate the output in JSON format with the following structure:
        [
            {{
                "item_number": "1",
                "item_name": "Example Item",
                "item_id": "#1234-abc123",
                "item_description": "A brief description",
                "item_price": 1000000,
                "item_type": "Weapon",
                "item_quantity: 10
                "owner_name": "Character Name",
                "owner_id": "#NPC123",
                "willing_to_trade": true
            }}
        ]

        Requirements:
        1. Each item must have all the required fields
        2. item_price must be a number (no currency symbols)
        3. item_id and owner_id must start with #
        4. willing_to_trade must be true or false
        5. No empty or null values allowed
        6. item_number must have a randomize number

        Owner Requirements:
        {owner_instruction}

        Custom attributes to consider:
        {custom_attributes_str}

        Begin generation:"""

    def __init__(
        self,
        model_name: str,
        num_products: int = 10,
        product_type: str = "Ragnarok Online Items",
        prompt_template: Optional[str] = None,
        same_owner: bool = True,
        same_owner_name_temperature: float = 0.8
    ):
        """Initialize the DatasetGenerator with validation."""
        # Validate inputs
        if not isinstance(model_name, str) or not model_name:
            raise ValueError("model_name must be a non-empty string")
        
        if not isinstance(num_products, int) or num_products <= 0:
            raise ValueError("num_products must be a positive integer")
        
        if not isinstance(product_type, str) or not product_type:
            raise ValueError("product_type must be a non-empty string")
        
        if not isinstance(same_owner, bool):
            raise ValueError("same_owner must be a boolean")
        
        if not isinstance(same_owner_name_temperature, (int, float)) or not 0 <= same_owner_name_temperature <= 1:
            raise ValueError("same_owner_name_temperature must be a number between 0 and 1")

        self.model_name = model_name
        self.num_products = num_products
        self.product_type = product_type
        self.prompt_template = prompt_template or self.DEFAULT_PROMPT_TEMPLATE
        self.same_owner = same_owner
        self.same_owner_name_temperature = same_owner_name_temperature

        self.device: str = self._get_device()
        self.model: Optional[AutoModelForCausalLM] = None
        self.tokenizer: Optional[AutoTokenizer] = None

    def _get_device(self) -> str:
        """Determines and returns the best available device (mps, cuda, or cpu)."""
        if torch.backends.mps.is_available() and torch.backends.mps.is_built():
            logger.info("MPS device is available and built.")
            return "mps"
        elif torch.cuda.is_available():
            logger.info("CUDA device is available.")
            return "cuda"
        else:
            logger.info("No GPU (MPS or CUDA) available, using CPU.")
            return "cpu"

    def _load_model(self) -> bool:
        """Loads the model and tokenizer onto the selected device."""
        if self.model is not None and self.tokenizer is not None:
            logger.info("Model and tokenizer already loaded.")
            return True

        logger.info(f"Attempting to load model and tokenizer on device: {self.device}")
        new_pad_token_added_to_vocab = False

        try:
            # 1. Load Tokenizer
            logger.info(f"Loading tokenizer for {self.model_name}...")
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                trust_remote_code=True
            )
            logger.info("Tokenizer loaded.")

            # 2. Debug and Set Pad Token
            logger.info(f"Initial tokenizer pad_token: {self.tokenizer.pad_token}, pad_token_id: {self.tokenizer.pad_token_id}")
            logger.info(f"Initial tokenizer eos_token: {self.tokenizer.eos_token}, eos_token_id: {self.tokenizer.eos_token_id}")
            logger.info(f"Initial tokenizer bos_token: {self.tokenizer.bos_token}, bos_token_id: {self.tokenizer.bos_token_id}")
            logger.info(f"Initial tokenizer unk_token: {self.tokenizer.unk_token}, unk_token_id: {self.tokenizer.unk_token_id}")

            if self.tokenizer.pad_token is None:
                logger.warning("Tokenizer `pad_token` is None. Attempting to set it.")
                if self.tokenizer.eos_token is not None:
                    logger.info(f"Setting `pad_token` to `eos_token` ('{self.tokenizer.eos_token}').")
                    self.tokenizer.pad_token = self.tokenizer.eos_token
                else:
                    logger.warning("Tokenizer `eos_token` is also None. Adding a new `[PAD]` special token.")
                    original_vocab_size = len(self.tokenizer)
                    self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
                    if len(self.tokenizer) > original_vocab_size:
                        new_pad_token_added_to_vocab = True
                        logger.info(f"Added new special token '[PAD]'. Tokenizer vocab size changed from {original_vocab_size} to {len(self.tokenizer)}.")
                    else:
                        logger.info("Tried to add '[PAD]', but vocab size did not change (it might have existed or aliased).")

            logger.info(f"After attempting to set: tokenizer pad_token: {self.tokenizer.pad_token}, pad_token_id: {self.tokenizer.pad_token_id}")

            # 3. Load Model
            logger.info(f"Loading model {self.model_name}...")
            model_kwargs = {
                "trust_remote_code": True,
                "torch_dtype": torch.float16 if self.device != "cpu" else torch.float32
            }

            if self.device == "mps":
                model_kwargs["device_map"] = self.device

            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                **model_kwargs
            )
            logger.info("Model loaded from pretrained.")

            # 4. Resize Token Embeddings if needed
            if new_pad_token_added_to_vocab:
                logger.info(f"Resizing model token embeddings to match new tokenizer vocab size: {len(self.tokenizer)}")
                self.model.resize_token_embeddings(len(self.tokenizer))
                if self.model.config.pad_token_id != self.tokenizer.pad_token_id:
                    logger.info(f"Updating model's config pad_token_id from {self.model.config.pad_token_id} to {self.tokenizer.pad_token_id}")
                    self.model.config.pad_token_id = self.tokenizer.pad_token_id

            # 5. Move to device if not already handled by device_map
            if self.device != "mps":
                logger.info(f"Moving model to device: {self.device}")
                self.model.to(self.device)

            logger.info(f"Model is now on device: {self.model.device}")
            logger.info(f"Model config pad_token_id: {self.model.config.pad_token_id}")

            # Final check of tokenizer pad_token_id
            if self.tokenizer.pad_token_id is None:
                logger.error("CRITICAL: tokenizer.pad_token_id is STILL None after all attempts. This will likely cause padding errors.")
                return False

            return True

        except Exception as e:
            logger.error(f"Failed to load model or tokenizer for {self.model_name}: {e}")
            logger.error(traceback.format_exc())
            self.model = None
            self.tokenizer = None
            return False

    def _cleanup_resources(self):
        """Cleans up model and tokenizer resources and clears GPU cache if applicable."""
        logger.info("Cleaning up resources...")
        try:
            if self.model is not None:
                del self.model
                self.model = None
                logger.debug("Model deleted.")

            if self.tokenizer is not None:
                del self.tokenizer
                self.tokenizer = None
                logger.debug("Tokenizer deleted.")

            gc.collect()
            logger.debug("Garbage collection triggered.")

            if self.device == "mps":
                if hasattr(torch, 'mps') and hasattr(torch.mps, 'empty_cache'):
                    torch.mps.empty_cache()
                    logger.info("MPS cache emptied.")
            elif self.device == "cuda":
                if hasattr(torch, 'cuda') and hasattr(torch.cuda, 'empty_cache'):
                    torch.cuda.empty_cache()
                    logger.info("CUDA cache emptied.")
        except Exception as e:
            logger.warning(f"Error during resource cleanup: {e}", exc_info=True)

    def _format_prompt(self, custom_attributes: Optional[List[str]] = None) -> str:
        """Format the prompt with simplified owner instruction."""
        # Generate owner_instruction based on parameters
        if self.same_owner:
            # Calculate number of items that should have same owner
            num_same_owner = int(self.num_products * self.same_owner_name_temperature)
            instruction = f"Generate {num_same_owner} items with the same owner name and ID. The remaining items should have different owners."
        else:
            instruction = "Generate items with different owner names and IDs."

        # Format custom attributes
        if custom_attributes:
            attributes_str = "\n".join(f"    - {attr}" for attr in custom_attributes)
        else:
            attributes_str = "    - (No specific custom attributes provided)"

        # Format the prompt
        return self.prompt_template.format(
            num_products=self.num_products,
            product_type=self.product_type,
            owner_instruction=instruction,  # Changed from same_owner_instruction to owner_instruction
            custom_attributes_str=attributes_str
        )

    def generate_dataset(self, custom_attributes: Optional[List[str]] = None) -> Dict[str, Any]:
        """Generates a dataset based on the provided attributes."""
        if self.model is None or self.tokenizer is None:
            logger.error("Model or tokenizer not loaded. Call _load_model() or use as context manager.")
            raise RuntimeError("Model not loaded. Cannot generate dataset.")

        formatted_prompt = self._format_prompt(custom_attributes)
        logger.debug(f"Formatted prompt:\n{formatted_prompt}")

        try:
            # --- Determine a safe max_length for truncation ---
            # Set a reasonable default max length
            default_model_max_len = 4096  # Conservative default
            effective_model_max_len = default_model_max_len

            # Try to get max length from model config first
            if hasattr(self.model, 'config') and hasattr(self.model.config, 'max_position_embeddings'):
                model_max_len = self.model.config.max_position_embeddings
                if isinstance(model_max_len, int) and 0 < model_max_len < 200000:
                    effective_model_max_len = model_max_len
                    logger.info(f"Using model.config.max_position_embeddings: {effective_model_max_len}")
            else:
                # Fallback to tokenizer's model_max_length if it's reasonable
                if isinstance(self.tokenizer.model_max_length, int) and 0 < self.tokenizer.model_max_length < 200000:
                    effective_model_max_len = self.tokenizer.model_max_length
                    logger.info(f"Using tokenizer.model_max_length: {effective_model_max_len}")
                else:
                    logger.warning(
                        f"Could not determine a reliable model_max_length. Defaulting to {default_model_max_len}. "
                        f"Model config max_position_embeddings: {getattr(getattr(self.model, 'config', None), 'max_position_embeddings', 'N/A')}"
                    )

            # Define max_new_tokens for generation
            max_new_tokens_for_generation = 2048
            
            # Calculate truncation_length for the input prompt
            buffer_tokens = 10
            truncation_length_for_input = effective_model_max_len - max_new_tokens_for_generation - buffer_tokens

            # Ensure truncation_length_for_input is positive and reasonable
            if truncation_length_for_input <= 0:
                logger.warning(
                    f"Calculated truncation_length_for_input ({truncation_length_for_input}) is too small. "
                    f"Adjusting to a safe value."
                )
                truncation_length_for_input = min(effective_model_max_len // 2, 2048)  # Use half of max length or 2048, whichever is smaller

            logger.info(f"Derived effective_model_max_len: {effective_model_max_len}")
            logger.info(f"Target max_new_tokens_for_generation: {max_new_tokens_for_generation}")
            logger.info(f"Calculated truncation_length_for_input: {truncation_length_for_input}")

            inputs = self.tokenizer(
                formatted_prompt,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=truncation_length_for_input
            )
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            pad_token_id = self.tokenizer.pad_token_id

            generation_config_params = {
                "input_ids": inputs["input_ids"],
                "attention_mask": inputs["attention_mask"],
                "max_new_tokens": max_new_tokens_for_generation,
                "do_sample": True,
                "temperature": 0.7,
                "top_p": 0.9,
                "num_return_sequences": 1,
                "pad_token_id": pad_token_id,
                "eos_token_id": self.tokenizer.eos_token_id
            }

            logger.info("Generating text...")
            with torch.no_grad():
                outputs = self.model.generate(**generation_config_params)

            generated_ids = outputs[0].to('cpu') if outputs[0].device.type != 'cpu' else outputs[0]
            generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
            logger.debug(f"Raw generated text:\n{generated_text}")

            if generated_text.strip().startswith(formatted_prompt.strip()):
                generated_text = generated_text.strip()[len(formatted_prompt.strip()):].strip()
            elif "Begin generation:" in generated_text:
                parts = generated_text.split("Begin generation:", 1)
                if len(parts) > 1:
                    generated_text = parts[1].strip()

            return self._process_output(generated_text)

        except Exception as e:
            logger.error(f"Dataset generation failed: {e}")
            logger.error(traceback.format_exc())
            raise RuntimeError(f"Generation failed: {str(e)}") from e

    def _process_output(self, generated_text: str) -> Dict[str, Any]:
        """Processes the generated text, attempting to parse JSON first."""
        logger.debug("Processing generated output...")
        
        def clean_generated_text(text):
            # Find the first occurrence of "item_number"
            start_index = text.find('"item_number"')
            if start_index != -1:
                # Find the start of the JSON array
                array_start = text.rfind('[', 0, start_index)
                if array_start != -1:
                    logger.info("Found JSON array start, cleaning text...")
                    return text[array_start:]
            logger.warning("Could not find proper JSON start, using original text")
            return text
        
        try:
            # Regex to find JSON block, accounts for optional "json" language specifier
            import re
            match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", generated_text, re.DOTALL)
            if match:
                json_str = match.group(1).strip()
                logger.info("Found JSON block in triple backticks.")
                try:
                    return {"json_data": json.loads(json_str)}
                except json.JSONDecodeError as je:
                    logger.warning(f"Failed to parse JSON from triple backticks: {je}. Raw content: {json_str}")
                    return {"attempted_json_parse_error": str(je), "raw_text_in_json_block": json_str, "full_text": generated_text}

            # If no triple backticks, try to parse the whole text if it looks like JSON
            stripped_text = generated_text.strip()
            if (stripped_text.startswith('{') and stripped_text.endswith('}')) or \
               (stripped_text.startswith('[') and stripped_text.endswith(']')):
                logger.info("Attempting to parse entire output as JSON.")
                try:
                    return {"json_data": json.loads(stripped_text)}
                except json.JSONDecodeError as je:
                    logger.warning(f"Failed to parse entire output as JSON: {je}")

        except Exception as e:
            logger.warning(f"Error during JSON parsing attempt: {e}")

        # If JSON parsing fails or isn't applicable, check for markdown
        if "```" in generated_text and "\n" in generated_text:
            logger.info("Output identified as potential markdown.")
            return {"markdown_content": generated_text}

        # Fallback to key-value pair extraction
        items = []
        for line in generated_text.split('\n'):
            line = line.strip()
            if ':' in line:
                key, value = line.split(':', 1)
                items.append({
                    "attribute": key.strip(),
                    "value": value.strip()
                })
        if items:
            logger.info("Output processed as key-value pairs.")
            return {"structured_items": items}

        # If all else fails, return the raw text
        logger.info("No specific structure found, returning raw text.")
        return {"raw_text": generated_text}

    def __enter__(self):
        """Context manager entry: Loads the model."""
        if not self._load_model():
            raise RuntimeError("Failed to load model resources for DatasetGenerator.")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit: Cleans up resources."""
        logger.info("Exiting context and cleaning up DatasetGenerator resources...")
        self._cleanup_resources()
        if exc_type:
            logger.error(f"Exception occurred in 'with' block: {exc_type.__name__}: {exc_val}")
        return False

In [9]:
if __name__ == "__main__":
    SELECTED_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
    #SELECTED_MODEL = 'meta-llama/Llama-3.1-8B'
    #SELECTED_MODEL = 'microsoft/Phi-3-mini-4k-instruct'

    PRODUCT_TYPE = "Ragnarok Online Items"
    NUM_PRODUCTS_TO_GENERATE = 10 # Define how many products you want

    # Example custom attributes for the product type
    # These will guide the LLM in generating relevant fields.
    custom_product_attributes = [
        "item_number: Number of the item",
        "item_name: Name of the item",
        "item_id: Unique identifier, e.g., #1234-abc123",
        "item_description: A brief description of the item's appearance, lore, or use",
        "item_price: Estimated market price in Zeny (integer)",
        "item_type: Category of the item (e.g., Weapon, Armor, Consumable, Accessory, Misc)",
        "item_quantity: Number of same item",
        "owner_name: Current or notable owner of the item (can be 'None' or a character/entity name)",
        "owner_id: Identifier for the owner, if applicable (e.g., #NPC123, #GUILD001)",
        "willing_to_trade: Boolean indicating if the item is typically available for trade (true/false)"
    ]

    try:
        # Using the generator as a context manager
        with DatasetGenerator(
            model_name=SELECTED_MODEL,
            num_products=NUM_PRODUCTS_TO_GENERATE,
            product_type=PRODUCT_TYPE,
            same_owner = True,
            same_owner_name_temperature=0.4
        ) as generator:

            logger.info(f"Generating dataset for {generator.product_type} (around {generator.num_products} products)...")
            # The generator's prompt should ideally guide the LLM to use these attributes as keys in JSON objects
            result = generator.generate_dataset(custom_product_attributes)

            df_to_save = None  # Initialize variable to hold the DataFrame

            logger.info("\n--- Processing Generated Result ---")
            if "json_data" in result and result["json_data"]:
                logger.info("Output Format: JSON Data found.")
                # Log the raw JSON for inspection if needed (can be verbose for large data)
                # logger.debug(f"Raw JSON data: {json.dumps(result['json_data'], indent=2)}")
                
                # After JSON parsing
                print("Raw JSON data length:", len(result["json_data"]))
                print("First item in JSON:", result["json_data"][0])
                
                if isinstance(result["json_data"], list):
                    if result["json_data"]: # Ensure list is not empty
                        try:
                            df_to_save = pd.DataFrame(result["json_data"])
                            logger.info(f"Created DataFrame from JSON list. Shape: {df_to_save.shape}")
                            
                            # After DataFrame creation
                            print("DataFrame shape:", df_to_save.shape)
                            print("First row of DataFrame:", df_to_save.iloc[0])
                        except Exception as e_df:
                            logger.warning(f"Could not create DataFrame from JSON list: {e_df}", exc_info=True)
                    else:
                        logger.info("JSON data is an empty list.")
                elif isinstance(result["json_data"], dict):
                    if result["json_data"]: # Ensure dict is not empty
                        try:
                            # Convert single dict to a list containing that dict for DataFrame creation
                            df_to_save = pd.DataFrame([result["json_data"]])
                            logger.info(f"Created DataFrame from single JSON object. Shape: {df_to_save.shape}")
                        except Exception as e_df:
                            logger.warning(f"Could not create DataFrame from single JSON object: {e_df}", exc_info=True)
                    else:
                        logger.info("JSON data is an empty dictionary.")
                else:
                    logger.warning("JSON data is not a list or dictionary, cannot convert to DataFrame directly.")

            elif "structured_items" in result and result["structured_items"]:
                logger.info("Output Format: Structured Items (Key-Value pairs) found.")
                # This format (list of {'attribute': key, 'value': val}) will create a "long" DataFrame.
                # If you need a "wide" table (database style) from this, further transformation is needed.
                # For now, we'll create the DataFrame as is from these items.
                try:
                    df_to_save = pd.DataFrame(result["structured_items"])
                    logger.info(f"Created DataFrame from structured items. Shape: {df_to_save.shape}")
                except Exception as e_df:
                    logger.warning(f"Could not create DataFrame from structured items: {e_df}", exc_info=True)
                    logger.info(f"Raw structured items: {result['structured_items']}")
            
            elif "markdown_content" in result:
                logger.info("Output Format: Markdown")
                # logger.debug(f"Markdown content:\n{result['markdown_content']}")
                # No DataFrame is directly created from markdown in this logic.
            elif "raw_text" in result:
                logger.info("Output Format: Raw Text")
                # logger.debug(f"Raw text content:\n{result['raw_text']}")
                # No DataFrame is directly created from raw text.
            elif "attempted_json_parse_error" in result:
                logger.warning(f"JSON parsing failed: {result['attempted_json_parse_error']}")
                logger.info(f"Raw text that was in JSON block: {result.get('raw_text_in_json_block', 'N/A')}")
                logger.info(f"Full generated text (for debugging prompt): {result.get('full_text', 'N/A')}")
            else:
                logger.info("Output Format: Unknown or result is empty.")
                # logger.debug(f"Full result: {json.dumps(result, indent=2)}")


            # --- Save the DataFrame if it was created ---
            # Transform the DataFrame from long to wide format
            if df_to_save is not None and isinstance(df_to_save, pd.DataFrame):
                if not df_to_save.empty:
                    try:
                        # Debug information before transformation
                        print("DataFrame before transformation:")
                        print(df_to_save.head())
                        
                        # Clean the data first
                        # Remove any rows where attribute or value is empty/NaN
                        df_to_save = df_to_save.dropna(subset=['attribute', 'value'])
                        
                        # Remove any rows that don't look like actual item data
                        # (e.g., remove header rows or non-item rows)
                        df_to_save = df_to_save[df_to_save['attribute'].str.contains('item_|owner_|willing_to_trade', na=False)]
                        
                        # Create item groups based on item_id occurrences
                        item_id_mask = df_to_save['attribute'].str.contains('"item_id"', na=False)
                        df_to_save['item_group'] = item_id_mask.cumsum()
                        
                        # Remove any rows before the first item_id (header rows)
                        first_item_id = df_to_save[item_id_mask].index[0] if any(item_id_mask) else 0
                        df_to_save = df_to_save.iloc[first_item_id:]
                        
                        # Clean up the attribute and value columns
                        df_to_save['attribute'] = df_to_save['attribute'].str.strip('"')
                        df_to_save['value'] = df_to_save['value'].str.strip('",')
                        
                        # Pivot the DataFrame
                        df_wide = df_to_save.pivot(index='item_group', columns='attribute', values='value')
                        
                        # Reset index and clean up
                        df_wide = df_wide.reset_index(drop=True)
                        
                        # Final validation
                        print("\nFinal DataFrame validation:")
                        print("Shape:", df_wide.shape)
                        print("Columns:", df_wide.columns.tolist())
                        print("Missing values:", df_wide.isnull().sum())
                        
                        # Replace the original DataFrame with the transformed one
                        df_to_save = df_wide
                        
                        # Save the dataset
                        save_dataset(df_to_save)
                        print("\nDataFrame saved successfully")
                        
                    except Exception as e:
                        logger.error(f"Error during DataFrame transformation: {e}")
                        raise

    except RuntimeError as e:
        logger.critical(f"A runtime error occurred in the main execution: {e}", exc_info=True)
    except Exception as e:
        logger.critical(f"An unexpected error occurred in the main execution: {e}", exc_info=True)

    logger.info("Dataset generation example finished.")

2025-05-28 00:01:44,307 - INFO - __main__ - MPS device is available and built.
2025-05-28 00:01:44,308 - INFO - __main__ - Attempting to load model and tokenizer on device: mps
2025-05-28 00:01:44,308 - INFO - __main__ - Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.2...
2025-05-28 00:01:44,712 - INFO - __main__ - Tokenizer loaded.
2025-05-28 00:01:44,712 - INFO - __main__ - Initial tokenizer pad_token: None, pad_token_id: None
2025-05-28 00:01:44,712 - INFO - __main__ - Initial tokenizer eos_token: </s>, eos_token_id: 2
2025-05-28 00:01:44,713 - INFO - __main__ - Initial tokenizer bos_token: <s>, bos_token_id: 1
2025-05-28 00:01:44,713 - INFO - __main__ - Initial tokenizer unk_token: <unk>, unk_token_id: 0
2025-05-28 00:01:44,713 - INFO - __main__ - Setting `pad_token` to `eos_token` ('</s>').
2025-05-28 00:01:44,714 - INFO - __main__ - After attempting to set: tokenizer pad_token: </s>, pad_token_id: 2
2025-05-28 00:01:44,714 - INFO - __main__ - Loading model mistralai/Mistr

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2025-05-28 00:01:51,213 - INFO - __main__ - Model loaded from pretrained.
2025-05-28 00:01:51,213 - INFO - __main__ - Model is now on device: mps:0
2025-05-28 00:01:51,214 - INFO - __main__ - Model config pad_token_id: None
2025-05-28 00:01:51,214 - INFO - __main__ - Generating dataset for Ragnarok Online Items (around 10 products)...
2025-05-28 00:01:51,214 - INFO - __main__ - Using model.config.max_position_embeddings: 32768
2025-05-28 00:01:51,215 - INFO - __main__ - Derived effective_model_max_len: 32768
2025-05-28 00:01:51,215 - INFO - __main__ - Target max_new_tokens_for_generation: 2048
2025-05-28 00:01:51,216 - INFO - __main__ - Calculated truncation_length_for_input: 30710
2025-05-28 00:01:51,220 - INFO - __main__ - Generating text...
2025-05-28 00:03:46,486 - INFO - __main__ - Output processed as key-value pairs.
2025-05-28 00:03:46,486 - INFO - __main__ - 
--- Processing Generated Result ---
2025-05-28 00:03:46,487 - INFO - __main__ - Output Format: Structured Items (Key-Val

DataFrame before transformation:
            attribute                                              value
0              Item 1                                                   
1       "item_number"                                               "1",
2         "item_name"                                   "Shadow's Edge",
3           "item_id"                                    "#2345-abcdef",
4  "item_description"  "A cursed longsword imbued with dark magic, le...

Final DataFrame validation:
Shape: (13, 10)
Columns: ['item_description', 'item_id', 'item_name', 'item_number', 'item_price', 'item_quantity', 'item_type', 'owner_id', 'owner_name', 'willing_to_trade']
Missing values: attribute
item_description    0
item_id             1
item_name           1
item_number         1
item_price          0
item_quantity       0
item_type           0
owner_id            1
owner_name          0
willing_to_trade    1
dtype: int64

DataFrame saved successfully


2025-05-28 00:03:46,886 - INFO - __main__ - MPS cache emptied.
2025-05-28 00:03:46,886 - INFO - __main__ - Dataset generation example finished.
