### Ragnarok Online Trade Buddy

This generator is used to create a dataset of ragnarok items , items like armors  being the topics of the dataset.

The generator will iteratively generate a dataset of products, with each product being a topic of the dataset.

The generator will use a combination of a prompt and a model to generate the dataset.

Parameters:
- model: The model to use to generate the dataset.
- prompt: The prompt to use to generate the dataset.
- product_type: The type of product to generate.
- num_products: The number of products to generate.
- custom_attributes: A list of custom attributes to add to the product.
- num_examples: The number of examples to generate for each thing.

Output format options:
- JSON
- CSV
- Markdown



In [3]:
import torch
import gc
import os
import json
import logging
import pandas as pd
import traceback # For detailed error logging
from typing import List, Dict, Optional, Any # Added Optional and Any
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login
from save_dataset import save_dataset

In [4]:
# --- Setup Logging ---
# Configure logging to output to console
# You can customize the format, level, and output (e.g., to a file)
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(name)s - %(message)s')
logger = logging.getLogger(__name__) # Create a logger for this module

logger.info(f"PyTorch version: {torch.__version__}")
logger.info(f"MPS available: {torch.backends.mps.is_available()}")
logger.info(f"MPS built: {torch.backends.mps.is_built()}")

# --- Authentication
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    logger.warning("HF_TOKEN environment variable not found. Some operations might fail.")
else:
    try:
        login(hf_token, add_to_git_credential=True)
        logger.info("Successfully logged into Hugging Face Hub.")
    except Exception as e:
        logger.error(f"Failed to log into Hugging Face Hub: {e}")

2025-05-26 15:57:09,443 - INFO - PyTorch version: 2.6.0
2025-05-26 15:57:09,462 - INFO - MPS available: True
2025-05-26 15:57:09,462 - INFO - MPS built: True
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | 

In [6]:
class DatasetGenerator:
    """
    Generates datasets using a model.
    Manages model loading, prompt formatting, generation, and resource cleanup.
    """

    DEFAULT_PROMPT_TEMPLATE = """
        You are an item dataset generator for the video game ragnarok online.
        Your goal is to generate exactly {num_products} items of the type: {product_type}.

        IMPORTANT: You MUST generate ALL items in a single JSON array with the following structure:
        [
            {{
                "item_number": 1,  # MUST start from 1 and be sequential
                "item_name": "Example Item",
                "item_id": "#1234-abc123",
                "item_description": "A brief description",
                "item_price": 1000000,
                "item_type": "Weapon",
                "item_quantity": 10,
                "owner_name": "Character Name",  # MUST NOT be null/None
                "owner_id": "#NPC123",  # MUST NOT be null/None
                "willing_to_trade": true
            }},
            {{
                "item_number": 2,
                ...
            }},
            ...  # Continue for all {num_products} items
        ]

        CRITICAL REQUIREMENTS:
        1. You MUST generate ALL {num_products} items in a single JSON array
        2. Do not generate items one at a time
        3. Do not include any text outside the JSON array
        4. Each item must be a complete JSON object
        5. All items must be in a single JSON array
        6. No partial or incomplete items
        7. Each item must have all required fields
        8. item_price must be a number (no currency symbols)
        9. item_id and owner_id must start with #
        10. willing_to_trade must be true or false
        11. No empty or null values allowed
        12. item_number must be sequential starting from 1
        13. owner_name and owner_id must NEVER be null/None

        Owner Requirements:
        {owner_instruction}

        Custom attributes to consider:
        {custom_attributes_str}

        Begin generation:"""

    def __init__(
        self,
        model_name: str,
        num_products: int = 10,
        product_type: str = "Ragnarok Online Items",
        prompt_template: Optional[str] = None,
        same_owner: bool = True,
        same_owner_name_temperature: float = 0.5
    ):
        """Initialize the DatasetGenerator with validation."""
        # Validate inputs
        if not isinstance(model_name, str) or not model_name:
            raise ValueError("model_name must be a non-empty string")
        
        if not isinstance(num_products, int) or num_products <= 0:
            raise ValueError("num_products must be a positive integer")
        
        if not isinstance(product_type, str) or not product_type:
            raise ValueError("product_type must be a non-empty string")
        
        if not isinstance(same_owner, bool):
            raise ValueError("same_owner must be a boolean")
        
        if not isinstance(same_owner_name_temperature, (int, float)) or not 0 <= same_owner_name_temperature <= 1:
            raise ValueError("same_owner_name_temperature must be a number between 0 and 1")

        self.model_name = model_name
        self.num_products = num_products
        self.product_type = product_type
        self.prompt_template = prompt_template or self.DEFAULT_PROMPT_TEMPLATE
        self.same_owner = same_owner
        self.same_owner_name_temperature = same_owner_name_temperature

        self.device: str = self._get_device()
        self.model: Optional[AutoModelForCausalLM] = None
        self.tokenizer: Optional[AutoTokenizer] = None

    def _get_device(self) -> str:
        """Determines and returns the best available device (mps, cuda, or cpu)."""
        if torch.backends.mps.is_available() and torch.backends.mps.is_built():
            logger.info("MPS device is available and built.")
            return "mps"
        elif torch.cuda.is_available():
            logger.info("CUDA device is available.")
            return "cuda"
        else:
            logger.info("No GPU (MPS or CUDA) available, using CPU.")
            return "cpu"

    def _load_model(self) -> bool:
        """Loads the model and tokenizer onto the selected device."""
        if self.model is not None and self.tokenizer is not None:
            logger.info("Model and tokenizer already loaded.")
            return True

        logger.info(f"Attempting to load model and tokenizer on device: {self.device}")

        try:
            # 1. Load Tokenizer
            logger.info(f"Loading tokenizer for {self.model_name}...")
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                trust_remote_code=True
            )
            logger.info("Tokenizer loaded.")

            # 2. Set up tokenizer properly
            if self.tokenizer.pad_token is None:
                if self.tokenizer.eos_token is not None:
                    self.tokenizer.pad_token = self.tokenizer.eos_token
                    self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
                else:
                    self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
                    self.tokenizer.pad_token = '[PAD]'
                    self.tokenizer.pad_token_id = self.tokenizer.convert_tokens_to_ids('[PAD]')

            # 3. Load Model
            logger.info(f"Loading model {self.model_name}...")
            model_kwargs = {
                "trust_remote_code": True,
                "torch_dtype": torch.float16 if self.device != "cpu" else torch.float32
            }

            if self.device == "mps":
                model_kwargs["device_map"] = self.device

            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                **model_kwargs
            )
            logger.info("Model loaded from pretrained.")

            # 4. Update model config
            if self.model.config.pad_token_id is None:
                self.model.config.pad_token_id = self.tokenizer.pad_token_id

            # 5. Move to device if not already handled by device_map
            if self.device != "mps":
                logger.info(f"Moving model to device: {self.device}")
                self.model.to(self.device)

            logger.info(f"Model is now on device: {self.model.device}")
            return True

        except Exception as e:
            logger.error(f"Failed to load model or tokenizer for {self.model_name}: {e}")
            logger.error(traceback.format_exc())
            self.model = None
            self.tokenizer = None
            return False

    def _format_prompt(self, custom_attributes: Optional[List[str]] = None) -> str:
        """Format the prompt with simplified owner instruction."""
        # Generate owner_instruction based on parameters
        if self.same_owner:
            # Calculate number of items that should have same owner
            num_same_owner = int(self.num_products * self.same_owner_name_temperature)
            instruction = f"Generate {num_same_owner} items with the same owner name and ID. The remaining items should have different owners."
        else:
            instruction = "Generate items with different owner names and IDs."

        # Format custom attributes
        if custom_attributes:
            attributes_str = "\n".join(f"    - {attr}" for attr in custom_attributes)
        else:
            attributes_str = "    - (No specific custom attributes provided)"

        # Format the prompt
        return self.prompt_template.format(
            num_products=self.num_products,
            product_type=self.product_type,
            owner_instruction=instruction,
            custom_attributes_str=attributes_str
        )

    def generate_dataset(self, custom_attributes: Optional[List[str]] = None) -> Dict[str, Any]:
        """Generates a dataset based on the provided attributes."""
        if self.model is None or self.tokenizer is None:
            logger.error("Model or tokenizer not loaded. Call _load_model() or use as context manager.")
            raise RuntimeError("Model not loaded. Cannot generate dataset.")

        formatted_prompt = self._format_prompt(custom_attributes)
        logger.debug(f"Formatted prompt:\n{formatted_prompt}")

        try:
            # Calculate truncation length
            truncation_length_for_input = self.model.config.max_position_embeddings - 2048  # Reserve space for generation
            logger.info(f"Using model.config.max_position_embeddings: {self.model.config.max_position_embeddings}")
            logger.info(f"Derived effective_model_max_len: {truncation_length_for_input}")

            # Set up generation parameters
            generation_config_params = {
                "max_new_tokens": 2048,
                "temperature": 0.7,
                "top_p": 0.9,
                "do_sample": True,
                "pad_token_id": self.tokenizer.pad_token_id,
                "eos_token_id": self.tokenizer.eos_token_id,
            }

            # Tokenize input
            inputs = self.tokenizer(
                formatted_prompt,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=truncation_length_for_input
            )
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            # Generate text
            logger.info("Generating text...")
            with torch.no_grad():
                outputs = self.model.generate(**generation_config_params)
            generated_ids = outputs[0].to('cpu') if outputs[0].device.type != 'cpu' else outputs[0]

            generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
            logger.debug(f"Raw generated text:\n{generated_text}")

            # Add validation step
            if "[" not in generated_text or "]" not in generated_text:
                logger.warning("Generated text does not contain a JSON array. Retrying with stronger prompt...")
                # Add system message to enforce format
                system_message = "You MUST generate all items in a single JSON array. Do not generate items one at a time."
                formatted_prompt = f"{system_message}\n\n{formatted_prompt}"
                # Retry generation
                inputs = self.tokenizer(formatted_prompt, return_tensors="pt", padding=True, truncation=True, max_length=truncation_length_for_input)
                inputs = {k: v.to(self.device) for k, v in inputs.items()}
                with torch.no_grad():
                    outputs = self.model.generate(**generation_config_params)
                generated_ids = outputs[0].to('cpu') if outputs[0].device.type != 'cpu' else outputs[0]
                generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)

            if generated_text.strip().startswith(formatted_prompt.strip()):
                generated_text = generated_text.strip()[len(formatted_prompt.strip()):].strip()
            elif "Begin generation:" in generated_text:
                parts = generated_text.split("Begin generation:", 1)
                if len(parts) > 1:
                    generated_text = parts[1].strip()

            return self._process_output(generated_text)

        except Exception as e:
            logger.error(f"Dataset generation failed: {e}")
            logger.error(traceback.format_exc())
            raise RuntimeError(f"Generation failed: {str(e)}") from e

    def _process_output(self, generated_text: str) -> Dict[str, Any]:
        """Processes the generated text, attempting to parse JSON first."""
        logger.debug("Processing generated output...")
        try:
            # First try to parse as JSON array
            try:
                if generated_text.strip().startswith('[') and generated_text.strip().endswith(']'):
                    json_data = json.loads(generated_text)
                    if isinstance(json_data, list) and len(json_data) > 0:
                        # Validate each item has required fields
                        valid_items = []
                        for item in json_data:
                            if all(key in item for key in ['item_name', 'item_description', 'item_price', 'item_type']):
                                valid_items.append(item)
                        if valid_items:
                            return {"json_data": valid_items}
            except json.JSONDecodeError:
                pass

            # If JSON parsing fails, try to parse as structured items
            lines = generated_text.split('\n')
            items = []
            current_item = {}
            
            for line in lines:
                line = line.strip()
                if not line:
                    if current_item and all(key in current_item for key in ['item_name', 'item_description', 'item_price', 'item_type']):
                        items.append(current_item)
                    current_item = {}
                    continue
                
                if ':' in line:
                    key, value = line.split(':', 1)
                    key = key.strip().strip('"')
                    value = value.strip().strip('",')
                    if key and value:
                        current_item[key] = value
            
            # Add the last item if it's valid
            if current_item and all(key in current_item for key in ['item_name', 'item_description', 'item_price', 'item_type']):
                items.append(current_item)
            
            if items:
                return {"json_data": items}
            
            return {"raw_text": generated_text}
            
        except Exception as e:
            logger.warning(f"Error during output processing: {e}")
            return {"raw_text": generated_text}

    def _cleanup_resources(self):
        """Cleans up model and tokenizer resources and clears GPU cache if applicable."""
        logger.info("Cleaning up resources...")
        try:
            if self.model is not None:
                del self.model
                self.model = None
                logger.debug("Model deleted.")

            if self.tokenizer is not None:
                del self.tokenizer
                self.tokenizer = None
                logger.debug("Tokenizer deleted.")

            gc.collect()
            logger.debug("Garbage collection triggered.")

            if self.device == "mps":
                if hasattr(torch, 'mps') and hasattr(torch.mps, 'empty_cache'):
                    torch.mps.empty_cache()
                    logger.info("MPS cache emptied.")
            elif self.device == "cuda":
                if hasattr(torch, 'cuda') and hasattr(torch.cuda, 'empty_cache'):
                    torch.cuda.empty_cache()
                    logger.info("CUDA cache emptied.")
        except Exception as e:
            logger.warning(f"Error during resource cleanup: {e}", exc_info=True)

    def __enter__(self):
        """Context manager entry: Loads the model."""
        if not self._load_model():
            raise RuntimeError("Failed to load model resources for DatasetGenerator.")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit: Cleans up resources."""
        logger.info("Exiting context and cleaning up DatasetGenerator resources...")
        self._cleanup_resources()
        if exc_type:
            logger.error(f"Exception occurred in 'with' block: {exc_type.__name__}: {exc_val}")
        return False

In [8]:
if __name__ == "__main__":
    SELECTED_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
    PRODUCT_TYPE = "Ragnarok Online Items"
    NUM_PRODUCTS_TO_GENERATE = 20

    custom_product_attributes = [
        "item_number: Number of the item",
        "item_name: Name of the item",
        "item_id: Unique identifier, e.g., #1234-abc123",
        "item_description: A brief description of the item's appearance, lore, or use",
        "item_price: Estimated market price in Zeny (integer)",
        "item_type: Category of the item (e.g., Weapon, Armor, Consumable, Accessory, Misc)",
        "item_quantity: Number of same item",
        "owner_name: Current or notable owner of the item (can be 'None' or a character/entity name)",
        "owner_id: Identifier for the owner, if applicable (e.g., #NPC123, #GUILD001)",
        "willing_to_trade: Boolean indicating if the item is typically available for trade (true/false)"
    ]

    try:
        with DatasetGenerator(
            model_name=SELECTED_MODEL,
            num_products=NUM_PRODUCTS_TO_GENERATE,
            product_type=PRODUCT_TYPE,
            same_owner=True,
            same_owner_name_temperature=0.4
        ) as generator:
            logger.info(f"Generating dataset for {generator.product_type} (around {generator.num_products} products)...")
            result = generator.generate_dataset(custom_product_attributes)
            
            df_to_save = None  # Initialize variable to hold the DataFrame

            logger.info("\n--- Processing Generated Result ---")
            if "json_data" in result and result["json_data"]:
                logger.info("Output Format: JSON Data found.")
                try:
                    df_to_save = pd.DataFrame(result["json_data"])
                    logger.info(f"Created DataFrame with {len(df_to_save)} items")
                    
                    # Ensure all required fields exist
                    required_fields = ['item_name', 'item_description', 'item_price', 'item_type']
                    missing_fields = [field for field in required_fields if field not in df_to_save.columns]
                    if missing_fields:
                        logger.error(f"Missing required fields: {missing_fields}")
                        logger.error("Skipping data processing due to missing required fields")
                    else:
                        # Clean up the data
                        for col in df_to_save.columns:
                            if df_to_save[col].dtype == 'object':
                                df_to_save[col] = df_to_save[col].str.strip('",')
                        
                        # Reset index and set item numbers
                        df_to_save = df_to_save.reset_index(drop=True)
                        df_to_save['item_number'] = range(1, len(df_to_save) + 1)
                        
                        save_dataset(df_to_save)
                        logger.info("Dataset saved successfully")
                    
                except Exception as e:
                    logger.error(f"Error processing JSON data: {e}")
                    logger.error(traceback.format_exc())
            else:
                logger.error("No valid JSON data found in the generated output.")

    except RuntimeError as e:
        logger.critical(f"A runtime error occurred in the main execution: {e}", exc_info=True)
    except Exception as e:
        logger.critical(f"An unexpected error occurred in the main execution: {e}", exc_info=True)

    logger.info("Dataset generation example finished.")

2025-05-26 16:04:06,589 - INFO - MPS device is available and built.
2025-05-26 16:04:06,590 - INFO - Attempting to load model and tokenizer on device: mps
2025-05-26 16:04:06,590 - INFO - Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.2...
2025-05-26 16:04:06,734 - INFO - Tokenizer loaded.
2025-05-26 16:04:06,734 - INFO - Loading model mistralai/Mistral-7B-Instruct-v0.2...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2025-05-26 16:04:11,183 - INFO - Model loaded from pretrained.
2025-05-26 16:04:11,184 - INFO - Model is now on device: mps:0
2025-05-26 16:04:11,184 - INFO - Generating dataset for Ragnarok Online Items (around 20 products)...
2025-05-26 16:04:11,184 - INFO - Using model.config.max_position_embeddings: 32768
2025-05-26 16:04:11,185 - INFO - Derived effective_model_max_len: 30720
2025-05-26 16:04:11,188 - INFO - Generating text...
2025-05-26 16:04:15,987 - INFO - 
--- Processing Generated Result ---
2025-05-26 16:04:15,988 - ERROR - No valid JSON data found in the generated output.
2025-05-26 16:04:15,988 - INFO - Exiting context and cleaning up DatasetGenerator resources...
2025-05-26 16:04:15,988 - INFO - Cleaning up resources...
2025-05-26 16:04:16,230 - INFO - MPS cache emptied.
2025-05-26 16:04:16,230 - INFO - Dataset generation example finished.
