### Product's Dataset Generator

This generator is used to create a dataset of products, products being the topics of the dataset.

The generator will iteratively generate a dataset of products, with each product being a topic of the dataset.

The generator will use a combination of a prompt and a model to generate the dataset.

Parameters:
- model: The model to use to generate the dataset.
- prompt: The prompt to use to generate the dataset.
- product_type: The type of product to generate.
- num_products: The number of products to generate.
- custom_attributes: A list of custom attributes to add to the product.
- num_examples: The number of examples to generate for each thing.

Output format options:
- JSON
- CSV
- Markdown



In [14]:
import torch
import gc
import os
import json
import logging
import pandas as pd
from typing import List, Dict, Optional, Any # Added Optional and Any
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login
import traceback # For detailed error logging

In [16]:
# Model Options
MODEL_MISTRAL = 'mistralai/Mistral-7B-Instruct-v0.2'
MODEL_PHI = 'microsoft/Phi-3-mini-4k-instruct'

In [12]:
# Prompt Options
PROMPT = """
  You are a products dataset generator.
  You will be given a list of optional attributes and a list of required attributes.
  Just make sure to generate the products as outputs

  The products output should be human readable or JSON format and generate a comprehensive
  data set depending on the product type requested by the user.
"""


In [18]:
# --- Setup Logging ---
# Configure logging to output to console
# You can customize the format, level, and output (e.g., to a file)
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(name)s - %(message)s')
logger = logging.getLogger(__name__) # Create a logger for this module

logger.info(f"PyTorch version: {torch.__version__}")
logger.info(f"MPS available: {torch.backends.mps.is_available()}")
logger.info(f"MPS built: {torch.backends.mps.is_built()}")

# --- Authentication
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    logger.warning("HF_TOKEN environment variable not found. Some operations might fail.")
else:
    try:
        login(hf_token, add_to_git_credential=True)
        logger.info("Successfully logged into Hugging Face Hub.")
    except Exception as e:
        logger.error(f"Failed to log into Hugging Face Hub: {e}")

2025-05-20 18:47:46,395 - INFO - __main__ - PyTorch version: 2.6.0
2025-05-20 18:47:46,396 - INFO - __main__ - MPS available: True
2025-05-20 18:47:46,396 - INFO - __main__ - MPS built: True
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variabl

In [21]:
# Dataset Generator
class DatasetGenerator:
    def __init__(
        self,
        model_name: str,
        num_products: int = 5,
        product_type: str = "Random Products",
        device: str = "mps"  # For M4 Max chip
    ):
        self.model_name = model_name
        self.num_products = num_products
        self.product_type = product_type
        self.device = device
        self.model = None
        self.tokenizer = None
        self._load_model()

    def _get_device(self):
        """Get the best available device"""
        if torch.backends.mps.is_available():
            print("Using MPS device")
            return "mps"
        elif torch.cuda.is_available():
            print("Using CUDA device")
            return "cuda"
        else:
            print("Using CPU device")
            return "cpu"
    
    def _load_model(self) -> bool:
        """
        Loads the model and tokenizer onto the selected device.
        `trust_remote_code=True` is used, exercise caution with untrusted models.
        Sets a pad_token if one is not already defined.
        """
        if self.model is not None and self.tokenizer is not None:
            logger.info("Model and tokenizer already loaded.")
            return True

        logger.info(f"Attempting to load model and tokenizer on device: {self.device}")
        try:
            logger.info(f"Loading tokenizer for {self.model_name}...")
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                trust_remote_code=True  # CAUTION: Only use with trusted models
            )
            logger.info("Tokenizer loaded successfully.")

            # <<< --- ADD THIS SECTION TO HANDLE MISSING PAD TOKEN --- >>>
            if self.tokenizer.pad_token is None:
                if self.tokenizer.eos_token is not None:
                    logger.info("Tokenizer does not have a pad_token, setting it to eos_token.")
                    self.tokenizer.pad_token = self.tokenizer.eos_token
                else:
                    # This case is less common for generative models but good to handle
                    logger.warning("Tokenizer has no pad_token and no eos_token. Adding a new [PAD] token.")
                    self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
                    # If you add a new token, you might need to resize model token embeddings
                    # self.model.resize_token_embeddings(len(self.tokenizer)) # See note below
            # <<< --- END OF ADDED SECTION --- >>>


            logger.info(f"Loading model {self.model_name}...")
            if self.device == "mps":
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    trust_remote_code=True,
                    torch_dtype=torch.float16,
                    device_map=self.device
                )
            else:
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    trust_remote_code=True,
                    torch_dtype=torch.float16
                )
                self.model.to(self.device)

            # If you added a new pad_token that wasn't eos_token, and it actually added a new token
            # to the vocabulary (not just aliasing an existing one), you might need to resize
            # the model's token embeddings here.
            # However, setting pad_token = eos_token usually doesn't change vocab size.
            # If you used `tokenizer.add_special_tokens({'pad_token': '[PAD]'})` and '[PAD]' was new,
            # you would do:
            # if self.tokenizer.pad_token == '[PAD]': # Or a more robust check if a new token was truly added
            #     self.model.resize_token_embeddings(len(self.tokenizer))
            #     logger.info(f"Resized model token embeddings to {len(self.tokenizer)}")


            logger.info(f"Model loaded successfully on {self.device}.")
            return True
        except Exception as e:
            logger.error(f"Failed to load model or tokenizer for {self.model_name}: {e}")
            logger.error(traceback.format_exc())
            self.model = None
            self.tokenizer = None
            return False

    def _cleanup_resources(self):
        """Clean up model resources while maintaining MPS device"""
        try:
            if hasattr(self, 'model') and self.model is not None:
                # Keep model on MPS, just delete the reference
                del self.model
                self.model = None
                
            if hasattr(self, 'tokenizer') and self.tokenizer is not None:
                del self.tokenizer
                self.tokenizer = None
                
            # Force garbage collection
            gc.collect()
            
            # Clear MPS cache
            if torch.backends.mps.is_available():
                torch.mps.empty_cache()
                
        except Exception as e:
            print(f"Warning: Error during cleanup: {str(e)}")

    def _format_prompt(self, custom_attributes: List[str] = None) -> str:
        """Format the prompt with custom attributes"""
        base_prompt = PROMPT
        if custom_attributes:
            attributes_str = "\n".join([f"    - {attr}" for attr in custom_attributes])
            base_prompt += f"\n  Custom attributes:\n{attributes_str}"
        return base_prompt

    def generate_dataset(self, custom_attributes: List[str] = None) -> Dict:
        try:
            if self.model is None:
                if not self._load_model():
                    raise Exception("Failed to load model")

            device = self._get_device()
            
            # Format prompt
            formatted_prompt = self._format_prompt(custom_attributes)
            
            # Generate
            inputs = self.tokenizer(formatted_prompt, return_tensors="pt")
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Create generation config
            generation_config = {
                "input_ids": inputs["input_ids"],
                "attention_mask": inputs["attention_mask"],
                "max_new_tokens": 512,
                "do_sample": False,
                "num_return_sequences": 1,
                "pad_token_id": self.tokenizer.pad_token_id,
                "eos_token_id": self.tokenizer.eos_token_id
            }
            
            # Generate with specific config
            outputs = self.model.generate(**generation_config)
            
            # Process output
            generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Cleanup after generation
            self._cleanup_resources()
            
            return self._process_output(generated_text)
        
        except Exception as e:
            self._cleanup_resources()
            raise Exception(f"Generation failed: {str(e)}")

    def _process_output(self, generated_text: str) -> Dict:
        """Process the generated text into a structured format, accepting markdown"""
        try:
            # If the text is already in markdown format, return it as is
            if "```" in generated_text:
                return {"markdown": generated_text}
                
            # Otherwise, try to extract key-value pairs
            items = []
            for line in generated_text.split('\n'):
                if ':' in line:
                    key, value = line.split(':', 1)
                    items.append({
                        "attribute": key.strip(),
                        "value": value.strip()
                    })
            
            if items:
                return {"items": items}
            else:
                # If no structured data found, return the raw text
                return {"text": generated_text}
                
        except Exception as e:
            print(f"Error processing output: {str(e)}")
            print(f"Raw generated text: {generated_text}")
            # Return the raw text if processing fails
            return {"text": generated_text}

# Example usage
if __name__ == "__main__":
    # Create generator instance
    generator = DatasetGenerator(
        model_name=MODEL,
        num_products=5,
        product_type="Medical Equipment"
    )
    
    # Example custom attributes
    custom_attrs = [
        "manufacturer: name",
        "ingredients: list",
        "dosage_form: type"
    ]
    
    # Generate dataset
    result = generator.generate_dataset(custom_attrs)
    #print(json.dumps(result, indent=2))
    print(pd.DataFrame(result)) 

2025-05-20 23:54:13,541 - INFO - __main__ - Attempting to load model and tokenizer on device: mps
2025-05-20 23:54:13,542 - INFO - __main__ - Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.2...
2025-05-20 23:54:13,923 - INFO - __main__ - Tokenizer loaded successfully.
2025-05-20 23:54:13,924 - INFO - __main__ - Tokenizer does not have a pad_token, setting it to eos_token.
2025-05-20 23:54:13,924 - INFO - __main__ - Loading model mistralai/Mistral-7B-Instruct-v0.2...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2025-05-20 23:54:16,553 - INFO - __main__ - Model loaded successfully on mps.


Using MPS device
                                                items
0     {'attribute': 'Custom attributes', 'value': ''}
1    {'attribute': '- manufacturer', 'value': 'name'}
2     {'attribute': '- ingredients', 'value': 'list'}
3     {'attribute': '- dosage_form', 'value': 'type'}
4           {'attribute': '- shape', 'value': 'type'}
5           {'attribute': '- color', 'value': 'type'}
6            {'attribute': '- size', 'value': 'type'}
7   {'attribute': '- packaging_type', 'value': 'ty...
8   {'attribute': '- expiration_date', 'value': 'd...
9   {'attribute': '- batch_number', 'value': 'numb...
10   {'attribute': '- net_weight', 'value': 'weight'}
11  {'attribute': '- storage_instructions', 'value...
13   {'attribute': '- side_effects', 'value': 'list'}
14  {'attribute': '- contraindications', 'value': ...
15    {'attribute': '- indications', 'value': 'list'}
16  {'attribute': '- route_of_administration', 'va...
17       {'attribute': '- strength', 'value': 'type'}
18    {'att

In [24]:
# --- Dataset Generator Class ---
class DatasetGenerator:
    """
    Generates datasets using a Hugging Face model.
    Manages model loading, prompt formatting, generation, and resource cleanup.
    """

    DEFAULT_PROMPT_TEMPLATE = """
      You are a products dataset generator.
      Your goal is to generate a list of {num_products} products of the type: {product_type}.
      You will be given a list of optional custom attributes to include for each product.
      Please ensure the output is a comprehensive dataset.

      The output should ideally be a JSON list of objects, where each object represents a product.
      If JSON is not possible, provide a clear, human-readable format.

      Custom attributes to consider:
      {custom_attributes_str}

      Begin generation:
      """

    def __init__(
        self,
        model_name: str,
        num_products: int = 5,
        product_type: str = "Random Products",
        prompt_template: Optional[str] = None
    ):
        """
        Initializes the DatasetGenerator.

        Args:
            model_name (str): The name of the Hugging Face model to use.
            num_products (int): The number of products to generate.
            product_type (str): The type of products to generate.
            prompt_template (Optional[str]): An optional custom prompt template.
                                            If None, uses DEFAULT_PROMPT_TEMPLATE.
        """
        self.model_name = model_name
        self.num_products = num_products
        self.product_type = product_type
        self.prompt_template = prompt_template or self.DEFAULT_PROMPT_TEMPLATE

        self.device: str = self._get_device()
        self.model: Optional[AutoModelForCausalLM] = None
        self.tokenizer: Optional[AutoTokenizer] = None
        # Model and tokenizer are loaded via __enter__ or an explicit load method

    def _get_device(self) -> str:
        """Determines and returns the best available device (mps, cuda, or cpu)."""
        if torch.backends.mps.is_available() and torch.backends.mps.is_built():
            logger.info("MPS device is available and built.")
            return "mps"
        elif torch.cuda.is_available():
            logger.info("CUDA device is available.")
            return "cuda"
        else:
            logger.info("No GPU (MPS or CUDA) available, using CPU.")
            return "cpu"

    def _load_model(self) -> bool:
        """
        Loads the model and tokenizer onto the selected device.
        `trust_remote_code=True` is used, exercise caution with untrusted models.
        Aggressively sets a pad_token if one is not already defined and resizes model
        embeddings if a new token is added.
        """
        if self.model is not None and self.tokenizer is not None:
            logger.info("Model and tokenizer already loaded.")
            return True

        logger.info(f"Attempting to load model and tokenizer on device: {self.device}")
        new_pad_token_added_to_vocab = False # Flag to track if we add a genuinely new token

        try:
            # 1. Load Tokenizer
            logger.info(f"Loading tokenizer for {self.model_name}...")
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                trust_remote_code=True  # CAUTION: Only use with trusted models
            )
            logger.info("Tokenizer loaded.")

            # 2. Debug and Set Pad Token
            logger.info(f"Initial tokenizer pad_token: {self.tokenizer.pad_token}, pad_token_id: {self.tokenizer.pad_token_id}")
            logger.info(f"Initial tokenizer eos_token: {self.tokenizer.eos_token}, eos_token_id: {self.tokenizer.eos_token_id}")
            logger.info(f"Initial tokenizer bos_token: {self.tokenizer.bos_token}, bos_token_id: {self.tokenizer.bos_token_id}") # Just for more info
            logger.info(f"Initial tokenizer unk_token: {self.tokenizer.unk_token}, unk_token_id: {self.tokenizer.unk_token_id}") # Just for more info


            if self.tokenizer.pad_token is None:
                logger.warning("Tokenizer `pad_token` is None. Attempting to set it.")
                if self.tokenizer.eos_token is not None:
                    logger.info(f"Setting `pad_token` to `eos_token` ('{self.tokenizer.eos_token}').")
                    self.tokenizer.pad_token = self.tokenizer.eos_token
                else:
                    # This is a more problematic case: no pad_token and no eos_token
                    logger.warning("Tokenizer `eos_token` is also None. Adding a new `[PAD]` special token.")
                    # Check current vocab size
                    original_vocab_size = len(self.tokenizer)
                    self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
                    # Check if vocab size actually changed
                    if len(self.tokenizer) > original_vocab_size:
                        new_pad_token_added_to_vocab = True
                        logger.info(f"Added new special token '[PAD]'. Tokenizer vocab size changed from {original_vocab_size} to {len(self.tokenizer)}.")
                    else:
                        logger.info("Tried to add '[PAD]', but vocab size did not change (it might have existed or aliased).")

            logger.info(f"After attempting to set: tokenizer pad_token: {self.tokenizer.pad_token}, pad_token_id: {self.tokenizer.pad_token_id}")

            # 3. Load Model
            logger.info(f"Loading model {self.model_name}...")
            # Note: torch_dtype might need adjustment based on actual device capabilities (e.g. CPU might not like float16)
            model_kwargs = {
                "trust_remote_code": True, # CAUTION
                "torch_dtype": torch.float16 if self.device != "cpu" else torch.float32 # float16 often problematic on CPU
            }

            if self.device == "mps":
                model_kwargs["device_map"] = self.device # Let MPS handle mapping if specified
            # For other devices (cuda, cpu), we'll load and then .to(device)

            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                **model_kwargs
            )
            logger.info("Model loaded from pretrained.")

            # 4. Resize Token Embeddings if a NEW token was added to tokenizer
            if new_pad_token_added_to_vocab:
                logger.info(f"Resizing model token embeddings to match new tokenizer vocab size: {len(self.tokenizer)}")
                self.model.resize_token_embeddings(len(self.tokenizer))
                # After resizing, it's good to check if the new pad_token_id in model config matches.
                # The model's config might also need updating for this new pad_token_id if it was just added.
                if self.model.config.pad_token_id != self.tokenizer.pad_token_id:
                     logger.info(f"Updating model's config pad_token_id from {self.model.config.pad_token_id} to {self.tokenizer.pad_token_id}")
                     self.model.config.pad_token_id = self.tokenizer.pad_token_id


            # 5. Move to device if not already handled by device_map
            if self.device != "mps": # If device_map wasn't used for MPS
                logger.info(f"Moving model to device: {self.device}")
                self.model.to(self.device)

            logger.info(f"Model is now on device: {self.model.device}")
            logger.info(f"Model config pad_token_id: {self.model.config.pad_token_id}")


            # Final check of tokenizer pad_token_id for sanity
            if self.tokenizer.pad_token_id is None:
                logger.error("CRITICAL: tokenizer.pad_token_id is STILL None after all attempts. This will likely cause padding errors.")
                return False # Indicate failure

            return True

        except Exception as e:
            logger.error(f"Failed to load model or tokenizer for {self.model_name}: {e}")
            logger.error(traceback.format_exc())
            self.model = None
            self.tokenizer = None
            return False

    def _cleanup_resources(self):
        """Cleans up model and tokenizer resources and clears GPU cache if applicable."""
        logger.info("Cleaning up resources...")
        try:
            if self.model is not None:
                del self.model
                self.model = None
                logger.debug("Model deleted.")

            if self.tokenizer is not None:
                del self.tokenizer
                self.tokenizer = None
                logger.debug("Tokenizer deleted.")

            gc.collect() # Force garbage collection
            logger.debug("Garbage collection triggered.")

            if self.device == "mps":
                if hasattr(torch, 'mps') and hasattr(torch.mps, 'empty_cache'):
                    torch.mps.empty_cache()
                    logger.info("MPS cache emptied.")
            elif self.device == "cuda":
                if hasattr(torch, 'cuda') and hasattr(torch.cuda, 'empty_cache'):
                    torch.cuda.empty_cache()
                    logger.info("CUDA cache emptied.")
        except Exception as e:
            logger.warning(f"Error during resource cleanup: {e}", exc_info=True)

    def _format_prompt(self, custom_attributes: Optional[List[str]] = None) -> str:
        """
        Formats the prompt using the class's prompt template, product details,
        and custom attributes.
        """
        if custom_attributes:
            attributes_str = "\n".join([f"    - {attr}" for attr in custom_attributes])
        else:
            attributes_str = "    - (No specific custom attributes provided)"

        return self.prompt_template.format(
            num_products=self.num_products,
            product_type=self.product_type,
            custom_attributes_str=attributes_str
        )

    def generate_dataset(self, custom_attributes: Optional[List[str]] = None) -> Dict[str, Any]:
        """
        Generates a dataset based on the provided attributes.
        (Other parts of the docstring remain the same)
        """
        if self.model is None or self.tokenizer is None:
            logger.error("Model or tokenizer not loaded. Call _load_model() or use as context manager.")
            raise RuntimeError("Model not loaded. Cannot generate dataset.")

        formatted_prompt = self._format_prompt(custom_attributes)
        logger.debug(f"Formatted prompt:\n{formatted_prompt}")

        try:
            # --- Determine a safe max_length for truncation ---
            logger.debug(f"Tokenizer class: {self.tokenizer.__class__.__name__}")
            logger.debug(f"Tokenizer's initial model_max_length type: {type(self.tokenizer.model_max_length)}")
            logger.debug(f"Tokenizer's initial model_max_length value: {self.tokenizer.model_max_length}")

            # Default to a conservative value if things go wrong
            # For many models like Mistral 7B (32k context) or Phi-3 (4k/128k), this will be overridden.
            default_model_max_len = 4096
            effective_model_max_len = default_model_max_len

            if isinstance(self.tokenizer.model_max_length, int) and 0 < self.tokenizer.model_max_length < 200000: # Check if it's a "reasonable" int (e.g., less than 200k)
                effective_model_max_len = self.tokenizer.model_max_length
                logger.info(f"Using tokenizer.model_max_length: {effective_model_max_len}")
            else:
                logger.warning(
                    f"Tokenizer.model_max_length ({self.tokenizer.model_max_length}) is not a reasonable int or not set. "
                    "Attempting to use model.config.max_position_embeddings."
                )
                if hasattr(self.model, 'config') and hasattr(self.model.config, 'max_position_embeddings') and \
                   isinstance(self.model.config.max_position_embeddings, int) and \
                   0 < self.model.config.max_position_embeddings < 200000:
                    effective_model_max_len = self.model.config.max_position_embeddings
                    logger.info(f"Using model.config.max_position_embeddings: {effective_model_max_len}")
                else:
                    logger.warning(
                        f"Could not determine a reliable model_max_length. Defaulting to {default_model_max_len}. "
                        f"Model config max_position_embeddings: {getattr(getattr(self.model, 'config', None), 'max_position_embeddings', 'N/A')}"
                    )
                    effective_model_max_len = default_model_max_len
            
            # Define max_new_tokens for generation (you might want to make this configurable)
            max_new_tokens_for_generation = 1024 # From your generation_config_params
            
            # Calculate truncation_length for the input prompt
            # We need to leave enough space for `max_new_tokens_for_generation`.
            # So, prompt length <= effective_model_max_len - max_new_tokens_for_generation
            # Add a small buffer (e.g., 5-10 tokens) just in case.
            buffer_tokens = 10 
            truncation_length_for_input = effective_model_max_len - max_new_tokens_for_generation - buffer_tokens

            # Ensure truncation_length_for_input is positive
            if truncation_length_for_input <= 0:
                logger.warning(f"Calculated truncation_length_for_input ({truncation_length_for_input}) is too small or negative. "
                               f"This might happen if effective_model_max_len ({effective_model_max_len}) is smaller than "
                               f"max_new_tokens_for_generation ({max_new_tokens_for_generation}). "
                               f"Adjusting to a small positive value (e.g., effective_model_max_len / 2).")
                truncation_length_for_input = effective_model_max_len // 2
                if truncation_length_for_input <=0: # Final fallback if even that is zero/negative
                    truncation_length_for_input = 256 # A very conservative small positive value

            logger.info(f"Derived effective_model_max_len: {effective_model_max_len}")
            logger.info(f"Target max_new_tokens_for_generation: {max_new_tokens_for_generation}")
            logger.info(f"Calculated truncation_length_for_input for tokenizer: {truncation_length_for_input}")
            # --- End of max_length determination ---

            inputs = self.tokenizer(
                formatted_prompt,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=truncation_length_for_input # Use the robustly calculated length
            )
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            pad_token_id = self.tokenizer.pad_token_id # Should be set correctly now from _load_model

            generation_config_params = {
                "input_ids": inputs["input_ids"],
                "attention_mask": inputs["attention_mask"],
                "max_new_tokens": max_new_tokens_for_generation, # Use the defined value
                "do_sample": True,
                "temperature": 0.7,
                "top_p": 0.9,
                "num_return_sequences": 1,
                "pad_token_id": pad_token_id,
                "eos_token_id": self.tokenizer.eos_token_id
            }
            # Ensure max_length for generation does not exceed model capacity when combined with prompt
            # total_max_len for generation call can be effective_model_max_len
            # This is typically handled by max_new_tokens correctly, but some models might also accept 'max_length' in generate
            # generation_config_params["max_length"] = effective_model_max_len # Optional: sometimes useful

            logger.info("Generating text...")
            with torch.no_grad():
                outputs = self.model.generate(**generation_config_params)

            generated_ids = outputs[0].to('cpu') if outputs[0].device.type != 'cpu' else outputs[0]
            generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
            logger.debug(f"Raw generated text:\n{generated_text}")

            if generated_text.strip().startswith(formatted_prompt.strip()):
                 generated_text = generated_text.strip()[len(formatted_prompt.strip()):].strip()
            elif "Begin generation:" in generated_text:
                 parts = generated_text.split("Begin generation:", 1)
                 if len(parts) > 1:
                     generated_text = parts[1].strip()

            return self._process_output(generated_text)

        except Exception as e:
            logger.error(f"Dataset generation failed: {e}")
            logger.error(traceback.format_exc())
            raise RuntimeError(f"Generation failed: {str(e)}") from e

    def _process_output(self, generated_text: str) -> Dict[str, Any]:
        """
        Processes the generated text, attempting to parse JSON first,
        then looking for markdown or key-value pairs.
        """
        logger.debug("Processing generated output...")
        # Attempt to find and parse JSON (often enclosed in ```json ... ``` or just { ... } or [ ... ])
        try:
            # Regex to find JSON block, accounts for optional "json" language specifier
            json_match = None
            # More robust regex to find JSON within triple backticks
            import re
            match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", generated_text, re.DOTALL)
            if match:
                json_str = match.group(1).strip()
                logger.info("Found JSON block in triple backticks.")
                try:
                    return {"json_data": json.loads(json_str)}
                except json.JSONDecodeError as je:
                    logger.warning(f"Failed to parse JSON from triple backticks: {je}. Raw content: {json_str}")
                    # Fall through to other parsing methods, but provide the attempted JSON
                    return {"attempted_json_parse_error": str(je), "raw_text_in_json_block": json_str, "full_text": generated_text}


            # If no triple backticks, try to parse the whole text if it looks like JSON
            stripped_text = generated_text.strip()
            if (stripped_text.startswith('{') and stripped_text.endswith('}')) or \
               (stripped_text.startswith('[') and stripped_text.endswith(']')):
                logger.info("Attempting to parse entire output as JSON.")
                try:
                    return {"json_data": json.loads(stripped_text)}
                except json.JSONDecodeError as je:
                    logger.warning(f"Failed to parse entire output as JSON: {je}")
                    # Fall through

        except Exception as e:
            logger.warning(f"Error during JSON parsing attempt: {e}")
            # Fall through to other parsing methods

        # If JSON parsing fails or isn't applicable, check for markdown
        if "```" in generated_text and "\n" in generated_text: # Simple markdown check
            logger.info("Output identified as potential markdown.")
            return {"markdown_content": generated_text}

        # Fallback to key-value pair extraction
        items = []
        for line in generated_text.split('\n'):
            line = line.strip()
            if ':' in line:
                key, value = line.split(':', 1)
                items.append({
                    "attribute": key.strip(),
                    "value": value.strip()
                })
        if items:
            logger.info("Output processed as key-value pairs.")
            return {"structured_items": items}

        # If all else fails, return the raw text
        logger.info("No specific structure found, returning raw text.")
        return {"raw_text": generated_text}

    def __enter__(self):
        """Context manager entry: Loads the model."""
        if not self._load_model():
            raise RuntimeError("Failed to load model resources for DatasetGenerator.")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit: Cleans up resources."""
        logger.info("Exiting context and cleaning up DatasetGenerator resources...")
        self._cleanup_resources()
        if exc_type: # If an exception occurred within the 'with' block
            logger.error(f"Exception occurred in 'with' block: {exc_type.__name__}: {exc_val}")
            # return False # To re-raise the exception. True would suppress it.
        return False # Ensure exceptions are re-raised

In [31]:
if __name__ == "__main__":
    SELECTED_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
    #SELECTED_MODEL = 'microsoft/Phi-3-mini-4k-instruct' # Alternative
    #SELECTED_MODEL = 'meta-llama/Llama-3.1-8B'

    PRODUCT_TYPE = "Medical Equipment"

    # Example custom attributes for the product type
    custom_product_attributes = [
        "Brand Name: Medical Group Occidente",
        "Product ID: #1234-abc123",
        "Manufacturer: Manufacturer's name",
        "Price: in USD",
        "Key Feature 1: description",
        "Key Feature 2: description",
        "Category: category name"
        "Target Audience: e.g., professionals, students, children",
        "Eco-friendly: boolean (true/false)"
    ]

    try:
        # Using the generator as a context manager
        with DatasetGenerator(
            model_name=SELECTED_MODEL,
            num_products=5,
            product_type=PRODUCT_TYPE
        ) as generator:

            logger.info(f"Generating dataset for {generator.product_type}...")
            result = generator.generate_dataset(custom_product_attributes)

            logger.info("\n--- Generated Result ---")
            if "json_data" in result:
                logger.info("Output Format: JSON")
                # Pretty print JSON
                print(json.dumps(result["json_data"], indent=2))
                # Optionally convert to Pandas DataFrame if it's a list of records
                if isinstance(result["json_data"], list):
                    try:
                        df = pd.DataFrame(result["json_data"])
                        print("\n--- DataFrame ---")
                        print(df)
                    except Exception as e_df:
                        logger.warning(f"Could not create DataFrame from JSON data: {e_df}")
                elif isinstance(result["json_data"], dict) and len(result["json_data"]) > 0:
                    # Handle a single JSON object if it makes sense as a DataFrame row
                    try:
                        df = pd.DataFrame([result["json_data"]])
                        print("\n--- DataFrame (from single JSON object) ---")
                        print(df)
                    except Exception as e_df:
                        logger.warning(f"Could not create DataFrame from single JSON object: {e_df}")


            elif "markdown_content" in result:
                logger.info("Output Format: Markdown")
                print(result["markdown_content"])
            elif "structured_items" in result:
                logger.info("Output Format: Key-Value Items")
                try:
                    df = pd.DataFrame(result["structured_items"])
                    print("\n--- DataFrame ---")
                    print(df)
                except Exception as e_df:
                    logger.warning(f"Could not create DataFrame from structured items: {e_df}")
                    print(result["structured_items"]) # Print raw if DF fails
            elif "raw_text" in result:
                logger.info("Output Format: Raw Text")
                print(result["raw_text"])
            elif "attempted_json_parse_error" in result:
                logger.warning(f"JSON parsing failed: {result['attempted_json_parse_error']}")
                logger.info("Raw text that was in JSON block:")
                print(result.get("raw_text_in_json_block", "N/A"))
                logger.info("Full generated text (might be useful for debugging prompt):")
                print(result.get("full_text", "N/A"))
            else:
                logger.info("Output Format: Unknown (dumping result)")
                print(json.dumps(result, indent=2))

    except RuntimeError as e:
        logger.critical(f"A runtime error occurred in the main execution: {e}")
        logger.critical(traceback.format_exc())
    except Exception as e:
        logger.critical(f"An unexpected error occurred in the main execution: {e}")
        logger.critical(traceback.format_exc())

    logger.info("Dataset generation example finished.")

2025-05-21 00:44:06,480 - INFO - __main__ - MPS device is available and built.
2025-05-21 00:44:06,480 - INFO - __main__ - Attempting to load model and tokenizer on device: mps
2025-05-21 00:44:06,480 - INFO - __main__ - Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.2...
2025-05-21 00:44:06,831 - INFO - __main__ - Tokenizer loaded.
2025-05-21 00:44:06,832 - INFO - __main__ - Initial tokenizer pad_token: None, pad_token_id: None
2025-05-21 00:44:06,832 - INFO - __main__ - Initial tokenizer eos_token: </s>, eos_token_id: 2
2025-05-21 00:44:06,832 - INFO - __main__ - Initial tokenizer bos_token: <s>, bos_token_id: 1
2025-05-21 00:44:06,832 - INFO - __main__ - Initial tokenizer unk_token: <unk>, unk_token_id: 0
2025-05-21 00:44:06,833 - INFO - __main__ - Setting `pad_token` to `eos_token` ('</s>').
2025-05-21 00:44:06,833 - INFO - __main__ - After attempting to set: tokenizer pad_token: </s>, pad_token_id: 2
2025-05-21 00:44:06,833 - INFO - __main__ - Loading model mistralai/Mistr

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2025-05-21 00:44:13,066 - INFO - __main__ - Model loaded from pretrained.
2025-05-21 00:44:13,066 - INFO - __main__ - Model is now on device: mps:0
2025-05-21 00:44:13,067 - INFO - __main__ - Model config pad_token_id: None
2025-05-21 00:44:13,067 - INFO - __main__ - Generating dataset for Medical Equipment...
2025-05-21 00:44:13,068 - INFO - __main__ - Using model.config.max_position_embeddings: 32768
2025-05-21 00:44:13,068 - INFO - __main__ - Derived effective_model_max_len: 32768
2025-05-21 00:44:13,068 - INFO - __main__ - Target max_new_tokens_for_generation: 1024
2025-05-21 00:44:13,069 - INFO - __main__ - Calculated truncation_length_for_input for tokenizer: 31734
2025-05-21 00:44:13,071 - INFO - __main__ - Generating text...
2025-05-21 00:44:44,770 - INFO - __main__ - Output processed as key-value pairs.
2025-05-21 00:44:44,770 - INFO - __main__ - 
--- Generated Result ---
2025-05-21 00:44:44,770 - INFO - __main__ - Output Format: Key-Value Items
2025-05-21 00:44:44,773 - INFO 


--- DataFrame ---
           attribute                                              value
0        "ProductID"                                        "#MEQ-001",
1        "BrandName"                         "Medical Group Occidente",
2     "Manufacturer"                   "Medical Equipment Corporation",
3            "Price"                                            1200.0,
4         "Category"                            "Diagnostic Equipment",
5      "KeyFeature1"  "High-resolution imaging capabilities with adv...
6      "KeyFeature2"  "Compact and portable design for easy transpor...
7   "TargetAudience"                           "Medical professionals",
8      "EcoFriendly"                                              false
9        "ProductID"                                        "#MEQ-002",
10       "BrandName"                         "Medical Group Occidente",
11    "Manufacturer"                    "Innovative Medical Solutions",
12           "Price"                         

2025-05-21 00:44:45,277 - INFO - __main__ - MPS cache emptied.
2025-05-21 00:44:45,278 - INFO - __main__ - Dataset generation example finished.
