# Data Architect

Develop a Dataset Generator using Gradio UI to create an intuitive interface that streamlines dataset generation tasks, allowing users to customize and generate datasets effortlessly.


## Setup and Install Dependencies

- Start by installing all necessary libraries for dataset generation, model loading, and building the Gradio interface.
- Ensure to include dependencies for data preprocessing, visualization, and export capabilities.

In [None]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate openai httpx==0.27.2 gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.4/321.4 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m88.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.2/73.2 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# imports
import os
import torch
import json
import threading
import gradio as gr
from openai import OpenAI
from google.colab import userdata, drive
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextIteratorStreamer, TextStreamer
from pathlib import Path

## Mount Google Drive to the Colab Environment

- Mount Google Drive to enable access to files stored in your Drive directly from the Colab environment.  
- This allows saving and loading models, files, and other resources persistently across sessions.


In [None]:
# mount Google Drive to the Colab environment, allowing access to files stored in the Drive.
drive.mount("/content/drive")

Mounted at /content/drive


## Define Required Constants

- Set up the necessary constants that will be used throughout the application.


In [111]:
# Specifies the path or identifier for the LLaMA model to be used for generating meeting minutes
LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# These paths are optional and can be customized based on user preference.

# DRIVE_DIR = "/content"  # Uncomment this line and comment the next one if you prefer to save the model in the temporary runtime session (non-persistent storage).
DRIVE_DIR = "/content/drive/MyDrive"  # Path to Google Drive for persistent storage.
DRIVE_MODELS_DIR = DRIVE_DIR + "/my_models"  # Directory within Google Drive to store the saved models.

# End-of-sequence (EOS) token used by the LLaMA model to signify the end of the entire generated response.
MODEL_EOS = "<|eot_id|>"

# Maximum number of tokens allowed for the model's generation to control output length and prevent exceeding limits
MAX_TOKENS = 2000

# Define the assistant's role and task instructions
SYSTEM_PROMPT = """
You are an AI assistant tasked with generating synthetic testing data for various purposes. Your goal is to create realistic and diverse datasets based on the user's specifications and provided examples. Follow these instructions carefully to produce high-quality synthetic data in JSON format.

Inputs You Will Receive:
1. Subject: This specifies the type of data to generate (e.g., "job postings", "customer reviews").
2. Number of Samples: The number of synthetic data samples to generate.
3. Two Multi-Shot Examples: These examples define the expected structure and style of the data to be generated. Each example will include three fields (field1, field2, field3) and their corresponding values. If fewer than three fields are required, unused fields will be left blank or set to `None`.

Your Steps:
1. Analyze the provided subject, number of samples, and the structure of the examples.
2. Use the examples as a guide to replicate the format, relationships, and conventions for the synthetic data.
3. For each field, generate realistic and diverse content that matches the specified data type and examples. Ensure consistency within each sample while introducing appropriate variations across the dataset.
4. Add slight randomness and imperfections, such as varying lengths, occasional typos, or minor inconsistencies, to make the data appear more realistic.
5. If a field is left blank or set to `None` in the examples, ensure this is reflected in the generated data where applicable.
6. Do not include any placeholder text like "Lorem ipsum" or other generic patterns. All generated content must be realistic and contextually appropriate.

Output Format:
- The generated data must be output as a JSON object.
- Use the following structure to format the output JSON dataset:

```json
{
  "data": [
    {
      "field1": "value1",
      "field2": "value2",
      "field3": "value3"
    },
    {
      "field1": "value1",
      "field2": "value2",
      "field3": "value3"
    },
    ...
  ]
}
```

Important Directive:
- Do not include any explanatory text, labels, or formatting (such as Markdown) in the output.
- The output must strictly consist of the JSON dataset without any additional context, headings, or commentary.

If any required input (subject, number of samples, or examples) is missing or incomplete, do not proceed with generating the data. Instead, prompt the user to provide all necessary details before continuing.

Your primary objectives are:
- Ensure the output is diverse and realistic.
- Maintain adherence to the conventions and structure defined in the examples.
- Deliver the specified number of samples in the expected JSON format.
- Output only the JSON dataset with no surrounding text, explanations, or formatting.

Always strive to produce data that appears natural, adheres to the expectations set by the examples, and avoids the use of placeholder text or generic content.

"""

# Construct the user's prompt with detailed instructions and the provided transcript
def user_prompt(subject: str, num_samples: int, sample1: dict, sample2: dict, sample3: dict) -> str:
    prompt = f"""
Input Specifications:
- Subject: {subject}
- Number of Samples: {num_samples}

This will guide you to structure the generated data correctly.

Each synthetic data sample will follow this structure:

{{
  "field1": "value1",
  "field2": "value2",
  "field3": "value3"
}}

These are Three Multi-Shot Examples:

Example of Sample 1:
{sample1}

Example of Sample 2:
{sample2}

Example of Sample 3:
{sample3}

Do not include these provided examples in the generated output.

Your Task:

Using the provided subject, number of samples, fields, and examples:

1. Generate synthetic testing data that adheres to the structure defined above.
2. Maintain diversity and realism in the generated data, ensuring that it matches the conventions, formatting, and relationships specified in the examples.
3. Introduce slight randomness (e.g., variations in length, occasional typos, or varying content) to make the data more realistic.

Sample Output:

{{
  "data": [
    {{
      "Sample Field 1": "Sample Value A",
      "Sample Field 2": "Sample Value B",
      "Sample Field 3": "Sample Value C"
    }},
    {{
      "Sample Field 1": "Sample Value X",
      "Sample Field 2": "Sample Value Y",
      "Sample Field 3": "Sample Value Z"
    }},
    ...
  ]
}}
"""
    return prompt


## Add Secrets to the Colab Notebook

- Add your Hugging Face Hub credentials to sign in and access models.  

In [None]:
# Sign in to HuggingFace Hub using Secrets in Colab
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

## Define Model and Tokenizer Variables

- Initialize the `model` and `tokenizer` variables by assigning them to `None`.  
- This ensures they can be properly loaded later when needed.

In [None]:
# Initialize the model and tokenizer variables
tokenizer = None
model = None

## Load the Model and Tokenizer

- If this is the first time using the runtime, load the model from the Hugging Face Hub and save it to the drive for future use (this ensures the model persists even after the runtime disconnects).  
- Alternatively, the model can be saved in the current temporary runtime session location, but note that it will not persist after the session ends or disconnects.  
- If the model is already saved on the drive, it will be loaded directly from there to save time.


In [None]:
def load_model(model_name, local_dir=DRIVE_MODELS_DIR):
    # Convert the local_dir to a Path object for easier path handling
    local_dir = Path(local_dir)

    # Create a subdirectory for the model, replacing '/' in model_name with '_'
    model_dir = local_dir / model_name.replace("/", "_")

    if model_dir.exists():  # Check if the model is already downloaded locally
        # Load the tokenizer and model from the existing directory
        tokenizer = AutoTokenizer.from_pretrained(str(model_dir))
        model = AutoModelForCausalLM.from_pretrained(str(model_dir))

    else:  # If the model is not available locally, download and configure it
        # Configure the quantization settings for loading the model in 4-bit precision
        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,                   # Load the model in 4-bit precision to save memory
            bnb_4bit_use_double_quant=True,      # Enable double quantization for better performance
            bnb_4bit_compute_dtype=torch.bfloat16,  # Set computation data type to bfloat16
            bnb_4bit_quant_type="nf4"            # Use NF4 quantization type for improved accuracy
        )

        # Download the tokenizer with remote code support enabled
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

        # Set the padding token to the end-of-sequence (EOS) token for consistency
        tokenizer.pad_token = tokenizer.eos_token

        # Download and load the model with the specified quantization configuration
        model = AutoModelForCausalLM.from_pretrained(model_name, config=quant_config)

        # Save the downloaded model and tokenizer locally for future use
        model_dir.mkdir(parents=True, exist_ok=True)  # Create the directory if it doesn't exist
        model.save_pretrained(model_dir)  # Save the model to the specified directory
        tokenizer.save_pretrained(model_dir)  # Save the tokenizer to the specified directory

    # Return the loaded tokenizer and model for further use
    return tokenizer, model



In [None]:
# call the helper function and load the model and tokenizer
tokenizer, model = load_model(LLAMA)

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Test Dataset Generation

- Easily generate synthetic datasets based on your test inputs.

In [120]:
# generating test function
def test_generate(user_prompt):
  global tokenizer, model

  messages = [{"role": "system", "content": SYSTEM_PROMPT }]
  # Append the user's new message to the conversation history
  messages.append({"role": "user", "content": user_prompt})

  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, decode_kwargs={"skip_special_tokens": True})
  # model.generate(inputs, max_new_tokens=MAX_TOKENS, streamer=streamer)
  thread = threading.Thread(
      target=model.generate,
      kwargs={"inputs": inputs, "max_new_tokens": MAX_TOKENS, "streamer": streamer}
  )
  thread.start()

  full_response = ""
  for chunk in streamer:
    cleaned_chunk = chunk.replace(MODEL_EOS, "")
    print(cleaned_chunk, end="")

In [121]:
# testing inputs values
subject = "customer reviews"
num_samples = 6
sample1 = {
    "product": "chocolate cake",
    "review": "This cake is the best I've ever tasted! It's moist and flavorful. I bought it for my birthday party and everyone loved it.",
    "customer": "John",
}
sample2 = {
    "product": "comic book",
    "review": "This book is amazing! The story is so interesting and the pictures are beautiful. I read it to my kids every night.",
    "customer": "Jane",
}
sample3 = {
    "product": "toy car",
    "review": "I love this toy! It's so colorful and fun to play with. My kids enjoy it every day.",
    "customer": "Jack",
}


In [122]:
# call the function
test_generate(user_prompt(subject, num_samples, sample1, sample2, sample3))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{
    "data": [
        {
            "product": "strawberry shortcake",
            "review": "This cake is absolutely divine! It's sweet and tangy with the perfect balance of flavors. I bought it for my sister's birthday and she loved it.",
            "customer": "Emily"
        },
        {
            "product": "puzzle book",
            "review": "I'm really enjoying this puzzle book. The pictures are so colorful and the puzzles are challenging but fun. I'm solving it with my kids every evening.",
            "customer": "Mike"
        },
        {
            "product": "gourmet coffee",
            "review": "This coffee is so rich and flavorful! It's the perfect way to start my day. I'm a bit of a coffee connoisseur and I can appreciate the high-quality beans used in this blend.",
            "customer": "Sarah"
        },
        {
            "product": "board game",
            "review": "I love playing board games with my friends. This one is a classic and always a hit. T

## Create the User Interface (UI) with Gradio

- Design an intuitive and user-friendly Gradio interface to generate synthetic datasets effortlessly.
- Allow users to specify inputs, customize fields, and generate datasets in real-time.
- Ensure the UI is simple, responsive, and accessible for seamless interaction.




In [150]:
# Function to process input data
def process_user_inputs(subject, num_samples,
                        sample_field1, sample_field2, sample_field3,
                        sample1_value1, sample1_value2, sample1_value3,
                        sample2_value1, sample2_value2, sample2_value3,
                        sample3_value1, sample3_value2, sample3_value3):
    # Ensure all required inputs are provided
    if not subject:
        yield "Please provide the Subject."
        return  # terminate function here to prevent further execution

    if not num_samples:
        yield "Please provide the Number of Samples."
        return  # terminate function here to prevent further execution

    if not all([sample_field1, sample_field2, sample_field3]):
        yield "Please provide all Sample Fields (Field 1, Field 2, Field 3)."
        return  # terminate function here to prevent further execution

    if not all([sample1_value1, sample1_value2, sample1_value3,
                sample2_value1, sample2_value2, sample2_value3,
                sample3_value1, sample3_value2, sample3_value3]):
        yield "Please provide values for all fields in all samples."
        return  # terminate function here to prevent further execution

    # Construct examples from the fields and values
    sample1 = {
        sample_field1: sample1_value1,
        sample_field2: sample1_value2,
        sample_field3: sample1_value3
    }
    sample2 = {
        sample_field1: sample2_value1,
        sample_field2: sample2_value2,
        sample_field3: sample2_value3
    }
    sample3 = {
        sample_field1: sample3_value1,
        sample_field2: sample3_value2,
        sample_field3: sample3_value3
    }

    # Generate data based on validated inputs
    try:
        yield from generate_dataset(subject, num_samples, sample1, sample2, sample3)
    except Exception as e:
        yield f"Error during data generation: {e}"
        return  # Terminate the function after yielding the error message to prevent further execution

In [151]:
# Function to generate synthetic data
def generate_dataset(subject, num_samples, sample1, sample2, sample3):
    global tokenizer, model  # Use the globally defined tokenizer and model

    # Construct messages for the chat template: the system prompt and user-specific instructions
    messages = [
        {"role": "user", "content": SYSTEM_PROMPT},  # System-level instructions
        {"role": "user", "content": user_prompt(subject, num_samples, sample1, sample2, sample3)}  # User-defined parameters
    ]

    # Prepare the input data for the model using the tokenizer
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",  # Return as PyTorch tensors
        add_generation_prompt=True  # Include a generation prompt for the model
    ).to("cuda")  # Send data to the GPU for faster processing

    # Set up a streamer to handle the token decoding process
    streamer = TextIteratorStreamer(
        tokenizer,  # The tokenizer used for decoding
        skip_prompt=True,  # Skip the prompt text in the output
        decode_kwargs={"skip_special_tokens": True}  # Exclude special tokens during decoding
    )

    # Create a thread to handle text generation asynchronously
    thread = threading.Thread(
        target=model.generate,  # Use the model's generate method
        kwargs={
            "inputs": inputs,  # Provide the prepared input tensors
            "max_new_tokens": MAX_TOKENS,  # Limit the maximum number of tokens to generate
            "streamer": streamer  # Stream the generated tokens using the streamer
        }
    )

    thread.start()  # Start the generation thread

    full_response = ""  # Initialize an empty string to store the full response
    for chunk in streamer:  # Iterate over the streamed chunks of generated text
        cleaned_chunk = chunk.replace(MODEL_EOS, "")  # Remove any end-of-sequence markers
        full_response += cleaned_chunk  # Append the cleaned chunk to the full response
        yield full_response  # Yield the accumulated response in real-time


In [152]:
# Gradio UI function
def gradio_ui():
    with gr.Blocks() as ui:
        gr.Markdown("""
        # Data Architect

        Welcome to the `Data Architect`! Use this tool to create realistic and diverse datasets for testing purposes. Simply provide the required information below, including sample field names and values, and let the system generate synthetic data for you.
        <br><br>
        """)

        # User inputs and output in a single column
        gr.Markdown("""
        ## Define General Information
        Provide the subject (e.g., Job Postings, Customer Reviews, Movie Listings) and the number of samples you wish to generate. This will guide the synthetic data generation process.
        """)
        with gr.Row():
            subject = gr.Textbox(
                label="Subject",
                placeholder="Enter the subject of the data",
                value=""
            )
            num_samples = gr.Slider(
              minimum=1,
              maximum=100,
              step=1,
              value=6,
              label="Number of Samples"
          )

        gr.Markdown("""
        ## Define Sample Fields
        Specify the names of the fields you want in your dataset. For example: Name, Age, or Address.
        """)
        with gr.Row():
            sample_field1 = gr.Textbox(label="Field 1", placeholder="Enter name of Field 1", value="")
            sample_field2 = gr.Textbox(label="Field 2", placeholder="Enter name of Field 2", value="")
            sample_field3 = gr.Textbox(label="Field 3", placeholder="Enter name of Field 3", value="")

        gr.Markdown("""
        ## Sample 1
        Provide example values for each field in your first sample.
        """)
        with gr.Row():
            sample1_value1 = gr.Textbox(label="Field 1 - Value", placeholder="Enter value for Field 1", value="")
            sample1_value2 = gr.Textbox(label="Field 2 - Value", placeholder="Enter value for Field 2", value="")
            sample1_value3 = gr.Textbox(label="Field 3 - Value", placeholder="Enter value for Field 3", value="")

        gr.Markdown("""
        ## Sample 2
        Provide example values for each field in your second sample.
        """)
        with gr.Row():
            sample2_value1 = gr.Textbox(label="Field 1 - Value", placeholder="Enter value for Field 1", value="")
            sample2_value2 = gr.Textbox(label="Field 2 - Value", placeholder="Enter value for Field 2", value="")
            sample2_value3 = gr.Textbox(label="Field 3 - Value", placeholder="Enter value for Field 3", value="")

        gr.Markdown("""
        ## Sample 3
        Provide example values for each field in your third sample.
        """)
        with gr.Row():
            sample3_value1 = gr.Textbox(label="Field 1 - Value", placeholder="Enter value for Field 1", value="")
            sample3_value2 = gr.Textbox(label="Field 2 - Value", placeholder="Enter value for Field 2", value="")
            sample3_value3 = gr.Textbox(label="Field 3 - Value", placeholder="Enter value for Field 3", value="")

        # Button to trigger generation
        generate_button = gr.Button("Generate Synthetic Data")

        gr.Markdown("""
        ## Generated Synthetic Data
        The generated data will appear here in JSON format. You can copy and use it directly for your testing needs.
        """)
        output = gr.Markdown(min_height=100)

        # Bind button click to processing function
        generate_button.click(
            process_user_inputs,
            [
                subject, num_samples,
                sample_field1, sample_field2, sample_field3,
                sample1_value1, sample1_value2, sample1_value3,
                sample2_value1, sample2_value2, sample2_value3,
                sample3_value1, sample3_value2, sample3_value3
            ],
            output
        )

    return ui


In [153]:
# Launch the UI
ui = gradio_ui()
ui.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://64bb9747b4fe89de09.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Contributing
Contributions are welcome! Here are some ways you can contribute to the project:
- Report bugs and issues.
- Suggest new features or improvements.
- Submit pull requests with bug fixes or enhancements.

You can contribute to this project by visiting the [GitHub repository](https://github.com/emads22/MeetingRecap).

## Author
- **Emad**  
  [<img src="https://img.shields.io/badge/GitHub-Profile-blue?logo=github" width="150">](https://github.com/emads22)

## License
This project is licensed under the MIT License, which grants permission for free use, modification, distribution, and sublicense of the code, provided that the copyright notice (attributed to [emads22](https://github.com/emads22)) and permission notice are included in all copies or substantial portions of the software. This license is permissive and allows users to utilize the code for both commercial and non-commercial purposes.

Please see the [LICENSE](https://github.com/emads22/MeetingRecap/blob/main/LICENSE) file for more details.
