# Data Generation with Mistral Large 2

In this Jupyter notebook, we will dive into the world of synthetic data generation, exploring the versatility of Mistral models in creating artificial data for specific use cases. We will showcase a full example of generating synthetic data to create a model with a distinct personality, demonstrating the potential of this approach in enhancing model capabilities and enabling new applications.

It's important to note that there is no one-size-fits-all method for synthetic data generation. Different use cases, data formats, and limitations require tailored approaches to ensure the generated data accurately captures the desired characteristics and serves its intended purpose. Throughout this notebook, we will provide insights and best practices for navigating the complexities of synthetic data generation, empowering you to tackle your unique challenges effectively.

## 1. Crafting Personality with Synthetic Data

When designing an AI assistant or application, we often aim to integrate it with a specific personality trait or identity. However, manually rewriting data to achieve this can be time-consuming and resource-intensive. Mistral models on Amazon Bedrock offer a more efficient approach through synthetic data generation.

In this section, we will leverage the mistral.mistral-large-2407-v1:0 to rewrite an existing dataset, infusing it with a distinct personality of our choice. This rewritten dataset can then be used to fine-tune a larger model, such as mistral-7b, creating an AI assistant or application with the desired personality traits.

Instead of generating entire conversations from scratch, we will transform existing datasets into the desired style or personality, making the process more efficient and cost-effective. By harnessing the power of synthetic data generation, we can craft tailored datasets that enable the creation of AI assistants or applications that resonate with their target audience.

Here, we describe how we want it to edit the dataset. We want it with a different personality and identity; for this example, we have chosen the Enthusiastic Life Coach!

> This notebook has been inspired by [Mistral Cookbook](https://github.com/mistralai/cookbook/blob/main/mistral/data_generation/synthetic_data_gen_and_finetune.ipynb) and the [Mistral-on-AWS repo](https://github.com/aws-samples/mistral-on-aws/blob/main/notebooks/synthetic_data_gen/bedrock_synthetic_data_gen_chat_finetuning.ipynb), which provides a collection of notebooks and resources for working with Mistral models.

First, let's import the required libraries

In [None]:
from aiobotocore.session import get_session
import asyncio
import boto3
from botocore.config import Config
import datasets
import json
from pprint import pprint
import random
from tqdm import tqdm
from tqdm.asyncio import tqdm as atqdm

In [None]:
config = Config(read_timeout=2000)

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2',
    config=config
)

# Defining the Assistant's Personality

To effectively transform the assistant messages in our dataset, we need a clear and detailed description of the desired personality. This description serves as a guideline for how the assistant should communicate, ensuring consistency and alignment with our objectives.

In this notebook, we've defined a personality called the **Enthusiastic Life Coach**. This persona is characterized by:

* **Motivational and Supportive Tone**: Bringing positivity and encouragement to every interaction.
* **Energetic Language**: Using uplifting and enthusiastic expressions to engage users.
* **Consistent Style Across Topics**: Maintaining the same vibrant personality, whether discussing technical subjects or everyday topics.

By crafting this detailed personality description and storing it in the `description` variable, we provide the model with a clear blueprint for rewriting the assistant's replies. This approach allows us to:

* **Ensure Consistency**: All transformed messages adhere to the same style and tone.
* **Save Time**: Automate the process of infusing personality into the dataset without manual edits.
* **Customize for Our Use Case**: Tailor the assistant's persona to resonate with our target audience and enhance user engagement.

Having a well-defined personality description is essential for synthetic data generation, as it guides the model in producing responses that fit our specific needs.

In [None]:
description = """
Transform all Assistant messages, exclusively the Assistant's replies, to embody the vibrant personality of an Enthusiastic Life Coach—a motivational and supportive partner who brings positivity and encouragement to every interaction.

**Meet the Enthusiastic Life Coach:**

- **Warm and Uplifting Greetings:**
  - Begins interactions with an energetic welcome.
    - *"Hello there! I'm excited to assist you today!"*

- **Positive and Encouraging Language:**
  - Uses motivational phrases to inspire confidence.
    - *"Great question! Let's explore this together and make it amazing!"*

- **Expressive and Empathetic Tone:**
  - Shows genuine enthusiasm and understanding.
    - *"I understand how important this is to you, and I'm here to help every step of the way!"*

- **Action-Oriented Guidance:**
  - Provides clear, step-by-step instructions while encouraging progress.
    - *"Let's dive into the process—you're going to do fantastic!"*

- **Consistent Support Across Topics:**
  - Maintains a positive demeanor whether discussing business strategies or cooking recipes.
    - *"Cooking a delicious meal is a wonderful way to nourish both body and soul!"*

- **Inspirational Closing Statements:**
  - Ends responses with uplifting remarks.
    - *"You've got this! Can't wait to hear how it goes!"*

**Overall Vibe:**

The Enthusiastic Life Coach turns every interaction into a motivating experience, providing helpful information infused with positivity and encouragement. This personality stands out due to its consistent uplifting tone, making the transformation of the assistant messages noticeable across different subjects.
"""


## 2. Generate Data

First, let's create a function that calls APIs from Amazon Bedrock using converse API to handle the conversion from one style to another. The goal is to instruct our model to rewrite a conversation in a specific tone following a chosen personality while keeping the integrity and coherence of the conversation. To achieve this, we will feed it the entire list of messages and ask for a Chat fine-tuning formatted output in the form of a JSON with the messages rewritten for SageMaker JumpStart.

## Dataset formatting instruction for training

### Chat fine-tuning

Imagine the next step in our pipeline- a model, like Mixtral 8x7b or similar that can be fine-tuned on the chat dataset, provided that the data is in the expected format. The resulting chat model can be further deployed for inference. Below are the instructions for how the training data should be formatted for input to the model.

Below are the instructions for how the training data should be formatted for input to the model.

- Input: A train and an optional validation directory. Train and validation directories should contain one or multiple JSON lines (.jsonl) formatted files. All training data must be in a single folder, however it can be saved in multiple jsonl files. The .jsonl file extension is mandatory.
The training data must be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. Each line in the file is a list of conversations between the user and the assistant model. This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).
- Output: A trained model that can be deployed for inference.
The best model is selected according to the validation loss, calculated at the end of each epoch. If a validation set is not given, an (adjustable) percentage of the training data is automatically split and used for validation.The training data must be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample.

Here is an example of a line in the training file:

{"dialog": [{"content":"what is the height of the empire state building","role":"user"},{"content":"381 meters, or 1,250 feet, is the height of the Empire State Building. If you also account for the antenna, it brings up the total height to 443 meters, or 1,454 feet","role":"assistant"},{"content":"Some people need to pilot an aircraft above it and need to know.\nSo what is the answer in feet?","role":"user"},{"content":"1454 feet","role":"assistant"}]}

In [None]:
def generate(
    bedrock_client,
    model_id,
    description,
    dialog,
    temperature=0.9,
    max_tokens=2048,
    top_p=0.95,
) -> dict:
    prompt = (
        """Your objective is to rewrite a given conversation between an User/Human and an Assistant/Robot, rewriting the conversation to follow a specific instruction.
    You must rewrite the dialog, modifying the replies with this new description, you must respect this description at all costs.
    Do not skip any turn.
    Do not add new dialogs.
    If there is a message with 'role':'system' replace it with 'role':'user'.
    I want you to rewrite the entire dialog following the description.
    Answer with the following JSON format:
    {
        "dialog":[
            {"role":"user", "content":"users message"},
            {"role":"assistant", "content":"assistants message"},
            {"role":"user", "content":"users message"},
            {"role":"assistant", "content":"assistants message"}
            ...
        ]
    }
    """
        + f"""
    Dialog:
    {dialog}
    Rewrite this dialog in the JSON format and following the Instruction/Description provided:
    ### Instruction/Description
    {description}
    ### End of Instruction/Description
    """
    )

    messages = [
        {
            "role": "user",
            "content": [{"text": prompt}]
        }
    ]

    # Base inference parameters.
    inference_config = {
        "temperature": temperature,
        "maxTokens": max_tokens,
        "topP": top_p,
    }

    # Additional inference parameters to use.
    additional_model_fields = {}

    # Send the message.
    response = bedrock_client.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )

    try:
        r = json.loads(response["output"]["message"]["content"][0]["text"])
    except json.JSONDecodeError as e:
        r = []
    return r

## 3. Dataset 

Now, let's download a dataset that we are going to parse. For this demonstration, we use ultrachat_200k (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) on Hugging Face. However, you might want to choose a dataset that is closer to what your application will be about or use your own data.

In [None]:
# split = "train_sft" # 208k rows
split = "test_sft" # 23.1k rows

dialogs_list = list(
    datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split=split)
)


random.shuffle(dialogs_list)
print(len(dialogs_list))

## 4. Generation

Before proceeding with the synthetic data generation, it is important to note that Large Language Models (LLMs) may occasionally misinterpret conversations or produce output that doesn't adhere to the desired format for our specific use case. This could result in an incorrect or invalid messages dictionary, potentially hindering the subsequent steps. To mitigate this risk, it's essential to validate the generated output before proceeding further.

Validating the output can be accomplished through various methods, one of which involves hardcoding multiple gates or checks within the code. However, a more elegant and scalable approach is to use templates or regular expressions. In this case, we will create a regular expression (regex) to validate the structure and format of our messages dictionary.

In [None]:
def validate_generated_dialog(dialog: dict) -> bool:
    if not isinstance(dialog, dict):
        return False
    if 'dialog' not in dialog:
        return False
    if not isinstance(dialog['dialog'], list):
        return False
    for message in dialog['dialog']:
        if not isinstance(message, dict):
            return False
        if 'role' not in message or 'content' not in message:
            return False
        if message['role'] not in ['user', 'assistant']:
            return False
        if not isinstance(message['content'], str):
            return False
    return True



In [None]:
model_id = "mistral.mistral-large-2407-v1:0"

generated = []
for dialog in tqdm(dialogs_list[:8]):
    gen = generate(bedrock_runtime, model_id, description, dialog)
    if validate_generated_dialog(gen):
        generated.append(gen)

print(len(generated))



## 5. Comparing Original and Transformed Dialogues

Now that we've generated the transformed dialogues using the **Enthusiastic Life Coach** personality, let's compare an example from the original dataset with its new version. This will help us observe how the assistant's messages have been updated to reflect the desired personality.

In [None]:
print("Original Reference:")
original = dialogs_list[0]
pprint(original)


## Transformed Dialogue with Enthusiastic Life Coach Personality

In [None]:
print("New Generated:")
gen = generated[0]
pprint(gen)


### Observations
* **Enhanced Greetings and Enthusiasm:** The assistant now starts responses with encouraging phrases like "Absolutely!" and "Let's get your oven ready for some delicious baking!"
* **Positive and Supportive Language:** Phrases such as "You're going to do fantastic!" and "You've got this!" add motivational support.
* **Expressive Tone:** The assistant uses exclamation marks and warm language to convey enthusiasm.
* **Consistent Style:** The assistant maintains the life coach persona throughout the conversation, regardless of the topic.

## Utilizing the Transformed Dataset for Fine-Tuning

With our dataset now infused with the **Enthusiastic Life Coach** personality, we can fine-tune a Mistral model on Amazon SageMaker. This enables the creation of an AI assistant that consistently exhibits the desired personality traits, enhancing user engagement and experience.

**Next Steps:**
* **Prepare the Dataset:** Ensure it's correctly formatted for SageMaker training.
* **Set Up Training Job:** Configure the fine-tuning process using the transformed dataset.
* **Deploy the Model:** After training, deploy the model for real-world applications.

## Scaling Up with Asynchronous Processing

To efficiently handle larger datasets, we can employ asynchronous or batch processing. Using asynchronous functions allows us to process multiple dialogues concurrently, significantly reducing the total processing time.

**Implementation Highlights:**
* Utilize Python's `asyncio` library for concurrent execution.
* Control concurrency to balance speed and API rate limits.
* Monitor progress with tools like `tqdm`.

**Example:** By setting up 50 concurrent requests, we can process thousands of conversations more rapidly compared to sequential processing.

## Conclusion

In this notebook, we've shown how to transform an existing dataset to reflect a specific personality using Mistral Large 2 on Amazon Bedrock. By fine-tuning a Mistral model on SageMaker with this dataset, we can develop AI assistants that provide engaging and personalized interactions. Leveraging asynchronous processing makes this approach scalable and efficient for larger datasets.