# O1datagen with CAMEL  And Upload Data To Huggingface

You can also check this cookbook in colab [here](https://colab.research.google.com/drive/1BEX7JQ7qtidy4W0glHh7cTZQksisM0C7?usp=sharing)  (Use the colab share link)



This notebook demonstrates how to set up and leverage CAMEL's **O1DataGenerator** for generating high-quality question-answer pairs like o1 thinking data, uploading the data to Hugging Face.

In this notebook, you'll explore:

- **CAMEL**: A powerful multi-agent framework that enables Retrieval-Augmented Generation and multi-agent role-playing scenarios, allowing for sophisticated AI-driven tasks.
- **O1DataGenerator**: A tool for generating like o1 thinking data.

- **Hugging Face Integration**: Uploading datasets  to the Hugging Face platform for sharing


⭐ **Star the Repo**

If you find CAMEL useful or interesting, please consider giving it a star on our [CAMEL GitHub Repo](https://github.com/camel-ai/camel)! Your stars help others find this project and motivate us to continue improving it.

### o1datagen

In [None]:
%%capture
!pip install camel-ai==0.2.15a0

In [None]:
import os
from datetime import datetime
import json
from camel.datagen.o1datagen import O1DataGenerator

### First we will set the OPENAI_API_KEY that will be used to generate the data.

In [None]:
from getpass import getpass

In [None]:
openai_api_key = getpass('Enter your OpenAI API key: ')
os.environ["OPENAI_API_KEY"] = openai_api_key

Enter your OpenAI API key: ··········


### Create a system message to define agent's default role and behaviors.

In [None]:
sys_msg = 'You are a genius at slow-thinking data and code'

### Use ModelFactory to set up the backend model for agent, for more detailed model settings

In [None]:
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.configs import ChatGPTConfig

In [None]:
# Define the model, here in this case we use gpt-4o-mini
model = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI,
    model_type=ModelType.GPT_4O_MINI,
    model_config_dict=ChatGPTConfig().as_dict(), # [Optional] the config for model
)

Initialize AI model by OPENAI_COMPATIBLE_MODEL

### Set ChatAgent

In [None]:
from camel.agents import ChatAgent
chat_agent = ChatAgent(
    system_message=sys_msg,
    model=model,
    message_window_size=10,
)

### Load Q&A data from a JSON file

### please prepare the qa data like below in json file:

'''
{
    "question1": "answer1",
    "question2": "answer2",
    ...
}
'''

In [None]:
!pwd

/content


In [None]:
# Load JSON data
file_path = 'qa_data.json'

In [None]:
with open(file_path, 'r', encoding='utf-8') as f:
    qa_data = json.load(f)

### Create an instance of O1DataGene

In [None]:
# Create an instance of O1DataGene
testo1 = O1DataGenerator(chat_agent, golden_answers=qa_data)

In [None]:
# Record generated answers
generated_answers = {}

### Test Q&A

In [None]:
# Test Q&A
for question in qa_data.keys():
    print(f"Question: {question}")

    # Get AI's thought process and answer
    answer = testo1.get_answer(question)
    generated_answers[question] = answer
    print(f"AI's thought process and answer:\n{answer}")

    # Verify the answer
    is_correct = testo1.verify_answer(question, answer)
    print(f"Answer verification result: {'Correct' if is_correct else 'Incorrect'}")
    print("-" * 50)
    print()  # Add a new line at the end of each iteration

Question: What is the coefficient of $x^2y^6$ in the expansion of $\left(\frac{3}{5}x-\frac{y}{2}\right)^8$?  Express your answer as a common fraction
AI's thought process and answer:
To find the coefficient of \( x^2y^6 \) in the expansion of \( \left(\frac{3}{5}x - \frac{y}{2}\right)^8 \), we can use the Binomial Theorem. The Binomial Theorem states that:

\[
(a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^{n-k} b^k
\]

In our case, we can identify \( a = \frac{3}{5}x \), \( b = -\frac{y}{2} \), and \( n = 8 \).

### Step 1: Analyze the problem requirements
We need to find the specific term in the expansion that contains \( x^2y^6 \). This means we are looking for the term where \( x \) has an exponent of 2 and \( y \) has an exponent of 6.

### Step 2: List the steps to solve the problem
1. Identify the general term in the binomial expansion.
2. Set up the equation to find the specific values of \( k \) that give us \( x^2 \) and \( y^6 \).
3. Calculate the coefficient for that term.
4. S

### Export the generated answers to a JSON file


In [None]:
simplified_output = {
    'timestamp': datetime.now().isoformat(),
    'qa_pairs': generated_answers
}
simplified_file = f'generated_answers_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'
with open(simplified_file, 'w', encoding='utf-8') as f:
    json.dump(simplified_output, f, ensure_ascii=False, indent=2)
print(f"The generated answers have been exported to: {simplified_file}")

The generated answers have been exported to: generated_answers_20241224_131157.json


Convert the o1 data into the SFT-compliant alpaca training data format

In [None]:
import json
from datetime import datetime


def transform_qa_format(input_file):
    # Read the input JSON file
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Transform the data
    transformed_data = []
    for question, answer in data['qa_pairs'].items():
        transformed_pair = {
            "instruction": question,
            "input": "",
            "output": answer
        }
        transformed_data.append(transformed_pair)

    # Generate output filename with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f'transformed_qa_{timestamp}.json'

    # Write the transformed data
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(transformed_data, f, ensure_ascii=False, indent=2)

    return output_file, transformed_data

In [None]:
output_file, transformed_data = transform_qa_format(simplified_file)
print(f"Transformation complete. Output saved to: {output_file}")

Transformation complete. Output saved to: transformed_qa_20241224_131203.json


# upload the data to the huggingface

In [None]:
from camel.datahubs.huggingface import HuggingFaceDatasetManager
from camel.datahubs.models import Record
def upload_to_huggingface(transformed_data, username, dataset_name=None):
    manager = HuggingFaceDatasetManager()

    if dataset_name is None:
        dataset_name = f"{username}/qa-dataset-{datetime.now().strftime('%Y%m%d')}"
    else:
        dataset_name = f"{username}/{dataset_name}"

    # Create dataset
    print(f"Creating dataset: {dataset_name}")
    dataset_url = manager.create_dataset(name=dataset_name)
    print(f"Dataset created: {dataset_url}")

    # Create dataset card
    print("Creating dataset card...")
    manager.create_dataset_card(
        dataset_name=dataset_name,
        description="Question-Answer dataset generated by CAMEL O1DataGene",
        license="mit",
        language=["en"],
        size_category="<1MB",
        version="0.1.0",
        tags=["camel", "question-answering"],
        task_categories=["question-answering"],
        authors=[username]
    )
    print("Dataset card created successfully.")

    # Create Record objects with user's key-value pairs directly
    records = []
    for item in transformed_data:
        record = Record(**item)  # Use the user's key-value pair directly as the field of Record
        records.append(record)

    # Add records
    print("Adding records to the dataset...")
    manager.add_records(dataset_name=dataset_name, records=records)
    print("Records added successfully.")

    return dataset_url

In [None]:
from camel.datahubs.huggingface import HuggingFaceDatasetManager
from camel.datahubs.models import Record
from datetime import datetime

def upload_to_huggingface(transformed_data, username, dataset_name=None):
    # Initialize the HuggingFaceDatasetManager
    manager = HuggingFaceDatasetManager()

    # Generate or validate the dataset name
    dataset_name = generate_or_validate_dataset_name(username, dataset_name)

    # Create the dataset on HuggingFace
    dataset_url = create_dataset(manager, dataset_name)

    # Create the dataset card with metadata
    create_dataset_card(manager, dataset_name, username)

    # Convert transformed data into Record objects
    records = create_records(transformed_data)

    # Add the records to the dataset
    add_records_to_dataset(manager, dataset_name, records)

    return dataset_url

def generate_or_validate_dataset_name(username, dataset_name):
    r"""Generate a dataset name if not provided, or validate and format the provided name.
    """
    if dataset_name is None:
        dataset_name = f"{username}/qa-dataset-{datetime.now().strftime('%Y%m%d')}"
    else:
        dataset_name = f"{username}/{dataset_name}"
    return dataset_name

def create_dataset(manager, dataset_name):
    r"""Create a new dataset on HuggingFace and return the dataset URL.
    """
    print(f"Creating dataset: {dataset_name}")
    dataset_url = manager.create_dataset(name=dataset_name)
    print(f"Dataset created: {dataset_url}")
    return dataset_url

def create_dataset_card(manager, dataset_name, username):
    r"""Create a dataset card with metadata for the dataset.
    """
    print("Creating dataset card...")
    manager.create_dataset_card(
        dataset_name=dataset_name,
        description="Question-Answer dataset generated by CAMEL O1DataGene",
        license="mit",
        language=["en"],
        size_category="<1MB",
        version="0.1.0",
        tags=["camel", "question-answering"],
        task_categories=["question-answering"],
        authors=[username]
    )
    print("Dataset card created successfully.")

def create_records(transformed_data):
    r"""Convert the transformed data into Record objects.
    """
    records = []
    for item in transformed_data:
        record = Record(**item)  # Use the user's key-value pair directly as the field of Record
        records.append(record)
    return records

def add_records_to_dataset(manager, dataset_name, records):
    r"""Add the list of Record objects to the dataset.
    """
    print("Adding records to the dataset...")
    manager.add_records(dataset_name=dataset_name, records=records)
    print("Records added successfully.")

# config the access token of huggingface

You can go to [here](https://huggingface.co/settings/tokens) to get API Key from Huggingface

In [None]:
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN

Enter your HUGGING_FACE_TOKEN: ··········


In [None]:
# Set your personal huggingface config, then upload to HuggingFace
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter dataset name (press Enter to use default): ").strip()
if not dataset_name:
    dataset_name = None

try:
    dataset_url = upload_to_huggingface(transformed_data, username, dataset_name)
    print(f"\nData successfully uploaded to HuggingFace!")
    print(f"Dataset URL: {dataset_url}")
except Exception as e:
    print(f"Error uploading to HuggingFace: {str(e)}")


Enter your HuggingFace username: zjrwtxtechstudio
Enter dataset name (press Enter to use default): o1data26
Creating dataset: zjrwtxtechstudio/o1data26
Dataset created: https://huggingface.co/datasets/zjrwtxtechstudio/o1data26
Creating dataset card...
Dataset card created successfully.
Adding records to the dataset...
Records added successfully.

Data successfully uploaded to HuggingFace!
Dataset URL: https://huggingface.co/datasets/zjrwtxtechstudio/o1data26


### Summary:

This cookbook demonstrates the process of using **CAMEL's O1DataGenerator** to create high-quality question-answer pairs, similar to o1 thinking data. The notebook covers the following steps:

1. **Setup**: Installation of the `camel-ai` library and configuration of the OpenAI API key.
2. **Data Generation**: Utilization of the `O1DataGenerator` to generate answers for predefined questions using llm model.
3. **Data Transformation**: Conversion of the generated Q&A data into a format compliant with the Alpaca training data schema.
4. **Upload to Hugging Face**: Integration with Hugging Face to upload the transformed dataset, including the creation of a dataset card and metadata.

The cookbook also includes detailed instructions for setting up the environment, handling API keys, and configuring the Hugging Face dataset upload process. The final output is a dataset uploaded to Hugging Face, ready for sharing and further use in AI training tasks.

⭐ **Star the Repo**

If you find CAMEL useful or interesting, please consider giving it a star on [GitHub](https://github.com/camel-ai/camel)! Your stars help others find this project and motivate us to continue improving it.