<a href="https://colab.research.google.com/github/aviadarn/AiNotebooks/blob/main/prepare_dataset_gemini_humanize_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Dataset for llama Fine-tuning
This tutorial demonstrates how to prepare your dataset for fine-tuning the Gemini model. We'll cover:

- Setting up the environment
- Loading and examining the data
- Converting data to the required JSONL format
- Validating the prepared dataset

# 1. Setup
let's do some installation first:

In [2]:
!pip install pandas datasets



Now we will import the necessary libraries:

In [3]:
import json
import pandas as pd

### 2. Downloading a Dataset from Hugging Face

To download a dataset from Hugging Face, we'll use the `datasets` library. You'll need to have a Hugging Face account and generate an access token to download private datasets or datasets that require authentication.

1.  **Get your Hugging Face Token**: Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and generate a new token with at least 'read' access.
2.  **Load the dataset**: Use the `load_dataset` function, passing your token.

In [9]:
# First, install the datasets library if you haven't already

# Import the necessary library
from datasets import load_dataset
from huggingface_hub import login
from google.colab import userdata

HUGGING_FACE_TOKEN =userdata.get('HUGGING_FACE_TOKEN')

try:
    print("Attempting to load a dataset from Hugging Face...")
    dataset_name = "dmitva/human_ai_generated_text"

    dataset = load_dataset(dataset_name, token=HUGGING_FACE_TOKEN if HUGGING_FACE_TOKEN != "hf_YOUR_ACTUAL_TOKEN_HERE" else None)

    print(f"Dataset '{dataset_name}' loaded successfully!")
    print(dataset)

    # Display the first few rows of the 'train' split if available
    if 'train' in dataset:
        print("\nFirst 5 rows of the 'train' split:")

        df_train = dataset['train'].to_pandas()
        print(df_train.head())

except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Please ensure you have the correct dataset name and a valid Hugging Face token.")



Attempting to load a dataset from Hugging Face...


README.md: 0.00B [00:00, ?B/s]

model_training_dataset.csv:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Dataset 'dmitva/human_ai_generated_text' loaded successfully!
DatasetDict({
    train: Dataset({
        features: ['id', 'human_text', 'ai_text', 'instructions'],
        num_rows: 1000000
    })
})

First 5 rows of the 'train' split:
                                     id  \
0  cc902a20-27c4-4c18-8012-048a328206d1   
1  c4d2fbe3-e966-479d-89c4-62e1729b6255   
2  710f585e-5e98-42b8-81f6-265d7c934645   
3  e4db6c43-7b6b-4385-9b67-04652c71df0c   
4  7a48bcf1-cbb4-4f41-b99a-ea859c56afdf   

                                          human_text  \
0  Also they feel more comfortable at home. Some ...   
1  I can get another job to work on the weekends,...   
2  parents and school should agree on the desicio...   
3  Base in my experiences I'm growing, I try hard...   
4  Many people around the world have different ch...   

                                             ai_text  \
0  \n\nTherefore, when it comes to allowing stude...   
1  It is important to weigh both the potential co...   


In [14]:
# Define the system instruction
df = df_train.sample(n=100, random_state=42)
example_text = "Your task is to rewrite AI-generated prompts to make them more human-like."
system_instruction = {
    "parts": [
        {"text": example_text}
    ]
}

# Generate the JSONL file
output_file = 'training_data.jsonl'
with open(output_file, 'w', encoding='utf-8') as f:
    for i, (_, row) in enumerate(df.iterrows()):
        # Skip the first two entries that will be used as examples
        if i < 2:
            continue

        entry = {
            "systemInstruction": system_instruction,
            "contents": [
                {
                    "role": "user",
                    "parts": [{"text": f"AI-Generated: {row['ai_text']}"}]
                },
                {
                    "role": "model",
                    "parts": [{"text": f"Human-Like: {row['human_text']}"}]
                }
            ]
        }
        # Write each JSON object as a single line
        f.write(json.dumps(entry, ensure_ascii=False) + '\n')

print(f"\n{output_file} has been generated successfully with {len(df) - 2} entries.")


training_data.jsonl has been generated successfully with 98 entries.


# 4. Validate the Generated JSONL
Let's examine a few entries from our generated JSONL file to ensure they're formatted correctly:

In [16]:
# Read and display the first few entries of the JSONL file
with open('training_data.jsonl', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 1:  # Show first 3 entries
            break
        entry = json.loads(line)
        print(f"\nEntry {i+1}:")
        print(json.dumps(entry, indent=2))
        print("-" * 80)


Entry 1:
{
  "systemInstruction": {
    "parts": [
      {
        "text": "Your task is to rewrite AI-generated prompts to make them more human-like."
      }
    ]
  },
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "AI-Generated: This increase in self-confidence pushes them to persist in their efforts to reach the goals they have set for themselves. Furthermore, reaching a goal gives an individual a sense of accomplishment and self-praise that can last long after the goal has been achieved.\n\nRecent studies have shown that when people make progress towards a goal, their self-esteem increases. Accomplishment of a goal not only provides an individual with long-lasting satisfaction but also pushes them to continue striving for greatness. \nReaching goals helps to build an individual's self-esteem in many ways. When students improve their grades, their self-concept tends to improve as well.\n\nOverall, the evidence confirms that achieving goal