## Introduction

This Jupyter Notebook provides an interactive interface for validating your training data (and optionally validation data) in `JSONL` format. It ensures that your data meets the requirements for Haiku Fine-Tuning in terms of file size, line count, token count, and data structure.

## Data Validation Checks

The notebook uses a custom data validation script to perform the following checks:

1. **File Structure**: Ensures each line in the JSONL file is valid JSON.
2. **File Size**:
   - Training data: Maximum of 10GB
   - Validation data: Maximum of 1GB
3. **Line Count**:
   - Training data: Between 32 and 10,000 lines
   - Validation data: Between 32 and 1,000 lines
   - Total (training + validation): Must not exceed 10,000 lines
4. **Data Structure**: Validates the structure of each entry in the JSONL file.
5. **Message Structure**: Checks the order and roles of messages in each entry.
6. **Token Count**: Ensures each entry has fewer than 32,000 tokens.
7. **Reserved Keywords**: Checks for the absence of Anthropic's reserved keywords in prompts.

#### Reserved Keywords

The validation process now includes a check for Anthropic's reserved keywords. The following keywords must not appear in any prompt (system message or user/assistant messages):

- "\nHuman:"
- "\nAssistant:"

Note that variations of these keywords without the colon (e.g., "\nHuman" or "\nAssistant") are allowed.

## Using the Notebook

### Data Location

This script requires your data to be locally available. If your provided training and validation datasets are stored in S3, the notebook will run a function to download the datasets locally.

**Note**: If your datasets are stored in S3, ensure that the notebook has sufficient permissions to access S3.

### Interpreting the Results

After running the validation, you will see output indicating whether the validation was successful or not:

- If the validation is `successful`, you will see the message "All data passed validation!"
- If there are any `errors`, they will be listed with specific details about the issue and its location.

In [None]:
# Import libraries and data validator

import os
import boto3
from data_validation import validate_data

In [None]:
# Function to download files from S3 if necessary

def get_local_path(file_path):
    if file_path.startswith('s3://'):
        s3 = boto3.client('s3')
        local_path = os.path.basename(file_path)
        bucket, key = file_path[5:].split('/', 1)
        # download file to local
        s3.download_file(bucket, key, local_path)
        return local_path
    return file_path

In [None]:
# Setup the paths to your data files, provide either S3 URI or local file path

training_file = "path to your training JSONL file"
validation_file = "path to your validation JSONL file" # optional

This script requires your data to be locally available. If your provided training and validation datasets are stored in S3, the notebook will run a function to download the datasets locally.

**Note**: If your datasets are stored in S3, ensure that the notebook has sufficient permissions to access S3.

In [None]:
# Get local path

local_training_path = get_local_path(training_file)
local_validation_path = get_local_path(validation_file) if validation_file else None

In [None]:
# Run validation
validate_data(local_training_path, local_validation_path)

## Next Step

After the validation is successful, your data is ready to use for Claude-3 Haiku fine-tuning job!