# L7: Synthetic Data Kit

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Load API keys

In [2]:
import os
from utils import get_llama_api_key
llama_api_key = get_llama_api_key()

In [3]:
os.environ["API_ENDPOINT_KEY"] = llama_api_key

In [4]:
#!pip install synthetic-data-kit==0.0.4b1

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>Access <code>requirements.txt</code> and <code>helper.py</code> files:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.</p>

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>
</div>

# Ingeting PDF files and web pages

In [5]:
!synthetic-data-kit ingest paper.pdf

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
[2K[32m⠹[0m Processing paper.pdf.....
[1A[2K[32m Text successfully extracted to [0m[1;32mdata/output/paper.txt[0m


In [6]:
!head -50 data/output/paper.txt | tail -10

Chunting Zhou⋄, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†,2,⋄,
Srinivasan Iyer†

FAIR at Meta, 1Paul G. Allen School of Computer Science & Engineering, University of Washington,
2University of Chicago
‡Joint second author, †Joint last author, ⋄Work done at Meta

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first
time, matches tokenization-based LLM performance at scale with significant improvements in inference
efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the


In [7]:
!synthetic-data-kit ingest https://ai.meta.com/blog/llama-4-multimodal-intelligence/

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
[2K[32m⠏[0m Processing https://ai.meta.com/blog/llama-4-multimodal-intelligence/.....
[1A[2K[32m Text successfully extracted to [0m[1;32mdata/output/ai_meta_com.txt[0m


In [8]:
!head -50 data/output/paper.txt | tail -10

Chunting Zhou⋄, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†,2,⋄,
Srinivasan Iyer†

FAIR at Meta, 1Paul G. Allen School of Computer Science & Engineering, University of Washington,
2University of Chicago
‡Joint second author, †Joint last author, ⋄Work done at Meta

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first
time, matches tokenization-based LLM performance at scale with significant improvements in inference
efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the


## Creating a QA dataset

In [9]:
!synthetic-data-kit create data/output/paper.txt --type qa

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32mL Using api-endpoint provider[0m
[?25lLoading config from: 
/usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
[2KConfig has LLM provider set to: api-endpointt/paper.txt...
[2KAPI_ENDPOINT_KEY from environment: Foundutput/paper.txt...
[2KUsing API key: From env vart from data/output/paper.txt...
[2KUsing API base URL: http://jupyter-api-proxy.internal.dlai/rev-proxy/llama-api
[2KL Using api-endpoint providerfrom data/output/paper.txt...
[2KLoading config from: content from data/output/paper.txt...
/usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
[2KConfig has LLM provider set to: api-endpointt/paper.txt...
[2K[32m⠹[0m 

In [10]:
!cat data/generated/paper_qa_pairs.json

{
  "summary": "The document introduces the Byte Latent Transformer (BLT), a new byte-level LLM architecture that matches tokenization-based LLM performance at scale while improving inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, allocating more compute to complex data. The model comprises three modules: a local encoder, a global latent transformer, and a local decoder. BLT achieves training flop-controlled parity with Llama 3 while using up to 50% fewer flops at inference and demonstrates improved robustness to input noise and character-level understanding.",
  "qa_pairs": [
    {
      "question": "What is the Byte Latent Transformer (BLT) and what does it achieve?",
      "answer": "The Byte Latent Transformer (BLT) is a new byte-level LLM architecture that matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness."
    },
    {
      "question": "How does BLT encode bytes?"

## Curating the dataset

In [11]:
!synthetic-data-kit curate data/generated/paper_qa_pairs.json --threshold=8

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[?25lLoading config from: 
/usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
[2KConfig has LLM provider set to: api-endpointpaper_qa_pairs.json...
[2KAPI_ENDPOINT_KEY from environment: Foundted/paper_qa_pairs.json...
[2KUsing API key: From env varm data/generated/paper_qa_pairs.json...
[2KUsing API base URL: http://jupyter-api-proxy.internal.dlai/rev-proxy/llama-api
[2KLoading config from: nt from data/generated/paper_qa_pairs.json...
/usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
[2KConfig has LLM provider set to: api-endpointpaper_qa_pairs.json...
[2KProcessing 2 batches of QA pairs...enerated/paper_qa_pairs.json...
[2K[

In [12]:
!cat data/cleaned/paper_qa_pairs_cleaned.json

{
  "summary": "The document introduces the Byte Latent Transformer (BLT), a new byte-level LLM architecture that matches tokenization-based LLM performance at scale while improving inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, allocating more compute to complex data. The model comprises three modules: a local encoder, a global latent transformer, and a local decoder. BLT achieves training flop-controlled parity with Llama 3 while using up to 50% fewer flops at inference and demonstrates improved robustness to input noise and character-level understanding.",
  "qa_pairs": [
    {
      "question": "What is the Byte Latent Transformer (BLT) and what does it achieve?",
      "answer": "The Byte Latent Transformer (BLT) is a new byte-level LLM architecture that matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness.",
      "rating": 9
    },
    {
      "question": "How doe

## Saving the dataset

In [13]:
!synthetic-data-kit save-as data/cleaned/paper_qa_pairs_cleaned.json --format jsonl --storage json

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
[?25l[32m⠋[0m Converting data/cleaned/paper_qa_pairs_cleaned.json to jsonl format with json 
storage...
[?25h[1A[2K[1A[2K[32m Converted to jsonl format and saved to [0m[1;32mdata/final/paper_qa_pairs_cleaned.jsonl[0m


In [14]:
!head -10  data/final/paper_qa_pairs_cleaned.jsonl

{"question": "What is the Byte Latent Transformer (BLT) and what does it achieve?", "answer": "The Byte Latent Transformer (BLT) is a new byte-level LLM architecture that matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness.", "rating": 9}
{"question": "How does BLT encode bytes?", "answer": "BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation.", "rating": 8}
{"question": "What is the main advantage of the BLT architecture over tokenization-based models?", "answer": "The BLT architecture dynamically allocates compute where it is needed, resulting in more efficient allocation of compute than tokenization-based models.", "rating": 9}
{"question": "How does BLT achieve improved flop efficiency compared to Llama 3?", "answer": "BLT matches training flop-controlled performance of Llama 3 while using up to 50% fewer flops at inference.", "rating": 9}
{"question": "Where is th

In [15]:
!synthetic-data-kit save-as data/cleaned/paper_qa_pairs_cleaned.json --format ft --storage json

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
[?25l[32m⠋[0m Converting data/cleaned/paper_qa_pairs_cleaned.json to ft format with json 
storage...
[?25h[1A[2K[1A[2K[32m Converted to ft format and saved to [0m[1;32mdata/final/paper_qa_pairs_cleaned_ft.json[0m


In [16]:
!head -30 data/final/paper_qa_pairs_cleaned_ft.json

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the Byte Latent Transformer (BLT) and what does it achieve?"
      },
      {
        "role": "assistant",
        "content": "The Byte Latent Transformer (BLT) is a new byte-level LLM architecture that matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness."
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "How does BLT encode bytes?"
      },
      {
        "role": "assistant",
        "content": "BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation."


## Configuration file

In [17]:
!cat "$(pip show synthetic-data-kit | grep Location | awk '{print $2}')/synthetic_data_kit/config.yaml"

# Master configuration file for Synthetic Data Kit

# Global paths configuration
paths:
  # Input data locations
  input:
    pdf: "data/pdf"
    html: "data/html"
    youtube: "data/youtube"
    docx: "data/docx"
    ppt: "data/ppt"
    txt: "data/txt"
  
  # Output locations
  output:
    parsed: "data/output"      # Where parsed text files are saved
    generated: "data/generated" # Where generated content is saved
    cleaned: "data/cleaned"     # Where cleaned content is saved
    final: "data/final"         # Where final formatted content is saved

# LLM Provider configuration
llm:
  # Provider selection: "vllm" or "api-endpoint"
  provider: "api-endpoint"

# VLLM server configuration
vllm:
  api_base: "http://localhost:8000/v1" # Base URL for VLLM API
  port: 8000                           # Port for VLLM server
  model: "meta-llama/Llama-3.3-70B-Instruct" # Default model to use
  max_retries: 3                       # Number of retries for API call

## Do it yourself: Creating a CoT dataset

In this section you can see how a Chain of Thought reasoning dataset can be created from the paper. By default, 10 CoT examples will be created. You may change it either by using a command line parameter `--num-pairs` as shown below.

In [18]:
!synthetic-data-kit create data/output/paper.txt --type cot --num-pairs 5

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32mL Using api-endpoint provider[0m
[?25lLoading config from: 
/usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
[2KConfig has LLM provider set to: api-endpointut/paper.txt...
[2KAPI_ENDPOINT_KEY from environment: Foundoutput/paper.txt...
[2KUsing API key: From env varnt from data/output/paper.txt...
[2KUsing API base URL: http://jupyter-api-proxy.internal.dlai/rev-proxy/llama-api
[2KL Using api-endpoint provider from data/output/paper.txt...
[2K[32m⠸[0m Generating cot content from data/output/paper.txt...INFO:httpx:HTTP Request: POST http://jupyter-api-proxy.internal.dlai/rev-proxy/llama-api/chat/completions "HTTP/1.1 200 OK"
[2K[32m⠴[0m Ge

In [19]:
!cat  data/generated/paper_cot_examples.json

{
  "summary": "The Byte Latent Transformer (BLT) is a new byte-level LLM architecture that matches tokenization-based LLM performance at scale while improving inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, allocating more compute to complex data. The model achieves training flop-controlled parity with Llama 3 while using up to 50% fewer flops at inference.",
  "cot_examples": [
    {
      "question": "How does the Byte Latent Transformer (BLT) architecture dynamically allocate compute based on data complexity?",
      "reasoning": "Step 1: Understand the BLT architecture and its components.\nStep 2: Identify how BLT segments bytes into patches.\nStep 3: Analyze how the entropy of the next byte prediction influences patch boundaries.\nStep 4: Recognize how the allocation of compute is dynamically adjusted based on patch size and complexity.",
      "answer": "BLT dynamically allocates compute by segmenting bytes into patches based on the e

Each created `cot_examples` is a dictionary with 3 keys: `question`, `reasoning` and `answer`. For example:

```
      "question": "How does the Byte Latent Transformer (BLT) architecture dynamically allocate compute based on data complexity?",

      "reasoning": "Step 1: Understand that BLT uses a dynamic method for grouping bytes into patches.\nStep 2: Recognize that the patching function segments bytes based on the entropy of the next byte prediction.\nStep 3: Analyze how the entropy patching method uses a small byte-level language model to compute next byte entropies.\nStep 4: Determine how patch boundaries are identified based on entropy thresholds or relative changes in entropy.\nStep 5: Conclude that BLT dynamically allocates compute by invoking the Latent Transformer based on patch boundaries determined by entropy.",
      
      "answer": "BLT dynamically allocates compute by segmenting bytes into patches based on the entropy of the next byte prediction, using a small byte-level language model to determine patch boundaries."
```

If the `reasoning` steps above are not obvious from `question` to `answer`, below is a grade level math reasoning example for you to easily verify the correctness of reasoning.

## Do it yourself: Creating a math reasoning dataset

GSM8K is a dataset of 8500 high quality linguistically diverse grade school math word problems. To use it you need to run: `pip install -U datasets==2.14.6`. Run the code below to get 50 examples from the dataset and save the questions in the examples to a text file.

In [20]:
import pandas as pd
import os
from datasets import load_dataset
from datasets import logging as datasets_logging
datasets_logging.set_verbosity_error()

# Create directories if they don't exist
os.makedirs('data/output', exist_ok=True)

# Load GSM8K dataset
gsm8k = load_dataset('gsm8k', 'main')

# Take 50 samples from the training set
samples = gsm8k['train'].select(range(50))

# Create a text file with the questions
with open('data/output/gsm8k_sample.txt', 'w') as f:
    for i, item in enumerate(samples):
        f.write(f"Problem {i+1}:\n{item['question']}\n\n")

print(f"Created sample with {len(samples)} problems")
print(f"Sample saved to data/output/gsm8k_sample.txt")


Created sample with 50 problems
Sample saved to data/output/gsm8k_sample.txt


The first 3 examples of the dataset's questions are:

```
Problem 1:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Problem 2:
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?

Problem 3:
Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?
```

In [21]:
!cat data/output/gsm8k_sample.txt

Problem 1:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Problem 2:
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?

Problem 3:
Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?

Problem 4:
Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?

Problem 5:
James writes a 3-page letter to 2 different friends twice a week.  How many pages does he write a year?

Problem 6:
Mark has a garden with flowers. He planted plants of three

Now run the data kit tool to create a math reasoning <code>cot</code> dataset from the text file.

In [22]:
!synthetic-data-kit create data/output/gsm8k_sample.txt --type cot

Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32mL Using api-endpoint provider[0m
[?25lLoading config from: 
/usr/local/lib/python3.11/site-packages/synthetic_data_kit/config.yaml
[2KConfig has LLM provider set to: api-endpointut/gsm8k_sample.txt...
[2KAPI_ENDPOINT_KEY from environment: Foundoutput/gsm8k_sample.txt...
[2KUsing API key: From env varnt from data/output/gsm8k_sample.txt...
[2KUsing API base URL: http://jupyter-api-proxy.internal.dlai/rev-proxy/llama-api
[2KL Using api-endpoint provider from data/output/gsm8k_sample.txt...
[2K[32m⠹[0m Generating cot content from data/output/gsm8k_sample.txt...INFO:httpx:HTTP Request: POST http://jupyter-api-proxy.internal.dlai/rev-proxy/llama-api/chat/completions 

In [23]:
!cat data/generated/gsm8k_sample_cot_examples.json

{
  "summary": "Here is a summary of the document in 2-3 sentences:\n\nThe document contains 50 math problems covering various topics such as algebra, geometry, and basic arithmetic operations. The problems range from simple calculations to more complex scenarios involving multiple steps and variables. The document appears to be a collection of practice problems or a test bank for assessing mathematical skills.",
  "cot_examples": [
    {
      "question": "If James writes a 3-page letter to 2 different friends twice a week, how many pages does he write in 52 weeks?",
      "reasoning": "Step 1: First, I need to determine how many pages James writes per week. He writes 3 pages per letter and sends 2 letters twice a week, so he writes 3 * 2 * 2 = 12 pages per week.\nStep 2: To find out how many pages he writes in a year, I need to multiply the number of pages he writes per week by the number of weeks in a year. There are 52 weeks in a year.\nStep 3: Multiply 12 pages per week by 52

You can check two examples of the generated dataset and confirm the added reasoning steps are correct:

In [24]:
import glob
json_files = glob.glob("data/generated/gsm8k_sample_cot_examples.json")

import json
with open(json_files[0], "r") as f:
    data = json.load(f)
print(data['cot_examples'][0]['question'])
print(data['cot_examples'][0]['reasoning'])
print(data['cot_examples'][0]['answer'])

If James writes a 3-page letter to 2 different friends twice a week, how many pages does he write in 52 weeks?
Step 1: First, I need to determine how many pages James writes per week. He writes 3 pages per letter and sends 2 letters twice a week, so he writes 3 * 2 * 2 = 12 pages per week.
Step 2: To find out how many pages he writes in a year, I need to multiply the number of pages he writes per week by the number of weeks in a year. There are 52 weeks in a year.
Step 3: Multiply 12 pages per week by 52 weeks to get the total number of pages written in a year.
624


In [25]:
print(data['cot_examples'][-1]['question'])
print(data['cot_examples'][-1]['reasoning'])
print(data['cot_examples'][-1]['answer'])

If a concert ticket costs $40 and Mr. Benson bought 12 tickets with a 5% discount on every ticket exceeding 10, how much did he pay in total?
Step 1: Calculate the cost of the first 10 tickets at full price. 10 tickets * $40 = $400.
Step 2: Determine the number of tickets that exceed 10, which is 12 - 10 = 2 tickets.
Step 3: Calculate the cost of the 2 extra tickets with a 5% discount. The discount on each ticket is $40 * 0.05 = $2. So, the price per ticket after discount is $40 - $2 = $38.
Step 4: The total cost for the 2 discounted tickets is 2 * $38 = $76.
Step 5: Add the cost of the first 10 tickets to the cost of the 2 discounted tickets to get the total cost.
476
