# Data Format Demo
Our project supports two data formats: **Parquet** and **JSONL**. Here is a simple example of how to use these two formats.

## Parquet
**Parquet** is an efficient columnar storage format optimized for large-scale data. The open-source speech-to-speech dataset we provide is stored in this format on the Hugging Face Hub. You can download the entire dataset using the code below

In [None]:
# Download the dataset
from datasets import load_dataset, load_from_disk
ds = load_dataset("gpt-omni/VoiceAssistant-400K")

# Save the dataset to disk if needed
save_path = "/path/to/save/directory"
ds.save_to_disk(save_path)

# Load the dataset from disk if needed
ds = load_from_disk(save_path)

You can try the following code to take a quick look at the Parquet data format we organized.

In [14]:
from datasets import load_dataset
parquet_files = [
    "pq_demo-zh.parquet",
    "pq_demo-en.parquet",
]
ds = load_dataset('parquet', data_files=parquet_files)
ds

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['split_name', 'index', 'round', 'question', 'question_audio', 'answer', 'answer_cosyvoice_speech_token', 'answer_snac'],
        num_rows: 20
    })
})

In [16]:
index = 10  # You can change this index to select different rows
question = ds['train'][index]['question']
question_audio = ds['train'][index]['question_audio']
answer = ds['train'][index]['answer']
print(question)
print(answer)

import IPython.display as ipd
ipd.Audio(question_audio['array'], rate=question_audio['sampling_rate'])

<USER>: Are there any particular physical benefits to mindful walking, such as improved posture or increased physical fitness?
Yes, there are physical benefits to mindful walking, such as improved posture, increased physical fitness, and better balance. Mindful walking can also help relieve tension in the body, reduce stress, and improve flexibility. It can also improve circulation and help with weight management. By tuning into the body, mindful walking can also help individuals identify and address any imbalances or discomfort, leading to a healthier and more aligned body.


## JSONL
**JSONL** is a format that stores JSON objects in a single line. It is a simple and efficient format for storing structured data. You can see the [example](jsonl_demo.jsonl) we provide.