# 文生图数据集

千帆平台支持用户上传文生图数据集，并使用文生图数据集对模型进行训练。

本篇教程将会介绍，如何在本地创建一个数据集，并且将该数据集上传至千帆平台，以供后续操作

# 前置准备

在开始之前，首先请将千帆 Python SDK 更新至最新版本

In [1]:
pip install -U qianfan



并且在环境变量中设置好 Access Key 与 Secret Key

In [2]:
import logging
import os

from qianfan.utils import enable_log

os.environ['QIANFAN_ACCESS_KEY'] = 'your_access_key'
os.environ['QIANFAN_SECRET_KEY'] = 'your_secret_key'
your_bos_storage_id = "your_bos_storage_id"
your_bos_storage_path = "your_bos_storage_path"

# 选择打印出来的日志等级，目前打印出 info 级别
enable_log(logging.INFO)

# 正文

千帆平台使用的文生图数据集采用文件夹的形式组织。一个数据集中，包含若干个后缀名为 (jpg/jpeg/bmp/png) 的图片文件，以及若干个后缀名为 json 的 json 文件

其中，需要关注的是 json 文件。每个 json 文件都是一个 json 字典对象，其中仅包含 `prompt` 字段，用于标注与 json 文件同名的图片文件的内容。

一个用于描述 "狗在草坪上打滚"，名为 `example.json` 的 json 文件，其内容可以是:
```json
{
    "prompt": "一只狗在草坪上打滚"
}
```

一个图片可以不包含对应的 json 文件，但若包含，则必须保证其文件名与图片文件名相同。

## 使用千帆 Python SDK 读取

千帆 Python SDK 中提供的 `Dataset` 类支持用户读取一个文生图数据集，并返回一个包含了相关信息的 `Dataset`  对象

In [3]:
from qianfan.dataset import Dataset
from qianfan.dataset.data_source import FileDataSource
from qianfan.dataset.data_source.base import FormatType

file_data_source = FileDataSource(path="data_file/text2img_example_data", file_format=FormatType.Text2Image)

ds = Dataset.load(file_data_source)

print(ds.list())

[INFO] [03-19 16:30:44] dataset.py:880 [t:8094817088]: list local dataset data by None


[{'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/8.jpg', 'annotation': None}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/9.jpg', 'annotation': None}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/4.jpg', 'annotation': {'prompt': '桌子上有一些白色小棒、两块布、一把剪刀。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/5.jpg', 'annotation': {'prompt': '一个男人和一个女人在一张桌子上用电脑。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/7.jpg', 'annotation': {'prompt': '一个戴眼镜的男人旁边有一只黑猫。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/6.jpg', 'annotation': {'prompt': '一群长颈鹿站在一棵大树边。'}}, {'image_path': '/Users/pengyiyang

用户在读取进来之后，可以手动对数据集中的内容进行修改，例如我们需要过滤掉所有不存在标注的图片：

In [4]:
ds = ds.filter(lambda x: x["annotation"] is not None)

print(ds.list())

[INFO] [03-19 16:30:44] dataset.py:880 [t:8094817088]: list local dataset data by None


[{'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/4.jpg', 'annotation': {'prompt': '桌子上有一些白色小棒、两块布、一把剪刀。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/5.jpg', 'annotation': {'prompt': '一个男人和一个女人在一张桌子上用电脑。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/7.jpg', 'annotation': {'prompt': '一个戴眼镜的男人旁边有一只黑猫。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/6.jpg', 'annotation': {'prompt': '一群长颈鹿站在一棵大树边。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/2.jpg', 'annotation': {'prompt': '一个装满早餐的盘子两侧放着刀叉。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/3.jpg', 'annotation': {'prompt': '桌子

## 上传到千帆

这些数据集最终还是需要上传到千帆，以供后续在平台上进行的训练。千帆 SDK 也为用户提供了相应的接口，来帮助用户上传

In [5]:
from qianfan.dataset.data_source import QianfanDataSource
from qianfan.resources.console.consts import DataStorageType, DataTemplateType

from qianfan.utils.utils import generate_letter_num_random_id

qianfan_data_source = QianfanDataSource.create_bare_dataset(
    f"text_to_image{generate_letter_num_random_id(6)}",
    DataTemplateType.Text2Image,
    DataStorageType.PrivateBos,
    your_bos_storage_id,
    your_bos_storage_path,
)

qianfan_ds = ds.save(qianfan_data_source)

[INFO] [03-19 16:30:44] baidu_qianfan.py:451 [t:8094817088]: start to create dataset on qianfan
[INFO] [03-19 16:30:44] baidu_qianfan.py:469 [t:8094817088]: create dataset on qianfan successfully
[INFO] [03-19 16:30:44] baidu_qianfan.py:237 [t:8094817088]: start to upload data to user BOS
[INFO] [03-19 16:30:45] baidu_qianfan.py:249 [t:8094817088]: uploading data to user BOS finished
[INFO] [03-19 16:30:46] utils.py:476 [t:8094817088]: successfully create importing task
[INFO] [03-19 16:30:48] utils.py:479 [t:8094817088]: polling import task status
[INFO] [03-19 16:30:48] utils.py:486 [t:8094817088]: import status: 1, keep polling
[INFO] [03-19 16:30:50] utils.py:479 [t:8094817088]: polling import task status
[INFO] [03-19 16:30:51] utils.py:486 [t:8094817088]: import status: 1, keep polling
[INFO] [03-19 16:30:53] utils.py:479 [t:8094817088]: polling import task status
[INFO] [03-19 16:30:53] utils.py:486 [t:8094817088]: import status: 1, keep polling
[INFO] [03-19 16:30:55] utils.py:

# 上传到 Bos

千帆 Python SDK 也提供了上传到百度智能云云对象存储（BOS）的功能。用户可根据自身需要进行选择

In [6]:
from qianfan.dataset.data_source import BosDataSource

bos_data_source = BosDataSource(
    region="bj",
    bucket=your_bos_storage_id,
    bos_file_path=your_bos_storage_path + f"text_to_image_dataset/ds_{generate_letter_num_random_id()}",
    file_format=FormatType.Text2Image,
)

bos_ds = ds.save(bos_data_source)

[INFO] [03-19 16:30:56] bos.py:117 [t:8094817088]: start to upload file
[INFO] [03-19 16:30:56] bos_uploader.py:91 [t:8094817088]: check if bos file existed


# 从千帆和 Bos 下载数据集到本地

除了将数据集上传到千帆和 Bos 之外，用户还可以将上面的数据集下载到本地，方便进行其它操作。以刚才我们上传的数据集为例子进行演示：

In [7]:
t2i_dataset_from_qianfan = Dataset.load(qianfan_data_source).save(data_file="t2i_from_qianfan", file_format=FormatType.Text2Image)

t2i_dataset_from_bos = Dataset.load(bos_data_source).save(data_file="t2i_dataset_from_bos", file_format=FormatType.Text2Image)

[INFO] [03-19 16:30:57] dataset.py:462 [t:8094817088]: no destination data source was provided, construct
[INFO] [03-19 16:30:57] dataset.py:257 [t:8094817088]: construct a file data source from path: t2i_from_qianfan, with args: {'file_format': <FormatType.Text2Image: 'text2image'>}
[INFO] [03-19 16:30:57] baidu_qianfan.py:345 [t:8094817088]: no cache was found, download cache
[INFO] [03-19 16:30:57] baidu_qianfan.py:271 [t:8094817088]: get dataset info succeeded for dataset id ds-mt06fsm54cram18v
[INFO] [03-19 16:30:57] utils.py:610 [t:8094817088]: start to export dataset
[INFO] [03-19 16:30:58] utils.py:614 [t:8094817088]: create dataset export task successfully
[INFO] [03-19 16:31:00] utils.py:619 [t:8094817088]: polling export task status
[INFO] [03-19 16:31:01] utils.py:627 [t:8094817088]: export status: 1, keep polling
[INFO] [03-19 16:31:03] utils.py:619 [t:8094817088]: polling export task status
[INFO] [03-19 16:31:03] utils.py:627 [t:8094817088]: export status: 1, keep pollin