# Generating a dataset

The first part of the fine tuning process is generating a dataset.

The goal of this notebook is to create an "instruction-based" fine tuning dataset using metadata accessed via IIIF,
which will be uploaded to Huggingface.

## Prerequisites

Ensure you have a [Huggingface](https://huggingface.co) account.

Create a [token](https://huggingface.co/docs/hub/en/security-tokens) with the minimum settings of:
- Read access to contents of all repos under your personal namespace
- Read access to contents of all public gated repos you can access
- Write access to contents/settings of all repos under your personal namespace

## Collecting Data

The first step is to use `loam-iiif` fetch manifests from a IIIF Collection

In [1]:
from loam_iiif import iiif

client = iiif.IIIFClient()

# Berkeley Folk Music Festival
collection_url = "https://api.dc.library.northwestern.edu/api/v2/collections/18ec4c6b-192a-4ab8-9903-ea0f393c35f7?as=iiif"
max_manifests = 5000

manifest_ids, _collection_ids = client.get_manifests_and_collections_ids(collection_url, max_manifests)

print(f"🔎 Found {len(manifest_ids)} manifests")

🔎 Found 5000 manifests


## Transforming data

The next step is to transform the data into an instruction based format:

```json
{
    "prompt": "",
    "completion": ""
}
```

<div style="border-radius: .5rem; border:2px solid #ac9a14; background-color: #ffe900; max-width: fit-content; max-height: fit-content; padding: 1rem; color: black; line-height: 1;">
It is not strictly necessary for the data to be in this format, but it provides conveniences later for fine tuning.
<div>

Though the goal of the repo is to fine tune the style of a multimodal model:
- it is not necessary to include images in the data, as the vision layer will not be adjsuted
- the text can come from anywhere, hence using the `summary` of a `Manifest`; not the `description` of a `Canvas`

In [2]:
import json
data = []

for id in manifest_ids:
    manifest = client.fetch_json(id)

    if "summary" not in manifest:
        continue

    summary: dict = manifest["summary"]
    keys = summary.keys()

    if len(keys) == 0:
        continue

    summary_text: str = ""
    if "none" in keys:
        summary_text = "\n".join(summary["none"])
    else:
        summary_text = "\n".join(summary[keys[0]])

    line = {
        "prompt": "Describe this image.",
        "completion": summary_text,
    }

    data.append(line)

print(f"✅ Processed {len(data)} manifests")
print("Example:")
print(json.dumps(data[0], indent=2))

✅ Processed 4906 manifests
Example:
{
  "prompt": "Describe this image.",
  "completion": "Sandy Paton playing guitar at Creed's Books in Berkeley, California"
}


## Saving data

### Local

Save the data locally as a `.jsonl` file

In [3]:
file_name = "outputs.jsonl"
with open(file_name, "w") as f:
    for i, line in enumerate(data):
        if i != 0:
            f.write("\n")
        f.write(json.dumps(line))

### Huggingface (Optional)

Save the data to Huggingface as a dataset.

In [4]:
import os

from huggingface_hub import login

login(os.environ["HF_TOKEN"])

  from .autonotebook import tqdm as notebook_tqdm
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [5]:
from huggingface_hub import HfApi

api = HfApi()

# Change this to your own repo name
new_repo_name = "charlesLoder/northwestern-metadata"
api.create_repo(
    repo_id=new_repo_name,
    repo_type="dataset",
    private=True,
    exist_ok=True,
)

RepoUrl('https://huggingface.co/datasets/charlesLoder/northwestern-metadata', endpoint='https://huggingface.co', repo_type='dataset', repo_id='charlesLoder/northwestern-metadata')

In [6]:
api.upload_file(
    path_or_fileobj=file_name,
    path_in_repo=file_name,
    repo_id=new_repo_name,
    repo_type="dataset",
    commit_message="Create dataset",
)

CommitInfo(commit_url='https://huggingface.co/datasets/charlesLoder/northwestern-metadata/commit/0c5fe6fe00dbbf585c8bd6e6ba44f6d148f686e6', commit_message='Create dataset', commit_description='', oid='0c5fe6fe00dbbf585c8bd6e6ba44f6d148f686e6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/charlesLoder/northwestern-metadata', endpoint='https://huggingface.co', repo_type='dataset', repo_id='charlesLoder/northwestern-metadata'), pr_revision=None, pr_num=None)