---
title:  Dynamically updating a Hugging Face hub organization README
description: Using the huggingface_hub library and Jinja to update a README dynamically
author: "Daniel van Strien"
date: 2023-03-07"
image: preview.png
---

tl;dr we can use the `huggingface_hub` library to auto generate a model card readme for the [BigLAM organization](https://huggingface.co/biglam).

## What are we aiming to do?

The Hugging Face hub allows organizations to create a README card to describe their organization.

![](https://github.com/davanstrien/blog/raw/master/images/_readme_auto_generate/before_readme.png)

Whilst you can manually create this there might be some content that would be nice to auto populate. For example, for the BigLAM organization, we're mainly focused on collecting datasets. Since we have many tasks supported by these datasets we might want to create a list of datasets organized by task. Ideally we don't want to have to manually update this. Let's see how we can do this!

First we'll install the `huggingface_hub` library which allows us to interact with the hub. We'll install `Jinja2` for templating and `toolz` because `toolz` makes Python infinitely more delightful!

In [None]:
%pip install huggingface_hub toolz Jinja2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import toolz
from huggingface_hub import list_datasets

We list all the datasets under this organization

In [None]:
big_lam_datasets = list(iter(list_datasets(author="biglam", limit=None, full=True)))

We want to check which tasks our organization currently has. If we look at an example of one dataset:

In [None]:
big_lam_datasets[0]

DatasetInfo: {
	id: biglam/illustrated_ads
	sha: 688e7d96e99cd5730a17a5c55b0964d27a486904
	lastModified: 2023-01-18T20:38:15.000Z
	tags: ['task_categories:image-classification', 'task_ids:multi-class-image-classification', 'annotations_creators:expert-generated', 'size_categories:n<1K', 'license:cc0-1.0', 'lam', 'historic newspapers']
	private: False
	author: biglam
	description: The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection.
	citation: @dataset{van_strien_daniel_2021_5838410,
  author       = {van Strien, Daniel},
  title        = {{19th Century United States Newspaper Advert images 
                   with 'illustrated' or 'non illustrated' labels}},
  month        = oct,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {0.0.1},
  doi          = {10.5281/zenodo.5838410},
  url          = {https://doi.org/10.5281/zenodo.5838410}}
	c

We can see the `cardData` attribute contains an item containing the tasks supported by a dataset

In [None]:
big_lam_datasets[0].cardData['task_categories']

['image-classification']

In [None]:
def get_task_categories(dataset):
    try:
        yield from dataset.cardData['task_categories']
    except KeyError:
        return None

We can use the `toolz.frequencies` function to get counts of these tasks in our org.

In [None]:
task_frequencies = toolz.frequencies(
    toolz.concat(map(get_task_categories, big_lam_datasets))
)
task_frequencies

{'image-classification': 8,
 'text-classification': 6,
 'image-to-text': 2,
 'text-generation': 7,
 'object-detection': 5,
 'fill-mask': 2,
 'text-to-image': 1,
 'image-to-image': 1,
 'token-classification': 1}

Since we want to organize by task type, let's grab the names of all the tasks in the BigLAM organization.

In [None]:
tasks = task_frequencies.keys()
tasks

dict_keys(['image-classification', 'text-classification', 'image-to-text', 'text-generation', 'object-detection', 'fill-mask', 'text-to-image', 'image-to-image', 'token-classification'])

We now want to group together datasets by the task(s) they support. We can use a default dict to create a dictionary where the keys are the task and the values are a list of datasets supporting that task. **Note** some datasets support multiple tasks so may appear under more than one task key.

In [None]:
from collections import defaultdict

In [None]:
datasets_by_task = defaultdict(list)

In [None]:
for dataset in big_lam_datasets:
    tasks = get_task_categories(dataset)
    for task in tasks:
        datasets_by_task[task].append(dataset)

We now have a dictionary which allows us to get all datasets supporting a task, for example `fill-mask`

In [None]:
datasets_by_task["fill-mask"]

[DatasetInfo: {
 	id: biglam/berlin_state_library_ocr
 	sha: a890935d5bd754ddc5b85f56b6f34f6d2bb4abba
 	lastModified: 2022-08-05T09:36:24.000Z
 	tags: ['task_categories:fill-mask', 'task_categories:text-generation', 'task_ids:masked-language-modeling', 'task_ids:language-modeling', 'annotations_creators:machine-generated', 'language_creators:expert-generated', 'multilinguality:multilingual', 'size_categories:1M<n<10M', 'language:de', 'language:nl', 'language:en', 'language:fr', 'language:es', 'license:cc-by-4.0', 'ocr', 'library']
 	private: False
 	author: biglam
 	description: None
 	citation: None
 	cardData: {'annotations_creators': ['machine-generated'], 'language': ['de', 'nl', 'en', 'fr', 'es'], 'language_creators': ['expert-generated'], 'license': ['cc-by-4.0'], 'multilinguality': ['multilingual'], 'pretty_name': 'Berlin State Library OCR', 'size_categories': ['1M<n<10M'], 'source_datasets': [], 'tags': ['ocr', 'library'], 'task_categories': ['fill-mask', 'text-generation'], 't

## How can we create a README that dynamically updates

We now have our datasets organized by task. However, at the moment, this is in the form of a Python dictionary. It would be much nicer to render it a more pleasing format. This is where a [templating engine](https://www.fullstackpython.com/template-engines.html) can help. In this case we'll use [Jinja](https://jinja.palletsprojects.com/en/3.0.x/templates/).

A templating engine allows us to create a template which can dynamically be updated based on values we pass in. We won't go in depth to templating engines/Jinja in this blog post because I'm not an expert in templating engines. This [Real Python article](https://realpython.com/primer-on-jinja-templating/) is a nice introduction to Jinja.

In [None]:
from jinja2 import Environment, FileSystemLoader

We can start by taking a look at our template. Since a lot of the template I created doesn't update, we'll use `tail` to look at the bottom of the template which is dynamically updating.

In [None]:
!tail -n 12 templates/readme.jinja

An overview of datasets currently made available via BigLam organised by task type.

{% for task_type, datasets in task_dictionary.items() %}

<details>
  <summary>{{ task_type }}</summary>
    {% for dataset in datasets %}
  - [{{dataset.cardData['pretty_name']}}](https://huggingface.co/datasets/biglam/{{ dataset.id }})
  {%- endfor %}

</details>
{% endfor %}

Even if you aren't familiar with templating engines, you can probably see roughly what this does. We look through all the keys and values in our dictionary, create a section for that task based on the dictionary key. We next loop through the dictionary values (which in this case is a list) and create a link for that dataset. Since we're looping through `DatasetInfo` objects in the list we can grab things like the `pretty_name` for the dataset and dynamically create a URL link.

We can load this template as follows

In [None]:
environment = Environment(loader=FileSystemLoader("templates/"))
template = environment.get_template("readme.jinja")

Create a context dictionary which we use to pass through our dictionary

In [None]:
context = {
    "task_dictionary": datasets_by_task,
}

We can now render this and see how it looks

In [None]:
print(template.render(context))

---
title: README
emoji: 📚
colorFrom: pink
colorTo: gray
sdk: static
pinned: false
---

BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models.


BigLAM started as a [datasets hackathon](https://github.com/bigscience-workshop/lam) focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine-learning applications accessible via the Hugging Face Hub.
We are continuing to work on making more datasets available via the Hugging Face hub to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine-learning datasets more closely reflect the richness of human culture.


## Dataset Overview

An overview of datasets currently made available via Big

In [None]:
with open('/tmp/README.md','w') as f:
    f.write(template.render(context))

## Updating the README on the Hugging Face Hub

This looks pretty good! It would be nice to also update the org README without having to manually edit the file. The `huggingface_hub` library helps us out here once again. Since the organization README is actually a special type of Hugging Face Space, we can interact with it in the same way we could for models or datasets.

In [None]:
from huggingface_hub import HfApi
from huggingface_hub import notebook_login

We'll create a `HFApi` instance.

In [None]:
api = HfApi()

Since we're planning to write to a repo we'll need to login to the hub.

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We can now upload the rendered README file we created above to our `biglam/README` space.

In [None]:
api.upload_file(
    path_or_fileobj="/tmp/readme.md",
    path_in_repo="README.md",
    repo_id="biglam/README",
    repo_type="space",
)

'https://huggingface.co/spaces/biglam/README/blob/main/README.md'

If we look at our updated README, we'll see we now have some nice collapsible sections for each task type containing the datasets for that task

![After README](after_readme.png)

Next steps, whilst this was already quite useful, at the moment we still have to run this code when we want to regenerate our README. [Webhooks](https://huggingface.co/docs/hub/webhooks) make it possible to make this fully automated by creating a webhook that monitors any changes to repos under the BigLAM org. Would love to hear from anyone who tries this out!