<a href="https://colab.research.google.com/github/daspartho/prompt-extend/blob/main/dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Downloading the parquet table

Since we only need prompts (no images), It is much faster and more flexible to load the meatadata Parquet table ourselves than using the Hugging Face's Datasets Generator.

In [1]:
from urllib.request import urlretrieve
table_url = 'https://huggingface.co/datasets/poloclub/diffusiondb/resolve/main/metadata-large.parquet'
urlretrieve(table_url, 'metadata.parquet')

('metadata.parquet', <http.client.HTTPMessage at 0x7fcf0214bad0>)

#### Read the `prompt` column in the table using Pandas

In [2]:
import pandas as pd
prompts = pd.read_parquet('metadata.parquet', columns=['prompt'])
prompts

Unnamed: 0,prompt
0,beautiful porcelain ivory fair face woman biom...
1,complex 3 d render hyper detailed ultra sharp ...
2,complex 3 d render hyper detailed ultra sharp ...
3,complex 3 d render hyper detailed ultra sharp ...
4,complex 3 d render hyper detailed ultra sharp ...
...,...
13999995,"Ibai Berto Romero as Willy Wonka, highly detai..."
13999996,"Ibai Berto Romero as Willy Wonka, highly detai..."
13999997,"Ibai Berto Romero as Willy Wonka, highly detai..."
13999998,"Ibai Berto Romero as Willy Wonka, highly detai..."


#### Get unique prompts from the `prompt` column

In [3]:
prompts_unique = prompts.drop_duplicates('prompt')
prompts_unique

Unnamed: 0,prompt
0,beautiful porcelain ivory fair face woman biom...
1,complex 3 d render hyper detailed ultra sharp ...
15,complex 3 d render hyper detailed ultra sharp ...
16,complex 3 d render hyper detailed ultra sharp ...
33,complex 3 d render hyper detailed ultra sharp ...
...,...
13999977,dreaming electric bicycle and electric car by ...
13999978,"riding neon bycicles in the woods, painted by ..."
13999987,"Ibai Llanos dressed as Willy Wonka, highly det..."
13999993,"Ibai Berto Romero as Willy Wonka, highly detai..."


#### Installing datasets

In [4]:
!pip install datasets -q

[K     |████████████████████████████████| 451 kB 4.9 MB/s 
[K     |████████████████████████████████| 212 kB 56.2 MB/s 
[K     |████████████████████████████████| 182 kB 54.7 MB/s 
[K     |████████████████████████████████| 115 kB 59.6 MB/s 
[K     |████████████████████████████████| 127 kB 44.4 MB/s 
[?25h

#### Pandas dataframe to HF dataset

In [5]:
from datasets import Dataset
dataset = Dataset.from_pandas(prompts_unique).remove_columns(['__index_level_0__'])
dataset

Dataset({
    features: ['prompt'],
    num_rows: 1819808
})

#### Login to HF

In [6]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


#### Upload dataset to HF

In [7]:
dataset.push_to_hub("daspartho/stable-diffusion-prompts")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]