<a href="https://colab.research.google.com/github/daspartho/prompt-extend/blob/main/tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installing required libraries

In [1]:
!pip install transformers sentencepiece datasets -q

[K     |████████████████████████████████| 5.5 MB 29.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 49.2 MB/s 
[K     |████████████████████████████████| 451 kB 39.2 MB/s 
[K     |████████████████████████████████| 182 kB 61.9 MB/s 
[K     |████████████████████████████████| 7.6 MB 54.7 MB/s 
[K     |████████████████████████████████| 212 kB 68.9 MB/s 
[K     |████████████████████████████████| 115 kB 61.1 MB/s 
[K     |████████████████████████████████| 127 kB 74.9 MB/s 
[?25h

### Downloading the corpus of prompts

In [2]:
from datasets import load_dataset

ds = load_dataset("daspartho/stable-diffusion-prompts")
ds

Downloading readme:   0%|          | 0.00/426 [00:00<?, ?B/s]



Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/daspartho___parquet/daspartho--stable-diffusion-prompts-5637e444d3df76f9/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/102M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1819808 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/daspartho___parquet/daspartho--stable-diffusion-prompts-5637e444d3df76f9/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['prompt'],
        num_rows: 1819808
    })
})

In [3]:
example = ds['train'][0]['prompt']
example

'beautiful porcelain ivory fair face woman biomechanical cyborg, close - up, sharp focus, studio light, iris van herpen haute couture headdress made of rhizomorphs, daisies, brackets, colorful corals, fractal mushrooms, puffballs, octane render, ultra sharp, 8 k '

### Transform the dataset into an iterator of batches of prompts

In [4]:
def get_training_corpus():
    return (
        ds["train"][i : i + 1000]["prompt"]
        for i in range(0, len(ds["train"]), 1000)
        )

training_corpus = get_training_corpus()

### Load the tokenizer

In [5]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Let's how if performs before training

In [6]:
tokens = old_tokenizer.tokenize(example)
tokens, len(tokens)

(['beaut',
  'iful',
  'Ġpor',
  'cel',
  'ain',
  'Ġivory',
  'Ġfair',
  'Ġface',
  'Ġwoman',
  'Ġbiome',
  'chan',
  'ical',
  'Ġcy',
  'borg',
  ',',
  'Ġclose',
  'Ġ-',
  'Ġup',
  ',',
  'Ġsharp',
  'Ġfocus',
  ',',
  'Ġstudio',
  'Ġlight',
  ',',
  'Ġir',
  'is',
  'Ġvan',
  'Ġher',
  'pen',
  'Ġha',
  'ute',
  'Ġcout',
  'ure',
  'Ġhead',
  'dress',
  'Ġmade',
  'Ġof',
  'Ġrh',
  'iz',
  'omorph',
  's',
  ',',
  'Ġda',
  'is',
  'ies',
  ',',
  'Ġbrackets',
  ',',
  'Ġcolorful',
  'Ġcor',
  'als',
  ',',
  'Ġfract',
  'al',
  'Ġmushrooms',
  ',',
  'Ġpuff',
  'balls',
  ',',
  'Ġoct',
  'ane',
  'Ġrender',
  ',',
  'Ġultra',
  'Ġsharp',
  ',',
  'Ġ8',
  'Ġk',
  'Ġ'],
 70)

### Training a new tokenizer

In [7]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

### Let's see how the trained tokenizer performs

In [8]:
tokens = tokenizer.tokenize(example)
tokens, len(tokens)

(['beautiful',
  'Ġporcelain',
  'Ġivory',
  'Ġfair',
  'Ġface',
  'Ġwoman',
  'Ġbiomechanical',
  'Ġcyborg',
  ',',
  'Ġclose',
  'Ġ-',
  'Ġup',
  ',',
  'Ġsharp',
  'Ġfocus',
  ',',
  'Ġstudio',
  'Ġlight',
  ',',
  'Ġiris',
  'Ġvan',
  'Ġherpen',
  'Ġhaute',
  'Ġcouture',
  'Ġheaddress',
  'Ġmade',
  'Ġof',
  'Ġrhizomorphs',
  ',',
  'Ġdaisies',
  ',',
  'Ġbrackets',
  ',',
  'Ġcolorful',
  'Ġcorals',
  ',',
  'Ġfractal',
  'Ġmushrooms',
  ',',
  'Ġpuffballs',
  ',',
  'Ġoctane',
  'Ġrender',
  ',',
  'Ġultra',
  'Ġsharp',
  ',',
  'Ġ8',
  'Ġk',
  'Ġ'],
 50)

### Saving the tokenizer

In [9]:
tokenizer.save_pretrained("prompt-tokenizer")

('prompt-tokenizer/tokenizer_config.json',
 'prompt-tokenizer/special_tokens_map.json',
 'prompt-tokenizer/vocab.json',
 'prompt-tokenizer/merges.txt',
 'prompt-tokenizer/added_tokens.json',
 'prompt-tokenizer/tokenizer.json')

### Uploading the tokenizer to HuggingFace Hub

Be sure to login with your auth token below to push the tokenizer to Hub

In [10]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


In [11]:
tokenizer.push_to_hub("prompt-tokenizer")

CommitInfo(commit_url='https://huggingface.co/daspartho/prompt-tokenizer/commit/5a57ceb314dfd4622e4137237c30dc76f8250508', commit_message='Upload tokenizer', commit_description='', oid='5a57ceb314dfd4622e4137237c30dc76f8250508', pr_url=None, pr_revision=None, pr_num=None)