<a href="https://colab.research.google.com/github/daspartho/prompt-extend/blob/main/model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Installing libraries required

In [None]:
!pip install transformers sentencepiece datasets

#### Login to HuggingFace with auth token to push the model to Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

#### Downloading the dataset from huggingface hub

In [None]:
from datasets import load_dataset

ds = load_dataset("Gustavosta/Stable-Diffusion-Prompts")
ds

Downloading readme:   0%|          | 0.00/777 [00:00<?, ?B/s]



Downloading and preparing dataset parquet/Gustavosta--Stable-Diffusion-Prompts to /root/.cache/huggingface/datasets/Gustavosta___parquet/Gustavosta--Stable-Diffusion-Prompts-f4211d2c5626deea/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Gustavosta___parquet/Gustavosta--Stable-Diffusion-Prompts-f4211d2c5626deea/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Prompt'],
        num_rows: 73718
    })
    test: Dataset({
        features: ['Prompt'],
        num_rows: 8192
    })
})

#### Tokenizing the dataset

In [None]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained('daspartho/prompt-tokenizer')

def tokenize(element):
    return tokenizer(
        element["Prompt"],
        truncation=True,
        max_length=context_length,
    )

tok_ds = ds.map(
    tokenize, 
    batched=True,
)
tok_ds

Downloading:   0%|          | 0.00/255 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/835k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/482k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

  0%|          | 0/74 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['Prompt', 'input_ids', 'attention_mask'],
        num_rows: 73718
    })
    test: Dataset({
        features: ['Prompt', 'input_ids', 'attention_mask'],
        num_rows: 8192
    })
})

#### Initializing the model

In [None]:
from transformers import AutoConfig, GPT2LMHeadModel

config = AutoConfig.from_pretrained(
    'gpt2',
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

model = GPT2LMHeadModel(config)

#### Set up a data collator to take care of creating the batches

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer, 
    mlm=False
)

#### Training time!

In [None]:
from transformers import Trainer, TrainingArguments

bs = 32
epochs = 3
lr = 1e-4

args = TrainingArguments(
    output_dir="prompt-extend",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    gradient_accumulation_steps=8,
    num_train_epochs=epochs,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    learning_rate=lr,
    fp16=True,
    report_to='none',
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tok_ds["train"],
    eval_dataset=tok_ds["test"],
)

trainer.train()

Cloning https://huggingface.co/daspartho/prompt-extend into local empty directory.
Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: Prompt. If Prompt are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 73718
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 8
  Total optimization steps = 864
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
100,6.3816,4.182323
200,3.7123,3.303325
300,3.118,2.831113
400,2.7291,2.550348
500,2.4918,2.365279
600,2.3379,2.237474
700,2.1952,2.171431
800,2.1593,2.145278


The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: Prompt. If Prompt are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8192
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: Prompt. If Prompt are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8192
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: Prompt. If Prompt are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8192
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `G

TrainOutput(global_step=864, training_loss=3.066827314871329, metrics={'train_runtime': 2410.8542, 'train_samples_per_second': 91.733, 'train_steps_per_second': 0.358, 'total_flos': 1.1016391539456e+16, 'train_loss': 3.066827314871329, 'epoch': 3.0})

#### Let's try it out

In [None]:
from transformers import TextGenerationPipeline

text_pipe = TextGenerationPipeline(
    model=model, 
    tokenizer=tokenizer,
    device=0,
)

prompt = "munchkin village house"
extended_prompt = text_pipe(prompt+',', num_return_sequences=1)[0]["generated_text"]
extended_prompt

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


'munchkin village house, concept art, highly detailed, artstation, trending, 8 k, studio lighting,'

#### Push the model to Hub

In [None]:
trainer.push_to_hub()