# Text generation fine-tuning

In this notebook we attempt to use a fine-tune a transformer model to generate text inspired by the training corpora.

In [1]:
pip install transformers -qq

[K     |████████████████████████████████| 5.3 MB 5.0 MB/s 
[K     |████████████████████████████████| 163 kB 59.4 MB/s 
[K     |████████████████████████████████| 7.6 MB 53.3 MB/s 
[?25h

In [2]:
import numpy as np
import tensorflow as tf
import pandas as pd
import transformers

from sklearn.model_selection import train_test_split

In [3]:
pip install datasets -qq

[K     |████████████████████████████████| 441 kB 4.8 MB/s 
[K     |████████████████████████████████| 212 kB 57.3 MB/s 
[K     |████████████████████████████████| 115 kB 59.2 MB/s 
[K     |████████████████████████████████| 127 kB 59.7 MB/s 
[K     |████████████████████████████████| 115 kB 54.6 MB/s 
[?25h

In [4]:
from datasets import load_dataset

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

Downloading:   0%|          | 0.00/200 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Once Trained

In [None]:
import torch

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
# load model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/content/drive/MyDrive/Bird is the word/models/GPT-Neo_nature_intellectual')
model.to(device)

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 2048)
    (wpe): Embedding(2048, 2048)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTNeoBlock(
        (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (out_proj): Linear(in_features=2048, out_features=2048, bias=True)
          )
        )
        (ln_2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=2048, out_features=8192, bias=True)
          (c_proj): Linear(

In [None]:
# test prompt
prompt = tokenizer("Allah is", return_tensors='pt')
prompt = {key: value.to(device) for key, value in prompt.items()}
out = model.generate(**prompt, min_length=200, max_length=200, do_sample=True)
tokenizer.decode(out[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Allah is the One, the Living, the Real. The other Arabs name their prophet Muhammad, but because they see in his life the fulfilment of all the injunctions of the Koran, they call him the Beloved (i.e., the Friend). I too am the Beloved; by my favour he became a Muslim. He recited the verses: ‘If you meet a rich man on the road – give him 250 guruş from your stores. He will give you 100 guruş moreIf you meet a poor man on the road – give him 100 guruş from your store for a journey. But if you come across a rich man – give him 250 guruş – then give him a pair of sandals and 50 guruş – they will give you another 100 guruş, saying, ‘My friend, you have made my day bright.’ – and he said, ‘By God, I entered the service of a rich man’; this too'

## Load and Split Data

In [37]:
dataset = load_dataset("text", data_files = {"/content/drive/MyDrive/Bird is the word/data/text/nature_intellectual/combined.txt"})



  0%|          | 0/1 [00:00<?, ?it/s]

In [38]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [39]:
datasets = dataset["train"].train_test_split(test_size = 0.3)

In [40]:
show_random_elements(datasets["test"])

Unnamed: 0,text
0,
1,"Without a doubt there will always be boys who will choose activities that are rambunctious, that call for physical strength and require an element of risk, but there will also be boys who will seek quieter pleasures, who will turn away from risk. There will be boys whose personalities will be somewhere in between these two paradigms. If boys are raised to be empathic and strong; autonomous and connected; responsible to self, to family and friends, and to society; able to make community rooted in a recognition of interbeing, then the solid foundation is present and they will be able to love."
2,
3,advocate for reform and humane treatment of pmu horses. But reform
4,
5,"\t\t\tPsilocybin can leave a lasting impression on people’s minds, like the grin on the Cheshire Cat in Alice’s Adventures in Wonderland, which “remained some time after the rest of it had gone.” In one study, researchers found that a single high dose of psilocybin increased the openness to new experiences, psychological well-being, and life satisfaction of healthy volunteers, a change that persisted in most cases for more than a year. Some studies have found that experiences with psilocybin have helped smokers or alcoholics break their addictions. Other studies have reported enduring increases in subjects’ sense of connection with the natural world."
6,disposable foals.
7,"Early mindfulness researchers proposed that a key impact of mindfulness practice is the reduction of automatic processing. This is supported by more recent findings that mindfulness practice reduces implicit age and race bias. Say you have the associations that black is bad and old is bad. Mindfulness loosens these associations, enabling you to notice and question them, so that you see a person of color or"
8,
9,


## Tokenize Data

In [35]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation = True)

In [41]:
tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

      

#0:   0%|          | 0/4 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/4 [00:00<?, ?ba/s]

#2:   0%|          | 0/4 [00:00<?, ?ba/s]

#3:   0%|          | 0/4 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

In [42]:
tokenized_datasets['train']

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 13959
})

### Concatenate and break into blocks of fixed size

In [43]:
block_size = 128 # Number of tokens in each block

In [44]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, though you could add padding instead if the model supports it
    # In this, as in all things, we advise you to follow your heart
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

CausalLM models in the 🤗 Transformers library automatically apply right-shifting to the inputs, so we don't need to do it manually.

In [45]:
lm_dataset = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=2, # 2? WHAT SHOULD THIS BE?
    num_proc=4,
)

      

#0:   0%|          | 0/1745 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/1745 [00:00<?, ?ba/s]

#2:   0%|          | 0/1745 [00:00<?, ?ba/s]

#3:   0%|          | 0/1745 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/748 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/748 [00:00<?, ?ba/s]

#3:   0%|          | 0/748 [00:00<?, ?ba/s]

#2:   0%|          | 0/748 [00:00<?, ?ba/s]

In [49]:
tokenizer.decode(lm_dataset["train"][4]["input_ids"])


' at home see Werrett (2019). The work of Darwin is a notable example. For most of his life, he conducted almost all of his work at home. He bred orchids on the windowsills, apples in the orchard, racing pigeons, and earthworms on the terrace. Much of the evidence Darwin mobilized in support of his theory of evolution came from networks of amateur animal and plant breeders, and he maintained a large volume of correspondence with well-organized networks of hobbyist collectors and backyard enthusiasts (Boulter [2010]). Today, digital platforms open new possibilities. In late 2018, a low-frequency'

In [50]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Initialize Model

In [51]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-1.3B')

Downloading:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

All 🤗 Transformers models are capable of computing an appropriate loss for their task internally (for example, a CausalLM model will use a cross-entropy loss). To do this, the labels must be provided in the input dict (or equivalently, in the columns argument to to_tf_dataset()), so that they are visible to the model during the forward pass.

This is quite different from the standard Keras way of handling losses, where labels are passed separately and not visible to the main body of the model, and loss is handled by a function that the user passes to compile(), which uses the model outputs and the label to compute a loss value.

The approach we take is that if the user does not pass a loss to compile(), the model will assume you want the internal loss. If you are doing this, you should make sure that the labels column(s) are included in the input dict or in the columns argument to to_tf_dataset.

In [52]:
from transformers import TrainingArguments, Trainer

In [53]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    # num_train_epochs= 2.0, 
)

In [54]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

## Train model

In [55]:
trainer.train()

***** Running training *****
  Num examples = 2060
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 774
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,3.087261
2,2.840900,3.134527


***** Running Evaluation *****
  Num examples = 821
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 821
  Batch size = 8


Epoch,Training Loss,Validation Loss
1,No log,3.087261
2,2.840900,3.134527
3,2.840900,3.283414


***** Running Evaluation *****
  Num examples = 821
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=774, training_loss=2.56713401932433, metrics={'train_runtime': 16849.6312, 'train_samples_per_second': 0.367, 'train_steps_per_second': 0.046, 'total_flos': 5735618256568320.0, 'train_loss': 2.56713401932433, 'epoch': 3.0})

In [56]:
## save trained model
model.save_pretrained('/content/drive/MyDrive/Bird is the word/models/GPT-Neo_nature_intellectual')

Configuration saved in /content/drive/MyDrive/Bird is the word/models/GPT-Neo_nature_intellectual/config.json
Model weights saved in /content/drive/MyDrive/Bird is the word/models/GPT-Neo_nature_intellectual/pytorch_model.bin


---


In [None]:
from transformers import pipeline

In [58]:
model.to('cpu')

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 2048)
    (wpe): Embedding(2048, 2048)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTNeoBlock(
        (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (out_proj): Linear(in_features=2048, out_features=2048, bias=True)
          )
        )
        (ln_2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=2048, out_features=8192, bias=True)
          (c_proj): Linear(

In [59]:
generator = pipeline(model = model, max_length = 200, task="text-generation", tokenizer = tokenizer)

In [75]:
generator('Ameya came to Switzerland to')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Ameya came to Switzerland to search out a new human being. We met again at the ashram. She was a quiet, beautiful, self-possessed young woman, who had learned both to meditate and to become a lay feminist. Although I didn’t share her politics at the time, it became clear that she had already chosen a life of greater responsibility. For her, the emotional life of men was the main area of conflict. She began to lecture men on the need to have compassion for their feelings, and to practice nonviolence toward women whenever they rebelled. She explained that men had a lot of emotional scar tissue they needed to work to heal. She had spent many years working with men in male-female relationships in India and Pakistan to unpack their feelings, and she became a champion of the notion that emotional pain was not inevitable in such relationships, that men could learn to love. She wanted men to know that they did not have to have rage.'}]