In this notebook, we will use a Huggingface dataset to trian a text generation model and push it to Huggingface

In [1]:
# Enter name of Hugggingface dataset here!
ds = "eliwill/Stoic-dataset"

1. Install dependencies

In [2]:
%%capture
! pip install transformers
! pip install datasets
! pip install huggingface_hub

In [4]:
# login to Huggingface Hub
from huggingface_hub import notebook_login

# Auth Token: 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
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [6]:
%%capture
# installing git large file storage (lfs)
!apt install git-lfs
!git config --global user.email "elijah.p.williams@vanderbilt.edu"
!git config --global user.name "eli-will-2656"

In [7]:
# Check if version > 4.2
import transformers
print(transformers.__version__)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

4.22.1


2. Loading in and preprocessing dataset

In [8]:
%%capture
from datasets import load_dataset
datasets = load_dataset(ds)



In [9]:
for i in range(3):
  print(f"{i} )", datasets['train']['text'][i], "\n")

0 ) OF BENEFITS IN GENERAL. It is, perhaps, one of the most pernicious errors of a rash and inconsiderate life, the common ignorance of the world in the matter of exchanging benefits. And this arises from a mistake, partly in the person that we would oblige, and partly in the thing itself. 

1 ) To begin with the latter: “A benefit is a good office, done with intention and judgment;” that is to say, with a due regard to all the circumstances of what, how, why, when, where, to whom, how much, and the like; or otherwise: “It is a voluntary and benevolent action that delights the giver in the comfort it brings to the receiver.” It will be hard to draw this subject, either into method or compass: the one, because of the infinite variety and complication of cases; the other, by reason of the large extent of it: for the whole business (almost) of mankind in society falls under this head; the duties of kings and subjects, husbands and wives, parents and children, masters and servants, natives

3. Load in and train causal language model

We will use the [`distilgpt2`](https://huggingface.co/distilgpt2) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead:

In [10]:
model_checkpoint = "distilgpt2"

In [11]:
%%capture
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [12]:
%%capture
# Tokenizing all the texts
def tokenize_function(examples):
    return tokenizer(examples["text"])
  
tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

In [13]:
# block_size = tokenizer.model_max_length
block_size = 128

In [14]:
# Preprocessing text
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, though you could add padding instead if the model supports it
    # In this, as in all things, we advise you to follow your heart
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [15]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

       

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

       

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'It is a voluntary and benevolent action that delights the giver in the comfort it brings to the receiver.” It will be hard to draw this subject, either into method or compass: the one, because of the infinite variety and complication of cases; the other, by reason of the large extent of it: for the whole business (almost) of mankind in society falls under this head; the duties of kings and subjects, husbands and wives, parents and children, masters and servants, natives and strangers, high and low, rich and poor, strong and weak, friends and enemies. The very meditation of it breeds good blood and'

In [17]:
%%capture
from transformers import TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [18]:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)

  super(Adam, self).__init__(name, **kwargs)


In [19]:
# Compiling model
import tensorflow as tf
model.compile(optimizer=optimizer, jit_compile=True)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [20]:
# Converting to tensorflow dataset
train_set = model.prepare_tf_dataset(
    lm_datasets["train"],
    shuffle=True,
    batch_size=16,
)

validation_set = model.prepare_tf_dataset(
    lm_datasets["validation"],
    shuffle=False,
    batch_size=16,
)

4. Training the model

Change `epochs` and `model_name` to desired values

In [21]:
from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"stoic-generator-10e"

tensorboard_callback = TensorBoard(log_dir="./clm_model_save/logs")

push_to_hub_callback = PushToHubCallback(
    output_dir="./clm_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)

callbacks = [tensorboard_callback, push_to_hub_callback]

model.fit(train_set, validation_data=validation_set, epochs=10, callbacks=callbacks)

Cloning https://huggingface.co/eliwill/stoic-generator-10e into local empty directory.


Epoch 1/10
  6/147 [>.............................] - ETA: 38s - loss: 4.2339



Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file tf_model.h5:   0%|          | 3.34k/313M [00:00<?, ?B/s]

Upload file logs/validation/events.out.tfevents.1664130386.b847084d5e3f.67.1.v2: 100%|##########| 1.56k/1.56k …

Upload file logs/train/events.out.tfevents.1664130319.b847084d5e3f.67.0.v2:   0%|          | 3.34k/871k [00:00…

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/eliwill/stoic-generator-10e
   af176d0..6dc9a4e  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/eliwill/stoic-generator-10e
   af176d0..6dc9a4e  main -> main



<keras.callbacks.History at 0x7f4fa0035ed0>

## Pipeline API

In [None]:
from transformers import pipeline

text_generator = pipeline(task="text-generation",
                          model="eliwill/stoic-generator-10e",
                          framework="tf")

In [24]:
from pprint import pprint
test_sentence = "What is the meaning of life?"
generated_sentence = text_generator(test_sentence, min_length=100, max_length=120)[0]['generated_text']
pprint(generated_sentence)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


('What is the meaning of life?"Life is my life. My life is the work of a great '
 'philosopher." "My life is my life."I suppose, for what purpose? But then I '
 'should begin to imagine things which are according to reason and those which '
 'are independent of reason: and how do I know that not every man has the '
 'right reason or the right faculty of life, but only of an intelligence of '
 'reason and in accordance with nature; how do you find that not only is this '
 'possible but every man has the right faculties and in accordance with '
 'nature; and so that you will find')
