In [1]:
import sys
!conda install --prefix {sys.prefix} -y -c pytorch pytorch

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [2]:
import sys
!{sys.executable} -m pip install transformers



In [6]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

In [9]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

In [10]:
generator("They walked and walked till the Gruffalo said")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "They walked and walked till the Gruffalo said: 'Give my daughter a free ride. I want her to be a member of the ruling family.'\n\n\n\nThe lawyer for Mr Argo, who is fighting against the ruling, told"}]

In [11]:
generator("They walked and walked till the Gruffalo said,")[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'They walked and walked till the Gruffalo said, `That the King is called the King, not the King, but the King is not.\n\n\n"And I am the King of Greece, and the King of Germany.\n"'

In [28]:
generator("They walked and walked till the Gruffalo said,",
          max_length=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'They walked and walked till the Gruffalo said, "O Christ, that is all my Lord, will help me, go there before my Father, to come."\n\n"Oh, then, how much is every prayer of the Lord?\n"And he said to us, "Well, I shall say, when will I go?"\n"Are you looking there? What is his heart?"\n"He said, "Well, but it is that he who gives the God who'}]

In [29]:
generator("They walked and walked till the Gruffalo said,",
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'They walked and walked till the Gruffalo said, \'Sectes.\' "For he did not turn to him! he would cry! for it! He did not turn to him! He let him cry! I put on the bed. But he said, "I am not here! You are not here to stand. I have said a bit of the word! I am not here! I shall come for you! I have said and he laid and stood! Now, this is'

In [27]:
print(generator("The Gruffalo: A New Adventure\n.",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The Gruffalo: A New Adventure
. The King of Spent Time
. The King of Darkness: A Feast of Fools
. The King of Darkness: A Feast of Fools by David Thirteen
. The King of Darkness by David Thirteen: The King of Evil
. The King of Evil
. The King of Evil: A Feast of Evil
Picking The King of Evil
.-The Devil: A Feast of Evil
For your life was the destiny of


In [30]:
generator("They walked and walked till the Gruffalo said,",
          top_k=tokenizer.vocab_size,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'They walked and walked till the Gruffalo said, bound in the earth."\n"SqurainContinence"\nHow about a tickles you nurses?\nWe b","pbs.snatch.pie.narciss\n-thanky troll like f**kin superte.\nWhy we\'re coming in with your name and name and be a baby monster?\nThere\'s you without a fucking head?\n\n-snatch Mafia\n-snatch mob try to kill you,'

Using this with a temperature greater than `1.0` can yield some unusual turns of phrase:

In [31]:
generator("They walked and walked till the Gruffalo said,",
          top_k=tokenizer.vocab_size,
          temperature=1.2,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




On the other extreme, setting the `top_k` value to `1` ensures that *only* the most likely token is chosen at each step. This is the same thing as ["greedy decoding"](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.greedy_search):

In [32]:
generator("They walked and walked till the Gruffalo said,",
          top_k=1,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"They walked and walked till the Gruffalo said, 'I am a man of the world.'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"

### Fine-tuning a model

"Fine-tuning" is a way of slightly modifying a model by training it a few extra steps on a corpus of your choice. This process adjusts the probabilities of the model so that it more closely reflects the probabilities of the source text you train it on. Fine-tuning models with Transformers is a little bit tricky! First, you'll need to install Hugging Face's `datasets` package:

In [12]:
import sys
!{sys.executable} -m pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-macosx_10_9_x86_64.whl (360 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m360.3/360.3 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-macosx_10_9_x86_64.whl (35 kB)
Collecti

In [41]:
import datasets

In [42]:
with open("84-0-20k.txt", "w") as fh:
    fh.write(open("Gruffalo.txt").read()[:20000])

In [43]:
training_data = datasets.load_dataset('text', data_files="84-0-20k.txt")

Downloading and preparing dataset text/default to /Users/mica/.cache/huggingface/datasets/text/default-88990be045234c4e/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /Users/mica/.cache/huggingface/datasets/text/default-88990be045234c4e/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [44]:
tokenizer.pad_token = tokenizer.eos_token
tokenized_training_data = training_data.map(
    lambda x: tokenizer(x['text']),
    remove_columns=["text"]
)

Map:   0%|          | 0/137 [00:00<?, ? examples/s]

In [45]:
block_size = 64
# magic from https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
lm_training_data = tokenized_training_data.map(
    group_texts,
    batched=True,
    batch_size=200
)

Map:   0%|          | 0/137 [00:00<?, ? examples/s]

In [46]:
from transformers import Trainer, TrainingArguments

In [47]:
trainer = Trainer(model=model,
                  train_dataset=lm_training_data['train'],
                  args=TrainingArguments(
                      output_dir='distilgpt2-finetune-frankenstein20k',
                      num_train_epochs=1,
                      do_train=True,
                      do_eval=False
                  ),
                  tokenizer=tokenizer)

In [48]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=3, training_loss=4.276010513305664, metrics={'train_runtime': 5.6148, 'train_samples_per_second': 3.918, 'train_steps_per_second': 0.534, 'total_flos': 359283032064.0, 'train_loss': 4.276010513305664, 'epoch': 1.0})

In [49]:
trainer.save_model()

In [50]:
generator("They walked and walked till the Gruffalo said,", max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'They walked and walked till the Gruffalo said, this was the one that was sitting there. Then she said, \'I know she said to me and the Gruffalo said, \'It is not very pleasant.\' That the word "The Gruffalo" could only be pronounced "" "What do you mean in this?" and the Gruffalo said, \'Wherever you are about to speak, I come and we will hear the Gruffalo."\'Then when Gia'

In [51]:
print(generator("The Gruffalo: A New Adventure\n",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The Gruffalo: A New Adventure

"How in the Old World we're the ones who know?"
"Oh my, good man."
"You know, right?""Oh, right?"
"Not exactly."
"Well, and that's what I want!"
"I'm going to get her back."
"So, no, you never."
"Really, my dear,"
"Oh?"
"That's right."
"No. She


In [65]:
print(generator("The Gruffalo said\n",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The Gruffalo said

“You know, we're already in the work of the very first two-hour show, but I can't help but watch a whole movie.“It's an unoriginal show, I know it's going to be very different. You know, every second you make a show, the time frame is usually very short, like 20 minutes, but it seems less long every second.“When you see the show itself, you don't watch


In [83]:
print(generator("The mouse in the woods\n",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The mouse in the woods



"But there was snow here WBFF. I want to get me the water. I'll pull some hot water and turn it into the river, which will be the wind."
"So a little river of water comes down to this river and when you get to take a cup of hot water and fill the cup, and you look at it. I know, I was the best boy in the world. But the only thing I saw was that


In [87]:
print(generator("The eyes of the Gruffalo \n",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The eyes of the Gruffalo 
It seems like the Gruffalo is wearing the clothes of your Gruffalo after you You m are the Gruffalo’s and the Gruffalo has a large, yellow hood.’You are the Gruffalo's and the Gruffalo’s just a little girl’s.’And you are a little black hair.’And you are a little black hair.’
