#18.065 Final Project 
aijia@mit.edu

This PromptGPT-2 model notebook is used to train a GPT-2 model for text generation and text cleaning and prompt input.

In [1]:
!git clone https://github.com/kamalkraj/minGPT-TF.git

Cloning into 'minGPT-TF'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 81 (delta 41), reused 42 (delta 17), pack-reused 0[K
Unpacking objects: 100% (81/81), done.


In [2]:
! pip install fastprogress==0.2.3

Collecting fastprogress==0.2.3
  Downloading fastprogress-0.2.3-py3-none-any.whl (12 kB)
Installing collected packages: fastprogress
  Attempting uninstall: fastprogress
    Found existing installation: fastprogress 1.0.2
    Uninstalling fastprogress-1.0.2:
      Successfully uninstalled fastprogress-1.0.2
Successfully installed fastprogress-0.2.3


In [3]:
import os
os.chdir('minGPT-TF')
import math
import numpy as np
import tensorflow as tf
from mingpt.model import GPT, GPTConfig

This is a character-wise sampling dataset from the text.

In [4]:
class CharDataset:

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __iter__(self):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        for _ in range(self.__len__()):
            i = np.random.randint(0, len(self.data) - (self.block_size + 1))
            chunk = self.data[i:i+self.block_size+1]
            dix = [self.stoi[s] for s in chunk]
            x = tf.convert_to_tensor(dix[:-1], dtype=tf.int32)
            y = tf.convert_to_tensor(dix[1:], dtype=tf.int32)
            yield x, y
    
    __call__ = __iter__

In [5]:
block_size = 128 

First, input *Grimms'Fairy Tales* with less stories as the training text.

In [6]:
import tensorflow as tf
from tensorflow import keras
path_to_file = tf.keras.utils.get_file('Grimm.txt', 'https://www.gutenberg.org/files/2591/2591-0.txt')

Downloading data from https://www.gutenberg.org/files/2591/2591-0.txt


In [11]:
text = open(path_to_file, 'r').read()
text.find('END')


521728

In [20]:
text[520565:521719] #This part is author introduction, which should be cut from the input data

'**\n\n\nThe Brothers Grimm, Jacob (1785-1863) and Wilhelm (1786-1859), were born\nin Hanau, near Frankfurt, in the German state of Hesse. Throughout\ntheir lives they remained close friends, and both studied law at Marburg\nUniversity. Jacob was a pioneer in the study of German philology,\nand although Wilhelm’s work was hampered by poor health the brothers\ncollaborated in the creation of a German dictionary, not completed until\na century after their deaths. But they were best (and universally) known\nfor the collection of over two hundred folk tales they made from oral\nsources and published in two volumes of ‘Nursery and Household Tales’ in\n1812 and 1814. Although their intention was to preserve such material as\npart of German cultural and literary history, and their collection was\nfirst published with scholarly notes and no illustration, the tales soon\ncame into the possession of young readers. This was in part due to Edgar\nTaylor, who made the first English translation in 1

In [21]:
text=text[2890:520564]

In [22]:
train_dataset_gen = CharDataset(text, block_size) 

data has 517674 characters, 74 unique.


In [23]:
train_dataset = tf.data.Dataset.from_generator(train_dataset_gen,(tf.int32,tf.int32))

In [24]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset_gen.vocab_size, train_dataset_gen.block_size,
                  n_layer=8, n_head=8, n_embd=512)

In [25]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=300, batch_size=64, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=200*len(train_dataset_gen)*block_size,
                      num_workers=4)
trainer = Trainer(GPT, mconf, train_dataset, len(train_dataset_gen), None, None, tconf)

In [26]:
history=trainer.train()

epoch 1: train loss 355.85049. lr 5.999645e-04
epoch 2: train loss 302.38544. lr 5.998549e-04
epoch 3: train loss 288.33002. lr 5.996713e-04
epoch 4: train loss 264.53198. lr 5.994138e-04
epoch 5: train loss 244.00127. lr 5.990824e-04
epoch 6: train loss 226.64413. lr 5.986771e-04
epoch 7: train loss 213.81165. lr 5.981982e-04
epoch 8: train loss 202.46840. lr 5.976457e-04
epoch 9: train loss 193.41008. lr 5.970197e-04
epoch 10: train loss 186.28934. lr 5.963204e-04
epoch 11: train loss 179.58250. lr 5.955481e-04
epoch 12: train loss 173.74802. lr 5.947027e-04
epoch 13: train loss 168.90077. lr 5.937847e-04
epoch 14: train loss 164.28537. lr 5.927941e-04
epoch 15: train loss 160.74887. lr 5.917312e-04
epoch 16: train loss 156.87894. lr 5.905964e-04
epoch 17: train loss 153.98038. lr 5.893899e-04
epoch 18: train loss 150.81343. lr 5.881119e-04
epoch 19: train loss 147.66617. lr 5.867629e-04
epoch 20: train loss 144.21683. lr 5.853430e-04
epoch 21: train loss 141.89911. lr 5.838528e-04
e

In [27]:
# alright, let's sample some character-level shakespear
from mingpt.utils import sample


In [None]:
context = "Sir!"
x = tf.convert_to_tensor([train_dataset_gen.stoi[s] for s in context], dtype=tf.int32)[None,...]

In [28]:
new_path= tf.keras.utils.get_file('Grimm2.txt', 'https://www.gutenberg.org/files/52521/52521-0.txt')
test = open(new_path, 'r').read()
test[6000:].find('THE STAR-MONEY')+6000

Downloading data from https://www.gutenberg.org/files/52521/52521-0.txt


44109

The word tensor of a prompt can be shown in this following example

In [34]:
tf.convert_to_tensor([train_dataset_gen.stoi[s] for s in 'And'], dtype=tf.int32)[None,...]


<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[15, 57, 47]], dtype=int32)>

In [31]:
context = test[44109:45694]+'And'
x = tf.convert_to_tensor([train_dataset_gen.stoi[s] for s in context], dtype=tf.int32)[None,...]
y = sample(trainer.model, x, len(context)+30, temperature=0.6, sample=True, top_k=5)[0]
completion = ''.join([train_dataset_gen.itos[int(i)] for i in y])
print(completion[0:len(context)+50])

THE STAR-MONEY


There was once on a time, a little girl whose father and mother were
dead. She was so poor that she no longer had any little room to live
in, or bed to sleep in. At last, she had nothing else but the clothes
she was wearing and a little bit of bread in her hand which some
charitable soul had given her. She was, however, good and pious.

And as she was thus forsaken by all the world, she went forth into the
open country, trusting in the good God.

Then a poor man met her, who said, “Ah, give me something to eat, I am
so hungry!”

She reached him the whole of her piece of bread, and said, “May God
bless it to your use,” and went onward.

Then came a child who moaned and said, “My head is so cold, give me
something to cover it with.”

So she took off her hood and gave it to him.

And when she had walked a little farther, she met another child who had
no jacket and was frozen with cold. Then she gave it her own.

A little farther on one begged for a frock, and she gave awa

In [33]:
print(completion[len(context)-3:len(context)+100])

And thus she went roving on through the wide world, and looked about and
flew into the fields; the flie
