# HW3

In this homework, we'll learn about transformers and chatbots.

It will probably be easiest to run this on http://colab.research.google.com

## minGPT Character Language Model

First, will inspect Karpathy's [minGPT](https://github.com/karpathy/minGPT/tree/master) library to learn more about transformers.

We'll first fit a character language model using mingpt. We'll use as training data all the text of Shakespeare.

In [1]:
# clone the library
!git clone https://github.com/karpathy/minGPT.git

Cloning into 'minGPT'...
remote: Enumerating objects: 489, done.[K
remote: Total 489 (delta 0), reused 0 (delta 0), pack-reused 489[K
Receiving objects: 100% (489/489), 1.44 MiB | 7.26 MiB/s, done.
Resolving deltas: 100% (260/260), done.


In [2]:
# Add mingpt to your Python path, so you can import it.
import sys
sys.path.insert(0, './minGPT')
from mingpt.model import GPT
from mingpt.trainer import Trainer
from mingpt.utils import set_seed
import pandas as pd
import pickle
import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
set_seed(3407)

In [3]:
# download shakespeare data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-04-18 00:22:28--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-04-18 00:22:28 (22.0 MB/s) - ‘input.txt’ saved [1115394/1115394]



### Data loading and training code

In [4]:
from mingpt.utils import set_seed, setup_logging, CfgNode as CN
import os
import sys

class CharDataset(Dataset):
    """
    This represents a dataset of characters.
    """
    @staticmethod
    def get_default_config():
        C = CN()
        C.block_size = 128
        return C

    def __init__(self, config, data):
        self.config = config
        self.parse_data(data)

    def parse_data(self, data):
        print('parsing char data')
        # get list of all characters
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        # map from char to int
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        # map from into to char
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.vocab_size = vocab_size
        self.data = data

    def get_vocab_size(self):
        return self.vocab_size

    def get_block_size(self):
        return self.config.block_size

    def __len__(self):
        return len(self.data) - self.config.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.config.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        # return as tensors
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

def get_config():

    C = CN()

    # system
    C.system = CN()
    C.system.seed = 3407
    C.system.work_dir = './out'

    # data
    C.data = CharDataset.get_default_config()

    # model
    C.model = GPT.get_default_config()
    C.model.model_type = 'gpt-micro'

    # trainer
    C.trainer = Trainer.get_default_config()
    C.trainer.learning_rate = 5e-4 # the model we're using is so small that we can go a bit faster

    return C


def train_model(config, train_dataset, sample_fn):
    """
    Train the model.
    config..........CfgNode
    train_dataset...Dataset that emits strings for training
    sample_fn.......function to call during training to show sample output.
    """
    # construct the model
    config.model.vocab_size = train_dataset.get_vocab_size()
    config.model.block_size = train_dataset.get_block_size()
    model = GPT(config.model)

    # construct the trainer object
    trainer = Trainer(config.trainer, model, train_dataset)

    # iteration callback
    def batch_end_callback(trainer):

        if trainer.iter_num % 10 == 0:
            print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")

        if trainer.iter_num % 500 == 0:
            # evaluate both the train and test score
            model.eval()
            with torch.no_grad():
                # sample from the model...
                context = list(train_dataset.itos.values())[0]
                completion = sample_fn(context, model, trainer, train_dataset, maxlen=100, temperature=1.)
                print('sample from the model:')
                print(completion)
            # save the latest model
            print("saving model")
            ckpt_path = os.path.join(config.system.work_dir, "model.pt")
            torch.save(model.state_dict(), ckpt_path)
            # revert model to training mode
            model.train()

    trainer.set_callback('on_batch_end', batch_end_callback)

    # run the optimization
    trainer.run()
    model.eval()
    return model, trainer

def configure_model(max_iters=100, block_size=128):
    config = get_config()
    config.merge_from_args(['--trainer.max_iters=%d' % max_iters,
                            '--data.block_size=%d' % block_size,
                            '--model.block_size=%d' % block_size])
    setup_logging(config)
    set_seed(config.system.seed)
    return config


def create_char_data(config):
    # construct the training dataset
    text = open('input.txt', 'r').read()
    return CharDataset(config.data, text)

def sample_from_char_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ''.join([train_dataset.itos[int(i)] for i in y])

In [8]:
# train the character model.
config = configure_model(max_iters=100, block_size=64)
#config = configure_model(max_iters=100, block_size=64*2)
train_dataset = create_char_data(config)
model, trainer = train_model(config, train_dataset, sample_from_char_model)

command line overwriting config attribute trainer.max_iters with 100
command line overwriting config attribute data.block_size with 128
command line overwriting config attribute model.block_size with 128
parsing char data
data has 1115394 characters, 65 unique.
number of parameters: 0.82M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 4.18332
sample from the model:

trt t  w,the Qe QiQrsv.teeQrs.i srQsr we sihyQ hsw w QihQit ik miktoe, e t tstQiiPosw hw hh, ee,ytsr
saving model
iter_dt 35.53ms; iter 10: train loss 3.26646
iter_dt 37.83ms; iter 20: train loss 2.97872
iter_dt 40.16ms; iter 30: train loss 2.79543
iter_dt 38.67ms; iter 40: train loss 2.68608
iter_dt 36.26ms; iter 50: train loss 2.62622
iter_dt 39.98ms; iter 60: train loss 2.63599
iter_dt 37.28ms; iter 70: train loss 2.53588
iter_dt 38.40ms; iter 80: train loss 2.56386
iter_dt 39.06ms; iter 90: train loss 2.51362


In [6]:
print(sample_from_char_model("Romeo:", model, trainer, train_dataset, maxlen=10, temperature=1))

Romeo: thom myon


**What is the `block_size` variable? Describe in detail what it does.**

You might want to consult the code for [model.py](https://github.com/karpathy/minGPT/blob/master/mingpt/model.py).



The block_size variable in the model configuration determines the maximum length of the sequence of tokens the model can handle in a single forward pass. Each position in the input sequence can attend to positions up to the block_size limit. This means that when processing any token, the model can only use tokens that appear within the preceding block_size positions as context. It also influences the construction of the attention mask in the model.

**What is the relationship between `block_size` and the total number of parameters in the model?** That is, if we double `block_size`, what happens to the total number of model parameters?

The total number of model parameters will slightly increase. If we set block_size=64, the number of parameter is 0.81M. If we set block_size=128, the number of parameter is 0.82M, which is slightly increased. Indeed, the most direct impact of block_size is on the positional embedding layer. The positional embedding matrix has a shape of [block_size, n_embd], where n_embd is the embedding dimensionality. If block_size is doubled, the size of the positional embedding matrix doubles in terms of rows (from [block_size, n_embd] to [2*block_size, n_embd]). This means there will be more parameters in this matrix specifically. However, this increase is relatively small compared to the entire model.

**What is the `n_layer` parameter? Describe in detail what it does. If we double this parameter, what happens to the total number of model parameters?**

The n_layer parameter specifies the number of transformer blocks that make up the model. If we double the n_layer parameter, we double the number of all these components:self-attention modules and feed-forward networks across the model. This results in a near doubling of the total number of parameters.

**What does the temperature paramter do?** See the generate method in [model.py](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L283).

Try setting temperature to different values. What do you observe about the output?

The temperature parameter in the generate method controlls the randomness of the text generation process. It's a scaling factor used in the process of selecting the next token in sequence generation tasks. With higher temperature, the model generates more random and surprising text, as less likely tokens get a relatively higher chance of being selected. While lower temperatures decrease randomness, making the output more predictable and closer to the most likely outcomes. If we set temperature = 1.0, the sampling is based purely on the model's learned probabilities without any additional bias.

**What does [line 148](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L148) in model.py do? How does this relate to the transformer model?**  

h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]). This line is a list that creates n_layer number of the Block class. Each instance of Block represents a transformer block. It defines the multi-layer structure of the model.

## Word Model
Now, let's fit a word model instead of a character model.

Given a string like:

> The cow     jumped over the moon. The moon is full tonight!

The `WordDataset` class below should create tokens for each space-delimited string:

> ['The', 'cow', 'jumped', 'over', 'the', 'moon', '.', 'The', 'moon', 'is', 'full', 'tonight', '!']

Note that multiple space characters are treated as one (Hint: `re` may help here.)

Using `CharDataset::parse_data` function above as an example, complete the `parse_data` function below to set the `stoi`, `itos`, `vocab_size`, and `data` attributes of the `WordDataset` class.

In [9]:
import re

class WordDataset(CharDataset):
  def parse_data(self, data):
    """
    data.....A single string representing many sentences.
    """
    # Use regex to split words and punctuation
    tokens = re.findall(r'\w+|[^\w\s]', data)

    # Calculate unique tokens and their frequencies
    unique_tokens = sorted(set(tokens))
    vocab_size = len(unique_tokens)
    print('Data has %d tokens, %d unique.' % (len(tokens), vocab_size))

    # Create mappings from words to indices and indices to words
    self.stoi = {token: i for i, token in enumerate(unique_tokens)}
    self.itos = {i: token for i, token in enumerate(unique_tokens)}
    self.vocab_size = vocab_size
    self.data = tokens

word_config = configure_model(max_iters=200, block_size=4)
word_data = WordDataset(word_config.data, 'The cow jumped over the moon. The moon is full tonight!')
word_data.data

command line overwriting config attribute trainer.max_iters with 200
command line overwriting config attribute data.block_size with 4
command line overwriting config attribute model.block_size with 4
Data has 13 tokens, 11 unique.


['The',
 'cow',
 'jumped',
 'over',
 'the',
 'moon',
 '.',
 'The',
 'moon',
 'is',
 'full',
 'tonight',
 '!']

In [10]:
# we can now reuse the training code to fit the word language model.
def sample_from_word_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ' '.join([train_dataset.itos[int(i)] for i in y])

word_model, word_trainer = train_model(word_config, word_data, sample_from_word_model)

number of parameters: 0.80M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 2.42279
sample from the model:
! jumped The moon moon is full jumped moon tonight moon full tonight ! tonight full The is is moon is tonight The moon . The moon is The is full full full is full tonight ! . . full . full is moon moon The moon moon is jumped is The ! The tonight full is . moon is full is ! The tonight . moon moon tonight full The the moon The . moon jumped ! over moon the over moon moon The moon is ! is ! moon moon tonight . . the moon . full is .
saving model
iter_dt 13.65ms; iter 10: train loss 0.63155
iter_dt 14.08ms; iter 20: train loss 0.39190
iter_dt 13.19ms; iter 30: train loss 0.24487
iter_dt 14.01ms; iter 40: train loss 0.16866
iter_dt 14.68ms; iter 50: train loss 0.15440
iter_dt 13.45ms; iter 60: train loss 0.12987
iter_dt 13.91ms; iter 70: train loss 0.10922
iter_dt 13.65ms; iter 80: train loss 0.07356
iter_dt 13.22ms; iter 90: train loss 0.07683
iter_dt 13.25ms; iter 100: tr

In [11]:
sample_from_word_model(["The"], word_model, word_trainer, word_data, maxlen=50, temperature=1.)

'The moon is full tonight ! ! ! full tonight ! ! ! jumped over the moon . The moon is full tonight ! ! ! ! ! full tonight ! ! ! . The moon is full tonight ! ! ! full tonight ! ! ! full tonight ! !'

### Wikipedia

With our word model, let's now fit a language model on the Wikipedia page for [New Orleans](https://en.wikipedia.org/wiki/New_Orleans)

First, we'll install a library to help us fetch the plain text of a wikipedia page.

In [12]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11680 sha256=cf9641e84cd1ea2360cfb7a68d25bd4f8253085590330845eeab8c9fb18e4820
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [19]:
import wikipedia
wikipedia.set_lang('en')
page = wikipedia.page('New Orleans')
print(page.content[:100])

New Orleans (commonly known as NOLA or the Big Easy among other nicknames) is a consolidated city-pa


**Create new variables `word_config`, `word_data`, `word_model`, `word_trainer` that are analogous to the ones used previously. These should fit a model to the `page` text defined in the previous cell.**

In [32]:
def get_config():

    C = CN()

    # system
    C.system = CN()
    C.system.seed = 3407
    C.system.work_dir = './out'

    # data
    C.data = CharDataset.get_default_config()

    # model
    C.model = GPT.get_default_config()
    C.model.model_type = 'gpt-micro'
    #C.model.n_layer = None
    #C.model.n_embd =  None

    # trainer
    C.trainer = Trainer.get_default_config()
    C.trainer.learning_rate = 5e-4 # the model we're using is so small that we can go a bit faster

    return C

def configure_model(max_iters=100, block_size=128):
    config = get_config()
    config.merge_from_args(['--trainer.max_iters=%d' % max_iters,
                            '--data.block_size=%d' % block_size,
                            '--model.block_size=%d' % block_size])
    setup_logging(config)
    set_seed(config.system.seed)
    return config

config = configure_model(max_iters=100, block_size=64)
word_data= WordDataset(config.data, page.content)
word_model, word_trainer = train_model(config, word_data, sample_from_word_model)


command line overwriting config attribute trainer.max_iters with 100
command line overwriting config attribute data.block_size with 64
command line overwriting config attribute model.block_size with 64
Data has 20377 tokens, 4270 unique.
number of parameters: 1.35M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 8.38579
sample from the model:
" best escaped recorded Archive the expand many break COVID resumption , the many 120 , being many 120 the Bouligny of resumption Upper while Improvising of Exchange categories Democratic gender the the percent of Nagin Improvising turned categories railroads many the established for , 120 office the the of being , 151 It 120 for the southern , order peace libres in the Loyola 6 says 6 expand 6 1792 escaped 1792 1731 housing 120 many 120 Army Appalachian of Luther break , break says decay 120 Archive Navtech of part bypass it housing factors jobs jobs doughnuts bypass of
saving model
iter_dt 25.98ms; iter 10: train loss 7.58332
iter_dt 2

In [28]:
sample_from_word_model(["A", "local", "variant", "for", "hip", "hop", "is"], word_model, word_trainer, word_data, maxlen=200, temperature=1.)

'A local variant for hip hop is been a to the most of the city and New Orleans from the most of the most of the French Quarter to a significant to the city from of a with the city \' s city , the city in New Orleans was the U . In the city to the United States . The the city . = = = = = = = = In the French , and the city was a " . New Orleans . S . According to the French and a to the city \' s largest to the Mississippi , 000 in the world - Katrina , and its s city is Louisiana in the United States . The city in the Mississippi River , the French in the United States ) . = In other of the city \' s city , which . The city \' s , the city as the nation \' Orleans \' s United States . = = = The city was the city \' s first in the city of the world . S . A " , and the city in a term in the first in the city \' s population of New'

Investigate different model settings (`block_size, max_iters, learning_rate, n_embd, n_layer`).

**What effect do you notice from trying different values? Which setting appears to generate the best generated text?**


If we don't change the model type, where the n_embd and n_layer is default.


If the block_size=64, max_iters=100, learning_rate=5e-4. Final loss is 5.21

If the block_size=64, max_iters=200, learning_rate=5e-4. Final loss is 4.03. The sample for the text is like "A local variant for hip hop is , and African American , New Orleans had been also known . The city had a study by the city of over it East . The New Orleans is the first - largest city ' s population . S . The first to this area is home to the Mississippi River and its - Katrina in the city ' s population . S . In 2010 , and other cities in Louisiana is the most of New Orleans ' s murder rate of New Orleans was the city had been Orleans is the first - Katrina was also ) , with a study by the city . A population . The city and African American . In January that had been Orleans had the National Orleans ' s office , and its own , and Latino Americans , the French Quarter ( all of the National Historical , 000 cities in New Orleans is a historic peak of Louisiana , the New Orleans was in the United States , New Orleans ' major bridge - Katrina - Catholic York , and other parishes . After the nation ' s largest population . = = = = = = = =" It much better than the first one. Increase the max iteration will increase the performance.


If the block_size=64*4, max_iters=100, learning_rate=5e-4. Final loss is 4.81265. It can somehow increase the performance.

The change of the learning_rate doesn't seem to increase the performance. The setting that has larger block_size and max_iters can yeild better performance.

As for n_embd and n_layer, increase them seems to increse the performance.


Suppose you wanted to take the word model trained on the New Orleans Wikipedia page and use supervised fine-tuning to create a chatbot that answers questions about New Orleans.

**What type of additional training data would you need to do this?**

Provide example data below.

To fine-tune a word model trained on the New Orleans Wikipedia page into a chatbot capable of answering questions about New Orleans, we would need a dataset comprising pairs of questions and answers related to the topic. The questions would simulate potential queries a user might pose about New Orleans, and the answers would provide the appropriate responses based on factual information. For example:
Q: What is New Orleans known for?
A: New Orleans is renowned for its distinctive music, Creole cuisine, unique dialects, and its annual celebrations and festivals, most notably Mardi Gras.



**If this new data contains words that don't appear in the New Orleans wikipedia page, what will happen? How can you fix this?**

If the new dataset for training the chatbot contains words that do not appear in the original New Orleans Wikipedia page, those words will be unrecognized by the model because they are not present in the vocabulary (stoi mapping) used during the initial training. It will cause the unknown token problem. The model won't have embeddings or learned parameters for these out-of-vocabulary words, leading to potential errors. To fix that, we can use pre-trained word embeddings that cover a broader vocabulary. These embeddings can be fine-tuned along with the model or used as a fixed input layer.