# Large Language Models

Author: Julie Butler

Date Created: November 25, 2025

Last Modified: December 1, 2025

In this assignment we will attempt to create a (very small) large language model (LLM) which can generate text similar to Shakespeare. The data set we will be using today comes from Andrej Karpathy and is used in his blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/). Note that this blog post is included in Ilya Sutskever's Top 30 Machine Learning Papers, a list of papers covering 90% of the current field of machine learning and a good primer for advanced studies in modern machine learning and artificial intelligence. Summaries of the papers can be found [here](https://aman.ai/primers/ai/top-30-papers/), though I do suggest that you read them yourself if you plan on pursuing this field (anyone want to join me in reading them over Christmas break?). Andrej Karpathy also has an excellent [YouTube channel](https://www.youtube.com/@AndrejKarpathy) that includes some very good (and very long) videos on making rather complex LLMs from scratch.

In [1]:
#############
## IMPORTS ##
#############
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np



In [2]:
#####################
## IMPORT THE DATA ##
#####################
ds, info = tfds.load("tiny_shakespeare", split="train", with_info=True)
text = ds.take(1).as_numpy_iterator().__next__()["text"].decode()

print(info)
print()

# Print a sample and the length of the data set.
print(text[0:100])
print()
print(len(text))

# Truncate the data set to make for faster training. You can change
# or remove this it get better performance.
truncate = 50000
text = text[:truncate]

tfds.core.DatasetInfo(
    name='tiny_shakespeare',
    full_name='tiny_shakespeare/1.0.0',
    description="""
    40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in
    Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural
    Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
    
    To use for e.g. character modelling:
    
    ```
    d = tfds.load(name='tiny_shakespeare')['train']
    d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))
    # train split includes vocabulary for other splits
    vocabulary = sorted(set(next(iter(d)).numpy()))
    d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})
    d = d.unbatch()
    seq_len = 100
    batch_size = 2
    d = d.batch(seq_len)
    d = d.batch(batch_size)
    ```
    """,
    homepage='https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt',
    data_dir='/Users/butlerju/tensorflow_datasets/tiny_shakespeare/1.0.0

In [3]:
##################
## TOKENIZATION ##
#################
seq_length = 128

tokenizer = tf.keras.layers.TextVectorization(
    max_tokens=8000,
    output_sequence_length=None,  # <-- we want full-length tokens
    standardize="lower_and_strip_punctuation",
)
tokenizer.adapt([text])

vocab = tokenizer.get_vocabulary()
vocab_size = len(vocab)

full_tokens = tokenizer(tf.constant([text]))
full_tokens = tf.squeeze(full_tokens, axis=0)  
full_tokens = tf.cast(full_tokens, tf.int32)

# Print a sample of the vocab list
print(vocab[:50])

['', '[UNK]', 'the', 'and', 'to', 'you', 'i', 'of', 'a', 'in', 'that', 'he', 'marcius', 'not', 'for', 'your', 'him', 'with', 'my', 'it', 'is', 'have', 'as', 'they', 'we', 'be', 'his', 'are', 'their', 'our', 'first', 'but', 'menenius', 'me', 'all', 'what', 'good', 'shall', 'this', 'will', 'than', 'if', 'o', 'no', 'well', 'cominius', 'at', 'us', 'them', 'so']


In [4]:
#########################
## FORMAT THE DATA SET ##
#########################
def make_dataset(tokens, seq_len):
    # Manually doing time series formatting, could change this to use
    # TimeSeriesGenerator.
    N = len(tokens)
    X = []
    Y = []
    for i in range(N - seq_len - 1):
        X.append(tokens[i : i + seq_len])
        Y.append(tokens[i + 1 : i + seq_len + 1])
    X = tf.stack(X)
    Y = tf.stack(Y)
    return tf.data.Dataset.from_tensor_slices((X, Y))

dataset = make_dataset(full_tokens, seq_length)
dataset = dataset.shuffle(2000).batch(32).prefetch(tf.data.AUTOTUNE)

In [5]:
##########################
## TRANSFORMER FUNCTION ##
#########################
def transformer(x, embed_dim, num_heads, ff_dim):
    # Attention Network
    attn_out = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads,
        key_dim=embed_dim // num_heads
    )(x, x)
    x = tf.keras.layers.LayerNormalization()(x + attn_out)

    # Feedforward Neural Network
    ff = tf.keras.Sequential([
        tf.keras.layers.Dense(ff_dim, activation="relu"),
        tf.keras.layers.Dense(embed_dim),
    ])
    x = tf.keras.layers.LayerNormalization()(x + ff(x))
    return x

In [14]:
###########################################################
## BUILD LARGE LANGUAGE MODEL (I.E. LINKED TRANSFORMERS) ##
###########################################################
# Parameters
embed_dim = 36
num_heads = 2
ff_dim = 64

inputs = tf.keras.Input(shape=(seq_length,), dtype=tf.int32)

x = tf.keras.layers.Embedding(vocab_size, embed_dim)(inputs)

x = transformer(x, embed_dim, num_heads, ff_dim)

x = transformer(x, embed_dim, num_heads, ff_dim)

outputs = tf.keras.layers.Dense(vocab_size)(x)

model = tf.keras.Model(inputs, outputs)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")

model.summary()

In [15]:
#######################
## GENERATE NEW TEXT ##
#######################
def generate(model, start_text, max_tokens=50):
    text_so_far = start_text

    for _ in range(max_tokens):
        # Tokenize the current text
        token_ids = tokenizer([text_so_far]) 
        token_ids = tf.squeeze(token_ids, axis=0).numpy().tolist() 
        # Keep only last 128 tokens
        token_ids = token_ids[-seq_length:]
        # LEFT-PAD with zeros if shorter than seq_length
        # Can replace this with pad_sequences if you want.
        if len(token_ids) < seq_length:
            pad_len = seq_length - len(token_ids)
            token_ids = [0] * pad_len + token_ids
        token_tensor = tf.constant([token_ids], dtype=tf.int32)
        # Predict next token
        preds = model.predict(token_tensor, verbose=0)
        next_id = int(tf.argmax(preds[0, -1]).numpy())
        # Convert id → word
        next_word = vocab[next_id]
        text_so_far += " " + next_word

    return text_so_far


In [None]:
###########
## TRAIN ##
###########
model.fit(dataset, epochs=3)



Epoch 1/3
[1m276/276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 56ms/step - loss: 7.6335
Epoch 2/3
[1m201/276[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m4s[0m 57ms/step - loss: 7.6332

In [None]:
#############
## EXAMPLE ##
#############
print(generate(model, "To be, or not to be"))

## Assignment

### Part 1: Understanding the Code
1. Go through the code and provide comments as to what is happening.

Answer the following questions thoroughly, including adding citations to any information learned outside of course materials.

2. What does the tokenizer do? What is its vocabulary?
3. How does the code create input/target sequences for a language model?
4. Describe the self-attention mechanism in the model. Why does it take (x, x) as arguments?
5. Explain the purpose of:
    * Embedding layer
    * Multi-head attention
    * Residual connections
    * LayerNormalization
    * Feedforward block
5. Why does the model use SparseCategoricalCrossentropy?
6. How does the generation function work? What is greedy decoding?

### Part 2: Modify the Code
1. Attempt to make the model better by adjusting the following parameters:`embed_dim`, `num_heads`, `ff_dim`, `output_sequence_length`, `seq_length`, and `epochs`. You can also change the truncation of the training data, activation functions, and any other parameters you like. Comment on how changing these parameters changes the model's performance. Note, I would advise against drastically increasing the parameters without testing first as the run times can easily sky rocket.
2. The above LLM only has two transformer blocks. Increase this to three and comment on the change in performance.
3. Increasing the above parameters and architecture still results in a simple LLM. Make two or more of the above changes to the model which will make it more complex, and hopefully better able to generate text. Explain what each change does to the model in a comment or markdown cell.
    * Change the tokenization strategy from the word level to the sub-word or character level. Note the tiny-shakespeare dataset is usually tokenized at the character level.
    * Add positional embeddings (see last week's assignment).
    * Change the attention to casual (masked).
    * Add an adaptive learning rate or a learning rate scheduler.
    * Add dropout.
    * Add temperature, top-k, or top-p sampling.
    * Change the decoding strategy (related to the sampling above).
    
### Part 3: Text Generation
What combination of changes made in Part 2 gave you the best output (i.e. that sounds like it could be Shakespeare)? What was your favorite output?
    