<h1>CS4619: Artificial Intelligence II</h1>
<h1>Language Models, Generative AI &amp; Sequence Modeling</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
import tensorflow as tf
from tensorflow.keras.utils import get_file
from tensorflow.keras.saving import load_model
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import Embedding
from tensorflow.keras import Input
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import GRU
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import Callback

2023-08-14 09:35:44.180581: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<h1>Acknowledgement</h1>
<ul>
     <li>The code is based closely on code from: 
        A. G&eacute;ron: 
        <i>Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd edn)</i>, O'Reilly, 2019
    </li>
</ul>

<h1>Warning</h1>
<ul>
    <li>The code takes a very long time to run.
    </li>
    <li>It is not important to understand this code in any case.</li>
</ul>

<h1>Language Models</h1>
<ul>
    <li>A <b>language model</b> for a given natural language, such as English, estimates the probability of
        each possible string of words, e.g.
        <ul>
            <li>P("The dog chased the cat") = 0.000002</li>
            <li>P("The cat chased the dog") = 0.0000002</li>
            <li>P("The the chased cat dog") = 0.000000000000001</li>
        </ul>
    </li>
    <li>If we have a <b>character-level language model</b>, then we can predict the most-likely next character.
        <ul>
            <li>E.g. P("h" | "The dog chased t") = 0.9, P("w" | "The dog chased t") = 0.05, P("x" | "The dog chased t") = 0.00001</li>
        </ul>
    </li>
    <li>If we have a <b>word-level language model</b>, then we can predict the most-likely next word.
        <ul>
            <li>E.g. P("the" | "The dog chased") = 0.07, P("a" | "The dog chased") = 0.0689, P("walked" | "The dog chased") = 0.0000004</li>
        </ul>
    </li>
</ul>

<h2>Learning a language model</h2>
<ul>
    <li>If we have lots of text, we can learn a language model.</li>
    <li>A simple-minded approach (using a word-level language model by way of example):
        <ul>
            <li>For each word, count next-word frequencies in the training examples.</li>
            <li>E.g. in the training examples, "the" is followed by "dog" 20 times, by "cat" 15 times, "kangaroo" once, and so on.
            </li>
            <li>From these, we can calculate the probabilities.</li>
        </ul>
        What is the weakness of this?
    </li>
    <li>So, instead, AI researchers use neural networks.</li>
    <li>We'll illustrate with a character-level language model.
        <ul>
            <li>An advantage of character-level models is we have a small number of next possible characters.</li>
            <li>For word-level models, on the other hand, we have to decide on a vocabulary and how to handle words that fall outside the vocabulary.
            </li>
        </ul>
    </li>
</ul>

<h2>A Character-Level Language Model using a RNN</h2>
<ul>
    <li>Everyone does this on Shakespeare &mdash; perhaps because if it outputs bad
        Shakespeare some people still think it sounds like Shakespeare!
    </li>  
</ul>

<h3>Preprocessing the training data</h3>
<ul>
    <li>Most of the effort goes into preprocessing the dataset. Don't get bogged down in the details of this code.</li>
    <li>We're one-hot encoding the characters.</li>
    <li>We're making overlapping windows, shuffling these, and putting them into batches.</li>
</ul>

In [4]:
shakespeare_url = "https://homl.info/shakespeare"
filepath = get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [5]:
# How much Shakespeare are we working with? How many characters?
len(shakespeare_text)

1115394

In [6]:
# Show you the first part of it
shakespeare_text[:148]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n'

In [7]:
# Show you all its distinct characters
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [8]:
# Create and fit a character-level (rather than word-level) tokenizer
# In effect, it lowercases and assigns ids to characters from 1 to 39 inc, e.g. ' ' is 2, 'e' is 3, etc.
# However, UNK is 0 and padding is 1. We don't want these so we subtract 2. Now, e.g. ' ' is 0, 'e' is 1, etc.
vectorization_layer = TextVectorization(split="character", standardize="lower")
vectorization_layer.adapt([shakespeare_text])
encoded = vectorization_layer([shakespeare_text])[0]
encoded -= 2

In [9]:
# Show you an encoding
vectorization_layer("speak")

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([ 9, 24,  3,  6, 26])>

In [10]:
max_tokens = vectorization_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chararacters = 1,115,394

In [11]:
# This function helps us create training, validation and test sets.

def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    # window() splits the data into smaller windows of text
    # Using shift=1 means the first window is, e.g., characters 0 to 100, the second is characters 1 to 101, etc.
    # Using drop_remainder=True means all windows are 101 characters long without needing us to pad the last ones (they are dropped)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    # But window() produces a nested dataset: a dataset containing windows (each of which is a dataset) so we flatten it
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    # Shuffle the windows and put into batches
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    # Separate the inputs (the first 100 characters) from the targets (the last, i.e. 101st, character)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [12]:
max_length = 100

train_set = to_dataset(encoded[:1_000_000], length=max_length, shuffle=True)
val_set = to_dataset(encoded[1_000_000:1_060_000], length=max_length)
test_set = to_dataset(encoded[1_060_000:], length=max_length)

<h3>The RNN</h3>
<ul>
    <li>We'll embed the tokenized input.</li>
    <li>Then we'll use a single GRU layer.</li>
    <li>The output layer has <code>max_tokens</code> neurons, because we're predicting that number of
        distinct characters, i.e. we have <code>max_tokens</code> classes.
    </li>
</ul>

In [13]:
inputs = Input(shape=(1,))
x = Embedding(input_dim=max_tokens, output_dim=16)(inputs)
x = GRU(128, activation="tanh", return_sequences=True)(x)
outputs = Dense(max_tokens, activation="softmax")(x)
char_language_model = Sequential(Model(inputs, outputs))

char_language_model.compile(optimizer=RMSprop(learning_rate=0.0001), loss="sparse_categorical_crossentropy", metrics=["accuracy"]) 

In [None]:
# Either run this, which takes a very, very long time
history = char_language_model.fit(train_set, validation_data=val_set, epochs=10)

In [14]:
# Or simply run this to load one that I saved for you previously
char_language_model = load_model("models/char_lm.keras")

<h3>Self-supervised learning</h3>
<ul>
    <li>Hold on! We are doing supervised learning. But our dataset has no labels. It is simply a lot of
        text.
    </li>
    <li>So, what are we using as labels?</li>
</ul>

<h3>Predictions using the language model</h3>
<ul>
    <li>Given some text (the <b>prompt</b>), we can use the model to predict the next character.
    </li>
</ul>

In [72]:
# A function to preprocess the text whose next character we will predict (tokenize and subtract 2) and make a prediction
def predict_next_char(text):
    x = np.array(vectorization_layer([text])) - 2
    probabilities = char_language_model.predict(x, verbose=0)[0, -1]
    predicted_token = tf.argmax(probabilities)
    predicted_char = vectorization_layer.get_vocabulary()[predicted_token + 2]
    return predicted_char

In [73]:
predict_next_char("To be or not to b")

'e'

In [74]:
predict_next_char(["How are yo"]) 

'u'

<h2>Word-Level Language Models</h2>
<ul>
    <li>The ideas are similar but the network predicts words instead of characters.</li>
    <li>However, there are a couple of subtelties.</li>
    <li>They are presented briefly below, but do not worry about them (e.g. do not 'learn off').</li>
</ul>

<h3>Sampled softmax</h3>
<ul>
    <li>The output layer of the character-level language model has one neuron per possible <em>character</em>; 
        see <code>Dense(max_id,...)</code> above. E.g. if there are 39 possible characters, then there
        are 39 neurons in this layer. It outputs 39 probabilities.
    </li>
    <li>The output layer of a word-level language model has one neuron per <em>word</em> in our vocabulary: tens- or 
        hundreds-of-thousands of neurons; tens- or hundreds-of-thousands probabilities. The softmax activation function would have to sum
        over the outputs of all the neurons. This is OK if there a few dozen (character-level language
        model) but not if there are thousands (word-level model).
    </li>
    <li>One solution that helps speed-up training is called sampled softmax.
        Without going into the details, in sampled softmax, the loss is estimated from a <em>sample</em>
        of the outputs, instead of all of them.
    </li>
</ul>

<h3>Transformer decoders for word-level language models</h3>
<ul>
    <li>It's fair to say that most word-level language models these days are <b>not</b> built using RNNs.</li>
    <li>Instead, we use transformers.</li>
    <li>As we know, an RNN takes in one input in each time step (e.g. it takes in one word at a time). Transformers, on the other
        hand, receive the entire input sequence in one go. (It arrives as a matrix with one word encoding per row, and this is
        why we needed positional encoding.) This allows what is sometimes referred to as <b>bidrectional</b> understanding: its
        understanding of a word depends on words that come before it, and words that come after it. (By the way, the word bidirectional 
        seems unfortunate to me, because the word already had a similar but slightly different meaning for RNNs.)</li>
    <li>However, to build a language model similar to the one above, bidirectional understanding is not what we want.</li>
    <li>This leads us to distinguish transformer encoders and decoders.
        <ul>
            <li>For text classificatiion tasks (such as the movie review sentiment analysis we were doing in the previous few lectures),
                we want a transformer encoder, with bidirectional understanding.
            </li>
            <li>For autoregressive tasks (predicting the next item from the previous ones), it would be cheating to work bidirectionally. 
                A transformer decoder is limited to making its predictions based on only the previous items. In a word-level 
                language model that
                uses a transformer decoder, still the entire input sequence is received in one go. But, when making preditions, it
                uses masking to hide future parts of the input sequence.
                <figure style="text-align: center;">
                    <img src="images/transformer_decoder.png" />
                </figure>
                (Note the masking in what is sometimes called a causal self-attention layer.)
            </li>
        </ul>
    </li>
    <li>An example of a transformer decoder word-level language model is GPT-1.</li>
</ul>

<h1>Natural Language Generation</h1>
<ul>
    <li>An obvious use for a Language Model is text generation.</li>
</ul>

<h2>Generating text using the character-level language model</h2>
<ul>
    <li>To generate text, we want to make repeated predictions:
        <ul>
            <li>Feed in some initial input;</li>
            <li>Predict the most likely next character/word;</li>
            <li>Add the prediction to the end of the input text;</li>
            <li>Feed in the extended input;</li>
            <li>Predict the most likely next character/word;</li>
        </ul>
        and so on.
    </li>
    <li>But this results in output text that is very repetitive.</li>
    <li>Instead, we make it stochastic:
        <ul>
            <li>We pick the next character/word randomly but based on the probabilities that the network produces.</li>
        </ul>
    </li>
    <li>We'll illustrate using the character-level RNN.</li>
</ul>

In [75]:
# A modification of the function for predicting the next character.
# The temperature parameter allows you to tune it: 
# - a value close to zero favours high probability characters, but leads to more repetition
# - a high value gives all characters an almost equal probability
def predict_next_char(text, temperature=1):
    x = np.array(vectorization_layer([text])) - 2
    probabilities = char_language_model.predict(x, verbose=0)[0, -1:]
    rescaled_logits = tf.math.log(probabilities) / temperature
    predicted_token = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    predicted_char = vectorization_layer.get_vocabulary()[predicted_token + 2]
    return predicted_char

# A function that predicts the next character repeatedly
def generate_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += predict_next_char(text, temperature)
    return ''.join(text)

In [94]:
# Some examples
print("Low temperature:\n", generate_text(text="Alas ", temperature=0.2), "\n------------------------------------------------------------")

print("Medium temperature:\n", generate_text(text="Alas ", temperature=0.9), "\n------------------------------------------------------------")

print("High temperature:\n", generate_text(text="Alas ", temperature=2.0), "\n------------------------------------------------------------")

Low temperature:
 Alas with her to the duke of your sight of the state to 
------------------------------------------------------------
Medium temperature:
 Alas my am your honour so to be?
i love thee, as tender 
------------------------------------------------------------
High temperature:
 Alas to: hish in uomagier,
in impripsanmen ofrugled,
im 
------------------------------------------------------------


<ul>
    <li>Not bad for such a small dataset, so few layers and so few training epochs!</li>
</ul>
<!--
<ul>
    <li>How can we make the generated text more convincing?
        <ul>
            <li>Tweak everything! More data, more layers, more neurons per layer, more epochs, &hellip;
            <li>You could make the windows bigger by increasing <code>n_steps</code> but even LSTM and GRUs,
                while better than SimpleRNNs, cannot handle very long sequences.
            </li>
            <li>We could change Char-RNN from being a <b>stateless RNN</b> to being a <b>stateful RNN</b>; see the Appendix.
            </li>
            <li>And, of course, we could use a transformer instead of an RNN, especially if we want a word-level model.</li>
        </ul>
    </li>
</ul>
<h2>Stateless RNNs and Stateful RNNs</h2>
<ul>
    <li><b>Stateless RNN:</b> In a training iteration, 
        <ul>
            <li>will be trained on a batch of random chunks of the text;</li>
            <li>hidden state starts at all zeros;</li>
            <li>processes the input, step by step;</li>
            <li>after the last timestep, throws away the hidden state.</li>
        </ul>
    </li>
    <li><b>Stateful RNN:</b>
        <ul>
            <li>preserve the hidden state at the end of the last timestep;</li>
            <li>use it as the initial hidden state for the next batch.</li>
        </ul>
        This way, we can learn longer patterns despite only back-propagating through short
        sequences.
    </li>
    <li>However, we now must arrange our batches quite carefully.
        <ul>
            <li>Each input sequence in a batch starts where the corresponding sequence in the previous
                batch finished.
            </li>
            <li>In other words, we must remove the overlapping and the shuffling that we used in the
                stateless RNN.
            </li>
        </ul>
    </li>
    <li>Keras comes with a parameter for its recurrent layers, <code>stateful=True</code>.</li>
</ul>
-->

<h2>Generating sequences of things other than words or characters</h2>
<ul>
    <li>We can build language models on many different kinds of sequence data &mdash; not just human language data.</li>
    <li>For example, we could train a language model on a music dataset &mdash; and use it to generate music!
        <ul>
            <li>Here's an example (one that uses a RNN): <a href="https://folkrnn.org/">https://folkrnn.org/</a
            </li>
        </ul>
    </li>
    <li>Or, for example, we could train a language model on a dataset that contains sequences of brushstrokes &mdash; and use it to generate drawings!</li>
</ul>

<h2>Generating things other than sequences</h2>
<ul>
    <li>We've seen how language models can be used to generate sequences such as text or music.</li>
    <li>Generating sequences is just one example of what has become known as <b>generative AI</b>.</li>
    <li>You are, no doubt, familiar with other examples of AI to generate images, audio or video.
    <!--
        <ul>
            <li><a href="https://ml4a.net/">ml4a: Machine Learning for Art</a></li>
        </ul>
    </li>
    -->
    <li>Unfortunately, we have not time to discuss these. If you are interested in how they work, you can investigate the following:
        <ul>
            <li>Generative Adversarial Networks (GANs);</li>
            <li>Variational encoders; and</li>
            <li>Diffusion models and, more recently, stable diffusion.</li>
        </ul>
    </li>
</ul>

<h1>Sequence Modeling</h1>
<ul>
    <li>Let's review some applications that use sequence data. We will group them, depending on whether they are many-to-one, one-to-many or many-to-many. (Some people use different terminology: sequence-to-vector or seq2vec; vector-to-sequence or vec2seq; and sequence-to-sequence or seq2seq.)</li>
</ul>

<h2>Many-to-one</h2>
<ul>
    <li>The input data is a sequence but the output is not; the output might be a fixed-size vector or a scalar.
        <figure style="text-align: center;">
            <img src="images/seq2vec.png" />
        </figure>
    </li>
    <li>Examples:
        <ul>
            <li>Timeseries forecasting, taking in sequences of numbers (historical rainfall data, stock prices, sales) and predicting the next number (tomorrow's rainfall, stock price, sales);</li>
            <li>Sentiment analysis, taking in text, outputting a class (e.g. positive, neutral, negative) or a score;
            <li>Activity recognition from video, taking in a video (sequence of images), outputting a class (e.g. running, walking, crawling);</li>
            <li>Image generation, outputting an image that corresponds to the user's natural language description. Examples include OpenAI's <a href="https://openai.com/dall-e-2">Dalle-2</a>, Google's <a href="https://imagen.research.google/">Imagen</a> and
                <a href="https://mid-journey.ai/">MidJourney.</li>
        </ul>
    </li>
    <li>We can implement these models using RNNs (the last recurrent layer would say <code>return_sequences=False</code>) or using transformer encoders.</li>
</ul>

<h3>One-to-many</h3>
<ul>
    <li>The input data is not a sequence, but the output is.
        <figure style="text-align: center;">
            <img src="images/vec2seq.png" />
        </figure>
    </li>
    <li>Example
        <ul>
            <li>Image captioning, taking in an image, outputting a descriptive phrase (sequence of words).</li>
        </ul
    </li>
    <li>We can implement these models using RNNs (the recurrent layers will all say <code>return_sequence=True</code>) or using a transformer decoder.
        <ul>
            <li>Here, for example, is what Google's image captioning system looked like some years ago.
            <figure style="text-align: center";>
                <img src="images/captioning.png" />
                <figcaption>
                    Google's image captioning system<br /> See
                    <a href="https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html">https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html</a><br />
                    Image comes from Vinyals et al.: <i>Show and Tell: Lessons learned from the 
                    2015 MSCOCO Image Captioning Challenge</i>, CoRR, abs/1609.06647, 2016 
                    (<a href="https://arxiv.org/pdf/1609.06647.pdf">https://arxiv.org/pdf/1609.06647.pdf</a>)
                </figcaption>
            </figure>
            A convolutonal neural network processes the image and passes a representation to a RNN that generates the caption.
            </li>
        </ul>
    </li>
</ul>

<h3>Many-to-many</h3>
<ul>
    <li>Both the input and the output are sequences.</li>
    <li>In some cases, many-to-many models work in a synchronized way: there is an output for each element of the input sequence.
        <figure style="text-align: center;">
            <img src="images/seq2seq.png" />
        </figure>
    </li>
    <li>Examples:
        <ul>
            <li>Video analysis, where each frame in the video is classified.</li>
            <li>DNA analysis, taking in a DNA sequence, outputting a corresponding sequence but highlighting a subsequence within the 
                input that codes a certain protein, for example AGCCCCTGTGAGGAACTAG $\rightarrow$ 0011111111111111100
            </li>
            <li>Named entity recognition, taking in text (sequence of words), outputting a corresponding sequence but highlighting
                subsequences that are the names of people, organizations, etc. "Elon Musk works for SpaceX and Tesla Inc." 
                $\rightarrow$ 11001011. Named entity recognition has all sorts of uses. It might be used by search engines when 
                indexing documents, or by customer support departments when routing your queries and complaints &mdash; to give just 
                two examples. Rather than outputting bit strings (0s and 1s), it can be more useful if the system outputs symbols that 
                show where a name starts and where it ends. (There are, by the way, lots of other ways to build named entity recognition
                systems, without using sequence models.)
            </li>
            <li>Part-of-speech tagging, taking in a sequence of words, outputting the grammatical category of each word. For example,
                if the input is "the cat sat on the mat", the output is "adjective noun verb preposition adjective noun".
            </li>
        </ul>
    </li>
    <li>But, for many tasks, synchronization would be wrong. Consider machine translation: "La plume de ma tante est sur la table"
        $\rightarrow$ "My aunt's pen is on the table"; "vin rouge" $\rightarrow$ "red wine". Word-to-word translation from one language 
        to another would not give correct results.
    </li>
    <li>For these tasks, a better architecture is an encoder-decoder architecture:
        <figure style="text-align: center;">
            <img src="images/encoder_decoder.png" />
        </figure>
        <ul>
            <li>The encoder takes in the input sequence; its final output tries to capture the entire input sequence in a 
                single vector.
            </li>
            <li>That vector is passed to the decoder, which generates the output sequence.</li>
        </ul>
    </li>
    <li>In the <i>Attention is all you need</i> paper, transformers were presented as an encoder-decoder architecture:
        <figure style="text-align: center;">
            <img src="images/transformer_encoder_decoder.png" />
        </figure>
    </li>
    <li>Machine translation is not the only application that benefits from an encoder-decoder architecture.
        <ul>
            <li>Document paraphrasing: The input is a document; the output is also a document &mdash; one that covers the same ground but does so using different language.</li>
            <li>Document summarization: The input is a document; the output is a shorter text that expresses the essence of the original document.</li>
            <li>Question-answering: The input is a question; the output is a suitable answer.</li>
            <li>Chatbots: The input may be the history of the conversation so far; 
                and the output is a suitable next contribution to the conversation.</li>
            <li>Playlist captioning: The input is a sequence of songs. The encoder passes a representation of
                those songs to the decoder, which generates a natural language description of the playlist.
            </li>
            <li>Text-to-melody and melody-to-melody, offered by Microsoft-funded <a href="https://www.audiogen.co/">Audiogen</a> and by
                Google's <a href="https://google-research.github.io/seanet/musiclm/examples/">MusicLM</a>.
            </li>
            <li>Text-to-video, which is one of the capabilities of <a href="https://runwayml.com/">Runway</a>.</li>
        </ul>
    </li>
    <li><a href="https://quillbot.com/">Quillbot</a> is one among many tools for paraphrasing, summarizing, and lots more.</li>
</ul>