<h1>CS4619: Artificial Intelligence II</h1>
<h1>Large Language Models</h1>
<h2>
    Derek Bridge<br />
    School of Computer Science and Information Technology<br />
    University College Cork
</h2>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

<h1>Large Language Models</h1>
<ul>
    <li>We now know what a <b>language model</b> is.</li>
    <li>There has been an explosion of <b>large language models</b> (LLMs).</li>
    <li>In what ways are they large?
        <ul>
            <li>Many more layers, hence many more parameters; and</li>
            <li>Huge training sets.</li>
        </ul>
        For example, here are some stats about the GPT family of LLMs:
        <table>
            <tr>
                <td></td><th>GPT-1</th><th>GPT-2</th><th>GPT-3</th><th>GPT-4</th>
            </tr>
            <tr>
                <th>Year</th><td>2018</td><td>2019</td><td>2020</td><td>2023</td>
            </tr>
            <tr>
                <th>num. parameters</th><td>110 million</td><td>1.5 billion</td><td>175 billion</td><td>rumoured 1-2 trillion</td>
            </tr>
            <tr>
                <th>training set</th><td>BookCorpus dataset (7000 books)</td><td>8 million good quality web pages (40Gb)</td><td>500 billion tokens (web crawl, book datasets, Wikipedia)</td><td>rumoured 13 trillion tokens, but images as well as text</td>
            </tr>
        </table>
    </li>
    <li>Their capabilities have surprised everyone.</li>
    <li>There is a lot of discussion, and even disagreement, about exactly what their capabailities are.</li>
</ul>

<h1>Training an LLM</h1>
<ul>
    <li>Training is typically a three-step process: Pre-training, Supervised Fine-Tuning (SFT) and Preference Alignment.
    <figure>
        <img src="images/llm_training.png" alt="" />
    </figure>
    </li>
    <li>Let's discuss each step in turn.</li>
</ul>

<h2>Pre-training</h2>
<ul>
    <li>The LLM is pre-trained on a large corpus of text.</li>
    <li>There are numerous pre-trained models. 
        Here's a <a href="https://github.com/Hannibal046/Awesome-LLM">web page</a> that curates information about this ever-growing field.   
        Also many models can be found on
        <a href="https://huggingface.co/">Hugging Face</a>, a US company that build tools but also hosts datasets and tools
        for use by researchers. Keras also has some <a href="https://keras.io/keras_hub/presets/">pre-trained LLMs</a>.
    </li>
    <li>Let's discuss a few just to give a flavour of some of the differences.
        <ul>
            <li>GPT: Generative Pre-trained Transformer
                <ul>
                    <li>The GPT models are developed by <a href="https://openai.com/">OpenAI</a>.</li>
                    <li>These models operate at word-level (mostly) and, architecturally, they are all transformer decoder models.
                        <ul>
                            <li>In other words, their lowest self-attention layer uses masking to hide future parts of the input sequence 
                                (i.e. it is a causal self-attention layer and the model is unidirectional).</li>
                            <li>Hence, they are used to predict the next word.</li>
                        </ul>
                    </li>
                    <li>Training is self-supervised: there are no separate labels; rather, the target comes from the text (the next word). 
                        (Be warned that the terminology used in this area is problematic. The self-supervised pre-training is sometimes 
                        referred to as unsupervised pre-training.)</li>
                </ul>
            </li>
            <li>BERT: Bidirectional Encoder Representations from Transformers
                <ul>
                    <li>Developed by researchers in Google;</li>
                    <li>A transformer encoder;</li>
                    <li>But bidrectional in the sense that, instead of being trained to predict the next word using the previous ones, 
                        it is trained to predict a missing word, using words that come before and after the missing word (doing this
                        is called <i>cloze</i> prediction).
                    </li>
                </ul>
            </li>
            <li>BART: Combining Bidirectional and Auto-Regressive Transformers
                <ul>
                    <li>Developed by researchers in Facebook AI (as they were called then);</li>
                    <li>Combines a bidrectional encoder with a unidirectional decoder;</li>
                    <li>It is trained in part to reconstruct sentences from corrupted versions of those sentences.</li>
                </ul>
            </li>
            <li>LLaMA: Large Language Model Meta AI
                <ul>
                    <li>Developed by Meta AI;</li>
                    <li>A transformer decoder; fewer parameters than GPT-3; but, Meta AI claims to show that it is just as good;</li>
                    <li>LLaMA's weights were available to academic researchers and subsequently leaked more widely;</li>
                    <li>Since the leak, Meta AI has released Llama2 and Llama3 models (and, yes, LLaMA has irritating capital letters and Llama2/Llama3 do not!) , including weights;</li>
                    <li>This has made Llama models the basis of a lot of research and development in academia and in industry outside of the big tech companies.</li>
                </ul>
            </li>
            <li>Gemini
                <ul>
                    <li>Developed by Google;</li>
                    <li>A transformer decoder;</li>
                    <li>It is multimodal, so inputs and outputs can contain text, images, videos and audio;</li>
                    <li>The Gemini LLM powers the Gemini Chatbot, which is Google's answer to ChatGPT.</li>
                </ul>
            </li>
        </ul>
    </li>
    <li>General-purpose pre-trained models, especially pre-trained LLMs, are sometime called Foundation Models.</li>
</ul>

<h2>Supervised Fine-Tuning (SFT)</h2>
<ul>
    <li>For downstream tasks, you add extra layers, and train on a smaller, labeled dataset. (This should remind you of <i>transfer learning</i> in <i>CS4618</i>.)</li>
        <ul>
            <li>For example, to obtain a movie review sentiment analyser that labels reviews as positive or negative, 
                we could add a dense layer comprising a single neuron with a sigmoid
                activation function, and train on the movie review dataset that we used in previous lectures.
            </li>
        </ul>
    </li>
    <li>OpenAI researchers illustrated this with four downstream tasks, summarised in this image:
        <figure style="text-align: center";>
            <img src="images/gpt1.png" />
            <figcaption>
                Figure taken from Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Improving language understanding by generative pre-training</a>.<br />(Linear = Dense)
            </figcaption>
        </figure>
        Notice how the training data for the transfer learning is arranged differently for each task.
        <!--
        Classification: (a) whether a sentence is grammatical or not; (b) a binary sentiment analysis task.
        Entailment: relationship between two sentences: neutral, entailment, contradiction.
        Similarity: whether two sentences are similar or not. The training data duplicates all examples, 
        reversing the order of the sentences.
        MCQs: (a) high-school exams: a passge of text, a question and answers to choose from; (b) story completion: a multi-sentence
        story and then two sentences to choose from to complete the story. In both cases, one exam question becomes multiple
        examples, one per option.
        -->
    </li>
    <li>Instruction models
        <ul>
            <li>Perhaps the most important example of Supervised Fine-Tuning is the SFT that OpenAI applied to its GPT3 and GPT4 models to make them better at following human instructions.</li>
            <li>They used a human-constructed dataset that comprises (prompt, response) pairs.
                <ul>
                    <li>E.g. 
                        <ul>
                            <li>Prompt: "Create a shopping list from this recipe. Trim the ends of the zucchini&hellip;"</li>
                            <li>Expert response: "Zucchini, beef, onion, mushroom, peppers, cheese, ketchup, salt, pepper."</li>
                        </ul>
                    </li>
                    <li>E.g.
                        <ul>
                            <li>Prompt: "What’s the cause of the 'anxiety lump' in our chest during stressful or disheartening experiences?"</li>
                            <li>Expert response: "The anxiety lump in your throat is caused by muscular tension keeping your glottis dilated to maximize airflow&hellip;"</li>
                        </ul>
                    </li>
                </ul>
            </li>
            <li>They refer to these fine-tuned models as InstructGPT models</li>
        </ul>
    </li>
    <li>As a final example, let's mention some local (UCC) work that prepares LLMs for Irish language tasks. Tran, O'Sullivan &amp; Nguyen take a pre-trained Llama model, which is obviously proficient in English and other widespread languages but not so proficient in minority languages, such as Irish. First, they do more self-supervised pre-training using Irish language texts. Then they do SFT to tune the model for English-Irish machine translation. See these papers if you are interested: <a href="https://arxiv.org/pdf/2405.1301">https://arxiv.org/pdf/2405.1301</a> and <a href="https://aclanthology.org/2024.loresmt-1.20.pdf">https://aclanthology.org/2024.loresmt-1.20.pdf</a>.</li>
</ul>

<h2>Preference Alignment</h2>
<ul>
    <li>Sometimes, even after SFT, the output of an LLM will be incorrect, untruthful, unhelpful, toxic or biased.</li>
    <li>How can we better <em>align</em> the model to the tasks that users want to perform and how users would want them to be performed?
    </li>
    <li>This is the job of Preference Alignment.</li>
    <li>The basic idea is to collect a dataset of the judgments of human experts and use these to tweak the LLM so 
        that its responses align better with human preferences.
    </li>  
    <li>For those who want a little more detail:
        <ol>
            <li>Collect a dataset of prompts, e.g. from users of the LLM;</li>
            <li>Submit one of the prompts to the LLM and obtain a response;</li>
            <li>Submit the same prompt to the LLM again to obtain another response;</li>
            <li>Ask a human expert which of the two responses is better;</li>
            <li>Tweak the LLM so that, faced with same prompt again, it is more likely to produce the preferred response. (The tweaking
                typically uses something called Proximal Policy Optimisation.)
            </li>
        </ol>
    </li>
    <li>
        There are several variants of this basic approach (details not important), including:
        <ul>
            <li>Reinforcement Learning from Human Feedback (RLHF): In RLHF, the human experts rank the responses, and then a regressor is trained to predict a reward score for how good that response is. These predicted rewards are used when updating the LLM.</li>
            <li>Direct Preference Optimization (DPO): In DPO, we update the LLM directly from the human-ranked responses, thus avoiding the need to learn a separate reward model.</li>
            <li>Kahneman-Tversky Optimizaton (KTO): KTO requires only that the human responses be classified as desirable/undesirable, rather than ranked.</li>
        </ul>
   </li>
    <li>(The bottleneck in RLHF, DPO and KTO is the need for humans to make judgments. Intriguingly, there is some early work that shows that replacing 
        the humans by a large language model, i.e. asking the language model to rank the responses, works just as well, at least for
        summarization tasks, e.g.: Harrison Lee et al. 2023. 
        <a href="https://arxiv.org/abs/2309.00267">RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback</a>)</li>
</ul>

<h1>So what is ChatGPT?</h1>
<ul>
    <li>Everyone knows that ChatGPT is a conversational chatbot from OpenAI.</li>
    <li>But now we can understand it in terms of the earlier parts of this lecture:
        <ul>
            <li>Its Foundation Model is a pretrained version of GPT. Initially, this was GTP-3.5, and now it is GPT-4.</li>
            <li>The Foundation Model is fine-tuned using SFT on a dataset of (prompt, response) pairs, to produce an Instruction Model called InstructGPT.</li>
            <li>RLHF is used on a rolling basis for Preference Alignment to produce a Chat Model</li>
        </ul>
    </li>
    <li>Of course, that's not all. The Chat Model is just one component.</li>
        <ul>
            <li>Another compnent is OpenAI's Moderation Model. This is a classifier which enables ChatGPT to filter inappropriate inputs and
                outputs. It classifies into several classes, including hate, harrassment, self-harm, sexual and violence. (You
                can use it in your own apps; see <a href="https://platform.openai.com/docs/guides/moderation/overview">moderation</a>.)
            </li>
            <li>These days it claims to have memory between conversations. Most likely, tokens from previous conversations and the current conversation are automatically included in a context window which is joined onto your prompt. The context window length has grown. For GPT-3, it was 2,049 tokens; for GPT-3.5, it was 4,096 tokens; for GPT-4, it is 32,768 tokens.</li>
            <li>Chat GPT-4 is also multi-modal, i.e. it can handle images as well as text in both the prompts and the responses.</li>
            <li>And Chat GPT-4 also has the ability to invoke external apps (see below).</li>
        </ul>
    </li>
    <li>Google's Gemini chatbot is similar.</li>
</ul>

<h1>Applications</h1>
<ul>
    <li>There's huge excitement (and hype) about possible applications of smarter chatbots, like ChatGPT.</li>
    <li>Customer support is the obvious one, since relatively dumb chatbots are being used already. For example,
        <a href="https://www.intercom.com/">Intercom</a> has a new chatbot called <a href="https://www.intercom.com/drlp/fin">Fin</a> 
        that is powered by ChatGPT.
    </li>
    <li><a href="https://www.crossingminds.com/">CrossingMinds</a> have launched <a href="https://www.crossingminds.com/gpt-spotlight">GPT Spotlight</a>, which uses ChatGPT to enable conversational product search and discovery in e-commerce.</li>
    <li><a href="https://www.duolingo.com/">DuoLingo</a>, the app that helps people learn foreign languages, is using ChatGPT to offer new language learning
        exercises, including role play.
    </li>
    <li><a href="https://www.khanacademy.org/">Khan Academy</a>, which is a non-profit that offers educational resources, is
        investigating a more conversational offering, using ChatGPT.
    </li>
    <li>Maybe there will be applications in games. Consider, for example, this Dungeons &amp; Dragons game: <a href="https://play.aidungeon.io/main/landing">https://play.aidungeon.io/main/landing</a>.
        (But note the controversy too: some players were typing words that caused the game to generate stories depicting sexual encounters involving children <a href="https://arstechnica.com/gaming/2021/05/it-began-as-an-ai-fueled-dungeon-game-then-it-got-much-darker/">https://arstechnica.com/gaming/2021/05/it-began-as-an-ai-fueled-dungeon-game-then-it-got-much-darker/</a>.)</li>
</ul>

<h1>A Moving Target</h1>
<ul>
    <li>ChatGPT behaviour (and the behaviour of LLMs in general) is not reliably reproducible:
        <ul>
            <li>Suppose that some months ago ChatGPT answered a question that I asked.</li>
            <li>Today, I ask it the exact same question. I get a different response.</li>
            <li>Why might this happen? (Give me at least three reasons.)</li>
        </ul>
    </li>
    <li>This is not great for researchers or teachers.
        <ul>
            <li>E.g. I might find an example prompt whose response illustrates a strength or weakness of ChatGPT. 
                But when a student later submits the same prompt, the response is different and does not exhibit the same strength/weakness.
            </li>
            <li>E.g. I might suspect a student has used ChatGPT inappropriately in their work. But I cannot reliably verify my 
                suspicions: even if I can guess the exact same
                prompt that the student used (unlikely), I will not get the same response that the student got.
            </li>
        </ul>
    </li>
</ul>

<h1>The Debate</h1>
<ul>
    <li>There is huge debate about what ChatGPT knows, or even what it 'knows' (in scare quotes).</li>
    <li>There is even debate about what it can do: what tasks can it actually perform?</li>
    <li>Below, I present two viewpoints.</li>
</ul>

<h2>Autocomplete-on-steroids? Fluent bull-shitters? Stochastic parrots?</h2>
<ul>
    <li>An LLM is trained to fill-in-the-blanks. More specifically, GPT models are trained to predict the next word. They stitch text together without any reference to meaning.</li>
    <li>According to this viewpoint, 
        we can use an LLM to generate text that resembles human language, but there is no real act of communication.
        <!--
        According to speech act theory, language is used to carry out actions: apologizing, promising, ordering, answering, 
        requesting, complaining, warning, inviting, refusing, and congratulating. These are all examples of speech acts. 
        But they depend on having certain beliefs, desires and intentions. If I am to sincerely apologise to you, then I 
        must believe that I have wronged you, desire good relations with you, and intend to restore our good relations (or 
        something like that!). Think about the beliefs, desires and intentions that underlie acts like promising, complaining 
        and so on. Since systems built atop of language models have no beliefs, desires and intentions, any language they produce 
        that resembles a speech act is not really the performance of that act. They may say "I'm sorry that I offended you'' 
        but this is not truly the speech act of apologizing.
        -->
    </li>
    <li>Communication depends on beliefs, desires and intentions. An LLM has none of these.</li>
    <li>We (the users of the LLM) attribute meaning to its utterances; we make sense of them; we interpret them as if they were acts of
        communication. But the utterances are, in reality, empty.
    </li>
</ul>
<figure style="text-align: center;">
    <img src="images/parrot.png" />
    <figcaption>
        Image from <a href="https://twitter.com/cuducos">Cuducos</a>
    </figcaption>
</figure>

<h2>Emergent Properties?</h2>
<ul>
    <li>In complex systems, properties that are not possessed by the individual components of the system emerge through their interaction.
        <ul>
            <li>"The whole is greater than the sum of its parts." &mdash; Aristotle</li>
            <li>Examples: ant colonies collectively find the shortest path to food sources; flocks of birds move in what appears to be an
                orchestrated way; maybe consciousness is an emergent property of high complexity brains; &hellip;</li>
        </ul>
    </li>
    <li>According to this alternative viewpoint, 
        LLMs exhibit emergent properties (or are beginning to exhibit them or something like them).</li>
     <li>The key idea is <b>in-context learning</b>.
        <ul>
            <li>Because a LLM is trained on a large dataset, it is exposed repeatedly to examples of many tasks.</li>
            <li>For example, since some of its training set comes from textbooks, it might see examples of arithmetic, or 
                machine translation.
                <figure style="text-align: center;">
                    <img src="images/in_context.png" />
                    <figcaption>
                        In-context learning. (Brown, T. B., et al. (2020). <a href="https://arxiv.org/abs/2005.14165">Language models are few-shot learners.</a>.)
                    </figcaption>
                </figure>
            </li>
            <li>So, although it is being trained in self-supervised fashion to predict the next word, &hellip;</li>
            <li>&hellip; it may additionally impicitly learn about these tasks from these 'patterns'.</li>
            <li>Hence, the pre-trained LLM may acquire some skill at these tasks. For these tasks, there may be no need no further training. 
                There is no need, therefore, for a task-specific labeled training set.
            </li>
        </ul>
    </li>
    <li>We can ask an LLM to perform these tasks using <b>zero-shot inference</b>, <b>one-shot inference</b> or <b>few-shot inference</b>.
        <ul>
            <li>Zero-shot inference means that we just ask it to carry out the task. 
                <ul>
                    <li>Here is a classification example (using BART):
                        <figure style="text-align: center;">
                            <img src="images/zero_shot.png" />
                        </figure>
                        In the example, the model is predicting the topic of a text, but it has not been fine-tuned on a dataset of 
                        texts labeled with topics. Its abilities in this task come from in-context learning.
                    </li>
                    <li>Try it yourself: <a href="https://huggingface.co/tasks/zero-shot-classification">https://huggingface.co/tasks/zero-shot-classification</a></li>
                    <li>But perhaps zero-shot inference is unfairly hard: even humans need an example or two.</li>
                </ul>
            </li>
            <li>In one-shot and few-shot inference, we ask the LLM to carry out a task but we give one or more examples.
                <figure style="text-align: center;">
                    <img src="images/few_shot_adapted.png" />
                    <figcaption>
                        Figure adapted from Brown, T. B., et al. (2020). <a href="https://arxiv.org/abs/2005.14165">Language models are few-shot learners.</a>.
                    </figcaption>
                </figure>
            </li>
            <li>It is important to understand that in zero-, one- and few-shot inference, there's no further training or fine-tuning going on: we are not feeding in a labeled dataset and 
                we are not updating the weights and biases of the neural network. We are just asking the LLM to carry out a task, and we hope that the ability to perform these tasks has emerged from the ability to predict words (see, e.g., Jason We et al. (2022). <a href="https://arxiv.org/abs/2206.07682">Emergent Abilities of Large Language Models</a>).
            </li>
            <li>(Yet again, terminology is all over the place. Some people say zero-shot learning or zero-shot transfer, instead of zero-shot inference. I resist this because there is no additional learning: weights are not being updated when performing these tasks.)</li>
        </ul>     
    </li>
    <li>There is a long list of possible emergent properties, i.e. skills that LLMs may have acquired through in-context learning.</li>
        <ul>
            <li>An extreme example is the Google engineer who claimed that Google's LaMDA LLM is sentient. He was subsequently fired: <a href="https://en.wikipedia.org/wiki/LaMDA#Sentience_claims">Wikipedia</a></li>
            <li>In between these extremes, we have people claiming that LLMs, e.g.:
                <ul>
                    <li>have commonsense knowledge, e.g. Zirui Zhao et al. (2023). <a href="https://arxiv.org/abs/2305.14078">Large Language Models as Commonsense Knowledge for Large-Scale Task Planning</a></li>
                    <li>can do arithmetic, e.g. Zheng Yuan et al. (2023). <a href="https://arxiv.org/abs/2304.02015">How well do Large Language Models perform in arithmetic tasks?</a></li>
                    <li>can do verbal reasoning (answering questions about short texts), e.g. Radford, A., et al. (2019). <a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language models are unsupervised multitask learners</a></li>
                    <li>can do analogical reasoning, e.g. Taylor Webb et al. (2022). <a href="https://doi.org/10.48550/arXiv.2212.09196">Emergent Analogical Reasoning in Large Language Models</a> 
</li>
                    <li>can reason about sequence of actions, e.g. Kenneth Li et al. (2022). <a href="https://arxiv.org/abs/2210.13382">Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task</a> shows an LLM predicting legal moves in the board game <i>Othello</i>.</li>
                    <li>have a sense of what they know and don't know: S. Kadavath et al. (2022). <a href="https://arxiv.org/abs/2207.05221">Language Models (Mostly) Know What They Know</a></li>
                    <li>have a theory of mind, i.e. the ability to reason about other people's mental states: Michal Kosinski. (2023). <a href="https://arxiv.org/abs/2302.02083">Theory of Mind May Have Spontaneously Emerged in Large Language Models</a>.</li>
                </ul>
                and so on!
            </li>
        </ul>
    </li>
</ul>

<h2>Who is right?</h2>
<ul>
    <li>Make up your own mind!</li>
    <li>In part, the disagreement between the skeptics and the believers is all about the extent to which LMs <b>memorize</b> and the extent to which they <b>generalize</b>. The skeptics believe that demonstrations of LLMs doing arithmetic, doing verbal reasoning, having a theory of mind, etc., are methodologically problematic.
        <ul>
            <li>In some cases, the task is artificially constrained.
                <ul>
                    <li>One critique of the work on analogical reasoning by Taylor Webb et al. (cited above) is that the task involved selection (multiple-choice questions), rather than generation of new analogies.</li>
                    <li>Generation of new analogies requires generalisation.</li>
                </ul>
            </li>
            <li>There is a high likelihood of leakage.
                <ul>
                    <li>We do not know exactly what is in the training data.</li>
                    <li>Therefore, we cannot ensure that the test data is distinct from the training data.</li>
                    <li>In fact, if we test using well-known problems (e.g. farmer-goat-cabbage) or textbook exercises, then there is a very high likelihood that the LLM will have been trained on the solutions.</li>
                </ul>
            </li>
            <li>The LLMs usually fail on even quite trivial variants of well-known problems, which shows that they don't do much generalization.
               <ul>
                   <li>It is easy to create subversive examples that wrong-foot the normal next-word expectations, revealing that there is no real reasoning:
        <figure style="text-align: center;">
            <figcaption>It gets this right:</figcaption>
            <img src="images/chatgpt_reasoning3.png" />
        </figure>
        <figure style="text-align: center;">
            <figcaption>But it gets this variant wrong (or, at least, it used to):</figcaption>
            <img src="images/chatgpt_reasoning4.png" />
        </figure>
        <figure style="text-align: center;">
            <figcaption>It gets this right (solution not shown):</figcaption>
            <img src="images/chatgpt_reasoning5.png" />
        </figure>
        <figure style="text-align: center;">
            <figcaption>But it gets this variant wrong (or, at least, it used to):</figcaption>
            <img src="images/chatgpt_reasoning6.png" />
        </figure>
    </li>
                    <li>On Theory of Mind problems, for example, LLMs fail to correctly answer
                questions that are trivial variants of the ones that they get right: Tomer Ullman. (2023). <a href="https://arxiv.org/abs/2302.08399">Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks</a>. On a dataset for testing for theory of mind, LLMs score close to zero, whereas humans score around 90 (Hyunwoo Kim et al. (2023) <a href="https://arxiv.org/abs/2310.15421">FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions</a>).</li>
               </ul>
            </li>
            <li>LLMs are trained to predict word probabilities. The skeptics point out that this has a lasting influence when you use an LLM for some other task. For example, consider tasks that are deterministic, such as decrypting using a shift cypher. LLMs do better when the input text has high probability, when the output text has high probability, and when the amount of shift is more common in web examples (e.g. 13 is more common than 12). This is shown in <a href="https://www.pnas.org/doi/pdf/10.1073/pnas.2322420121">this paper</a>. The same authors show that the effect persists even when more complex LLMs are used, e.g. OpenAI's o1, while more capable, still suffers from this lasting influence.</li>
            <li>This <a href="https://aiguide.substack.com/p/the-llm-reasoning-debate-heats-up">blog post</a> does a brilliant job of reviewing three papers that explore LLM reasoning capabilities. All three papers show that LLMS rely a lot on memorization and exhibit only limited genearlization abilities. See also Karthik Valmeekam et al. (2022). <a href="https://arxiv.org/abs/2206.10498">Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)</a></li>
        </ul>
    </li>
    <li>The skeptics argue that the same can be said for all of the other claims about emergent properties: it is easy to construct 
        examples that the LLM should
        get right but that it doesn't.</li>
</ul>

<h1>Making shit up</h1>
<ul>
    <li>LLMs cannot reliably be factual.</li>
    <li>Of course, often they will be factual: they have been trained on vast quantities of text, a lot of which is factual.</li>
    <li>But the text they concoct is a pastiche based only on probabilities, and so falsehoods will also be commonplace, e.g.
        <ul>
            <li>Meta's <i>Galactica</i> demo was online for only three days of 2022 in the face of severe criticism. 
                Trained on scientific articles, Meta claimed it would help scientists write papers but, of course, 
                <a href="https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/">it just made things up</a>, including citations.
            </li>
            <li><a href="https://www.theguardian.com/technology/2023/jun/23/two-us-lawyers-fined-submitting-fake-court-citations-chatgpt">Two lawyers in the US were fined</a>, after submitting court filings written by ChatGPT that cited invented cases.</li>
            <li>A <a href="https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs">study by Stanford University</a> found LLMs to be unsuitable for use as a medical assistant because: (a) not only is it non-deterministic, its responses have high variability; (b) it is inaccurate: only 41% of GPT-4 responses agreed with the known answer to medical questions according to a consensus of 12 physicians; and (c) it had potential for harm: 7% of answers were deemed potentially harmful by the physicians.</li>
            <li>A <a href="https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14">news story</a> reports that medical centres in the US used Open AI's <i>Whisper</i> system to transcibe patent consultations from audio to text, but found that it made things up.</li>
            <li>Last time I asked ChatGPT for a biography of Derek Bridge, it said I was born in Ireland (no!) and it invented a PhD from 
                the Edinburgh University (no, mine is from Cambridge University!).</li>
        </ul>
    </li>
    <li>In some domains, inventions might be taken for creativity. But, right now, it is probably true to say that, putting all
        the hype to one side, the concoction of falsehoods by systems such as ChatGPT is the main factor that is holding back the 
        deployment of these systems in real-world applications.</li>
    <li>In any domain where there is an expectation of factuality, you should proceed with extreme caution before deploying an LLM. Some people would argue, for example, that LLMs should not be used in search engines because the users of search engines expect factual answers; some would argue that they should not be used by journalists, by academics, by students, &hellip; Some would argue that they should not be used in medical applications. Even in customer support, there are dangers, e.g. there is <a href="https://www.businessinsider.com/car-dealership-chevrolet-chatbot-chatgpt-pranks-chevy-2023-12">this story</a> of a chatbot offering a car for sale for $1.
    </li>
    <!--
    <li>Within OpenAI, <a href="https://www.alignmentforum.org/posts/BgoKdAzogxmgkuuAt/behavior-cloning-is-miscalibrated">Leo Gao</a>
        and others are claiming that ChatGPT's supervised transfer learning (step 1 in the figure from earlier) is 
        contributing to the falsehoods.
        <ul>
            <li>There's a mismatch between what the LLM has learned (next-word probabilities) and what the humans know and assume when
                they construct the examples (lots of commonsense knowledge, for example). The claim is that this mismatch results in a
                miscalibrated model, that is over/underconfident about things that it wasn't exposed to enough of because the
                human was assuming them. Or something!
            </li>
            <li>This seems contradicted by results on this page: https://openai.com/research/instruction-following where, for the
                Hallucinations dataset, the supervised steps improves matters and it is the RLHF that makes matters worse.
        </ul>
    </li>
    -->
    <li>These concocted falsehoods are now commonly called <b>hallucinations</b>. Referring to them in this way is controversial for at least three reasons:
        <ol>
            <li>In human psychology, an hallucination is a false perception (e.g. seeing something that isn't there).
                But what we have here are false utterances.
            </li>
            <li>Arguably, the word "hallucination" unreasonably anthropomorphises the system.</li>
            <li>Arguably, the word sounds unreasonably benign.</li>
        </ol>
    </li>
    <li>But what to use instead?
        <ul>
            <li>"Lie"? The problem is that to lie is to state something which you believe is false. And LLMs don't have any beliefs.</li>
            <li>"Bull-shit"? Too sweary!</li>
            <li>"Confabulation"? No one knows what it means.</li>
        </ul>
        You'll note that I have been saying "falsehoods" and, when I need a verb, I say that it "concocts" them.
    </li>
</ul>

<h1>Prompt Engineering</h1>
<ul>
    <li>Prompt engineering is the process of designing a prompt that will be input to a generative AI system.</li>
    <li>It can involve:
        <ul>
            <li>deciding how to phrase a request;</li>
            <li>specifying a style of answer;</li>
            <li>describing a context for a request;</li>
            <li>choosing one or more examples for one-shot and few-shot inference;</li>
            <li>&hellip;</li>
        </ul>
    </li>
    <li>Chain-of-thought (CoT) prompting is one example, where the prompt encourages the model to break the problem down. 
        This is often as simple as telling it to "reason in steps".
        <figure style="text-align: center;">
            <figcaption>It works here:</figcaption>
            <img src="images/chatgpt_prompteng1.png" />
        </figure>
    </li>
    <li>Or a prompt might encourage the model to be factual ("Answer only using reliable sources and cite those sources")</li>
    <li>Originally, prompt engineering was seen as a human skill, perhaps even something that would give rise to a whole new career 
        path.
    </li>
    <li>If interested, you can look at this <a href="https://amatriain.net/blog/prompt201">survey of techniques</a>.</li>
    <li>More recently, there has been a flurry of papers about Automatic Prompt Generation, e.g.:
        <ul>
            <li><a href="https://arxiv.org/abs/2210.03493">this paper</a> about automatic chain-of-thought prompting, where the software selects examples to include in the prompts;</li>
            <li><a href="https://ai.googleblog.com/2023/08/teaching-language-models-to-reason.html">this Google blog post</a> about 'teaching' models to reason algorithmically, with examples that show enhanced arithmetic skills, and 
            <a href="https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html">this one</a> about solving
            maths and science problems;</li>
            <li><a href="https://arxiv.org/abs/2309.11495">this paper</a> that claims to reduce hallucination by breaking tasks down into subtasks, and <a href="https://arxiv.org/pdf/2302.12813.pdf">this paper</a> that injects external knowledge and does fact checking
                to reduce concoction of falsehoods (doing this is now being referred to as Retrieval-Augmented Generation (RAG), based on 
            <a href="https://arxiv.org/abs/2005.11401">this paper</a>).</li>
        </ul>
    </li>
    <li>You really should not assume that prompt engineering overcomes the fundamental limitations of LLMs:
        <figure style="text-align: center;">
            <figcaption>It did not help here:</figcaption>
            <img src="images/chatgpt_prompteng2.png" />
        </figure>
    </li>
</ul>

<h1>External Actions</h1>
<ul>
    <li>We can extend the capabilities of systems such as ChatGPT by giving them access to external apps.</li>
    <li>Originally, this was done using what OpenAI called <i>plugins</i>, but now it is done using what they call <i>actions</i>
        <blockquote>
            "GPT Actions empower ChatGPT users to interact with external applications via RESTful APIs calls outside of ChatGPT simply by using natural language. They convert natural language text into the json schema required for an API call. GPT Actions are usually either used to do data retrieval to ChatGPT (e.g. query a Data Warehouse) or take action in another application (e.g. file a JIRA ticket)." <cite><a href="https://platform.openai.com/docs/actions/introduction">https://platform.openai.com/docs/actions/introduction</a></cite>
        </blockquote>
    </li>
    <li>External actions might enable systems that use LLMs to overcome their limitations (to some extent):
        <ul>
            <li>Invoking a calculator app or a Python interpreter or Wolfram Alpha will allow calculations.</li>
            <li>Invoking a web browser or a search engine will give access to fresh content (more recent than the content in the
                training dataset) and allows verification of accuracy to filter falsehoods.
            </li>
           <li>Carrying out actions upon the external world, e.g. booking tickets.</li>
        </ul>
    </li>
    <li>What's cool is that giving ChatGPT access to an external API does not require any programming. 
        <ul>
            <li>You write an <a href="https://platform.openai.com/docs/actions/getting-started">OpenAPI Schema</a>, which is a json file which tells ChatGPT how to use the API (e.g. its URL, how to authenticate, and natural language descriptions of the endpoints).</li>
            <li>ChatGPT uses the Schema to decide whether the user's natural language prompt requires use of an API.</li>
        </ul>
        OpenAI are vague about the details but something similar can be seen in this <a href="https://betterprogramming.pub/how-llms-like-chatgpt-can-use-plugins-and-tools-2d0571869e01">blog post</a>. (An alternative to this few-shot approach would be to train the LLM to use the external APIs by fine-tuning it on a dataset of examples of API use, as shown in <a href="https://arxiv.org/abs/2302.04761">this paper</a>.)
    </li>
    <li>This is hugely exciting! It may spell the demise of native apps and web apps. We may not need to invoke apps ourselves; we may not need to learn how to use each one. Instead, something like ChatGPT would act like a natural language operating system, invoking apps for us to help us get our work done!</li>
    <li>However, it is early days! 
        <ul>
            <li>There are concerns about how to ensure that the system uses its extended capabailities safely, especially if it can take actions in the external world.</li>
            <li>It is fair to say that integrations of LLMs with search engines have not been wholy successful, even resulting in some riducle:
                <ul>
                    <li>Some examples of Microsoft's integration of ChatGPT with the Bing search engine going wrong: <a href="https://twitter.com/MovingToTheSun/status/1625156575202537474">Bing getting argumentative</a> and <a href="https://dkb.blog/p/bing-ai-cant-be-trusted">more examples</a>.</li>
                    <li>A promotional video showing BARD (Google's integration of their LaMDA and PaLM LLMs with their search engine) producing a factual error <a href="https://www.theguardian.com/technology/2023/feb/09/google-ai-chatbot-bard-error-sends-shares-plummeting-in-battle-with-microsoft">wiped $100 billion off Alphabet's shares</a>.</li>
                    <li>Google's integration of its Gemini LLM with its search engine to produce <i>AI Overviews</i> <a href="https://searchengineland.com/google-ai-overview-fails-442575">tells people to eat rocks</a>.</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h1>LLM Agents</h1>
<ul>
    <li>As we saw, people are fairly skeptical about the capabilities of LLMs, in particular whether they can reason, given that LLMs are trained only to do autocompletion.</li>
    <li>But in the newest systems, where there is multi-step Automatic Prompt Engineering and use of External Actions, the LLM is just one component of the system.
        <ul>
            <li>LLM-Modulo systems comprise a generate-test framework, in which the LLM generates candidate repsonses and an external verifier tests them, and the best response is shown to the user.</li>
            <li>In Tree-of-Thoughts (ToT) an algorithm recursively breaks a problem into smaller parts and tries alternative paths for solving each subproblem, invoking the LLM on the subproblems.</li>
            <li>In OpenAI's o1 and o3 family of models, a system that has been trained using Reinforcement Learning selects among different Chain-of-Thoughts prompting of the LLM. (Or something like this! OpenAI is not very explicit about the details.)</li>
        </ul>
        This image from Rakesh Gohel shows the way that LLMs have become just one component within a larger system:
        <figure>
            <img src="images/llm_evolution.jpg" />
        </figure>
        <!-- Google's tech report on LLm Agents: https://media.licdn.com/dms/document/media/v2/D561FAQH8tt1cvunj0w/feedshare-document-pdf-analyzed/B56ZQq.TtsG8AY-/0/1735887787265?e=1736985600&v=beta&t=pLuArcKyUcxE9B1Her1QWfMHF_UxZL9Q-Y0JTDuSn38 -->
    </li>
    <li>There is no clear terminology for referring to these augmented LLMs: some people call them AI Agents (but there is resistance to this); I tend to call them LLM Agents; others call them Large Reasoning Model or Reasoning Language Models.</li>
    <li>Can these augmented LLMs do reasoning? Are they better at generalization? Do they concoct fewer falsehoods?</li>
    <li>The answer seems to be: yes, but only to some extent (see, e.g., <a href="https://arxiv.org/pdf/2409.13373">this paper</a>).</li>
    <li>And we need to recognize that these improved capabilities come at a cost: responding to your prompt now takes a considerably longer time and uses considerably more energy.</li>
</ul>

<h1>Other Problems</h1>
<ul>
    <li>Despite the use of Preference Alignment and of classifiers for toxic input/output (such as OpenAI's moderation classifier, mentioned earlier), LLMs
        might still produce toxic text. This includes text that: incites hatred; promotes violence; harasses, threatens or
        bullies; is sexual, erotic or pornographic; encourages self-harm; provides ill-founded medical diagnoses or treatments;
        gives instructions for producing dangerous artefacts (e.g. home-made bombs); gives instructions that themselves are
        dangerous (e.g. inappropriate mixing of chemicals); and so on. LLMs and other AI systems that use machine learning cannot
        'unlearn' parts of what they have learned in order to comply with legal or ethical requirements.
    </li>
    <li>As a special case of the previous point but worthy of separate mention, LLMs may generate malware, designed to
        disrupt systems or obtain unauthorised access to systems. <a href="https://simonwillison.net/2023/Apr/14/worst-that-can-happen/">Prompt injection attacks</a> are particularly worrisome
        in the case of external actions. It is plausible that an LLM might invoke actions that are malicious. (See <a href="https://arxiv.org/pdf/2410.13691">this paper</a> for demonstrations of the kind of thing that might happen.)
    </li>
    <li>Despite the use of Preference Alignment and of classifiers such as OpenAI's moderation classifier, there is some evidence that users can bypass the
        safeguards by explicitly instructing the LLM to produce unsafe outputs or by <a href="https://www.vice.com/en/article/epzyva/ai-chatgpt-tokens-words-break-reddit">including unusual token sequences in the
        prompt</a>.
    </li>
    <li>When LLMs are fine-tuned on company-specific data (e.g. to create a useful customer support chatbot), there is a risk of data exfiltration &mdash; where the LLM leaks confidential data.</li>
    </li>
    <li>Harmful activities that rely on producing text are made easier by LLMs. 
        These include: misinformation, spam, phishing, impersonation, and fraudulent writing of, e.g., product reviews, student essays,
        academic publications, and reviews of academic publications. There are no 100% reliable methods for detecting whether content
        has been produced by an LLM (or some other generative AI). OpenAI, for example, develped a classifier, but shut it down within
        a year. Whenever a classifier or something like it is developed, it is easy to modify the AI so that it produces content that
        bypasses detection.
    </li>
    <li>ChatGPT is impressive in English. It performs well in other common languages such as Spanish and Japanese, and including some programming languages. But, in languages which are under-represented on the Web, such as Tigrinya (7 million speakers, mostly in Eritrea),
        Kurdish (27 million speakers, mostly in Turkey and Iraq) and Tamil (78 million speakers in Sri Lanka and parts of India),
        it is not even always grammatical; it makes up words; it fails to follow instructions; and its ablities on reasoning 
        tasks are even lower than they are in English (<a href="https://restofworld.org/2023/chatgpt-problems-global-language-testing/">https://restofworld.org/2023/chatgpt-problems-global-language-testing/</a>).
    </li>
    <li>The text produced by an LLM may reflect biases that are present in the original training data.  Similarly, Preference Alignment aligns the model's
        outputs to the preferences of human experts, but these experts have their own biases (not least, because they are mostly young, 
        computer-savvy, English-speakers
        who live in the USA or South-east Asia). Similarly, the prompts used for the fine-tuning are sampled from prompts that came
        from users of the system &mdash; these users will also not be representative of the population at large.
    </li>
    <li>There are moral and legal issues concerning the use of other people's text to train LLMs.
        <ul>
            <li>The tech giants assume they can use text (and images, and videos) as training data for their models
                without permission or payment.
            </li>
            <li>Sometimes the text that a LLM generates is identifiably related to text in the training dataset, raising questions of 
                plagiarism and copyright. For example, the US media site CNET seems to be making heavy use of AI-generated articles,
                and <a href="https://futurism.com/cnet-ai-plagiarism">people are finding examples of something similar to plagiarism</a>.
            </li>
            <li>There is talk of lawsuits.</li>
            <li>There is research on protecting work from use by AI. For example, for images the University of Chicago's <a hre="https://glaze.cs.uchicago.edu/">Glaze</a> uses a cloaking technique that prevents an image generator from accurately being able to replicate the style in an artwork.</li>
        </ul>
    </li>
    <li>The web will start to fill with text that has been generated by LLMs (product reviews, news articles, and so on). What
        happens when future LLMs are trained on data that is crawled from the web? In this <a href="https://www.lightbluetouchpaper.org/2023/06/06/will-gpt-models-choke-on-their-own-exhaust/">blog post</a>, Ross
        Andersen says: "Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re 
        about to fill the Internet with blah. This will make it harder to train newer models by scraping the web &hellip;". 
        A <a href="https://arxiv.org/abs/2305.17493v2">paper</a> on which he is co-author predicts 'model collapse': over time,
        mistakes compound and the models are no longer learning from the true underlying data distribution. Another
        <a href="https://arxiv.org/abs/2307.01850">paper</a> shows models getting progressively worse when trained on their own
        output.
    </li>
    <li>Pre-training LLMs is hugely energy-intensive and water-intensive. Even inference costs (i.e. using the LLM to respond to prompts) have grown. Systems that allow automatic Chain-of-thought prompting or other multi-step prompting have some of the highest costs. For example, in 2024 it is estimated that, for some prompts, o1 costs 20 times more than GPT4.</li>
</ul>

<h1>Conclusion</h1>
<ul>
    <li>Well done if you read through all the above material!</li>
    <li>One thing we can be sure about is that it is already out-of-date :(</li>
    <li>This is an exciting, fast-moving but also troubling field.</li>
</ul>