# LLM in Actions

## First steps with LLM

In [None]:
%%capture
!pip uninstall llama-cpp-python
# build on an old Intel MBP where Metal is not supported
!FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

In [None]:
from llama_cpp import Llama

model_name="mistral-7b-instruct-v0.1.Q5_K_M"
model_name="TinyLlama-1.1B-Chat-v0.3"
model_ame="ggml-model-Q4_K"


model = Llama(model_path=f"/Users/alleon_g/.cache/llama.cpp/models/{model_name}.gguf", verbose=False)

In [None]:
answer = model('Please answer in one sentence to this question: What is a Large Language Model?', stop=["<|im_start|>"]) # , "\n"])

In [None]:
print(answer)

In [None]:
print(answer['choices'][0]['text'])

## Outline

- Intro to LLMs

- The wide range of abilitiers of LLMs

- Training and Deploying LLMs

- Getting Commercial Value from LLMs

## Quick history of LLMs

In the vast landscape of technological advancement, the journey of language models stands out as a testament to human ingenuity. Much like the epochs of human evolution, where our ancestors transcended through distinct ages defined by tools and discoveries, language processing models too have charted an illustrious journey, each phase building upon the last, refining the art and science of understanding and generating human language.

The Stone Age of language models saw the rudimentary yet pioneering techniques like Bag of Words and TF-IDF. These methods, in their simplicity, were akin to the early tools fashioned by our forebears - basic, yet fundamental in the march towards progress. The Bronze Age brought sophistication, introducing us to the magic of embeddings via methods like Word2Vec and Doc2Vec. This era celebrated the enchanting algebra of words, with the classic equation "king - man + woman = queen" encapsulating the groundbreaking progress.

The Iron Age was a realm of complexity and nuance. RNNs and LSTMs came to the forefront, harnessing the temporal essence of language. Yet, as metals are tempered to achieve sharper edges, the world of deep learning bore witness to a pivotal mechanism - attention. Just as iron tools were sharper and more durable than bronze, this newfound technique started to show potential beyond what was previously imagined.

The Industrial Revolution, a period of rapid innovation and progress in human history, finds its parallel in the world of language models with the advent of Transformers. A period marked by the dazzling pace of discovery, this era saw models like GPT, BERT, and others rise to prominence. But as with any revolution, refinement is inevitable. The models were fine-tuned, and techniques such as Reinforcement Learning from Human Feedback (RLHF) were developed, echoing the relentless quest for perfection that marked the historical Industrial Revolution.

Join us as we embark on this enlightening journey, drawing parallels between human evolution and the advancement of language models, celebrating the monumental strides made in a relatively short span of time. This chapter promises a symphony of past discoveries and a harbinger of the future's potential.

### **The Stone Age of Language Models: Bag of Words and TF-IDF**

The dawn of language processing was characterized by simplicity, foundational discoveries, and the establishment of fundamental techniques that paved the way for more complex developments. Two of the most fundamental techniques of this era in the domain of Natural Language Processing (NLP) were the Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).

**1. Bag of Words (BoW):**
The Bag of Words model, as its name suggests, represents a text as an unordered set of words. Essentially, it involves tokenizing the text into words and counting the frequency of each word. This method disregards grammar, order, and context, focusing solely on word occurrence. In essence,it is  straightforward and foundational. While BoW is intuitive and easy to implement, its simplicity also becomes its limitation, as it lacks the ability to capture semantic relationships between words and the context in which they appear.

**2. Term Frequency-Inverse Document Frequency (TF-IDF):**
While the BoW model laid the groundwork, TF-IDF built upon it by introducing a measure of importance to each term in a document relative to a collection of documents or corpus. The core idea behind TF-IDF is that words that appear frequently in one document but not in many documents across the corpus likely hold more significance. This technique balances the frequency of a term (Term Frequency) against its rarity across multiple documents (Inverse Document Frequency), providing a weight that can be used to rank terms' relevance within a document[^1^].

Despite the simplicity of these models, they served as crucial stepping stones. Both BoW and TF-IDF became cornerstones for many early information retrieval systems and remain relevant for various applications today.

However, as we transitioned from this metaphorical 'Stone Age', it became evident that capturing the complexity and nuance of human language required more than just term frequencies and weights. The era demanded models that could understand context, semantic relationships, and the subtleties that make language rich and intricate. This realization led to the subsequent developments and the ushering in of the Bronze Age of NLP.

### **The Bronze Age of Language Models: Word Embeddings and the Semantics of Language**

As human civilization progressed into the Bronze Age, tools became more refined, cultures grew richer, and the limits of the known world expanded. Similarly, in the landscape of NLP, the realization that raw counts and frequencies weren't capturing the essence of language gave rise to more sophisticated methods. Enter the world of word embeddings, where words were represented as dense vectors capturing the semantics and contextual relationships between them.

**1. Word2Vec:**
Developed by [Mikolov et al.](https://arxiv.org/abs/1301.3781) at Google, Word2Vec became one of the most popular methods for learning word embeddings. Instead of sparse vectors used in BoW and TF-IDF, Word2Vec represented words as dense vectors, typically with several hundred dimensions[^1^]. These vectors were learned using neural networks by predicting a word given its context (Continuous Bag of Words approach) or predicting context words given a target word (Skip-Gram approach). The underlying principle was simple: words that appear in similar contexts tend to have similar meanings. The result was groundbreaking; the word vectors captured a wide range of semantic and syntactic relationships. For instance, the model could capture relationships such as "king" - "man" + "woman" ≈ "queen", showcasing the [algebra of semantics](https://aclanthology.org/N13-1090.pdf).

**2. Doc2Vec:**
While Word2Vec was revolutionary in representing individual words, there was a need to capture representations for larger text units like sentences or documents. [Le and Mikolov](https://arxiv.org/abs/1405.4053) introduced Doc2Vec (or Paragraph Vector) to address this. Doc2Vec extends the Word2Vec algorithm to capture document-level embeddings, encapsulating the semantic meaning of varied-length pieces of texts.

Visualize word embeddings as a vast landscape where each word is a location. Words with similar meanings are closer to each other, while unrelated words are distant. This spatial representation provides a bird's-eye view of the vast semantics of language, akin to observing settlements and cultures from a high vantage point in the Bronze Age.

These embeddings became the foundation for a plethora of NLP tasks, from sentiment analysis to machine translation. They provided the crucial leap from mere word counts to understanding the semantics of language.

As the Bronze Age set the stage for a subsequent era of enhanced tools and techniques, word embeddings prepared the ground for models that could further dive into the intricacies of language, leading us to the Iron Age of NLP.

### **The Iron Age of Language Models: Recurrent Networks, Attention, and the Intricacies of Temporal Dependencies**

Progressing from the Bronze Age, the Iron Age in human history marked a period of rapid advancements in tools and methods, enabling the construction of stronger, more durable, and complex artifacts. Similarly, in the world of NLP, as the limitations of mere word embeddings became evident, there was a transition to more intricate models that could handle the temporal and sequential nature of language. This led to the rise of Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and the pivotal attention mechanism.

**1. Recurrent Neural Networks (RNNs):**
Traditional neural networks, while potent, struggled with sequential data. They lacked memory of previous inputs in a sequence, which is crucial for understanding languages where words and their meanings often depend on preceding words. [RNNs](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) addressed this by having loops that allowed information persistence. In essence, RNNs could remember past information and were aptly suited for tasks like language modeling and machine translation.

However, RNNs were not without flaws. They struggled with long-term dependencies due to issues like vanishing and exploding gradients.

**2. Long Short-Term Memory Networks (LSTMs):**
 In NLP, the meaning of a word often depends on the preceding words or the overall context. Traditional neural networks couldn't effectively capture this temporal relationship. LSTMs, with their ability to remember past information, became aptly suited for tasks like language modeling, sentiment analysis, and machine translation. Their architecture, consisting of input, forget, and output gates, allows them to decide what information to retain or discard, thereby preserving the context over long sequences. A significant NLP application of LSTMs came in the form of sequence-to-sequence models. These models, often used in tasks like machine translation, utilized an LSTM-based encoder to read and encode the source sentence and an LSTM-based decoder to generate the target sentence[^1^](https://papers.nips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf). In this landscape, LSTMs provided a significant leap, enabling models to dive deeper into the intricacies and subtleties of human language.

**3. Attention Mechanism:**
Despite the capabilities of LSTMs, there was still room for improvement. Enter the [attention mechanism](https://arxiv.org/abs/1409.0473). It allowed models to focus on specific parts of the input sequence when producing an output, much like how humans pay attention to specific words when listening or reading. Initially used to boost performance in machine translation tasks, attention quickly became a cornerstone for various NLP applications. 

Just as iron reshaped civilizations by enabling the construction of advanced machinery, RNNs, LSTMs, and the attention mechanism redefined the landscape of NLP, setting the stage for the next revolutionary era—the Industrial Revolution of language models.

### **The Industrial Revolution of Language Models: Transformers and the Power of Parallel Processing**

As the Industrial Revolution marked a radical transformation of society through mechanization and the development of new technologies, a similar paradigm shift occurred in the world of Natural Language Processing (NLP) with the advent of transformer architectures. The transformer era was characterized by parallel processing, scalability, and models that displayed an unprecedented ability to understand and generate human-like text.

**1. The Transformer Architecture:**
Introduced by [Vaswani et al]((https://arxiv.org/abs/1706.03762)). in 2017, the transformer architecture revolutionized NLP by discarding recurrence and focusing entirely on attention mechanisms, enabling parallelization. This allowed for the training of significantly larger models, yielding improved results across a multitude of NLP tasks.

**2. BERT and Pre-trained Language Models:**
Building on the transformer's foundation, [BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers) by Devlin et al. showcased the power of pre-training and fine-tuning. By pre-training on a vast corpus and fine-tuning on specific tasks, BERT set new state-of-the-art results on eleven natural language processing tasks, emphasizing the benefits of general-purpose language representation.

**3. GPT and Generative Modeling:**
OpenAI's Generative Pre-trained Transformer ([GPT]((https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf))) highlighted the capabilities of transformers in generating coherent, diverse, and contextually relevant sentences. The subsequent versions, especially GPT-3, stunned the community with its ability to draft essays, answer questions, and even write poetry.

**4. Customization and Fine-tuning with Reinforcement Learning:**
Transformers' scalable nature allowed researchers to explore the intersection of reinforcement learning and NLP. Techniques such as Supervised Fine-Tuning ([SFT]()) and Reinforcement Learning from Human Feedback ([RLHF](https://arxiv.org/abs/1706.03741)) were introduced to fine-tune models, making them more aligned with human values and specific application needs. 

Envision this era as an industrial factory, where the assembly line (parallel processing) efficiently processes vast amounts of data, producing refined, high-quality products (human-like text and understanding). Just as the steam engine, telegraph, and railroads accelerated society's progress during the Industrial Revolution, transformers and their progenies are pushing the boundaries of what's possible in NLP.

* Prehistoric
  - bag of words [1954]
  - tf-idf [1972]

* Bronze Age
  - word2vec [2013], doc2vec
  - king - man + woman = queen
  - Embeddings -> Outcome: LSTMs, RNNs, ...

* Iron Age
  - attention mechanism
  - self-supervised learning
  - power-law scalings

* Industrial Revolution
  - Reinforcement Learning with Human Feedback (RLHF)
  - Model alignment
  - InstructGPT [2022](https://arxiv.org/abs/2203.02155), ChatGPT [2022](https://help.openai.com/en/articles/6825453-chatgpt-release-notes), GPT4 [2023](https://arxiv.org/abs/2303.08774), Anthropic, Cohere, Mistral

## Transformers

* Attention was used in the _Bronze age_
* But Transformers kickstarted the _Iron Age_
    - demonstrating that **Attention is all you need** [2017](https://arxiv.org/abs/1706.03762)
    - avoiding recurrence and/or convolutions

![Transformers](./Sequence_to_Sequence.gif)

## Transformers in a Nutshell

<table>
  <thead>
    <tr>
      <th colspan="2"></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img width="500" src="transformers.png"/></td>
      <td>
          <ul>
              <li style="font-size: 20pt;">The input of the transformer model is a prompt that needs  to be embedded</li>
              <p></p>
              <li style="font-size: 20pt;">The block is the main source of complexity. Each block contains a masked multi-head attention, a feed forward network and several layers of normalization</li>
              <p></p>
              <li style="font-size: 20pt;">The output is fed through one more linear layer to obtain the final output</li>
          </ul>
      </td>
    </tr>
  </tbody>
</table>



## Tokenization in a Nutshell

In [25]:
%%capture
!pip install torch transformers

In [26]:
from transformers import pipeline, GPT2Tokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
generator = pipeline('text-generation', model='gpt2')

In [28]:
generator("The capital of Belgium is Brussels. The capital of France is", max_new_tokens=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The capital of Belgium is Brussels. The capital of France is Paris.'}]

In [29]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [30]:
'Megatron' in tokenizer.get_vocab()

False

In [31]:
tokenizer.encode('Megatron loves only himself')

[42672, 23484, 10408, 691, 2241]

In [32]:
tokenizer.convert_ids_to_tokens(tokenizer.encode('Megatron loves only himself'))

['Meg', 'atron', 'Ġloves', 'Ġonly', 'Ġhimself']

In [33]:
'Ġloves' in tokenizer.get_vocab()

True

In [34]:
encoded_input = tokenizer('Megatron loves only himself')

## Language Models



Autoregressive models                                 |  Autoencoding models                                  |
------------------------------------------------------|-------------------------------------------------------| 
Predict tokens based on past context i.e.             | Predict tokens based on past and future context i.e.  |
  **'Gérard loves __.'**                              |   **'Gérard loves __ pizas.'**                        |
  Good for Natural Language Generation                |  Good for Natural Language Understanding              |
  GPT architectures                                   |  BERT architectures                                   |

## Large Language Models

- Number of parameters > 100M
    - Largest is [WuDao](https://www.baai.ac.cn/english.html#Research) with 1.75T parameters

* Don't necessarily rely on the Transformer architecture
    * but maby do - some are hybrid

* Trained on very large corpora
    * ThePile [2020](https://arxiv.org/abs/2101.00027) is a 825 GiB dataset that consists of 22 smaller, high-quality datasets combined together

* Can be finetuned for specific usages

## Bigger LLMs

![LLMs](2023-Alan-D-Thompson-AI-Bubbles-Optimal-Rev-5.png)