<a href="https://colab.research.google.com/github/daphnei/gen-ai-tutorials/blob/main/llm_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Imports and initialization

from abc import ABC

class Engine(ABC):
  def score(self, text):
    """Tokenizes the provided text, and returns a score for each token.
    
    The score is log-likelihood. A higher score means a token was more
    likely according to the model.
    """
    pass

  def generate(self, prompt, top_p, num_tokens):
    """Generates text given the provided prompt text."""
    pass


# Glossary

Here are some of the terms we will be using in this tutorial.


*   **(Language) Model**: A neural network trained to generate text.
*   **Engine**: The code and computers used to do inference with a model.
*   **Accelerator**: A [GPU](https://www.techtarget.com/searchvirtualdesktop/definition/GPU-graphics-processing-unit) or [TPU](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware chip used to massively speed up model inference by enabling fast, parallelized matrix multiplcations. 



# Unit 1: Working with Pre-Trained Languages Models

## 1.1 Choosing a model and engine

Which model should you use? It depends on what your goals are, what your budget is, and what kinds of computational resources you have available.

In this section, we will summarize the pros and cons some of the popular systems, and guide you through the process of setting them up for inference.

## Open-source models on HuggingFace

HuggingFace is an open-source code framework for loading an open-source models onto an accelerator in order to train or do inference. It is very popular among academic researchers.

There are [hundreds of models](https://huggingface.co/models) available through HuggingFace's model repository. Some prominent ones which you can use for tutorial are.

### Why use this?
HuggingFace is great if your goal is to write open-source code, with results that anyone can reproduce. HuggingFace gives access to a huge number of models, and it is fairly easy to swap between models.

Because HuggingFace is widely used, it is easy to find help online. Probably someone else has had the same question as you and already posted about it.

### Why not use this?
When you use HuggingFace, you are running the code on your own computer. If you don't have a big enough accelerator, you will run into difficulties loading up larger models. For example, Colab (the software you are currently using) gives you access to a Tesla T4 GPU, which has 16 GB of RAM. This means, you can load the 6.7B parameter LLaMa, but not any of the bigger ones.

* **[Pythia-3B](https://huggingface.co/EleutherAI/pythia-2.8b-deduped)**: Part of a family of models trained by [Eleuther AI](https://www.eleuther.ai/about), a non-profit AI research lab.
* **[GPT-2 XL](https://huggingface.co/gpt2-xl)**: The original large language model from OpenAI, and the last one they open-sourced before moving away from open source models.
* **[BLOOM-3b](https://huggingface.co/docs/transformers/model_doc/bloom)**: These models were created by the [BigScience Initative](https://bigscience.huggingface.co/), a collaboration between HugginFace and many academic research labs to responsibly build a set of high-quality multilingual models.

## Open-source LLaMA 

Released this past February by Meta Research, LLaMA is the latest-and-greatest in open-source pre-trained LLMs. The LLaMA models range in size from 7 billion to 65 billion paramters.

Since the models' release, they have been finetuned by researchers to improve  functionality for a variety of use cases. Some prominent derivatives of LLaMA are:

* **Vicuna**: Finetuned for TODO
* **Alpaca**: Finetuned for TODO
* **TODO**:
 
While LLaMa and its derivative models can be used with the HuggingFace framework, the Colab runtime you are on right now doesn't have a big enough accelerator to load any of these models.

Instead, we have loaded up LLaMA on LTI's compute cluster, and you can use the code below to query the cluster. Note that the server being queries in the code below will be taken down at the end of the tutorial.

### Why use this?
TODO

### Why not use this?
TODO

### Setup Instructions
TODO

## Cohere's models

## Anthropic's models

## OpenAI's GPT-3/4

TODO: Add brief summary of these models

### Why use these?
OpenAI's model have become the industry standard for large language models.
They have an API which is very easy to use. Since the models all get run on OpenAI's servers, you don't need your own compute resources, and there is practically no setup involved to get started.

### Why not use these?
OpenAI's models are not open-source and only accessible through an API. There is relatively little information available on how they were trained or what data they were trained on. OpenAI may change the models being used under the hood by the API, and you as a user will not know it. This makes them a bad choice for fully reproducible research.

The OpenAI API costs money to use. See [this link](https://openai.com/pricing) for pricing details.

### Setup Instructions
1. Go to www.openai.com and create an account.
2. Go to https://platform.openai.com/account/api-keys and click the "Create new secret key" button. It doesn't matter what you name it.
3. Copy the secret key and paste it into the code block below. Then run the code block.

In [None]:
#@title Run this code block to use GPT-3 for Unit 1.
MODEL_NAME = "text-davinci-003" #@param ["gpt-4", "text-davinci-003", "text-curie-001", "text-babbage-001", "text-ada-001"]
SECRET_KEY = "Paste your secret key here." #@param {type:"string"}


!pip install openai
import openai
openai.api_key = SECRET_KEY

class OpenAIEngine(Engine):
  def __init__(self, model_name):
    self.model_name = model_name

  def score(self, text):
    raise ValueError("TODO(daphne): fill this in")
    # yiming: The API does not return logprobs for tokens in the prompt, so
    # implementing score() may be impossible

  def generate(self, prompt, top_p, num_tokens):
    response = openai.Completion.create(
      engine=self.model_name,
      prompt=prompt,
      temperature=1.0,
      max_tokens=num_tokens,
      top_p=top_p,
      frequency_penalty=0.0,
      presence_penalty=0.0,
      logprobs=1
    )
    return response["choices"][0]["text"]

engine = OpenAIEngine(MODEL_NAME)

## 1.3 Analyzing Likelihoods

As we discussed in the lecture, language models take as input a prompt sequence and then output a score for each token in the vocabulary.

A higher score means the model is more confident this that this token fits as the next token in the sequence.

In this unit, we will inspect token likelihoods to build an understaning of why models give tokens higher or low scores. We will also learn how to compute perplexity, a standard metric people use to measure how well models have learned to represent language.

In [None]:
# TODO

## 1.3 Controlling the Amount of Randomness During Generation

In this section, we will investigate the impact the `top_p` parameter has on the text that a model generates.

In [None]:
# TODO

## 1.4 Creating a Classifer using Few-Shot Learning

In this section, we will build a binary sentiment classifier using a LLM such
as GPT-3.

Writing the perfect prompt is more an art than a science, but a prompt often has
two parts:
- An `instruction` string that instructs the model on how to complete the task.
Adding it is often effective, because models such as OpenAI's `text-davinci-003`
are fine-tuned to follow user instructions.
- Sereral `demonstration` strings that give examples of completing the task.

For example, a prompt that translates English words to Chinese could look like
the following. Running GPT-3 to complete the prompt provides the correct answer
"松鼠".

```
Translate English to Chinese.

dog -> 狗
apple -> 苹果
coffee -> 咖啡
supermarket -> 超市
squirrel ->
```

Now, let's build a binary sentiment classifier for [yelp reviews](https://huggingface.co/datasets/yelp_polarity/).

In [None]:
!pip install datasets

import datasets

yelp = datasets.load_dataset("yelp_polarity")
train_data = yelp["train"].shuffle(seed=1).select(range(4))
test_data = yelp["test"].shuffle(seed=1).select(range(20))

print("Data format:", train_data[1])


In [None]:
# convert integer labels to text
label_map = {
    0: " negative",
    1: " positive"
}
"""
It might seem strange to put a space before "negative". This is due to how
GPT-3's tokenizer (BPE) works: " negative" is considered one token, and is
observed much more frequently than "negative" without the whitespace.
"""

instruction = "Classify the sentiment of these yelp reviews as positive or negative.\n\n"
demo_template = "Review: {review}\nSentiment:{sentiment}\n\n"

# construct the prompt by concatenating instructions and templates
prompt_parts = [instruction]
for instance in train_data:
    review = instance["text"]
    sentiment = label_map[instance["label"]]
    prompt_parts.append(demo_template.format(review=review, sentiment=sentiment))
prompt = ''.join(prompt_parts)


print("YOUR PROMPT:", prompt, sep='\n')

In [None]:
# now we can feed the prompt to GPT-3 to classify a new review!

eval_template = "Review: {review}\nSentiment:"

def classify_review(review: str) -> str:
  """ Classify a single movie review """
  classify_prompt = prompt + eval_template.format(review=review)
  response = openai.Completion.create(
    engine=MODEL_NAME,
    prompt=classify_prompt,
    temperature=0.0, # <- only need the top-probability answer for classification
    max_tokens=1,
    logit_bias={
      "3967": 100,
      "4633": 100,
    } # <- whitelist only tokens that correspond to a label
  )
  return response["choices"][0]["text"]


def evaluate(verbose=False) -> None:
  """ Evaluate your prompt on the test set """
  correct = []
  for i, instance in enumerate(test_data):
    review = instance["text"]
    label = label_map[instance["label"]]
    predicted = classify_review(review)
    correct.append(1 if label == predicted else 0)
  
    print(f"======== {i} / {len(test_data)} ========")
    print(f"REVIEW: {review}")
    print(f"LABEL:{label}")
    print(f"PREDICTED:{predicted}")

  
  acc = sum(correct) / len(correct)
  print(f"Accuracy of your prompt on {len(test_data)} test examples: {acc:.0%}")


evaluate(verbose=True) # Expect 100% performance

# Unit 2: Models Tuned for Instruction-Following and Dialog

## 2.1 Choosing a model

## 2.2 Comparing Behaviour of Pre-trained and Tuned Models

## 2.3 Building a Persona Bot with In-Context Learning

# Scratch
This is where Daphne is keeping random code pieces which will not end up in the tutorial.

In [None]:
engine.generate("What can", top_p=1.0, num_tokens=4)