<a href="https://colab.research.google.com/github/adammuhtar/llm-notebooks/blob/main/notebooks/alpaca-lora-7b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Alpaca-LoRA 🦙🌲: Low-Rank Adaptation of the Stanford Alpaca model**


The development of large language models (LLMs) have been nothing short of remarkable, revolutionising the field of natural language processing (NLP) and potentially moving humanity slightly closer towards building an artificial general intelligence. With the advent of large pre-trained models such as GPT-3 and GPT-4, these models are becoming increasingly sophisticated, with the latest models leveraging on billions of parameters and demonstrating impressive language processing capabilities. Equally as exciting is the growing trend towards making these models more accessible to researchers and developers alike, with many pre-trained models becoming freely available.

[**LLaMA**](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)

One such language model is [LLaMA (Large Language Model Meta AI)](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) developed by researchers at Meta AI. Smaller but equally performant models such as LLaMA allows the wider community to access this transformative technology without requiring prohibitive amounts of infrastructure and resources to run it. LLaMA is available at several sizes (7B, 13B, 33B, and 65B parameters) and keeping in theme with this series of notebooks, we will run the smallest model, LLaMA 7B (trained on one trillion tokens) to showcase the feasibility of running LLMs on consumer hardware.

[**Alpaca**](https://crfm.stanford.edu/2023/03/13/alpaca.html)

The version of LLaMA this notebook will be running is [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), developed by researchers at Stanford University. The model has been fine-tuned from the LLaMA 7B model based on 52,000 instruction-following demonstrations and researchers' preliminary evaluations of the model indicates that Alpaca performs similarly to OpenAI’s [text-davinci-003](https://platform.openai.com/docs/models/model-endpoint-compatibility) for single-turn instruction following.

[**LoRA**](https://arxiv.org/abs/2106.09685)

Another key concept applied here is the application of LoRA, short for [Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). Introduced by researchers from Microsoft, this technique addresses the high computational cost associated with fine-tuning LLMs, as they are often computationally expensive to train and deploy, and their fine-tuning process can be even more resource-intensive, limiting their practical applicability. LoRA instead approaches fine-tuning in a more efficient and cost-effective manner by utilising low-rank approximation techniques to reduce the computational complexity of the adaptation process. The approach freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks - this makes LoRA a more efficient adaptation of LLMs without significant hits to their performance. LoRA has shows promise as a practical solution for enhancing the scalability and efficiency of LLMs. LoRA is subclass of a new fine-tuning concept called [Parameter-Efficient Fine-tuning (PEFT)](https://huggingface.co/blog/peft), which yields LLM performance comparable to full fine-tuning while only having a small number of trainable parameters.

---

*References*:
* Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. *arXiv* preprint arXiv:2106.09690.
* Mangrulkar, S. & Paul. S. (2023, February 10). 🤗 PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware. *Hugging Face Blog*. https://huggingface.co/blog/peft
* Meta AI. (2023, February 23). Introducing LLaMA: A foundational, 65-billion-parameter large language model. *Meta AI Blog*. https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

---

## **Table of Contents**

* [1. Notebook setup](#section-1)
* [2. Load LLaMa tokeniser and fine-tuned Alpaca-LoRA model](#section-2)
* [3. Generating text](#section-3)

## 1. Notebook setup <a name="section-1"></a>

This notebook is run using Google Colaboratory (Colab) - Colab is Google's implementation of [Jupyter Notebooks](https://jupyter.org/). This notebook has the following packages installed:
* `python==3.9.16`
* `bitsandbytes==0.38.1`
* `datasets==2.11.0`
* `loralib==0.1.1`
* `peft==0.3.0.dev0`
* `sentencepiece==0.1.98`
* `torch==2.0.0+cu118`
* `transformers==4.28.1`

The `bitsandbytes`, `datasets`, `loralib`, `sentencepiece`, and `transformers` libraries will need to be manually installed into the Colab environment (pip install by running a shell command), following the requirements from the core [`alpaca-lora`](https://github.com/tloen/alpaca-lora/blob/main/requirements.txt) repo. Running this requires hardware accelerators to access higher RAM runtimes; GPU specifications should at least match the Tesla T4 GPU (16 GB GDDR6 @ 320 GB/s), which is available for free in Google Colab.

Replicating this notebook for larger Dolly 2.0 models (e.g [`decapoda-research/llama-13b-hf`](https://huggingface.co/decapoda-research/llama-13b-hf) or [`decapoda-research/llama-30b-hf`](https://huggingface.co/decapoda-research/llama-30b-hf)) on Colab will require Colab Pro, using hardware such as the A100 Tensor Core GPUs.

In [None]:
# Query GPU device status/details
!nvidia-smi

Sun Apr 16 03:55:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check IP address details if there are restrictions running non-local servers
!curl ipinfo.io

{
  "ip": "34.67.31.176",
  "hostname": "176.31.67.34.bc.googleusercontent.com",
  "city": "Council Bluffs",
  "region": "Iowa",
  "country": "US",
  "loc": "41.2324,-95.8751",
  "org": "AS396982 Google LLC",
  "postal": "51501",
  "timezone": "America/Chicago",
  "readme": "https://ipinfo.io/missingauth"
}

In [1]:
!pip install bitsandbytes datasets loralib sentencepiece transformers --quiet
!pip install git+https://github.com/huggingface/peft.git --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.3/104.3 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Standard library imports
import textwrap

# Third-party imports
from peft import PeftModel
import torch
from transformers import LlamaTokenizer , LlamaForCausalLM, GenerationConfig


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


In [None]:
# Check available GPUs for computation
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    # Print details of all available GPUs
    for i in range(num_gpus):
        gpu_props = torch.cuda.get_device_properties(i)
        print(f"Device details for GPU {i+1}:")
        print(f"* Name: {gpu_props.name}")
        print(f"* Memory size: {round(gpu_props.total_memory / 1024**3, 2)} GB")
        if i == num_gpus-1:
            continue
        else:
            print("-"*79)
    # Get the currently active GPU device and print its name and memory size
    active_gpu = torch.cuda.current_device()
    active_gpu_props = torch.cuda.get_device_properties(active_gpu)
    print("="*79)
    print(f"Currently active GPU device: {active_gpu_props.name}")
    print(f"Memory size: {round(active_gpu_props.total_memory / 1024**3, 2)} GB")
    print("="*79)
else:
    print("No GPU devices found.")

Device details for GPU 1:
* Name: Tesla T4
* Memory size: 14.75 GB
Currently active GPU device: Tesla T4
Memory size: 14.75 GB


## 2. Load LLaMA tokeniser and fine-tuned Alpaca-LoRA model <a name="section-2"></a>

There are a number of base and fine-tuned models that we can choose from. Here are a number of possible entries (non-exhaustive) to choose from:
* Possible values for `base_model`:
    * [`decapoda-research/llama-7b-hf`](https://huggingface.co/decapoda-research/llama-7b-hf)
    * [`decapoda-research/llama-13b-hf`](https://huggingface.co/decapoda-research/llama-13b-hf)
    * [`decapoda-research/llama-30b-hf`](https://huggingface.co/decapoda-research/llama-30b-hf)
* Possible values for `finetuned_model`:
    * [`tloen/alpaca-lora-7b`](https://huggingface.co/tloen/alpaca-lora-7b)
    * [`chansung/gpt4-alpaca-lora-7b`](https://huggingface.co/chansung/gpt4-alpaca-lora-7b)
    * [`chansung/alpaca-lora-13b`](https://huggingface.co/chansung/alpaca-lora-13b)

*N.B.: the size of model run depends on the RAM runtimes capacity*

This section runs [`tloen/alpaca-lora-7b`](https://huggingface.co/tloen/alpaca-lora-7b) from Eric J. Wang. Once chosen, we set up the tokeniser and model objects as follows:

* The `tokeniser` is created using `LlamaTokenizer` from the latest `transformers` library and loaded with the LLaMA tokeniser from the `base_model` model checkpoint.
* `model` is created using `LlamaForCausalLM` from the latest `transformers` library and loaded with the `base_model` checkpoint. `load_in_8bit` argument is set to True, which loads the model in 8-bit mode to reduce memory usage by half with no noticeable loss in quality - this is useful when your GPU is not large enough to fit the uncompressed model. `device_map` is set to "auto" to automatically select the device (CPU or GPU) to run the model on.

In [3]:
# Choose which model to run
base_model = "decapoda-research/llama-7b-hf" #@param ["decapoda-research/llama-7b-hf", "decapoda-research/llama-13b-hf", "decapoda-research/llama-30b-hf"]
finetuned_model = "tloen/alpaca-lora-7b" #@param ["tloen/alpaca-lora-7b", "chansung/alpaca-lora-13b", "chansung/gpt4-alpaca-lora-7b"]

# Load tokeniser, base model and fine-tuned model
tokeniser = LlamaTokenizer.from_pretrained(base_model)
model = LlamaForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, finetuned_model)

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


Downloading (…)lve/main/config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00033.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading adapter_model.bin:   0%|          | 0.00/67.2M [00:00<?, ?B/s]

In [4]:
def generate_prompt(instruction, input=None):
    """
    Generate a prompt for a given instruction and optional input.

    Args:
        * instruction (`str`): The main instruction for the prompt.
        * input (`str`, optional): Additional input that provides context for
        the task. Defaults to None.

    Returns:
        `str`: A prompt that includes the instruction, input (if provided), and
        a space for the response.
    """
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
        
        ### Instruction:
        {instruction}
        
        ### Input:
        {input}
        
        ### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
        
        ### Instruction:
        {instruction}
        
        ### Response:"""

def alpaca_speak(
    context = None,
    temperature: float = 0.7,
    top_p: float = 0.95,
    repetition_penalty: float = 1.2,
    max_new_tokens: int = 512,
    width: int = 100
):
    """
    This function prompts the user to enter a prompt and generates responses
    using the fine-tuned Alpaca-LoRA model

    Args:
        * context (`str`): Optional. A string that provides additional context
        to the prompt. Default is None.
        * temperature (`float`): Optional. A value that controls the
        "creativity" of the generated sequences. Represents the degree of
        randomness in the generated text. A higher temperature value leads to
        more diverse and unpredictable sequences, while a lower value leads to
        more conservative and predictable sequences (e.g. a value of 1.0
        represents maximum randomness). In this function, the temperature value
        is set to 0.7, which means that the generated sequences will be
        moderately creative.
        * top_p (`float`): Optional. A value that controls the "safety" of the
        generated sequences. Represents the maximum cumulative probability
        allowed for the generated tokens. A higher top_p value leads to more
        conservative and safe sequences, while a lower value leads to more
        diverse and unpredictable sequences (e.g. a value of 1.0 means that all
        tokens with non-zero probability are considered). Default is 0.95.
        * repetition_penalty (`float`): Optional. A value that controls the
        "repetition" of the generated sequences, penalsing the model for
        repeating the same tokens in a sequence. A higher repetition penalty
        value leads to fewer repetitions in the generated sequences, and vice
        versa. Default is 1.2.
        * max_new_tokens (`int`): maximum number of new tokens that can be
        generated by the model in each response. Defaults to 512.
        * width (`int`): Optional. The maximum number of characters
        allowed in a single line of the generated text. Default is 100.
    
    Example usage:
    # Generate 5 responses using a prompt and additional context
    `alpaca_speak(context="I love to play video games", n=5)`
    """
    input_prompt = input("Prompt: ")
    print("-"*100)
    prompt = generate_prompt(input_prompt, context)
    inputs = tokeniser(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty
    )
    print("Response:\n")
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=max_new_tokens
    )
    for s in generation_output.sequences:
        output = tokeniser.decode(s)
        print(
            textwrap.fill(
                output.split("### Response:")[1].strip(),
                width=width
            )
        )
    print("-"*100)

## 3. Generating text <a name="section-3"></a>

In this section of the notebook, we'll be working through some examples of various tasks to see how well Alpaca-LoRA performs. Note that these are not meant to be comprehensive or robust tests, but simply anecdotal examples of localised prompts. Compared to other comparable publicly available LLMs such as [`dolly-v2-7b`](https://huggingface.co/databricks/dolly-v2-3b), Alpaca is expected to perform somewhat better as the base LLaMa models are trained with larger set of tokens (1 trillion as opposed to [Eleuther AI's Pythia](https://huggingface.co/EleutherAI/pythia-6.9b)'s 300 billion tokens).

### Test 1: Open Q&A I

> **Prompt: What are Newton's three laws of motion?**

In [5]:
alpaca_speak()

Prompt: What are Newton's three laws of motion?
----------------------------------------------------------------------------------------------------
Response:

Newton’s Three Laws of Motion state that objects will remain at rest or move in straight lines
unless acted upon by another force, every action has an equal and opposite reaction, and for every
action there is an equal but opposite reaction.
----------------------------------------------------------------------------------------------------


### Test 2: Open Q&A II

> **Prompt: What is the Pythagoras Theorem?**

In [7]:
alpaca_speak()

Prompt: What is the Pythagoras Theorem?
----------------------------------------------------------------------------------------------------
Response:

The Pythagorean theorem states that in any right triangle, the square of the hypotenuse (the side
opposite to the right angle) equals the sum of the squares of the other two sides. This equation can
be written as A^2 + B^2 = C^2 where A and B are the lengths of the legs of the triangle and C is the
length of the hypotenuse.
----------------------------------------------------------------------------------------------------


### Test 3: Text Summarisation

> **Prompt: Summarise this text: Spectacular photographs of a Canary Wharf skyscraper being completed have been rediscovered more than 30 years after they were taken.
Amid strong winds, Tony Brien sat in a wooden box suspended by a crane some 250m above east London to capture the images of One Canada Square. Previously believed to have been lost, the photographs were uncovered during a search of Mr Brien's archives. He said he was "completely staggered" to find the photos from November 1990. Mr Brien said he had worked on the Canary Wharf project before when he was approached to photograph the "topping out" of One Canada Square. "I said 'fine', not realising quite at that time that the only way I could really do it was to go up in a bucket or a crate from the ground which was attached to a crane," he explained.**

Source: BBC. (2023, April 15). Canary Wharf: Spectacular photos of skyscraper rediscovered. *BBC News*. https://www.bbc.co.uk/news/uk-england-london-65274803

In [8]:
alpaca_speak(max_new_tokens=256)

Prompt: Summarise this text: Spectacular photographs of a Canary Wharf skyscraper being completed have been rediscovered more than 30 years after they were taken. Amid strong winds, Tony Brien sat in a wooden box suspended by a crane some 250m above east London to capture the images of One Canada Square. Previously believed to have been lost, the photographs were uncovered during a search of Mr Brien's archives. He said he was "completely staggered" to find the photos from November 1990. Mr Brien said he had worked on the Canary Wharf project before when he was approached to photograph the "topping out" of One Canada Square. "I said 'fine', not realising quite at that time that the only way I could really do it was to go up in a bucket or a crate from the ground which was attached to a crane," he explained.
----------------------------------------------------------------------------------------------------
Response:

The rediscovery of spectacular photographs of One Canada Square being

### Test 4: Brainstorming

> **Prompt: What are some fun activities a family can do along the River Thames?**

In [9]:
alpaca_speak()

Prompt: What are some fun activities a family can do along the River Thames?
----------------------------------------------------------------------------------------------------
Response:

Some fun activities for families to do on the River Thames include taking a boat tour, visiting
museums and galleries such as Tate Modern or London Dungeon, going for a picnic in one of the many
parks lining the riverbank, watching street performers at Covent Garden, exploring the markets
around Borough Market, and enjoying a meal together at one of the many restaurants located near the
waterfront.
----------------------------------------------------------------------------------------------------


### Test 5: Creative Writing I

> **Prompt: Write a short story about Buzz Lightyear's adventure to get flowers for Jessie**

In [10]:
alpaca_speak()

Prompt: Write a short story about Buzz Lightyear's adventure to get flowers for Jessie
----------------------------------------------------------------------------------------------------
Response:

Once upon a time, there was a brave and courageous little man named Buzz Lightyear who lived in the
Wild West town of Cactusville. He had been given his own space ship by his parents when he turned 10
years old so they could go on vacation without him. But instead of going on vacations with them,
Buzz decided to explore the galaxy! One day while exploring, he stumbled across a strange planet
filled with beautiful purple and pink flowers. He knew right away this would be perfect gift for his
best friend back home - Jessie. So off he went to find these magical flowers. After many days of
searching, he finally found what he was looking for deep within the jungle. With great excitement,
he picked up all the flowers and headed straight back towards Earth. When he arrived at Cactusville,
everyone

### Test 6: Creative Writing II

> **Prompt: Write a short murder mystery about John Lennon and the Backstreet Boys**

In [11]:
alpaca_speak()

Prompt: Write a short murder mystery about John Lennon and the Backstreet Boys
----------------------------------------------------------------------------------------------------
Response:

The night was dark, but there were still plenty of people out on the streets. It had been months
since anyone heard from John Lennon or his bandmates in the Backstreet Boys, so when they suddenly
appeared at a local bar it caused quite a stir among the patrons.                             Little
did any of them know what would happen next! As soon as John stepped into the room he seemed to have
some sort of strange power over everyone around him - even though none of them knew who he really
was. He quickly made friends with all the locals, and before long he'd managed to convince most of
them to join him for a late-night stroll through town...                               But little
did these unsuspecting citizens realize just how dangerous their new friend could be! Before too
long, one by one ea

### Test 7: Creative Writing III

> **Prompt: Write a short story about the Presidency of Alexander Hamilton in an alternate reality where he had become the US president**

In [12]:
alpaca_speak()

Prompt: Write a short story about the Presidency of Alexander Hamilton in an alternate reality where he had become the US president
----------------------------------------------------------------------------------------------------
Response:

In this alternate universe, Alexander Hamilton was elected President of the United States after his
successful tenure as Secretary of Treasury under George Washington’s administration. He ran on a
platform of fiscal responsibility and economic growth, promising to continue the policies set forth
by Washington while also implementing new ones. His campaign slogan “Hamilton for Progress”
resonated with voters across all demographics, leading him to victory over his opponent, Thomas
Jefferson.   As President, Hamilton focused primarily on domestic policy issues such as tax reform,
infrastructure development, education funding, healthcare accessibility, and immigration reform. In
addition, he worked closely with Congress to pass legislation aimed at 

## Testing GPT4 Alpaca-LoRA 7B

Another LoRA LLM that recently appeared is [`chansung/gpt4-alpaca-lora-7b`](https://huggingface.co/chansung/gpt4-alpaca-lora-7b). While interesting, the model responses - at least the way its set up in this notebook - were verbose and of poorer quality than [`tloen/alpaca-lora-7b`](https://huggingface.co/tloen/alpaca-lora-7b).

In [3]:
# Choose which model to run
base_model = "decapoda-research/llama-7b-hf" #@param ["decapoda-research/llama-7b-hf", "decapoda-research/llama-13b-hf", "decapoda-research/llama-30b-hf"]
finetuned_model = "chansung/gpt4-alpaca-lora-7b" #@param ["tloen/alpaca-lora-7b", "chansung/alpaca-lora-13b", "chansung/gpt4-alpaca-lora-7b"]

# Load tokeniser, base model and fine-tuned model
tokeniser = LlamaTokenizer.from_pretrained(base_model)
model = LlamaForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, finetuned_model)

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


Downloading (…)lve/main/config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00033.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/428 [00:00<?, ?B/s]

Downloading adapter_model.bin:   0%|          | 0.00/67.2M [00:00<?, ?B/s]

In [7]:
def generate_prompt(instruction, input=None):
    """
    Generate a prompt for a given instruction and optional input.

    Args:
        * instruction (`str`): The main instruction for the prompt.
        * input (`str`, optional): Additional input that provides context for
        the task. Defaults to None.

    Returns:
        `str`: A prompt that includes the instruction, input (if provided), and
        a space for the response.
    """
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
        
        ### Instruction:
        {instruction}
        
        ### Input:
        {input}
        
        ### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
        
        ### Instruction:
        {instruction}
        
        ### Response:"""

def alpaca_speak(
    context = None,
    temperature: float = 0.7,
    top_p: float = 0.95,
    repetition_penalty: float = 1.2,
    max_new_tokens: int = 512,
    width: int = 100
):
    """
    This function prompts the user to enter a prompt and generates responses
    using the fine-tuned Alpaca-LoRA model

    Args:
        * context (`str`): Optional. A string that provides additional context
        to the prompt. Default is None.
        * temperature (`float`): Optional. A value that controls the
        "creativity" of the generated sequences. Represents the degree of
        randomness in the generated text. A higher temperature value leads to
        more diverse and unpredictable sequences, while a lower value leads to
        more conservative and predictable sequences (e.g. a value of 1.0
        represents maximum randomness). In this function, the temperature value
        is set to 0.7, which means that the generated sequences will be
        moderately creative.
        * top_p (`float`): Optional. A value that controls the "safety" of the
        generated sequences. Represents the maximum cumulative probability
        allowed for the generated tokens. A higher top_p value leads to more
        conservative and safe sequences, while a lower value leads to more
        diverse and unpredictable sequences (e.g. a value of 1.0 means that all
        tokens with non-zero probability are considered). Default is 0.95.
        * repetition_penalty (`float`): Optional. A value that controls the
        "repetition" of the generated sequences, penalsing the model for
        repeating the same tokens in a sequence. A higher repetition penalty
        value leads to fewer repetitions in the generated sequences, and vice
        versa. Default is 1.2.
        * max_new_tokens (`int`): maximum number of new tokens that can be
        generated by the model in each response. Defaults to 512.
        * width (`int`): Optional. The maximum number of characters
        allowed in a single line of the generated text. Default is 100.
    
    Example usage:
    # Generate 5 responses using a prompt and additional context
    `alpaca_speak(context="I love to play video games", n=5)`
    """
    input_prompt = input("Prompt: ")
    print("-"*100)
    prompt = generate_prompt(input_prompt, context)
    inputs = tokeniser(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty
    )
    print("Response:\n")
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=max_new_tokens
    )
    for s in generation_output.sequences:
        output = tokeniser.decode(s)
        print(
            textwrap.fill(
                output.split("### Response:")[1].strip(),
                width=width
            )
        )
    print("-"*100)

### Test 1: Open Q&A I

> **Prompt: What are Newton's three laws of motion?**

In [8]:
alpaca_speak()

Prompt: What are Newton's three laws of motion?
----------------------------------------------------------------------------------------------------
Response:

Newton’s Three Laws of Motion state that every object in the universe tends to remain at rest or
move with constant velocity unless acted upon by some external force, and that any change in
momentum (mass times velocity) occurs only as a result of forces acting on it. The first law states
that objects will continue moving along their current path until they experience an unbalanced
force. This means that if there is no net force acting on them, then they will keep going forever
without changing direction.  The second law states that when two objects interact, the sum of all
forces acting on each object equals zero. In other words, for every action, there must be an equal
and opposite reaction. For example, when you push against something, it also pushes back against you
with the same amount of force.  Finally, the third law says

### Test 2: Open Q&A II

> **Prompt: What is the Pythagoras Theorem?**

In [9]:
alpaca_speak()

Prompt: What is the Pythagoras Theorem?
----------------------------------------------------------------------------------------------------
Response:

The Pythagorean theorem, also known as Pythagoras's theorem or simply "Pythagoras", states that in
any right triangle, the square of the hypotenuse (the side opposite the right angle) always equals
the sum of the squares of the other two sides. This relationship can be expressed mathematically by
the equation `a^2 + b^2 = c^2`, where 'c' represents the length of the hypotenuse and 'a', 'b'
represent lengths of the legs of the triangle. The theorem was first discovered around 500 BC by
Greek mathematician Pythagoras. It has been used extensively throughout history to solve problems
involving triangles, including finding missing sides when only two are given. In modern times, it
continues to play an important role in geometry, trigonometry, physics, engineering, architecture,
art, and many other fields.�  In summary, the Pythagoras theore

### Test 3: Text Summarisation

> **Prompt: Summarise this text: Spectacular photographs of a Canary Wharf skyscraper being completed have been rediscovered more than 30 years after they were taken.
Amid strong winds, Tony Brien sat in a wooden box suspended by a crane some 250m above east London to capture the images of One Canada Square. Previously believed to have been lost, the photographs were uncovered during a search of Mr Brien's archives. He said he was "completely staggered" to find the photos from November 1990. Mr Brien said he had worked on the Canary Wharf project before when he was approached to photograph the "topping out" of One Canada Square. "I said 'fine', not realising quite at that time that the only way I could really do it was to go up in a bucket or a crate from the ground which was attached to a crane," he explained.**

Source: BBC. (2023, April 15). Canary Wharf: Spectacular photos of skyscraper rediscovered. *BBC News*. https://www.bbc.co.uk/news/uk-england-london-65274803

In [11]:
alpaca_speak(max_new_tokens=256)

Prompt: Summarise this text: Spectacular photographs of a Canary Wharf skyscraper being completed have been rediscovered more than 30 years after they were taken. Amid strong winds, Tony Brien sat in a wooden box suspended by a crane some 250m above east London to capture the images of One Canada Square. Previously believed to have been lost, the photographs were uncovered during a search of Mr Brien's archives. He said he was "completely staggered" to find the photos from November 1990. Mr Brien said he had worked on the Canary Wharf project before when he was approached to photograph the "topping out" of One Canada Square. "I said 'fine', not realising quite at that time that the only way I could really do it was to go up in a bucket or a crate from the ground which was attached to a crane," he explained.
----------------------------------------------------------------------------------------------------
Response:

The story begins with the discovery of spectacular photographs of one

### Test 4: Brainstorming

> **Prompt: What are some fun activities a family can do along the River Thames?**

In [12]:
alpaca_speak(max_new_tokens=256)

Prompt: What are some fun activities a family can do along the River Thames?
----------------------------------------------------------------------------------------------------
Response:

1. Take a boat ride down the river and enjoy the scenic views of London's iconic landmarks such as
Big Ben, Westminster Abbey, Tower Bridge, and more.  2. Go for a picnic by the riverside or take
part in one of the many festivals held on its banks throughout the year.  3. Visit Shakespeare’s
Globe Theatre to watch a play performed outdoors with stunning views of the river.  4. Enjoy a
leisurely walk along the South Bank promenade while taking in the sights and sounds of this vibrant
city.  5. Explore the historic buildings lining the riverbanks including Tate Modern, The Shard, and
St Paul’s Cathedral.  6. Stop at one of the numerous restaurants and cafes located alongside the
river and treat yourself to delicious food and drinks overlooking the water.  7. Rent bicycles from
one of the many bike rent

### Test 5: Creative Writing I

> **Prompt: Write a short story about Buzz Lightyear's adventure to get flowers for Jessie**

In [13]:
alpaca_speak()

Prompt: Write a short story about Buzz Lightyear's adventure to get flowers for Jessie
----------------------------------------------------------------------------------------------------
Response:

Once upon a time, in a faraway galaxy called Andromeda, there lived a brave and courageous space
ranger named Buzz Lightyear who was on a mission to save his friend Jessie from the clutches of
Emperor Zurg.  Buzz had heard rumors that Emperor Zurg had captured Jessie and imprisoned her deep
within his castle. He knew he must act quickly if he wanted any chance at rescuing her before it was
too late. So with determination in his eyes and a smile on his face, Buzz set off on his journey. As
he traveled through the vastness of outer space, he encountered many obstacles along the way
including asteroids, meteors, and even some pesky alien creatures trying to stop him from reaching
his destination. But nothing could deter this fearless hero as he continued onward towards his goal.
Finally after 

### Test 6: Creative Writing II

> **Prompt: Write a short murder mystery about John Lennon and the Backstreet Boys**

In [14]:
alpaca_speak()

Prompt: Write a short murder mystery about John Lennon and the Backstreet Boys
----------------------------------------------------------------------------------------------------
Response:

John Lennon was one of the most famous musicians in history, known for his work with The Beatles as
well as his solo career. He had recently reunited with Paul McCartney to record new music together
when he received word from his manager that there were some issues at home.  He rushed back to New
York City where he found his wife Yoko Ono unconscious on their bedroom floor. She had been brutally
attacked by someone who had broken into her house while she slept. As John tried to revive her, he
heard strange noises coming from downstairs. Curious, he went to investigate only to find three
members of the popular boy band, the Backstreet Boys, standing over his wife's body holding knives.
They told him they wanted money but refused to say why or how much. When John asked them what would
happen if he di

### Test 7: Creative Writing III

> **Prompt: Write a short story about the Presidency of Alexander Hamilton in an alternate reality where he had become the US president**

In [15]:
alpaca_speak()

Prompt: Write a short story about the Presidency of Alexander Hamilton in an alternate reality where he had become the US president
----------------------------------------------------------------------------------------------------
Response:

In this alternative history, Alexander Hamilton was elected as President of the United States after
winning the 1804 election against Thomas Jefferson and Aaron Burr. He took office on March 5th, 1805
with his Vice-President being John Adams who served alongside him for two terms until his death in
July 1826. During his presidency, Hamilton focused primarily on domestic issues such as improving
infrastructure, expanding trade opportunities, and strengthening the nation's economy through
policies like the National Bank Act which established the First Bank of the United States. His
administration also saw the passage of several landmark pieces of legislation including the Embargo
Act of 1807, the Alien and Sedition Acts, and the Louisiana Purchase.