<a href="https://colab.research.google.com/github/adammuhtar/llm-notebooks/blob/main/notebooks/koala-lora-7b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Koala-LoRA 🐨🐻: Dialogue-Centric Language Model**

The development of large language models (LLMs) have been nothing short of remarkable, revolutionising the field of natural language processing (NLP) and potentially moving humanity slightly closer towards building an artificial general intelligence. With the advent of large pre-trained models such as GPT-3 and GPT-4, these models are becoming increasingly sophisticated, with the latest models leveraging on billions of parameters and demonstrating impressive language processing capabilities. Equally as exciting is the growing trend towards making these models more accessible to researchers and developers alike, with many pre-trained models becoming freely available.

[**LLaMA**](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)

One such language model is [LLaMA (Large Language Model Meta AI)](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) developed by researchers at Meta AI. Smaller but equally performant models such as LLaMA allows the wider community to access this transformative technology without requiring prohibitive amounts of infrastructure and resources to run it. LLaMA is available at several sizes (7B, 13B, 33B, and 65B parameters) and keeping in theme with this series of notebooks, we will run the smallest model, LLaMA 7B (trained on one trillion tokens) to showcase the feasibility of running LLMs on consumer hardware.

[**Koala**](https://bair.berkeley.edu/blog/2023/04/03/koala/)

The version of LLaMA this notebook will be running is [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/), developed by researchers at University of California, Berkeley. [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) is a LLaMA base model fine-tuned on dialogue data scraped from the web and public datasets, which includes high-quality responses to user queries from other large language models, as well as question answering datasets and human feedback datasets. Human-based evaluations done by the researchers suggests that [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) is as competitive, if not more, to other existing models. This notebook uses the version of [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) provided by [Sam Witteveen](https://huggingface.co/samwit).

---

*References*:
* Berkeley Artificial Intelligence Research. (2023, April 3). Koala: A Dialogue Model for Academic Research *BAIR Blog*. https://bair.berkeley.edu/blog/2023/04/03/koala/
* Meta AI. (2023, February 23). Introducing LLaMA: A foundational, 65-billion-parameter large language model. *Meta AI Blog*. https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

---

## **Table of Contents**

* [1. Notebook setup](#section-1)
* [2. Load LLaMa tokeniser and fine-tuned Koala model](#section-2)
* [3. Generating text](#section-3)

## 1. Notebook setup <a name="section-1"></a>

This notebook is run using Google Colaboratory (Colab) - Colab is Google's implementation of [Jupyter Notebooks](https://jupyter.org/). This notebook has the following packages installed:
* `python==3.9.16`
* `accelerate==0.18.0`
* `bitsandbytes==0.38.1`
* `datasets==2.11.0`
* `loralib==0.1.1`
* `sentencepiece==0.1.98`
* `torch==2.0.0+cu118`
* `transformers==4.28.1`

The `accelerate`, `bitsandbytes`, `datasets`, `loralib`, `sentencepiece`, and `transformers` libraries will need to be manually installed into the Colab environment (pip install by running a shell command), similar to the requirements from the core [`alpaca-lora`](https://github.com/tloen/alpaca-lora/blob/main/requirements.txt) repo. Running this requires hardware accelerators to access higher RAM runtimes; GPU specifications should at least match the Tesla T4 GPU (16 GB GDDR6 @ 320 GB/s), which is available for free in Google Colab.

Replicating this notebook for larger Dolly 2.0 models (e.g [`decapoda-research/llama-13b-hf`](https://huggingface.co/decapoda-research/llama-13b-hf) or [`decapoda-research/llama-30b-hf`](https://huggingface.co/decapoda-research/llama-30b-hf)) on Colab will require Colab Pro, using hardware such as the A100 Tensor Core GPUs.

In [1]:
# Query GPU device status/details
!nvidia-smi

Sun Apr 16 14:03:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Check IP address details if there are restrictions running non-local servers
!curl ipinfo.io

{
  "ip": "34.105.5.29",
  "hostname": "29.5.105.34.bc.googleusercontent.com",
  "city": "The Dalles",
  "region": "Oregon",
  "country": "US",
  "loc": "45.5946,-121.1787",
  "org": "AS396982 Google LLC",
  "postal": "97058",
  "timezone": "America/Los_Angeles",
  "readme": "https://ipinfo.io/missingauth"
}

In [3]:
!pip install accelerate bitsandbytes datasets loralib sentencepiece transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.3/215.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.3/104.3 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
# Standard library imports
import textwrap

# Third-party imports
import torch
from transformers import LlamaTokenizer , LlamaForCausalLM, GenerationConfig, pipeline

In [5]:
# Check available GPUs for computation
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    # Print details of all available GPUs
    for i in range(num_gpus):
        gpu_props = torch.cuda.get_device_properties(i)
        print(f"Device details for GPU {i+1}:")
        print(f"* Name: {gpu_props.name}")
        print(f"* Memory size: {round(gpu_props.total_memory / 1024**3, 2)} GB")
        if i == num_gpus-1:
            continue
        else:
            print("-"*79)
    # Get the currently active GPU device and print its name and memory size
    active_gpu = torch.cuda.current_device()
    active_gpu_props = torch.cuda.get_device_properties(active_gpu)
    print("="*79)
    print(f"Currently active GPU device: {active_gpu_props.name}")
    print(f"Memory size: {round(active_gpu_props.total_memory / 1024**3, 2)} GB")
    print("="*79)
else:
    print("No GPU devices found.")

Device details for GPU 1:
* Name: Tesla T4
* Memory size: 14.75 GB
Currently active GPU device: Tesla T4
Memory size: 14.75 GB


## 2. Load LLaMA tokeniser and fine-tuned Koala model <a name="section-2"></a>

We set up the tokeniser and model objects as follows:

* The `tokeniser` is created using `LlamaTokenizer` from the latest `transformers` library and loaded with the LLaMa tokeniser from the `koala_model` model checkpoint.
* `model` is created using `LlamaForCausalLM` from the latest `transformers` library and loaded with the `koala_model` checkpoint. `load_in_8bit` argument is set to True, which loads the model in 8-bit mode to reduce memory usage by half with no noticeable loss in quality - this is useful when your GPU is not large enough to fit the uncompressed model. `device_map` is set to "auto" to automatically select the device (CPU or GPU) to run the model on.

In [6]:
# Choose which model to run
koala_model = "samwit/koala-7b" #@param ["samwit/koala-7b", "TheBloke/koala-7B"]

# Load tokeniser, base model and fine-tuned model
tokeniser = LlamaTokenizer.from_pretrained(koala_model)
base_model = LlamaForCausalLM.from_pretrained(
    koala_model,
    load_in_8bit=True,
    device_map="auto"
)

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/881 [00:00<?, ?B/s]



Downloading (…)lve/main/config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00014.bin:   0%|          | 0.00/1.96G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00014.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00009-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00010-of-00014.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Downloading (…)l-00011-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00012-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00013-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00014-of-00014.bin:   0%|          | 0.00/1.69G [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [13]:
def koala_speak(
    temperature: float = 0.7,
    top_p: float = 0.95,
    repetition_penalty: float = 1.2,
    max_length: int = 512,
    width: int = 100
):
    """
    This function prompts the user to enter a prompt and generates n responses
    using the fine-tuned Koala-LoRA model

    Args:
        * temperature (`float`): Optional. A value that controls the
        "creativity" of the generated sequences. Represents the degree of
        randomness in the generated text. A higher temperature value leads to
        more diverse and unpredictable sequences, while a lower value leads to
        more conservative and predictable sequences (e.g. a value of 1.0
        represents maximum randomness). In this function, the temperature value
        is set to 0.7, which means that the generated sequences will be
        moderately creative.
        * top_p (`float`): Optional. A value that controls the "safety" of the
        generated sequences. Represents the maximum cumulative probability
        allowed for the generated tokens. A higher top_p value leads to more
        conservative and safe sequences, while a lower value leads to more
        diverse and unpredictable sequences (e.g. a value of 1.0 means that all
        tokens with non-zero probability are considered). Default is 0.95.
        * repetition_penalty (`float`): Optional. A value that controls the
        "repetition" of the generated sequences, penalsing the model for
        repeating the same tokens in a sequence. A higher repetition penalty
        value leads to fewer repetitions in the generated sequences, and vice
        versa. Default is 1.2.
        * max_length (`int`): maximum number of new tokens that can be generated
        by the model in each response. Defaults to 512.
        * max_length (`int`): Optional. The maximum number of characters
        allowed in a single line of the generated text. Default is 100.
        * n (`int`): Optional. The number of responses to generate. Default is
        3.
    """
    input_prompt = input("Prompt: ")
    print("-"*100)
    print("Response:\n")
    pipe = pipeline(
        "text-generation",
        model=base_model, 
        tokenizer=tokeniser, 
        max_length=512,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.2
    )
    output = pipe(f"BEGINNING OF CONVERSATION: USER: {input_prompt}")
    # Split the input text into lines based on newline characters
    lines = output[0]["generated_text"].split("\n")
    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]
    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)
    return print(wrapped_text)

## 3. Generating text <a name="section-3"></a>

In this section of the notebook, we'll be working through some examples of various tasks to see how well Alpaca-LoRA performs. Note that these are not meant to be comprehensive or robust tests, but simply anecdotal examples of localised prompts.

That said, Koala's outputs seen here are arguably the best of the three:
* Koala gets most of the facts right, such as Newton's three laws and the Pythagoras theorem. Instances where it didn't get right, such as Hamilton's death from his duel with Burr, was only marginally off (Hamilton's died on 12th July, instead of 4th July) - that said, the prompt did ask the model to imagine an alternate reality, so this might not count!
* Koala's response is generally more coherent with both brainstorming ideas and creative writing exhibiting a more reasonable train of thought.

### Test 1: Open Q&A I

> **Prompt: What are Newton's three laws of motion?**

In [14]:
koala_speak()

Prompt: What are Newton's three laws of motion?
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: What are Newton's three laws of motion?

Newton's three laws of motion are fundamental principles that describe how objects move. They state
that:

1.   An object will continue to accelerate at a constant rate until it reaches its maximum
acceleration, or the point where there is no longer any net force acting on it. This law states that
an object will keep moving in a straight line unless something else acts upon it and changes its
direction.
2.   The second law states that the force applied to an object must be equal to the mass of the
object multiplied by its acceleration. This law states that the strength of the force required to
change the velocity (speed) of an object depends on both the mass of the object and the acceleration
being applied.
3.   The third law states that for every action, 

### Test 2: Open Q&A II

> **Prompt: What is the Pythagoras Theorem?**

In [15]:
koala_speak()

Prompt: What is the Pythagoras Theorem?
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: What is the Pythagoras Theorem?
The Pythagorean theorem, also known as the "Pythagorean theorem," states that in a right-angled
triangle, the square of the length of the longest side (the hypotenuse) is equal to the sum of the
squares of the other two sides. This theorem can be expressed mathematically as follows:

c^2 = a^2 + b^2

where c represents the length of the hypotenuse, and a and b represent the lengths of the other two
sides. The theorem has been proven for thousands of years and is considered one of the most
important mathematical concepts in history. It has many practical applications in fields such as
engineering, architecture, and construction.


### Test 3: Text Summarisation

> **Prompt: Summarise this text: Spectacular photographs of a Canary Wharf skyscraper being completed have been rediscovered more than 30 years after they were taken.
Amid strong winds, Tony Brien sat in a wooden box suspended by a crane some 250m above east London to capture the images of One Canada Square. Previously believed to have been lost, the photographs were uncovered during a search of Mr Brien's archives. He said he was "completely staggered" to find the photos from November 1990. Mr Brien said he had worked on the Canary Wharf project before when he was approached to photograph the "topping out" of One Canada Square. "I said 'fine', not realising quite at that time that the only way I could really do it was to go up in a bucket or a crate from the ground which was attached to a crane," he explained.**

Source: BBC. (2023, April 15). Canary Wharf: Spectacular photos of skyscraper rediscovered. *BBC News*. https://www.bbc.co.uk/news/uk-england-london-65274803

In [16]:
koala_speak()

Prompt: Summarise this text: Spectacular photographs of a Canary Wharf skyscraper being completed have been rediscovered more than 30 years after they were taken. Amid strong winds, Tony Brien sat in a wooden box suspended by a crane some 250m above east London to capture the images of One Canada Square. Previously believed to have been lost, the photographs were uncovered during a search of Mr Brien's archives. He said he was "completely staggered" to find the photos from November 1990. Mr Brien said he had worked on the Canary Wharf project before when he was approached to photograph the "topping out" of One Canada Square. "I said 'fine', not realising quite at that time that the only way I could really do it was to go up in a bucket or a crate from the ground which was attached to a crane," he explained.
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: Summarise this text: Spectacular pho

### Test 4: Brainstorming

> **Prompt: What are some fun activities a family can do along the River Thames?**

In [17]:
koala_speak()

Prompt: What are some fun activities a family can do along the River Thames?
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: What are some fun activities a family can do along the River
Thames?
1.   Take a river cruise and see London from a different perspective.
2.   Visit one of the many museums or galleries located on the banks of the Thames, such as the
British Museum or Tate Modern.
3.   Go for a walk along the South Bank and enjoy views of the city skyline and the river.
4.   Have a picnic in one of the parks that line the Thames, such as Hyde Park or Kensington
Gardens.
5.   Rent a boat and go for a leisurely ride down the river.


### Test 5: Creative Writing I

> **Prompt: Write a short story about Buzz Lightyear's adventure to get flowers for Jessie**

In [18]:
koala_speak()

Prompt: Write a short story about Buzz Lightyear's adventure to get flowers for Jessie
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: Write a short story about Buzz Lightyear's adventure to get flowers
for Jessie

Once upon a time, in the vast expanse of space, there lived a brave and noble spaceman named Buzz
Lightyear. He was known throughout the galaxy as one of the greatest heroes who ever lived, always
ready to defend his friends and protect the universe from harm.

One day, Buzz received an urgent message from his friend Jessie, who was trapped on a distant planet
with no way back home. Jessie had been captured by a group of evil aliens who were holding her
captive and threatening to destroy all life in the galaxy if they didn't receive a ransom payment of
rare and valuable flowers.

Determined to save Jessie and stop the alien threat once and for all, Buzz set out on a dangerous
jou

### Test 6: Creative Writing II

> **Prompt: Write a short murder mystery about John Lennon and the Backstreet Boys**

In [19]:
koala_speak()

Prompt: Write a short murder mystery about John Lennon and the Backstreet Boys
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: Write a short murder mystery about John Lennon and the Backstreet
Boys
John Lennon was found dead in his home on December 8, 1980. The police were called to investigate,
but they quickly ruled out any suspicious activity or motives for the crime. However, over time,
more information came to light that suggested there may have been something sinister behind Lennon's
death.

One of the key pieces of evidence was a mysterious phone call made by someone claiming to be Yoko
Ono, Lennon's wife at the time. According to witnesses, she received a call from an unknown person
who threatened her life if she didn't leave Lennon alone. This led some people to speculate that Ono
had a role in Lennon's death, either as a victim or as a conspirator.

Another piece of evidence was t

### Test 7: Creative Writing III

> **Prompt: Write a short story about the Presidency of Alexander Hamilton in an alternate reality where he had become the US president**

In [20]:
koala_speak()

Prompt: Write a short story about the Presidency of Alexander Hamilton in an alternate reality where he had become the US president
----------------------------------------------------------------------------------------------------
Response:

BEGINNING OF CONVERSATION: USER: Write a short story about the Presidency of Alexander Hamilton in
an alternate reality where he had become the US president after his death.
In this alternate reality, Alexander Hamilton was assassinated on July 4th, 1804 while leaving a
duel with Aaron Burr. However, instead of being buried at Trinity Church Cemetery in New York City
as planned, his body was preserved and placed into cryogenic storage to be revived later when
technology advanced enough to bring him back to life.

Years passed, and by the early 23rd century, medical science had made significant advancements that
allowed for the successful resuscitation of frozen bodies. When it became clear that Alexander
Hamilton's time as President would have be