<img src="images/panel-logo.png" width="200px" align="right" vspace="20px"/>

# Local LLMs Chat App with Panel

<hr>

In the [previous notebook](04-panel-intro.ipynb), `pn.chat.ChatInterface` was introduced with a callback that simply echoed the sent message.

In this section, we will make it much more interesting by connecting a local LLM, specifically Llama3 from earlier.

## Ragna & ExLlama2-quantized Llama3

### Setup

Let's first initialize the model and create a Ragna chat. We'll explore the Python Software Foundation's annual reports.

In [1]:
from local_llm import Llama38BInstruct
from ragna import Rag, source_storages
import panel as pn
pn.extension()

documents = [
    "files/psf-report-2021.pdf",
    "files/psf-report-2022.pdf",
    "files/psf-report-2023.pdf",
]

chat = Rag().chat(
    documents=documents,
    source_storage=source_storages.Chroma,
    assistant=Llama38BInstruct,
)

await chat.prepare();

### Test the chat

We can first do a test run to see if it works.

In [2]:
message = await chat.answer("What does the PSF do?", stream=True)

async for chunk in message:
    print(chunk, end="")



Based on the provided context, the Python Software Foundation (PSF) appears to be involved in various activities related to the Python programming language and community. Some of the specific initiatives mentioned include:

1. Grant disbursements: The PSF provides grants to support various projects and events, such as conferences, workshops, and coding camps.
2. Community awards and expenses: The PSF seems to provide funding for community-related expenses and awards.
3. Packaging Work Group/Infrastructure/Other: The PSF may be involved in maintaining and improving the infrastructure and packaging of Python software.
4. Fiscal sponsorship: The PSF may provide fiscal sponsorship to other organizations or projects.
5. Code of Conduct: The PSF may be involved in promoting and enforcing a code of conduct within the Python community.
6. Developer-in-Residence: The PSF has hired a full-time developer to work on CPython, directed by the Python Steering Council.
7. Packaging Project Manager: 

### Migrate to `ChatInterface`

Now, let's migrate this functionality into `pn.chat.ChatInterface` with a callback.

To do this, we copy paste the prior cell's code into a function, and then:

1. Prefix the `def` with `async` to make it async
2. Replace the hard-coded string with `contents`
3. Concatenate the chunks into a `response` string
4. Yield the `response`

In [3]:
async def reply(contents, user, instance):
    message = await chat.answer(contents, stream=True)

    response = ""
    async for chunk in message:
        response += chunk
        yield response

chat_interface = pn.chat.ChatInterface(callback=reply)
chat_interface

Now try entering "Who is the Python Developer in Residence?" into the chat.

## LlamaCpp

For posterity, we can use `llama-cpp-python` for quantized models too!

`llama-cpp` can run on both CPU and GPU, and has an API that mimics OpenAI's API.

Personally, I use it because I don't have any spare GPUs lying around and it runs extremely well on my local Mac M2 Pro! It also handles chat template formats internally so it's just a matter of specifying a the proper `chat_format` key.

Here, we:
1. download the quantized model (if it doesn't exist already) in GGUF format
2. instantiate the model; first checking the cache
3. serialize all messages into `transformers` format (new)
4. calls the chat completion Openai-like API on the messages
5. stream the chunks

In [4]:
from pathlib import Path

import llama_cpp
import panel as pn
from huggingface_hub import hf_hub_download
pn.extension()

model_path = hf_hub_download(
    "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "mistral-7b-instruct-v0.2.Q5_K_M.gguf",
    local_dir=str(Path.home() / "shared/scipy/rags-to-riches")
)  # 1.

# 2.
if model_path in pn.state.cache:
    llama = pn.state.cache[model_path]
else:
    llama = llama_cpp.Llama(
        model_path=model_path,
        n_gpu_layers=-1,
        chat_format="mistral-instruct",
        n_ctx=2048,
        logits_all=True,
        verbose=False,
    )
    pn.state.cache[model_path] = llama

def reply(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = instance.serialize()  # 3.
    message = llama.create_chat_completion_openai_v1(messages=messages, stream=True)  # 4.

    response = ""
    for chunk in message:
        part = chunk.choices[0].delta.content or ""
        response += part
        yield response  # 5.

chat_interface = pn.chat.ChatInterface(callback=reply)
chat_interface

We can even give the model a personality by setting a system message!

Update the callback with the a system message.

Note, Mistral Instruct does NOT support the `system` role so we use `user` instead.

In [5]:
system_message = "You are an excessively passionate Pythonista."

def reply(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = [
        {"role": "user", "content": system_message}  # updated here
    ] + instance.serialize()
    message = llama.create_chat_completion_openai_v1(messages=messages, stream=True)

    response = ""
    for chunk in message:
        part = chunk.choices[0].delta.content or ""
        response += part
        yield response

chat_interface.callback = reply
chat_interface

### Your turn ✨

Try aggregating all you've learned to customize the personality of the chatbot on the go!

Again, replace the ellipses with the appropriate code snippets!

In [7]:
import llama_cpp
import panel as pn
from pydantic import BaseModel
from huggingface_hub import hf_hub_download

pn.extension()

model_path = hf_hub_download(
    "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "mistral-7b-instruct-v0.2.Q5_K_M.gguf",
    local_dir=str(Path.home() / "shared/scipy/rags-to-riches")
)

if model_path in pn.state.cache:
    llama = pn.state.cache[model_path]
else:
    llama = llama_cpp.Llama(
        model_path=model_path,
        n_gpu_layers=-1,
        chat_format="mistral-instruct",
        n_ctx=2048,
        logits_all=True,
        verbose=False,
    )
    pn.state.cache[model_path] = llama

def reply(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = [
        {"role": "user", "content": "you are ted lasso"}  # Fill this out
    ] + instance.serialize()
    message = llama.create_chat_completion_openai_v1(
        messages=messages, stream=True
    )

    response = ""
    for chunk in message:
        part = chunk.choices[0].delta.content or ""
        response += part
        yield response


system_input = ...  # Fill this out
chat_interface = pn.chat.ChatInterface(callback=reply, min_height=350)
layout = pn.Column(
    system_input,
    chat_interface,
)
layout

That's all for now. Check out [the examples](https://holoviz-topics.github.io/panel-chat-examples/) to see more on how you can integrate `pn.chat.ChatInterface` with other services!

Again, there is also a [HoloViz Discourse](https://discourse.holoviz.org/) if you want to ask questions.

<hr>

_❗️ **Warning:** Make sure to stop the Jupyter Kernel (in the JupyterLab Menu Bar, click on "Kernel" -> "Shut down Kernel") before proceeding to prevent the "insufficient VRAM" error._

<br>

**✨ Next: [Ragna's UI and Experiments](06-UI-and-experiments.ipynb) →**


💬 _Wish to continue discussions after the tutorial? Contact the presenters: [@pavithraes](https://github.com/pavithraes), [@dharhas](https://github.com/dharhas), [@ahuang11](https://github.com/ahuang11)_

<hr>