<h3 style="text-align: center;"><span style="color:red">This example that can be overwritten by container updates.</span></h3>
<h4 style="text-align: center;"><span style="color:red">Please create a new notebook if you plan to make changes.</span></h4>

## Intro

This example jupyter file demonstrates basic chat functionality.

## Getting started

Run each cell to step through the process, including (first-time only) downloading a model to use.
Note: Run cells, in order, at least once, to set up the environment.  Once set up, you can change the prompt and rerun just that cell.  If you make changes in a cell, that cell should be rerun for the changes to take effect.

In [None]:
from helpers import device_info

device_info()

In [None]:
from dataclasses import dataclass

@dataclass
class Model:
    filename: str
    repo_id: str

models = {
    "codeninja"   : Model("codeninja-1.0-openchat-7b.Q4_K_M.gguf", "TheBloke/CodeNinja-1.0-OpenChat-7B-GGUF"),
    "mixtral"     : Model("mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf", "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"),
}

model = models["codeninja"] # or change to "mixtral"

In [None]:
import os
from huggingface_hub import hf_hub_download

# Uncommenting the following line _may_ improve download speeds
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download the model if needed.
hf_hub_download(repo_id=model.repo_id, filename=model.filename, local_dir="/app/data");

In [None]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

In [None]:
template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [None]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [None]:
from gpu_monitor import GPUMonitor

# The monitor is a handy way to see how the gpu handles different operations, but it is shown below the cell it runs in.  Thus
# it can be helpful to use it around both the LLM loading as well as the handling of prompts.  Feel free to comment this out if
# not needed.
gpu_monitor = GPUMonitor()
display(gpu_monitor.display());

In [None]:
n_gpu_layers = 12  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=f"/app/data/{model.filename}",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

In [None]:
# The monitor is a handy way to see how the gpu handles different operations, but it is shown below the cell it runs in.  Thus
# it can be helpful to use it around both the LLM loading as well as the handling of prompts.  Feel free to comment this out if
# not needed.
gpu_monitor = GPUMonitor()
display(gpu_monitor.display());

In [None]:
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question);

## Cleanup
<h4><span style="color:red">Don't run until you are finished using the LLM.</span></h4>

The following cell will reset the variables using memory on the GPU and clear memory

In [None]:
import torch

callback_manager=None
llm_chain=None
llm=None
torch.cuda.empty_cache();