<a href="https://colab.research.google.com/github/colinmcnamara/austin_langchain/blob/main/labs/LangChain_101/Misc/101-1-streamlit_llamacpp_mistral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Install langchain, streamlit, and huggingface_hub modules for python

In [1]:
%pip install -q langchain huggingface_hub streamlit

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m115.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

* We will use [Mistral 7B Instruct v0.1 model](https://mistral.ai/) but since those models can be very large, we'll instead use a quantized (compressed) version from [Tom Jobins](https://huggingface.co/TheBloke) on [Hugging Face](https://huggingface.co/)
* The compressed version is still 4GB in size. So we'll pre-fetch it and cache it now so we can see the download progress. That way, when we call this again from Streamlit, we'll already have it on disk and it won't take nearly as long to start the streamlit app, as it does to download the model.

In [2]:
from huggingface_hub import hf_hub_download

(repo_id, model_file_name) = ("TheBloke/Mistral-7B-Instruct-v0.1-GGUF", "mistral-7b-instruct-v0.1.Q4_0.gguf")
hf_hub_download(repo_id=repo_id, filename=model_file_name, repo_type="model")

mistral-7b-instruct-v0.1.Q4_0.gguf:   0%|          | 0.00/4.11G [00:00<?, ?B/s]

'/root/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.1-GGUF/snapshots/45167a542b6fa64a14aea61a4c468bbbf9f258a8/mistral-7b-instruct-v0.1.Q4_0.gguf'

* We will install [LlamaCpp](https://python.langchain.com/docs/integrations/llms/llamacpp) for python but with CUDA support so we can leverage GPU.
* This model will also work without GPU support on standard CPU, but then it will be very slow to process our requests and respond back.
* This notebook has code that requires GPU so ensure you have the `T4 GPU` selected. Otherwise the program will crash without any errors on screen.

In [3]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.15.tar.gz (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.15-cp310-cp310-manylinux_2_35_x86_64.whl size=6969351 sha256=2eee8720722d9fbb2cf7571c75a1015f7108eb4954eebf244375e3995bb0d7af
  Stored in directory: /root/.cache/pip/wheels/0e/9a/85/b27890418a82fb5be7ceddff8e60f573f6ce989f7a2b43f7ca
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.2.15


* The original code modified and tweaked to use LlamaCpp so we can use local llm instead of OpenAI.

In [4]:
%%writefile streaming_app.py
import streamlit as st
from langchain.llms import LlamaCpp
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.callbacks.base import BaseCallbackHandler
from huggingface_hub import hf_hub_download


# We don't want to reinitialize the llm with every user interaction so we ask Streamlit to cache it
# Downloading a model takes time so we only want to do it once and cache it.
# Once a model is downloaded, it has to be loaded, and the time it takes to load a model is directly
# proportional to its physical size. So we don't want to keep reloading the model. Hence, we ask streamlit
# to cache it instead.
@st.cache_resource
def download_model_and_prepare_llm():

    # We have already run these 2 lines earlier, so when Streamlit runs it again, since it already has the model
    # downloaded locally, it will do an early return with the model's path on disk

    (repo_id, model_file_name) = ("TheBloke/Mistral-7B-Instruct-v0.1-GGUF", "mistral-7b-instruct-v0.1.Q4_0.gguf")

    model_path = hf_hub_download(repo_id=repo_id, filename=model_file_name, repo_type="model")

    # We will now load the model using the LlamaCpp interface which we will later pass to LangChain
    llama_llm = LlamaCpp(
            model_path=model_path,
            temperature=0,
            max_tokens=512,
            top_p=1,
            # callback_manager=callback_manager,
            n_gpu_layers=100,
            n_batch=512,
            n_ctx=4096,
            stop=["[INST]"],
            verbose=False,
            streaming=True,
            )

    return llama_llm


# We cache the llm chain until the system prompt changes. Then we re initialize it with the new prompt.
@st.cache_resource
def create_chain(system_prompt):

    # We get the model we pre-fetched earlier and loaded using LlamaCpp interface
    llama_llm = download_model_and_prepare_llm()

    # Mistral 7b is a foundational model. It won't work with chat prompt out of the box.
    # So we create a new prompt template based on their specifications.
    # We also pass in an initial "system prompt" to set the model's personality.
    template = """<s>[INST]{}[/INST]</s>

[INST]{}[/INST]""".format(system_prompt, "{question}")

    # We now prepare the prompt template using the template string above so we can create an llm chain using it
    prompt = PromptTemplate(template=template, input_variables=["question"])

    # We initialize the chain with the model and the prompt template.
    # llm_chain = LLMChain(llm=llama_llm, prompt=prompt)
    llm_chain = prompt | llama_llm

    return llm_chain


# We don't use stream handler at this time.
class StreamHandler(BaseCallbackHandler):
    def __init__(self, container, initial_text=""):
        self.container = container
        self.text = initial_text

    def on_llm_new_token(self, token: str, **kwargs) -> None:
        self.text += token
        self.container.markdown(self.text)


# Create a sidebar text field to allow us to set and modify the initial prompt
with st.sidebar:
    system_prompt = st.text_area(label="Enter a system prompt to adjust the chatbot behavior",
                                 value="You are a helpful AI assistant who answers questions in short sentences.")

# We build our chain by passing the system prompt.
# It is worth noting that the chain will be rebuilt if we change the system prompt.
llm = create_chain(system_prompt)

# We initialize the chat history in streamlit session
# This is not the model's memory. Just a list of messages we passed to it and
# responses we received from it. This will be used to render the chat history.
if "messages" not in st.session_state:
    st.session_state["messages"] = [{"role": "assistant", "content": "How may I help you today?"}]

# We loop though the chat history in the session and display it in a chat like interface
for msg in st.session_state.messages:
    st.chat_message(msg["role"]).write(msg["content"])

# When we receive user input from the chat input:
if prompt := st.chat_input():

    # 1. we add it to the chat history in the session memory and tag it as user message
    st.session_state.messages.append({"role": "user", "content": prompt})

    # 2. we add it to the chat history on the screen
    st.chat_message("user").write(prompt)

    # stream_handler = StreamHandler(st.empty())

    # 3. pass it to our llm chain as a prompt and get a response back
    # response = llm.run(prompt)
    response = llm.invoke({"question":prompt})

    # 4. we add the response to the chat history in the session memory and tag it as AI response
    st.session_state.messages.append({"role":"assistant", "content": response})

    # 5. We append the response to the chat history on the screen
    with st.chat_message("assistant"):
        st.markdown(response)


Writing streaming_app.py


In [5]:
!streamlit run streaming_app.py >>/content/logs.txt 2>&1 &

## Find the IP of your instance

In [6]:
!curl ipv4.icanhazip.com
!echo "Copy this IP into the webpage that opens below"

34.125.185.186
Copy this IP into the webpage that opens below


## Expose the Streamlit app on port 8501

In [7]:
!npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 4.225s
your url is: https://quiet-llamas-battle.loca.lt
^C
