<img src="images/roche-logo-blue.png" alt="Roche logo" style="float: right;" width="150" />
<p>&nbsp;</p>
<p>&nbsp;</p>

# Building a Q&A engine with LangChain and open-source LLMs
## Marek Grzenkowicz

#### September 2023

## About Roche

![Key Roche Informatics hubs](images/about-roche.png)

## Large Language Models did not appear out of nowhere

### Natural language processing (NLP)

- Standardized tasks (question answering, summarization, sentiment analysis, ...)
- Evaluation benchmarks
- **Leaderboards**
- Word and sentence **embeddings** (word2vec, GloVe, fastText, ELMo, BERT, ...)

### Machine learning

- Neural networks
- Deep learning
- CUDA
- Transformer architecture
- Attention (ML technique)
- Reinforced learning

## Goals

1. Implement an LLM application without deep NLP knowledge
2. Develop and run it locally, instead of making API calls to a cloud-hosted model
3. Use an open-source model
4. Create a prototype with an actual business problem in mind

## Tools

- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) - framework for developing applications powered by LLMs
- [Hugging Face Hub](https://huggingface.co/models) - repository of pre-trained language models
  - [Transformers](https://huggingface.co/docs/transformers/index) - downloading the models
- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) - compact, open-source LLM from [lmsys.org](https://lmsys.org) 
- [ChromaDB](https://www.trychroma.com) - embedding database (vector store)
- [Jupyter Notebook](https://jupyter.org/) with the [RISE](https://github.com/damianavila/RISE) extension - IDE with a slideshow feature

## Why the `lmsys/fastchat-t5-3b-v1.0` model?

- GPU + 4GB memory - 1B parameters at best
- CPU + 32GB RAM - 3B parameters

- [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) ➡ [`CobraMamba/mamba-gpt-3b-v3`](https://huggingface.co/CobraMamba/mamba-gpt-3b-v3) claims to surpass some 12B models
  - but it is slow ➡ [Open LLM performace leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)


- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)
  - _A commercial-friendly, compact, yet powerful chat assistant_
  - The first model to actually generate any response on my laptop in reasonable time 🎉

## Some duct tape first

In [16]:
# use with care! important warnings may get hidden

import warnings
warnings.filterwarnings('ignore')

In [17]:
# https://github.com/chroma-core/chroma/blob/main/chromadb/__init__.py#L57

import sys
__import__("pysqlite3")
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

## Load a model

In [37]:
from langchain.llms import HuggingFacePipeline

In [19]:
model_id, task = "lmsys/fastchat-t5-3b-v1.0", "text2text-generation"

# model will be downloaded on first use and cached in ~/.cache/huggingface/hub/

model = HuggingFacePipeline.from_model_id(
    model_id=model_id,
    task=task,
    model_kwargs={
        "temperature": 0,
        "max_length": 1000
    },
    device=-1,  # CPU
)

Device has 1 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.


## Initialize an LLM chain and start asking questions

In [20]:
from langchain import PromptTemplate, LLMChain

template_text = """
{question}
"""
template = PromptTemplate(template=template_text, input_variables=["question"])

llm_chain = LLMChain(llm=model, prompt=template)

In [21]:
llm_chain("Who is Sheryl Crow?")["text"]

'<pad> Sheryl  Crow  is  an  American  singer,  songwriter,  and  actress.  She  is  best  known  for  her  role  as  the  lead  singer  and  lead  guitarist  of  the  rock  band  The  Band wagon,  and  for  her  role  as  the  lead  singer  and  lead  guitarist  of  the  alternative  rock  band  The  Mamas  and  the  Papas.  Crow  has  also  been  a  member  of  the  band  The  Mamas  and  the  Papas  since  its  formation  in  1995.\n'

## Prompt engineering

In [22]:
template_text = """
Provide brief answers, use 10 words or less.
{question}
"""
template = PromptTemplate(template=template_text, input_variables=["question"])

llm_chain = LLMChain(llm=model, prompt=template)

In [23]:
llm_chain("Who is Sheryl Crow?")["text"]

'<pad> Singer-songwriter'

## Easy questions

In [24]:
llm_chain("Who is Poland located?")["text"]

'<pad> Europe'

In [25]:
llm_chain("What is Bialowieza Forest?")["text"]

'<pad> Bialowieza Forest is a protected forest in Poland.'

## Harder questions

In [26]:
llm_chain("What does the name 'Bialowieza' mean in English?")["text"]

'<pad> "Birch Tree"'

In [27]:
llm_chain("What's the length of the Tsar's Trail and where does it begin?")["text"]

"<pad> The Tsar's Trail is a 900 mile long trail that begins in Moscow and ends in St. Petersburg."

# What now?

# Should I fine-tune the base mode? 🤔

## Embeddings and  cosine similarity

![cosine similarity in 2D](./images/vectors-cos-sim-500.png)

The actual embedding spaces have **100s of dimensions**! 🤯

Source: https://github.com/grzenkom/do-androids-read/

## Arithmetic of word vectors

\begin{equation}
\LARGE{\mathit{ v_{parent} + v_{woman} \approx v_{x} }}
\end{equation}

![sum of "parent" and "woman" vectors](./images/vector-mother.png)

\begin{equation}
\LARGE{\mathit{ v_{seawater} - v_{salt} \approx v_{x} }}
\end{equation}

![difference of "seawater" and "salt" vectors](./images/vector-water.png)

Source: https://github.com/grzenkom/do-androids-read/

## Retrieval Augmented Generation (RAG)

With RAG, documents can be stored as embeddings in a vector database, queried
for based on semantic meaning, and then these relevant splits are passed into
**model prompt via the context window**. LLM uses these text chunks from
original documents to generate an answer.

![Question Answering flow](images/langchain-qa-flow.jpeg)

Source: https://python.langchain.com/docs/use_cases/question_answering/#overview

## LangChain integrations

1. [Catalog](https://integrations.langchain.com/llms)
2. [Documentation](https://python.langchain.com/docs/integrations)

![LangChain integrations](./images/langchain-integrations.png)

## RAG step 1 - load documents

In [60]:
from langchain.document_loaders import WikipediaLoader

loader = WikipediaLoader(query="Białowieża Forest", lang="en")
bf_docs = loader.load()

In [61]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    "https://bpn.com.pl/index.php?option=com_content&task=view&id=651&Itemid=297&lang=en"
)
bpn_page = loader.load()

## RAG step 1 - load documents

In [62]:
# load some unreleated document to test the vector store later

loader = WikipediaLoader(query="Kubeflow", lang="en")
kf_docs = loader.load()

## RAG step 2 - split documents into chunks

In [63]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500, chunk_overlap = 0
)
all_splits = text_splitter.split_documents(bf_docs + bpn_page + kf_docs)

## RAG step 3 - calculate embeddings for text chunks

In [64]:
from langchain.embeddings import HuggingFaceEmbeddings

hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={
        "device": "cpu"
    },
    encode_kwargs={
        "normalize_embeddings": False
    }
)

## RAG step 4 - store embeddings in a database

In [65]:
from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(
    documents=all_splits, embedding=hf_embeddings
)

## RAG step 4 - store embeddings in a database

In [85]:
relevant_splits = vector_store.similarity_search_with_score("kubeflow", k=3)
[(doc_split.page_content[:100], score) for doc_split, score in relevant_splits]

[('Kubeflow is an open-source platform for machine learning and MLOps on Kubernetes introduced by Googl',
  0.5977725982666016),
 ('=== Kubeflow Pipelines for model training ===\nOnce developed, models are trained in the Kubeflow Pip',
  0.789579451084137),
 ('become among the top 2% of GitHub projects ever. Kubeflow 1.0 was released in March 2020 via a publi',
  0.8031338453292847),
 ('=== KServe for model serving ===\nThe KServe component (previously named KFServing) provides Kubernet',
  0.81390780210495)]

## RAG step 5 - initialize a Q&A chain

In [33]:
from langchain.chains import RetrievalQA

# the QA chain is constructed with the LLM model (loaded earlier)
# and the embedding database
qa_chain = RetrievalQA.from_chain_type(
    llm=model, retriever=vector_store.as_retriever()
)

### `RetrievalQA` chain - processing steps

1. query the vector database for document splits relevant to the question,
2. pass the retrieved splits into the model prompt,
3. have the LLM generate an answer based on original documents.

## RAG step 6 - use Q&A chain to query the loaded documents

In [34]:
question = "What does the name 'Bialowieza' mean in English?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> White  Tower.\n'

In [35]:
question = "What's the length of the Tsar's Trail and where does it begin?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> 4  km  long,  starts  at  Przed  Kosym  Mostem  depot.\n'

## No more hallucinations then?

In [36]:
question = "Is Bialowieza Forest famous for its walking trails?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> No,  it  is  known  for  its  undergrowth  plants  and  animal  trails.\n'

## Next steps

1. Other loaders
    - PDF files
    - slide decks
    - Google Sites
    - files in Google Drive
2. Conversational interface
3. Memory
4. Larger model
5. GPU processing

## Links

### Overview

- [How OpenAI trained ChatGPT](https://blog.quastor.org/p/openai-trained-chatgpt)
    - 🎥 [State of GPT | Andrej Karpathy](https://www.youtube.com/watch?v=s6zNXZaIiiI) 
- [Catching up on the weird world of LLMs](https://simonwillison.net/2023/Aug/3/weird-world-of-llms)
- [What We Know About LLMs  (Primer)](https://willthompson.name/what-we-know-about-llms-primer) 
- [The Many Ways that Digital Minds Can Know](https://moultano.wordpress.com/2023/06/28/the-many-ways-that-digital-minds-can-know)

### Tutorials

- [Running a Hugging Face Large Language Model (LLM) locally on my laptop](https://www.markhneedham.com/blog/2023/06/23/hugging-face-run-llm-model-locally-laptop)
- [Why You (Probably) Don’t Need to Fine-tune an LLM](https://www.tidepool.so/2023/08/17/why-you-probably-dont-need-to-fine-tune-an-llm/)

### Challenges

- [Open challenges in LLM research](https://huyenchip.com/2023/08/16/llm-research-open-challenges.html) 
- [OWASP Top 10 for Large Language Model Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications) 

### Skepticism

- [Anti-hype LLM reading list](https://gist.github.com/veekaybee/be375ab33085102f9027853128dc5f0e)
- [What if Generative AI turned out to be a Dud?](https://garymarcus.substack.com/p/what-if-generative-ai-turned-out)
    - And [Marcus on AI](https://garymarcus.substack.com) in general

<img src="images/roche-logo-blue.png" alt="Roche logo" style="float: right;" width="150" />
<p>&nbsp;</p>
<p>&nbsp;</p>


# Doing now what patients need next
<p>&nbsp;</p>
<p>&nbsp;</p>


TODO: add go.roche.com links to the GitHub repository