<img src="images/roche-logo-blue.png" alt="Roche logo" style="float: right;" width="150" />
<p>&nbsp;</p>
<p>&nbsp;</p>

# Building a Q&A engine with LangChain and open-source LLMs
## Marek Grzenkowicz

#### September 2023

## About Roche

![Key Roche Informatics hubs](images/about-roche.png)

## Large Language Models did not appear out of nowhere

- Natural language processing (NLP)
- Neural networks
- Deep learning
- Transformer architecture
- Word and sentence embeddings
- Evaluation benchmarks and leaderboards
- 🚧TODO: add more examples

## Goals

1. Implement an LLM application without deep NLP knowledge
2. Develop locally, instead of making call to an API of a cloud-hosted model
3. Use an open-source model
4. Create a prototype with an actual business problem in mind

## Tools

- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) - framework for developing applications powered by LLMs
- [Hugging Face Hub](https://huggingface.co/models) - repository of pre-trained language models
  - [Transformers](https://huggingface.co/docs/transformers/index) - downloading the models
- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) - compact, open-source LLM from [lmsys.org](https://lmsys.org) 
- [ChromaDB](https://www.trychroma.com) - embedding database (vector database)
- [Jupyter Notebook](https://jupyter.org/) with the [RISE](https://github.com/damianavila/RISE) extension - IDE with a slideshow feature

## Why the `lmsys/fastchat-t5-3b-v1.0` model?

- GPU + 4GB memory - 1B parameters at best
- CPU + 32GB RAM - 3B parameters

- [CobraMamba/mamba-gpt-3b-v3](https://huggingface.co/CobraMamba/mamba-gpt-3b-v3) claims to surpass some 12B models ([Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard))
  - but it is slow ([Open LLM performace leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard))


- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)
  - _A commercial-friendly, compact, yet powerful chat assistant_
  - The first model to actually generate any response on my laptop 🎉

## Some duct tape first

In [23]:
# do NOT use in production or even during development!

# import warnings
# warnings.filterwarnings('ignore')

# import logging
# logging.basicConfig(level=logging.CRITICAL, format='')

In [24]:
# https://github.com/chroma-core/chroma/blob/main/chromadb/__init__.py#L57
import sys
__import__("pysqlite3")
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

## Load a model

In [25]:
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

In [26]:
model_id, task = "lmsys/fastchat-t5-3b-v1.0", "text2text-generation"

# model will be downloaded on first use and cached in ~/.cache/huggingface/hub/

model = HuggingFacePipeline.from_model_id(
    model_id=model_id,
    task=task,
    model_kwargs={
        "temperature": 0,
        "max_length": 1000
    },
)

## Initialize an LLM chain and start asking questions

In [27]:
template_text = """
{question}
"""
template = PromptTemplate(template=template_text, input_variables=["question"])

llm_chain = LLMChain(llm=model, prompt=template)

In [28]:
llm_chain("Who is Sheryl Crow?")["text"]

'<pad> Sheryl  Crow  is  an  American  singer,  songwriter,  and  actress.  She  is  best  known  for  her  role  as  the  lead  singer  and  lead  guitarist  of  the  rock  band  The  Band wagon,  and  for  her  role  as  the  lead  singer  and  lead  guitarist  of  the  alternative  rock  band  The  Mamas  and  the  Papas.  Crow  has  also  been  a  member  of  the  band  The  Mamas  and  the  Papas  since  its  formation  in  1995.\n'

## Prompt engineering

In [29]:
template_text = """
Provide brief answers, use 10 words or less.
{question}
"""
template = PromptTemplate(template=template_text, input_variables=["question"])

llm_chain = LLMChain(llm=model, prompt=template)

In [30]:
llm_chain("Who is Sheryl Crow?")["text"]

'<pad> Singer-songwriter'

## Easy questions

In [31]:
llm_chain("Who is Poland located?")["text"]

'<pad> Europe'

In [32]:
llm_chain("What is Bialowieza Forest?")["text"]

'<pad> Bialowieza Forest is a protected forest in Poland.'

## Harder questions

In [33]:
llm_chain("What does the name 'Bialowieza' mean in English?")["text"]

'<pad> "Birch Tree"'

In [34]:
llm_chain("What's the length of the Tsar's Trail and where does it begin?")["text"]

"<pad> The Tsar's Trail is a 900 mile long trail that begins in Moscow and ends in St. Petersburg."

## Retrieval Augmented Generation (RAG)

With RAG, documents can be stored as embeddings in a vector database, queried
for based on semantic meaning, and then these relevant splits are passed into
**model prompt via the context window**. LLM uses these text chunks from
original documents to generate an answer.

![Question Answering flow](images/langchain-qa-flow.jpeg)

Diagram source: https://python.langchain.com/docs/use_cases/question_answering/#overview

## Text embeddings (vectorization)

2. explain what embeddings are, with screenshots

## LangChain integrations

TODO: add a screenshot here

1. https://integrations.langchain.com/
2. https://python.langchain.com/docs/integrations/document_loaders

## RAG step 1 - load documents

In [35]:
from langchain.document_loaders import WikipediaLoader

loader = WikipediaLoader("Białowieża_Forest", lang="en")
wiki_page = loader.load()

In [36]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://bpn.com.pl/index.php?option=com_content&task=view&id=651&Itemid=297&lang=en")
bpn_page = loader.load()

## RAG step 2 - split documents into chunks

In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(wiki_page + bpn_page)

## RAG step 3 - calculate embeddings for text chunks

In [38]:
from langchain.embeddings import HuggingFaceEmbeddings

hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={
        'device': 'cpu'
    },
    encode_kwargs={
        'normalize_embeddings': False
    }
)

## RAG step 4 - store embeddings in a database

In [39]:
from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(documents=all_splits, embedding=hf_embeddings)

## RAG step 5 - initialize a Q&A chain

In [40]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

# the QA chain is constructed with the LLM model (loaded earlier) and the embedding database
qa_chain = RetrievalQA.from_chain_type(llm=model, retriever=vector_store.as_retriever())

## RAG step 6 - use Q&A chain to query the loaded documents

In [41]:
question = "What does the name 'Bialowieza' mean in English?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> "White  Tower"  in  Polish.\n'

In [42]:
question = "What's the length of the Tsar's Trail and where does it begin?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> 4  km  long,  starts  at  Przed  Kosym  Mostem  depot.\n'

## No more hallucinations then?

In [43]:
question = "Is Bialowieza Forest famous for its walking trails?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> No,  it  is  known  for  its  undergrowth  plants  and  animal  trails.\n'

## Summary

1. Working locally is possible
2. LangChain is powerful
... or is it too obvious?

## Next steps

1. Other loaders
    - PDF files
    - slide decks
    - Google Sites
    - files in Google Drive
2. Conversational interface
3. Memory 

## Links

### Overview

- [How OpenAI trained ChatGPT](https://blog.quastor.org/p/openai-trained-chatgpt)
    - 🎥 [State of GPT | Andrej Karpathy](https://www.youtube.com/watch?v=s6zNXZaIiiI) 
- [Catching up on the weird world of LLMs](https://simonwillison.net/2023/Aug/3/weird-world-of-llms)
- [What We Know About LLMs  (Primer)](https://willthompson.name/what-we-know-about-llms-primer) 
- [The Many Ways that Digital Minds Can Know](https://moultano.wordpress.com/2023/06/28/the-many-ways-that-digital-minds-can-know)

### Tutorials

- [Running a Hugging Face Large Language Model (LLM) locally on my laptop](https://www.markhneedham.com/blog/2023/06/23/hugging-face-run-llm-model-locally-laptop)
- [Why You (Probably) Don’t Need to Fine-tune an LLM](https://www.tidepool.so/2023/08/17/why-you-probably-dont-need-to-fine-tune-an-llm/)

### Challenges

- [Open challenges in LLM research](https://huyenchip.com/2023/08/16/llm-research-open-challenges.html) 
- [OWASP Top 10 for Large Language Model Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications) 

### Skepticism

- [Anti-hype LLM reading list](https://gist.github.com/veekaybee/be375ab33085102f9027853128dc5f0e)
- [What if Generative AI turned out to be a Dud?](https://garymarcus.substack.com/p/what-if-generative-ai-turned-out)
    - And [Marcus on AI](https://garymarcus.substack.com) in general

## Thank you!

TODO: add go.roche.com links to the GitHub repository