# Internal Knowledge Base Q&A Using Langchain & OpenAI

This example shows how to query an internal knowledge base stored in a GitHub repo as Markdown files.

This notebook is adapted from the [Retrieval Question Answering with Sources](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa_with_sources.html) example by Langchain.

## Set Up


In [None]:
!pip install langchain==0.0.123 # https://github.com/hwchase17/langchain/releases
!pip install openai
!pip install faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain==0.0.123
  Downloading langchain-0.0.123-py3-none-any.whl (426 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m426.3/426.3 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting aiohttp<4.0.0,>=3.8.3
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 KB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiosignal>=1.1.2
  Downloading aiosigna

### Set up OPEN_API_KEY and necessary variables

In [None]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key here and hit enter:")

Paste your OpenAI API key here and hit enter:··········


In [None]:
REPO_URL = "https://github.com/GovTechSG/developer.gov.sg"  # Source URL
DOCS_FOLDER = "docs"  # Folder to check out to
REPO_DOCUMENTS_PATH = "collections/_products/categories/devops/ship-hats"  # Set to "" to index the whole data folder
DOCUMENT_BASE_URL = "https://www.developer.tech.gov.sg/products/categories/devops/ship-hats"  # Actual URL
DATA_STORE_DIR = "data_store"

## Build the datastore
*(Skip to next section to load data store from files if it has been saved locally to save cost of embeddings)*

### Clone the GitHub repo

In [None]:
!git clone $REPO_URL $DOCS_FOLDER

Cloning into 'docs'...
remote: Enumerating objects: 38440, done.[K
remote: Counting objects: 100% (1657/1657), done.[K
remote: Compressing objects: 100% (993/993), done.[K
remote: Total 38440 (delta 1235), reused 972 (delta 648), pack-reused 36783[K
Receiving objects: 100% (38440/38440), 465.55 MiB | 30.81 MiB/s, done.
Resolving deltas: 100% (25703/25703), done.
Updating files: 100% (1801/1801), done.


### Load documents and split them into chunks for conversion to embeddings

In [None]:
import os
import pathlib
import re

from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

name_filter = "**/*.md"
separator = "\n### "  # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time
chunk_size_limit = 1000
max_chunk_overlap = 20

repo_path = pathlib.Path(os.path.join(DOCS_FOLDER, REPO_DOCUMENTS_PATH))
document_files = list(repo_path.glob(name_filter))

def convert_path_to_doc_url(doc_path):
  # Convert from relative path to actual document url
  return re.sub(f"{DOCS_FOLDER}/{REPO_DOCUMENTS_PATH}/(.*)\.[\w\d]+", f"{DOCUMENT_BASE_URL}/\\1", str(doc_path))

documents = [
    Document(
        page_content=open(file, "r").read(),
        metadata={"source": convert_path_to_doc_url(file)}
    )
    for file in document_files
]

text_splitter = CharacterTextSplitter(separator=separator, chunk_size=chunk_size_limit, chunk_overlap=max_chunk_overlap)
split_docs = text_splitter.split_documents(documents)



### (Optional) Check estimated tokens and costs

In [None]:
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken
  Downloading tiktoken-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.3.3


In [None]:
import tiktoken
# create a GPT-4 encoder instance
enc = tiktoken.encoding_for_model("gpt-4")

total_word_count = sum(len(doc.page_content.split()) for doc in split_docs)
total_token_count = sum(len(enc.encode(doc.page_content)) for doc in split_docs)

print(f"\nTotal word count: {total_word_count}")
print(f"\nEstimated tokens: {total_token_count}")
print(f"\nEstimated cost of embedding: ${total_token_count * 0.0004 / 1000}")


Total word count: 2065

Estimated tokens: 5215

Estimated cost of embedding: $0.002086


### Create Vector Store using OpenAI

In [None]:
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(split_docs, embeddings)

### Verify content of Vector Store with a sample query

In [None]:
from IPython.display import display, Markdown

search_result = vector_store.similarity_search_with_score("What is SHIP-HATS?")
search_result

line_separator = "\n"# {line_separator}Source: {r[0].metadata['source']}{line_separator}Score:{r[1]}{line_separator}
display(Markdown(f"""
## Search results:{line_separator}
{line_separator.join([
  f'''
  ### Source:{line_separator}{r[0].metadata['source']}{line_separator}
  #### Score:{line_separator}{r[1]}{line_separator}
  #### Content:{line_separator}{r[0].page_content}{line_separator}
  '''
  for r in search_result
])}
"""))

## (Optional) Save vector store to files and download/save in another location for reuse

In [None]:
vector_store.save_local(DATA_STORE_DIR)
# Download the files `$DATA_STORE_DIR/index.faiss` and `$DATA_STORE_DIR/index.pkl` to local

#### To load the Vector Store from files:

In [None]:
# Upload the files `$DATA_STORE_DIR/index.faiss` and `$DATA_STORE_DIR/index.pkl` to local
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

if os.path.exists(DATA_STORE_DIR):
  vector_store = FAISS.load_local(
      DATA_STORE_DIR,
      OpenAIEmbeddings()
  )
else:
  print(f"Missing files. Upload index.faiss and index.pkl files to {DATA_STORE_DIR} directory first")

## Query using the vector store with ChatGPT integration
### Set up the chat model and specific prompt

In [None]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template="""Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256)  # Modify model_name if you have access to GPT-4
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs=chain_type_kwargs
)

from IPython.display import display, Markdown
def print_result(result):
  output_text = f"""### Question:
  {query}
  ### Answer:
  {result['answer']}
  ### Sources:
  {result['sources']}
  ### All relevant sources:
  {' '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
  """
  display(Markdown(output_text))

#### Use the chain to query

In [None]:
query = "What is SHIP-HATS?"
result = chain(query)
print_result(result)

### Question: 
  What is SHIP-HATS?
  ### Answer: 
  **SHIP (Secure Hybrid Integration Pipeline)-HATS (Hive Agile Testing Solutions)** is a Continuous Integration/Continuous Delivery (CI/CD) component within SG Government Tech Stack (SGTS) that enables developers to plan, build, test, and deploy code to production. It is a multi-tenanted Software-as-a-Service (SaaS) based end-to-end CI/CD for all applications that is classified as RESTRICTED and below. It comes with security and governance guardrails to ensure policy compliance, better quality, visibility, and transparency. SHIP-HATS also offers shortened time-to-market, economies of scale, and a performance management dashboard. It is managed by GovTech, and it offers commercially off-the-shelf (COTS) tools with the right security and compliance settings. 


  ### Sources: 
  https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview
  ### All relevant sources:
  https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/resources https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/training/tools
  

Turn on debugging to see the OpenAI requests

In [None]:
import logging

logging.getLogger("openai").setLevel(logging.DEBUG) # logging.INFO or logging.DEBUG

query = "What is SHIP-HATS?"
result = chain(query)
print_result(result)

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/engines/text-embedding-ada-002/embeddings
DEBUG:openai:api_version=None data='{"input": ["What is SHIP-HATS?"], "encoding_format": "base64"}' message='Post details'
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/engines/text-embedding-ada-002/embeddings processing_ms=16 request_id=d9117df6d84956864393935607df48e8 response_code=200
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/chat/completions
DEBUG:openai:api_version=None data='{"messages": [{"role": "system", "content": "Use the following pieces of context to answer the users question.\\nTake note of the sources and include them in the answer in the format: \\"SOURCES: source1 source2\\", use \\"SOURCES\\" in capital letters regardless of the number of sources.\\nIf you don\'t know the answer, just say that \\"I don\'t know\\", don\'t try to make up an answer.\\n----------------\\

### Question: 
  What is SHIP-HATS?
  ### Answer: 
  **SHIP (Secure Hybrid Integration Pipeline)-HATS (Hive Agile Testing Solutions)** is a Continuous Integration/Continuous Delivery (CI/CD) component within SG Government Tech Stack (SGTS) that enables developers to plan, build, test, and deploy code to production. It is a multi-tenanted Software-as-a-Service (SaaS) based end-to-end CI/CD for all applications that is classified as RESTRICTED and below. It comes with security and governance guardrails to ensure policy compliance, better quality, visibility, and transparency. SHIP-HATS also offers shortened time-to-market, economies of scale, and a performance management dashboard. It is managed by GovTech, and it offers commercially off-the-shelf (COTS) tools with the right security and compliance settings. 


  ### Sources: 
  https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview
  ### All relevant sources:
  https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/resources https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/training/tools
  

Print result again without rerunning

In [None]:
print_result(result)

### Question: 
  What is SHIP-HATS?
  ### Answer: 
  **SHIP (Secure Hybrid Integration Pipeline)-HATS (Hive Agile Testing Solutions)** is a Continuous Integration/Continuous Delivery (CI/CD) component within SG Government Tech Stack (SGTS) that enables developers to plan, build, test, and deploy code to production. It is a multi-tenanted Software-as-a-Service (SaaS) based end-to-end CI/CD for all applications that is classified as RESTRICTED and below. It comes with security and governance guardrails to ensure policy compliance, better quality, visibility, and transparency. SHIP-HATS also offers shortened time-to-market, economies of scale, and a performance management dashboard. It is managed by GovTech, and it offers commercially off-the-shelf (COTS) tools with the right security and compliance settings. 


  ### Sources: 
  https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview
  ### All relevant sources:
  https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/training/tools https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/resources
  