We will be using open source libraries and large language models (LangChain and Zephyr-7b-beta) to perform conversational Q&A with a PDF (scientific paper).

We will be running the code using Google Colab. Please make sure that the T4 GPU instance of Colab notebook is activated via the notebook settings before proceeding.

Code is based on this tutorial: https://medium.com/@nimritakoul01/chat-with-your-pdf-files-using-mistral-7b-and-langchain-f3be9363301c

Note: I tried to use LlamaIndex, a data framework for connecting custom data sources to LLMs, instead of LangChain but it didnt work very well.


In [1]:
# Install dependencies
!pip install -q huggingface_hub
!pip install -q chromadb
!pip install -q langchain
!pip install -q pypdf
!pip install -q sentence-transformers
!pip install -q python-dotenv
!pip install -q ctransformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/57.9 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# import required libraries
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFaceHub
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from dotenv import load_dotenv
import sys

# Load HUGGINGFACEHUB_API_TOKEN environmenal variables (HuggingFace token) from .env file
load_dotenv()

False

Load PDF document and split into chunks

In [3]:
loader = PyPDFLoader(r'Data/Strauss-Liew-3D-SIM.pdf')
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=10)
texts = text_splitter.split_documents(documents)

Setup embedder and database that will be used to embed and save the chunks, respectively

In [4]:
embeddings = HuggingFaceEmbeddings()

# Instantiate database with text and huggingface embeddings
db = Chroma.from_documents(texts, embeddings)

# Retrieve top 3 documents for each query
retriever = db.as_retriever(search_kwargs={'k': 3})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Specify model we are going to use and instantiate the LLM using HuggingFaceHub

In [None]:
repo_id = "HuggingFaceH4/zephyr-7b-beta"
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0.2, "max_new_tokens":200})



Alternatively, try with llama-cpp-python library. The advantage here is that llama-cpp allows you to specify the model quantization type from the TheBloke. You will need to download the model with wget first though before using it!

In [None]:
# Install llama-cpp-python library and also make sure CUDA is used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

# Download model. Make sure you find the correct GGUF file link from TheBloke
!wget https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q5_K_M.gguf  #See https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

In [8]:
from langchain.llms import LlamaCpp

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = LlamaCpp(
  model_path="./zephyr-7b-beta.Q5_K_M.gguf",  # Download the model file first
  temperature=0,
  max_tokens=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  top_p=1,
  n_ctx=4096,
  n_threads=4,  # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=50  # The number of layers to offload to GPU, if you have GPU acceleration available
)


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model metadata: {'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'huggingfaceh4_zephyr-7b-beta', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'}


Ask your questions and have a conservation with your PDF. The context used to generate the answers and the source document is also provided

In [13]:
# Create the Conversational Retrieval Chain
qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever,return_source_documents=True)

# Create an infinite loop and save chat history so we can chat with the pdf
chat_history = []
while True:
    query = input('\nPrompt: ')
    #To exit: use 'exit', 'quit', 'q', or Ctrl-D.",
    if query.lower() in ["exit", "quit", "q"]:
        print('Exiting')
        sys.exit()
    result = qa_chain({'question': query, 'chat_history': chat_history})
    print('Answer: ' + result['answer'] + '\n')
    # Print out relevant context(s)
    print("Source context(s):")
    print("\n\n".join(set([x.page_content for x in result['source_documents']])))
    # Print out source documents
    print("\n\n###Source file(s)###")
    print("\n".join(set([x.metadata['source'] for x in result['source_documents']])))
    chat_history.append((query, result['answer']))


Prompt: What is the role of FtsZ in S.aureus?


Llama.generate: prefix-match hit


Answer:  FtsZ is involved in the formation and structure of the cytokinetic ring in S.aureus. Specifically, it forms bead-like structures throughout the entire Z ring, which can vary in concentration and contain visible "gaps" where little or no FtsZ is present. The dynamics of FtsZ within these structures are similar to those observed in B.subtilis, with rapid turnover and exchange between the Z ring and non-ring FtsZ pool required for remodeling and constriction. Visualization using techniques such as 3D-SIM and immunofluorescence confirms the heterogeneous and bead-like arrangement of FtsZ in live and chemically fixed cells, indicating a genuine structure for the Z ring in S.aureus.

Source context(s):
Figure 5. Z ring dynamics in S. aureus .(A) 3D-SIM (OMX Blaze) time-lapse images show how FtsZ localization changes within the Z ring in S.
aureus RN4220 cells (SA89). A white arrowhead marks the position of a gap when it initially forms inside the Z ring. The subsequent position of t

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


Answer:  The amount of IPTG used for induction of FtsZ-GFP in this study on the role of FtsZ in S.aureus was 0.005 mM. This information can be found in the description of Figure 6, which shows FtsZ-GFP localization in non-dividing cells with the div-355 mutation at the non-permissive temperature using conventional deconvolved microscopy. The authors mention that FtsZ remains dynamic even when division is inhibited and Z ring constriction does not occur. White arrowheads mark areas where EzrA concentration is reduced, and arrowheads indicate the formation of additional gaps in EzrA-GFP rings. Time is indicated on the upper left corner for each image. The diameter of EzrA rings is approximately 0.9 um.

Source context(s):
Figure 5. Z ring dynamics in S. aureus .(A) 3D-SIM (OMX Blaze) time-lapse images show how FtsZ localization changes within the Z ring in S.
aureus RN4220 cells (SA89). A white arrowhead marks the position of a gap when it initially forms inside the Z ring. The subsequen

KeyboardInterrupt: Interrupted by user