<a href="https://colab.research.google.com/github/ZadeFrontier/R2R/blob/main/Conversational_Research_Assistant_Marcktechpost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain-community langchain pypdf sentence-transformers faiss-cpu transformers accelerate einops

Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting langchain
  Downloading langchain-0.3.21-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.7 (from langchain)
  Downloading langchain_text_splitters-0.3.7-py3-none-any.whl.metadata (1.9 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,

In [None]:
import os
import torch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
from IPython.display import display, Markdown

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("Google Drive mounted")

Mounted at /content/drive
Google Drive mounted


In [None]:
def load_documents(pdf_folder_path):
    documents = []

    if not pdf_folder_path:
        print("Downloading a sample paper...")
        !wget -q https://arxiv.org/pdf/1706.03762.pdf -O attention.pdf
        pdf_docs = ["attention.pdf"]
    else:
        pdf_docs = [os.path.join(pdf_folder_path, f) for f in os.listdir(pdf_folder_path)
                   if f.endswith('.pdf')]

    print(f"Found {len(pdf_docs)} PDF documents")

    for pdf_path in pdf_docs:
        try:
            loader = PyPDFLoader(pdf_path)
            documents.extend(loader.load())
            print(f"Loaded: {pdf_path}")
        except Exception as e:
            print(f"Error loading {pdf_path}: {e}")

    return documents


documents = load_documents("")

Downloading a sample paper...
Found 1 PDF documents
Loaded: attention.pdf


In [None]:
def split_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")
    return chunks

chunks = split_documents(documents)

Split 15 documents into 52 chunks


In [None]:
def create_vector_store(chunks):
    print("Loading embedding model...")
    embedding_model = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
    )

    print("Creating vector store...")
    vector_store = FAISS.from_documents(chunks, embedding_model)
    print("Vector store created successfully!")
    return vector_store

vector_store = create_vector_store(chunks)

Loading embedding model...


  embedding_model = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating vector store...
Vector store created successfully!


In [None]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.3


In [None]:
def load_language_model():
    print("Loading language model...")
    model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

    try:
        import subprocess
        print("Installing/updating bitsandbytes...")
        subprocess.check_call(["pip", "install", "-U", "bitsandbytes"])
        print("Successfully installed/updated bitsandbytes")
    except:
        print("Could not update bitsandbytes, will proceed without 8-bit quantization")

    from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
    import torch

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    if torch.cuda.is_available():
        try:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
                llm_int8_has_fp16_weight=False
            )

            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                quantization_config=quantization_config
            )
            print("Model loaded with 8-bit quantization")
        except Exception as e:
            print(f"Error with quantization: {e}")
            print("Falling back to standard model loading without quantization")
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto"
            )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float32,
            device_map="auto"
        )

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=2048,
        temperature=0.2,
        top_p=0.95,
        repetition_penalty=1.2,
        return_full_text=False
    )

    from langchain_community.llms import HuggingFacePipeline
    llm = HuggingFacePipeline(pipeline=pipe)
    print("Language model loaded successfully!")
    return llm

llm = load_language_model()

Loading language model...
Installing/updating bitsandbytes...
Successfully installed/updated bitsandbytes


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

Error with quantization: Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`
Falling back to standard model loading without quantization


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


Language model loaded successfully!


  llm = HuggingFacePipeline(pipeline=pipe)


In [None]:
def create_research_assistant(vector_store, llm):
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}
    )

    memory = []

    def format_docs(docs):
        formatted = "\n\n".join(f"Document {i+1}:\n{doc.page_content}" for i, doc in enumerate(docs))
        return formatted

    print("Creating conversational research assistant...")

    def process_query(query, return_sources=False):
        nonlocal memory

        retrieved_docs = retriever.get_relevant_documents(query)

        context = format_docs(retrieved_docs)

        history_text = "\n".join([f"Human: {q}\nAssistant: {a}" for q, a in memory])

        prompt = (
            f"You are a helpful research assistant. Use the following context to answer the question at the end.\n"
            f"If you don't know the answer or can't find it in the context, say 'I don't have enough information to answer this question.'\n"
            f"Always cite your sources by referring to the Document number.\n\n"
            f"Context:\n{context}\n\n"
            f"Conversation history:\n{history_text}\n\n"
            f"Question: {query}\n"
            f"Answer:"
        )

        response = llm(prompt)

        memory.append((query, response))

        return (response, retrieved_docs) if return_sources else response

    return process_query

In [None]:
def format_research_assistant_output(query, response, sources):
    output = f"\n{'=' * 50}\n"
    output += f"USER QUERY: {query}\n"
    output += f"{'-' * 50}\n\n"
    output += f"ASSISTANT RESPONSE:\n{response}\n\n"
    output += f"{'-' * 50}\n"
    output += f"SOURCES REFERENCED:\n\n"

    for i, doc in enumerate(sources):
        output += f"Source #{i+1}:\n"
        content_preview = doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content
        wrapped_content = textwrap.fill(content_preview, width=80)
        output += f"{wrapped_content}\n\n"

    output += f"{'=' * 50}\n"
    return output

import textwrap

research_assistant = create_research_assistant(vector_store, llm)

test_queries = [
    "What is the key idea behind the Transformer model?",
    "Explain self-attention mechanism in simple terms.",
    "Who are the authors of the paper?",
    "What are the main advantages of using attention mechanisms?"
]

for query in test_queries:
    response, sources = research_assistant(query, return_sources=True)
    formatted_output = format_research_assistant_output(query, response, sources)
    print(formatted_output)

  retrieved_docs = retriever.get_relevant_documents(query)
  response = llm(prompt)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Creating conversational research assistant...

USER QUERY: What is the key idea behind the Transformer model?
--------------------------------------------------

ASSISTANT RESPONSE:
 The Transformer model uses multi-head attention to jointly attend to information from multiple
representation spaces at different positions. It achieves better BLEU scores than previous state-of-the-art
models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training
cost.

--------------------------------------------------
SOURCES REFERENCED:

Source #1:
The Transformer uses multi-head attention in three different ways: • In
"encoder-decoder attention" layers, the queries come from the previous decoder
layer, and the memory keys and values come from t...

Source #2:
Figure 1: The Transformer - model architecture. The Transformer follows this
overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, ...






USER QUERY: Explain self-attention mechanism in simple terms.
--------------------------------------------------

ASSISTANT RESPONSE:
 Self-attention involves computing the dot product between each input vector and its corresponding
key vector, where the key vector represents the representation space of the same token. This allows us to
attend to information from multiple representations simultaneously without having to compute their individual
dot products.

--------------------------------------------------
SOURCES REFERENCED:

Source #1:
P Epos. We also experimented with using learned positional embeddings [9]
instead, and found that the two versions produced nearly identical results (see
Table 3 row (E)). We chose the sinusoidal vers...

Source #2:
Attention Visualizations Input-Input Layer5 It is in this spirit that a majority
of American governments have passed new laws since 2009 making the registration
or voting process more difficult . <EOS...

Source #3:
3.2 Attention An att




USER QUERY: Who are the authors of the paper?
--------------------------------------------------

ASSISTANT RESPONSE:
 The author of the paper is [insert name].

--------------------------------------------------
SOURCES REFERENCED:

Source #1:
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building
a large annotated corpus of english: The penn treebank. Computational
linguistics, 19(2):313–330, 1993. [26] David McCl...

Source #2:
[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign
language. In Advances in Neural Information Processing Systems, 2015. [38]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc...

Source #3:
across languages. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing, pages 832–841. ACL, August 2009. [15] Rafal
Jozefowicz, Oriol Vinyals, Mike Schuster, Noam ...

Source #4:
comments, corrections and inspiration. References [1] Jimmy Lei Ba, Jamie Ryan
Kiros, and Geoffrey E Hinto




USER QUERY: What are the main advantages of using attention mechanisms?
--------------------------------------------------

ASSISTANT RESPONSE:
 1. Faster Training: Multi-head attention reduces the number of computations needed during training, making it more efficient compared to other attention models like convolutions or recurrence.
2. Better Performance: Self-attention has been shown to perform better than other attention models on tasks like machine translation, natural language processing, and visual recognition.
3. More Parallelizable: Since self-attention relies on matrix multiplication, it can be easily parallelized across GPUs or CPU cores, allowing for greater scalability.

--------------------------------------------------
SOURCES REFERENCED:

Source #1:
Attention Visualizations Input-Input Layer5 It is in this spirit that a majority
of American governments have passed new laws since 2009 making the registration
or voting process more difficult . <EOS...

Source #2:
3.2 At