## RAG System Overview for Jamming Source Localization Research


The code below implements a Retrieval-Augmented Generation (RAG) system, designed to support literature review tasks by allowing access to documents published after the model’s training data cutoff, as well as documents that require special access online (e.g., documents behind paywalls or requiring credentials). This system enhances the access to contextual information from diverse documents, significantly boosting the accuracy and quality of responses from language models.

In my application, I utilize this system to process over 30 research papers on jamming source localization.

### Purpose of the RAG System

The main function of this RAG system is to aid in the review and analysis of research papers, particularly those concerning jamming source localization, though it can be adapted to other fields as well. Its objective is to efficiently gather and synthesize relevant information, thereby supporting researchers and practitioners in staying 'in-the-know' of the newest developments and techniques in this specialized area of study.

Sure, here's the updated description with the components listed in the order that they appear in the code:

### Core Components of the RAG System

- **OpenAI Embeddings (LLM)**: Utilizes OpenAI's embeddings, which serve as the foundation for generating context-aware text and creating semantic embeddings. These embeddings are crucial for retrieving the most relevant information related to user queries.

- **FAISS Vector Store**: The Facebook AI Similarity Search (FAISS) Vector Store is critical for storing and swiftly retrieving document embeddings, thus allowing the system to find and pull the most pertinent documents based on content similarity to user queries.

- **Retriever**: Functions like the system's search engine, querying the FAISS Vector Store to fetch documents that match the user's input, ensuring the responses are supported by the most relevant data.

- **Cache-Backed Embedder**: Converts text data into vector format before embedding it into the vector store, making it comprehensible and retrievable by the system for future queries.

- **Document Processing Pipeline**: Handles the processing of documents with several key subcomponents:
  - **Document Loader**: Loads and preprocesses the research papers, making them ready for embedding and retrieval.
  - **Text Splitter (Document Chunker)**: Splits extensive research papers into smaller, more manageable chunks, which are easier to process for embedding and retrieval.
  - **SimpleDocument Class**: Organizes text chunks along with their metadata, assigning a unique key to each for efficient management and retrieval.

- **LLM Setup (ChatOpenAI)**: Utilizes the ChatOpenAI model, specifically gpt-3.5-turbo, set to a temperature of 0.2 and capable of handling up to 4096 tokens. This model enhances the generation of contextual responses based on the retrieved documents.

- **Conversational Retrieval Chain**: Integrates the LLM, memory, and retriever with callback functions, enhancing the dynamic interaction between the user and the retrieval system. This chain also returns source documents for transparency and deeper analysis.
  - **Memory Management**: Features a Conversation Buffer Memory that tracks the history of user queries and responses, thereby optimizing the contextual relevance of each interaction.
  - **Output Handler**: Employs a StdOutCallbackHandler to directly print responses, facilitating immediate feedback from the system.

### Text Splitters Experimentation
The project experiments with two types of text splitters—RecursiveCharacterTextSplitter and CharacterTextSplitter—to identify the best approach for dividing large text documents into manageable chunks. Through a series of tests chunk_overlap, each splitter's performance is evaluated based on metrics like average chunk size (set to average page content length), number of chunks, and evaluation time. This process aims to find the optimal balance between efficiency (minimizing computational overhead) and granularity (ensuring detailed segmentation), enhancing both the system's performance and retrieval accuracy.

A higher number of chunks means the text is split into finer, more detailed parts. This can be beneficial for analyzing specific sections in depth but might increase processing time and complexity. Larger chunks reduce the processing load as there are fewer pieces to manage and index, which can speed up retrieval tasks. However, too large chunks might miss finer details.

We want to achieve:   
- Adequate granularity without producing an excessive number of small chunks that could slow down the system.
- Manageable chunk sizes that are not too large, maintaining necessary detail for effective analysis and retrieval.

### System Interaction and Workflow

The workflow of the RAG system is tailored to enhance research on jamming source localization by following these steps:

1. **Query Processing**: The system begins by taking a user query related to jamming source localization and converting it into an embedding.
2. **Document Retrieval**: It then retrieves documents that closely match the query's embedding from the vector store, focusing on those most relevant to jamming source localization.
3. **Response Generation**: Leveraging the retrieved documents, the system generates detailed responses that are informed by the latest research findings, helping users synthesize current trends and results in the field.


This RAG system streamlines the accessibility to state-of-the-art research in in jamming source localization or any other relevant field being researched. By simply replacing the research papers in the relevant directory with the specific research topic papers, the literature review stage can be accelerated and the concise information required can be accesses in a much more flexible way.

In [1]:
# Install required packages
!pip install langchain langchain-community langchain-openai openai faiss-cpu tiktoken pdfplumber ipywidgets  > /dev/null

In [2]:
import os
import sys
import time
import openai
import hashlib
import textwrap
import warnings
import pdfplumber
import contextlib
import regex as re
from io import StringIO
import ipywidgets as widgets
from IPython.display import display, HTML
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
from langchain.schema import Document
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

In [3]:
from getpass import getpass
import os

# Input cell for setting API key (input is hidden)
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key: ").strip()

# Function to validate the API key
def validate_api_key(api_key):
    if not api_key:
        raise ValueError("The API key is not set. Please set the OPENAI_API_KEY environment variable.")
    elif api_key == "YOUR_API_KEY_HERE": # if required set specific API key here
        raise ValueError("The provided API key is a placeholder. Please provide a valid OpenAI API key.")

# Validate the API key
validate_api_key(os.environ["OPENAI_API_KEY"])

print("API key is set and valid.")


Please enter your OpenAI API key: ··········
API key is set and valid.


In [4]:
from google.colab import drive
drive.mount('/content/drive')
papers_directory = '/content/drive/My Drive/genai_project/papers' # content in papers directory is in .pdf format, papers can be replaced with any relevant content in .txt or .pdf format

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Convert PDF research papers to text
def convert_pdfs_to_text(directory):
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(directory, filename)
            text_path = os.path.join(directory, filename.replace('.pdf', '.txt'))
            if not os.path.exists(text_path):  # Check if the text file already exists
                with pdfplumber.open(pdf_path) as pdf:
                    full_text = []
                    for page in pdf.pages:
                        page_text = page.extract_text()
                        if page_text:
                            full_text.append(page_text)
                    full_text = "\n".join(full_text)
                with open(text_path, 'w', encoding='utf-8') as text_file:
                    text_file.write(full_text)
                    print(f"Converted {filename} to text.")

convert_pdfs_to_text(papers_directory)

In [6]:
# Function to evaluate splitters
def evaluate_splitter(documents, splitter, metric='avg_chunk_size'):
    start_time = time.time()
    chunks = [Document(page_content=chunk) for document in documents for chunk in splitter.split_text(document.page_content)]
    end_time = time.time()
    num_chunks = len(chunks)
    avg_chunk_size = sum(len(chunk.page_content) for chunk in chunks) / num_chunks if num_chunks else 0
    eval_time = end_time - start_time

    if metric == 'num_chunks':
        return num_chunks
    elif metric == 'eval_time':
        return eval_time
    else:
        return avg_chunk_size

# Define text splitters and chunk_overlap values to test
chunk_overlap_values = [0, 10, 25, 50, 100]  # added 0 to experiment with
splitters = {
    "RecursiveCharacterTextSplitter": RecursiveCharacterTextSplitter,
    "CharacterTextSplitter": CharacterTextSplitter,
}

# Load documents (pdf files converted to text files)
def load_text_documents(directory):
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding='utf-8') as file:
                content = file.read()
                documents.append(Document(page_content=content))
    return documents

# Load documents
documents = load_text_documents(papers_directory)

# Evaluate each splitter with different chunk_overlap values
evaluation_results = {}
for name, splitter_cls in splitters.items():
    for overlap in chunk_overlap_values:
        splitter = splitter_cls(chunk_size=500, chunk_overlap=overlap)  # keep chunk size constant and vary overlap
        result = evaluate_splitter(documents, splitter)
        evaluation_results[(name, overlap)] = result
        print(f"Evaluating Splitter: {name}, Chunk Overlap: {overlap}")
        print(f"Number of Chunks: {len([Document(page_content=chunk) for document in documents for chunk in splitter.split_text(document.page_content)])}")
        print(f"Average Chunk Size: {result}")
        print(f"Evaluation Time: {evaluate_splitter(documents, splitter, metric='eval_time'):.2f} seconds\n")

# Select the best splitter and chunk_overlap based on the metric
best_splitter_name, best_overlap = min(evaluation_results, key=evaluation_results.get)
best_splitter = splitters[best_splitter_name](chunk_size=500, chunk_overlap=best_overlap)
print(f"Best Splitter: {best_splitter_name} with Chunk Overlap: {best_overlap} and Average Chunk Size: {evaluation_results[(best_splitter_name, best_overlap)]}\n")

# Print the chosen configuration
print("Chosen Configuration:")
print(f"Splitter: {best_splitter_name}")
print(f"Chunk Overlap: {best_overlap}")
print(f"Metric Value (Average Chunk Size): {evaluation_results[(best_splitter_name, best_overlap)]}")

# Use the best splitter to split the documents
chunks = [Document(page_content=chunk) for document in documents for chunk in best_splitter.split_text(document.page_content)]

print("\nDocument chunks created using the best splitter and chunk overlap configuration.")


Evaluating Splitter: RecursiveCharacterTextSplitter, Chunk Overlap: 0
Number of Chunks: 4240
Average Chunk Size: 455.62594339622643
Evaluation Time: 0.17 seconds

Evaluating Splitter: RecursiveCharacterTextSplitter, Chunk Overlap: 10
Number of Chunks: 4243
Average Chunk Size: 455.7977845863776
Evaluation Time: 0.12 seconds

Evaluating Splitter: RecursiveCharacterTextSplitter, Chunk Overlap: 25
Number of Chunks: 4259
Average Chunk Size: 455.41629490490726
Evaluation Time: 0.24 seconds

Evaluating Splitter: RecursiveCharacterTextSplitter, Chunk Overlap: 50
Number of Chunks: 4301
Average Chunk Size: 456.0823064403627
Evaluation Time: 0.10 seconds

Evaluating Splitter: RecursiveCharacterTextSplitter, Chunk Overlap: 100
Number of Chunks: 4731
Average Chunk Size: 456.2441344324667
Evaluation Time: 0.30 seconds

Evaluating Splitter: CharacterTextSplitter, Chunk Overlap: 0
Number of Chunks: 40
Average Chunk Size: 48401.325
Evaluation Time: 0.00 seconds

Evaluating Splitter: CharacterTextSplitt

In [7]:
# local file store to cache embeddings
store = LocalFileStore("./cache/")

# Instantiate the OpenAI core embeddings model
core_embeddings_model = OpenAIEmbeddings()

# Create a CacheBackedEmbeddings object from the core embeddings model
embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model,
    store,
    namespace = core_embeddings_model.model
)

# Create a vector store using FAISS from document chunks
vectorstore = FAISS.from_documents(chunks, embedder)

# Instantiate a retriever
retriever = vectorstore.as_retriever()

print("Setup complete, vector store and retriever are ready for use.")

Setup complete, vector store and retriever are ready for use.


In [8]:
# Instantiate the LLM using ChatOpenAI with the specified model
llm = ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=4096)  # max_tokens is set to None by default (gpt-3.5-turbo has a capacity of 4096 tokens)

# Setup memory for conversation history
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, input_key='question', output_key='answer')

# Setup the StdOutCallbackHandler to print outputs directly
handler = StdOutCallbackHandler()

# Setup the ConversationalRetrievalChain
qachat = ConversationalRetrievalChain.from_llm(
    llm=llm,
    memory=memory,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)

print("Conversational Retrieval Chain setup complete.")

Conversational Retrieval Chain setup complete.


  warn_deprecated(


In [9]:
warnings.filterwarnings("ignore")

# Context manager to suppress stdout from qachat (ConversationalRetrievalChain)
@contextlib.contextmanager
def suppress_stdout():
    with StringIO() as buf, contextlib.redirect_stdout(buf):
        yield

# Function to process queries through the QA chain
def ask_question(question):
    with suppress_stdout():
        response = qachat({"question": question})
    return response['answer']

# Function to format and print the response with wrapping
def format_response(text, width=80):
    wrapper = textwrap.TextWrapper(width=width)
    wrapped_text = wrapper.fill(text)
    return wrapped_text

# Create a Textarea widget for larger input
textarea = widgets.Textarea(
    value='',
    placeholder='Type your question here...',
    description='',
    disabled=False,
    layout=widgets.Layout(width='60%', height='80px', border='none', margin='20px 0 0 0')
)

# Create a button to submit the query
button = widgets.Button(
    description="Submit",
    layout=widgets.Layout(margin='60px 0 0 20px', width='70px', height='40px')
)

# Create an output widget test box to display the response
response_output = widgets.Output()

# Function to handle button click event (for submit)
def on_button_click(b):
    user_query = textarea.value
    result = ask_question(user_query)
    formatted_result = format_response(result)

    with response_output:
        response_output.clear_output()

        # Display formatted result
        display(HTML(f"<div style='border:1px solid #ddd; padding:15px; border-radius:10px; background-color:#f9f9f9; width:58%; margin-top:20px;'>{formatted_result}</div>"))

# Attach the button click event handler
button.on_click(on_button_click)

# Styled container for question input
textarea_styled = widgets.HTML(
    value="""
    <style>
        .styled-container {
            border: 1px solid #ddd;
            padding: 15px;
            border-radius: 10px;
            background-color: #f9f9f9;
            width: 30%;
            margin-top: 10px;
            font-family: sans-serif, sans-serif;
            box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
        }
        .styled-container h3 {
            font-weight: bold;
            margin-bottom: 8px;
        }
        .styled-container p {
            margin-bottom: 10px;
        }
        .styled-container textarea {
            width: 100%;
            height: 80px;
            padding: 10px;
            border-radius: 5px;
            border: 1px solid #ccc;
            font-family: sans-serif, sans-serif;
            font-size: 14px;
        }
    </style>
    <div class='styled-container'>
        <h3>RAG Model for Jamming Source Localization</h3>
        <p>Ask a question related to research on jamming source localization.</p>
    </div>
    """
)

# Container for input and button
input_container = widgets.HBox([textarea, button], layout=widgets.Layout(align_items='flex-start', margin='10px 0 0 0'))

# Display the styled container, Textarea, button, and response
display(widgets.VBox([
    textarea_styled,
    input_container,
    response_output
]))

VBox(children=(HTML(value="\n    <style>\n        .styled-container {\n            border: 1px solid #ddd;\n  …

Example inputs:
- What's the latest research on jamming source localization?
- What input data is used by the jamming source localization models?
- Can you name some methods used in jamming source localization research?
- Are there any jamming source localization papers that frame their problem statement in 3D space or do they just use 2D?
- If I have a swarm of drones in 3D space and I want to employ a jamming source localization method, can you name a few that might be relevant?
- Please list the most recently published research papers on jamming source localization.
- Provide a list of jamming source localization research papers that use GPS and received signal strength data to localize the jammer.