# PaperBuddy

PaperBuddy is an AI assistant that answers questions about papers on ArXiv (arxiv.org) using an LLM (Llama-3.3-70b-instruct) and retrieval augmented generation (RAG).

## Install dependencies

In [1]:
!pip install -r requirements.txt

Collecting arxiv==2.2.0 (from -r requirements.txt (line 1))
  Downloading arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting faiss-cpu==1.11.0 (from -r requirements.txt (line 2))
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting langchain-community==0.3.26 (from -r requirements.txt (line 4))
  Downloading langchain_community-0.3.26-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-nvidia-ai-endpoints==0.3.10 (from -r requirements.txt (line 5))
  Downloading langchain_nvidia_ai_endpoints-0.3.10-py3-none-any.whl.metadata (11 kB)
Collecting pymupdf==1.26.1 (from -r requirements.txt (line 6))
  Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting python-dotenv==1.1.1 (from -r requirements.txt (line 7))
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting feedparser~=6.0.10 (from arxiv==2.2.0->-r requirements.txt (line 1))
  Downloading feedparser-6.0.11-py3-none-any

## Discuss papers with PaperBuddy

In [2]:
from google.colab import files
from paperbuddy import PaperBuddy
import shutil

In [3]:
ARXIV_PAPER_IDS = [
    # The EnvDesign Model: A Method to Solve the Environment Design Problem
    '2412.18109'
]

pb = PaperBuddy(ARXIV_PAPER_IDS)

INFO:PaperBuddy_Logger:Loaded environment variables from .env file.
INFO:PaperBuddy_Logger:Using nvidia/nv-embed-v1 embedding model.
INFO:PaperBuddy_Logger:Using meta/llama-3.3-70b-instruct chat model.
INFO:PaperBuddy_Logger:Created FAISS conversation store.
INFO:PaperBuddy_Logger:Created FAISS document store.
INFO:PaperBuddy_Logger:Adding arXiv papers with IDs ['2412.18109'] to document store
INFO:PaperBuddy_Logger:Split papers into chunks.
INFO:PaperBuddy_Logger:Created paper metadata chunks.
INFO:PaperBuddy_Logger:Added paper and metadata chunks to document store.


In [4]:
def print_interaction(question):
    print(f'Question: {question}\n\n')
    answer = pb.prompt(question)
    print(f'\n\nAnswer: {answer}\n\n')

In [5]:
print_interaction('What is the EnvDesign model?')

Question: What is the EnvDesign model?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: The EnvDesign model is a method that uses graph theory and optimization algorithms to solve the environment design problem, which involves designing pre-production testing environments that take into account the diversity of server/node properties and dynamically emphasize or de-emphasize certain node properties based on current testing priorities.

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [6]:
print_interaction('What optimization algorithms are used?')

Question: What optimization algorithms are used?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: The optimization algorithms used include simulated annealing and branch and bound. Specifically, there are 6 simulated annealing algorithms and 12 branch and bound algorithms tested on different instances of the environment design problem.

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [7]:
print_interaction('What do the simulated annealing algorithms do?')

Question: What do the simulated annealing algorithms do?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: The simulated annealing algorithms iteratively modify individual cliques in a given schedule to produce a new, more optimal schedule. They optimize the expanded coverage schedule with or without preservation of clique cover, and replace each chosen clique with a clique built from a random vertex, all but one of the vertices, or a single vertex in the chosen clique.

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [8]:
print_interaction('What do the branch and bound algorithms do?')

Question: What do the branch and bound algorithms do?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: The branch and bound algorithms build a schedule, clique by clique, starting from an empty collection of cliques, or modify existing cliques to produce a new, more optimal schedule. They iteratively add cliques to a given schedule until it reaches the desired size, and use strategies such as depth-first or best-first selection to choose the next clique to add.

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [9]:
print_interaction(
    'Which optimization algorithm performs the best in the experiments?')

Question: Which optimization algorithm performs the best in the experiments?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: The optimization algorithms that perform the best in the experiments are algorithms 3.3 and 3.4 (traditional and look-ahead branch and bound to optimize the expanded coverage schedule, with a depth-first-best-first selection strategy) under the dimension-based and combination-based objective functions, and algorithm 1.3 (simulated annealing to optimize the expanded coverage schedule, without preservation of clique cover and replacing each chosen clique with a clique built from all but one of the vertices in the chosen clique) under the relationship-based objective function. In Experiment 2, the best algorithm is 1.2 (simulated annealing to optimize the expanded coverage schedule, with preservation of clique cover and replacing each chosen clique with a clique built from a random vertex).

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [10]:
print_interaction(
    'What do cliques represent in the context of the environment design '
    + 'problem?')

Question: What do cliques represent in the context of the environment design problem?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: In the context of the environment design problem, cliques represent node configurations, where each clique is of size d and contains one vertex from each of the d dimensions. Each node configuration corresponds to a clique of size d in the graph, and the problem of finding the minimal set of node configurations needed to cover each dimension is a more constrained version of the minimum clique cover problem.

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [11]:
print_interaction('What are some examples of dimensions?')

Question: What are some examples of dimensions?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: Examples of dimensions include HW model, BIOS version, and VM type. 

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [12]:
print_interaction('What are schedules?')

Question: What are schedules?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: A schedule is a collection of testing configurations, where each configuration has a set of dimension values, one from each dimension. It is generated using algorithms such as branch and bound or simulated annealing to optimize the coverage of dimension values. The schedule is used to test different combinations of dimension values.

Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem
* Conversation history




In [13]:
print_interaction('Are cliques the same thing as testing configurations? '
                  + 'If so, is it fair to say that schedules are collections '
                  + 'of cliques?')

Question: Are cliques the same thing as testing configurations? If so, is it fair to say that schedules are collections of cliques?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: Yes, cliques and testing configurations are equivalent, as each clique represents a node configuration with one vertex from each dimension. A schedule is a collection of testing configurations, and since testing configurations are equivalent to cliques, it is fair to say that schedules are collections of cliques.
Sources:
* The EnvDesign Model: A Method to Solve the Environment Design Problem
* Conversation history




In [14]:
pb.save_data()

INFO:PaperBuddy_Logger:Saved arXiv paper IDs.
INFO:PaperBuddy_Logger:Saved FAISS index for PaperBuddy conversation store.
INFO:PaperBuddy_Logger:Saved FAISS index for PaperBuddy document store.


In [15]:
NEW_ARXIV_PAPER_IDS = [
    # optimizn: a Python Library for Developing Customized Optimization
    # Algorithms
    '2503.00033'
]

pb = PaperBuddy(NEW_ARXIV_PAPER_IDS, load_stores=True)

INFO:PaperBuddy_Logger:Loaded environment variables from .env file.
INFO:PaperBuddy_Logger:Using nvidia/nv-embed-v1 embedding model.
INFO:PaperBuddy_Logger:Using meta/llama-3.3-70b-instruct chat model.
INFO:PaperBuddy_Logger:Loaded PaperBuddy conversation store from saved FAISS index.
INFO:PaperBuddy_Logger:Loaded PaperBuddy document store from saved FAISS index.
INFO:PaperBuddy_Logger:Adding arXiv papers with IDs ['2503.00033'] to document store
INFO:PaperBuddy_Logger:Split papers into chunks.
INFO:PaperBuddy_Logger:Created paper metadata chunks.
INFO:PaperBuddy_Logger:Added paper and metadata chunks to document store.


In [16]:
print_interaction('What is the optimizn library?')

Question: What is the optimizn library?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: The optimizn library is a Python library for developing customized optimization algorithms under general optimization algorithm paradigms, including simulated annealing and branch and bound, with continuous training offerings. 

Sources:
* optimizn: a Python Library for Developing Customized Optimization Algorithms




In [17]:
print_interaction('What is simulated annealing?')

Question: What is simulated annealing?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: Simulated annealing is an optimization algorithm inspired by the annealing of solids, where a solution is iteratively modified into a more optimal solution. It features random restarts and occasionally allows modifications that produce a less optimal solution to escape local minima of the cost/objective function.

Sources:
* optimizn: a Python Library for Developing Customized Optimization Algorithms
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [18]:
print_interaction('What is branch and bound?')

Question: What is branch and bound?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: Branch and bound is an optimization algorithm that represents the problem space as a tree, where the root node is the original constrained optimization problem and its descendant nodes are more constrained versions of the problem. The algorithm iteratively adds nodes to the tree, pruning nodes that will not lead to a solution more optimal than the most optimal solution seen so far, until it finds the optimal solution.

Sources:
* optimizn: a Python Library for Developing Customized Optimization Algorithms
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [19]:
print_interaction('What is continuous training?')

Question: What is continuous training?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: Continuous training refers to the ability to run an optimization algorithm, save its problem parameters, best solution found, and state, and then resume running from that state later. This allows the algorithm to pick up where it left off and potentially produce solutions that get closer to optimality.

Sources:
* optimizn: a Python Library for Developing Customized Optimization Algorithms
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [20]:
print_interaction(
    'To develop a customized simulated annealing algorithm, what '
    + 'functions does the user have to implement?')

Question: To develop a customized simulated annealing algorithm, what functions does the user have to implement?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: To develop a customized simulated annealing algorithm, the user has to implement the following functions: 
- get initial solution: Returns the initial solution 
- next candidate: Produces a new solution by modifying the current solution
- cost: Computes the value of the objective function (cost) for a given solution
- reset candidate: Produces a new solution that becomes the current solution under the specified reset probability
- get temperature: Gets the temperature given the number of iterations since the last random restart.

Sources:
* optimizn: a Python Library for Developing Customized Optimization Algorithms
* The EnvDesign Model: A Method to Solve the Environment Design Problem




In [21]:
print_interaction(
    'Does optimizn\'s branch and bound implementation allow users to customize '
    + 'the order in which the nodes in the tree are selected and evaluated? If '
    + 'so, how?')

Question: Does optimizn's branch and bound implementation allow users to customize the order in which the nodes in the tree are selected and evaluated? If so, how?




INFO:PaperBuddy_Logger:Retrieved relevant conversation history.
INFO:PaperBuddy_Logger:Retrieved relevant context from papers.
INFO:PaperBuddy_Logger:Generated chat model response to user input with retrieved conversation history and context.
INFO:PaperBuddy_Logger:Split user input and PaperBuddy output into chunks.
INFO:PaperBuddy_Logger:Added chunks to conversation store.




Answer: Yes, optimizn's branch and bound implementation allows users to customize the order in which the nodes in the tree are selected and evaluated. The user can specify the strategy for selecting the next node in the tree to evaluate through the "bnb selection strategy" argument in the BnBProblem class constructor. The supported selection strategies are depth-first, depth-first-best-first, and best-first-depth-first.

Sources:
* optimizn: a Python Library for Developing Customized Optimization Algorithms




In [22]:
pb.save_data()

INFO:PaperBuddy_Logger:Saved arXiv paper IDs.
INFO:PaperBuddy_Logger:Saved FAISS index for PaperBuddy conversation store.
INFO:PaperBuddy_Logger:Saved FAISS index for PaperBuddy document store.


In [23]:
shutil.make_archive('paperbuddy_data', 'zip', './paperbuddy_data')
files.download('./paperbuddy_data.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>