https://pypi.org/project/arxiv/

Turns out there's a python wrapper for the arXiv API

In [1]:
!pip install --upgrade --quiet arxiv
!pip install --upgrade --quiet pymupdf

In [2]:
import arxiv

In [3]:
# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = arxiv.Search(
  query = "quantum",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = arxiv.Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = arxiv.Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search))
print(first_result.title)

Mean-field theory of 1+1D $\mathbb{Z}_2$ lattice gauge theory with matter
Pair production in rotating electric fields via quantum kinetic equations: Resolving helicity states
Gaussian Process Regression with Soft Inequality and Monotonicity Constraints
Dark energy as a geometrical effect in geodetic brane gravity
Forecasting the sensitivity of Pulsar Timing Arrays to gravitational wave backgrounds
Scalable quantum detector tomography by high-performance computing
Efficient Quantum Circuits for Non-Unitary and Unitary Diagonal Operators with Space-Time-Accuracy trade-offs
Thermodynamics and Quasinormal Modes of the Dymnikova Black Hole in Higher Dimensions
Anyonic quantum multipartite maskers in the Kitaev model
Dancing above the abyss: Environmental effects and dark matter signatures in inspirals into massive black holes
['Mean-field theory of 1+1D $\\mathbb{Z}_2$ lattice gauge theory with matter', 'Pair production in rotating electric fields via quantum kinetic equations: Resolving he

Not that impressive, we already have this functionality. But we could use this later to clean up some code I wrote earlier in MVP.ipynb

Only interesting feature is the Result.download_pdf() function. Maybe we could add this feature so the user can download a preprint they find interesting

---

Langchain is probably the best way of turning this tool 'conversational'

In [4]:
!pip install --upgrade langchain

Collecting langchain
  Using cached langchain-0.1.14-py3-none-any.whl.metadata (13 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.29-cp312-cp312-macosx_10_9_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.3-cp312-cp312-macosx_10_9_x86_64.whl.metadata (7.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Using cached dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.30 (from langchain)
  Using cached langchain_community-0.0.31-py3-none-any.whl.metadata (8.4 kB)
Collecting langchain-core<0.2.0,>=0.1.37 (from langchain)
  Using cached langchain_core-0.1.40-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Using cached langchain_text_splitters-0.0.1-py3-none-any.whl.metadat

In [5]:
from langchain_community.document_loaders import ArxivLoader

In [6]:
docs = ArxivLoader(query="quantum computing", load_max_docs=2).load()
len(docs)

2

In [7]:
docs[0].metadata  # meta-information of the Document

{'Published': '2022-08-01',
 'Title': 'The Rise of Quantum Internet Computing',
 'Authors': 'Seng W. Loke',
 'Summary': 'This article highlights quantum Internet computing as referring to\ndistributed quantum computing over the quantum Internet, analogous to\n(classical) Internet computing involving (classical) distributed computing over\nthe (classical) Internet. Relevant to quantum Internet computing would be areas\nof study such as quantum protocols for distributed nodes using quantum\ninformation for computations, quantum cloud computing, delegated verifiable\nblind or private computing, non-local gates, and distributed quantum\napplications, over Internet-scale distances.'}

In [8]:
docs[0].page_content[:10000]  # all pages of the Document content

'arXiv:2208.00733v1  [cs.ET]  1 Aug 2022\nIEEE IOT MAGAZINE, VOL. XX, NO. X, X 2022\n1\nThe Rise of Quantum Internet Computing\nSeng W. Loke, Member, IEEE\nAbstract—This article highlights quantum Internet computing as referring to distributed quantum computing over the quantum Internet,\nanalogous to (classical) Internet computing involving (classical) distributed computing over the (classical) Internet. Relevant to\nquantum Internet computing would be areas of study such as quantum protocols for distributed nodes using quantum information for\ncomputations, quantum cloud computing, delegated veriﬁable blind or private computing, non-local gates, and distributed quantum\napplications, over Internet-scale distances.\nIndex Terms—quantum Internet computing, quantum Internet, distributed quantum computing, Internet computing, distributed\nsystems, Internet\n”This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this\nv

### Turns out there's already a semantic search feature for arXiv built into langchain

In [9]:
# get a token: https://platform.openai.com/account/api-keys

from getpass import getpass

OPENAI_API_KEY = getpass()

In [14]:
import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [12]:
!pip install langchain_openai

Collecting langchain_openai
  Using cached langchain_openai-0.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting tiktoken<1,>=0.5.2 (from langchain_openai)
  Downloading tiktoken-0.6.0-cp312-cp312-macosx_10_9_x86_64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken<1,>=0.5.2->langchain_openai)
  Downloading regex-2023.12.25-cp312-cp312-macosx_10_9_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Using cached langchain_openai-0.1.1-py3-none-any.whl (32 kB)
Downloading tiktoken-0.6.0-cp312-cp312-macosx_10_9_x86_64.whl (974 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.7/974.7 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading regex-2023.12.25-cp312-cp312-macosx_10_9_x86_64.whl (298 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.1/298.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling 

In [21]:
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI
from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(load_max_docs=3)
docs = retriever.get_relevant_documents(query="machine learning")
docs[0].metadata #meta-information
docs[0].page_content[:400]  # a content of the Document

model = ChatOpenAI(model_name="gpt-3.5-turbo")  # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [22]:
questions = [
    #"What are Heat-bath random walks with Markov base?",
    #"What is the ImageBind model?",
    #"How does Compositional Reasoning with Large Language Models works?",
    "What is the Dirichlet Process?",
    "When did John Lafferty most recently publish a paper?",
    "Describe the latest paper by John Lafferty.",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is the Dirichlet Process? 

**Answer**: The Dirichlet Process is a stochastic process used in probability theory and Bayesian statistics. It is often used as a prior distribution in Bayesian nonparametric models. The Dirichlet process is characterized by its property of clustering data into an unknown number of groups, making it a flexible tool in modeling complex data structures. 

-> **Question**: When did John Lafferty most recently publish a paper? 

**Answer**: I don't have that information. 

-> **Question**: Describe the latest paper by John Lafferty. 

**Answer**: I don't know the latest paper published by John Lafferty. 



In [20]:
questions = [
    "What are Heat-bath random walks with Markov base? Include references to answer.",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What are Heat-bath random walks with Markov base? Include references to answer. 

**Answer**: I don't have information on "Heat-bath random walks with Markov base" in the context provided. 

