# Proposal review assist tool

The code below is a (mild) re-factoring of the mess I created when serving on one of the panels reviewing observational proposals for one of the ESA missions. Well, actually, two missions because my agreement to serve in one of those was transferred from previous cycle, so there was a problem in a form of 70+ to read rather than 20 that I expected. I am a lazy person and had some time to play with LLMs, so sketched an agentic RAG to help me with the task. Ultimately, it did not work fully reliable, so I had to read proposals anyway, but playing with LLMs made this task less boring :) Here I just updated some stuff to account for updates of various packages, and removed sensitive information. I also made it a bit simpler by looking at a single proposal (one of old ones written by me) whereas in principle it should be a function to loop over a set of proposal pdfs.  

Proposals are typically expected to be ranked by several categories, i.e. reviewer needs to answer several questions and then decide whether proposal is worth to be granted based on those answers. For instance, those could be something like:
* aims of the proposal
* science adressed
* proposal strengths
* proposal weaknesses
* relevance of a mission in context of the proposal
* ... can science be done with other facility...

Usually not too many as the answers to those questions also need to be summarized very briefly in the final report. Unfortunately, relevant information is not always contained in the proposal text itself (either because authors prefer not to, or simply due to the lack of space). The basic idea was thus to parse the text and extract relevant references, search for them at ADS, download relevant pdfs and use those as context to answer the questions which need to be answered. As commonly done, context is accumulated via RAG. Here I use pinecore to store embedings, but in principle any vector DB will do. Let us start by reading the pdf of the proposal


In [151]:
import PyPDF2
import ads

# load enviromental variables for API keys. Don't forget to put those in .env file
# specifically OPENAI_API_KEY, PINECONE_API_KEY, and ADS_API_KEY
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())

import fitz  # PyMuPDF


def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfFileReader(file)
        text = ''.encode('utf-8', 'ignore')
        for page_num in range(reader.numPages):
            page = reader.getPage(page_num)
            text += page.extractText().encode('utf-8', 'ignore')
    return text.decode('utf-8')

def extract_text_from_pdf_fitz(pdf_path):
    document = fitz.open(pdf_path)
    all_text = ""
    for page_num in range(len(document)):
        page = document.load_page(page_num)  # Load the page
        all_text += page.get_text()  # Extract text from the page
    return all_text

    
pdf_path = 'xmm_ao17_groj1744.pdf'  # Replace with your PDF file path
proposal_text = extract_text_from_pdf_fitz(pdf_path)

Proposal format is predefined, i.e. references shall be at the end. We can thus cut only the interesting part of the text to save some tokens

In [152]:
print(proposal_text)

The dark side of the bursting pulsar
PI: V. Doroshenko
COIs: A. Santangelo, S. Tsygankov, A. Mushtukov, R. Doroshenko, V. Suleimanov
1. Abstract
GRO J1744−28 is a unique source which appears as a transient accreting pulsar and a peculiar X-ray burster.
This unusual combination of properties is thought to be due to the magnetic ﬁeld strengh of the neutron
star, which being B ∼5 × 1011 G is both signiﬁcantly higher than in accreting millisecond pulsars but
lower than those of normal X-ray pulsars. Given the estimated ﬁeld, it has been suggested that outside of
the outbursts the source shall switch to the so-called “propeller” regime when the accretion is centrifugally
inhibited, and only non-pulsed thermal emission from neutron stars surface could still be observed. However,
several serendipuous detections of GRO J1744−28 in quiescence revealed a spectrum fully compatible with
that observed during the outbursts rather than much softer spectrum expected from a pulsar in “propeller”
regime

In [153]:
references_text = proposal_text[proposal_text.find('References\n'):]

In [154]:
print(references_text)

References
Cui, W. 1997, , 482, L163
D’A`
ı, A., Di Salvo, T., Iaria, R., et al. 2015, , 449, 4288
Daigne, F., Goldoni, P., Ferrando, P., et al. 2002, , 386, 531
de Jager, O. C., Raubenheimer, B. C., & Swanepoel, J. W. H. 1989, , 221, 180
Degenaar, N., Wijnands, R., Cackett, E. M., et al. 2012, , 545, A49
Doroshenko, R., Santangelo, A., Doroshenko, V., Suleimanov, V., & Piraino, S. 2015, , 452, 2490
Doroshenko, V., Santangelo, A., Doroshenko, R., et al. 2014, , 561, A96
Illarionov, A. F. & Sunyaev, R. A. 1975, , 39, 185
Nishiuchi, M., Koyama, K., Maeda, Y., et al. 1999, , 517, 436
Rappaport, S. & Joss, P. C. 1997, , 486, 435
Sanna, A., Riggio, A., Burderi, L., et al. 2017, , 469, 2
Tsygankov, S. S., Doroshenko, V., Lutovinov, A. A., Mushtukov, A. A., & Poutanen, J. 2017a, , 605, A39
Tsygankov, S. S., Lutovinov, A. A., Doroshenko, V., et al. 2016a, , 593, A16
Tsygankov, S. S., Mushtukov, A. A., Suleimanov, V. F., et al. 2017b, ArXiv e-prints
Tsygankov, S. S., Mushtukov, A. A., Suleimano

This has really simple format, but different authors may do it differently, so using LLMs is a more robust approach. Use OpenAI API for that:

In [155]:
promt = """You are a publication list parser. You parse the text consisting of textual references 
           to papers and output each publication in a separate line. Each line contains a 
           list of authors for a given publication and year of the publication as last element.
           You put all author names in quotation marks, remove et al. throughout the text.
           Substitute special characters with equivalent utf-8 symbols. Year must not contain any letters and must be a number separated by same symbol as authors in the author list. The text is:"""

In [156]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": promt},
    {"role": "user", "content": references_text}
  ], temperature=0
)

So now we have reference strings parsed by GPT

In [157]:
paper_search_strings = completion.choices[0].message.content.split('\n')
print(paper_search_strings)

['["Cui, W."] 1997', '["D’Aı, A.", "Di Salvo, T.", "Iaria, R."] 2015', '["Daigne, F.", "Goldoni, P.", "Ferrando, P."] 2002', '["de Jager, O. C.", "Raubenheimer, B. C.", "Swanepoel, J. W. H."] 1989', '["Degenaar, N.", "Wijnands, R.", "Cackett, E. M."] 2012', '["Doroshenko, R.", "Santangelo, A.", "Doroshenko, V.", "Suleimanov, V.", "Piraino, S."] 2015', '["Doroshenko, V.", "Santangelo, A.", "Doroshenko, R."] 2014', '["Illarionov, A. F.", "Sunyaev, R. A."] 1975', '["Nishiuchi, M.", "Koyama, K.", "Maeda, Y."] 1999', '["Rappaport, S.", "Joss, P. C."] 1997', '["Sanna, A.", "Riggio, A.", "Burderi, L."] 2017', '["Tsygankov, S. S.", "Doroshenko, V.", "Lutovinov, A. A.", "Mushtukov, A. A.", "Poutanen, J."] 2017', '["Tsygankov, S. S.", "Lutovinov, A. A.", "Doroshenko, V."] 2016', '["Tsygankov, S. S.", "Mushtukov, A. A.", "Suleimanov, V. F."] 2017', '["Tsygankov, S. S.", "Mushtukov, A. A.", "Suleimanov, V. F.", "Poutanen, J."] 2016', '["Tsygankov, S. S.", "Wijnands, R.", "Lutovinov, A. A.", "Degen

It's time to search for ADS papers. The easiest way is just to query for same author list and year, hence the LLM formatting above. Here, however, there are couple of issues. First, some papers are not found due to various reasons (unicode issues, simply wrong references given by authors etc), second there may be too many results. This can be due to truncation of the author list done in some proposals (to save space), or simply due to a very common name of the first author. Still, it makes sense to retrieve all abstracts, and then just filter most similar papers using abstract embeddings.

In [158]:
import re, ads

def search_ads(search_string):
    authors = re.findall(r'"([^"]*)"',search_string)
    year = re.findall('[0-9][0-9][0-9][0-9]',search_string)
    query = f'collection:astronomy AND author:"^{authors[0]}"'+" AND ".join([f' author:"{x}"' for x in authors[1:]])+f" AND year:{year[0]}"
    res = ads.SearchQuery(q=query)
    print(query)
    res.execute()
    return res.articles

cited_papers = [search_ads(x) for x in paper_search_strings]

collection:astronomy AND author:"^Cui, W." AND year:1997
collection:astronomy AND author:"^D’Aı, A." author:"Di Salvo, T." AND  author:"Iaria, R." AND year:2015
collection:astronomy AND author:"^Daigne, F." author:"Goldoni, P." AND  author:"Ferrando, P." AND year:2002
collection:astronomy AND author:"^de Jager, O. C." author:"Raubenheimer, B. C." AND  author:"Swanepoel, J. W. H." AND year:1989
collection:astronomy AND author:"^Degenaar, N." author:"Wijnands, R." AND  author:"Cackett, E. M." AND year:2012
collection:astronomy AND author:"^Doroshenko, R." author:"Santangelo, A." AND  author:"Doroshenko, V." AND  author:"Suleimanov, V." AND  author:"Piraino, S." AND year:2015
collection:astronomy AND author:"^Doroshenko, V." author:"Santangelo, A." AND  author:"Doroshenko, R." AND year:2014
collection:astronomy AND author:"^Illarionov, A. F." author:"Sunyaev, R. A." AND year:1975
collection:astronomy AND author:"^Nishiuchi, M." author:"Koyama, K." AND  author:"Maeda, Y." AND year:1999
col

In [159]:
[f"{x[0]}: {len(x[1])} results" for x in zip(paper_search_strings,cited_papers)]

['["Cui, W."] 1997: 9 results',
 '["D’Aı, A.", "Di Salvo, T.", "Iaria, R."] 2015: 1 results',
 '["Daigne, F.", "Goldoni, P.", "Ferrando, P."] 2002: 1 results',
 '["de Jager, O. C.", "Raubenheimer, B. C.", "Swanepoel, J. W. H."] 1989: 2 results',
 '["Degenaar, N.", "Wijnands, R.", "Cackett, E. M."] 2012: 2 results',
 '["Doroshenko, R.", "Santangelo, A.", "Doroshenko, V.", "Suleimanov, V.", "Piraino, S."] 2015: 2 results',
 '["Doroshenko, V.", "Santangelo, A.", "Doroshenko, R."] 2014: 2 results',
 '["Illarionov, A. F.", "Sunyaev, R. A."] 1975: 1 results',
 '["Nishiuchi, M.", "Koyama, K.", "Maeda, Y."] 1999: 1 results',
 '["Rappaport, S.", "Joss, P. C."] 1997: 1 results',
 '["Sanna, A.", "Riggio, A.", "Burderi, L."] 2017: 6 results',
 '["Tsygankov, S. S.", "Doroshenko, V.", "Lutovinov, A. A.", "Mushtukov, A. A.", "Poutanen, J."] 2017: 2 results',
 '["Tsygankov, S. S.", "Lutovinov, A. A.", "Doroshenko, V."] 2016: 1 results',
 '["Tsygankov, S. S.", "Mushtukov, A. A.", "Suleimanov, V. F."] 2

For citations where more than one match is available, it makes sense to select one with most similar abstract:

In [160]:
from langchain_openai import OpenAIEmbeddings
from scipy.spatial.distance import cosine
import numpy as np

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
e0 = embeddings.embed_query(proposal_text)

for i in range(len(cited_papers)):
    if len(cited_papers[i])>1:
        similarity = [p.abstract!=None and cosine(e0,embeddings.embed_query(p.abstract)) or 1 for p in cited_papers[i]]
        cited_papers[i] = [cited_papers[i][np.argmin(similarity)]]
cited_papers = [x[0] for x in cited_papers if len(x)>0]



To get full texts to build RAG database the only thing we need are bibcodes, so store those:

In [161]:
bibcodes = [x.bibcode for x in cited_papers]

Now we need to download full texts for these articles. Here ADS API needs to be used directly:

In [162]:
import requests, urllib, tempfile, os
pdf_priority = ['ads_pdf','eprint_pdf','pub_pdf'] # try ADS-stored pdf, then arxiv, then publishers (they have captchas)

def download_file(bibcode,priority):
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_filename = temp_file.name
    request = f"https://api.adsabs.harvard.edu/v1/resolver/{bibcode}/{pdf_priority[priority]}"
    # print(request)
    response = requests.get(request,headers={'Authorization': 'Bearer ' + os.getenv('ADS_API_KEY')})
    if response.ok:
        url = response.json()['link']
        urllib.request.urlretrieve(url, temp_filename)
        return temp_filename
    else:
        return False
    
def get_fulltext(bibcode):
    text = ''
    for i in [0,1,2]:
        try:
            pdf = download_file(bibcode,i)
            text = extract_text_from_pdf_fitz(pdf)
            os.remove(pdf)
            break
        except:
            continue
    return text

In [163]:
cited_texts = [get_fulltext(x) for x in bibcodes]

Note that some texts are not fetched because journals respond with picture or whatever to bots. In principle, one needs to use some "prove you're human" in those cases, but it's not the topic of current excercise, so something is better than nothing. Now we can concatenate the database of cited papers plus the proposal itself and convert it to vector embeddings to feed the RAG.

In [164]:
import functools
knowledge = functools.reduce(lambda x,y: x+' '+y, cited_texts + [proposal_text])

In [165]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 300,
)

texts = text_splitter.create_documents([knowledge])

Now we can actually feed that to pinecone

In [166]:
texts[0]

Document(page_content='arXiv:astro-ph/9704084v1  10 Apr 1997\nEvidence for “Propeller” Eﬀects In X-ray Pulsars GX 1+4 And GRO J1744-28\nWei Cui1\nABSTRACT\nWe present observational evidence for “propeller” eﬀects in two X–ray pulsars,\nGX 1+4 and GRO J1744–28. Both sources were monitored regularly by the Rossi\nX–ray Timing Explorer (RXTE) throughout a decaying period in the X–ray brightness.\nQuite remarkably, strong X–ray pulsation became unmeasurable when total X–ray ﬂux\nhad dropped below a certain threshold. Such a phenomenon is a clear indication of\nthe propeller eﬀects which take place when pulsar magnetosphere grows beyond the\nco-rotation radius as a result of the decrease in mass accretion rate and centrifugal\nforce prevents accreting matter from reaching the magnetic poles. The entire process\nshould simply reverse as the accretion rate increases. Indeed, steady X–ray pulsation\nwas reestablished as the sources emerged from the non-pulsating faint state. These')

In [167]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY')

# configure client
pc = Pinecone(api_key=api_key)

In [168]:
from pinecone import ServerlessSpec
import time

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)
index_name = 'langchain-rag'
print(spec)

ServerlessSpec(cloud='aws', region='us-east-1')


In [169]:
try:
    pc.delete_index(index_name)
except:
    pass
pc.create_index(
        index_name,
        dimension=len(e0),  # dimensionality of text-embedding-ada-002
        metric='dotproduct',
        spec=spec
    )
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [170]:
index = pc.Index(index_name)
# wait a moment for connection
time.sleep(1)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [171]:
from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore.from_documents(
        texts,
        index_name=index_name,
        embedding=embeddings
    )

In [172]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1061}},
 'total_vector_count': 1061}

Now we can test whether similarity searches over vector store actually work:

In [173]:
vectorstore.similarity_search(
    'Justification of requested observing time',  # our search query
    k=1  # return 1 most relevant docs
)

[Document(page_content='observed pulse period with the values reported during the recent outburst of the source will allow to measure\nthe observed rate of spin frequency change in quiescence. Comparison with the spin evolution of the source\nduring the outburst can then be used to put additional constrains on distance to the source and eﬀective\nmagnetospheric radius (Sanna et al. 2017).\n3. Justiﬁcation of requested observing time, feasibility and visibility\nGRO J1744−28 has alredy been observed in quiescence a number of times with XMM-Newton, and\nChandra, however, none of the observations are suitable to achieve our goals. Most of the observations\nare short (1-5 ks) monitoring pointings of the Galactic centre not targeting speciﬁcally the source. The\ntwo longer XMM observations in quiescence (0506291201, 38 ks, and 0794580301, 25 ks) were performed\nin modes unsuitable to search for the pulsations in GRO J1744−28. The former was devoted to study')]

In [174]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=os.environ.get('OPENAI_API_KEY'),
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

Now we can test how it works. First try single shot with super simple prompt:

In [175]:
query = "what are the main scientific objectives for observing GRO J1744-28 with XMM-Newton"
qa(query)

{'query': 'what are the main scientific objectives for observing GRO J1744-28 with XMM-Newton',
 'result': 'The main scientific objectives for observing GRO J1744-28 with XMM-Newton include measuring the observed rate of spin frequency change in quiescence, comparing the spin evolution of the source during outbursts to constrain distance and effective magnetospheric radius, discriminating between different spectral models, decreasing uncertainty in spectral parameters, and detecting pulsations to understand the mechanisms for the quiescent X-rays in the system.'}

Works ok on simple questions but not as impressive on more complex ones, i.e. some optimizations are needed... Main issue is that the proposal itself is just part of RAG, but in fact it's more than that, i.e. that's the context for all the questions. And of course there are a lot of possibilities for promt engineering and chaining. Makes sense to use langchain for that.

In [224]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


template = f'You are a highly qualified scientist reviewing an observational proposal below:{proposal_text}' + """You analyze this in broader context: {context}.
              You use given context to answer the question and carefully analyze both the context and the question before answering.
              When answering, you first look whether the answer is already provided in the context.
              Otherwise list out relevant arguments and answer the question step by step.
              When doing calculations of the exposure time you clearly distinguish between the already conducted observations and those requested in the proposal.
              If you don't know the answer, say you don't know."
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
summarizing_prompt = ChatPromptTemplate.from_template("extract facts and provide coincise one-two sentence summary of the following: {text}. Be sure to omit intermediate steps but keep relevant final numbers and object names. Only include output itself without prefixes.")

model = ChatOpenAI(openai_api_key=os.environ.get('OPENAI_API_KEY'),
    model_name='gpt-3.5-turbo',
    temperature=0.0)
retriever = vectorstore.as_retriever()

retrieval_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | {"text": RunnablePassthrough()}
    | summarizing_prompt
    | model
    | StrOutputParser() 
) 

retrieval_chain.invoke("Relevance of XMM in context of the proposed observations")


'XMM-Newton is highly relevant for the proposed observations of the bursting pulsar GRO J1744−28 due to its sensitivity, spatial resolution, previous successful use in similar studies, and feasibility for the planned observation period. The proposed observation will almost triple the total exposure, allowing for discrimination between spectral models and decreasing uncertainty for spectral parameters by a factor of five. XMM-Newton provides higher sensitivity compared to other instruments, reaching X-ray luminosities significantly deeper than their sensitivity levels, and offers sub-arcsecond spatial resolution crucial for studying the detailed structure of the X-ray emission from the pulsar.'

Now we can define our list of questions and create a report adressing the questions we're actually interested in:

In [228]:
questions = [
    "what are the scientific goals of the proposal?",
    "why XMM-Newton data is essential and can the program be fulfilled with other facilities? If yes, which?",
    "are the proposed target or targets suitable to support the scientific case? If yes, Why?",
    "are the proposed observations feasible to fulfill the scientific case?",
    "is visibility of the target discussed?"]
answers = [retrieval_chain.invoke(q) for q in questions]

In [229]:
from IPython.display import display, Markdown
md = """"""
for x in zip(questions, answers):
    md+=f"#{x[0]}\n\n{x[1]}\n\n"

display(Markdown(md))

#what are the scientific goals of the proposal?

The proposal aims to investigate the physical origin of X-ray emission from the bursting pulsar GRO J1744−28 in quiescence by verifying continued accretion, determining emission variability, detecting pulsations, comparing spectra with outburst data and other sources, measuring spin frequency change, and constraining distance and magnetospheric radius.

#why XMM-Newton data is essential and can the program be fulfilled with other facilities? If yes, which?

The XMM-Newton data is essential for the proposed observational program to investigate the physical origin of X-ray emission from GRO J1744−28 in quiescence. Previous observations with Chandra were not suitable due to shorter monitoring pointings and unsuitable observation modes. XMM-Newton is needed for the proposed investigation because it provides the required exposure time, sensitivity, and observation modes that other facilities like Chandra may not meet.

#are the proposed target or targets suitable to support the scientific case? If yes, Why?

The proposed target GRO J1744−28 is suitable for the observational proposal due to its unique properties, uncertainties in behavior, need for further investigation, and feasibility of the proposed observation. Previous observations were not sufficient, and a new 50 ks XMM observation in quiescence is necessary to study the source's accretion and emission.

#are the proposed observations feasible to fulfill the scientific case?

The proposal suggests a 50 ks XMM observation is necessary to investigate the physical origin of the observed X-ray emission from the bursting pulsar GRO J1744−28 in quiescence. A sensitivity analysis indicates that this exposure time is needed to detect pulsations at the expected flux level and rule out a thermal origin of the emission, with the target being observable multiple times within the specified timeframe.

#is visibility of the target discussed?

The proposal discusses the visibility of the target, mentioning it is observable multiple times within specific periods in 2018 and 2019. This shows that the target's visibility was considered in planning the observation.

