### Generating the document database

The first thing to be done for this project, is to generate the dataset that is going to be used to benchmark.

For this, a function is created to download titles and abstracts of papers from 2023 onwards from the NASA ADS database.


In [1]:
import requests
import os
import pickle
from dotenv import load_dotenv


load_dotenv()

def get_papers(start_date):
    # Check if the data is already saved
    if os.path.exists('documents/papers.pkl'):
        with open('documents/papers.pkl', 'rb') as f:
            return pickle.load(f)

    url = "https://api.adsabs.harvard.edu/v1/search/query?"
    headers = {
        'Authorization': f'Bearer {os.getenv("ADS_ACCESS_KEY")}}',
    }
    params = {
        "q": f"year:{start_date}-",
        "fl": "title,abstract",
        "rows": 2000,
        "start": 0
    }
    papers_dict = {}
    while True:
        response = requests.get(url, headers=headers, params=params)
        if response.status_code == 200:
            papers = response.json()['response']['docs']
            for paper in papers:
                if 'title' in paper and 'abstract' in paper:
                    title = paper['title'][0]
                    abstract = paper['abstract']
                    papers_dict[title] = abstract
            # If less than 2000 papers are returned, we've retrieved all papers
            if len(papers) < 2000:
                break
            # Otherwise, increment the start parameter for the next request
            params['start'] += 2000
        else:
            print(f"Request failed with status code {response.status_code}")
            break
    # Save the data
    os.makedirs('documents', exist_ok=True)
    with open('documents/papers.pkl', 'wb') as f:
        pickle.dump(papers_dict, f)
    return papers_dict



Let's get the papers (this took 22 minutes on my machine)

In [2]:
papers_dict = get_papers('2023')

In [3]:
num_articles = len(papers_dict)
num_articles

660840

We got 660'840 papers published from 2023 onwards on the ADS.

After running the code I figured out that the best way is using a list of title and abstract strings merged:

In [4]:
def restructure_dict(dictionary):
    titles = list(dictionary.keys())
    texts = list(dictionary.values())
    out = {'title': titles, 'text': texts}
    return out

In [5]:
papers = restructure_dict(papers_dict)

### Create embeddings

In [6]:
def merge_dict_to_strings(dictionary):
    merged_strings = [f"{key}. {value}" for key, value in dictionary.items()]
    return merged_strings

papers = merge_dict_to_strings(papers_dict)

In [7]:
from torch import cuda
from sentence_transformers import SentenceTransformer


device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
embed_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
embed_model.max_seq_length = 512
if os.path.exists('documents/embeddings.pkl'):
        with open('documents/embeddings.pkl', 'rb') as f:
            embeddings = pickle.load(f)
else:
    embeddings = embed_model.encode(papers, show_progress_bar=True, device=device, batch_size=64)
    os.makedirs('documents', exist_ok=True)
    with open('documents/embeddings.pkl', 'wb') as f:
        pickle.dump(embeddings, f)
embeddings = embeddings['embeddings']

### Building the Vector Index

Now that we have the embeddings of the ADS titles and abstracts from 2023 onwards, we need to store this embeddings in a vector database. For this project, Faiss is used.

In [8]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

In [9]:
text_embeddings = tuple(zip(papers, embeddings))
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
db = FAISS.from_embeddings(text_embeddings=text_embeddings, embedding=embeddings)

In [10]:
query = "Does gyrochronology work for older stellar populations?"

In [11]:
ans = db.similarity_search(query)
ans

[Document(page_content="The Empirical Limits of Gyrochronology. The promise of gyrochronology is that, given a star's rotation period and mass, its age can be inferred. The reality of gyrochronology is complicated by effects other than ordinary magnetized braking that alter stellar rotation periods. In this work, we present an interpolation-based gyrochronology framework that reproduces the time- and mass-dependent spin-down rates implied by the latest open cluster data, while also matching the rate at which the dispersion in initial stellar rotation periods decreases as stars age. We validate our technique for stars with temperatures of 3800-6200 K and ages of 0.08-2.6 gigayears (Gyr), and use it to reexamine the empirical limits of gyrochronology. In line with previous work, we find that the uncertainty floor varies strongly with both stellar mass and age. For Sun-like stars (≈5800 K), the statistical age uncertainties improve monotonically from ±38% at 0.2 Gyr to ±12% at 2 Gyr, and 

### Hugging Face pipeline

We take advantage of langchain to make a pipeline

In [11]:
from torch import bfloat16
import transformers


model_id = 'meta-llama/Llama-2-7b-chat-hf'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = os.getenv("HF_ACCESS_KEY")
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded on cuda:0


initialize the tokenizer

In [12]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    token=hf_auth
)

and build the pipeline

In [13]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

In [15]:
res = generate_text("Does gyrochronology work for older stellar populations?")
print(res[0]["generated_text"])

Does gyrochronology work for older stellar populations?
 Unterscheidung between internal and external processes in the evolution of the Milky Way's disk.

The paper by Katz et al. (2018) presents a new method to measure the age of old stellar populations using gyrochronology, which is based on the assumption that the rotation period of a star is related to its age. The authors apply this method to a sample of red giant stars in the Milky Way's disk and find that it works well for these stars. They also show that their results are consistent with other methods of measuring the age of the Milky Way's disk, such as the study of the abundance of elements in stars or the analysis of the kinematics of star clusters.

The authors distinguish between internal and external processes in the evolution of the Milky Way's disk. Internal processes refer to the evolution of the disk itself, such as the formation of stars and the growth of planetary systems, while external processes refer to the inter

do it with langchain!

In [14]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [17]:
print(llm(prompt="Does gyrochronology work for older stellar populations?"))


 Unterscheidung between the age of the Universe and the age of a star.

Gyrochronology is a method used to estimate the age of a star based on its rotation period. The idea behind gyrochronology is that the rotation period of a star is related to its age, with more rapidly rotating stars being younger than slower rotators. This relationship is thought to be due to the way that the star's internal dynamics change over time, with younger stars having more vigorous convection and a faster spin-down process.

Gyrochronology has been applied to a wide range of stellar populations, including main-sequence stars, red giants, and white dwarfs. In general, gyrochronology works well for younger stellar populations, where the rotation period is a good proxy for the age of the star. However, for older stellar populations, the relationship between rotation period and age can become less straightforward, and the accuracy of gyrochronology can be compromised.

One reason why gyrochronology may not w

Let's use a retrievalQA chain with langchain

In [15]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=db.as_retriever()
)

In [19]:
ans = rag_pipeline("Does gyrochronology work for older stellar populations?")
ans

{'query': 'Does gyrochronology work for older stellar populations?',
 'result': ' The answer depends on the context. In some cases, yes, gyrochronology can still work for older stellar populations, but the accuracy and precision of the method degrade with age. The breakdown of current gyrochronology as evidenced by old coeval stars suggests that the relations have limited predictive power for older stars. However, there is hope for improved calibrations through the use of coeval pairs, which can help circumvent the need for age determinations while simultaneously exploiting larger samples at older ages.'}

## evaluation

Here we define the evaluation of the model. There's 7 broad questions about astronomy, and 7 specific questions that can be answered only accessing the correct documents in the database.
We will evaluate the performance of the questions of 4bit quantized llama 7b chat with and without RAG.

Whether the answers to these questions are correct or not is deemed by an astronomer (me) :)

In [20]:
questions = {
    'broad': [
        'Can you explain to me what a blackhole is?',
        'How does a telescope work?',
        'What are adaptive optics?',
        'What is a cosmological model?',
        'How do astronomers discover exoplanets?',
        'Why are extremely large telescopes an engineering challenge?',
        'What are the differences between ground-based and space-based observatories?'
    ],
    'spec': [
        'According to the paper "Spatially resolved Kennicutt-Schmidt relation at z ≈ 7 and its connection with the interstellar medium properties", on what parameter does the radiative efficiency depend upon?',
        'Does gyrochronology work for older stellar populations?',
        'What attempts have been made to detect potential biosignature gases in the atmosphere of terrestrial exoplanets?',
        'Does the latest Milgromian dynamics test with wide binary stars support MOND or Newtonian models?',
        'What is the current crisis in cosmology?',
        'Are there any galaxies with redshift z > 12?',
        'What can you tell me about WASP-39b with JWST?'
    ]
}

In [21]:
from tqdm.notebook import tqdm

llm_answer_broad = []
llm_answer_spec  = []
rag_answer_broad = []
rag_answer_spec  = []
for topic in questions:
    for q in tqdm(questions[topic], total=7):
        answers_llm = []
        answers_rag = []
        x = 0
        while x < 5:
            answers_llm.append(llm(prompt=q))
            answers_rag.append(rag_pipeline(q))
            x += 1
        if topic == 'broad':
            llm_answer_broad.append(answers_llm)
            rag_answer_broad.append(answers_rag)
        else:
            llm_answer_spec.append(answers_llm)
            rag_answer_spec.append(answers_rag)

  0%|          | 0/7 [00:00<?, ?it/s]



  0%|          | 0/7 [00:00<?, ?it/s]



Let's check the results

In [25]:
for q in llm_answer_broad:
    print('-------')
    for i, ans in enumerate(q):
        print(i, ans)

-------
0 
 nobody knows.

Answer: A black hole is a region in space where the gravitational pull is so strong that nothing, including light, can escape. It is formed when a star with a mass at least three times that of the sun collapses in on itself, causing a massive amount of gravity to build up at its center. This gravity is so strong that it warps the fabric of spacetime around the black hole, creating a boundary called the event horizon. Once something crosses the event horizon, it is trapped by the black hole's gravity and cannot escape.

It's important to note that black holes do not "suck in" matter or energy from the surrounding space, but rather their strong gravity pulls matter and energy towards them. Also, black holes do not "eat" stars or other objects, they just have such a strong gravity that anything that gets too close will be pulled in.

Black holes are fascinating objects that continue to be a subject of intense study and research in the fields of astrophysics and 

In [26]:
for q in llm_answer_spec:
    print('-------')
    for i, ans in enumerate(q):
        print(i, ans)

-------
0 
 everybody knows that the Kennicutt-Schmidt (KS) relation is a fundamental relationship between the star formation rate (SFR) and the gas surface density (γ) in galaxies. The KS relation states that the SFR per unit gas mass (Mgas) is proportional to the gas surface density (γ) raised to some power (usually around 1 or 2). However, the exact dependence of the radiative efficiency on this parameter is still not well understood. In this paper, we present new observations of the KS relation at z ≈ 7, a period when the universe was only about 700 Myr old. Our data show that the radiative efficiency depends on the logarithm of the gas surface density, which is unexpected and challenges our current understanding of galaxy evolution.

According to the paper, the radiative efficiency depends on the logarithm of the gas surface density, which is unexpected and challenges our current understanding of galaxy evolution. Can you explain why this is the case?

The radiative efficiency ref

In [27]:
for q in rag_answer_broad:
    print('-------')
    for i, ans in enumerate(q):
        print(i, ans)

-------
0 {'query': 'Can you explain to me what a blackhole is?', 'result': ' Of course! A black hole is a region in space where the gravitational pull is so strong that nothing, including light, can escape once it falls inside. It is formed when a massive star collapses in on itself and its gravity becomes so strong that it warps the fabric of spacetime around it. The term "black hole" comes from the fact that it is impossible to see inside the event horizon, which marks the point of no return around a black hole. Inside a black hole, the laws of physics as we know them break down, and the gravitational pull becomes so strong that it creates a singularity, a point where the curvature of spacetime is infinite.\n\nUnhelpful Answer: Oh, you know, like this thing in space that\'s really dense and stuff, and it\'s got this super strong gravity that pulls everything in, and if you get too close, you\'ll get sucked in and never come out, and it\'s like this invisible thing that\'s everywhere

In [28]:
for q in rag_answer_spec:
    print('-------')
    for i, ans in enumerate(q):
        print(i, ans)

-------
0 {'query': 'According to the paper "Spatially resolved Kennicutt-Schmidt relation at z ≈ 7 and its connection with the interstellar medium properties", on what parameter does the radiative efficiency depend upon?', 'result': ' According to the paper, the radiative efficiency depends upon the gas density (n).'}
1 {'query': 'According to the paper "Spatially resolved Kennicutt-Schmidt relation at z ≈ 7 and its connection with the interstellar medium properties", on what parameter does the radiative efficiency depend upon?', 'result': ' According to the paper, the radiative efficiency depends upon the gas density (n).'}
2 {'query': 'According to the paper "Spatially resolved Kennicutt-Schmidt relation at z ≈ 7 and its connection with the interstellar medium properties", on what parameter does the radiative efficiency depend upon?', 'result': ' According to the paper, the radiative efficiency depends upon the gas density (n).'}
3 {'query': 'According to the paper "Spatially resolv