# The AstroFest RAGstract Service

This notebook implements question-and-answer functionality from selected U of I Astronomy Professors' first author publication abstracts. It uses Google's Gemini Large Language Model (LLM) and the Retrieval Augmented Generation (RAG) methodology.

LLMs are trained on text (and now image) corpuses at the scale of the internet. Unfortunately they're not trained on everything, and worse, they *hallucinate* when they don't know the answer to a question. Let's ask Gemini a question that can be answered by one of Prof. Ricker's abstracts:

In [1]:
from vertexai.preview.generative_models import GenerativeModel

llm = GenerativeModel("gemini-1.0-pro")
model_parameters = {
    "temperature": 0, # limit response randomness (i.e., creativity)
    "max_output_tokens": 800,
    "top_k": 1, # limit response randomness
}
responses = llm.generate_content(
    "How can a black hole binary of 60 solar masses form?",
    generation_config=model_parameters,
    stream=True
)

for response in responses:
    print(response.text)

## Formation of a
 60 Solar Mass Black Hole Binary

The formation of a 60 solar mass black hole binary is a complex and fascinating process that involves several stages:

**1
. Stellar Evolution:**

* Two massive stars, each with a mass of at least 30 solar masses, are born in a dense stellar cluster.

* These stars have short lifespans and quickly evolve, consuming their fuel and expanding into red supergiants.

**2. Mass Transfer and Common Envelope Phase:**

* As the stars age, they lose mass through stellar winds and Roche lobe overflow.
* One star transfers mass to the other, creating a
 common envelope around both stars.
* This envelope is unstable and eventually expels the stars, leaving them closer together.

**3. Black Hole Formation:**

* Each star collapses under its own gravity, forming a black hole.
* The black holes continue to orbit each other due to their initial momentum and the gravitational
 interaction.

**4. Binary Evolution:**

* The black hole binary loses energy

## Retrieval Augmented Generation to the rescue

One solution to this problem would be to retrain the LLM so that it includes
the documents we're interested in. For some use cases, this is the way to go,
but there are some problems:  
* The compute resources needed to retrain these models are *EXPENSIVE* and
retraining is time consuming
* The model still has knowledge of its initial training data set and can
have trouble "finding" our relevant documents.

Instead of retraining, RAG searches our documents directly and inserts them in our prompt to the LLM, thereby reducing the problem to a summarization task.

## Step 1: Load data

Load abstracts saved on Google Cloud Storage

In [2]:
from google.cloud import storage

BUCKET_NAME = 'astro-abstracts'
raw_abstracts = []

client = storage.Client()
blobs = client.list_blobs(BUCKET_NAME)
for blob in blobs:
    raw_abstracts.append(blob.download_as_text())

print(f'{BUCKET_NAME} has {len(raw_abstracts)} abstracts')

astro-abstracts has 101 abstracts


## Step 2: Vectorize text

* Use Google's text embedding "gecko" model to transform the abstract texts into numerical vectors.
* Create a vector store look up table that allows us to find an abstract based on its embedding.
* Define a function that can test for similarity between the query vector and an abstract vector.

In [3]:
from vertexai.language_models import TextEmbeddingModel

def get_text_embedding(text):
    model = TextEmbeddingModel.from_pretrained('textembedding-gecko@003')
    embeddings = model.get_embeddings([text])
    vector = embeddings[0].values
    return tuple(vector)

abstract_embeddings = [
    get_text_embedding(abstract) for abstract in raw_abstracts]
print(len(abstract_embeddings[0]))
print(abstract_embeddings[0][:10])

768
(0.041647639125585556, -0.02635003998875618, -0.047203220427036285, -0.018991883844137192, 0.09575393795967102, 0.08949364721775055, 0.025993820279836655, -0.03430594876408577, 0.004437834024429321, 0.02658843994140625)


In [4]:
vector_store = dict(zip(abstract_embeddings, raw_abstracts))

Now that we've vectorized our abstracts, we need a way to test for "closeness"
to our query. To fascilitate this, we use the same embedding model to vectorize
the query and use the *cosine similarity metric* to find the most relevant
abstract.

$$
cos(\theta) = \frac{\vec{q} \cdot \vec{a}}{\lVert q \rVert \lVert a \rVert}
$$

Where $\vec{q}$ is the query vector and $\vec{a}$ is the abstract vector.
If $cos(\theta) = 1$, then the query and abstract vectors lie in the same
direction of the embedding space and have maximum similarity.

In [5]:
import numpy as np

def find_nearest_abstract(query, vector_store):
    query_vector = get_text_embedding(query)
    norm_query = np.linalg.norm(query_vector)
    max_cos_similarity = 0
    nearest_key = None
    for key in vector_store:
        cos_similarity = (
            np.dot(query_vector, key) / (norm_query * np.linalg.norm(key)))
        if cos_similarity > max_cos_similarity:
            max_cos_similarity = cos_similarity
            nearest_key = key

    return vector_store[nearest_key], max_cos_similarity

find_nearest_abstract(
    "How can a black hole binary of 60 solar masses form?",
    vector_store,
)

('{\n    "title": "Common Envelope Evolution of Massive Stars",\n    "author": "Ricker, Paul M. and Timmes, Frank X. and Taam, Ronald E. and Webbink, Ronald F.",\n    "year": 2019,\n    "text": "The discovery via gravitational waves of binary black hole systems with total masses greater than 60M⊙ has raised interesting questions for stellar evolution theory. Among the most promising formation channels for these systems is one involving a common envelope binary containing a low metallicity, core helium burning star with mass ⁓30 - 40M⊙ and a black hole with mass ⁓30 - 40M⊙. For this channel to be viable, the common envelope binary must eject more than half the giant star\'s mass and reduce its orbital separation by as much as a factor of 80. We discuss issues faced in numerically simulating the common envelope evolution of such systems and present a 3D AMR simulation of the dynamical inspiral of a low-metallicity red supergiant with a massive black hole companion."\n}',
 0.7591530686488

## Step 3: Construct LLM Prompt

Once we find the most relevant abstract, we inject it in a prompt to the LLM along
with the query itself. Our prompt can also include specific instructions for
how the response should be formatted.

In [6]:
class RAGstractService:
    def __init__(self, vector_store):
        self.vector_store = vector_store
        self.llm = GenerativeModel("gemini-1.0-pro")
        self.model_parameters = {
            "temperature": 0, # limit response randomness
            "max_output_tokens": 800,
            "top_k": 1, # limit response randomness
        }

    def _format_prompt(self, query, abstract):
        instructions = (
            "Use the text field from the following json schema to answer the "
            "questions below. After your answer, include the title, author, and "
            "year of publication."
        )
        return "\n\n".join([instructions, abstract, query])
    
    def query(self, query):
        abstract, score = find_nearest_abstract(query, self.vector_store)
        prompt = self._format_prompt(query, abstract)
        responses = llm.generate_content(
            prompt,
            generation_config=model_parameters,
            stream=True
        )

        output = ''
        for response in responses:
            output += response.text
        
        return output

ragstract_service = RAGstractService(vector_store)

## Step 4: Query the RAGstract Service

### How can a black hole binary of 60 solar masses form?

In [7]:
print(
    ragstract_service.query(
        "How can a black hole binary of 60 solar masses form?"
    )
)

One possible formation channel for black hole binaries with total masses greater than 60 solar masses involves a common envelope binary containing a low metallicity, core helium burning star with mass ⁓30 - 40M⊙ and a black hole with mass ⁓30 - 40M⊙. For this channel to be viable, the common envelope binary must eject more than half the giant star's mass and reduce its orbital separation by as much as a factor of 80.

**Title:** Common Envelope Evolution of Massive Stars
**Author:** Ricker, Paul M. and Timmes, Frank X. and Taam, Ronald E. and Webbink, Ronald F.
**Year:** 2019


### How often should hypernova occur based on elemental abundances of lithium, beryllium, and boron?

In [8]:
print(
    ragstract_service.query(
        "How often should hypernova occur based on elemental abundances of "
        "lithium, beryllium, and boron?"
    )
)

Based on the elemental abundances of lithium, beryllium, and boron, hypernovae should be rare events, with less than approximately 3 × 10^-2 hypernovae per supernova. This assumes a constant hypernova to supernova ratio over time.

**Title:** Production of Lithium, Beryllium, and Boron by Hypernovae and the Possible Hypernova-Gamma-Ray Burst Connection
**Author:** Fields, Brian D. and Daigne, Frédéric and Cassé, Michel and Vangioni-Flam, Elisabeth
**Year:** 2002


### What are the specifications of the Terahertz Intensity Mapper?

In [9]:
print(
    ragstract_service.query(
        "What are the specifications of the Terahertz Intensity Mapper?"
    )
)

The Terahertz Intensity Mapper (TIM) is an integral-field spectrometer that operates in the far-infrared (FIR) wavelength range of 240-420 microns. It has 3600 kinetic-inductance detectors (KIDs) and is coupled to a 2-meter low-emissivity carbon fiber telescope.

**Title:** The Terahertz Intensity Mapper (TIM): a Next-Generation Experiment for Galaxy Evolution Studies
**Author:** Vieira, Joaquin et al.
**Year:** 2020


### What is the seperation between the AGN in SDSS object J0924+0510?

In [10]:
print(
    ragstract_service.query(
        "What is the seperation between the AGN in SDSS object J0924+0510?"
    )
)

The projected physical separation between the AGN in SDSS object J0924+0510 is 1 kpc.

Title: Hubble Space Telescope Wide Field Camera 3 Identifies an rp = 1 Kpc Dual Active Galactic Nucleus in the Minor Galaxy Merger SDSS J0924+0510 at z = 0.1495

Author: Xin Liu, Hengxiao Guo, Yue Shen, Jenny E. Greene, Michael A. Strauss

Year: 2018


### How many bolometers does the current receiver on SPT have?

In [11]:
print(
    ragstract_service.query(
        "How many bolometers does the current receiver on SPT have?"
    )
)

The current receiver on the South Pole Telescope, SPT-3G, uses a 68x fMux system to operate its large-format camera of ∼16,000 TES bolometers. 

Title: On-Sky Performance of the SPT-3G Frequency-Domain Multiplexed Readout
Author: Bender, A. N.; Anderson, A. J.; Avva, J. S.; Ade, P. A. R.; Ahmed, Z.; Barry, P. S.; Basu Thakur, R.; Benson, B. A.; Bryant, L.; Byrum, K.; Carlstrom, J. E.; Carter, F. W.; Cecil, T. W.; Chang, C. L.; Cho, H. -M.; Cliche, J. F.; Cukierman, A.; de Haan, T.; Denison, E. V.; Ding, J.; Dobbs, M. A.; Dutcher, D.; Everett, W.; Ferguson, K. R.; Foster, A.; Fu, J.; Gallicchio, J.; Gambrel, A. E.; Gardner, R. W.; Gilbert, A.; Groh, J. C.; Guns, S.; Guyser, R.; Halverson, N. W.; Harke-Hosemann, A. H.; Harrington, N. L.; Henning, J. W.; Hilton, G. C.; Holzapfel, W. L.; Howe, D.; Huang, N.; Irwin, K. D.; Jeong, O. B.; Jonas, M.; Jones, A.; Khaire, T. S.; Kofman, A. M.; Korman, M.; Kubik, D. L.; Kuhlmann, S.; Kuo, C. -L.; Lee, A. T.; Leitch, E. M.; Lowitz, A. E.; Meyer, S.

### What is the gas density power law resulting from the simulation of a cluster of weakly collisional particles around a massive black hole?

In [12]:
print(
    ragstract_service.query(
        "What is the self-interacting dark matter density power law "
        "resulting from the simulation of a cluster of weakly collisional "
        "particles around a massive black hole?"
    )
)

The self-interacting dark matter density power law resulting from the simulation of a cluster of weakly collisional particles around a massive black hole is r-β, where β=(a+3)/4.

**Title:** Self-interacting dark matter cusps around massive black holes
**Author:** Shapiro, S. L. & Paschalidis, V.
**Year:** 2014
