## Prerequisites

We will use the Transformers library from HuggingFace which is pip-installable:

pip install transformers

You'll also probably want to use PyTorch

In [1]:
pip install transformers torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

## Exercise 1: Prompt Engineering 101

The aim of this exercise is to understand how prompt structure affects LLM outputs.

1. Use transformers.pipeline to interact with gpt2 or mistralai/Mistral-7B-Instruct-v0.1

2. Try the three different format prompts:
   * Instructional: "Write a poem about data science in astronomy"
   * Conversational: "What can you tell me about data science in astronomy"
   * Completion-style: "Data science in astronomy is the field of..."
   
3. Vary the temperature of the model and the top_k/top_p settings and see what effects this has on the outputs


In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=-1)  # CPU

# Prompts
prompts = {
    "instructional": "Write a poem about data science in astronomy.",
    "conversational": "What can you tell me about data science in astronomy?",
    "completion": "Data science in astronomy is the field of"
}

# Vary each parameter independently while holding the others constant
temperatures = [0.5, 0.9, 1.2]
top_ks = [0, 20, 50]
top_ps = [0.8, 0.9, 1.0]

# Generate and display results
for prompt_type, prompt in prompts.items():
    print(f"\n{'='*30}\nPROMPT TYPE: {prompt_type.upper()}\nPROMPT: {prompt}\n{'='*30}")

    print("\n--- VARYING TEMPERATURE ---")
    for temp in temperatures:
        output = pipe(prompt, max_new_tokens=30, temperature=temp, top_k=50, top_p=1.0, do_sample=True)[0]['generated_text']
        print(f"[Temp={temp}] → {output.strip()}\n")

    print("\n--- VARYING TOP_K ---")
    for k in top_ks:
        output = pipe(prompt, max_new_tokens=30, temperature=1.0, top_k=k, top_p=1.0, do_sample=True)[0]['generated_text']
        print(f"[Top-k={k}] → {output.strip()}\n")

    print("\n--- VARYING TOP_P ---")
    for p in top_ps:
        output = pipe(prompt, max_new_tokens=30, temperature=1.0, top_k=50, top_p=p, do_sample=True)[0]['generated_text']
        print(f"[Top-p={p}] → {output.strip()}\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



PROMPT TYPE: INSTRUCTIONAL
PROMPT: Write a poem about data science in astronomy.

--- VARYING TEMPERATURE ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=0.5] → Write a poem about data science in astronomy. I love it, but it's not the best way to do it. (I'm sure I'll be writing one more time.)

There



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=0.9] → Write a poem about data science in astronomy. What is your goal?

A: A science fiction novel. I hope it's science fiction or something.

You have to get past



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=1.2] → Write a poem about data science in astronomy. (Or you can submit it at this site.) But in the end, you should understand what is really going on here. (Of course, your


--- VARYING TOP_K ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=0] → Write a poem about data science in astronomy. /w/ Twitter fanfic Thomson Cantrell

Image copyright Getty Images Image caption Can Mars be a local laser visitor? /w/ Twitter



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=20] → Write a poem about data science in astronomy.

If any of your questions or comments are not answered on this form or this form, please email me at jessina.mckesson



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=50] → Write a poem about data science in astronomy.

I'm taking the class at UT Southwestern Union's Computer Science and Artificial Intelligence Institute, a digital lab that has five hundred students and a


--- VARYING TOP_P ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=0.8] → Write a poem about data science in astronomy. The paper is called "A Tale of Two Cities," and it is a book on "what you can learn about the physics of the universe" that



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=0.9] → Write a poem about data science in astronomy. Please cite the abstract.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=1.0] → Write a poem about data science in astronomy.

I'm an astronomer of course. My job is to investigate things that no one would have thought were possible. I'm looking for ways to


PROMPT TYPE: CONVERSATIONAL
PROMPT: What can you tell me about data science in astronomy?

--- VARYING TEMPERATURE ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=0.5] → What can you tell me about data science in astronomy?

In astronomy, there are some basic concepts that make up the science of astronomy. For example, the laws of motion of the Earth are known



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=0.9] → What can you tell me about data science in astronomy?


A recent paper by the same University of London researchers who published their findings has provided evidence that it's possible to observe some data that makes sense



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=1.2] → What can you tell me about data science in astronomy?

A professor of cosmology at UC San Diego, Kevin Bickley, said he was taught by the pioneering cosmological theorist Edwin Hubble


--- VARYING TOP_K ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=0] → What can you tell me about data science in astronomy? Why is Pluto high on the list?

The tricky part is, with this much data we have to reconstruct and orient ourselves. One challenge is



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=20] → What can you tell me about data science in astronomy? Is there a new way to describe the universe? I was asked by Professor Daniel Dyson about these questions by a colleague at the New York Academy of



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=50] → What can you tell me about data science in astronomy? How much more efficient is it with data than with math?

"Science has an enormous impact on the number and quality of measurement methods and tools


--- VARYING TOP_P ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=0.8] → What can you tell me about data science in astronomy?

Data science can help us understand how galaxies interact with one another. We can measure the distance between stars and galaxies, how light travels through them



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=0.9] → What can you tell me about data science in astronomy?

A number of astronomers are working on their work on this subject. In our current world, astronomy is a science of knowing the weather and how



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=1.0] → What can you tell me about data science in astronomy?

I'm going to give you a bit more detail than you needed, and I want this to be more direct than I want you to be


PROMPT TYPE: COMPLETION
PROMPT: Data science in astronomy is the field of

--- VARYING TEMPERATURE ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=0.5] → Data science in astronomy is the field of astronomy in which we study the physical, chemical, and astronomical properties of matter, and the properties of matter in matter theory.

The primary objective



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=0.9] → Data science in astronomy is the field of applying gravitational lensing to observations, and it's not only in terms of scientific knowledge but physics as well. The field is now expanding exponentially, and



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Temp=1.2] → Data science in astronomy is the field of understanding how we observe the universe using instruments attached to the spacecraft. It is the primary scientific scientific activity. Scientists and engineers will continue this research work this


--- VARYING TOP_K ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=0] → Data science in astronomy is the field of knowledge-building science. It fills a great gap in our educational and ethnographic education at universities and medicine-school grad schools. A large-scale



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=20] → Data science in astronomy is the field of astronomy science which uses the principles of optics, gravitation, and general relativity to determine the physical characteristics of the universe. The fundamental properties of the universe



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-k=50] → Data science in astronomy is the field of using science to answer questions that matter so much to the human intellect. A better understanding of astronomy and how it is used should enable the public to better


--- VARYING TOP_P ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=0.8] → Data science in astronomy is the field of astronomy that includes data science in astrophysics, as well as astronomy, which includes information science and data mining, which includes information science for information technologies,



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[Top-p=0.9] → Data science in astronomy is the field of scientific inquiry based on the idea that all living things are connected with the Earth and thus matter.

In a recent review by the Australian Space Agency

[Top-p=1.0] → Data science in astronomy is the field of research which is used by the National Aeronautics and Space Administration (NASA).

The role of the scientists of astronomy is to understand,



## Exercise 2: Text Classification with AstroBERT

The aim of this exercise to use a pre-trained LLM to classify astronomical texts.

1. Create a data set of 20 random sentences from abstracts on arXiv (https://arxiv.org/list/astro-ph/new)

2. Using the AstroBERT model ("EleutherAI/astroBERT") classify these sentences

In [None]:
# I used adsabs/astroBERT instead of EleutherAI/astroBERT, because the latter did not seem to be available

In [14]:
sentences = 'There is general consensus that active galactic nuclei (AGNs) derive their radiating power from a supermassive black hole (SMBH) that accretes matter. Yet, their precise powering mechanisms and the resulting growth of the SMBH are poorly understood, especially for AGNs at high redshift. Blazars are AGNs pointing their jet toward the observer, thus being detectable from radio through gamma rays at high redshift due to Doppler boosting. The blazar MG3 J163554+3629 is located at redshift z = 3,65 and it is a flat spectrum radio quasar (FSRQ). In this work, we show the results of the modeling of its spectral energy distribution (SED) from radio to gamma rays with a one-zone leptonic model. We estimate the uncertainties through a Markov Chain Monte Carlo approach. As a result, we infer the black hole mass MBH = 1,1+0,2 × 109 M⊙ and a modest magnetic field −0,1 of B = 6,56+0.13 × 10−2 G in line with the Compton dominance observed in high-redshift FSRQs. The emitting −0,09 region is outside the broad line region but within the region of the dust torus radius. The rather small accretion efficiency of η = 0,083 is not solely inferred through the SED modeling but also through the energetics. An evolution study suggests that in an Eddington-limited accretion process the SMBH did not have time enough to grow from an initial seed mass of ∼ 106M⊙ at z ≈ 30 into a mass of MBH ≈ 109M⊙ at z = 3,65. Faster mass growth might be obtained in a super-Eddington process throughout frequent episodes. Alternative scenarios propose that the existence of the jet itself can facilitate a more rapid growth. We present results of field-level inference of the baryon acoustic oscillation (BAO) scale rs on rest-frame dark matter halo catalogs. Our field-level constraint on rs is obtained by explicitly sampling the initial conditions along with the bias and noise parameters via the LEFTfield EFT- based forward model. Comparing with a standard reconstruction pipeline applied to the same data and over the same scales, the field-level constraint on the BAO scale improves by a factor of ∼ 1,2 − 1,4 over standard BAO reconstruction. We point to a surprisingly simple source of the additional information. Gravitational waves (GWs) offer a novel avenue for probing the Universe. One of their exciting applications is the independent measurement of the Hubble constant, 𝐻0, using dark standard sirens, which combine GW signals with galaxy catalogues considering that GW events are hosted by galaxies. However, due to the limited reach of telescopes, galaxy catalogues are incomplete at high redshifts. The commonly used GLADE+ is complete only up to redshift 𝑧 = 0.1, necessitating a model accounting for the galaxy luminosity distribution accounting for the selection function of galaxies, typically described by the Schechter function. In this paper, we examine the influence of the Schechter function model on dark sirens, focusing on its redshift evolution and its impact on 𝐻0 and rate parameters measurements. We find that neglecting the evolution of the Schechter function can influence the prior in redshift on GWs, which has particularly high impact for distant GW events with limited galaxy catalogue support. Moreover, conducting a joint estimation of 𝐻0 and the rate parameters, we find that allowing them to vary fixes the bias in 𝐻0 but the rate parameter 𝛾 depends on the evolving Schechter function. Our results underscore the importance of incorporating an evolving Schechter function to account for changes in galaxy populations over cosmic time, as this impacts rate parameters to which 𝐻0 is sensitive.'

sentences = sentences.split('.')
for sentence in sentences:
    if len(sentence) < 50:
        sentences.remove(sentence)
sentences = sentences[:20]
sentences

['There is general consensus that active galactic nuclei (AGNs) derive their radiating power from a supermassive black hole (SMBH) that accretes matter',
 ' Yet, their precise powering mechanisms and the resulting growth of the SMBH are poorly understood, especially for AGNs at high redshift',
 ' Blazars are AGNs pointing their jet toward the observer, thus being detectable from radio through gamma rays at high redshift due to Doppler boosting',
 ' The blazar MG3 J163554+3629 is located at redshift z = 3,65 and it is a flat spectrum radio quasar (FSRQ)',
 ' In this work, we show the results of the modeling of its spectral energy distribution (SED) from radio to gamma rays with a one-zone leptonic model',
 ' We estimate the uncertainties through a Markov Chain Monte Carlo approach',
 ' As a result, we infer the black hole mass MBH = 1,1+0,2 × 109 M⊙ and a modest magnetic field −0,1 of B = 6,56+0',
 '13 × 10−2 G in line with the Compton dominance observed in high-redshift FSRQs',
 ' The 

In [16]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

# Use the AstroBERT model
model_name = "adsabs/astroBERT"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)  # May throw if not fine-tuned for classification

# Use GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a classification pipeline
clf = pipeline("text-classification", model=model, tokenizer=tokenizer, device=device)


# Classify each sentence
results = clf(sentences)

# Display results
for sent, res in zip(sentences, results):
    print(f"\nSentence: {sent}\n→ Label: {res['label']}, Score: {res['score']:.4f}")


tokenizer_config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/207k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/679k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at adsabs/astroBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0



Sentence: There is general consensus that active galactic nuclei (AGNs) derive their radiating power from a supermassive black hole (SMBH) that accretes matter
→ Label: LABEL_0, Score: 0.5296

Sentence:  Yet, their precise powering mechanisms and the resulting growth of the SMBH are poorly understood, especially for AGNs at high redshift
→ Label: LABEL_1, Score: 0.5755

Sentence:  Blazars are AGNs pointing their jet toward the observer, thus being detectable from radio through gamma rays at high redshift due to Doppler boosting
→ Label: LABEL_1, Score: 0.6038

Sentence:  The blazar MG3 J163554+3629 is located at redshift z = 3,65 and it is a flat spectrum radio quasar (FSRQ)
→ Label: LABEL_1, Score: 0.5482

Sentence:  In this work, we show the results of the modeling of its spectral energy distribution (SED) from radio to gamma rays with a one-zone leptonic model
→ Label: LABEL_0, Score: 0.5301

Sentence:  We estimate the uncertainties through a Markov Chain Monte Carlo approach
→ Lab

## Exercise 3: LLMs as Zero-shot Annotators

The aim of this exercise is to repurpose LLMs to do zero-shot labeling.

1. Using the data set of sentences you created in Ex. 2, use a zero-shot classification pipeline to create label topics.

In [21]:
from transformers import pipeline

# Load zero-shot classification model (multilingual works well too)
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")


# Define your candidate labels
candidate_labels = [
    "Galaxies",
    "Stars",
    "Black holes",
    "Exoplanets",
    "Cosmology",
    "Dark matter",
    "Gravitational waves",
    "Instruments",
    "Interstellar medium",
    "Machine learning"
]

# Classify each sentence
for i, sentence in enumerate(sentences):
    result = classifier(sentence, candidate_labels, multi_label=True)
    top_label = result["labels"][0]
    top_score = result["scores"][0]
    print(f"\nSentence {i+1}: {sentence}")
    print(f"→ Top label: {top_label} (score: {top_score:.4f})")


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0



Sentence 1: There is general consensus that active galactic nuclei (AGNs) derive their radiating power from a supermassive black hole (SMBH) that accretes matter
→ Top label: Black holes (score: 0.9906)

Sentence 2:  Yet, their precise powering mechanisms and the resulting growth of the SMBH are poorly understood, especially for AGNs at high redshift
→ Top label: Interstellar medium (score: 0.2157)

Sentence 3:  Blazars are AGNs pointing their jet toward the observer, thus being detectable from radio through gamma rays at high redshift due to Doppler boosting
→ Top label: Instruments (score: 0.0050)

Sentence 4:  The blazar MG3 J163554+3629 is located at redshift z = 3,65 and it is a flat spectrum radio quasar (FSRQ)
→ Top label: Instruments (score: 0.0283)

Sentence 5:  In this work, we show the results of the modeling of its spectral energy distribution (SED) from radio to gamma rays with a one-zone leptonic model
→ Top label: Instruments (score: 0.4887)

Sentence 6:  We estimate th

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Sentence 10:  The rather small accretion efficiency of η = 0,083 is not solely inferred through the SED modeling but also through the energetics
→ Top label: Instruments (score: 0.2882)

Sentence 11:  An evolution study suggests that in an Eddington-limited accretion process the SMBH did not have time enough to grow from an initial seed mass of ∼ 106M⊙ at z ≈ 30 into a mass of MBH ≈ 109M⊙ at z = 3,65
→ Top label: Instruments (score: 0.0437)

Sentence 12:  Faster mass growth might be obtained in a super-Eddington process throughout frequent episodes
→ Top label: Stars (score: 0.2552)

Sentence 13:  Alternative scenarios propose that the existence of the jet itself can facilitate a more rapid growth
→ Top label: Stars (score: 0.1674)

Sentence 14:  We present results of field-level inference of the baryon acoustic oscillation (BAO) scale rs on rest-frame dark matter halo catalogs
→ Top label: Dark matter (score: 0.9605)

Sentence 15:  Our field-level constraint on rs is obtained by expl

## Exercise 4: Text Summarization and Evaluation

The aim of this exercise is to assess how well LLMs summarize text.

1. Take a few abstracts from arXiv and run them through a summarization pipeline using AstroBERT as the model.

2. Compare the output with using model="sshleifer/distilbart-cnn-12-6".

In [22]:
abstracts = ['There is general consensus that active galactic nuclei (AGNs) derive their radiating power from a supermassive black hole (SMBH) that accretes matter. Yet, their precise powering mechanisms and the resulting growth of the SMBH are poorly understood, especially for AGNs at high redshift. Blazars are AGNs pointing their jet toward the observer, thus being detectable from radio through gamma rays at high redshift due to Doppler boosting. The blazar MG3 J163554+3629 is located at redshift z = 3.65 and it is a flat spectrum radio quasar (FSRQ). In this work, we show the results of the modeling of its spectral energy distribution (SED) from radio to gamma rays with a one-zone leptonic model. We estimate the uncertainties through a Markov Chain Monte Carlo approach. As a result, we infer the black hole mass MBH = 1.1+0.2 × 109 M⊙ and a modest magnetic field −0.1 of B = 6.56+0.13 × 10−2 G in line with the Compton dominance observed in high-redshift FSRQs. The emitting −0.09 region is outside the broad line region but within the region of the dust torus radius. The rather small accretion efficiency of η = 0.083 is not solely inferred through the SED modeling but also through the energetics. An evolution study suggests that in an Eddington-limited accretion process the SMBH did not have time enough to grow from an initial seed mass of ∼ 106M⊙ at z ≈ 30 into a mass of MBH ≈ 109M⊙ at z = 3.65. Faster mass growth might be obtained in a super-Eddington process throughout frequent episodes. Alternative scenarios propose that the existence of the jet itself can facilitate a more rapid growth.',
             'The origin of the high-energy emission in astrophysical jets from black holes is a highly debated issue. This is particularly true for jets from supermassive black holes that are among the most powerful particle accelerators in the Universe. So far, the addition of new observations and new messengers have only managed to create more questions than answers. However, the newly available X-ray polarization observations promise to finally distinguish between emission models. We use extensive multiwavelength and polarization campaigns as well as state-of-the-art polarized spectral energy distribution models to attack this problem by focusing on two X-ray polarization observations of blazar BL Lacertae in flaring and quiescent γ-ray states. We find that regardless of the jet composition and underlying emission model, inverse-Compton scattering from relativistic electrons dominates at X-ray energies.',
             'Early chemical enrichment processes can be revealed by the careful study of metal-poor stars. In our Local Group, we can obtain spectra of individual stars to measure their precise, but not always accurate, chemical abundances. Unfortunately, stellar abundances are typically estimated under the simplistic assumption of local thermodynamic equilibrium (LTE). This can systematically alter both the abundance patterns of individual stars and the global trends of chemical enrichment. The SAGA database compiles the largest catalogue of metal-poor stars in the Milky Way. For the first time, we provide the community with the SAGA catalogue fully corrected for non-LTE (NLTE) effects, using state-of-the-art publicly available grids. In addition, we present an easy-to-use online tool NLiTE that quickly provides NLTE corrections for large stellar samples. For further scientific exploration, NLiTE facilitates the comparison of different NLTE grids to investigate their intrinsic uncertainties. Finally, we compare the NLTE-SAGA catalogue with our cosmological galaxy formation and chemical evolution model, NEFERTITI. By accounting for NLTE effects, we can solve the long-standing discrepancy between models and observations in the abundance ratio of [C/Fe], which is the best tracer of the first stellar populations. At low [Fe/H] < −3.5, models are unable to reproduce the high measured [C/Fe] in LTE, which are lowered in NLTE, aligning with simulations. Other elements are a mixed bag, where some show improved agreement with the models (e.g. Na) and others appear even worse (e.g. Co). Few elemental ratios do not change significantly (e.g. [Mg/Fe], [Ca/Fe]). Properly accounting for NLTE effects is fundamental for correctly interpreting the chemical abundances of metal-poor stars. Our new NLiTE tool, thus, enables a meaningful comparison of stellar samples with chemical and stellar evolution models as well as with low-metallicity gaseous environments at higher redshift.',
             ' Relics are massive, compact and quiescent galaxies that assembled the majority of their stars in the early Universe and lived untouched until today, completely missing any subsequent size-growth caused by mergers and interactions. They provide the unique opportunity to put constraints on the first phase of mass assembly in the Universe with the ease of being nearby. While only a few relics have been found in the local Universe, the INSPIRE project has confirmed 38 relics at higher redshifts (𝑧 ∼ 0.2 − 0.4), fully characterising their integrated kinematics and stellar populations. However, given the very small sizes of these objects and the limitations imposed by the atmosphere, structural parameters inferred from ground-based optical imaging are possibly affected by systematic effects that are difficult to quantify. In this paper, we present the first high-resolution image obtained with Adaptive Optics Ks-band observations on SOUL-LUCI@LBT of one of the most extreme INSPIRE relics, KiDS J0842+0059 at 𝑧 ∼ 0.3. We confirm the disky morphology of this galaxy (axis ratio of 0.24) and its compact nature (circularized effective radius of ∼ 1 kpc) by modelling its 2D surface brightness profile with a PSF-convolved Sérsic model. We demonstrate that the surface mass density profile of KiDS J0842+0059 closely resembles that of the most extreme local relic, NGC 1277, as well as of high-redshift red nuggets. We unambiguously conclude that this object is a remnant of a high-redshift compact and massive galaxy, which assembled all of its mass at 𝑧 > 2, and completely missed the merger phase of the galaxy evolution.',
             'Several assumptions at the foundation of the standard cosmological model have as a direct consequence a specific relation between cosmological distances, known as the dis- tance duality relation, whose violation would be a smoking gun of deviations from standard cosmology. We explore the role of upcoming gravitational wave observations in investigating possible deviations from the distance duality relation, alongside the more commonly used supernovae. We find that, when combined with baryon acoustic oscillations, gravitational waves will provide similar constraining power to the combination of baryon acoustic oscilla- tions and supernovae. Moreover, the combination of observables with different sensitivities to electromagnetic and gravitational physics provides a promising way to discriminate among different physical mechanisms that could lead to violations of the distance duality relation.']


In [None]:
# AstroBert cannot generate text, so I will use GPT 2 and google/pegasus-pubmed instead, in addition to distilbar summarization.

In [24]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Load GPT-2
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Prompt GPT-2 with "Summarize this abstract:"
for i, abstract in enumerate(abstracts):
    prompt = f"Summarize this abstract: {abstract.strip()}\nSummary:"
    output = generator(prompt, max_new_tokens=60, do_sample=True, temperature=0.7)[0]['generated_text']
    # Just print the part after "Summary:"
    summary = output.split("Summary:")[-1].strip()
    print(f"\n📝 Abstract {i+1} Summary:\n{summary}")

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



📝 Abstract 1 Summary:
The present study shows that the SED of MBH = 1.1+0.2 × 109 M⊙ and a modest magnetic field −0.1 of B = 6.56−0.13 × 10−2 G in line with the Compton dominance observed in high-red


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



📝 Abstract 2 Summary:
The results of our study, published in the Astrophysical Journal Letters, offer a new way to understand the mechanism that leads to the high-energy emission in astrophysical jets from black holes. We propose that the emission pattern is more complex than previously thought and that it may be the cause of higher


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



📝 Abstract 3 Summary:
Our detailed, systematic, accurate, and robust NLiTE tool results in an abundance ratio of about 2.3 (1.4–3.5) for individual stars. This value is less than the 2.1. In contrast, our numerical model works better at many concentrations, ranging from


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



📝 Abstract 4 Summary:
Relics are huge, compact and quiescent galaxies that assembled the majority of their stars in the early Universe and lived untouched until today, completely missing any subsequent size-growth caused by mergers and interactions. They provide the unique opportunity to put constraints on the first phase of mass assembly in the Universe

📝 Abstract 5 Summary:
The most important question for the future of cosmology is what is the potential consequences of the combined gravitational wave observations of baryon acoustic oscillations and supernovae. The new estimates imply a high probability, the most probable, for a new model of the fundamental laws of physics, which would provide


In [25]:
from transformers import pipeline

# Load Pegasus model fine-tuned on scientific abstracts
summarizer = pipeline("summarization", model="google/pegasus-pubmed")

# Run summarization
for i, abstract in enumerate(abstracts):
    summary = summarizer(abstract, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]
    print(f"\n📚 Abstract {i+1} Summary (Pegasus):\n{summary}")


config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-pubmed and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Device set to use cuda:0



📚 Abstract 1 Summary (Pegasus):
we show the results of the modeling of the spectral distribution ( energy distribution ) from a gamma ray radiozone model . through a radio approach , we infer the black hole mass = 1.1+1.1+0.2 = 0.1+0.1 = 6.5+20.09 in

📚 Abstract 2 Summary (Pegasus):
we present the first high - energy x - ray emission from a black hole in a jet . <n> our analysis is based on observations of the blazart polarized polarization of x - ray photons emitted from the tails of two jets passing through a black hole in their direction . 

📚 Abstract 3 Summary (Pegasus):
we present an online tool ( http://www.cs.umd.edu/ipad/ipad/index.html ) that provides corrections for non- ( publicly available ) state-of-art grids . using this tool <n> , we provide corrections for

📚 Abstract 4 Summary (Pegasus):
we present a high - resolution image of one of the most extreme evolution morphology of relic-bands-@@@ observations of one of the most extreme evolution morphology of relic-bands-@@

In [23]:
from transformers import pipeline

summarizer_bart = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

print("\n=== Summaries with DistilBART ===")
for abstract in abstracts:
    summary = summarizer_bart(abstract, max_length=60, min_length=20, do_sample=False)[0]['summary_text']
    print(f"\nOriginal: {abstract}\n→ Summary: {summary}")


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cuda:0



=== Summaries with DistilBART ===

Original: There is general consensus that active galactic nuclei (AGNs) derive their radiating power from a supermassive black hole (SMBH) that accretes matter. Yet, their precise powering mechanisms and the resulting growth of the SMBH are poorly understood, especially for AGNs at high redshift. Blazars are AGNs pointing their jet toward the observer, thus being detectable from radio through gamma rays at high redshift due to Doppler boosting. The blazar MG3 J163554+3629 is located at redshift z = 3.65 and it is a flat spectrum radio quasar (FSRQ). In this work, we show the results of the modeling of its spectral energy distribution (SED) from radio to gamma rays with a one-zone leptonic model. We estimate the uncertainties through a Markov Chain Monte Carlo approach. As a result, we infer the black hole mass MBH = 1.1+0.2 × 109 M⊙ and a modest magnetic field −0.1 of B = 6.56+0.13 × 10−2 G in line with the Compton dominance observed in high-redshift