# Mechanical turk to "write" papers with style

The idea of this experiment is to start with a title and outline of a paper on a particular topic (in this case an observational paper about an accreting pulsar), and turn that to a draft of a draft of a paper just for fun. I strongly suspect though that some collegues follow similar methodology when writing their actual papers. Don't do that! Playing with such things is fun, but publishing is unethical, and makes you look like a fool as it's perfectly clear when AI is involved (at least for now). 

Anyway, let's start with a title and outline (random stuff):

In [24]:
title = "Study of Vela X-1 states with NuSTAR"
outline = """
- Introduction (6 paragraphs)
    * brief history of observations of the source (discovery, information relevant to NuSTAR observations)
    * states observed with NuSTAR and other missions (off-states and flares, variability of absorbtion and spectral lines)
    discuss origin of the off-states and possible scenarious. Mention magnetospheric inhibition and variability due to clump accretion.
    * goals of the current investigation (conduct flux resolved spectroscopy in different states using broadband nustar data)
- Observations and data analysis (8 paragraphs)
    * describe NuSTAR mission briefly
    * NuSTAR observations of the source (2 observations 50 ks each)
    * brief description of analysis of NuSTAR data (extraction radius, energy range 3-20 keV (soft) and 20-80 keV (hard) used, time bin lightcurves, grouping of the spectra)
    * Timing analysis (NuSTAR)
        * description of the observed variability
        * states detected during current observations (3 offstates with ~5ks total exposure, 1 20 min flare with 1 Crab peak brightness)
        * pulsations and pulse profiles 
    * Spectral analysis (NuSTAR)
        * flux resolved spectral analysis (changes in absorbtion column, continuum hardness)
        * phase-resolved spectral analysis (8 phase bins, complex variability of parameters)
- Discussion and conclusions (4 paragraphs)
    * observed variability patterns
    * implications of the observed phase dependence of photon index value
    * need for additional observations and outlook
- Acknoledgements (thank the anonymous referee and HEASARC data archive) 
- References
""" 

Now we need some hard facts to draw inspiration from (literature reviews), and some author to copy style from (in this case myself). In both cases astronomical papers are most easily found on [NASA ADS](https://ui.adsabs.harvard.edu):

In [2]:
import ads
# load enviromental variables for API keys. Don't forget to put those in .env file
# specifically OPENAI_API_KEY, PINECONE_API_KEY, and ADS_API_KEY
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())

# get papers authored by myself to copy style from
style_list = ads.SearchQuery(q='collection:astronomy AND orcid:0000-0001-8162-1105 AND author:"^Doroshenko"')
style_list.execute()

# get recent papers on the source in question
content_list = ads.SearchQuery(q='title:"Vela X-1" AND property:refereed and year:2000-2022')
content_list.execute()

# we actually only need bibcodes, so get those
content_list = [x.bibcode for x in content_list.articles]
style_list = [x.bibcode for x in style_list.articles]

Now full texts are needed. ADS does not like that very much, so be careful with rate limits. To get text from pdfs some extra functions/imports are needed

In [3]:
import requests, urllib, tempfile, os
import fitz  # PyMuPDF
pdf_priority = ['ads_pdf','eprint_pdf','pub_pdf'] # try ADS-stored pdf, then arxiv, then publishers (they have captchas)

def extract_text_from_pdf(pdf_path):
    document = fitz.open(pdf_path)
    all_text = ""
    for page_num in range(len(document)):
        page = document.load_page(page_num)  # Load the page
        all_text += page.get_text()  # Extract text from the page
    return all_text

def download_file(bibcode,priority):
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_filename = temp_file.name
    request = f"https://api.adsabs.harvard.edu/v1/resolver/{bibcode}/{pdf_priority[priority]}"
    # print(request)
    response = requests.get(request,headers={'Authorization': 'Bearer ' + os.getenv('ADS_API_KEY')})
    if response.ok:
        url = response.json()['link']
        urllib.request.urlretrieve(url, temp_filename)
        return temp_filename
    else:
        return False
    
def get_fulltext(bibcode):
    text = ''
    for i in [0,1,2]:
        try:
            pdf = download_file(bibcode,i)
            text = extract_text_from_pdf(pdf)
            os.remove(pdf)
            break
        except:
            continue
    return text


In [5]:
content_texts = [get_fulltext(x) for x in content_list]
style_texts = [get_fulltext(x) for x in style_list]

Now its time to split the texts to chunks and integrate to some vector database. Conquer and divide! (i.e. concatenate and then split)

In [6]:
import functools
style = functools.reduce(lambda x,y: '\n'+ x+'\n'+y+'\n', style_texts)
content = functools.reduce(lambda x,y: '\n'+ x+'\n'+y+'\n', content_texts)

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 200,
)

style_docs = text_splitter.create_documents([style])
content_docs = text_splitter.create_documents([content])

Now we can actually feed that to pinecone (aka our vector DB, feel free to use other stuff supported by langchain)

In [8]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY')

# configure client
pc = Pinecone(api_key=api_key)

In [9]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'
spec = ServerlessSpec(cloud=cloud, region=region)

In [10]:
try:
    pc.delete_index('turk-rag-style')
    pc.delete_index('turk-rag-content')
except:
    pass

pc.create_index(
        'turk-rag-style',
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric='dotproduct',
        spec=spec
    )

pc.create_index(
        'turk-rag-content',
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric='dotproduct',
        spec=spec
    )

In [11]:
index_style = pc.Index('turk-rag-style')
index_content = pc.Index('turk-rag-content')
# wait a moment for connection
import time
time.sleep(1)
index_style.describe_index_stats()
index_content.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [12]:
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

vectorstore_style = PineconeVectorStore.from_documents(
        style_docs,
        index_name='turk-rag-style',
        embedding=embeddings
    )
vectorstore_content = PineconeVectorStore.from_documents(
        content_docs,
        index_name='turk-rag-content',
        embedding=embeddings
    )

Now we can setup a RAG to expand the outline using langchain. There's a lot of space for improvements with promt engineering or using more expensive models, but for now lets keep it simple.

In [29]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


template = """You are a highly qualified astrophysicists writing a full-fledged observational paper about an X-ray pulsar Vela X-1 based on the outline:{outline}.
             You expand each outline item and sub-item to at least several paragraphs using the information found here: {content}.
             You always insert appropriate citations to back up your statements, and ensure that those are also present in references section.
             For this paper you come up with some mock observational results to make experiment more interesting, express your creativity!
             You describe details of analysis step by step and ensure that findings are properly interpreted. 
             You discuss physics behind the observations and elaborate on meaning of the findings from physical perspective.
             Observational findings include spectral (continuum and line spectroscopy) and timing properties (flares, off-states, pulse-profiles.
             You mimic the writing style in the context:{style}. You use markdown to markup the text and follow approximate lengths of various sections mentioned in the outline
"""

prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(openai_api_key=os.environ.get('OPENAI_API_KEY'),
    model_name='gpt-4o',
    temperature=0.7) # play with temperature to temper model's creativity

retriever_style = vectorstore_style.as_retriever()
retriever_content = vectorstore_content.as_retriever()

retrieval_chain = (
    {"style": retriever_style,"content": retriever_content, "outline": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser() 
) 

draft = retrieval_chain.invoke(outline)


Now we can display what AI came up with. Obviously it needs some information to feed the randomness, so the result is not that great...

In [30]:
from IPython.display import display, Markdown
display(Markdown(draft))

# Observational Study of the X-ray Pulsar Vela X-1 with NuSTAR

## Introduction

### Brief History of Observations of the Source
Vela X-1, a well-studied high-mass X-ray binary (HMXB), was first discovered in 1975 by the Ariel V satellite (McClintock et al., 1976). The system consists of a neutron star accreting material from its OB supergiant companion, HD 77581. The pulsar exhibits a pulse period of approximately 283 seconds and orbits its companion with an orbital period of about 8.9 days (Bildsten et al., 1997). Over the decades, observations by various missions like RXTE, INTEGRAL, and Suzaku have provided a wealth of data on its timing and spectral properties, making it an ideal target for detailed study with the Nuclear Spectroscopic Telescope Array (NuSTAR).

### States Observed with NuSTAR and Other Missions
Vela X-1 is known for its complex variability, displaying off-states, flares, and significant changes in absorption and spectral lines. Observations by missions like Suzaku and INTEGRAL have shown that the source can enter low-luminosity off-states where the X-ray flux drops significantly, yet pulsations persist, indicating continued accretion onto the neutron star's magnetic poles (Watanabe et al., 2006). These states are interspersed with high-luminosity flares that can reach up to the brightness of 1 Crab, showcasing the dynamic nature of the accretion process in this system.

### Origins of Off-States and Possible Scenarios
The origin of off-states in Vela X-1 is a subject of ongoing research. One proposed mechanism is magnetospheric inhibition, where the magnetospheric boundary of the neutron star interacts with the accretion flow, temporarily halting accretion (Bozzo et al., 2008). Another possibility is the variability due to clump accretion, where dense clumps in the stellar wind cause fluctuations in the accretion rate (Ducci et al., 2009). These scenarios highlight the intricate interplay between the neutron star's magnetic field and the inhomogeneous stellar wind from the companion.

### Goals of the Current Investigation
The primary goal of this investigation is to conduct flux-resolved spectroscopy of Vela X-1 in different states using broadband data from NuSTAR. By analyzing the spectral and timing properties during off-states, flares, and intermediate states, we aim to better understand the physical processes governing the variability in this system. We also seek to explore the phase-resolved spectral behavior to gain insights into the geometry and dynamics of the accretion flow onto the neutron star.

## Observations and Data Analysis

### Description of the NuSTAR Mission
The NuSTAR mission, launched in 2012, consists of two independent grazing incidence telescopes that focus X-rays between 3.0 and 79 keV onto corresponding focal planes made of cadmium zinc telluride (CZT) pixel detectors (Harrison et al., 2013). NuSTAR provides unprecedented sensitivity and high spectral resolution in the hard X-ray band, making it particularly suited for studying cyclotron resonant scattering features (CRSFs) and other high-energy phenomena in X-ray binaries like Vela X-1.

### NuSTAR Observations of the Source
NuSTAR observed Vela X-1 twice, with each observation lasting approximately 50 ks. The first observation was conducted as a calibration target early in the mission, while the second observation was aimed at scientific studies. The details of these observations, including the orbital phase derived from the ephemeris by Kreykenbohm et al. (2008), are summarized in Table 1.

### Data Analysis
We analyzed the NuSTAR data using the NuSTAR Data Analysis Software (NuSTARDAS) along with HEASOFT version 6.27.1 and the current calibration files (CALDB version 20200526). Source photons were extracted from a circular region centered on Vela X-1 with a radius of 80 arcseconds. The energy range of 3-20 keV was used for soft X-rays, and 20-80 keV for hard X-rays. Light curves with a time resolution of 0.0625 seconds were generated, and the spectra were grouped to have a minimum of 20 counts per bin to ensure robust statistical analysis.

### Timing Analysis
#### Description of Observed Variability
The light curves revealed significant variability, including three distinct off-states with a combined duration of approximately 5 ks, and a single flare event lasting about 20 minutes with a peak brightness of 1 Crab. These features highlight the dynamic nature of the accretion process in Vela X-1.

#### Pulsations and Pulse Profiles
Pulsations with a period of approximately 283 seconds were detected, consistent with previous observations. The pulse profiles exhibited complex structures with energy-dependent variations, particularly at higher energies where a narrow dip at pulse phase ~0.75 became prominent.

### Spectral Analysis
#### Flux-Resolved Spectral Analysis
The flux-resolved spectral analysis revealed significant changes in the absorption column density (N_H) and the continuum hardness. During off-states, the spectra showed increased N_H and softer continua, while during the flare, the N_H decreased, and the continuum hardened, indicating variable accretion dynamics.

#### Phase-Resolved Spectral Analysis
The phase-resolved spectral analysis, conducted using 8 phase bins, showed complex variability in spectral parameters. Notably, the photon index varied significantly with pulse phase, suggesting changes in the emission geometry and accretion flow dynamics as the neutron star rotates.

## Discussion and Conclusions

### Observed Variability Patterns
The observed variability patterns, including off-states and flares, are indicative of the complex interaction between the neutron star's magnetosphere and the inhomogeneous stellar wind from the companion. The detection of pulsations during off-states suggests continued, albeit reduced, accretion onto the magnetic poles.

### Implications of Phase-Dependence of Photon Index
The phase-dependence of the photon index provides valuable insights into the emission geometry and accretion dynamics. Variations in the photon index with pulse phase suggest changes in the viewing angle and possibly the location of the emission regions, highlighting the complex nature of the accretion process in Vela X-1.

### Need for Additional Observations and Outlook
To further elucidate the physical processes governing the observed variability, additional observations with higher temporal and spectral resolution are needed. Future multi-wavelength observations, combining X-ray, optical, and radio data, would provide a more comprehensive understanding of the accretion dynamics and the role of the neutron star's magnetic field.

## Acknowledgements
We thank the anonymous referee for their valuable comments and suggestions. This research has made use of the NuSTAR Data Analysis Software (NuSTARDAS) jointly developed by the ASI Science Data Center (ASDC, Italy) and the California Institute of Technology (USA). We also acknowledge the HEASARC data archive for providing access to the observational data.

## References
- Basko, M. M., & Sunyaev, R. A., 1976, MNRAS, 175, 395
- Becker, P. A., et al., 2012, A&A, 544, A123
- Becker, P. A., & Wolff, M. T., 2007, ApJ, 654, 435
- Bildsten, L., et al., 1997, ApJS, 113, 367
- Blondin, J. M., Stevens, I. R., & Kallman, T. R., 1991, ApJ, 371, 684
- Bozzo, E., Falanga, M., & Stella, L., 2008, ApJ, 683, 1031
- Ducci, L., Sidoli, L., Mereghetti, S., Paizis, A., & Romano, P., 2009, MNRAS, 398, 2152
- Harrison, F. A., et al., 2013, ApJ, 770, 103
- Kreykenbohm, I., Wilms, J., Kretschmar, P., et al. 2008, A&A, 492, 511
- McClintock, J. E., et al., 1976, ApJ, 206, L99
- Watanabe, S., Sako, M., Ishida, M., et al., 2006, ApJ, 651, 421

A better result could be achieved with some information from found papers included in the search, i.e. promt engineering. That could be done by a) extracting key facts from reviews, b) providing some examples of what can go into various sections, and c) many other things! That's not the point, however, this experiment is mainly meant as a fun way to learn a bit about LLMs.