# Experiment: Downstream applications of polars-dovmed results: LLM-Based Paper Analysis
This is not part of the polars-dovmed package sensu stricto, just a POC/attempt demonstrating exploring the use of LLM for determining the search results relevancy for specific queries.

**Exploratory workflow:** This notebook sends full-text papers to an LLM API to analyze whether they contain actionable biological information (genomic coordinates, accessions, database identifiers) related to specific biological concepts.

We test here providing the full text, the abstract only, the extracted entities only, the entities ± N words around them, etc.  
Originally we tested multiple different LLMs, but settled on llama for cost (locally hosted). Playing around with the prompt seems to affect different models differently.  
We also tested one-shot vs zero-shot prompting - providing an example of the expected output format. We settled on one-shot or few-shot prompting as it seemed to help a bit, but results were still very unstructured.

**Related notebooks:** For cleaning and normalizing the LLM responses, see [`02_clean_llm_responses.ipynb`](./02_clean_llm_responses.ipynb). For fetching actual sequences from databases, see [`03_fetch_sequences_from_databases.ipynb`](./03_fetch_sequences_from_databases.ipynb).

**NOTE!** As anything with LLMs, all results should be taken with a mountain of salt grains and verify everything independently.  
Some of the code here was made for earlier (obsolete) versions of polars-dovmed, e.g. not all of the fields exist in the current output.

In [None]:
import os 
import polars as pl
import json
from tqdm.notebook import tqdm 

from polars_dovmed.llm_convert_context_to_coord import parse_llm_response, call_llm_api
from polars_dovmed.llm_utils import call_llm_api, list_available_models
from polars_dovmed.utils import unstruct_with_suffix, drop_empty_or_null_columns, convert_nested_cols, clean_pattern_for_polars
os.chdir("/clusterfs/jgi/scratch/science/metagen/neri/code/blits/polars_dovmed")
queries = json.load(open("RNA_virus_rss_queries.json"))
file_lists_df = pl.read_parquet("data/pubmed_central/pmc_oa/filelists.parquet")

In [2]:
results_df = pl.read_parquet("results/pubmed_central/processed_literature_test/prcoessed.parquet")
print(results_df.shape)
results_df.sort(by="total_matches",descending=True).head(2)

(336198, 26)


pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches
str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32
"""PMC7122378""","""Ribosomal Frameshifting in Dec…","""Frameshifting provides an eleg…","""Miller, W. Allen, Giedroc, Dav…","""Recoding: Expansion of Decodin…","""2010""","""10.1007/978-0-387-89382-2_9""","""Frameshifting Plant Viruses Pl…",[],"[""virus and Dianthovirus genera (Fig. 9.1 ), despite many differences in sequence. Interestingly, in the dianthoviruses and umbraviruses, the putative LDFE we predict is located upstream of the cap-independent translation element that is also located in the 3 UTR (Mizumoto et al., 2003 ), whereas in the luteoviruses, the LDFE is downstream of the cap""]",[],"[""viruses known to employ minus one ( 1) programmed ribosomal frameshift"", ""virus , undergoes a net +1 reading frame change to translate the viral RdRp coding region (Karasev et al., 1995 ). This would be the first known +1 frameshift in any plant viral RNA. A carlavirus may use 1 frameshift"", … ""viruses appears to be confined to the plant virus world as are the possible 1 frameshift""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses employ frameshifting. Summary In most cases, the biological role and basic mechanism of 1 ribosome frameshifting are likely the same in plant viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin-type pseudoknot; or (iii) a stable, imperfect stem-loop"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""NC_003747"", ""NC_001575"", … ""accession ""]",[],76
"""PMC7114087""","""Programmed ribosomal frameshif…","""Ribosomal frameshifting is a m…","""Dos Ramos, Francisco J., Brier…","""Virus Research""","""2005-11-28""","""10.1016/j.virusres.2005.10.008""","""The translation of most eukary…",[],[],[],"[""programmed ribosomal frameshifting, respectively (reviewed in Gale et al., 2000 , Pe ery and Mathews, 2000 ). Ribosomal frameshift"", ""viruses employ frameshift"", … ""viruses. The huge public health consequences of the HIV-1 pandemic and concerns over future SARS CoV outbreaks demands that all avenues be explored in the quest to counteract these agents. In principle, the replication of any virus that uses a frameshift process could be disrupted by modulation of frameshift""]",[],[],[],[],[],[],[],"[""virus (SIV) suggested the involvement of a stem-loop structure ( Jacks et al., 1988b ). Subsequently, mutational analysis, frameshift assays in transfected mammalian tissue culture cells and virus-infectivity assays have confirmed that the original stem-loop"", ""viral frameshift pseudoknots ( Brierley and Pennell, 2001 ) does not retain the original stem-loop element of Jacks and co-workers. Fig. 1 Proposed stimulatory RNAs at the frameshift sites of HIV-1 and HIV-2. The HIV-1 signal is shown in panels A E and that of HIV-2 in panel F. (A) Basic hairpin. The original stem-loop"", … ""viral agents that target this process may have previously unanticipated consequences on cellular metabolism in uninfected cells. 6 Conclusions It is now established that the stimulatory RNAs present at the frameshift signals of HIV-1 and SARS CoV are examples of stem-loop and pseudoknot stimulators, respectively. However, there remain a number of unanswered questions. Regarding the HIV-1 signal, one of the key uncertainties is whether the stem-loop""]",[],[],[],[],[],49


In [3]:
"PMC9508848" in results_df["pmc_id"]

True

In [7]:
results_df.filter((pl.col("title").str.to_lowercase().str.starts_with("ictv")) & (pl.col("title").str.to_lowercase().str.contains("virus taxonomy")) )["title"].to_list()

['ICTV Virus Taxonomy Profile:Nanoviridae',
 'ICTV Virus Taxonomy Profile:Marnaviridae2021',
 'ICTV Virus Taxonomy Profile:Arteriviridae2021',
 'ICTV Virus Taxonomy Profile:Solemoviridae2021',
 'ICTV Virus Taxonomy Profile:Iflaviridae',
 'ICTV Virus Taxonomy Profile:Picornaviridae',
 'ICTV Virus Taxonomy Profile:Dicistroviridae',
 'ICTV Virus Taxonomy Profile:Thaspiviridae2021',
 'ICTV Virus Taxonomy Profile:Inoviridae',
 'ICTV Virus Taxonomy Profile:Bornaviridae',
 'ICTV Virus Taxonomy Profile:Pseudoviridae',
 'ICTV Virus Taxonomy Profile:Ovaliviridae',
 'ICTV Virus Taxonomy Profile:Herpesviridae2021',
 'ICTV Virus Taxonomy Profile:Belpaoviridae2021',
 'ICTV Virus Taxonomy Profile:Retroviridae2021',
 'ICTV Virus Taxonomy Profile:Geminiviridae2021',
 'ICTV Virus Taxonomy Profile:Hepadnaviridae',
 'ICTV Virus Taxonomy Profile:Arenaviridae2023',
 'ICTV Virus Taxonomy Profile:Yueviridae2023',
 'ICTV Virus Taxonomy Profile:Qinviridae2023',
 'ICTV Virus Taxonomy Profile:Sunviridae2023',
 'I

some statistics about the df 

In [8]:
mcols = results_df.columns 
for col in mcols:
    print(f"{col}: {results_df[col].explode().value_counts(sort=True)}")

pmc_id: shape: (336_196, 2)
┌─────────────┬───────┐
│ pmc_id      ┆ count │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ null        ┆ 3     │
│ PMC7122378  ┆ 1     │
│ PMC7114087  ┆ 1     │
│ PMC7127214  ┆ 1     │
│ PMC11426510 ┆ 1     │
│ …           ┆ …     │
│ PMC7999968  ┆ 1     │
│ PMC7999985  ┆ 1     │
│ PMC7999986  ┆ 1     │
│ PMC7999995  ┆ 1     │
│ PMC7999997  ┆ 1     │
└─────────────┴───────┘
title: shape: (335_301, 2)
┌─────────────────────────────────┬───────┐
│ title                           ┆ count │
│ ---                             ┆ ---   │
│ str                             ┆ u32   │
╞═════════════════════════════════╪═══════╡
│ Introduction                    ┆ 16    │
│ Influenza                       ┆ 11    │
│ Vaccines                        ┆ 7     │
│ Infectious Diseases             ┆ 7     │
│ Abstracts                       ┆ 6     │
│ …                               ┆ …     │
│ Electrochemotherapy in Mucosal… ┆ 1     │
│ Altered

Filter to remove some false positives

In [10]:
rna_or_virus_filt_expr = [pl.col("full_text").str.to_lowercase().str.contains("rna"),
    pl.col("full_text").str.to_lowercase().str.contains_any(["virus","viral","phage"])
]

results_df.filter(rna_or_virus_filt_expr).shape

(306252, 26)

## Removing the zombies (papers with SARS-like stuff in title)
A lot of false positives -  it's probably (?) safe to assume that any element on it already has proper profile/tagging in some of the DBs

In [11]:
zombie_filt_expr = [
    ~pl.col("title").str.to_lowercase().str.contains_any(
        ["kidney","tissue","yoga","retracted","sars","abstracts of","tumor","cancer","vaccine","congress","covid","poster","abstracts","oral"])
]
results_df.filter(zombie_filt_expr,rna_or_virus_filt_expr).shape

(223186, 26)

In [12]:
with_proximity_matches = [
        pl.col("total_matches").ge(1)
]
results_df.filter(with_proximity_matches)

pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches
str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32
"""PMC7122378""","""Ribosomal Frameshifting in Dec…","""Frameshifting provides an eleg…","""Miller, W. Allen, Giedroc, Dav…","""Recoding: Expansion of Decodin…","""2010""","""10.1007/978-0-387-89382-2_9""","""Frameshifting Plant Viruses Pl…",[],"[""virus and Dianthovirus genera (Fig. 9.1 ), despite many differences in sequence. Interestingly, in the dianthoviruses and umbraviruses, the putative LDFE we predict is located upstream of the cap-independent translation element that is also located in the 3 UTR (Mizumoto et al., 2003 ), whereas in the luteoviruses, the LDFE is downstream of the cap""]",[],"[""viruses known to employ minus one ( 1) programmed ribosomal frameshift"", ""virus , undergoes a net +1 reading frame change to translate the viral RdRp coding region (Karasev et al., 1995 ). This would be the first known +1 frameshift in any plant viral RNA. A carlavirus may use 1 frameshift"", … ""viruses appears to be confined to the plant virus world as are the possible 1 frameshift""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses employ frameshifting. Summary In most cases, the biological role and basic mechanism of 1 ribosome frameshifting are likely the same in plant viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin-type pseudoknot; or (iii) a stable, imperfect stem-loop"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""NC_003747"", ""NC_001575"", … ""accession ""]",[],76
"""PMC7114087""","""Programmed ribosomal frameshif…","""Ribosomal frameshifting is a m…","""Dos Ramos, Francisco J., Brier…","""Virus Research""","""2005-11-28""","""10.1016/j.virusres.2005.10.008""","""The translation of most eukary…",[],[],[],"[""programmed ribosomal frameshifting, respectively (reviewed in Gale et al., 2000 , Pe ery and Mathews, 2000 ). Ribosomal frameshift"", ""viruses employ frameshift"", … ""viruses. The huge public health consequences of the HIV-1 pandemic and concerns over future SARS CoV outbreaks demands that all avenues be explored in the quest to counteract these agents. In principle, the replication of any virus that uses a frameshift process could be disrupted by modulation of frameshift""]",[],[],[],[],[],[],[],"[""virus (SIV) suggested the involvement of a stem-loop structure ( Jacks et al., 1988b ). Subsequently, mutational analysis, frameshift assays in transfected mammalian tissue culture cells and virus-infectivity assays have confirmed that the original stem-loop"", ""viral frameshift pseudoknots ( Brierley and Pennell, 2001 ) does not retain the original stem-loop element of Jacks and co-workers. Fig. 1 Proposed stimulatory RNAs at the frameshift sites of HIV-1 and HIV-2. The HIV-1 signal is shown in panels A E and that of HIV-2 in panel F. (A) Basic hairpin. The original stem-loop"", … ""viral agents that target this process may have previously unanticipated consequences on cellular metabolism in uninfected cells. 6 Conclusions It is now established that the stimulatory RNAs present at the frameshift signals of HIV-1 and SARS CoV are examples of stem-loop and pseudoknot stimulators, respectively. However, there remain a number of unanswered questions. Regarding the HIV-1 signal, one of the key uncertainties is whether the stem-loop""]",[],[],[],[],[],49
"""PMC7127214""","""Translating old drugs into new…","""Programmed ribosomal frameshif…","""Ruiz-Echevarria, Maria J, Dinm…","""Trends in Biotechnology""","""1999-3-11""","""10.1016/S0167-7799(97)01167-0""","""The ability of ribosomes to ma…",[],"[""viral mutation rates ensure the selection of mutant viral genes encoding viral proteins that can bypass the actions of the drugs, resulting in drug-resistant functional virus. ( b ) Drugs that target programmed ribosomal frameshifting affect the host translational machinery, which is independent of the viral mutational cap""]",[],"[""programmed to shift their translational reading frame one base in the 5 direction ( 1 ribosomal frameshifting) have been identified ( Table 1 ). Programmed 1 ribosomal frameshifting is most commonly observed in double-stranded RNA (dsRNA) and nonsegmented (+) strand RNA viruses; programmed +1 ribosomal frameshifting, which shifts the ribosome one base in the 3 direction, has also been characterized in at least two viral systems. A few examples of programmed ribosomal frameshifting are known to occur in bacterial genes, and one example of programmed +1 ribosomal frameshift"", ""programmed 1 ribosomal frameshifting. These different ribosomal frameshift systems have been extensively reviewed elsewhere 2 , 3 , 4 , 5 , and so this article will focus exclusively on programmed 1 ribosomal frameshift"", … ""programmed frameshift""]",[],[],[],[],[],[],[],[],[],[],[],[],[],49
"""PMC11426510""","""A novel viral RNA detection me…","""The diagnoses of retroviruses …","""Moreira de Oliveira, Izadora C…","""PLOS ONE""","""2024""","""10.1371/journal.pone.0310171""","""Various methodologies are avai…",[],[],[],[],[],[],[],[],"[""viral RNA detection method based on the combined use of trans-acting ribozyme""]","[""virus in the early stages of infection. It is the primary test employed for emerging viruses, such as SARS-CoV-2. Application of ribozyme"", ""ribozymes exhibit stability during transport, storage, and manipulation. In the early 2000s, ribozymes were employed for detecting the hepatitis C virus. This detection method involved the binding of ribozymes to the viral"", … ""viral RNA detection method based on the combined use of trans-acting ribozyme""]",[],"[""viral RNA initiator fragment through complementary binding to its toehold and stem sequence. Following this, the second DNA hairpin (H2) opens after hybridizing with the first H1, due to complementary binding with its extended loop""]",[],[],[],"[""YP_009724389.1"", ""DS126191"", … ""accession numbers""]",[],47
"""PMC5653336""","""Inhibition of human cytomegalo…","""We have previously engineered …","""Liu, Fenyong, Yang, Zhu, He, L…","""PLoS ONE""","""2017""","""10.1371/journal.pone.0186791""","""Human cytomegalovirus (HCMV) i…",[],[],[],[],[],[],[],[],"[""virus immediate early gene expression and growth by a novel RNase P ribozyme""]","[""virus (KSHV) [ 2 ]. HCMV causes numerous diseases in humans especially in immunocompromised individuals, including AIDS patients [ 1 ]. Thus, creating novel and useful antiviral treatments is essential for combating HCMV infection. The scientific communities have recognized ribozyme"", ""viral agents to ""silence genes by creating a cleavage to the viral mRNA sequences and inhibiting viral growth [ 3 5 ]. Also, ribozyme"", … ""ribozymes in reducing viral replication, HCMV was used to infect the cells with MOI of 1 [ 32 , 34 ]. At 24-hour intervals through 7 days postinfection, the cells and supernatant were collected, and viral""]",[],[],[],[],[],[],[],40
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""PMC7997495""","""Enterovirus A71 Vaccines""","""Enterovirus A71 (EV-A71) is a …","""Brewer, Gary, Shih, Shin-Ru, L…","""Vaccines""","""2021-3""","""10.3390/vaccines9030199""","""Enterovirus A71 (EV-A71) is on…",[],[],[],[],[],[],[],[],[],[],[],"[""viral translation and virulence. Changing cytosine to uridine at position 158 in stem-loop""]",[],[],[],[],[],1
"""PMC7997928""","""Oncolytic virotherapy induced …","""Oncolytic viruses, such as ves…","""Melcher, Alan, Thompson, Jill …","""Nature Communications""","""2021""","""10.1038/s41467-021-22115-1""","""Escape from frontline therapy …",[],"[""viral replication and thereby escape oncolysis. CSDE1 is multi-functional RNA-binding protein that regulates RNA translation 40 47 . CSDE1 has not previously been reported to be involved in the regulation of VSV replication, although it has been shown to stimulate cap-independent translation initiation for several other viruses (reviewed in ref. 45 ). Thus, knockdown of CSDE1 reduced the internal ribosome entry site (IRES)-driven translation of both human rhinovirus (HRV) and poliovirus, while not affecting cap""]",[],[],[],[],[],[],[],[],[],[],[],[],[],"[""NM_144901.4""]",[],1
"""PMC7998283""","""Nanosized Particles Assembled …","""RNA-based molecules have recen…","""Liu, Wei-Lin, Lo, Shih-Yen, Ch…","""Polymers""","""2021-3""","""10.3390/polym13060858""","""RNA-based molecules, including…",[],"[""virus (HCV) is an enveloped, single-stranded positive-sense RNA virus. Its genomic RNA consists of an open reading frame, flanked by two highly structured untranslated regions (UTRs) at its 5 and 3 ends [ 16 ]. The secondary and tertiary structures of the internal ribosome entry site""]",[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],1
"""PMC7998436""","""The Novel Genetic Background o…","""The circulation in Europe of n…","""Lisowska, Anna, Perez, Lester …","""Viruses""","""2021-3""","""10.3390/v13030396""","""Infectious bursal disease viru…",[],"[""viral serine protease (VP4) [ 12 ]. Segment B encodes the viral RNA-dependent RNA polymerase (RdRp) (VP1), which catalyzes replication and transcription [ 13 ]. However, a recent study revealed that IBDV uses a novel cap-independent mechanism of protein synthesis initiation that relies on the viral proteins VP1 and VP3, which act as a substitute for the 5 cap structure [ 14 ]. Of all five viral proteins, VP2 is the major structural protein that builds the viral cap""]",[],[],[],[],[],[],[],[],[],[],[],[],[],"[""KX759540"", ""MT629830"", … ""accession numbers""]",[],1


In [15]:
filtered_df = results_df.filter(zombie_filt_expr,rna_or_virus_filt_expr,with_proximity_matches)
filtered_df.shape

(6444, 26)

Filter down - only things that have accessions or coordinates in them

In [17]:
coord_or_access = filtered_df.filter(
    ~(pl.col("all_accessions").list.len() == 0) | ~(pl.col("all_coordinates").list.len() == 0)
)
coord_or_access.shape

(3611, 26)

And to only with coordinates OR only accessions

In [19]:
coord_df = filtered_df.filter(
    ~(pl.col("all_coordinates").list.len() == 0)
)
access_df =  filtered_df.filter(
    ~(pl.col("all_accessions").list.len() == 0) 
)
print(f"items with acession {access_df.shape}")
print(f"items with coordinates {coord_df.shape}")

items with acession (3564, 26)
items with coordinates (190, 26)


## TEST: prompt contains the entire full text

In [21]:
# how many characters ~~~~~ roughly number of tokens?
coord_or_access.select(pl.col("full_text").str.len_chars().mean()) 

full_text
f64
49148.446137


Remove records with full text that are probably too long for the model

In [22]:
coord_or_access = coord_or_access.filter(pl.col("full_text").str.len_chars().le(100000)) 
coord_or_access.shape

(3485, 26)

In [23]:
"PMC9508848" in coord_or_access["pmc_id"]

True

In [None]:
def create_system_prompt_full_text() -> str:
    """Create the system prompt for full-text LLM analysis."""
    return """You are an expert bioinformatics researcher analyzing complete scientific papers. Your task is to examine the full text of a research paper and respond with structured information in JSON format.

You will be given:
1. A complete scientific paper text
2. A target biological concept type (e.g. molcular function, domain, gene, protein, variant, organism life style or phenomena)
3. A specific term that was detected in the paper based on string matching
4. A list of terms related to the biological concept (note, these may be in a regex format used in the string search)

Your goal is to determine if the detected term genuinely represents the target biological concept, and if the provided full text contain actioanable information about this biological concept, which we could extract.
Actionable information includes genomic/protein coordinates/positions, identifiers/accessions, database names, and organism information.
Examples of actionable information:
1. If the user's concept or terms are about promoter regions, and the full text includes: "... promoter sequence for blablabla123 enzyme is located in the 0 to -15 bases upstream of the start codon...".
2. If user's concept is about the active site of an enzyme, and the full text includes: "...the catalytic site of the protein (aa 50 - 65), forms a hydrophobic pocket..".
3. If the user's interest is in fungal orthologues of some bacterial genes, and the text includes "... a similar function exists in certain fungi, such as Aspergillus nidulans protein Xyz..."
4. If the user's interest is about a parasite life style/cycle, and the text includes "... early stages of parasitus maximus occur in birds, often corvids...".
5. If the user's interest is in a specific molecular function, and the full text includes: "... the reaction (substrate) is oxidized by enzyme Xyz to produce product abc...".
6. If the user's interest is in non-coding RNA in a specific organism, and the text include: "... a non-coding region is transcribed on top (in overlap) of gene bla1, nested between position 100 and 200..."

ANALYSIS APPROCH:
- Read through the entire paper.
- Look for all mentions of the target term and related biological entities
- Search for actionable information (genomic coordinates, protein positions, sequence accessions, database names, and database identifiers).
- Evaluate whether the concept and the actionable information are related (e.g. genomic coordinates refer to the same biological entity where the phenomena of interest is).
- Consider the mentioned context - introduction/background sections (such as litrature overviews) may refer to different entities than those discussed in the results/methods.

CRITICAL EVALUATION CRITERIA:
- False positives are common - the input is only loosely filtered based on string searches, so the detected term may refer to unrelated concepts (e.g., acronym disambiguation).
- Synthetic constructs, expression plasmids, modified sequences, antibodies, vaccines, artifical vectors, and biosensors should generally be marked as not relevant, as they do not describe the naturally occuring phenomena/concept.
- The actioanable information may be scattered across the paper: for example the accessions and database may be mentioned in a "data availability" section, but the coordinates could be in the results, methods or supplementary sections.

RESPONSE FORMAT:
If there is actionable information regarding the concept, extract and consolidate it.
Your response must ONLY contain a valid JSON object with one of these structures:

If the paper is relevant to the concept and you found most of the required/actionable information (database, identifier, coordinates):
{
    "is_relevant": "relevant", 
    "reason": "brief summary of why the manuscript is relevant",
    "coordinate_list": [
        {
            "name": "the user term or biological concept of interest this item is about",
            "type": "RNA, DNA, Protein (amino acid)",
            "organism": "specific taxid/taxon_id if mentioned, if not then species/organism name (if not mentioned, leave empty)",
            "database": "source database for identifiers (GenBank, UniProt, IMG/M etc.)",
            "accession": "unique accession/identifier if available",
            "start": "start position if available",
            "end": "end position if available", 
            "strand": "1 for forward/positive strand, or -1 for reverse/negative strand (if the information is available and the molecule type is nucleic)",
            "sequence": "specific nucleic acid or amino acid sequences if provided"
        }
    ]
}

If the paper is relevant to the concept, but not enough actionable information is available in the text, set "is_relevant" to "insufficient".
For example, if actionable data (coordinates, concept, organism, identifiers) are not entirely mentioned or if they are noted as being available elsewhere (e.g., "...see supplementary material").
{
    "is_relevant": "insufficient",
    "reason": "what information is missing and where it might be found",
    "coordinate_list": []
}

If the term is not relevant to the target concept, or is mentioned in relation to synthetic/artificial constructs:
{
    "is_relevant": "not_relevant",
    "reason": "concise explanation of why the term/information is not relevant",
    "coordinate_list": []
}

IMPORTANT NOTES:
- Only extract information explicitly stated in the paper.
- Do not convert gene names to accessions unless explicitly provided.
- Ensure the term and coordinates refer to the same biological entity.
- If uncertain prefer "insufficient" over "relevant".
- There may be multiple actionable information in the paper relating to the concept of interest - list all of them in the coordinate_list attribute.
- For missing values, use an empty string ("") - do not use "Nan" or "N/A" or "null" or "MISSING".
- ONLY RESPOND WITH A VALID JSON: without comments or text outside of the JSON. String values (even if empty) MUST be enclosed in double quotes. The last item in a list/array should not be followed by a comma.
- All numeric values for strand should be strings ("1" or "-1"), not integers.

"""
def create_user_prompt_full_text(full_text: str, user_terms: str,
                                matched_terms: str, title: str, 
                                prompt_prepend: str | None = None, prompt_append: str|None = None) -> str:
    """Create the user prompt with the full text of the paper to be analyzed."""
    prompt = "Analyze the following full scientific paper text and return the appropriate JSON response:\n\n"
    
    # Add prepend text if provided
    if prompt_prepend and prompt_prepend.strip():
        prompt += f"The user also notes: {prompt_prepend.strip()}\n\n"  
    
    prompt += f"Paper title: {title}\n\n"
    prompt += f"All terms the user is interested in : {user_terms}\n"
    prompt += f"The specific texts that were matched in this paper: '{matched_terms}'\n\n"
    prompt += f"Full paper text:\n{full_text}\n\n"

    # Add append text if provided
    if prompt_append and prompt_append.strip():
        prompt += f"\n\nThe user also notes: {prompt_append.strip()}"
    
    return prompt

In [50]:
len(create_system_prompt_full_text())

6160

In [None]:
list_available_models(
    api_base="https://YOUR.API.PROVIDER.COM",
    api_key=os.environ.get("YOUR_API_KEY"), # type: ignore
)

In [53]:
# Create new dictionary for the loop.
all_terms = queries.copy()

# Remove the unwanted entries
for key in ["virus_taxonomy_report", "disqualifying_terms"]:
    all_terms.pop(key, None) 


for key, value in all_terms.items():
    cleaned_patterns = []
    for pattern_list in value:
        cleaned_pattern_list = [clean_pattern_for_polars(pattern) for pattern in pattern_list]
        cleaned_patterns.append(cleaned_pattern_list)
    all_terms[key] = cleaned_patterns

print(all_terms)

concept_columns = access_df.select(pl.selectors.starts_with(queries.keys())).columns # type: ignore
print(concept_columns)


{'ires': [['viral|virus', 'ires'], ['viral|virus', 'internal', 'ribosome', 'entry', 'site'], ['viral|virus', 'cap independent'], ['viral|virus', 'independent', 'cap'], ['viral|virus', 'internal', 'ribosomal', 'entry'], ['viral|virus', 'ires'], ['ires', 'viral|virus']], 'frameshifting_elements': [['viral|virus|phage', 'frameshift'], ['programmed', 'ribosomal', 'frameshift'], ['programmed', 'frameshift'], ['frameshifting', 'stimulation', 'element'], ['frameshift', 'stimulation', 'element'], ['viral|virus|phage', 'fse']], 'upstream_flanking_sequences': [['cis', 'acting', 'flanking', 'element'], ['cis', 'acting', '5', 'flanking'], ['viral|virus|phage', 'ufs']], 'tr_loops': [['viral|virus|phage', 'tr', 'loop'], ['tr stem', 'loop'], ['viral|virus|phage', 'tr', 'loop'], ['tr', 'stemloop'], ['tr', 'stem loop']], 'dimer_linkage_structures': [['viral|virus|phage', 'dls'], ['viral|virus', 'dimer', 'linkage', 'structure'], ['viral|virus', 'dls']], 'rev_response_elements': [['viral|virus', 'rev', '

In [47]:
tmp_df = coord_df
tmp_df.select(concept_columns).with_columns(pl.concat_list(pl.col(concept_columns))).unique().to_series().to_list()

[['virus-infected and produced protein A, but dsRNA staining was only observed in cells infected by AcR1 and not AcR1 3 ( SI Appendix , Fig. S1 ). The RNA replication defect of AcR1 3 was rescued by transfecting a plasmid expressing a replicable RNA1 template, RNA1fs, that contains a frameshift'],
 ['viral cytokine IFN- was analyzed after exposure for 24 h to the TLR3 agonist Poly-inosinic-cytidylic acid [poly (I:C)]. Real time-PCR analysis indicates that expression of UBAC1 significantly impairs induction of TLR3 downstream target genes, acting on both NF- B and IRF3 signaling ( Figure 2 A). Conversely, abrogation of UBAC1 expression in NHEK using short hairpin'],
 ['virus neutralizing activity ( 61 , 62 , 66 , 74 ). Several broadly neutralizing antibodies all target the same region within the glycan cap. Beneath the 18- 18 hairpin (known as the MLD anchor) sits a highly conserved patch of residues referred to as the MLD cradle ( 93 ) ( Figure 2B ). The cradle forms a hydrophobic pock

In [None]:
models=["Llama-4-Scout-17B-16E-Instruct"] 
all_responses = []

# tmp_df = access_df
tmp_df = filtered_df
# tmp_df = coord_df

for index, row in enumerate(tqdm(tmp_df.iter_rows(named=True), total=len(tmp_df), desc="Processing rows")):
        print(row["title"])
        print(len(row["full_text"]))
        this_row_reponses = dict.fromkeys(models, None)
        #    continue
        # Create the user prompt with full text instead of just context
        user_prompt = create_user_prompt_full_text(
            full_text=row["full_text"],  # Use full text instead of coordinate_text
            title=row["title"],
            user_terms=all_terms,
            matched_terms=tmp_df[index].select(concept_columns).with_columns(pl.concat_list(pl.col(concept_columns))).unique().to_series().to_list(), # type: ignore
            prompt_prepend="My focus is on RNA viruses and RNA secondary structures they might use. I am NOT interested at all in antibodies, vaccines, artifical vectors, synthetic constructs or biosensors - if any of these is mentioned, consider set is_relevant to not_relevant"
        )
        # if (len(user_prompt) > 100000):
             
        for model in models:
            try:
                llm_response = call_llm_api(
                user_prompt=user_prompt,
                system_prompt=create_system_prompt_full_text(),
                api_key=os.environ.get("YOUR_API_KEY"), # type: ignore
                api_base="https://YOUR.API.PROVIDER.COM",
                model=model
                )
                parsed = parse_llm_response(llm_response)

            except Exception as e:
                parsed = {"is_relevant": "ERROR", "reason": e, "coordinate_list":[]}

            this_row_reponses[model] = parsed
        all_responses.append(this_row_reponses)

        print(this_row_reponses)


Processing rows:   0%|          | 0/6444 [00:00<?, ?it/s]

Ribosomal Frameshifting in Decoding Plant Viral RNAs
57573
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses ribosomal frameshifting in plant viral RNAs, which is relevant to RNA viruses and RNA secondary structures.', 'coordinate_list': [{'name': 'frameshift site', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': 'GGGUUUU in BYDV, GGAUUUU in RCNMV'}, {'name': 'ADSL', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': 'large adjacent downstream-bulged stem-loop'}, {'name': 'LDFE', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': 'long-distance frameshift element'}, {'name': 'pseudoknot', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': 'compact pseudoknot adjacent to the shifty site'}]}}
T

JSON parsed successfully after 1 fix attempt(s)


{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper provides detailed information about the crystal structure of a ribosomal frameshifting viral pseudoknot, including the presence of an RNA triplex in the minor groove of stem 1.', 'coordinate_list': [{'name': 'Beet western yellow virus (BWYV) pseudoknot', 'type': 'RNA', 'organism': 'Beet western yellow virus (BWYV), a plant luteovirus', 'database': 'Protein Data Bank', 'accession': '437D', 'start': '', 'end': '', 'strand': '', 'sequence': '', 'coordinate_list': [{'name': 'stem 1', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': 'C3', 'end': '', 'strand': '', 'sequence': ''}, {'name': 'stem 2', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': ''}, {'name': 'loop 1', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': 'C8 A9'}, {'name': 'loop 2', 'type': 'R

JSON parsed successfully after 1 fix attempt(s)


{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper focuses on RNA viruses and RNA secondary structures, specifically analyzing the influenza A virus genome to identify functional RNA structures involved in viral genome replication.', 'coordinate_list': [{'name': 'stem-loop structure at nucleotide positions 39-60 of segment 6', 'type': 'RNA', 'organism': 'Influenza A virus', 'database': '', 'accession': '', 'start': '39', 'end': '60', 'strand': '', 'sequence': ''}, {'name': 'stem-loop structure at nucleotide positions 87-130 of segment 5', 'type': 'RNA', 'organism': 'Influenza A virus', 'database': '', 'accession': '', 'start': '87', 'end': '130', 'strand': '', 'sequence': ''}]}}
Intragenomic Long-Distance RNARNA Interactions in Plus-Strand RNA Plant Viruses
38618
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': "The paper discusses intragenomic long-distance RNA-RNA interactions (LDRIs) in plus-strand RNA plant viruses, which is 

JSON parsed successfully after 1 fix attempt(s)


{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses the structure of the PCBP2/stemloop IV complex underlying translation initiation mediated by the poliovirus type I IRES, which is relevant to RNA viruses and RNA secondary structures.', 'coordinate_list': [{'name': 'PCBP2/SLIVm complex', 'type': 'RNA-Protein complex', 'organism': 'Poliovirus', 'database': 'PDB', 'accession': '', 'start': '278', 'end': '398', 'strand': '', 'sequence': ''}, {'name': 'SLIVm RNA', 'type': 'RNA', 'organism': 'Poliovirus', 'database': '', 'accession': '', 'start': '278', 'end': '398', 'strand': '', 'sequence': ''}, {'name': 'PCBP2-FL', 'type': 'Protein', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': ''}]}}
A risk marker of tribasic hemagglutinin cleavage site in influenza A (H9N2) virus
49267


KeyboardInterrupt: 

Saving the raw response collection for safety

In [61]:
with open(f"results/rna_virus/rna_secondary_structure/llm_full_text_respones/{row['pmc_id']}.json", "w") as outfile:
    json.dump(parsed, outfile)


In [60]:
with open("results/rna_virus/rna_secondary_structure/all_responses.jsonl","w") as outfile:
    for response in all_responses:
        json.dump(response, outfile)
        outfile.write("\n")


In [None]:
response_df = pl.from_dicts(all_responses[:20]) #,infer_schema_length = None,strict=False)
# response_df.write_parquet("results/pubmed_central/processed_literature/llm_out/full_text_bench.parquet")
print(response_df.shape)
response_df.head(1)

In [None]:
all_data = []
for i, col in enumerate(response_df.columns):
    temp_df = response_df.unnest(col)
    col2 = "is_relevant"
    this_results = (temp_df[col2].explode().value_counts(name="count", normalize=False)
                   .with_columns(pl.lit(col).alias("model")))
    all_data.append(this_results)

# Concatenate and pivot
combined_df = pl.concat(all_data)
bench_df = combined_df.pivot(
    values="count", 
    index="is_relevant", 
    on="model"
).fill_null(0.0)

bench_df

In [None]:
response_df = response_df.hstack(tmp_df[:response_df.height])
response_df = unstruct_with_suffix(response_df, col_name="Devstral-Small-2505", suffix="_devstral")
response_df = unstruct_with_suffix(response_df, col_name="Llama-4-Scout-17B-16E-Instruct", suffix="_scout")


In [None]:
response_df.write_parquet("tmp4simon.parquet")

In [None]:
response_df

In [None]:
response_df = response_df.hstack(tmp_df)
response_df = unstruct_with_suffix(response_df, col_name="Devstral-Small-2505", suffix="_devstral")
response_df = unstruct_with_suffix(response_df, col_name="Llama-4-Scout-17B-16E-Instruct", suffix="_scout")
response_df = unstruct_with_suffix(response_df, col_name="concept_devstral", suffix="_devstral")
response_df = unstruct_with_suffix(response_df, col_name="concept_scout", suffix="_scout")
response_df

In [None]:
response_df = unstruct_with_suffix(response_df, col_name="coordinate_dict_devstral", suffix="_devstral")
response_df = unstruct_with_suffix(response_df, col_name="coordinate_dict_scout", suffix="_scout")
response_df

In [None]:
response_df.write_parquet("results/pubmed_central/processed_literature/llm_out/full_text_bench.parquet")


In [None]:
convert_nested_cols(response_df.drop("full_text")).write_csv("results/pubmed_central/processed_literature/llm_out/full_text_bench.csv")

Exploring the extractions where the models differed in response

In [None]:
diff_papers = response_df.filter((pl.col("is_relevant_devstral") != pl.col("is_relevant_scout"))).drop("pmid")
diff_papers

In [None]:

diff_papers = drop_empty_or_null_columns(diff_papers)
diff_papers

In [None]:
convert_nested_cols(diff_papers).drop(["full_text"]).write_csv("results/pubmed_central/processed_literature/llm_out/diff_papers.csv")

In [None]:
relv_papers = response_df.filter((pl.col("is_relevant_devstral") == "relevant") & (pl.col("is_relevant_scout")== "relevant")).drop("pmid")
relv_papers