# Experiment: Downstream applications of polars-dovmed results: Clean and Normalize LLM Responses
This is not part of the polars-dovmed package sensu stricto, just a POC/attempt demonstrating post-processing of LLM-extracted biological data.

**Exploratory workflow:** This notebook processes the raw LLM responses from the analysis step, performing data cleaning, normalization, and quality validation. It handles terminology mapping (e.g., various synonyms to canonical terms like "ires", "fse", "ribozyme"), standardizes missing values, and validates that extracted coordinates contain the required fields.

The notebook also identifies failed extractions and can re-run them through the LLM with adjusted prompts to improve data quality.

**Related notebooks:** For the LLM API calls that generate these responses, see [`01_llm_analyze_papers.ipynb`](./01_llm_analyze_papers.ipynb).  
For fetching actual genomic sequences using the cleaned coordinates, see [`03_fetch_sequences_from_databases.ipynb`](./03_fetch_sequences_from_databases.ipynb).

**NOTE!** As anything with LLMs, all results should be taken with a mountain of salt grains and verify everything independently.  
Some of the code here was made for earlier (obsolete) versions of polars-dovmed, e.g. not all of the fields exist in the current output.

In [1]:
import os 
import polars as pl
import json

# from polars_dovmed.llm_convert_context_to_coord import parse_llm_response, call_llm_api
# from polars_dovmed.llm_utils import call_llm_api, list_available_models
from polars_dovmed.utils import clean_pattern_for_polars
os.chdir("/clusterfs/jgi/scratch/science/metagen/neri/code/blits/polars_dovmed")
queries = json.load(open("./results/rna_virus/rna_secondary_structure/RNA_virus_rss_queries.json"))
response_schema= json.load(open("./results/rna_virus/rna_secondary_structure/rna_virus_schema.json"))
file_lists_df = pl.read_parquet("data/pubmed_central/pmc_oa/filelists.parquet")

In [2]:
results_df = pl.read_parquet("results/pubmed_central/processed_literature_test/filtered.parquet")
print(results_df.shape)
results_df.sort(by="total_matches",descending=True).head(2)

(6257, 26)


pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches
str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32
"""PMC7122378""","""Ribosomal Frameshifting in Dec…","""Frameshifting provides an eleg…","""Miller, W. Allen, Giedroc, Dav…","""Recoding: Expansion of Decodin…","""2010""","""10.1007/978-0-387-89382-2_9""","""Frameshifting Plant Viruses Pl…",[],"[""virus and Dianthovirus genera (Fig. 9.1 ), despite many differences in sequence. Interestingly, in the dianthoviruses and umbraviruses, the putative LDFE we predict is located upstream of the cap-independent translation element that is also located in the 3 UTR (Mizumoto et al., 2003 ), whereas in the luteoviruses, the LDFE is downstream of the cap""]",[],"[""viruses known to employ minus one ( 1) programmed ribosomal frameshift"", ""virus , undergoes a net +1 reading frame change to translate the viral RdRp coding region (Karasev et al., 1995 ). This would be the first known +1 frameshift in any plant viral RNA. A carlavirus may use 1 frameshift"", … ""viruses appears to be confined to the plant virus world as are the possible 1 frameshift""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses employ frameshifting. Summary In most cases, the biological role and basic mechanism of 1 ribosome frameshifting are likely the same in plant viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin-type pseudoknot; or (iii) a stable, imperfect stem-loop"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""NC_003747"", ""NC_001575"", … ""accession ""]",[],76
"""PMC7127214""","""Translating old drugs into new…","""Programmed ribosomal frameshif…","""Ruiz-Echevarria, Maria J, Dinm…","""Trends in Biotechnology""","""1999-3-11""","""10.1016/S0167-7799(97)01167-0""","""The ability of ribosomes to ma…",[],"[""viral mutation rates ensure the selection of mutant viral genes encoding viral proteins that can bypass the actions of the drugs, resulting in drug-resistant functional virus. ( b ) Drugs that target programmed ribosomal frameshifting affect the host translational machinery, which is independent of the viral mutational cap""]",[],"[""programmed to shift their translational reading frame one base in the 5 direction ( 1 ribosomal frameshifting) have been identified ( Table 1 ). Programmed 1 ribosomal frameshifting is most commonly observed in double-stranded RNA (dsRNA) and nonsegmented (+) strand RNA viruses; programmed +1 ribosomal frameshifting, which shifts the ribosome one base in the 3 direction, has also been characterized in at least two viral systems. A few examples of programmed ribosomal frameshifting are known to occur in bacterial genes, and one example of programmed +1 ribosomal frameshift"", ""programmed 1 ribosomal frameshifting. These different ribosomal frameshift systems have been extensively reviewed elsewhere 2 , 3 , 4 , 5 , and so this article will focus exclusively on programmed 1 ribosomal frameshift"", … ""programmed frameshift""]",[],[],[],[],[],[],[],[],[],[],[],[],[],49


some statistics about the df 

In [3]:
mcols = results_df.columns 
for col in mcols:
    print(f"{col}: {results_df[col].explode().value_counts(sort=True)}")

pmc_id: shape: (6_257, 2)
┌────────────┬───────┐
│ pmc_id     ┆ count │
│ ---        ┆ ---   │
│ str        ┆ u32   │
╞════════════╪═══════╡
│ PMC7122378 ┆ 1     │
│ PMC7127214 ┆ 1     │
│ PMC5653336 ┆ 1     │
│ PMC7119991 ┆ 1     │
│ PMC7114514 ┆ 1     │
│ …          ┆ …     │
│ PMC7996568 ┆ 1     │
│ PMC7996929 ┆ 1     │
│ PMC7997928 ┆ 1     │
│ PMC7998283 ┆ 1     │
│ PMC7998436 ┆ 1     │
└────────────┴───────┘
title: shape: (6_230, 2)
┌─────────────────────────────────┬───────┐
│ title                           ┆ count │
│ ---                             ┆ ---   │
│ str                             ┆ u32   │
╞═════════════════════════════════╪═══════╡
│ Virus Replication               ┆ 3     │
│ Expression                      ┆ 2     │
│ Release of P-TEFb from the Sup… ┆ 2     │
│ Rfam 15: RNA families database… ┆ 2     │
│ Astroviruses                    ┆ 2     │
│ …                               ┆ …     │
│ Liquid Biomolecular Condensate… ┆ 1     │
│ The Atypical Kinase RIOK3 Li

### The results_df ^ is what the [llm_convert_context_to_coord.py](../src/polars_dovmed/llm_convert_context_to_coord.py) script ran on
With llama scout, took around 4-10sec per record, ran overnight.

Reading the raw responses

In [38]:
response_df = pl.read_parquet(
    "results/rna_virus/rna_secondary_structure/llm_full_text_responses_3/results.parquet"
    ).unnest("Llama-4-Scout-17B-16E-Instruct")
response_df = response_df.hstack(results_df)
print(response_df.shape)
response_df.head(1)

(6257, 30)


is_relevant,reason,coordinate_list,general_viral_rna,pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches
str,str,list[struct[10]],list[struct[9]],str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32
"""relevant""","""Detailed characterization of r…","[{""frameshifting_elements"",""RNA"","""","""","""","""","""","""","""",null}, {""ldfe"",""RNA"",""Barley yellow dwarf virus (BYDV, genus Luteovirus)"","""","""",""4 kb downstream in the genome"","""","""","""",null}, … {""pipo"",""Protein"",""Potyviridae"","""","""","""","""","""","""",null}]",,"""PMC7122378""","""Ribosomal Frameshifting in Dec…","""Frameshifting provides an eleg…","""Miller, W. Allen, Giedroc, Dav…","""Recoding: Expansion of Decodin…","""2010""","""10.1007/978-0-387-89382-2_9""","""Frameshifting Plant Viruses Pl…",[],"[""virus and Dianthovirus genera (Fig. 9.1 ), despite many differences in sequence. Interestingly, in the dianthoviruses and umbraviruses, the putative LDFE we predict is located upstream of the cap-independent translation element that is also located in the 3 UTR (Mizumoto et al., 2003 ), whereas in the luteoviruses, the LDFE is downstream of the cap""]",[],"[""viruses known to employ minus one ( 1) programmed ribosomal frameshift"", ""virus , undergoes a net +1 reading frame change to translate the viral RdRp coding region (Karasev et al., 1995 ). This would be the first known +1 frameshift in any plant viral RNA. A carlavirus may use 1 frameshift"", … ""viruses appears to be confined to the plant virus world as are the possible 1 frameshift""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses employ frameshifting. Summary In most cases, the biological role and basic mechanism of 1 ribosome frameshifting are likely the same in plant viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin-type pseudoknot; or (iii) a stable, imperfect stem-loop"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""NC_003747"", ""NC_001575"", … ""accession ""]",[],76


In [39]:
response_df["is_relevant"].value_counts(sort=True)

is_relevant,count
str,u32
"""not_relevant""",3526
"""relevant""",1858
"""insufficient""",871
"""ERROR""",2


In [40]:
response_df.filter(pl.col("is_relevant").is_in(["ERROR","insufficient"]))["reason"].unique().to_list()


['The paper discusses stem-loop structures in the context of genomoviruses, specifically mentioning a potential stem-loop structure with a nonanucleotide motif at its apex in SsHADV-1, but does not provide specific coordinates or actionable information about the structure.',
 "The paper discusses Sphae, a toolkit for predicting phage therapy candidates from sequencing data, mentioning various bioinformatics tools and techniques used for phage genome assembly and annotation. However, specific actionable information regarding the user's terms of interest (e.g., ires, frameshifting_elements) is not directly provided in the given text.",
 'The paper discusses the cleavage of poly(A)-binding protein (PABP) by duck hepatitis A virus (DHAV) 3C protease, but does not provide specific actionable information about IRES or other requested concepts.',
 'The paper discusses the comparison of four regions in the replicase gene of heterologous infectious bronchitis virus strains, but it does not prov

In [41]:
insufficient_df =  response_df.filter(pl.col("is_relevant").eq("insufficient"))
insufficient_df.write_parquet("results/rna_virus/rna_secondary_structure/insufficient.parquet")

In [42]:
relevant_df = response_df.filter(~pl.col("is_relevant").ne("relevant"))
relevant_df.shape

(1858, 30)

In [43]:
# test_df = unstruct_with_suffix(relevant_df, col_name="coordinate_list", suffix="_devstral")
relevant_df = relevant_df.explode(
    pl.col("coordinate_list")).with_columns(
    pl.col("coordinate_list")
    .struct.rename_fields(["name","type","organism","database","accession","start","end","strand","sequence"]
        ).struct.unnest())
print(relevant_df.shape) # exploded - now every row is an extracted coordinate
relevant_df.head(1)

(4245, 39)


is_relevant,reason,coordinate_list,general_viral_rna,pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches,name,type,organism,database,accession,start,end,strand,sequence
str,str,struct[10],list[struct[9]],str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32,str,str,str,str,str,str,str,str,str
"""relevant""","""Detailed characterization of r…","{""frameshifting_elements"",""RNA"","""","""","""","""","""","""","""",null}",,"""PMC7122378""","""Ribosomal Frameshifting in Dec…","""Frameshifting provides an eleg…","""Miller, W. Allen, Giedroc, Dav…","""Recoding: Expansion of Decodin…","""2010""","""10.1007/978-0-387-89382-2_9""","""Frameshifting Plant Viruses Pl…",[],"[""virus and Dianthovirus genera (Fig. 9.1 ), despite many differences in sequence. Interestingly, in the dianthoviruses and umbraviruses, the putative LDFE we predict is located upstream of the cap-independent translation element that is also located in the 3 UTR (Mizumoto et al., 2003 ), whereas in the luteoviruses, the LDFE is downstream of the cap""]",[],"[""viruses known to employ minus one ( 1) programmed ribosomal frameshift"", ""virus , undergoes a net +1 reading frame change to translate the viral RdRp coding region (Karasev et al., 1995 ). This would be the first known +1 frameshift in any plant viral RNA. A carlavirus may use 1 frameshift"", … ""viruses appears to be confined to the plant virus world as are the possible 1 frameshift""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses employ frameshifting. Summary In most cases, the biological role and basic mechanism of 1 ribosome frameshifting are likely the same in plant viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""viral RNAs there appear to be three classes of RNA structure downstream of the slippery site that can facilitate 1 ribosomal frameshifting: (i) an apical loop internal loop (ALIL) structure in which a bulged stem-loop, located 5 6 nt downstream of the slippery site, base pairs to a distant loop in the 3 UTR; (ii) a very small, highly structured hairpin-type pseudoknot; or (iii) a stable, imperfect stem-loop"", ""virus are known or predicted to have a GGGUUUU shifty site. The structured region (Fig. 9.1A ) that begins six to eight bases downstream of the shifty heptanucleotide consists of a large adjacent downstream-bulged stem-loop (ADSL) that forms a complex pseudoknot by base pairing of a bulge loop in the ADSL to a stem-loop"", … ""viruses as in animal viruses. While some of the shifty heptanucleotide sites are the same in plant and animal viruses, the specific downstream structures that facilitate 1 frameshifting differ significantly between viruses of the two kingdoms. No ALIL-like structures that base pair with a stem-loop""]",[],[],[],"[""NC_003747"", ""NC_001575"", … ""accession ""]",[],76,"""frameshifting_elements""","""RNA""","""""","""""","""""","""""","""""","""""",""""""


### Noramlize verbal stuff / add "category"

In [44]:
coord_cols = ["name","type","organism","database","accession","start","end","strand","sequence"]
for col in coord_cols:
    print(f"{col}: {relevant_df[col].explode().value_counts(sort=True)}")
    print(f"{col}: {relevant_df[col].explode().unique().to_list()}")

name: shape: (2_378, 2)
┌─────────────────────────────────┬───────┐
│ name                            ┆ count │
│ ---                             ┆ ---   │
│ str                             ┆ u32   │
╞═════════════════════════════════╪═══════╡
│ ires                            ┆ 189   │
│ frameshifting_element           ┆ 168   │
│ internal_ribosome_entry_site    ┆ 148   │
│ frameshifting_elements          ┆ 124   │
│ internal_ribosome_entry_site_(… ┆ 115   │
│ …                               ┆ …     │
│ ev_d68_ires                     ┆ 1     │
│ pcmv_mir20a_internal_ribosome_… ┆ 1     │
│ vsv_ifn                         ┆ 1     │
│ igr_p/m                         ┆ 1     │
│ hcv_core_protein                ┆ 1     │
└─────────────────────────────────┴───────┘
name: ['ago1', 'ires_like_sequence', 'viral_circrnas', 'frameshift_stimulatory_element', 'viral_genomic_rna_(grna)', 'ribosome_shunt_configuration', 'aug_872', 'pvs_cp', 'dimerization_initiation_sequence_(dis)', 'ecmv_ires', 'co

In [45]:
missing_syn = [None, "","not provided","not specified","null"]
missing_map = {key : None for key in missing_syn}
# Normalize all coord cols
relevant_df = relevant_df.with_columns(
    pl.col(coord_cols).replace(missing_map)
)

relevant_df.select(pl.col("strand"))["strand"].value_counts() #unique().to_list()

# strand values
plus_syn = ["+","+1","1","forward"]
nega_syn = ["-","-1","negative"]
relevant_df = relevant_df.with_columns(
    pl.when(pl.col("strand").is_in(plus_syn)).then(pl.lit(True))
    .when(pl.col("strand").is_in(nega_syn)).then(pl.lit(False))
    .otherwise(pl.lit(None))
    .alias("strand")
    )
# name 
missing_syn = [None, "","not provided","not specified","null"]
ires_syn = [ "ires", 'arpc2_ires', 'bag_1_ires',"ev_d68_ires", 'vegf_ires_a', 'fgf2_ires', 'snca_ires', "Internal ribosome entry site (IRES)", "internal_ribosome_entry_site_(ires)_like_element", "type_2_ires",     "hcv_internal_ribosomal_entry_site_(ires)",    "emcv_ires_dtof",    "cvb3_ires",    "iapv_ires",    "internal_ribosome_entry_sequence_(ires)",    "ires_stem_loop_ii",    "mhhav_ires",    "internal_ribosomal_entry_site_(ires)",    "viral_ires",    "internal_ribosome_entry_site_(ires)_like_element",    "yap1_ires",    "type_6e_ires",    "ires_5'_utr",    "ires_a_element",    "ecmv_ires",    "5'_utr_(ires)",    "poliovirus_2a_internal_ribosome_entry_site_(ires)",    "viral_internal_ribosome_entry_site_(ires)",    "type_4_ires",    "igr_ires",    "crpv_internal_ribosome_entry_site_(ires)",    "internal_ribosomal_entry_sites_(ires)",    "hcv_ires",    "type_i_ires",    "domain_v_of_ires",    "ires_e73_denv2",    "crpv_like_ires",    "pvy_ires",    "dicistrovirus_igr_ires",    "htlv_1_ires",    "c_src_ires",    "tev_ires",    "ires_motif",    "internal_ribosomal_entry_site_(ires)_of_ev71",    "internal_ribosome_entry_sites_(ires)",    "ires_like_sequence",    "domain_vi_of_ires",    "emcv_ires_element",    "ev_d68_ires",    "ad.cag_myod_ires_egfp",    "ad.cag_myod_ires_cherry",    "denv_ires",    "internal_ribosome_entry_site_(ires)",    "ires_(internal_ribosome_entry_site)",    "kshv_ires",    "trv_ires",    "caledonia_beadlet_anemone_dicistro_like_virus_1_igr_ires",    "hav_ires",    "poliovirus_2a_internal_ribosomal_entry_site_(ires)",    "mmtv_ires",    "mchsle_1_ires",    "hcv_internal_ribosome_entry_site_(ires)",    "iresl_element",    "rabbit_picornavirus_strain_rabbit01/2013/hun_ires",    "ires_stem_loop_ii_(sl_ii)",    "ires_structure",    "viral_vector_csii_ef1_mlotus_ires_venus",    "pcmv_mir20a_internal_ribosome_entry_site_(ires)_green_fluorescent_protein_(gfp)",    "ires_internal_ribosomal_entry_site",    "rhpv_ires",    "ires_of_hepatitis_c_virus",    "crtmv_ires",    "pv_ires",    "ires_rna",    "xiap_ires",    "beihai_picorna_like_virus_85_igr_ires",    "type_5_ires",    "ires2",    "emcv_ires",    "hav_ires_domain_v",    "hiv_1_ires",    "halastavirva_virus_ires",    "gtcv_ires",    "meloe1_ires",    "ires_element",    "hcv_ires_sequence",    "ires_element_in_the_3'_utr_of_gch1",    "ires_elements",    "viral_ires_element",    "internal_ribosome_entry_sequences_(iress)",    "internal_ribosome_entry_site",    "poliovirus_2a_internal_ribosome_entry_site",    "viral_genome_under_the_translational_control_of_the_internal_ribosome_entry_site",    "internal_ribosomal_entry_site",    "internal_ribosome_entry_sites",    "internal_ribosomal_entry_structure",    "encephalomyocarditis_virus_internal_ribosomal_entry_site",    "internal_ribosomal_entry_sites"]
fse_syn = [ "fse", "frameshift_region","sars_cov_2_fse","frameshifting_region","frameshifting_stimulus_elements_(fse)","frameshift_element_(fse)", "fse","ibv_frameshift_signal", "fses_of_simian_immunodeficiency_virus","programmed_ribosome_frameshifting", "hiv_1_frameshift_rna_structure", "programmed_ribosomal_frameshift_site", "ribosomal_frameshift_element", "frameshift_stimulatory_element_(fse)", "frameshift_mutation", "ribosomal_frameshift_motif", "cug_initiation_and_frameshifting_element", "programmed_1_ribosomal_frameshift_(1_prf)_signal","programmed_ribosomal_frameshift_signal", "coronavirus_frameshifting_stimulation_element", "hiv_1_frameshift_element", "programmed_ribosomal_frameshift_element", "frameshift_stimulatory_elements_(fses)", "gag_pol_ribosomal_frameshift_site", "programmed_ribosomal_frameshifting_(prf)", "programmed_ribosomal_frameshifting_(prf)_site", "sars_cov_2_frameshifting_pseudoknot", "frameshift_element_(fse)", "programmed_1_ribosome_frameshift", "ribosomal_frameshift", "frameshifting_elements", "programmed_1_ribosomal_frameshift", "ribosomal_frameshifting", "frameshifting_site", "sars_cov_2_frameshifting_stimulatory_element_(fse)", "ribosomal_frameshifting_site", "frameshift_stimulatory_signal_(fss)", "slippery_ribosomal_frameshift_site", "programmed_1_ribosomal_frameshifting_(1_prf)", "frameshifting_element", "programmed_1_ribosomal_frameshift_(1_prf)", "programmed__1_ribosomal_frameshift", "frameshift_signal", "plekhm2_frameshifting_element", "nd3_174+1_frameshift_insertion", "ribosomal_frameshift_region", "frameshift_mechanism", "rf2_frameshift_site", "programmed_ribosomal_frameshifting", "gag_pol_frameshift_signal", "ribosomal_frameshift_site", "frameshift_signals", "programmed_ribosomal_frameshifting_stimulation_element_(fse)", "ribosomal_frameshifting_mechanism", "rna_stem_loop_functioning_as_an_intergenic_ribosomal_frameshift_signal", "ldri_for__1_frameshifting_in_bydv", "frameshifting_stimulation_element_(fse)", "frameshift_sites", "frameshift_stimulation_element", "frameshift_stimulatory_structure", "ribosome_frameshifting", "programmed_frameshifting", "programmed_ribosomal_frameshifting_pseudoknot_stimulator", "bax_2_ribosomal_frameshift_site", "ribosomal_frameshift_signal_(rfs)", "ribosomal_frameshift_signal", "viral_frameshifting_elements_(fses)", "ty1_frameshift_signal", "prfb_frameshift_signal", "frameshifting_pseudoknot_(pk)", "orf1a/1b_ribosomal_frameshifting_site", "hiv_ribosomal_frameshift_site", "programmed_21_ribosomal_frameshift", "ribosomal_frameshifting_element", "dnax_frameshift_signal", "viral_frameshift", "programmed_1_ribosomal_frameshifting", "ribosomal_frameshifting_sequence", "temperature_dependent_frameshift", "ribosomal_frameshift_sequence", "hiv_1_gag_pol_frameshift", "ribosome_frameshift", "frameshift_stimulatory_element", "dnax_frameshift_cassette", "frameshift_site", "frameshift_motif", "frameshift", "frameshifting_element_(fse)", "frameshift_in_icp0_open_reading_frame", "frameshifts", "frameshifting_element_of_sars_cov_2", "frameshift_stimulating_rna", "frameshift_inducing_stem_loop", "frameshift_promoting_pseudoknot", "programmed_ribosomal_frameshift", "programmed_frameshift_site", "gag/pol_frameshift_site_(fss)", "frameshift_stimulatory_pseudoknot", "programmed_1_frameshifting", "ty1_frameshifting_element", "minimal_frameshifting_element_(mfe)", "gag_pol_frameshift_stem_loop", "peg10_frameshift_element", "frameshift_element", "stem_loop_region_involved_in_frameshifting", "frameshifting_pseudoknot"]
tr_loop_syn = ["tr_loops","tr_loop","tr_stem_loop","tr_stem_loops"]
dls_syn = ["dimer_initiation_site","dimerization_initiation_site_(dis)","dimer_initiation_sequence_(dis)","dimer_linkage_structures","dimer_linkage_structure","dimer_linkage_structure_(dls)", "dimer_linkage_sequence_(dls)","dimerization_linkage_structure_(dls)", "dls",]
dlp_syn = ["downstream_hairpin_loop_(dsh)","downstream_loop","downstream_hairpin_loops","downstream_loop_(dlp)","downstream_loop,_dlp","dlp","downstream_loop_(dlp)","downstream_loop,_dlp","downstream_loop","downstream_hairpin_loop_(dsh)"]
ribozyme_syn = ["ribozyme;_crrna,_crispr_rna;_hdv,_hepatitis_delta_virus_ribozyme;_camv_term,_cauliflower_mosaic_virus","twister_sister_ribozyme","four_way_junctional_twister_sister_ribozyme","hbv_ribozyme","ribozyme_(hhrz)","hdv_ribozyme_sequence","hepatitis_delta_virus_(hdv)_ribozyme","r3c_ligase_ribozyme","hdv_like_self_cleaving_ribozymes","ribozyme_switches","ribozyme_derived_from_the_hepatitis_delta_virus","hepatitis_delta_virus_(hdv)_genomic_or_antigenomic_ribozyme","glms_ribozyme","twister_sister_ribozyme","viral_ribozymes","twister_sister_(ts)_ribozyme","cpeb3_ribozyme","pseudoknot_ribozyme","ribozyme_(hdvrbz)","varkud_satellite_ribozyme","hammerhead_ribozymes","hammerhead_ribozyme_(hamrz)_and_hepatitis_delta_virus_ribozyme_(hdvrz)","hdv_antigenomic_ribozyme","hh_ribozyme","hairpin_ribozyme","ts_ribozyme","twister_ribozymes","hovlinc_ribozyme","ribozyme","ribozymes","hepatitis_delta_virus_ribozyme","viral_ribozyme","self_cleaving_ribozymes","hammerhead_ribozyme_(hhrz)","hdv_ribozyme","rdev_ribozyme","glucosamine_6_phosphate_synthase_(glms)_ribozyme","ribozyme_(hhrbz)","twister_ribozyme","hepatitis_delta_ribozyme","hepatitis_b_virus_hammerhead_ribozyme","pistol_ribozyme","hepatitis_delta_virus_(hdv)_like_ribozymes","neurospora_varkud_satellite_(vs)_ribozyme","hatchet_ribozyme","hammerhead_ribozyme","ribozyme_domain","theta_ribozymes","hammerhead_ribozyme_(hhr)"]
rev_syn = [ "rre","rev_response_elements","rev_response_element_(rre)","rev_responsive_element_(rre)","rev_response_element","hiv_1_rev","stem_loop_iib_of_the_rre","rev_response_element_(rre)","rre_rna_stem_loop","rev_responsive_element_(rre)","rre_ii_0","rre_element_of_hiv","rre_region_of_hiv_1_genome","rre_iib_0","rre","rre_tr_0"]
ufs_syn =["ufs","cis_acting_5'_flanking_element_(ufs)","upstream_flanking_sequences","upstream_flanking_sequence","5'_uar_flanking_stem_(ufs)"]
tar_sym = ["tar_rna","tar_hairpin","transactivation_response_(tar)_element","trans_activation_response_element_rna_(tar)","tar_and_polya_hairpins","tar_5sl_stem_loop","tar_stem_loop","tar","tar_element","tar_loop","trans_activation_response_(tar)_element","tar_element_of_hiv","tar_9sl_stem_loop","tar_rna_stem_loop","transactivating_response_(tar)_element"]
relevant_df = relevant_df.with_columns(
    pl.when(pl.col("name").is_in(missing_syn)).then(pl.lit(None))
    .when(pl.col("name").is_in(ires_syn)).then(pl.lit("ires"))
    .when(pl.col("name").is_in(ufs_syn)).then(pl.lit("ufs"))
    .when(pl.col("name").is_in(ribozyme_syn)).then(pl.lit("ribozyme"))
    .when(pl.col("name").is_in(dls_syn)).then(pl.lit("dls"))
    .when(pl.col("name").is_in(rev_syn)).then(pl.lit("rev"))
    .when(pl.col("name").is_in(fse_syn)).then(pl.lit("fse"))
    .when(pl.col("name").is_in(dlp_syn)).then(pl.lit("dlp"))
    .when(pl.col("name").is_in(tar_sym)).then(pl.lit("tar"))
    .when(pl.col("name").is_in(tr_loop_syn)).then(pl.lit("tr_loop"))
    .otherwise(pl.lit("unclear"))
    .alias("category")
    )
relevant_df["category"].value_counts(sort=True)

category,count
str,u32
"""unclear""",2822
"""ires""",661
"""fse""",486
"""ribozyme""",154
"""rev""",40
"""tar""",34
"""dls""",17
"""dlp""",17
"""ufs""",8
"""tr_loop""",6


In [48]:
relevant_df.filter(pl.col("category").eq("unclear"))["name"].value_counts(sort=True) #to_list()

name,count
str,u32
,79
"""5'_utr""",48
"""stem_loop_structure""",47
"""stem_loop_structures""",45
"""3'_utr""",28
…,…
"""packaging_signals""",1
"""chv4_p24""",1
"""vsv_ifn""",1
"""igr_p/m""",1


#### todo: revisit these, figure out if the system prompt could be even stricter to regarding the guidelines of controled vocabulary for the schema

In [50]:
relevant_df = relevant_df.with_columns(
        pl.sum_horizontal([pl.col(col).ne("") for col in ["name","type","organism","database","accession","start","end","strand","sequence"]])
        .alias("total_coord_attributes")
    ).sort(by="total_coord_attributes", descending=True)
relevant_df.head(1)

is_relevant,reason,coordinate_list,general_viral_rna,pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches,name,type,organism,database,accession,start,end,strand,sequence,category,total_coord_attributes
str,str,struct[10],list[struct[9]],str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32,str,str,str,str,str,str,str,bool,str,str,u32
"""relevant""","""The paper provides a detailed …","{""frameshifting_site"",""RNA"",""West Nile virus"",""ncbi_genbank"",""NC_009942"",""3420"",""3678"",""1"",""The sequence spanning the NS1 and NS2A genes"",null}",,"""PMC11797035""","""RNA elements required for the …","""Graphical AbstractGraphical Ab…","""Cate, JamieH D, Shelke, Rohan …","""Nucleic Acids Research""","""2024-12-19""","""10.1093/nar/gkae1248""","""Programmed ribosomal frameshif…",[],[],"[""virus-induced ribosomal frameshift""]","[""viruses and other organisms to regulate gene expression by altering the messenger RNA (mRNA) reading frame of ribosomes during translation. This process is especially prevalent in RNA viruses, where frameshift"", ""viruses because it exists in two forms: full-length NS1 and an extended NS1 ( 18 20 ). The production of NS1 is dependent on a 1 ribosomal frameshift"", … ""viral life cycle. This aligns with previous studies, which reported that different slippery site sequences can influence the extent and direction of frameshift events ( 88 ). The versatility of the slippery site in promoting both 1 and 2 frameshift""]",[],[],[],[],[],[],[],"[""viral systems, such as HIV, the frameshifting-stimulating element may adopt alternative structures, either forming a stem loop""]",[],[],[],"[""MN908947.3"", ""RefSeq NC_009942""]",[],21,"""frameshifting_site""","""RNA""","""West Nile virus""","""ncbi_genbank""","""NC_009942""","""3420""","""3678""",True,"""The sequence spanning the NS1 …","""fse""",9


### Checking how many extracted coodinates contain all requested fields


In [51]:
relevant_df["total_coord_attributes"].value_counts(sort=True)

total_coord_attributes,count
u32,u32
3,1453
5,729
4,642
2,533
7,303
6,219
8,209
0,79
9,78


In [52]:
relevant_df.filter(pl.col("total_coord_attributes") == 0)["reason"].to_list()

["The paper discusses various mechanisms of gene expression used by viruses, including ribosomal frameshifting and termination suppression, which are relevant to the user's interest in ires, frameshifting_elements, and other viral RNA features.",
 'The paper discusses the role of a non-coding RNA (TMER4) in the pathogenesis of Murine gammaherpesvirus 68 (MHV68), including its essential role in hematogenous dissemination and establishment of peripheral latency.',
 'The paper describes a comprehensive database of ribozymes, which are functional RNA molecules that can catalyze chemical reactions.',
 'The paper explores internal ribosome entry sites (IRES) as therapeutic targets, discussing their role in viral and cellular mRNA translation, and various strategies to target IRES elements for therapeutic gain.',
 'The paper discusses the dynamic RNA structurome and its functions, including the study of RNA structures in living cells using high-throughput sequencing (HTS) technologies.',
 'Th

### For the probably failed entries, goning to redo th llm call


In [53]:
redos_df = relevant_df.filter(pl.col("total_coord_attributes") == 0).drop(set(relevant_df.columns).difference(set(results_df.columns)))
redos_df.shape

(79, 26)

In [54]:
from polars_dovmed.llm_convert_context_to_coord import parse_llm_response,create_user_prompt_full_text,create_system_prompt_full_text,call_llm_api

In [55]:
# Create new dictionary for the loop.
all_terms = queries.copy()

# Remove the unwanted entries
for key in ["virus_taxonomy_report", "disqualifying_terms"]:
    all_terms.pop(key, None) 

for key, value in all_terms.items():
    cleaned_patterns = []
    for pattern_list in value:
        cleaned_pattern_list = [clean_pattern_for_polars(pattern) for pattern in pattern_list]
        cleaned_patterns.append(cleaned_pattern_list)
    all_terms[key] = cleaned_patterns

print(all_terms)

concept_columns = redos_df.select(pl.selectors.starts_with(queries.keys())).columns # type: ignore
print(f"Found concept columns: {concept_columns}")

models=["Llama-4-Scout-17B-16E-Instruct"]
all_responses = []
    
# Create the system prompt
system_prompt = create_system_prompt_full_text(schema=response_schema)
print(f"The system prompt is: {system_prompt}")

for index, row in enumerate(redos_df.iter_rows(named=True)):
    this_row_reponses = dict.fromkeys(models, None)

    user_prompt = create_user_prompt_full_text(
        full_text=row["full_text"],  # Use full text instead of coordinate_text
        title=row["title"],
        user_terms=all_terms,
        matched_terms=redos_df[index].select(concept_columns).with_columns(pl.concat_list(pl.col(concept_columns))).unique().to_series().to_list(), # type: ignore
        prompt_prepend="My focus is on RNA viruses and RNA secondary structures they might use. I am NOT interested at all in antibodies, vaccines, artifical vectors, synthetic constructs or biosensors - if any of these is mentioned, consider set is_relevant to not_relevant"
    )

    for model in models:
        try:
            llm_response = call_llm_api(
            user_prompt=user_prompt,
            system_prompt=system_prompt,
            api_key=os.environ.get("YOUR_API_KEY"), # type: ignore
            api_base="https://YOUR.API.PROVIDER.COM",
            model=model
            )
            parsed = parse_llm_response(llm_response)

        except Exception as e:
            parsed = {"is_relevant": "ERROR", "reason": e, "coordinate_list":[]}

        this_row_reponses[model] = parsed
        with open(f"results/rna_virus/rna_secondary_structure/llm_full_text_respones/{row['pmc_id']}_redo.json", "w") as outfile:
            json.dump(parsed, outfile)
    all_responses.append(this_row_reponses)
    print(this_row_reponses)


{'ires': [['viral|virus', 'ires'], ['viral|virus', 'internal', 'ribosome', 'entry', 'site'], ['viral|virus', 'cap independent'], ['viral|virus', 'independent', 'cap'], ['viral|virus', 'internal', 'ribosomal', 'entry'], ['viral|virus', 'ires'], ['ires', 'viral|virus']], 'frameshifting_elements': [['viral|virus|phage', 'frameshift'], ['programmed', 'ribosomal', 'frameshift'], ['programmed', 'frameshift'], ['frameshifting', 'stimulation', 'element'], ['frameshift', 'stimulation', 'element'], ['viral|virus|phage', 'fse']], 'upstream_flanking_sequences': [['cis', 'acting', 'flanking', 'element'], ['cis', 'acting', '5', 'flanking'], ['viral|virus|phage', 'ufs']], 'tr_loops': [['viral|virus|phage', 'tr', 'loop'], ['tr stem', 'loop'], ['viral|virus|phage', 'tr', 'loop'], ['tr', 'stemloop'], ['tr', 'stem loop']], 'dimer_linkage_structures': [['viral|virus|phage', 'dls'], ['viral|virus', 'dimer', 'linkage', 'structure'], ['viral|virus', 'dls']], 'rev_response_elements': [['viral|virus', 'rev', '

Failed to parse JSON response: Expecting property name enclosed in double quotes: line 1 column 5 (char 4)
Original response text: 
{
  "is_relevant": "relevant",
  "reason": "The paper describes the discovery of sixteen novel mycoviruses co-infecting a single strain of Rhizoctonia zeae, including their genome organization and phylogenetic analysis.",
  "coordinate_list": [
    {
      "name": "RzHV1",
      "type": "RNA",
      "organism": "Rhizoctonia zeae",
      "database": "GenBank",
      "accession": "OQ559666",
      "start": "137",
      "end": "12526",
      "strand": "",
      "sequence": ""
    },
    {
      "name": "RzHV2",
      "type": "RNA",
      "organism": "Rhizoctonia zeae",
      "database": "GenBank",
      "accession": "OQ559672",
      "start": "605",
      "end": "13299",
      "strand": "",
      "sequence": ""
    },
    {
      "name": "RzYkV1",
      "type": "RNA",
      "organism": "Rhizoctonia zeae",
      "database": "GenBank",
      "accession": "OQ559

{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper describes the discovery of sixteen novel mycoviruses co-infecting a single strain of Rhizoctonia zeae, including their genome organization and phylogenetic analysis.', 'coordinate_list': []}}
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses alphavirus replication, tropism, and interference with host gene expression, including RNA secondary structures used by the virus.', 'coordinate_list': [{'name': 'stem loop structure', 'type': 'RNA', 'organism': '', 'database': '', 'accession': '', 'start': '', 'end': '', 'strand': '', 'sequence': ''}]}}
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': "The paper discusses the interaction between the dual-specificity kinase DYRK1A and the Hepatitis B virus genome, and its role in regulating the production of viral RNA. The study focuses on RNA viruses and RNA secondary structures, which is relevant 

Failed to parse JSON response: Expecting property name enclosed in double quotes: line 1 column 5 (char 4)
Original response text: 
{
  "is_relevant": "relevant",
  "reason": "The paper discusses the structural disorder in the proteome and interactome of Alkhurma virus (ALKV), focusing on RNA secondary structures and protein intrinsic disorder, which is relevant to RNA viruses and RNA secondary structures.",
  "coordinate_list": [
    {
      "name": "ALKV genomic polyprotein",
      "type": "Protein",
      "organism": "Alkhurma virus",
      "database": "UniProt",
      "accession": "Q91B85",
      "start": "",
      "end": "",
      "strand": "",
      "sequence": "",
    },
    {
      "name": "Capsid protein C",
      "type": "Protein",
      "organism": "Alkhurma virus",
      "database": "UniProt",
      "accession": "",
      "start": "",
      "end": "",
      "strand": "",
      "sequence": "",
    },
    {
      "name": "Protein prM",
      "type": "Protein",
      "organism

{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses the structural disorder in the proteome and interactome of Alkhurma virus (ALKV), focusing on RNA secondary structures and protein intrinsic disorder, which is relevant to RNA viruses and RNA secondary structures.', 'coordinate_list': []}}
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses defective viral genomes (DVGs) and their role in RNA virus infections, including their generation, mechanisms, and functions in pathogenesis.', 'coordinate_list': []}}
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses the modularity of the N-terminal amphipathic helix conserved in picornavirus 2C proteins and hepatitis C NS5A protein, which is related to RNA secondary structures in RNA viruses.', 'coordinate_list': []}}
{'Llama-4-Scout-17B-16E-Instruct': {'is_relevant': 'relevant', 'reason': 'The paper discusses the VPgPro

In [56]:
from polars_dovmed.schema_utils import normalize_biological_name
redos_df = relevant_df.filter(pl.col("total_coord_attributes") == 0).drop(set(relevant_df.columns).difference(set(results_df.columns)))


redos_df = redos_df.hstack(pl.from_dicts(all_responses).unnest("Llama-4-Scout-17B-16E-Instruct")).filter(~pl.col("is_relevant").ne("relevant"))
redos_df = redos_df.explode(pl.col("coordinate_list")).with_columns(
    pl.col("coordinate_list").struct.rename_fields(["name","type","organism","database","accession","start","end","strand","sequence"]
        ).struct.unnest())

redos_df = redos_df.with_columns(
        pl.sum_horizontal([pl.col(col).ne("") for col in ["name","type","organism","database","accession","start","end","strand","sequence"]])
        .alias("total_coord_attributes")
    ).sort(by="total_coord_attributes", descending=True)

redos_df = redos_df.with_columns(pl.col("name").map_elements(normalize_biological_name,return_dtype=pl.String).alias("name"))
print(redos_df.shape)
redos_df.head(1)


(80, 39)


  redos_df = redos_df.with_columns(pl.col("name").map_elements(normalize_biological_name).alias("name"))


pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches,is_relevant,reason,coordinate_list,name,type,organism,database,accession,start,end,strand,sequence,total_coord_attributes
str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32,str,str,struct[9],str,str,str,str,str,str,str,str,str,u32
"""PMC10409264""","""Influenza A virus inhibits TET…","""Author summaryInfluenza A viru…","""Yue, Min, Li, Yan, Hu, Yixiang…","""PLOS Pathogens""","""2023-7""","""10.1371/journal.ppat.1011550""","""The influenza A virus (IAV) is…",[],[],[],"[""viral replication. The endoribonuclease PA-X is translated from the polymerase acidic protein (PA) mRNA by +1 ribosomal frameshift""]",[],[],[],[],[],[],[],[],[],[],[],"[""NP_001371809.1"", ""NP_001120680.1"", … ""NP_001120680.1""]",[],1,"""relevant""","""The paper discusses the role o…","{""STAT1"",""Protein"",""Homo sapiens"",""UniProt"",""P42291"","""","""","""",""""}","""stat1""","""Protein""","""Homo sapiens""","""UniProt""","""P42291""","""""","""""","""""","""""",5


In [57]:
# Normalize all coord cols
redos_df = redos_df.with_columns(
    pl.col(coord_cols).replace(missing_map)
)

redos_df.select(pl.col("strand"))["strand"].value_counts() #unique().to_list()

# strand values
plus_syn = ["+","+1","1","forward"]
nega_syn = ["-","-1","negative"]
redos_df = redos_df.with_columns(
    pl.when(pl.col("strand").is_in(plus_syn)).then(pl.lit(True))
    .when(pl.col("strand").is_in(nega_syn)).then(pl.lit(False))
    .otherwise(pl.lit(None))
    .alias("strand")
    )
# name 
redos_df = redos_df.with_columns(
    pl.when(pl.col("name").is_in(missing_syn)).then(pl.lit(None))
    .when(pl.col("name").is_in(ires_syn)).then(pl.lit("ires"))
    .when(pl.col("name").is_in(ufs_syn)).then(pl.lit("ufs"))
    .when(pl.col("name").is_in(ribozyme_syn)).then(pl.lit("ribozyme"))
    .when(pl.col("name").is_in(dls_syn)).then(pl.lit("dls"))
    .when(pl.col("name").is_in(rev_syn)).then(pl.lit("rev"))
    .when(pl.col("name").is_in(fse_syn)).then(pl.lit("fse"))
    .when(pl.col("name").is_in(dlp_syn)).then(pl.lit("dlp"))
    .when(pl.col("name").is_in(tar_sym)).then(pl.lit("tar"))
    .when(pl.col("name").is_in(tr_loop_syn)).then(pl.lit("tr_loop"))
    .otherwise(pl.lit("unclear"))
    .alias("category")
    )
redos_df["category"].value_counts(sort=True)

category,count
str,u32
"""unclear""",58
"""ires""",14
"""ribozyme""",4
"""fse""",4


In [58]:
redos_df.filter(pl.col("category") == "unclear")["name"].unique().to_list()

['cvb3_capsid_region',
 'a512t_mutation',
 'internal_extended_stem_loop_structure',
 'cap1',
 'stat1',
 'a576v_mutation',
 'transcription_regulatory_sequences',
 'vpgpro',
 'human_rhinovirus_(hrv)',
 'open_reading_frame_(orf)_1_region',
 'tet2_mrna',
 'hepatitis_b_virus_genome',
 'astrovirus_va3',
 'viral_reads',
 'pa_x',
 "5'_cap",
 'isg15',
 'viral_rna',
 'cap2',
 'tmer4_stem_loop',
 'internal_discontinuous_replication_element',
 'bk_polyomavirus_(bkv)',
 "3'_utr",
 'm6a_modifications',
 'norovirus',
 None,
 'wymv_rna',
 'viral_genome',
 'stem_loop_structure',
 'pseudoknots',
 'eif6',
 'adenovirus_(adv)',
 'i581t_mutation',
 'stem_loop_structures',
 'psi_region',
 'hev_rna',
 'g3bp1_granules',
 'intersection_sequence_(int)',
 'd207s_mutation']

In [59]:
set(relevant_df.columns).difference(redos_df.columns)

{'general_viral_rna'}

In [60]:
redos_df.height + relevant_df.height

4325

In [61]:
test = pl.concat([relevant_df,redos_df],how="diagonal_relaxed")
test.shape

(4325, 41)

In [62]:
relevant_df = test
relevant_df.head(1)

is_relevant,reason,coordinate_list,general_viral_rna,pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches,name,type,organism,database,accession,start,end,strand,sequence,category,total_coord_attributes
str,str,struct[10],list[struct[9]],str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32,str,str,str,str,str,str,str,bool,str,str,u32
"""relevant""","""The paper provides a detailed …","{""frameshifting_site"",""RNA"",""West Nile virus"",""ncbi_genbank"",""NC_009942"",""3420"",""3678"",""1"",""The sequence spanning the NS1 and NS2A genes"",null}",,"""PMC11797035""","""RNA elements required for the …","""Graphical AbstractGraphical Ab…","""Cate, JamieH D, Shelke, Rohan …","""Nucleic Acids Research""","""2024-12-19""","""10.1093/nar/gkae1248""","""Programmed ribosomal frameshif…",[],[],"[""virus-induced ribosomal frameshift""]","[""viruses and other organisms to regulate gene expression by altering the messenger RNA (mRNA) reading frame of ribosomes during translation. This process is especially prevalent in RNA viruses, where frameshift"", ""viruses because it exists in two forms: full-length NS1 and an extended NS1 ( 18 20 ). The production of NS1 is dependent on a 1 ribosomal frameshift"", … ""viral life cycle. This aligns with previous studies, which reported that different slippery site sequences can influence the extent and direction of frameshift events ( 88 ). The versatility of the slippery site in promoting both 1 and 2 frameshift""]",[],[],[],[],[],[],[],"[""viral systems, such as HIV, the frameshifting-stimulating element may adopt alternative structures, either forming a stem loop""]",[],[],[],"[""MN908947.3"", ""RefSeq NC_009942""]",[],21,"""frameshifting_site""","""RNA""","""West Nile virus""","""ncbi_genbank""","""NC_009942""","""3420""","""3678""",True,"""The sequence spanning the NS1 …","""fse""",9


In [63]:
relevant_df = relevant_df.with_columns(
        pl.sum_horizontal([pl.col(col).ne("") for col in ["name","type","organism","database","accession","start","end","strand","sequence"]])
        .alias("total_coord_attributes")
    ).sort(by="total_coord_attributes", descending=True)
print(relevant_df.shape)
relevant_df.head(1)

(4325, 41)


is_relevant,reason,coordinate_list,general_viral_rna,pmc_id,title,abstract_text,authors,journal,publication_date,doi,full_text,ires_extracted_from_title,ires_extracted_from_full_text,frameshifting_elements_extracted_from_title,frameshifting_elements_extracted_from_full_text,upstream_flanking_sequences_extracted_from_full_text,dimer_linkage_structures_extracted_from_full_text,rev_response_elements_extracted_from_full_text,downstream_hairpin_loops_extracted_from_full_text,viral_ribozymes_extracted_from_title,viral_ribozymes_extracted_from_full_text,stem_loop_structures_extracted_from_title,stem_loop_structures_extracted_from_full_text,general_viral_rna_extracted_from_full_text,virus_taxonomy_report_extracted_from_title,virus_taxonomy_report_extracted_from_full_text,all_accessions,all_coordinates,total_matches,name,type,organism,database,accession,start,end,strand,sequence,category,total_coord_attributes
str,str,struct[10],list[struct[9]],str,str,str,str,str,str,str,str,list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],list[str],u32,str,str,str,str,str,str,str,bool,str,str,u32
"""relevant""","""The paper provides a detailed …","{""frameshifting_site"",""RNA"",""West Nile virus"",""ncbi_genbank"",""NC_009942"",""3420"",""3678"",""1"",""The sequence spanning the NS1 and NS2A genes"",null}",,"""PMC11797035""","""RNA elements required for the …","""Graphical AbstractGraphical Ab…","""Cate, JamieH D, Shelke, Rohan …","""Nucleic Acids Research""","""2024-12-19""","""10.1093/nar/gkae1248""","""Programmed ribosomal frameshif…",[],[],"[""virus-induced ribosomal frameshift""]","[""viruses and other organisms to regulate gene expression by altering the messenger RNA (mRNA) reading frame of ribosomes during translation. This process is especially prevalent in RNA viruses, where frameshift"", ""viruses because it exists in two forms: full-length NS1 and an extended NS1 ( 18 20 ). The production of NS1 is dependent on a 1 ribosomal frameshift"", … ""viral life cycle. This aligns with previous studies, which reported that different slippery site sequences can influence the extent and direction of frameshift events ( 88 ). The versatility of the slippery site in promoting both 1 and 2 frameshift""]",[],[],[],[],[],[],[],"[""viral systems, such as HIV, the frameshifting-stimulating element may adopt alternative structures, either forming a stem loop""]",[],[],[],"[""MN908947.3"", ""RefSeq NC_009942""]",[],21,"""frameshifting_site""","""RNA""","""West Nile virus""","""ncbi_genbank""","""NC_009942""","""3420""","""3678""",True,"""The sequence spanning the NS1 …","""fse""",9


In [64]:
relevant_df = relevant_df.filter(pl.col("total_coord_attributes") >= 1)
relevant_df.shape

(4230, 41)

### Total ~4230

In [65]:
relevant_df.write_parquet("results/rna_virus/rna_secondary_structure/llm_full_text_relevant_responses.parquet")
