# Experiment: Downstream applications of polars-dovmed results: Fetch Genomic Sequences from Databases
This is not part of the polars-dovmed package sensu stricto, just a POC/attempt demonstrating some downstream usage of results and exploring the retrieval of actual genomic sequences from public databases.

**Exploratory workflow:** This notebook takes the cleaned and normalized LLM-extracted coordinates and attempts to fetch the actual nucleotide sequences from public databases (NCBI/RefSeq) using GFF files. It filters for high-quality extractions (those with both coordinates and accessions), cleans sequence formatting, and validates nucleotide sequences.

The goal is to transform LLM-extracted metadata into usable biological sequences for downstream analysis.

**Related notebooks:** For the LLM API calls that extract coordinates, see [`01_llm_analyze_papers.ipynb`](./01_llm_analyze_papers.ipynb). For cleaning and normalizing those responses, see [`02_clean_llm_responses.ipynb`](./02_clean_llm_responses.ipynb).

**NOTE!** As anything with LLMs, all results should be taken with a mountain of salt grains and verify everything independently.  
Some of the code here was made for earlier (obsolete) versions of polars-dovmed, e.g. not all of the fields exist in the current output.

In [2]:
import os 
import json
import polars as pl
from pathlib import Path
os.chdir("/clusterfs/jgi/scratch/science/metagen/neri/code/blits/polars_dovmed")
queries = json.load(open("./results/rna_virus/rna_secondary_structure/RNA_virus_rss_queries.json"))
response_schema= json.load(open("./results/rna_virus/rna_secondary_structure/rna_virus_schema.json"))
file_lists_df = pl.read_parquet("data/pubmed_central/pmc_oa/filelists.parquet")

In [3]:
work_df = pl.read_parquet("results/rna_virus/rna_secondary_structure/llm_full_text_relevant_responses.parquet")
relevant_cols = ["is_relevant","reason","pmc_id","all_coordinates","all_accessions","category","title","publication_date","total_coord_attributes","total_matches","name","type","organism","database","accession","start","end","strand","sequence"]

work_df = work_df.select(relevant_cols).unique()

print(work_df.shape)
work_df.sort(by="total_matches",descending=True).head(2)

(4230, 19)


is_relevant,reason,pmc_id,all_coordinates,all_accessions,category,title,publication_date,total_coord_attributes,total_matches,name,type,organism,database,accession,start,end,strand,sequence
str,str,str,list[str],list[str],str,str,str,u32,u32,str,str,str,str,str,str,str,bool,str
"""relevant""","""Detailed characterization of r…","""PMC7122378""",[],"[""NC_003747"", ""NC_001575"", … ""accession ""]","""unclear""","""Ribosomal Frameshifting in Dec…","""2010""",4,76,"""ldfe""","""RNA""","""Barley yellow dwarf virus (BYD…",,,"""4 kb downstream in the genome""",,,
"""relevant""","""Detailed characterization of r…","""PMC7122378""",[],"[""NC_003747"", ""NC_001575"", … ""accession ""]","""unclear""","""Ribosomal Frameshifting in Dec…","""2010""",3,76,"""alil""","""RNA""","""Barley yellow dwarf virus (BYD…",,,,,,


Trimming for low hanging fruits (for now)

In [4]:
# work_df = work_df.filter(~pl.col("category").eq("unclear"))
# work_df = work_df.filter(pl.col("total_coord_attributes").ge(5))
# work_df = work_df.filter(pl.col("type").eq("RNA"))
work_df = work_df.filter(pl.sum_horizontal([pl.col(col).ne("") for col in ["start","end","accession"]]).ge(2))

work_df

is_relevant,reason,pmc_id,all_coordinates,all_accessions,category,title,publication_date,total_coord_attributes,total_matches,name,type,organism,database,accession,start,end,strand,sequence
str,str,str,list[str],list[str],str,str,str,u32,u32,str,str,str,str,str,str,str,bool,str
"""relevant""","""The paper discusses the geneti…","""PMC4810542""",[],"[""A11020"", ""GU179001"", … ""accession number""]","""fse""","""Genetic Stability of Bacterial…","""2016-3-28""",8,4,"""frameshifting_element""","""RNA""","""Human cytomegalovirus (HCMV)""","""ncbi_genbank""","""GU179001.1""","""3687""","""9738""",,"""TTTTTTTT TTTTTTTTT (frameshift…"
"""relevant""","""The paper examines the stabili…","""PMC11998532""",[],"[""X54430.1"", ""X77950.1"", … ""GenBank accession""]","""unclear""","""Hairpin inserts in viral genom…","""2025-3-21""",6,5,"""domain_2""","""RNA""","""Citrus yellow vein-associated …","""ncbi_genbank""",,"""669""","""2398""",,
"""relevant""","""The paper provides evidence th…","""PMC7522710""",[],"[""SAMN13244308"", ""SAMN13244304"", … ""SRA accession""]","""unclear""","""Non-retroviral Endogenous Vira…","""2020-9-21""",6,1,"""cfav_eve1""","""RNA""","""Aedes aegypti""",,,"""NS2""","""NS2""",false,
"""relevant""","""The paper discusses the discov…","""PMC8876172""",[],"[""OL471343"", ""OL471345"", … ""accession numbers""]","""unclear""","""Viruses Infecting Greenhood Or…","""2022-2""",8,4,"""pterostylis_alphaendornavirus_…","""RNA""","""Pterostylis nutans""","""ncbi_genbank""","""OL471320""","""1""","""14889""",true,
"""relevant""","""The paper describes the engine…","""PMC7409111""",[],"[""NM_001175829"", ""NM_001352010.1"", ""DS126431""]","""unclear""","""EngineeringMaize rayado fino v…","""2020-8""",5,2,"""pol/cp1_junction""","""RNA""","""Zea mays""",,,"""5494""","""5495""",,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""relevant""","""The paper provides detailed in…","""PMC10462655""","[""residues 746-992""]","[""C601003"", ""P69503"", … ""accession code""]","""unclear""","""Dynamically regulated two-site…","""2023""",5,1,"""j_k_st_region""","""RNA""","""encephalomyocarditis virus""",,,"""G680""","""C787""",,
"""relevant""","""The paper identifies and chara…","""PMC9559581""",[],"[""ON241323"", ""ON241332"", … ""accession number""]","""unclear""","""Virome ofPseudostellaria heter…","""2022""",8,3,"""pseudostellaria_heterophylla_a…","""RNA""","""Pseudostellaria heterophylla""","""ncbi_genbank""","""ON241319""","""171""","""1335""",true,
"""relevant""","""The paper provides information…","""PMC6615781""",[],"[""CUHK63749"", ""CUHK65899"", … ""accession ""]","""ires""","""Molecular epidemiological stud…","""2019-7-1""",7,1,"""internal_ribosome_entry_site""","""RNA""","""Enterovirus D68""","""ncbi_genbank""","""MG739632-MG739646""","""681""","""731""",,
"""relevant""","""The paper discusses the HIV-1 …","""PMC7687225""",[],"[""PDB: 1ANR"", ""ID:""]","""tar""","""TargetDirected AzideAlkyne Cyc…","""2020-7-20""",5,1,"""tar_rna""","""RNA""","""Human immunodeficiency virus t…",,,"""1""","""59""",,


In [4]:
for col in relevant_cols:
    print(f"{col}: {work_df[col].explode().value_counts(sort=True)}")
    print(f"{col}: {work_df[col].explode().unique().to_list()}")

is_relevant: shape: (1, 2)
┌─────────────┬───────┐
│ is_relevant ┆ count │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ relevant    ┆ 194   │
└─────────────┴───────┘
is_relevant: ['relevant']
reason: shape: (166, 2)
┌─────────────────────────────────┬───────┐
│ reason                          ┆ count │
│ ---                             ┆ ---   │
│ str                             ┆ u32   │
╞═════════════════════════════════╪═══════╡
│ The paper discusses the geneti… ┆ 4     │
│ The paper discusses frameshift… ┆ 4     │
│ The paper provides detailed in… ┆ 3     │
│ The paper provides detailed an… ┆ 3     │
│ The paper identifies and chara… ┆ 3     │
│ …                               ┆ …     │
│ The paper characterizes the IR… ┆ 1     │
│ The paper discusses the produc… ┆ 1     │
│ The paper discusses the molecu… ┆ 1     │
│ The paper discusses the use of… ┆ 1     │
│ The paper provides detailed in… ┆ 1     │
└─────────────────────────────────┴───────┘
reason:

In [5]:
work_df = work_df.with_columns(pl.col("sequence").str.replace_all("_| ","").alias("sequence"))
work_df = work_df.with_columns(pl.col("sequence").str.replace_all("7A stretch","AAAAAAA").alias("sequence"))
work_df = work_df.with_columns(pl.col("sequence").str.replace_all("5'-|3' ","").alias("sequence"))
work_df = work_df.with_columns(pl.col("sequence").str.replace_all("-3' ","",literal=True).alias("sequence"))
work_df = work_df.with_columns(pl.col("sequence").str.replace_all("-3","",literal=True).alias("sequence"))
work_df = work_df.with_columns(pl.col("sequence").str.replace_all("'","",literal=True).alias("sequence"))


In [None]:
work_df = work_df.with_columns(
    pl.when(~pl.col("sequence").str.to_lowercase().str.contains(pattern="[^atgcu]",literal=False))
    .then("sequence").otherwise(None).alias("sequence"))


In [7]:
work_df

is_relevant,reason,pmc_id,all_coordinates,all_accessions,category,title,publication_date,total_coord_attributes,total_matches,name,type,organism,database,accession,start,end,strand,sequence
str,str,str,list[str],list[str],str,str,str,u32,u32,str,str,str,str,str,str,str,bool,str
"""relevant""","""The paper provides detailed in…","""PMC7114287""",[],"[""AJ271965"", ""X53459"", … ""GenBank accession""]","""fse""","""The complete sequence of the b…","""2005-8-30""",8,1,"""frameshifting_elements""","""RNA""","""Bovine torovirus""","""ncbi_genbank""","""AY427798""","""14148""","""14205""",,"""UUUAAAC"""
"""relevant""","""The paper discusses the role o…","""PMC11401462""",[],"[""JQ804832"", ""H00010189"", … ""accession number""]","""ires""","""NSUN2 mediates distinct pathwa…","""2024-5-18""",7,1,"""ires_motif""","""RNA""","""Enterovirus 71""","""ncbi_genbank""","""JQ804832""","""584""","""584""",,
"""relevant""","""The paper discusses the discov…","""PMC9869654""",[],"[""KR868724"", ""KX645667"", … ""KR868723""]","""fse""","""Discovery of novel Mamastrovir…","""2023-1-02""",8,2,"""frameshifting_element""","""RNA""","""Bactrian camel astrovirus (BcA…","""ncbi_genbank""","""KR868721, KR868722, KR868723, …","""near the end of ORF1a""","""followed by a stem-loop struct…",,"""AAAAAAC"""
"""relevant""","""The paper discusses the regula…","""PMC8776124""",[],"[""ID:""]","""ires""","""G-Quadruplex Regulation ofVEGF…","""2022-1""",6,2,"""ires_a_element""","""RNA""","""Homo sapiens""","""ncbi_genbank""",,"""749""","""1038""",,
"""relevant""","""The paper characterizes two to…","""PMC10674808""",[],"[""ON812795"", ""OR250783"", … ""accession numbers""]","""fse""","""Molecular Characterization of …","""2023-11""",9,2,"""frameshifting_element""","""RNA""","""Geotrichum candidum""","""ncbi_genbank""","""OR250782""","""1958""","""1964""",true,"""GGUUUAAU"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""relevant""","""The paper characterizes the IR…","""PMC5587806""",[],[],"""ires""","""The IRES5UTRof the dicistrovir…","""2017-7-18""",6,1,"""ires_5'_utr""","""RNA""","""Cricket paralysis virus""","""ncbi_genbank""",,"""357""","""761""",,
"""relevant""","""The paper discusses the produc…","""PMC5599999""",[],"[""Q13255"", ""P33535"", … ""GenBank accession""]","""ires""","""Production of G proteincoupled…","""2017-10""",6,2,"""internal_ribosome_entry_site""","""RNA""",,"""ncbi_genbank""","""AF218039""","""6025""","""6216""",,
"""relevant""","""The paper discusses the molecu…","""PMC7949111""",[],[],"""ires""","""Enterovirus D68 molecular and …","""2021-1-21""",9,1,"""ev_d68_ires""","""RNA""","""Enterovirus D68""","""ncbi_genbank""","""JX101610""","""1""","""903""",true,"""GGUUUAAU"""
"""relevant""","""The paper discusses the use of…","""PMC7785429""",[],"[""MT240929"", ""MT240924"", … ""ID:""]","""ires""","""Consensus small interfering RN…","""2020-11-17""",8,1,"""ires_structure""","""RNA""","""Hepatitis C virus sub-genotype…","""ncbi_genbank""","""Y11604""","""141""","""279""",true,


In [None]:
all_fetched = Path("data/refseq/ncbi_datasets/").glob(pattern="*/*.gff")
ix = 0
all_frames=[]
for gff_file in all_fetched:
    # all_frames.append(pb.read_gff(str(gff_file)))
    ix+=1
    if ix >10:
        break


In [10]:
test_df = pl.concat(pl.collect_all(all_frames))
test_df

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

0rows [00:00, ?rows/s]

chrom,start,end,type,source,score,strand,phase,attributes
str,u32,u32,str,str,f32,str,u32,list[struct[2]]
"""NC_001422.1""",1,5386,"""region""","""RefSeq""",,"""+""",,"[{""ID"",""NC_001422.1:1..5386""}, {""Dbxref"",""taxon:2886930""}, … {""old-name"",""BACTERIOPHAGE PHI-X174""}]"
"""NC_001422.1""",51,221,"""gene""","""RefSeq""",,"""+""",,"[{""ID"",""gene-phiX174p04""}, {""Dbxref"",""GeneID:2546403""}, … {""locus_tag"",""phiX174p04""}]"
"""NC_001422.1""",51,221,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-NP_040706.1""}, {""Parent"",""gene-phiX174p04""}, … {""transl_table"",""11""}]"
"""NC_001422.1""",57,57,"""sequence_alteration""","""RefSeq""",,"""+""",,"[{""ID"",""id-phiX174p04""}, {""Dbxref"",""GeneID:2546403""}, … {""replace"",""c""}]"
"""NC_001422.1""",117,117,"""sequence_alteration""","""RefSeq""",,"""+""",,"[{""ID"",""id-phiX174p04-2""}, {""Dbxref"",""GeneID:2546403""}, … {""replace"",""a""}]"
…,…,…,…,…,…,…,…,…
"""NC_002166.1""",36211,36834,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-NP_037694.1""}, {""Parent"",""gene-HK022p52""}, … {""transl_table"",""11""}]"
"""NC_002166.1""",37267,37590,"""gene""","""RefSeq""",,"""+""",,"[{""ID"",""gene-HK022p53""}, {""Dbxref"",""GeneID:1262515""}, … {""locus_tag"",""HK022p53""}]"
"""NC_002166.1""",37267,37590,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-NP_037695.1""}, {""Parent"",""gene-HK022p53""}, … {""transl_table"",""11""}]"
"""NC_002166.1""",37574,38050,"""gene""","""RefSeq""",,"""+""",,"[{""ID"",""gene-HK022p54""}, {""Dbxref"",""GeneID:1262473""}, … {""locus_tag"",""HK022p54""}]"


In [41]:
test_df.collect_schema()

Schema([('chrom', String),
        ('start', UInt32),
        ('end', UInt32),
        ('type', String),
        ('source', String),
        ('score', Float32),
        ('strand', String),
        ('phase', UInt32),
        ('attributes', List(Struct({'tag': String, 'value': String})))])

In [None]:
# gff_test = pb.read_gff("./data/refseq/ncbi_datasets/GCF_000880055.1/genomic.gff").collect() #,attr_fields=["ID","Parent","1","2","3","4","5","6","7","8","9","10","11","12"]).collect()
# gff_test#.explode(pl.col("attributes")) #.struct.unnest())

0rows [00:00, ?rows/s]

chrom,start,end,type,source,score,strand,phase,attributes
str,u32,u32,str,str,f32,str,u32,list[struct[2]]
"""NC_010800.1""",1,27657,"""region""","""RefSeq""",,"""+""",,"[{""ID"",""NC_010800.1:1..27657""}, {""Dbxref"",""taxon:11152""}, … {""nat-host"",""turkey""}]"
"""NC_010800.1""",529,20336,"""gene""","""RefSeq""",,"""+""",,"[{""ID"",""gene-TCoV_gp01""}, {""Dbxref"",""GeneID:6353556""}, … {""locus_tag"",""TCoV_gp01""}]"
"""NC_010800.1""",529,12354,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-YP_001941164.2""}, {""Parent"",""gene-TCoV_gp01""}, … {""protein_id"",""YP_001941164.2""}]"
"""NC_010800.1""",12354,20336,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-YP_001941164.2""}, {""Parent"",""gene-TCoV_gp01""}, … {""protein_id"",""YP_001941164.2""}]"
"""NC_010800.1""",529,2547,"""mature_protein_region_of_CDS""","""RefSeq""",,"""+""",,"[{""ID"",""id-YP_001941164.2:1..673""}, {""Parent"",""cds-YP_001941164.2""}, … {""protein_id"",""YP_001941175.1""}]"
…,…,…,…,…,…,…,…,…
"""NC_010800.1""",25669,25866,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-YP_001941172.1""}, {""Parent"",""gene-TCoV_gp09""}, … {""protein_id"",""YP_001941172.1""}]"
"""NC_010800.1""",25863,26111,"""gene""","""RefSeq""",,"""+""",,"[{""ID"",""gene-TCoV_gp10""}, {""Dbxref"",""GeneID:6353561""}, … {""locus_tag"",""TCoV_gp10""}]"
"""NC_010800.1""",25863,26111,"""CDS""","""RefSeq""",,"""+""",0,"[{""ID"",""cds-YP_001941173.1""}, {""Parent"",""gene-TCoV_gp10""}, … {""protein_id"",""YP_001941173.1""}]"
"""NC_010800.1""",26054,27283,"""gene""","""RefSeq""",,"""+""",,"[{""ID"",""gene-TCoV_gp11""}, {""Dbxref"",""GeneID:6353558""}, … {""locus_tag"",""TCoV_gp11""}]"


In [None]:
# gff_test[1]["attributes"]


attributes
list[struct[2]]
"[{""ID"",""gene-TCoV_gp01""}, {""Dbxref"",""GeneID:6353556""}, … {""locus_tag"",""TCoV_gp01""}]"
