### Making list of SRA accessions to fetch

In this script we will parse Entrez results to choose families for ehich there are at least 3 taxa with at least 100k reads, and then we will randomly pick five of these. We will only consider Illumina reads. We will export the results as a csv table, which will be used by a bash script later to download data using fastq-dump

In [1]:
import pandas as pd, numpy as np, xmltodict

In [2]:
Eukarya_SRA = pd.read_csv('Eukarya_SRA.csv')
Eukarya_taxonomy = xmltodict.parse(open('Eukarya_taxonomy_clean.xml','rb'))

First, let's make a table relating each entry of Eukarya_taxonomy to its family

In [3]:
taxonomy_data = []
for tax in Eukarya_taxonomy['TaxaSet']['Taxon']:
    family = {'ScientificName':np.nan, 'TaxId':np.nan}
    kingdom = {'ScientificName':np.nan}
    for parent in tax['LineageEx']['Taxon']:
        if parent['Rank'] == 'family':
            family = parent
        if parent['Rank'] == 'kingdom':
            kingdom = parent

    taxonomy_data.append(dict(TaxID=tax['TaxId'],
                              Rank=tax['Rank'],
                              FamilyName=family['ScientificName'],
                              FamilyID=family['TaxId'],
                              Kingdom=kingdom['ScientificName']))

In [23]:
taxonomy_data = pd.DataFrame(taxonomy_data)
taxonomy_data

Unnamed: 0,TaxID,Rank,FamilyName,FamilyID,Kingdom
4,100047,species,Melanommataceae,45307,Fungi
12,100188,species,Embiotocidae,50791,Metazoa
15,10020,species,Heteromyidae,10015,Metazoa
34,1003752,species,Creediidae,270598,Metazoa
66,1007502,species,Nannizziopsiaceae,1368691,Fungi
...,...,...,...,...,...
23916,99158,species,Sarcocystidae,5809,
23920,992336,species,Oxycarenidae,1545379,Metazoa
23928,992840,species,Cordiaceae,1561080,Viridiplantae
23999,999555,species,Bonnetiaceae,125011,Viridiplantae


Now let's read which family IDs were included in model training and filter this table

In [24]:
taxonomy_data = (taxonomy_data
 .loc[~taxonomy_data['FamilyID'].isin(pd.read_csv('runs_to_download_data.csv')['FamilyID'].astype(str))]
)
taxonomy_data

Unnamed: 0,TaxID,Rank,FamilyName,FamilyID,Kingdom
4,100047,species,Melanommataceae,45307,Fungi
12,100188,species,Embiotocidae,50791,Metazoa
15,10020,species,Heteromyidae,10015,Metazoa
34,1003752,species,Creediidae,270598,Metazoa
66,1007502,species,Nannizziopsiaceae,1368691,Fungi
...,...,...,...,...,...
23916,99158,species,Sarcocystidae,5809,
23920,992336,species,Oxycarenidae,1545379,Metazoa
23928,992840,species,Cordiaceae,1561080,Viridiplantae
23999,999555,species,Bonnetiaceae,125011,Viridiplantae


Done, now let's filter only to Illumina data with at least 50M base pairs.

In [25]:
Eukarya_illumina = (Eukarya_SRA.
 loc[lambda x: (x['Platform']=='ILLUMINA') & (x['bases'] != 'bases')].
 loc[lambda x: x['bases'].astype(int) > 50000000])
Eukarya_illumina

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
2,SRR23456104,2023-02-14 14:07:26,2023-02-14 14:05:15,20507732,4108229833,20507732,200,1802,,https://sra-pub-run-odp.s3.amazonaws.com/sra/S...,...,,,,,UNIVERSIDAD MIGUEL HERNANDEZ DE ELCHE,SRA1590205,,public,9F8C2DBCDDADE87A80B4911E4B0E00ED,83DBC8A95A6FFD10C4CC2820A44CB4E4
3,SRR23456103,2023-02-14 14:23:02,2023-02-14 14:09:20,28604938,5609916031,28604938,196,2474,,https://sra-pub-run-odp.s3.amazonaws.com/sra/S...,...,,,,,UNIVERSIDAD MIGUEL HERNANDEZ DE ELCHE,SRA1590205,,public,BBA52B72031251C433B69E337E25E9A8,F8B229FA37419D3B7809B1DDAFADA1F5
4,SRR23456708,2023-02-14 19:25:58,2023-02-14 17:25:31,239175255,26028423817,239175255,108,13279,,https://sra-pub-run-odp.s3.amazonaws.com/sra/S...,...,,,,,TEXAS A&M UNIVERSITY,SRA1590262,,public,9E4030F31BB1C10BFC516C2C8132F7E3,B8809611A93461D6A22DBD7291EF7A9E
5,SRR23456707,2023-02-14 16:31:07,2023-02-14 16:10:26,106897386,13867013483,106897386,129,7092,,https://sra-pub-run-odp.s3.amazonaws.com/sra/S...,...,,,,,TEXAS A&M UNIVERSITY,SRA1590262,,public,C872803832B08AE183A2B02520EE40D5,23CE303E68DC0EF9550301D5665907F7
6,SRR23456706,2023-02-14 19:25:58,2023-02-14 17:37:23,281934610,34674308367,281934610,122,17656,,https://sra-pub-run-odp.s3.amazonaws.com/sra/S...,...,,,,,TEXAS A&M UNIVERSITY,SRA1590262,,public,09BFBC6C77AA0C92AECDA90BA8A48671,F10575BCA517F35F7A9542DD5E26FFBD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
657768,SRR8873051,2019-04-09 22:53:12,2019-04-09 22:51:14,9417471,2844076242,9417471,302,889,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,JGI,SRA871723,,public,A8B46B1D317516291098503EE71A7CA6,3B5B38A56BC4EFF660E86ECE02FEB70A
657769,SRR9945544,2022-12-20 05:05:39,2019-08-11 05:57:49,316440553,63288110600,316440553,200,21469,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"KUNMING INSTITUTE OF ZOOLOGY, CHINESE ACADEMY ...",SRA937931,,public,4083EE6BA927651ADDD49C37D21308D8,6318B3B84A0ECA49D04C7EA8AA4BE2C6
657770,SRR9945545,2022-12-20 05:05:40,2019-08-11 03:35:44,52936555,10587311000,52936555,200,4089,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"KUNMING INSTITUTE OF ZOOLOGY, CHINESE ACADEMY ...",SRA937931,,public,2A5F7C57C9CE673CB993C9F2BCC85690,AC35773D2578D5A2E166064CB014DCFF
657771,SRR9945550,2022-12-20 05:05:40,2019-08-11 03:45:24,68035329,13607065800,68035329,200,4830,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"KUNMING INSTITUTE OF ZOOLOGY, CHINESE ACADEMY ...",SRA937931,,public,ECA5CC8AAFCD0BE099D75BF29F74D7EA,915EFB39F79C5F77D6161BEC40F8044A


Now let's find the overlap between both tables

In [27]:
families_to_keep = (Eukarya_illumina.loc[:,['BioSample','TaxID']].
 merge(taxonomy_data, on = 'TaxID', how = 'left').
 dropna(subset=['FamilyID']).
 loc[:,['TaxID','FamilyID','FamilyName','Kingdom']].
 drop_duplicates().
 loc[:,['Kingdom','FamilyID','FamilyName']].
 value_counts().
 reset_index().
 rename(columns = {0:'count'})
)

families_to_keep 

Unnamed: 0,Kingdom,FamilyID,FamilyName,count
0,Metazoa,81368,Lampridae,2
1,Metazoa,81379,Percopsidae,2
2,Metazoa,81641,Lotidae,2
3,Metazoa,81707,Nycteribiidae,2
4,Metazoa,8184,Centropomidae,2
...,...,...,...,...
1324,Metazoa,27843,Fasciolidae,1
1325,Metazoa,27840,Mermithidae,1
1326,Metazoa,27830,Strongylidae,1
1327,Metazoa,27822,Ischnochitonidae,1


How many families in each kingdom?

In [28]:
families_to_keep['Kingdom'].value_counts()

Kingdom
Metazoa          909
Fungi            216
Viridiplantae    204
Name: count, dtype: int64

Ok, now let's randomly choose up to 20 taxa per family with one accession randomly chosen by taxon. To do this, first we randomly choose one accession by taxon and then we randomly choose up to 20 rows per family.

In [29]:
runs_to_keep = (Eukarya_illumina.merge(taxonomy_data, on = 'TaxID', how = 'left').
                loc[lambda x: x['FamilyID'].isin(families_to_keep['FamilyID'])].
                groupby('TaxID').
                sample(n=1, random_state=2948763)['Run'])


final_runs = (Eukarya_illumina.
 merge(taxonomy_data, on = 'TaxID', how = 'left').
 loc[lambda x: x['FamilyID'].isin(families_to_keep['FamilyID'])].
 loc[lambda x: x['Run'].isin(runs_to_keep)].
 groupby('FamilyID').
 apply(lambda x: x.sample(20, random_state=87635) if len(x) > 20 else x).
 reset_index(drop=True)
)

final_runs

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash,Rank,FamilyName,FamilyID,Kingdom
0,ERR4083945,2020-10-29 12:46:18,2020-11-07 10:58:48,10834146,3250243800,10834146,300,1312,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,MAX PLANCK INSTITUTE FOR DEVELOPMENTAL BIOLOGY,ERA2534765,,public,75BD0E3BB07DB60E5D229FB45D5377CA,9BC27E5E47A2C53A7FD4098B88B44C2F,species,Heteromyidae,10015,Metazoa
1,SRR1646414,2014-11-10 16:09:12,2014-11-10 15:53:22,132049916,26674083032,132049916,202,16620,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,BCM,SRA200536,,public,4DEFC90B06BD66C18D6CAA8B61DC7893,AA647CF62CACA7B8AB93E9A288BDF965,species,Heteromyidae,10015,Metazoa
2,SRR17013387,2022-06-25 00:21:28,2021-11-23 05:47:58,162394765,48718429500,162394765,300,14455,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,UNIVERSITE DE MONTPELLIER,SRA1333473,,public,A42BCD6176CA648E63BC511A651C1E5A,5102B66C4FE42CDEF1BED63E69B02F7E,species,Ploceidae,1002748,Metazoa
3,SRR10019913,2020-03-31 01:31:26,2019-08-26 21:31:04,199344555,19535766390,199344555,98,12479,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,B10K CONSORTIUM,SRA946809,,public,8ABDC963CFE30E3910618BB7D6BC5662,0CBA7BE1FE684EA259CB8914761F84E2,species,Ploceidae,1002748,Metazoa
4,SRR16214291,2021-10-06 01:24:08,2021-10-06 00:03:56,16616753,4985025900,16616753,300,1915,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,"KUNMING INSTITUTE OF BOTANY, CHINESE ACADEMY O...",SRA1306706,,public,FFCCE55C1D780A1A3ADE88150373BE98,5B0E344C8D69A3E600DEC1263D7A15C9,species,Strombosiaceae,1003241,Viridiplantae
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1692,SRR8245998,2019-11-28 02:41:55,2018-11-27 02:29:54,22216601,6118974105,22216601,275,2761,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,FAIRYLAKE BOTANICAL GARDEN,SRA814262,,public,0BD5F1EC8200AC3A7AD792D2F64EA8AB,F2944DE1BFFBCD0DC42EC16C0457E21E,species,Lophocoleaceae,984509,Viridiplantae
1693,SRR8246021,2019-11-28 02:41:55,2018-11-27 02:36:57,33163474,9701250030,33163474,292,3493,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,FAIRYLAKE BOTANICAL GARDEN,SRA814262,,public,7648FB8C458799BE16E2F29363D3D4D3,724367A775BE4D713CE66E1C40B12189,subspecies,Myliaceae,984539,Viridiplantae
1694,SRR6846270,2019-06-20 17:09:59,2018-03-16 05:35:03,332630019,99789005700,332630019,300,38976,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,NORTHWESTERN POLYTECHNICAL UNIVERSITY,SRA666987,,public,DBBB81E0D3DE045105F5ADD17699BD9A,E69E86E2571B82F4039C4DC3ABD91D0E,species,Antilocapridae,9889,Metazoa
1695,SRR1190496,2015-07-22 17:12:22,2015-12-15 19:00:15,54039161,10807832200,54039161,200,6995,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,UNIVERSITY OF OXFORD,SRA145727,,public,68D3435E28D423B95052C0D16185667D,E26991E50BD025202EE95364234B3DF6,species,Gracillariidae,98966,Metazoa


Just to verify, let's see how many records per family:

In [30]:
(final_runs.loc[:,['TaxID','FamilyID','FamilyName','Kingdom']].
 drop_duplicates().
 loc[:,['Kingdom','FamilyID','FamilyName']].
 value_counts().
 reset_index().
 rename(columns = {0:'count'}))

Unnamed: 0,Kingdom,FamilyID,FamilyName,count
0,Metazoa,81368,Lampridae,2
1,Metazoa,81379,Percopsidae,2
2,Metazoa,81641,Lotidae,2
3,Metazoa,81707,Nycteribiidae,2
4,Metazoa,8184,Centropomidae,2
...,...,...,...,...
1324,Metazoa,27843,Fasciolidae,1
1325,Metazoa,27840,Mermithidae,1
1326,Metazoa,27830,Strongylidae,1
1327,Metazoa,27822,Ischnochitonidae,1


Now let's save the table with samples to download as a csv file:

In [31]:
final_runs.to_csv('runs_notincluded_to_download_data.csv',index=False)

And now let's save a simplified version of this table with just the information that we need for fastq-dump

In [32]:
final_runs[['Run','FamilyID']].to_csv('runs_notincluded_to_download.txt',index=False,header=False)

In [33]:
final_runs[['Run','FamilyID']]

Unnamed: 0,Run,FamilyID
0,ERR4083945,10015
1,SRR1646414,10015
2,SRR17013387,1002748
3,SRR10019913,1002748
4,SRR16214291,1003241
...,...,...
1692,SRR8245998,984509
1693,SRR8246021,984539
1694,SRR6846270,9889
1695,SRR1190496,98966
