# ðŸ“¥ Download and Filter NASA GeneLab Omics Datasets

This notebook automates the retrieval and preâ€‘processing of omics datasets from the NASA GeneLab Open Science Data Repository (OSDR) using the `genelab_utils` package. It supports both incremental and full updates, applies preâ€‘filters to reduce file size, and writes a manifest of downloaded files.

Author: Peter W. Rose, UC San Diego (pwrose.ucsd@gmail.com)

In [1]:
import pandas as pd
import genelab_utils as gl

In [2]:
MANIFEST_PATH = "../data/manifest.csv" # file to save dataset info

## Incremental vs Full Update
By default, this notebook runs an incremental update. It downloads and preprocesses any new datasets specified in the "technology_types" list below.

If any datasets have been updated, set the "reset" variable to "True" to run a complete update.

The downloaded datasets are saved in the "datasets" directory.

In [3]:
RESET = False # run incremental update
# RESET = True # run a complete update to refresh datasets

## Get a List of GeneLab processed Datasets

In [4]:
dataset_info = gl.get_processed_datasets()

https://visualization.osdr.nasa.gov/biodata/api/v2//query/metadata/?study.characteristics.organism.term%20accession%20number&investigation.study%20assays.study%20assay%20technology%20type&investigation.study%20assays.study%20assay%20measurement%20type&study.characteristics.material%20type&study.characteristics.material%20type.term%20accession%20number&file.category&file.subcategory


## Filter by Study and Technology Type

In [5]:
id_accession = ["OSD-557", "OSD-568", "OSD-583", "OSD-679", "OSD-680", "OSD-681",]

technology_types = ["Micro-Computed Tomography", 
                    "Immunostaining",
                    "Microscopy",
                    "Tonometry",
                    "Ultrasonography",
                    "Optical Coherence Tomography",
                    "Magnetic Resonance Imaging",
                    "Implanted Sensor Data",
                   ]

                #["RNA Sequencing (RNA-Seq)", 
                 #   "DNA microarray", 
                  #  "Whole Genome Bisulfite Sequencing",
                   # "Reduced-Representation Bisulfite Sequencing",
                   #]
dataset_info = gl.filter_by_study(dataset_info, id_accession)
dataset_info = gl.filter_by_technology_type(dataset_info, technology_types)
dataset_info.head()
print(dataset_info['identifier'])

206672    OSD-557
206713    OSD-557
206757    OSD-557
208959    OSD-568
209028    OSD-568
209029    OSD-568
230232    OSD-583
245531    OSD-679
246419    OSD-679
246643    OSD-679
247309    OSD-679
249088    OSD-680
Name: identifier, dtype: object


## Filter by Organism

In [6]:
print(f"Available organisms: {dataset_info['taxonomy'].unique()}")

Available organisms: ['10090' '10116']


In [7]:
taxids = {"9606": "Homo sapiens",
          # -- Rodens -- 
          "10090": "Mus musculus",
          "10116": "Rattus norvegicus",
          # -- Fish --
          "7955": "Danio rerio",
          "8090": "Oryzias latipes",
          # -- Nematoda --
           "6239": "Caenorhabditis elegans",
          # -- Insecta --
           "7227": "Drosophila melanogaster",
           "63436": "Leptopilina heterotoma",
           "63433": "Leptopilina boulardi",
          # -- Bacteria --
          "562": "Escherichia coli",
          "287": "Pseudomonas aeruginosa",
          "1423": "Bacillus subtilis",
          "1781": "Mycobacterium marinum",
          "148447": "Paraburkholderia phymatum",
          # -- Fungi --
           "4932": "Saccharomyces cerevisiae",
          # -- Plants --
           "3711": "Brassica rapa",
           "15368": "Brachypodium distachyon",
           "3702": "Arabidopsis thaliana",
         }
          
dataset_info = gl.filter_by_organism(dataset_info, taxids)

In [8]:
print(f"Filtered organisms: {dataset_info['taxonomy'].unique()}")

Filtered organisms: ['10090' '10116']


In [9]:
#dataset_info = dataset_info[["identifier", "technology", "measurement", "assay_name", "taxonomy", "organism", "material"]].copy()
#dataset_info.drop_duplicates(inplace=True)
#dataset_info.head()

## Select Datasets to Download
The map below specifies the technology type and a substring used to identify processed files. Processed files must contain this substring.

In [10]:
file_types = {"Micro-Computed Tomography": "TRANSFORMED", 
                "Immunostaining": "TRANSFORMED",
                "Microscopy": "TRANSFORMED",
                "Tonometry": "TRANSFORMED",
                "Ultrasonography": "TRANSFORMED",
                "Optical Coherence Tomography": "TRANSFORMED",
                "Magnetic Resonance Imaging": "TRANSFORMED",
                "Implanted Sensor Data": "TRANSFORMED",}

            #{"DNA microarray": "differential_expression",
              #"RNA Sequencing (RNA-Seq)": "differential_expression",
              #"Whole Genome Bisulfite Sequencing": "differential_methylation_tiles",
              #"Reduced-Representation Bisulfite Sequencing": "differential_methylation_tiles",}"""

#### Define pre-filters to reduce the file the essential data

In [11]:
#def differential_expression_filter(df, threshold=0.05):
 #   filtered_df = df[df['ENTREZID'].notna() & (df['ENTREZID'].astype(str) != '')]
    # Keep only required columns
 #   filtered_df = filtered_df.filter(regex=r"^(ENTREZID|GENENAME|Log2fc_|Adj\.p\.value_)")
 #   adj_pval_cols = [col for col in filtered_df.columns if col.startswith("Adj.p.value_")]
 #   filtered_df = filtered_df[filtered_df[adj_pval_cols].le(threshold).any(axis=1)]
    # Explode rows with multiple genes
 #   if "ENTREZID" in filtered_df.columns:
 #       filtered_df["ENTREZID"] = filtered_df["ENTREZID"].astype(str)
 #       filtered_df["ENTREZID"] = filtered_df["ENTREZID"].apply(lambda x:x.split('|'))
 #       filtered_df = filtered_df.explode('ENTREZID')
 #       filtered_df["ENTREZID"] = filtered_df["ENTREZID"].str.strip()
 #   return filtered_df

In [12]:
#def differential_methylation_filter(df, threshold=0.05):
 #   filtered_df = df[df['ENTREZID'].notna() & (df['ENTREZID'].astype(str) != '')]
    # Keep only required columns
 #   filtered_df = filtered_df.filter(regex=r"^(ENTREZID|GENENAME|chr|start|end|dist.to.feature|prom|exon|intron|meth.diff_|qvalue_)")
 #   qval_cols = [col for col in filtered_df.columns if col.startswith("qvalue_")]
 #   filtered_df = filtered_df[filtered_df[qval_cols].le(threshold).any(axis=1)]
     # Explode rows with multiple genes
 #   if "ENTREZID" in filtered_df.columns:
 #       filtered_df["ENTREZID"] = filtered_df["ENTREZID"].astype(str)
 #       filtered_df["ENTREZID"] = filtered_df["ENTREZID"].apply(lambda x:x.split('|'))
 #       filtered_df = filtered_df.explode('ENTREZID')
 #       filtered_df["ENTREZID"] = filtered_df["ENTREZID"].str.strip()
 #   return filtered_df

In [13]:
def trivial_filter(df):
    return df

In [14]:
filters = {#"differential_expression": differential_expression_filter,
           #"differential_methylation_tiles": differential_methylation_filter,
           "trivial_filter": trivial_filter,}

In [17]:
manifest = gl.download_data_files(dataset_info, file_types, filters, reset=RESET)
manifest.to_csv(MANIFEST_PATH, index=False)

https://osdr.nasa.gov/geode-py/ws/studies/OSD-557/download?source=datamanager&file=LSDS-1_microCT_Overbey_microCT_TRANSFORMED.csv
File already exist: OSD-557_LSDS-1_microCT_Overbey_microCT_TRANSFORMED.csv
https://osdr.nasa.gov/geode-py/ws/studies/OSD-557/download?source=datamanager&file=LSDS-1_immunostaining_Overbey_HNE_RetinaLayer_TRANSFORMED.csv
File already exist: OSD-557_LSDS-1_immunostaining_Overbey_HNE_RetinaLayer_TRANSFORMED.csv
https://osdr.nasa.gov/geode-py/ws/studies/OSD-557/download?source=datamanager&file=LSDS-1_immunostaining_Overbey_HNE_PhotoreceptorLayer_TRANSFORMED.csv
Downloading: OSD-557_LSDS-1_immunostaining_Overbey_HNE_PhotoreceptorLayer_TRANSFORMED.csv
https://osdr.nasa.gov/geode-py/ws/studies/OSD-557/download?source=datamanager&file=LSDS-1_immunostaining_Overbey_PNA_TRANSFORMED.csv
File already exist: OSD-557_LSDS-1_immunostaining_Overbey_PNA_TRANSFORMED.csv
https://osdr.nasa.gov/geode-py/ws/studies/OSD-557/download?source=datamanager&file=LSDS-1_microCT_Overbey_m

In [16]:
manifest.head()

Unnamed: 0,identifier,assay_name,sample_name,measurement,technology,material,study.characteristics.material type.term accession number,study.characteristics.organism.term accession number,file.category,file.subcategory,taxonomy,organism,filename,url
206672,OSD-557,OSD-557_molecular-cellular-imaging_immunostain...,GC15,Molecular Cellular Imaging,Immunostaining,Left eye,http://purl.org/sig/ont/fma/fma54450,http://purl.bioontology.org/ontology/NCBITAXON...,Immunostaining,Tabular Result Files,10090,Mus musculus,OSD-557_LSDS-1_microCT_Overbey_microCT_TRANSFO...,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
206672,OSD-557,OSD-557_molecular-cellular-imaging_immunostain...,GC15,Molecular Cellular Imaging,Immunostaining,Left eye,http://purl.org/sig/ont/fma/fma54450,http://purl.bioontology.org/ontology/NCBITAXON...,Immunostaining,Tabular Result Files,10090,Mus musculus,OSD-557_LSDS-1_immunostaining_Overbey_HNE_Reti...,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
206672,OSD-557,OSD-557_molecular-cellular-imaging_immunostain...,GC15,Molecular Cellular Imaging,Immunostaining,Left eye,http://purl.org/sig/ont/fma/fma54450,http://purl.bioontology.org/ontology/NCBITAXON...,Immunostaining,Tabular Result Files,10090,Mus musculus,OSD-557_LSDS-1_immunostaining_Overbey_HNE_Phot...,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
206672,OSD-557,OSD-557_molecular-cellular-imaging_immunostain...,GC15,Molecular Cellular Imaging,Immunostaining,Left eye,http://purl.org/sig/ont/fma/fma54450,http://purl.bioontology.org/ontology/NCBITAXON...,Immunostaining,Tabular Result Files,10090,Mus musculus,OSD-557_LSDS-1_immunostaining_Overbey_PNA_TRAN...,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
206713,OSD-557,OSD-557_molecular-cellular-imaging_immunostain...,GC15,Molecular Cellular Imaging,Immunostaining,Left eye,http://purl.org/sig/ont/fma/fma54450,http://purl.bioontology.org/ontology/NCBITAXON...,Immunostaining,Tabular Result Files,10090,Mus musculus,OSD-557_LSDS-1_microCT_Overbey_microCT_TRANSFO...,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
