# 📥 Download and Filter NASA GeneLab Omics Datasets

This notebook automates the retrieval and pre‑processing of omics datasets from the NASA GeneLab Open Science Data Repository (OSDR) using the `genelab_utils` package. It supports both incremental and full updates, applies pre‑filters to reduce file size, and writes a manifest of downloaded files.

Author: Peter W. Rose, UC San Diego (pwrose.ucsd@gmail.com)

In [16]:
import pandas as pd
import genelab_utils as gl

In [17]:
MANIFEST_PATH = "../data/manifest.csv" # file to save dataset info

## Incremental vs Full Update
By default, this notebook runs an incremental update. It downloads and preprocesses any new datasets specified in the "technology_types" list below.

If any datasets have been updated, set the "reset" variable to "True" to run a complete update.

The downloaded datasets are saved in the "datasets" directory.

In [18]:
RESET = False # run incremental update
# RESET = True # run a complete update to refresh datasets

## Get a List of GeneLab processed Datasets

In [19]:
dataset_info = gl.get_processed_datasets()

## Filter by Technology Type

In [20]:
technology_types = ["RNA Sequencing (RNA-Seq)", 
                    "DNA microarray", 
                    "Whole Genome Bisulfite Sequencing",
                    "Reduced-Representation Bisulfite Sequencing",
                   ]
dataset_info = gl.filter_by_technology_type(dataset_info, technology_types)

## Filter by Organism

In [21]:
print(f"Available organisms: {dataset_info['taxonomy'].unique()}")

Available organisms: ['7227' '10090' '6239' '3702' '9606' '1423' '' '287' '10116' '7955'
 '15368' '562' '1781' '3711' '63436' '63433' '4932' '148447' '8090']


In [22]:
taxids = {"9606": "Homo sapiens",
          # -- Rodens -- 
          "10090": "Mus musculus",
          "10116": "Rattus norvegicus",
          # -- Fish --
          # "7955": "Danio rerio",
          "8090": "Oryzias latipes",
          # -- Nematoda --
          # "6239": "Caenorhabditis elegans",
          # -- Insecta --
          # "7227": "Drosophila melanogaster",
          # "63436": "Leptopilina heterotoma",
          # "63433": "Leptopilina boulardi",
          # -- Bacteria --
          "562": "Escherichia coli",
          "287": "Pseudomonas aeruginosa",
          "1423": "Bacillus subtilis",
          "1781": "Mycobacterium marinum",
          "148447": "Paraburkholderia phymatum",
          # -- Fungi --
          # "4932": "Saccharomyces cerevisiae",
          # -- Plants --
          # "3711": "Brassica rapa",
          # "15368": Brachypodium distachyon",
          # "3702": "Arabidopsis thaliana",
         }
          
dataset_info = gl.filter_by_organism(dataset_info, taxids)

In [23]:
print(f"Filtered organisms: {dataset_info['taxonomy'].unique()}")

Filtered organisms: ['10090' '9606' '1423' '287' '10116' '562' '1781' '148447' '8090']


In [24]:
dataset_info = dataset_info[["identifier", "technology", "measurement", "assay_name", "taxonomy", "organism", "material"]].copy()
dataset_info.drop_duplicates(inplace=True)
dataset_info.head()

Unnamed: 0,identifier,technology,measurement,assay_name,taxonomy,organism,material
279,OSD-100,RNA Sequencing (RNA-Seq),transcription profiling,OSD-100_transcription-profiling_rna-sequencing...,10090,Mus musculus,left eye
286,OSD-101,RNA Sequencing (RNA-Seq),transcription profiling,OSD-101_transcription-profiling_rna-sequencing...,10090,Mus musculus,Left gastrocnemius
281,OSD-102,RNA Sequencing (RNA-Seq),transcription profiling,OSD-102_transcription-profiling_rna-sequencing...,10090,Mus musculus,Left kidney
265,OSD-103,Whole Genome Bisulfite Sequencing,DNA methylation profiling,OSD-103_dna-methylation-profiling_whole-genome...,10090,Mus musculus,Quadriceps-left
272,OSD-103,RNA Sequencing (RNA-Seq),transcription profiling,OSD-103_transcription-profiling_rna-sequencing...,10090,Mus musculus,Quadriceps-left


## Select Datasets to Download
The map below specifies the technology type and a substring used to identify processed files. Processed files must contain this substring.

In [25]:
file_types = {"DNA microarray": "differential_expression",
              "RNA Sequencing (RNA-Seq)": "differential_expression",
              "Whole Genome Bisulfite Sequencing": "differential_methylation_tiles",
              "Reduced-Representation Bisulfite Sequencing": "differential_methylation_tiles",}

#### Define pre-filters to reduce the file the essential data

In [26]:
def differential_expression_filter(df, threshold=0.05):
    filtered_df = df[df['ENTREZID'].notna() & (df['ENTREZID'].astype(str) != '')]
    # Keep only required columns
    filtered_df = filtered_df.filter(regex=r"^(ENTREZID|GENENAME|Log2fc_|Adj\.p\.value_)")
    adj_pval_cols = [col for col in filtered_df.columns if col.startswith("Adj.p.value_")]
    filtered_df = filtered_df[filtered_df[adj_pval_cols].le(threshold).any(axis=1)]
    # Explode rows with multiple genes
    if "ENTREZID" in filtered_df.columns:
        filtered_df["ENTREZID"] = filtered_df["ENTREZID"].astype(str)
        filtered_df["ENTREZID"] = filtered_df["ENTREZID"].apply(lambda x:x.split('|'))
        filtered_df = filtered_df.explode('ENTREZID')
        filtered_df["ENTREZID"] = filtered_df["ENTREZID"].str.strip()
    return filtered_df

In [27]:
def differential_methylation_filter(df, threshold=0.05):
    filtered_df = df[df['ENTREZID'].notna() & (df['ENTREZID'].astype(str) != '')]
    # Keep only required columns
    filtered_df = filtered_df.filter(regex=r"^(ENTREZID|GENENAME|chr|start|end|dist.to.feature|prom|exon|intron|meth.diff_|qvalue_)")
    qval_cols = [col for col in filtered_df.columns if col.startswith("qvalue_")]
    filtered_df = filtered_df[filtered_df[qval_cols].le(threshold).any(axis=1)]
     # Explode rows with multiple genes
    if "ENTREZID" in filtered_df.columns:
        filtered_df["ENTREZID"] = filtered_df["ENTREZID"].astype(str)
        filtered_df["ENTREZID"] = filtered_df["ENTREZID"].apply(lambda x:x.split('|'))
        filtered_df = filtered_df.explode('ENTREZID')
        filtered_df["ENTREZID"] = filtered_df["ENTREZID"].str.strip()
    return filtered_df

In [28]:
filters = {"differential_expression": differential_expression_filter,
           "differential_methylation_tiles": differential_methylation_filter}

In [29]:
manifest = gl.download_data_files(dataset_info, file_types, filters, reset=RESET)
manifest.to_csv(MANIFEST_PATH, index=False)

File already exist: GLDS-100_rna_seq_differential_expression.csv
File already exist: GLDS-101_rna_seq_differential_expression.csv
File already exist: GLDS-102_rna_seq_differential_expression.csv
File already exist: GLDS-103_Gwgbs_differential_methylation_tiles_GLMethylSeq.csv
File already exist: GLDS-103_rna_seq_differential_expression.csv
File already exist: GLDS-104_rna_seq_differential_expression.csv
File already exist: GLDS-105_Gwgbs_differential_methylation_tiles_GLMethylSeq.csv
File already exist: GLDS-105_rna_seq_differential_expression.csv
File already exist: GLDS-109_array_differential_expression_GLmicroarray.csv
File already exist: GLDS-117_array_differential_expression_GLmicroarray.csv
Downloading: GLDS-124_array_differential_expression_GLmicroarray.csv
Skipping file: GLDS-124_array_differential_expression_GLmicroarray.csv. No data after filtering.
File already exist: GLDS-125_array_differential_expression_GLmicroarray.csv
File already exist: GLDS-127_rna_seq_differential_ex

In [30]:
manifest.head()

Unnamed: 0,identifier,technology,measurement,assay_name,taxonomy,organism,material,filename,url
279,OSD-100,RNA Sequencing (RNA-Seq),transcription profiling,OSD-100_transcription-profiling_rna-sequencing...,10090,Mus musculus,left eye,GLDS-100_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
286,OSD-101,RNA Sequencing (RNA-Seq),transcription profiling,OSD-101_transcription-profiling_rna-sequencing...,10090,Mus musculus,Left gastrocnemius,GLDS-101_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
281,OSD-102,RNA Sequencing (RNA-Seq),transcription profiling,OSD-102_transcription-profiling_rna-sequencing...,10090,Mus musculus,Left kidney,GLDS-102_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
265,OSD-103,Whole Genome Bisulfite Sequencing,DNA methylation profiling,OSD-103_dna-methylation-profiling_whole-genome...,10090,Mus musculus,Quadriceps-left,GLDS-103_Gwgbs_differential_methylation_tiles_...,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
272,OSD-103,RNA Sequencing (RNA-Seq),transcription profiling,OSD-103_transcription-profiling_rna-sequencing...,10090,Mus musculus,Quadriceps-left,GLDS-103_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-...
