 🌐 Create Study & Mission Nodes in the SPOKE‑GeneLab Knowledge Graph

This notebook reads the GeneLab dataset manifest file, extracts mission‑ and study‑level metadata, and writes Neo4j node and relationship files for integration into the SPOKE‑GeneLab Knowledge Graph via the `genelab_utils` package.

Author: Peter W. Rose, UC San Diego (pwrose.ucsd@gmail.com)

In [14]:
import pandas as pd
import genelab_utils as gl

In [15]:
pd.set_option('display.max_rows', None)  # Shows all rows
pd.set_option('display.max_colwidth', None)  # Shows full content of each cell

## Setup Environment Variables
Edit `../.env` to configure the environment.    

In [16]:
# Node and relationship directory paths
node_dir, rel_dir = gl.setup_environment()

Environment setup for KG version: v0.0.3


## Validate the KG Metadata Files in the `../kg/v#.#.#/metadata` Directory

In [17]:
gl.validate_kg_metadata()

Metadata files passed the check!


## Get Info about available Datasets

In [18]:
MANIFEST_PATH = "../data/manifest.csv" # file with dataset info

In [19]:
manifest = pd.read_csv(MANIFEST_PATH)
manifest.head()

Unnamed: 0,identifier,technology,measurement,assay_name,taxonomy,organism,material,filename,url
0,OSD-100,RNA Sequencing (RNA-Seq),transcription profiling,OSD-100_transcription-profiling_rna-sequencing-(rna-seq),10090,Mus musculus,left eye,GLDS-100_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-100/download?source=datamanager&file=GLDS-100_rna_seq_differential_expression.csv
1,OSD-101,RNA Sequencing (RNA-Seq),transcription profiling,OSD-101_transcription-profiling_rna-sequencing-(rna-seq)_Illumina,10090,Mus musculus,Left gastrocnemius,GLDS-101_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-101/download?source=datamanager&file=GLDS-101_rna_seq_differential_expression.csv
2,OSD-102,RNA Sequencing (RNA-Seq),transcription profiling,OSD-102_transcription-profiling_rna-sequencing-(rna-seq)_Illumina HiSeq 4000,10090,Mus musculus,Left kidney,GLDS-102_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-102/download?source=datamanager&file=GLDS-102_rna_seq_differential_expression.csv
3,OSD-103,Whole Genome Bisulfite Sequencing,DNA methylation profiling,OSD-103_dna-methylation-profiling_whole-genome-bisulfite-sequencing,10090,Mus musculus,Quadriceps-left,GLDS-103_Gwgbs_differential_methylation_tiles_GLMethylSeq.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-103/download?source=datamanager&file=GLDS-103_Gwgbs_differential_methylation_tiles_GLMethylSeq.csv
4,OSD-103,RNA Sequencing (RNA-Seq),transcription profiling,OSD-103_transcription-profiling_rna-sequencing-(rna-seq),10090,Mus musculus,Quadriceps-left,GLDS-103_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-103/download?source=datamanager&file=GLDS-103_rna_seq_differential_expression.csv


## Get GeneLab Mission and Study Metadata

In [20]:
metadata = gl.get_metadata(manifest)
metadata.head()

Unnamed: 0,identifier,project_type,project_title,taxonomy,organism,flight_program,space_program,mission_id,name,start_date,end_date
0,OSD-100,Spaceflight Study,Rodent Research 1,10090,Mus musculus,International Space Station (ISS),NASA,SpaceX-4,SpaceX-4,2014-09-21,2014-10-25
1,OSD-101,Spaceflight Study,Rodent Research 1,10090,Mus musculus,International Space Station (ISS),NASA,SpaceX-4,SpaceX-4,2014-09-21,2014-10-25
2,OSD-102,Spaceflight Study,Rodent Research 1,10090,Mus musculus,International Space Station (ISS),NASA,SpaceX-4,SpaceX-4,2014-09-21,2014-10-25
3,OSD-103,Spaceflight Study,Rodent Research 1,10090,Mus musculus,International Space Station (ISS),NASA,SpaceX-4,SpaceX-4,2014-09-21,2014-10-25
4,OSD-103,Spaceflight Study,Rodent Research 1,10090,Mus musculus,International Space Station (ISS),NASA,SpaceX-4,SpaceX-4,2014-09-21,2014-10-25


## Create Mission Nodes

In [21]:
missions = metadata[["mission_id", "name", "flight_program", "space_program", "start_date", "end_date"]]
missions = missions[missions["name"] != ""].copy()
missions.rename(columns={"mission_id": "identifier"}, inplace=True)

In [22]:
mission_nodes = gl.save_dataframe_to_kg(missions, 'Mission', node_dir)
print(f"Number of Mission nodes: {mission_nodes.shape[0]}")
mission_nodes.head()

Number of Mission nodes: 22


Unnamed: 0,identifier,name,flight_program,space_program,start_date,end_date
0,SpaceX-4,SpaceX-4,International Space Station (ISS),NASA,2014-09-21,2014-10-25
13,Expedition-14,Expedition 14,International Space Station (ISS),NASA,2006-09-18,2007-04-21
14,SpaceX-8,SpaceX-8,International Space Station (ISS),NASA,2016-04-08,2016-05-11
25,SpaceX-9,SpaceX-9,International Space Station (ISS),NASA,2016-07-18,2016-08-26
35,STS-135,STS-135,Space Transportation System (STS),NASA,2011-07-08,2011-07-21


## Create Study Nodes

In [23]:
studies = metadata[["identifier", "project_title", "project_type", "organism", "taxonomy"]].copy()
studies["name"] = studies["identifier"]
studies = studies[["identifier", "name", "project_title", "project_type", "organism", "taxonomy"]]

In [24]:
study_nodes = gl.save_dataframe_to_kg(studies, 'Study', node_dir)
print(f"Number of Study nodes: {study_nodes.shape[0]}")
study_nodes.head()

Number of Study nodes: 125


Unnamed: 0,identifier,name,project_title,project_type,organism,taxonomy
0,OSD-100,OSD-100,Rodent Research 1,Spaceflight Study,Mus musculus,10090
1,OSD-101,OSD-101,Rodent Research 1,Spaceflight Study,Mus musculus,10090
2,OSD-102,OSD-102,Rodent Research 1,Spaceflight Study,Mus musculus,10090
3,OSD-103,OSD-103,Rodent Research 1,Spaceflight Study,Mus musculus,10090
5,OSD-104,OSD-104,Rodent Research 1,Spaceflight Study,Mus musculus,10090


## Create Missions-CONDUCTED_MIcS-Study Relationships

In [25]:
mission_conducted_study = metadata[["mission_id", "identifier"]]
# Not all studies have an associated mission (e.g., ground studies)
mission_conducted_study = mission_conducted_study[mission_conducted_study["mission_id"] != ""].copy()
mission_conducted_study.rename(columns={"mission_id": "from", "identifier": "to", }, inplace=True)

In [26]:
mission_conducted_study_rels = gl.save_dataframe_to_kg(mission_conducted_study, 'Mission-CONDUCTED_MIcS-Study', rel_dir)
print(f"Number of Mission-CONDUCTED_MIcS-Study relationships: {mission_conducted_study_rels.shape[0]}")
mission_conducted_study_rels.head()

Number of Mission-CONDUCTED_MIcS-Study relationships: 74


Unnamed: 0,from,to
0,SpaceX-4,OSD-100
1,SpaceX-4,OSD-101
2,SpaceX-4,OSD-102
3,SpaceX-4,OSD-103
5,SpaceX-4,OSD-104
