# 🌱 Create Model‑Organism & Human Gene Nodes with Orthology Links

This notebook processes the GeneLab datasets to build Neo4j node and relationship files for model‑organism genes (MGene), human ortholog genes, and their orthology links. It uses the `genelab_utils` and `ortholog_mapper` packages to extract gene IDs, map to human orthologs, and write CSVs ready for SPOKE ingestion.

Author: Peter W. Rose, UC San Diego (pwrose.ucsd@gmail.com)

In [15]:
import pandas as pd
import genelab_utils as gl
import ortholog_mapper

In [16]:
pd.set_option('display.max_rows', None)  # Shows all rows
pd.set_option('display.max_colwidth', None)  # Shows full content of each cell

## Setup Environment Variables
Edit `../.env` to configure the environment.   

In [17]:
# Node and relationship directory paths
node_dir, rel_dir = gl.setup_environment()

Environment setup for KG version: v0.0.3


In [18]:
MANIFEST_PATH = "../data/manifest.csv" # file with dataset info

## Get Info about available Datasets

In [19]:
manifest = pd.read_csv(MANIFEST_PATH)
manifest.head()

Unnamed: 0,identifier,technology,measurement,assay_name,taxonomy,organism,material,filename,url
0,OSD-100,RNA Sequencing (RNA-Seq),transcription profiling,OSD-100_transcription-profiling_rna-sequencing-(rna-seq),10090,Mus musculus,left eye,GLDS-100_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-100/download?source=datamanager&file=GLDS-100_rna_seq_differential_expression.csv
1,OSD-101,RNA Sequencing (RNA-Seq),transcription profiling,OSD-101_transcription-profiling_rna-sequencing-(rna-seq)_Illumina,10090,Mus musculus,Left gastrocnemius,GLDS-101_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-101/download?source=datamanager&file=GLDS-101_rna_seq_differential_expression.csv
2,OSD-102,RNA Sequencing (RNA-Seq),transcription profiling,OSD-102_transcription-profiling_rna-sequencing-(rna-seq)_Illumina HiSeq 4000,10090,Mus musculus,Left kidney,GLDS-102_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-102/download?source=datamanager&file=GLDS-102_rna_seq_differential_expression.csv
3,OSD-103,Whole Genome Bisulfite Sequencing,DNA methylation profiling,OSD-103_dna-methylation-profiling_whole-genome-bisulfite-sequencing,10090,Mus musculus,Quadriceps-left,GLDS-103_Gwgbs_differential_methylation_tiles_GLMethylSeq.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-103/download?source=datamanager&file=GLDS-103_Gwgbs_differential_methylation_tiles_GLMethylSeq.csv
4,OSD-103,RNA Sequencing (RNA-Seq),transcription profiling,OSD-103_transcription-profiling_rna-sequencing-(rna-seq),10090,Mus musculus,Quadriceps-left,GLDS-103_rna_seq_differential_expression.csv,https://osdr.nasa.gov/geode-py/ws/studies/OSD-103/download?source=datamanager&file=GLDS-103_rna_seq_differential_expression.csv


## Create MGene (Model Organism Gene) Nodes

In [20]:
# Parse all data files for gene ids (ENTREZID) and create a unique list of all genes.
mgenes = gl.extract_gene_info(manifest)

In [21]:
mgene_nodes = gl.save_dataframe_to_kg(mgenes, 'MGene', node_dir)
print(f"Number of MGene nodes: {mgene_nodes.shape[0]}")
mgene_nodes.head()

Number of MGene nodes: 57789


Unnamed: 0,identifier,name,organism,taxonomy
0,23849,Kruppel-like factor 6,Mus musculus,10090
1,235339,dihydrolipoamide S-acetyltransferase (E2 component of pyruvate dehydrogenase complex),Mus musculus,10090
2,12444,cyclin D2,Mus musculus,10090
3,66108,NADH:ubiquinone oxidoreductase subunit A9,Mus musculus,10090
4,57278,basal cell adhesion molecule,Mus musculus,10090


## Map Model Organism Genes to Human Orthologs
Orthologs are mapped using the taxonomy id and ENTREZ gene identifier. Suggestions for suitable ortholog databases are listed below.

In [22]:
# List of taxonomies in the dataset with statistically significant data that have ENTREZ IDs
print(f"List of taxonomy ids in the datasets: {mgenes['taxonomy'].unique()}")

List of taxonomy ids in the datasets: ['10090' '9606' '10116']


In [23]:
# List of supported ortholog_dbs that have a mapping to human ENTREZ IDs
suggestions = ortholog_mapper.suggest_ortholog_dbs(mgenes, "taxonomy", "identifier")
suggestions

Unnamed: 0,taxonomy,supported_dbs
0,10090,"[Panther, HGNC, Ensembl, EggNOG, HomoloGene, PhylomeDB, Treefam, JAX, OMA, OrthoDB, NCBI, Inparanoid]"
1,10116,"[Panther, Ensembl, EggNOG, HomoloGene, PhylomeDB, Treefam, JAX, OMA, OrthoDB, NCBI, Inparanoid]"


In [24]:
mapped_genes = ortholog_mapper.map_orthologs(mgenes, "taxonomy", "identifier", "human_entrez_id", ortholog_dbs=["JAX", "Ensembl"])
# Remove any genes that cannot be mapped to human ENTREZ ids
mapped_genes = mapped_genes[mapped_genes["human_entrez_id"] != ""]

## Create Human Ortholog Gene Nodes
The mapped genes include non-protein coding genes

In [25]:
human_genes = mapped_genes[['human_entrez_id']].copy()
human_genes.rename(columns={'human_entrez_id': 'identifier'}, inplace=True)

In [26]:
human_gene_nodes = gl.save_dataframe_to_kg(human_genes, 'Gene', node_dir)
print(f"Number of Gene nodes: {human_gene_nodes.shape[0]}")
human_gene_nodes.head()

Number of Gene nodes: 28291


Unnamed: 0,identifier
0,1316
1,1737
2,894
3,4704
4,4059


## Create MGene-IS_ORTHOLOG_MGiG-Gene Relationships

In [27]:
model_to_human_genes = mapped_genes[["identifier", "human_entrez_id"]].copy()
model_to_human_genes.rename(columns={"identifier": "from", "human_entrez_id": "to"}, inplace=True)

In [28]:
model_to_human_gene_rels = gl.save_dataframe_to_kg(model_to_human_genes, 'MGene-IS_ORTHOLOG_MGiG-Gene', rel_dir)
print(f"Number of MGene-IS_ORTHOLOG_MGiG-Gene relationships: {model_to_human_gene_rels.shape[0]}")
model_to_human_gene_rels.head()

Number of MGene-IS_ORTHOLOG_MGiG-Gene relationships: 56761


Unnamed: 0,from,to
0,23849,1316
1,235339,1737
2,12444,894
3,66108,4704
4,57278,4059
