# TCGA Mutation and Clinical Data to RDF Knowledge Graph

**Author:** Chiao-Feng Lin  
**Email:** clin at dnanexus.com  
**Date:** Oct 21 2023

## Description

This Jupyter Notebook is part of the Data Management for Transformer Models hackathon project. It demonstrates the conversion of The Cancer Genome Atlas (TCGA) mutation and clinical sample data into a RDF (Resource Description Framework) knowledge graph. The resulting knowledge graph can be used for semantic representation and querying of TCGA data.

## Project Details

- **Hackathon Name:** Data Management for Transformer Models
- **Team Name:** Cohort-based vcfs to Knowledge Graphs
- **Team Members:**
  - Chiao-Feng Lin (team lead)
  - Rachit Kumar
  - Soham Shirolkar 
  - Aniket Naik

## Dependencies

This notebook relies on the following Python libraries and tools:
- rdflib, pandas
- BioPortal API key

## License

This notebook is provided under the MIT license.

In [1]:
### Making BioPortal API calls requires BioPortal API key
# Please acquire one and replace it here.
apikey="BioPortal-API-Key"

In [2]:
pip install rdflib

[0mCollecting rdflib
  Downloading rdflib-7.0.0-py3-none-any.whl (531 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.9/531.9 kB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m
Collecting isodate<0.7.0,>=0.6.0
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m198.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.1 rdflib-7.0.0
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
from rdflib import Graph, Literal, Namespace, RDF, URIRef
import requests


In [4]:
def query_mondo (cancer_type_query=None):
    """ Returns a Mondo ID for a cancer type"""
    # BioPortal API endpoint for searching MONDO ontology classes
    bioportal_api_url = "https://data.bioontology.org/ontologies/MONDO/classes"

    # Define the cancer type we want to search for
    #cancer_type_query = "Colon Adenocarcinoma" 

    # Prepare the query parameters
    params = {
        "q": cancer_type_query,
        "apikey": apikey,  # Replace with your BioPortal API key
    }

    # Send an HTTP GET request to the BioPortal API
    response = requests.get(bioportal_api_url, params=params)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()

        # Extract the concept ID for the first matching result (if available)
        if "collection" in data and len(data["collection"]) > 0:
            first_result = data["collection"][0]
            concept_id = first_result.get("@id", "N/A")
            #print(f"Concept ID for '{cancer_type_query}': {concept_id}")
            return concept_id
        else:
            print(f"No matching results found for '{cancer_type_query}'")
    else:
        print(f"Failed to query the BioPortal API. Status code: {response.status_code}")
    

## Read TCGA mutation file

In [5]:
mut_df = pd.read_csv("/mnt/project/data/coad_cptac_2019/data_mutations.txt",sep="\t")
mut_df

  mut_df = pd.read_csv("/mnt/project/data/coad_cptac_2019/data_mutations.txt",sep="\t")


Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,gencode_transcript_name,gencode_transcript_status,gencode_transcript_tags,gencode_transcript_type,gene_id,gene_type,havana_transcript,ref_context,secondary_variant_classification,transcript_id
0,PTPN22,26191,,GRCh37,1,114380884,114380884,+,missense_variant,Missense_Mutation,...,PTPN22-001,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000033015.1,AGCTCCAGAAAGTCAAAAGAA,,ENST00000359785
1,KIF17,57576,,GRCh37,1,21016724,21016724,+,synonymous_variant,Silent,...,KIF17-009,KNOWN,basic|appris_candidate_longest|CCDS,protein_coding,,protein_coding,OTTHUMT00000276995.1,ACAGCCTGACGTCATATGAGT,,ENST00000247986
2,C1orf167,284498,,GRCh37,1,11849446,11849446,+,synonymous_variant,Silent,...,MTHFR-001,KNOWN,basic|appris_candidate|CCDS,protein_coding,,protein_coding,OTTHUMT00000006538.1,AGGAAGCCGCCAGAGCACCGC,,ENST00000433342
3,CD1D,912,,GRCh37,1,158151458,158151458,+,missense_variant,Missense_Mutation,...,CD1D-001,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000058340.1,CGGGTTTATCGAAGCAGCTTC,,ENST00000368171
4,ZMYM1,79830,,GRCh37,1,35580705,35580705,+,missense_variant,Missense_Mutation,...,ZMYM1-001,NOVEL,alternative_5_UTR|basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000012705.1,TACCCTGCCTCGTCTTAAGAC,,ENST00000373330
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73828,LINC00283,100874057,,GRCh37,13,103399819,103399819,+,downstream_gene_variant,3'Flank,...,,,,,0.0,,,TTTCTGATAATTTTTTTTTAA,,ENST00000430111
73829,C17orf50,146853,,GRCh37,17,34095314,34095315,+,downstream_gene_variant,3'Flank,...,,,,,0.0,,,GCGCCTTCCTTGGGGGCTGTAG,,ENST00000285023
73830,MALRD1,340895,,GRCh37,10,19492731,19492731,+,upstream_gene_variant,5'Flank,...,,,,,0.0,,,TGACTGGATACGGAGCTCTCA,,ENST00000377266
73831,MYO15B,80022,,GRCh37,17,73609112,73609112,+,downstream_gene_variant,3'Flank,...,MYO15B-001,KNOWN,sequence_error|basic,processed_transcript,,protein_coding,OTTHUMT00000448172.2,GATCATGGGCGCATACCTGGT,,ENST00000583560


## Filter for mutations having ClinVar annotations

In [6]:
mut_df = mut_df.loc[~mut_df['ClinVar_TYPE'].isna()]
mut_df

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,gencode_transcript_name,gencode_transcript_status,gencode_transcript_tags,gencode_transcript_type,gene_id,gene_type,havana_transcript,ref_context,secondary_variant_classification,transcript_id
27,TP53,7157,,GRCh37,17,7577548,7577548,+,missense_variant,Missense_Mutation,...,TP53-001,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000367397.1,CGGTTCATGCCGCCCATGCAG,,ENST00000269305
65,APC,324,,GRCh37,5,112175273,112175273,+,stop_gained,Nonsense_Mutation,...,APC-201,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000250738.2,AGCAGTGTCACAGCACCCTAG,,ENST00000257430
173,INSR,3643,,GRCh37,19,7122669,7122669,+,missense_variant,Missense_Mutation,...,INSR-001,KNOWN,basic|appris_candidate_longest|CCDS,protein_coding,,protein_coding,OTTHUMT00000458544.1,GCAGTTTCTCGCTGCCAGGTC,,ENST00000302850
248,TAAR9,134860,,GRCh37,6,132859609,132859609,+,stop_lost,Nonstop_Mutation,...,TAAR9-001,KNOWN,mRNA_end_NF|cds_end_NF|basic,polymorphic_pseudogene,,polymorphic_pseudogene,OTTHUMT00000042254.2,CCTTCACTTCTAACAACTGCA,,ENST00000434551
317,TP53,7157,,GRCh37,17,7577114,7577114,+,missense_variant,Missense_Mutation,...,TP53-001,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000367397.1,AGGACAGGCACAAACACGCAC,,ENST00000269305
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72809,APC,324,,GRCh37,5,112174405,112174405,+,frameshift_variant,Frame_Shift_Del,...,APC-201,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000250738.2,AGTTGAACTCTGGAAGGCAAA,,ENST00000257430
72952,APC,324,,GRCh37,5,112175727,112175736,+,frameshift_variant,Frame_Shift_Del,...,APC-201,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000250738.2,GTTCAGAGGGTCCAGGTTCTTCCAGATGCT,,ENST00000257430
73011,PAX6,5080,,GRCh37,11,31812317,31812317,+,frameshift_variant,Frame_Shift_Del,...,PAX6-008,KNOWN,not_organism_supported|basic|appris_candidate|...,protein_coding,,protein_coding,OTTHUMT00000099293.4,CTGCATATGTGGGGGGGTGTA,,ENST00000419022
73040,OCA2,4948,,GRCh37,15,28200305,28200305,+,"frameshift_variant,splice_region_variant",Frame_Shift_Del,...,OCA2-001,KNOWN,basic|appris_principal|CCDS,protein_coding,,protein_coding,OTTHUMT00000250823.1,GCTGGGTACCTTTTTTTGGAG,Frame_Shift_Del,ENST00000354638


In [7]:
len(mut_df.loc[~mut_df['ClinVar_TYPE'].isna()]['Tumor_Sample_Barcode'].unique())

93

## Read TCGA clinical sample file

In [8]:
clinical_df = pd.read_csv("/mnt/project/data/coad_cptac_2019/data_clinical_sample.txt",sep="\t",skiprows=4)
clinical_df

Unnamed: 0,PATIENT_ID,SAMPLE_ID,SPECIMEN_PRESERVATION,SEQUENCED,COPY_NUMBER,MRNA_DATA,MICRORNA_DATA,METHYLATION_STATUS,PROTEIN,PHOSPHOPROTEIN,MSI_STATUS,PATHOLOGY_STATUS,PRIMARY_SITE,ONCOTREE_CODE,CANCER_TYPE,CANCER_TYPE_DETAILED,SOMATIC_STATUS,TMB_NONSYNONYMOUS
0,01CO001,01CO001,Frozen Tissue,1,1,1,1,0,0,0,MSS,Malignant,Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.566667
1,01CO005,01CO005,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,4.366667
2,01CO006,01CO006,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Ascending Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.700000
3,01CO008,01CO008,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Descending Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,5.166667
4,01CO013,01CO013,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.633333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,05CO014,05CO014,[Not Available],1,1,1,1,0,0,0,MSS,[Not Available],[Not Available],COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.866667
106,05CO055,05CO055,[Not Available],1,1,1,1,0,0,0,MSS,[Not Available],Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,3.600000
107,11CO059,11CO059,[Not Available],1,1,1,1,0,0,0,MSI-H,[Not Available],Ascending Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,102.033333
108,16CO012,16CO012,[Not Available],1,1,1,1,0,0,0,MSS,[Not Available],Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,3.733333


## Filter for patients having clinvar annotated mutations

In [9]:
clinical_df.loc[clinical_df['SAMPLE_ID'].isin(mut_df.loc[~mut_df['ClinVar_TYPE'].isna()]['Tumor_Sample_Barcode'].unique())]

Unnamed: 0,PATIENT_ID,SAMPLE_ID,SPECIMEN_PRESERVATION,SEQUENCED,COPY_NUMBER,MRNA_DATA,MICRORNA_DATA,METHYLATION_STATUS,PROTEIN,PHOSPHOPROTEIN,MSI_STATUS,PATHOLOGY_STATUS,PRIMARY_SITE,ONCOTREE_CODE,CANCER_TYPE,CANCER_TYPE_DETAILED,SOMATIC_STATUS,TMB_NONSYNONYMOUS
0,01CO001,01CO001,Frozen Tissue,1,1,1,1,0,0,0,MSS,Malignant,Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.566667
1,01CO005,01CO005,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,4.366667
3,01CO008,01CO008,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Descending Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,5.166667
4,01CO013,01CO013,Frozen Tissue,1,1,1,1,0,1,1,MSS,Malignant,Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.633333
5,01CO014,01CO014,Frozen Tissue,1,1,1,1,0,1,1,MSI-H,Malignant,Hepatix Flexure,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,33.900000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,22CO006,22CO006,Frozen Tissue,1,1,1,1,0,1,1,MSI-H,Malignant,Ascending Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,34.933333
104,05CO005,05CO005,[Not Available],1,1,1,1,0,0,0,MSS,[Not Available],Sigmoid Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,4.466667
105,05CO014,05CO014,[Not Available],1,1,1,1,0,0,0,MSS,[Not Available],[Not Available],COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,2.866667
107,11CO059,11CO059,[Not Available],1,1,1,1,0,0,0,MSI-H,[Not Available],Ascending Colon,COAD,Colorectal Cancer,Colon Adenocarcinoma,Matched,102.033333


## Make mapping between cancer type and Mondo ID

In [10]:
cancer_mondo={}
for c in clinical_df['CANCER_TYPE_DETAILED'].unique():
    mondoid=query_mondo(c)
    cancer_mondo[c]=mondoid

In [11]:
print(cancer_mondo)

{'Colon Adenocarcinoma': 'http://purl.obolibrary.org/obo/MONDO_0001552'}


## Load Mondo ontology

In [12]:
mondo_json = requests.get('https://github.com/monarch-initiative/mondo/releases/latest/download/mondo.json').json()
mondo_nodes = mondo_json['graphs'][0]['nodes']
mondo_nodes[0]

{'id': 'http://identifiers.org/hgnc/10001', 'lbl': 'RGS5', 'type': 'CLASS'}

In [13]:
def get_gene_mondo_id(target_lbl=None):
    #target_lbl = 'SHH'
    matching_node = None

    for node in mondo_json['graphs'][0]['nodes']:
        if 'lbl' in node and node['lbl'] == target_lbl:
            matching_node = node
            break

    if matching_node:
        #print(f"Matching node found:{matching_node['id']}")
        return matching_node['id']
    else:
        pass
        #print(f"No node with lbl '{target_lbl}' found.")

In [14]:
get_gene_mondo_id('SHH')

'http://identifiers.org/hgnc/10848'

## Add nodes to a graph

In [15]:
g = Graph()

# Define namespaces
tt = Namespace("http://tinytcga.org/")  # tiny tcga namespace
mondo = Namespace("http://purl.obolibrary.org/obo/mondo#")  # Mondo namespace

# Define properties
hasHugoSymbol = tt.hasHugoSymbol
isCancerTypeOf = tt.isCancerTypeOf

# Add data to the graph
for index, row in clinical_df.loc[clinical_df['SAMPLE_ID'].isin(mut_df.loc[~mut_df['ClinVar_TYPE'].isna()]['Tumor_Sample_Barcode'].unique())].iterrows():
    #print(f"{row['CANCER_TYPE_DETAILED']} {row['SAMPLE_ID']}")
    sample = f"http://tinytcga.org/{row['SAMPLE_ID']}"

    #sample = f"tt.{row['SAMPLE_ID']}"
    mondoid=cancer_mondo.get(row['CANCER_TYPE_DETAILED'])
    #print(mondoid)
    #g.add(sample, isCancerTypeOf, URIRef(mondoid))
    g.add((URIRef(sample), isCancerTypeOf, URIRef(mondoid)))
    for j, mrow in mut_df[mut_df['Tumor_Sample_Barcode']==row['SAMPLE_ID']].iterrows():
        #print(mrow['Hugo_Symbol'])
        gene_mondo=get_gene_mondo_id(mrow['Hugo_Symbol'])
        if gene_mondo is not None:
            g.add((URIRef(sample), hasHugoSymbol, URIRef(gene_mondo)))


## Serialize the graph to a ttl

In [16]:
g.serialize("tinytcga.ttl")

<Graph identifier=Nd31bd351b075470382740518518f6cf4 (<class 'rdflib.graph.Graph'>)>