In [1]:
import pandas as pd
import requests

## Grabbing Mutational Data from cBioPortal Swagger API
cBioPortal has a bunch of weird API access points. We were able to identify a SWAGGER API end point and play around with some of the listed examples to identify a way to access mutational data. Actually identifying what endpoints to use to grab our interested data is proving to be tricky though. I'm hoping that if we can grab all mutations (variants), therapeutics, and diseases (outcomes too?) from a sample set, we can use the unique sample / patient identifiers to link everything back together and create a "Study" node in MetaKB v2.  


### Grab Study Sample Lists and Identifiers

In [22]:
r = requests.get('https://www.cbioportal.org/api/sample-lists?projection=DETAILED&pageSize=10000000&pageNumber=0&direction=ASC')
print(r.status_code)
response = r.json()

200


In [23]:
data = []
for record in response:
    row = {
        'category': record.get('category',None),
        'name': record.get('name',None),
        'sampleListId': record.get('sampleListId',None),
        'studyId': record.get('studyId',None),
        'sampleCount': record.get('sampleCount',None),
        'sampleIds': record.get('sampleIds',None)
    }
    data.append(row)

cbio_studies = pd.DataFrame(data)
cbio_studies

Unnamed: 0,category,name,sampleListId,studyId,sampleCount,sampleIds
0,all_cases_in_study,All samples,pancreas_msk_2024_all,pancreas_msk_2024,395,"[P-0000142-T01-IM3, P-0002230-T01-IM3, P-00027..."
1,all_cases_with_cna_data,Samples with CNA data,pancreas_msk_2024_cna,pancreas_msk_2024,395,"[P-0000142-T01-IM3, P-0002230-T01-IM3, P-00027..."
2,all_cases_with_mutation_and_cna_data,Samples with mutation and CNA data,pancreas_msk_2024_cnaseq,pancreas_msk_2024,395,"[P-0000142-T01-IM3, P-0002230-T01-IM3, P-00027..."
3,all_cases_with_mutation_data,Samples with mutation data,pancreas_msk_2024_sequenced,pancreas_msk_2024,395,"[P-0000142-T01-IM3, P-0002230-T01-IM3, P-00027..."
4,all_cases_with_sv_data,Samples with SV data,pancreas_msk_2024_sv,pancreas_msk_2024,395,"[P-0000142-T01-IM3, P-0002230-T01-IM3, P-00027..."
...,...,...,...,...,...,...
2543,all_cases_with_mutation_data,Samples with mutation data,thyroid_gatci_2024_sequenced,thyroid_gatci_2024,190,"[ANPT0001P, ANPT0002P, ANPT0003P, ANPT0004P, A..."
2544,all_cases_in_study,All samples,mbn_sfu_2023_all,mbn_sfu_2023,297,"[BLGSP-71-06-00001-01A, BLGSP-71-06-00002-01C,..."
2545,all_cases_with_mrna_rnaseq_data,Samples with mRNA data (RNA Seq V2),mbn_sfu_2023_rna_seq_v2_mrna,mbn_sfu_2023,297,"[BLGSP-71-06-00001-01A, BLGSP-71-06-00002-01C,..."
2546,all_cases_with_mutation_data,Samples with mutation data,mbn_sfu_2023_sequenced,mbn_sfu_2023,297,"[BLGSP-71-06-00001-01A, BLGSP-71-06-00002-01C,..."


The `studyId` and `sampleListId` columns can be used to grab mutations, other data sets in the following API calls.  
  
These are just a few of the endpoints that I have looked at, there are others to explore at the swaggerui: `https://www.cbioportal.org/api/swagger-ui/index.html#/`

In [25]:
cbio_studies[cbio_studies['studyId'].str.contains('pptc_2019')]

Unnamed: 0,category,name,sampleListId,studyId,sampleCount,sampleIds
593,all_cases_with_mrna_rnaseq_data,Samples with mRNA data (RNA Seq),pptc_2019_rna_seq_mrna,pptc_2019,244,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."
594,all_cases_in_study,All samples,pptc_2019_all,pptc_2019,261,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."
595,all_cases_with_cna_data,Samples with CNA data,pptc_2019_cna,pptc_2019,252,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."
596,all_cases_with_mutation_and_cna_data,Samples with mutation and CNA data,pptc_2019_cnaseq,pptc_2019,232,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."
597,all_cases_with_mutation_and_cna_and_mrna_data,Complete samples,pptc_2019_3way_complete,pptc_2019,222,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."
598,all_cases_with_mutation_data,Samples with mutation data,pptc_2019_sequenced,pptc_2019,261,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."
599,all_cases_with_sv_data,Samples with SV data,pptc_2019_sv,pptc_2019,244,"[ALL-03, MLL-1, MLL-14, MLL-2, MLL-3, MLL-5, M..."


I can't find any solid documentation on this, but it seems like for some of the end points to work properly, you just use the parent studyId and append whatever the API is looking for to it (e.g. `pptc_2019` becomes `pptc_2019_mutations` if you are using the mutations endpoint)

### Grab Mutations
Mutation data from each sample set seems pretty useful for identifying variants. Let's grab this and create a dataframe that we can hopefully link later on using patientIDs.

In [11]:
study_name = 'pptc_2019'
r = requests.get(f'https://www.cbioportal.org/api/molecular-profiles/{study_name}_mutations/mutations?sampleListId={study_name}_all&projection=SUMMARY&pageSize=10000000&pageNumber=0&direction=ASC')
response = r.json()

In [None]:
data = []
for record in response:
    row = {
        'uniqueSampleKey': record.get('uniqueSampleKey',None),
        'uniquePatientKey': record.get('uniquePatientKey',None),
        'molecularProfileId': record.get('molecularProfileId',None),
        'sampleId': record.get('sampleId',None),
        'patientId': record.get('patientId',None),
        'startPosition': record.get('startPosition',None),
        'endPosition': record.get('endPosition',None),
        'ncbiBuild': record.get('ncbiBuild',None),
        'referenceAllele': record.get('referenceAllele',None),
        'variantAllele': record.get('variantAllele',None),
        'mutationType': record.get('mutationType',None),
        'proteinChange': record.get('proteinChange',None)
    }
    data.append(row)

df = pd.DataFrame(data)
df

Unnamed: 0,uniqueSampleKey,uniquePatientKey,molecularProfileId,sampleId,patientId,startPosition,endPosition,ncbiBuild,referenceAllele,variantAllele,mutationType,proteinChange
0,QUxMLTAzOnBwdGNfMjAxOQ,UDAwMDI6cHB0Y18yMDE5,pptc_2019_mutations,ALL-03,P0002,20711345,20711345,GRCh37,G,A,Missense_Mutation,R132Q
1,QUxMLTAzOnBwdGNfMjAxOQ,UDAwMDI6cHB0Y18yMDE5,pptc_2019_mutations,ALL-03,P0002,151545824,151545824,GRCh37,G,A,Missense_Mutation,R355Q
2,QUxMLTAzOnBwdGNfMjAxOQ,UDAwMDI6cHB0Y18yMDE5,pptc_2019_mutations,ALL-03,P0002,56142751,56142751,GRCh37,C,T,Missense_Mutation,P276L
3,QUxMLTAzOnBwdGNfMjAxOQ,UDAwMDI6cHB0Y18yMDE5,pptc_2019_mutations,ALL-03,P0002,28377812,28377812,GRCh37,A,T,Missense_Mutation,L4132Q
4,QUxMLTAzOnBwdGNfMjAxOQ,UDAwMDI6cHB0Y18yMDE5,pptc_2019_mutations,ALL-03,P0002,35403920,35403920,GRCh37,G,A,Missense_Mutation,E1556K
...,...,...,...,...,...,...,...,...,...,...,...,...
42571,TkNILVdULTgtUzEzLTM3MDA6cHB0Y18yMDE5,UDAxOTE6cHB0Y18yMDE5,pptc_2019_mutations,NCH-WT-8-S13-3700,P0191,205273009,205273009,GRCh37,G,A,Nonsense_Mutation,Q486*
42572,TkNILVdULTgtUzEzLTM3MDA6cHB0Y18yMDE5,UDAxOTE6cHB0Y18yMDE5,pptc_2019_mutations,NCH-WT-8-S13-3700,P0191,41967015,41967015,GRCh37,G,A,Missense_Mutation,E812K
42573,TkNILVdULTgtUzEzLTM3MDA6cHB0Y18yMDE5,UDAxOTE6cHB0Y18yMDE5,pptc_2019_mutations,NCH-WT-8-S13-3700,P0191,34942916,34942916,GRCh37,A,G,Missense_Mutation,N301S
42574,TkNILVdULTgtUzEzLTM3MDA6cHB0Y18yMDE5,UDAxOTE6cHB0Y18yMDE5,pptc_2019_mutations,NCH-WT-8-S13-3700,P0191,19368846,19368846,GRCh37,T,G,Missense_Mutation,Y330S


### Get Clinical Attributes
TODO: Figure out what this is and if it is useful

In [15]:
study_name = 'pptc_2019'
r = requests.get(f'https://www.cbioportal.org/api/studies/{study_name}/clinical-attributes?projection=DETAILED&pageSize=10000000&pageNumber=0&direction=ASC')
r.json()

[{'displayName': 'Age',
  'description': 'Age',
  'datatype': 'NUMBER',
  'patientAttribute': True,
  'priority': '1',
  'clinicalAttributeId': 'AGE',
  'studyId': 'pptc_2019'},
 {'displayName': 'Cancer Subtype Curated',
  'description': 'Cancer Subtype Curated',
  'datatype': 'STRING',
  'patientAttribute': False,
  'priority': '5',
  'clinicalAttributeId': 'CANCER_SUBTYPE_CURATED',
  'studyId': 'pptc_2019'},
 {'displayName': 'Cancer Type',
  'description': 'Cancer Type',
  'datatype': 'STRING',
  'patientAttribute': False,
  'priority': '1',
  'clinicalAttributeId': 'CANCER_TYPE',
  'studyId': 'pptc_2019'},
 {'displayName': 'Cancer Type Detailed',
  'description': 'Cancer Type Detailed',
  'datatype': 'STRING',
  'patientAttribute': False,
  'priority': '1',
  'clinicalAttributeId': 'CANCER_TYPE_DETAILED',
  'studyId': 'pptc_2019'},
 {'displayName': 'EXPRESSION',
  'description': 'EXPRESSION',
  'datatype': 'STRING',
  'patientAttribute': False,
  'priority': '2',
  'clinicalAttribut

### Other Datasets
TODO: Identify if any of the other desired datapoints (therapeutics, diseases, predicates, outcomes) exist on any of these endpoints. If not, figure out how to get them from cBioPortal (bc I definitely see them on the general data explorer)