In [1]:
import cptac
import pandas as pd

In [2]:
br = cptac.Brca()
help(br)

Help on Brca in module cptac.cancers.brca object:

class Brca(cptac.cancers.cancer.Cancer)
 |  Brca(no_internet=False)
 |  
 |  Manages BRCA (Breast Cancer) data from various sources.
 |  
 |  This class extends the base Cancer class and initializes the BRCA data from 
 |  a variety of sources including BCM, Broad Institute, MSSM, University of Michigan, 
 |  Washington University, and a Harmonized dataset. 
 |  
 |  Attributes:
 |      _sources (dict): A dictionary holding data from different sources.
 |  
 |  Method resolution order:
 |      Brca
 |      cptac.cancers.cancer.Cancer
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, no_internet=False)
 |      Initializes the Brca object.
 |      
 |      Args:
 |          no_internet (bool): If True, the object will not attempt to download data from the internet. 
 |                              Default is False.
 |      
 |      Raises:
 |          ValueError: If the 'no_internet' argument is not of boolea

Using help(br) we find the following function:

get_proteomics(self, source: str = None, tissue_type: str = 'both', imputed: bool = False) -> pandas.core.frame.DataFrame
 |      Get the proteomics dataframe from the specified data source.

In [8]:
br_prot = br.get_proteomics("bcm")


Downloading BRCA_proteomics_gene_abundance_log2_reference_intensity_normalized_Tumor.txt.gz: 100%|██████████| 9.48M/9.48M [00:09<00:00, 1.02MB/s]  
Downloading gencode.v34.basic.annotation-mapping.txt.gz: 100%|██████████| 1.75M/1.75M [00:02<00:00, 610kB/s] 


In [6]:
br._sources

{'bcm': <cptac.cancers.bcm.bcmbrca.BcmBrca at 0x1186d0d90>,
 'broad': <cptac.cancers.broad.broadbrca.BroadBrca at 0x10ebed6d0>,
 'mssm': <cptac.cancers.mssm.mssm.Mssm at 0x15610ac90>,
 'umich': <cptac.cancers.umich.umichbrca.UmichBrca at 0x157f2fd90>,
 'washu': <cptac.cancers.washu.washubrca.WashuBrca at 0x158043690>,
 'harmonized': <cptac.cancers.harmonized.harmonized.Harmonized at 0x158043d10>}

washu, mssm, and harmonized did not have proteomics data. bcm is the first one that does. However, it doesn't have somatic mutation data.

In [34]:
source = "umich"
br_prot = br.get_proteomics(source)

Actually, it looks like none of these sources have both proteomic data and somatic mutation data. This is fatal. We may be cooked.

In [33]:
patient_ids = br_prot.index
mut_ids = br.get_somatic_mutation("harmonized").index
print(len(patient_ids))
print(len(mut_ids))
print(len(set(patient_ids).intersection(set(mut_ids))))

Downloading PanCan_Union_Maf_Broad_WashU_v1.1.maf.gz: 100%|██████████| 138M/138M [00:54<00:00, 2.53MB/s]    


125
29017
120


ACTUALLY: I think we're okay! Most of the patient IDs match up, and according to chat that's the whole point of cptac, is to allow consistency across datasets. So, in other words, #print("yay!")

We retrieve the variable stored in the other notebook.

In [None]:
%store -r patients_without_driver_mutations
print(len(patients_without_driver_mutations))
patient_ids_of_interest = set(patient_ids).intersection(set(patients_without_driver_mutations))
print(len(patient_ids_of_interest))

23
13


It seems we don't have proteomic data for all of the patients of interest. That should be okay. If not, I wonder if we can get it from another source.
The below code gets just the proteomic data of those patients of interest.

In [48]:
br_prot.loc[list(patient_ids_of_interest)]

Name,ARF5,M6PR,ESRRA,FKBP4,NDUFAF7,FUCA2,DBNDD1,SEMA3F,CFTR,CYP51A1,...,DDHD1,WIZ,GBF1,APOA5,WIZ,LDB1,WIZ,RFX7,SWSAP1,SVIL
Database_ID,ENSP00000000233.5,ENSP00000000412.3,ENSP00000000442.6,ENSP00000001008.4,ENSP00000002125.4,ENSP00000002165.5,ENSP00000002501.6,ENSP00000002829.3,ENSP00000003084.6,ENSP00000003100.8,...,ENSP00000500986.2,ENSP00000500993.1,ENSP00000501064.1,ENSP00000501141.1,ENSP00000501256.3,ENSP00000501277.1,ENSP00000501300.1,ENSP00000501317.1,ENSP00000501355.1,ENSP00000501521.1
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
05BR001,0.235065,-0.574439,,-0.202276,-0.15692,-0.55196,0.282831,-0.762923,,-0.699721,...,0.666705,,-0.062854,,,-0.344965,-0.079027,0.199019,,-1.200344
11BR027,0.041073,-0.034179,-0.146448,-0.465687,-0.77664,0.855552,-1.430503,-0.213314,-0.749593,-0.181497,...,-0.253134,0.138083,0.276236,,,0.97093,-0.028558,0.210965,,
01BR023,-0.303983,-0.114765,-0.765492,0.469224,0.333034,-0.328081,-0.472246,0.409787,,-0.23043,...,-0.320561,,-0.129917,,,0.082966,0.342258,0.183923,-0.72719,0.794657
11BR015,0.131767,-0.208966,-0.206715,1.171356,0.054821,-0.526126,-0.350226,0.826188,0.496219,-0.687686,...,1.124724,,-0.101972,,,0.601303,0.842491,0.107564,0.754653,
03BR002,0.021738,-0.791756,-0.042158,0.142145,0.135307,0.16713,0.144323,0.813034,-1.502186,-0.453388,...,-0.288471,,-0.321458,,,-0.16186,-0.432187,-0.045841,0.297138,
18BR017,0.431638,-0.257658,,0.25614,-0.382104,0.159356,0.89182,0.146538,0.299602,0.669755,...,-0.116827,,0.38004,,0.287355,0.001492,-0.094358,-0.020082,,-0.540734
05BR003,-0.771361,-0.532881,-0.326465,0.212867,-0.001153,-0.606051,-0.012429,-0.727154,-2.034999,-0.428694,...,-0.103391,,-0.102442,,,0.187046,-0.020788,-0.021146,0.296405,
11BR076,0.088269,-0.037398,0.211992,-0.471725,0.253513,0.03956,0.363207,0.716435,1.312877,0.010232,...,0.329388,,0.240846,,,-0.181492,-0.094372,0.348717,0.116377,
11BR057,0.919826,0.471811,0.461863,0.579499,0.632575,1.66594,0.250632,1.588594,0.261681,0.156741,...,-1.195099,,0.466116,,,0.117651,-0.433174,-1.214647,0.030587,
11BR049,-0.407304,0.246336,,0.198524,-0.392402,-0.055206,-0.010324,-0.738146,0.67103,-0.200922,...,-0.20191,,0.087507,,0.099454,0.366716,-0.235007,0.368724,,-0.924415


Now there are 12,922 columns. We can't deal with that many proteins, so we should try and reduce the dimensions. This is where PCA comes in. We'll use this to reduce the proteins to a smaller number of principal components, maybe 4 or so, and see if we can find some patterns.