In [2]:
import cptac
import pandas as pd

In [3]:
br = cptac.Brca()
help(br)

Help on Brca in module cptac.cancers.brca object:

class Brca(cptac.cancers.cancer.Cancer)
 |  Brca(no_internet=False)
 |
 |  Manages BRCA (Breast Cancer) data from various sources.
 |
 |  This class extends the base Cancer class and initializes the BRCA data from
 |  a variety of sources including BCM, Broad Institute, MSSM, University of Michigan,
 |  Washington University, and a Harmonized dataset.
 |
 |  Attributes:
 |      _sources (dict): A dictionary holding data from different sources.
 |
 |  Method resolution order:
 |      Brca
 |      cptac.cancers.cancer.Cancer
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, no_internet=False)
 |      Initializes the Brca object.
 |
 |      Args:
 |          no_internet (bool): If True, the object will not attempt to download data from the internet.
 |                              Default is False.
 |
 |      Raises:
 |          ValueError: If the 'no_internet' argument is not of boolean type.
 |
 |  -------------

Using help(br) we find the following function:

get_proteomics(self, source: str = None, tissue_type: str = 'both', imputed: bool = False) -> pandas.core.frame.DataFrame
 |      Get the proteomics dataframe from the specified data source.

In [4]:
br_prot = br.get_proteomics("bcm")


Downloading BRCA_proteomics_gene_abundance_log2_reference_intensity_normalized_Tumor.txt.gz: 100%|██████████| 9.48M/9.48M [00:04<00:00, 1.91MB/s]  
Downloading gencode.v34.basic.annotation-mapping.txt.gz: 100%|██████████| 1.75M/1.75M [00:01<00:00, 1.24MB/s]


In [5]:
br._sources

{'bcm': <cptac.cancers.bcm.bcmbrca.BcmBrca at 0x122bf595280>,
 'broad': <cptac.cancers.broad.broadbrca.BroadBrca at 0x122bf597ec0>,
 'mssm': <cptac.cancers.mssm.mssm.Mssm at 0x122e1c1a5d0>,
 'umich': <cptac.cancers.umich.umichbrca.UmichBrca at 0x122e1b77050>,
 'washu': <cptac.cancers.washu.washubrca.WashuBrca at 0x122e1c1bf20>,
 'harmonized': <cptac.cancers.harmonized.harmonized.Harmonized at 0x122e1c195e0>}

washu, mssm, and harmonized did not have proteomics data. bcm is the first one that does. However, it doesn't have somatic mutation data.

In [6]:
source = "umich"
br_prot = br.get_proteomics(source)

Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 100%|██████████| 15.6M/15.6M [00:06<00:00, 2.36MB/s]  
Downloading prosp-brca-all-samples.txt.gz: 100%|██████████| 3.85k/3.85k [00:00<00:00, 5.83kB/s]


Actually, it looks like none of these sources have both proteomic data and somatic mutation data. This is fatal. We may be cooked.

In [7]:
patient_ids = br_prot.index
mut_ids = br.get_somatic_mutation("harmonized").index
print(len(patient_ids))
print(len(mut_ids))
print(len(set(patient_ids).intersection(set(mut_ids))))

125
29017
120


ACTUALLY: I think we're okay! Most of the patient IDs match up, and according to chat that's the whole point of cptac, is to allow consistency across datasets. So, in other words, #print("yay!")

We retrieve the variable stored in the other notebook.

In [8]:
%store -r normal_patients
print(len(normal_patients))
patient_ids_of_interest = set(patient_ids).intersection(set(normal_patients))
print(len(patient_ids_of_interest))

11
11


In [9]:
%store -r all_patients
all_prot = br_prot.loc[list(all_patients)]
%store all_prot

Stored 'all_prot' (DataFrame)


We have proteomic data for all the patients we're interested in, which is great.
The below code gets just the proteomic data of those patients of interest.

In [10]:
normal_prot = br_prot.loc[list(patient_ids_of_interest)]
normal_prot

Name,ARF5,M6PR,ESRRA,FKBP4,NDUFAF7,FUCA2,DBNDD1,SEMA3F,CFTR,CYP51A1,...,DDHD1,WIZ,GBF1,APOA5,WIZ,LDB1,WIZ,RFX7,SWSAP1,SVIL
Database_ID,ENSP00000000233.5,ENSP00000000412.3,ENSP00000000442.6,ENSP00000001008.4,ENSP00000002125.4,ENSP00000002165.5,ENSP00000002501.6,ENSP00000002829.3,ENSP00000003084.6,ENSP00000003100.8,...,ENSP00000500986.2,ENSP00000500993.1,ENSP00000501064.1,ENSP00000501141.1,ENSP00000501256.3,ENSP00000501277.1,ENSP00000501300.1,ENSP00000501317.1,ENSP00000501355.1,ENSP00000501521.1
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
11BR015,0.131767,-0.208966,-0.206715,1.171356,0.054821,-0.526126,-0.350226,0.826188,0.496219,-0.687686,...,1.124724,,-0.101972,,,0.601303,0.842491,0.107564,0.754653,
03BR005,-1.045496,-1.036901,0.938419,-0.516023,0.17369,-0.478226,-0.236865,0.801878,,-0.954885,...,0.235481,,-0.668006,,,0.62521,-0.276358,0.46696,,3.013091
11BR027,0.041073,-0.034179,-0.146448,-0.465687,-0.77664,0.855552,-1.430503,-0.213314,-0.749593,-0.181497,...,-0.253134,0.138083,0.276236,,,0.97093,-0.028558,0.210965,,
18BR004,0.86187,-0.113748,1.184738,0.132507,0.454913,0.401912,0.826343,0.176891,-2.142509,-0.638793,...,0.080048,,0.337514,,,0.604032,-0.61864,0.463813,1.275217,0.46121
05BR003,-0.771361,-0.532881,-0.326465,0.212867,-0.001153,-0.606051,-0.012429,-0.727154,-2.034999,-0.428694,...,-0.103391,,-0.102442,,,0.187046,-0.020788,-0.021146,0.296405,
01BR025,0.514449,-0.18117,-0.273682,0.526654,0.422711,0.675484,0.03264,-0.006112,-4.767301,-0.418856,...,-1.401882,,0.237409,,,-0.359958,0.109089,0.039473,1.250211,
03BR002,0.021738,-0.791756,-0.042158,0.142145,0.135307,0.16713,0.144323,0.813034,-1.502186,-0.453388,...,-0.288471,,-0.321458,,,-0.16186,-0.432187,-0.045841,0.297138,
18BR017,0.431638,-0.257658,,0.25614,-0.382104,0.159356,0.89182,0.146538,0.299602,0.669755,...,-0.116827,,0.38004,,0.287355,0.001492,-0.094358,-0.020082,,-0.540734
05BR001,0.235065,-0.574439,,-0.202276,-0.15692,-0.55196,0.282831,-0.762923,,-0.699721,...,0.666705,,-0.062854,,,-0.344965,-0.079027,0.199019,,-1.200344
01BR023,-0.303983,-0.114765,-0.765492,0.469224,0.333034,-0.328081,-0.472246,0.409787,,-0.23043,...,-0.320561,,-0.129917,,,0.082966,0.342258,0.183923,-0.72719,0.794657


Now there are 12,922 columns. We can't deal with that many proteins, so we should try and reduce the dimensions. This is where PCA comes in. We'll use this to reduce the proteins to a smaller number of principal components, maybe 4 or so, and see if we can find some patterns.

In [11]:
%store normal_prot

Stored 'normal_prot' (DataFrame)
