##  Part 2: Compute mRNA-protein correlation for new tumour studies

**Input:** Transcriptomics and Proteomics data of the tumour studies listed below.
1. <a href=https://tinyurl.com/3fv3wdrp>Clear cell renal carcinoma (2019)</a>     
2. <a href=https://tinyurl.com/k829xt9r>Endometrial Cancer (2020)</a>         
3. <a href=https://tinyurl.com/yfvtp3wp>Lung Adenocarcinoma (2020)</a>         
4. <a href=https://tinyurl.com/2zsjx24s>Breast Cancer (2020)</a>           
5. <a href=https://tinyurl.com/2mxy6vw7>Head and Neck Squammous Cell Carcinoma (2021)</a>           
6. <a href=https://tinyurl.com/3f65rr3x>Glioblastoma (2021)</a>            

**Output:** Gene-wise correlation between mRNA and protein abundances 

<div class="alert alert-block alert-info">
    <b>Note:</b> The input data are downloaded from the <a href=https://pypi.org/project/cptac/>CPTAC python API</a>
</div>    

#### Import Packages

In [1]:
import os
import cptac
import numpy as np
import pandas as pd
from cptac import utils as ut

%load_ext autoreload
%autoreload 1
%aimport standardised_pipeline_utils

In [2]:
get_local_data_path = lambda folders, fname: os.path.normpath('../local_data/'+'/'.join(folders) +'/'+ fname)

# Output File
file_tumour_correlation = get_local_data_path(['processed', 'correlation_mRNA_protein'], 'cptac_tumour_studies.csv')
file_samples_info = get_local_data_path(['processed', 'correlation_mRNA_protein'], 'samples_info.csv')

In [3]:
def get_transcriptomics(data, multiIndexed=False):
    transcriptomics = data.get_transcriptomics(tissue_type='tumor').transpose()
    if(multiIndexed):    
        transcriptomics = ut.reduce_multiindex(transcriptomics.transpose(), levels_to_drop='Database_ID', quiet=True)
        transcriptomics = transcriptomics.transpose()
    print("Dimensions: ", transcriptomics.shape)
    print("Null values count: ", transcriptomics.isnull().sum().sum())
    return transcriptomics

def get_proteomics(data, multiIndexed=True):
    proteomics = data.get_proteomics(tissue_type='tumor')
    if(multiIndexed):    
        proteomics = ut.reduce_multiindex(proteomics, levels_to_drop='Database_ID', quiet=True)
    print("Dataframe transposed.")
    proteomics = proteomics.transpose()
    print("Dimensions: ", proteomics.shape)
    print("Null values count: ", proteomics.isnull().sum().sum())
    return proteomics

In [4]:
# Information collected for the Supplemental Table S1B
common_samples = {}
common_genes = {}
transcriptomic_samples = {}
proteomic_samples = {}

### Download data from CPTAC

In [5]:
cptac.list_datasets()

Unnamed: 0_level_0,Description,Data reuse status,Publication link
Dataset name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Brca,breast cancer,no restrictions,https://pubmed.ncbi.nlm.nih.gov/33212010/
Ccrcc,clear cell renal cell carcinoma (kidney),no restrictions,https://pubmed.ncbi.nlm.nih.gov/31675502/
Colon,colorectal cancer,no restrictions,https://pubmed.ncbi.nlm.nih.gov/31031003/
Endometrial,endometrial carcinoma (uterine),no restrictions,https://pubmed.ncbi.nlm.nih.gov/32059776/
Gbm,glioblastoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/33577785/
Hnscc,head and neck squamous cell carcinoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/33417831/
Lscc,lung squamous cell carcinoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/34358469/
Luad,lung adenocarcinoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/32649874/
Ovarian,high grade serous ovarian cancer,no restrictions,https://pubmed.ncbi.nlm.nih.gov/27372738/
Pdac,pancreatic ductal adenocarcinoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/34534465/


In [6]:
cptac.download('Ccrcc', version='0.1.1')
cptac.download('Endometrial', version='2.1.1')
cptac.download('Luad', version='3.1.1')
cptac.download('Brca', version='5.4')
cptac.download('Gbm', version='3.0')
cptac.download('Hnscc', version='2.0')

                                                

True

### Clear Cell Renal Carcinoma (ccRCC)

In [7]:
ccrcc = cptac.Ccrcc(version='0.1.1')
ccrcc_clinical_info = ccrcc.get_clinical()
interested_samples = ccrcc_clinical_info.index[ccrcc_clinical_info['histologic_type']=='Clear cell renal cell carcinoma']

                                          

In [8]:
ccrcc_transcriptomics = get_transcriptomics(ccrcc)

Dimensions:  (19275, 110)
Null values count:  0


In [9]:
ccrcc_transcriptomics = ccrcc_transcriptomics[interested_samples]
ccrcc_transcriptomics_processed = standardised_pipeline_utils.process(ccrcc_transcriptomics)
ccrcc_transcriptomics_processed[:2]

Dimensions:  (16718, 103)


Patient_ID,C3L-00004,C3L-00010,C3L-00011,C3L-00026,C3L-00079,C3L-00088,C3L-00096,C3L-00097,C3L-00103,C3L-00183,...,C3N-01220,C3N-01261,C3N-01361,C3N-01522,C3N-01524,C3N-01646,C3N-01648,C3N-01649,C3N-01651,C3N-01808
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,0.995336,0.6794,0.354549,2.543775,4.355205,1.114256,1.624697,1.060201,1.294317,1.091372,...,1.400173,1.3495,2.57843,0.664676,4.472127,2.823319,7.008482,2.212953,1.268049,0.903522
A1CF,16.677828,16.682712,0.245606,16.347532,4.858958,13.654469,8.107277,4.541293,1.853419,6.29322,...,0.215648,6.03394,3.416981,9.471037,8.165651,2.720128,0.018267,2.237772,13.311588,12.117981


In [10]:
ccrcc_proteomics = get_proteomics(ccrcc)
ccrcc_proteomics[:2]

Dataframe transposed.
Dimensions:  (11710, 110)
Null values count:  284842


Patient_ID,C3L-00004,C3L-00010,C3L-00011,C3L-00026,C3L-00079,C3L-00088,C3L-00096,C3L-00097,C3L-00103,C3L-00183,...,C3N-01220,C3N-01261,C3N-01361,C3N-01522,C3N-01524,C3N-01646,C3N-01648,C3N-01649,C3N-01651,C3N-01808
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-0.304302,1.195915,-0.286155,0.13573,-0.123959,0.427542,-0.242107,0.506469,0.720836,0.082946,...,0.791576,0.31854,0.093607,-0.504522,0.788178,-0.173487,-0.350081,0.246378,-0.242872,0.171883
A1CF,0.641447,0.19462,-0.780455,0.404286,-0.677773,0.310249,-0.128732,-0.513243,-1.135859,-0.128068,...,-0.892166,-0.251923,-0.535844,0.087143,-0.12676,-0.686012,-0.699248,-0.847288,0.48695,0.364511


In [11]:
assert len(ccrcc_proteomics.columns[ccrcc_proteomics.columns.duplicated()]) == 0, "columns contain duplicates"

<div class="alert alert-block alert-warning">
<b>Note:</b> Despite dropping one level of the multi-index, there are no columns/samples with duplicates.
</div>

In [12]:
ccrcc_proteomics = ccrcc_proteomics[interested_samples]
ccrcc_proteomics_processed = standardised_pipeline_utils.process(ccrcc_proteomics)
ccrcc_proteomics_processed[:2]

Dimensions:  (7820, 103)


Patient_ID,C3L-00004,C3L-00010,C3L-00011,C3L-00026,C3L-00079,C3L-00088,C3L-00096,C3L-00097,C3L-00103,C3L-00183,...,C3N-01220,C3N-01261,C3N-01361,C3N-01522,C3N-01524,C3N-01646,C3N-01648,C3N-01649,C3N-01651,C3N-01808
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-0.304302,1.195915,-0.286155,0.13573,-0.123959,0.427542,-0.242107,0.506469,0.720836,0.082946,...,0.791576,0.31854,0.093607,-0.504522,0.788178,-0.173487,-0.350081,0.246378,-0.242872,0.171883
A1CF,0.641447,0.19462,-0.780455,0.404286,-0.677773,0.310249,-0.128732,-0.513243,-1.135859,-0.128068,...,-0.892166,-0.251923,-0.535844,0.087143,-0.12676,-0.686012,-0.699248,-0.847288,0.48695,0.364511


In [13]:
transcriptomic_samples['ccRCC'] = ccrcc_transcriptomics_processed.shape[1]
proteomic_samples['ccRCC'] = ccrcc_proteomics_processed.shape[1]

In [14]:
ccrcc_transcriptomics_processed, ccrcc_proteomics_processed = standardised_pipeline_utils.match_proteins_samples( \
                                                                            ccrcc_transcriptomics_processed, 
                                                                            ccrcc_proteomics_processed)

Number of common proteins:  7609
Number of common samples:  103


In [15]:
common_samples['ccRCC'] = ccrcc_transcriptomics_processed.shape[1]
common_genes['ccRCC'] = ccrcc_proteomics_processed.shape[0]

In [16]:
correlation_ccrcc = standardised_pipeline_utils.correlate_genewise(ccrcc_transcriptomics_processed, 
                                                                    ccrcc_proteomics_processed, 'ccRCC')

Median Spearman Correlation:  0.4103


In [17]:
correlation_ccrcc_pearson = standardised_pipeline_utils.correlate_genewise(ccrcc_transcriptomics_processed, 
                                                                    ccrcc_proteomics_processed, 'ccRCC', method='pearson')

Median Pearson Correlation:  0.4233


<a id=Endometrial_Cancer></a>
### Endometrial Cancer

In [18]:
endo = cptac.Endometrial(version='2.1.1')

                                                

In [19]:
endo_transcriptomics = get_transcriptomics(endo)
endo_transcriptomics[:2]

Dimensions:  (28057, 95)
Null values count:  0


Patient_ID,C3L-00006,C3L-00008,C3L-00032,C3L-00090,C3L-00098,C3L-00136,C3L-00137,C3L-00139,C3L-00143,C3L-00145,...,C3N-01219,C3N-01267,C3N-01346,C3N-01349,C3N-01510,C3N-01520,C3N-01521,C3N-01537,C3N-01802,C3N-01825
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,4.02,4.81,6.24,5.31,9.84,5.03,3.17,6.03,6.02,5.71,...,5.34,5.22,5.63,7.63,4.85,4.84,4.8,5.95,5.25,6.79
A1BG-AS1,2.16,2.21,6.43,4.87,8.83,5.59,3.56,5.46,5.9,5.43,...,5.77,5.5,6.89,7.54,3.44,4.95,5.02,5.92,5.97,6.72


In [20]:
endo_transcriptomics_processed = standardised_pipeline_utils.process(endo_transcriptomics)
endo_transcriptomics_processed[:2]

Dimensions:  (20807, 95)


Patient_ID,C3L-00006,C3L-00008,C3L-00032,C3L-00090,C3L-00098,C3L-00136,C3L-00137,C3L-00139,C3L-00143,C3L-00145,...,C3N-01219,C3N-01267,C3N-01346,C3N-01349,C3N-01510,C3N-01520,C3N-01521,C3N-01537,C3N-01802,C3N-01825
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,4.02,4.81,6.24,5.31,9.84,5.03,3.17,6.03,6.02,5.71,...,5.34,5.22,5.63,7.63,4.85,4.84,4.8,5.95,5.25,6.79
A1BG-AS1,2.16,2.21,6.43,4.87,8.83,5.59,3.56,5.46,5.9,5.43,...,5.77,5.5,6.89,7.54,3.44,4.95,5.02,5.92,5.97,6.72


In [21]:
endo_proteomics = get_proteomics(endo, multiIndexed = False)
endo_proteomics[:2]

Dataframe transposed.
Dimensions:  (10999, 95)
Null values count:  116089


Patient_ID,C3L-00006,C3L-00008,C3L-00032,C3L-00090,C3L-00098,C3L-00136,C3L-00137,C3L-00139,C3L-00143,C3L-00145,...,C3N-01219,C3N-01267,C3N-01346,C3N-01349,C3N-01510,C3N-01520,C3N-01521,C3N-01537,C3N-01802,C3N-01825
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-1.18,-0.685,-0.528,-1.67,-0.374,-1.08,-1.32,-0.467,-1.12,-0.716,...,-0.295,-1.3,-0.67,0.687,-0.269,-1.07,-1.28,-0.29,0.266,0.692
A2M,-0.863,-1.07,-1.32,-1.19,-0.0206,-0.708,-0.708,0.37,-1.31,-0.885,...,-0.0589,-1.29,-1.11,1.44,0.944,-0.712,-0.736,-0.32,1.39,0.589


In [22]:
endo_proteomics_processed = standardised_pipeline_utils.process(endo_proteomics)
endo_proteomics_processed[:2]

Dimensions:  (9099, 95)


Patient_ID,C3L-00006,C3L-00008,C3L-00032,C3L-00090,C3L-00098,C3L-00136,C3L-00137,C3L-00139,C3L-00143,C3L-00145,...,C3N-01219,C3N-01267,C3N-01346,C3N-01349,C3N-01510,C3N-01520,C3N-01521,C3N-01537,C3N-01802,C3N-01825
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-1.18,-0.685,-0.528,-1.67,-0.374,-1.08,-1.32,-0.467,-1.12,-0.716,...,-0.295,-1.3,-0.67,0.687,-0.269,-1.07,-1.28,-0.29,0.266,0.692
A2M,-0.863,-1.07,-1.32,-1.19,-0.0206,-0.708,-0.708,0.37,-1.31,-0.885,...,-0.0589,-1.29,-1.11,1.44,0.944,-0.712,-0.736,-0.32,1.39,0.589


In [23]:
transcriptomic_samples['EC'] = endo_transcriptomics_processed.shape[1]
proteomic_samples['EC'] = endo_proteomics_processed.shape[1]

In [24]:
endo_transcriptomics_processed, endo_proteomics_processed = standardised_pipeline_utils.match_proteins_samples( \
                                                                            endo_transcriptomics_processed, 
                                                                            endo_proteomics_processed)

Number of common proteins:  8998
Number of common samples:  95


In [25]:
common_samples['EC'] = endo_transcriptomics_processed.shape[1]
common_genes['EC'] = endo_transcriptomics_processed.shape[0]

In [26]:
correlation_endo = standardised_pipeline_utils.correlate_genewise(endo_transcriptomics_processed, 
                                                                    endo_proteomics_processed, 'EC')

Median Spearman Correlation:  0.4839


In [27]:
correlation_endo_pearson = standardised_pipeline_utils.correlate_genewise(endo_transcriptomics_processed, 
                                                                    endo_proteomics_processed, 'EC', method='pearson')

Median Pearson Correlation:  0.5116


<a id="Luad"></a>
### Lung Adenocarcinoma

In [28]:
lung_adenocarcinoma = cptac.Luad(version='3.1.1')

                                         

In [29]:
luad_transcriptomics = get_transcriptomics(lung_adenocarcinoma)
luad_transcriptomics[:2]

Dimensions:  (18099, 110)
Null values count:  64320


Patient_ID,C3L-00001,C3L-00009,C3L-00080,C3L-00083,C3L-00093,C3L-00094,C3L-00095,C3L-00140,C3L-00144,C3L-00263,...,C3N-02572,C3N-02582,C3N-02586,C3N-02587,C3N-02588,C3N-02729,X11LU013,X11LU016,X11LU022,X11LU035
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,2.2545,1.477,1.5103,3.0398,1.7528,0.5742,2.2706,0.9331,1.0906,1.1381,...,1.0492,2.3424,1.6608,0.6747,0.7923,-0.1409,0.7896,0.6455,1.6085,0.0421
A1CF,-2.7845,-1.9278,-4.9913,-3.544,-5.2883,-3.6693,-2.2618,-3.7772,-5.185,-1.038,...,-2.9021,-5.5138,-4.746,-3.9404,-4.3875,-6.2414,-5.1965,-5.9792,-3.424,-5.5014


In [30]:
assert len(luad_transcriptomics.columns[luad_transcriptomics.columns.duplicated()]) == 0, "columns contain duplicates"

In [31]:
luad_transcriptomics_processed = standardised_pipeline_utils.process(luad_transcriptomics)
luad_transcriptomics_processed[:2]

Dimensions:  (17022, 110)


Patient_ID,C3L-00001,C3L-00009,C3L-00080,C3L-00083,C3L-00093,C3L-00094,C3L-00095,C3L-00140,C3L-00144,C3L-00263,...,C3N-02572,C3N-02582,C3N-02586,C3N-02587,C3N-02588,C3N-02729,X11LU013,X11LU016,X11LU022,X11LU035
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,2.2545,1.477,1.5103,3.0398,1.7528,0.5742,2.2706,0.9331,1.0906,1.1381,...,1.0492,2.3424,1.6608,0.6747,0.7923,-0.1409,0.7896,0.6455,1.6085,0.0421
A1CF,-2.7845,-1.9278,-4.9913,-3.544,-5.2883,-3.6693,-2.2618,-3.7772,-5.185,-1.038,...,-2.9021,-5.5138,-4.746,-3.9404,-4.3875,-6.2414,-5.1965,-5.9792,-3.424,-5.5014


In [32]:
luad_proteomics = get_proteomics(lung_adenocarcinoma)
luad_proteomics[:2]

Dataframe transposed.
Dimensions:  (10699, 110)
Null values count:  90431


Patient_ID,C3L-00001,C3L-00009,C3L-00080,C3L-00083,C3L-00093,C3L-00094,C3L-00095,C3L-00140,C3L-00144,C3L-00263,...,C3N-02572,C3N-02582,C3N-02586,C3N-02587,C3N-02588,C3N-02729,X11LU013,X11LU016,X11LU022,X11LU035
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-2.5347,-0.5627,-1.9422,2.1636,-1.0022,-1.5576,-1.0718,-1.0799,-1.9159,-1.1384,...,-1.4006,-3.3718,-1.2578,-1.13,-1.6323,-0.7255,-1.3882,-1.4884,-0.3318,-0.7338
A2M,-3.4057,-1.7945,-2.3782,3.1227,-0.9632,-3.0225,-3.1204,-0.7682,-3.338,-2.0141,...,-3.4726,-4.1354,-3.0975,-1.7842,-2.8213,-3.2235,-2.4728,-3.4264,-1.1635,-1.8498


In [33]:
assert len(luad_proteomics.columns[luad_proteomics.columns.duplicated()]) == 0, "columns contain duplicates"

In [34]:
luad_proteomics_processed = standardised_pipeline_utils.process(luad_proteomics)
luad_proteomics_processed[:2]

Dimensions:  (8758, 110)


Patient_ID,C3L-00001,C3L-00009,C3L-00080,C3L-00083,C3L-00093,C3L-00094,C3L-00095,C3L-00140,C3L-00144,C3L-00263,...,C3N-02572,C3N-02582,C3N-02586,C3N-02587,C3N-02588,C3N-02729,X11LU013,X11LU016,X11LU022,X11LU035
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-2.5347,-0.5627,-1.9422,2.1636,-1.0022,-1.5576,-1.0718,-1.0799,-1.9159,-1.1384,...,-1.4006,-3.3718,-1.2578,-1.13,-1.6323,-0.7255,-1.3882,-1.4884,-0.3318,-0.7338
A2M,-3.4057,-1.7945,-2.3782,3.1227,-0.9632,-3.0225,-3.1204,-0.7682,-3.338,-2.0141,...,-3.4726,-4.1354,-3.0975,-1.7842,-2.8213,-3.2235,-2.4728,-3.4264,-1.1635,-1.8498


In [35]:
transcriptomic_samples['LUAD'] = luad_transcriptomics_processed.shape[1]
proteomic_samples['LUAD'] = luad_proteomics_processed.shape[1]

In [36]:
luad_transcriptomics_processed, luad_proteomics_processed = standardised_pipeline_utils.match_proteins_samples( \
                                                                            luad_transcriptomics_processed, 
                                                                            luad_proteomics_processed)

Number of common proteins:  8507
Number of common samples:  110


In [37]:
common_samples['LUAD'] = luad_transcriptomics_processed.shape[1]
common_genes['LUAD'] = luad_transcriptomics_processed.shape[0]

In [38]:
correlation_luad = standardised_pipeline_utils.correlate_genewise(luad_transcriptomics_processed, 
                                                                   luad_proteomics_processed, 'LUAD')

Median Spearman Correlation:  0.5465


In [39]:
correlation_luad_pearson = standardised_pipeline_utils.correlate_genewise(luad_transcriptomics_processed, 
                                                                   luad_proteomics_processed, 'LUAD', method='pearson')

Median Pearson Correlation:  0.5618


<a id="Breast_Cancer"></a>
### Breast Cancer (2020)

In [40]:
brca = cptac.Brca(version='5.4')

                                         

In [41]:
brca_transcriptomics = get_transcriptomics(brca)
brca_transcriptomics[:2]

Dimensions:  (23121, 122)
Null values count:  608161


Patient_ID,CPT000814,CPT001846,X01BR001,X01BR008,X01BR009,X01BR010,X01BR015,X01BR017,X01BR018,X01BR020,...,X20BR002,X20BR005,X20BR006,X20BR007,X20BR008,X21BR001,X21BR002,X21BR010,X22BR005,X22BR006
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,1.9265,3.6578,0.9896,0.5535,2.8359,1.5804,1.9006,-0.8184,-0.2645,,...,4.712,,,,,2.1736,,-0.3261,-1.2102,0.7403
A1BG-AS1,2.4267,2.6524,2.6363,2.2119,3.3449,2.1647,2.5487,-0.3528,1.3557,,...,1.9708,1.7106,0.6634,0.3475,1.3309,2.1405,,1.0329,0.6457,1.6475


In [42]:
brca_transcriptomics_processed = standardised_pipeline_utils.process(brca_transcriptomics)
brca_transcriptomics_processed[:2]

Dimensions:  (16409, 122)


Patient_ID,CPT000814,CPT001846,X01BR001,X01BR008,X01BR009,X01BR010,X01BR015,X01BR017,X01BR018,X01BR020,...,X20BR002,X20BR005,X20BR006,X20BR007,X20BR008,X21BR001,X21BR002,X21BR010,X22BR005,X22BR006
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,1.9265,3.6578,0.9896,0.5535,2.8359,1.5804,1.9006,-0.8184,-0.2645,,...,4.712,,,,,2.1736,,-0.3261,-1.2102,0.7403
A1BG-AS1,2.4267,2.6524,2.6363,2.2119,3.3449,2.1647,2.5487,-0.3528,1.3557,,...,1.9708,1.7106,0.6634,0.3475,1.3309,2.1405,,1.0329,0.6457,1.6475


In [43]:
brca_proteomics = get_proteomics(brca)
brca_proteomics[:2]

Dataframe transposed.
Dimensions:  (10107, 122)
Null values count:  65248


Patient_ID,CPT000814,CPT001846,X01BR001,X01BR008,X01BR009,X01BR010,X01BR015,X01BR017,X01BR018,X01BR020,...,X20BR002,X20BR005,X20BR006,X20BR007,X20BR008,X21BR001,X21BR002,X21BR010,X22BR005,X22BR006
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-0.6712,1.3964,2.0219,-0.529,1.2556,-0.3843,1.0394,1.1533,1.9579,-0.1637,...,1.8732,-0.4227,1.5862,-0.297,1.6767,-0.661,-1.3735,1.1583,0.4948,0.5049
A2M,-0.2075,1.3302,1.6269,0.3267,3.4489,-1.0239,-0.1915,2.5655,2.4185,-0.581,...,1.5261,-1.911,1.6519,1.3457,1.7907,-0.6402,0.4227,0.3329,-1.0986,-0.6582


In [44]:
assert len(brca_proteomics.columns[brca_proteomics.columns.duplicated()]) == 0, "columns contain duplicates"

In [45]:
brca_proteomics_processed = standardised_pipeline_utils.process(brca_proteomics)
brca_proteomics_processed[:2]

Dimensions:  (8785, 122)


Patient_ID,CPT000814,CPT001846,X01BR001,X01BR008,X01BR009,X01BR010,X01BR015,X01BR017,X01BR018,X01BR020,...,X20BR002,X20BR005,X20BR006,X20BR007,X20BR008,X21BR001,X21BR002,X21BR010,X22BR005,X22BR006
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,-0.6712,1.3964,2.0219,-0.529,1.2556,-0.3843,1.0394,1.1533,1.9579,-0.1637,...,1.8732,-0.4227,1.5862,-0.297,1.6767,-0.661,-1.3735,1.1583,0.4948,0.5049
A2M,-0.2075,1.3302,1.6269,0.3267,3.4489,-1.0239,-0.1915,2.5655,2.4185,-0.581,...,1.5261,-1.911,1.6519,1.3457,1.7907,-0.6402,0.4227,0.3329,-1.0986,-0.6582


In [46]:
transcriptomic_samples['BrCa (2020)'] = brca_transcriptomics_processed.shape[1]
proteomic_samples['BrCa (2020)'] = brca_proteomics_processed.shape[1]

In [47]:
brca_transcriptomics_processed, brca_proteomics_processed = standardised_pipeline_utils.match_proteins_samples( \
                                                                            brca_transcriptomics_processed, 
                                                                            brca_proteomics_processed)

Number of common proteins:  8243
Number of common samples:  122


In [48]:
common_samples['BrCa (2020)'] = brca_transcriptomics_processed.shape[1]
common_genes['BrCa (2020)'] = brca_transcriptomics_processed.shape[0]

In [49]:
correlation_brca = standardised_pipeline_utils.correlate_genewise(brca_transcriptomics_processed, 
                                                                  brca_proteomics_processed, 'BrCa (2020)')

Median Spearman Correlation:  0.4348


In [50]:
correlation_brca_pearson = standardised_pipeline_utils.correlate_genewise(brca_transcriptomics_processed, 
                                                                  brca_proteomics_processed, 'BrCa (2020)', method='pearson')

Median Pearson Correlation:  0.425


<a id="Hnscc_Cancer"></a>
### HNSCC

In [51]:
hnscc = cptac.Hnscc(version='2.0')

                                          

In [52]:
hnscc_transcriptomics = get_transcriptomics(hnscc)
hnscc_transcriptomics[:2]

Dimensions:  (38456, 109)
Null values count:  0


Patient_ID,C3L-00977,C3L-00987,C3L-00994,C3L-00995,C3L-00997,C3L-00999,C3L-01138,C3L-01237,C3L-02617,C3L-02621,...,C3N-03933,C3N-04152,C3N-04273,C3N-04275,C3N-04276,C3N-04277,C3N-04278,C3N-04279,C3N-04280,C3N-04611
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.2,5.44,5.14,5.69,4.54,4.89,6.59,5.42,4.26,6.21,...,5.14,3.77,4.26,5.22,6.64,6.3,5.67,5.76,5.79,4.77
A1BG-AS1,6.79,6.63,6.31,6.06,5.14,5.76,6.74,5.38,5.42,5.9,...,6.31,4.75,5.41,5.26,6.8,7.03,7.02,6.14,6.23,5.92


In [53]:
# Samples removed due to (i) QC failed (ii) HPV positive as specified in Table S1 of HNSCC Cancer Cell Paper
# https://www.cell.com/cancer-cell/fulltext/S1535-6108(20)30655-3#supplementaryMaterial
removed_samples = ['C3N-01643', 'C3N-02693']

In [54]:
hnscc_transcriptomics.drop(columns = [col for col in hnscc_transcriptomics.columns if col in removed_samples], 
                           inplace=True)

In [55]:
hnscc_transcriptomics_processed = standardised_pipeline_utils.process(hnscc_transcriptomics)
hnscc_transcriptomics_processed[:2]

Dimensions:  (26134, 108)


Patient_ID,C3L-00977,C3L-00987,C3L-00994,C3L-00995,C3L-00997,C3L-00999,C3L-01138,C3L-01237,C3L-02617,C3L-02621,...,C3N-03933,C3N-04152,C3N-04273,C3N-04275,C3N-04276,C3N-04277,C3N-04278,C3N-04279,C3N-04280,C3N-04611
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.2,5.44,5.14,5.69,4.54,4.89,6.59,5.42,4.26,6.21,...,5.14,3.77,4.26,5.22,6.64,6.3,5.67,5.76,5.79,4.77
A1BG-AS1,6.79,6.63,6.31,6.06,5.14,5.76,6.74,5.38,5.42,5.9,...,6.31,4.75,5.41,5.26,6.8,7.03,7.02,6.14,6.23,5.92


In [56]:
hnscc_proteomics = get_proteomics(hnscc, multiIndexed=False)
hnscc_proteomics[:2]

Dataframe transposed.
Dimensions:  (11744, 115)
Null values count:  238975


Patient_ID,C3L-00977,C3L-00987,C3L-00994,C3L-00995,C3L-00997,C3L-00999,C3L-01138,C3L-01237,C3L-02617,C3L-02621,...,C3N-04278,C3N-04279,C3N-04280,C3N-04611,C3L-00994.C,C3L-02617.C,C3L-04350.C,C3L-05257.C,C3N-01757.C,C3N-03042.C
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,27.725342,28.152905,28.348186,28.004445,27.735214,27.949122,28.293267,28.216073,27.452281,27.70104,...,27.475926,27.349798,27.786746,27.881307,27.865217,28.26514,28.828969,28.14128,28.141688,28.680969
A1CF,19.056377,,18.058554,,,,,,19.320859,,...,,,,,,,,,,


In [57]:
# Aggregate the replicate samples as they have > 0.9 spearman correlation 
hnscc_proteomics.rename(columns = lambda x: str(x).replace('.C', ''), inplace=True) 
hnscc_proteomics = hnscc_proteomics.groupby(hnscc_proteomics.columns, axis=1).mean()
# Drop removed samples from our analysis
hnscc_proteomics.drop(columns = [x for x in hnscc_proteomics.columns if x in removed_samples], inplace=True)

In [58]:
hnscc_proteomics_processed = standardised_pipeline_utils.process(hnscc_proteomics)
hnscc_proteomics_processed[:2]

Dimensions:  (8696, 110)


Patient_ID,C3L-00977,C3L-00987,C3L-00994,C3L-00995,C3L-00997,C3L-00999,C3L-01138,C3L-01237,C3L-02617,C3L-02621,...,C3N-03933,C3N-04152,C3N-04273,C3N-04275,C3N-04276,C3N-04277,C3N-04278,C3N-04279,C3N-04280,C3N-04611
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,27.725342,28.152905,28.106702,28.004445,27.735214,27.949122,28.293267,28.216073,27.85871,27.70104,...,27.853538,27.473217,27.351122,27.520569,28.709391,27.736346,27.475926,27.349798,27.786746,27.881307
A2M,28.565472,29.374443,30.06764,29.267877,28.724642,29.352481,29.229332,29.100412,29.014712,28.806067,...,28.97223,28.476319,28.415198,29.133823,29.190307,29.491318,29.05276,28.465475,28.505049,28.674746


In [59]:
transcriptomic_samples['HNSCC'] = hnscc_transcriptomics_processed.shape[1]
proteomic_samples['HNSCC'] = hnscc_proteomics_processed.shape[1]

In [60]:
hnscc_transcriptomics_processed, hnscc_proteomics_processed = standardised_pipeline_utils.match_proteins_samples( \
                                                                            hnscc_transcriptomics_processed, 
                                                                            hnscc_proteomics_processed)

Number of common proteins:  8583
Number of common samples:  108


In [61]:
common_samples['HNSCC'] = hnscc_transcriptomics_processed.shape[1]
common_genes['HNSCC'] = hnscc_transcriptomics_processed.shape[0]

In [62]:
correlation_hnscc = standardised_pipeline_utils.correlate_genewise(hnscc_transcriptomics_processed, 
                                                                   hnscc_proteomics_processed, 'HNSCC')

Median Spearman Correlation:  0.5296


In [63]:
correlation_hnscc_pearson = standardised_pipeline_utils.correlate_genewise(hnscc_transcriptomics_processed, 
                                                                   hnscc_proteomics_processed, 'HNSCC', method='pearson')

Median Pearson Correlation:  0.5603


<a id="GBM_Cancer"></a>
### GBM

In [64]:
gbm = cptac.Gbm(version='3.0')

                                        

In [65]:
gbm_transcriptomics = get_transcriptomics(gbm, multiIndexed=True)
gbm_transcriptomics[:2]

Dimensions:  (60483, 99)
Null values count:  0


Patient_ID,C3L-00104,C3L-00365,C3L-00674,C3L-00677,C3L-01040,C3L-01043,C3L-01045,C3L-01046,C3L-01048,C3L-01049,...,C3N-02788,C3N-03070,C3N-03088,C3N-03180,C3N-03182,C3N-03183,C3N-03184,C3N-03186,C3N-03188,C3N-03473
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5S_rRNA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5S_rRNA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
assert len(gbm_transcriptomics.columns[gbm_transcriptomics.columns.duplicated()]) == 0, "columns contain duplicates"

In [67]:
gbm_transcriptomics_processed = standardised_pipeline_utils.process(gbm_transcriptomics)
gbm_transcriptomics_processed[:2]

Dimensions:  (29628, 99)


Patient_ID,C3L-00104,C3L-00365,C3L-00674,C3L-00677,C3L-01040,C3L-01043,C3L-01045,C3L-01046,C3L-01048,C3L-01049,...,C3N-02788,C3N-03070,C3N-03088,C3N-03180,C3N-03182,C3N-03183,C3N-03184,C3N-03186,C3N-03188,C3N-03473
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5S_rRNA,38886.167971,26532.401969,43273.830168,34050.208702,42224.30997,44217.30486,47885.637617,45879.768789,49169.733975,48205.474974,...,28434.976673,62476.163734,23732.935866,25153.158215,40051.285619,36967.844139,80850.137537,37860.60954,34691.447141,61714.912891
7SK,354300.002325,270272.124551,89523.626156,121644.956366,93259.597405,130696.137435,116847.685886,106623.883707,105016.376686,81733.377534,...,125895.14843,147671.301692,64687.124482,172464.282283,186020.044424,181443.805445,271901.473898,216507.270389,73552.366041,134589.00757


In [68]:
gbm_proteomics = get_proteomics(gbm, multiIndexed=False)
gbm_proteomics[:2]

Dataframe transposed.
Dimensions:  (11141, 99)
Null values count:  85305


Patient_ID,C3L-00104,C3L-00365,C3L-00674,C3L-00677,C3L-01040,C3L-01043,C3L-01045,C3L-01046,C3L-01048,C3L-01049,...,C3N-02788,C3N-03070,C3N-03088,C3N-03180,C3N-03182,C3N-03183,C3N-03184,C3N-03186,C3N-03188,C3N-03473
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,0.07763,-0.145975,0.821991,-0.064567,-0.763691,1.094879,-0.027903,-0.375754,-0.394736,-0.025968,...,-0.327487,1.942106,0.27851,1.04588,-0.424647,0.230843,-0.635316,0.61664,-0.059547,-0.899255
A2M,0.487228,0.798796,1.09647,0.129385,-1.031834,0.769231,-0.735991,-0.037553,-0.485108,-0.310086,...,-0.340301,1.657565,0.8366,1.151704,-0.733923,0.426624,-0.478657,0.767029,-0.526563,-0.333312


In [69]:
gbm_proteomics_processed = standardised_pipeline_utils.process(gbm_proteomics)
gbm_proteomics_processed[:2]

Dimensions:  (9786, 99)


Patient_ID,C3L-00104,C3L-00365,C3L-00674,C3L-00677,C3L-01040,C3L-01043,C3L-01045,C3L-01046,C3L-01048,C3L-01049,...,C3N-02788,C3N-03070,C3N-03088,C3N-03180,C3N-03182,C3N-03183,C3N-03184,C3N-03186,C3N-03188,C3N-03473
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,0.07763,-0.145975,0.821991,-0.064567,-0.763691,1.094879,-0.027903,-0.375754,-0.394736,-0.025968,...,-0.327487,1.942106,0.27851,1.04588,-0.424647,0.230843,-0.635316,0.61664,-0.059547,-0.899255
A2M,0.487228,0.798796,1.09647,0.129385,-1.031834,0.769231,-0.735991,-0.037553,-0.485108,-0.310086,...,-0.340301,1.657565,0.8366,1.151704,-0.733923,0.426624,-0.478657,0.767029,-0.526563,-0.333312


In [70]:
transcriptomic_samples['GBM'] = gbm_transcriptomics_processed.shape[1]
proteomic_samples['GBM'] = gbm_proteomics_processed.shape[1]

In [71]:
gbm_transcriptomics_processed, gbm_proteomics_processed = standardised_pipeline_utils.match_proteins_samples( \
                                                                            gbm_transcriptomics_processed, 
                                                                            gbm_proteomics_processed)

Number of common proteins:  9348
Number of common samples:  99


In [72]:
common_samples['GBM'] = gbm_transcriptomics_processed.shape[1]
common_genes['GBM'] = gbm_proteomics_processed.shape[0]

In [73]:
correlation_gbm = standardised_pipeline_utils.correlate_genewise(gbm_transcriptomics_processed, 
                                                                   gbm_proteomics_processed, 'GBM')

Median Spearman Correlation:  0.5014


In [74]:
correlation_gbm_pearson = standardised_pipeline_utils.correlate_genewise(gbm_transcriptomics_processed, 
                                                                   gbm_proteomics_processed, 'GBM', method='pearson')

Median Pearson Correlation:  0.5061


In [75]:
def dataframe_from_dict(*dict_args):
    dataframe = pd.DataFrame.from_dict(dict_args[0], orient='index')
    for i in range(1, len(dict_args)):
        dataframe = pd.concat([dataframe, pd.DataFrame.from_dict(dict_args[i], orient='index')], axis=1)
    dataframe.reset_index(inplace=True)
    dataframe.columns=['Data', '# Samples in Trancriptomic Data', '# Samples in Proteomic Data', 
                       '# Common Samples', '# Common Proteins']
    return dataframe.set_index('Data')

sample_data = dataframe_from_dict(transcriptomic_samples, proteomic_samples, common_samples, common_genes)
sample_data.to_csv(file_samples_info, header = False, mode='a')
sample_data

Unnamed: 0_level_0,# Samples in Trancriptomic Data,# Samples in Proteomic Data,# Common Samples,# Common Proteins
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccRCC,103,103,103,7609
EC,95,95,95,8998
LUAD,110,110,110,8507
BrCa (2020),122,122,122,8243
HNSCC,108,110,108,8583
GBM,99,99,99,9348


In [76]:
correlation_combined = pd.concat([correlation_ccrcc, correlation_endo, correlation_luad, correlation_brca, 
                                  correlation_hnscc, correlation_gbm], axis=1)
correlation_combined.to_csv(file_tumour_correlation)
correlation_combined[:2]

Unnamed: 0_level_0,ccRCC,EC,LUAD,BrCa (2020),HNSCC,GBM
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A1BG,0.035573,0.324974,-0.119663,0.077242,0.065872,-0.192863
A1CF,0.908613,,,,,


In [77]:
correlation_combined_pearson = pd.concat([correlation_ccrcc_pearson, correlation_endo_pearson, 
                                          correlation_luad_pearson, correlation_brca_pearson, 
                                          correlation_hnscc_pearson, correlation_gbm_pearson], axis=1)
round(correlation_combined_pearson.median(), 2)

ccRCC          0.42
EC             0.51
LUAD           0.56
BrCa (2020)    0.43
HNSCC          0.56
GBM            0.51
dtype: float64