# TCGA Data obtaining
In this notebook, it is going to be downloaded data from The Cancer Genome Project (TCGA).  

It is going to be used the package *xenaPython* (https://github.com/ucscXena/xenaPython), Python API for Xena Hub.

Datasets we are interested in are:

- gene expression profiles from cancer patients (https://toil.xenahubs.net/download/tcga_RSEM_gene_tpm.gz)

- clinical data from cancer patients ()

## Gene expression profiles (BRCA and non-BRCA)

In [1]:
import xenaPython as xena

HUB = 'https://toil.xenahubs.net'
DATASET = "tcga_RSEM_gene_tpm"
all_samples = xena.dataset_samples(HUB, DATASET, None)
all_genes = xena.dataset_field(HUB, DATASET)

In [None]:
print("Number of samples (patients): " + str(len(all_samples)))
print("Number of genes: " + str(len(all_genes)))

When downloading all expression levels for each patient and gene, there should be an error because of timeout (HTTP 504: Timeout). In order to solve it, data is going to be downloaded using multiprocessing.

In [14]:
import xenaPython as xena
from multiprocessing import Pool

class ParallelDownloader:
       
    def __init__(self, gene_amount_per_download, hub, dataset):
        self.HUB = hub
        self.DATASET = dataset
        self.gene_amount_per_pool = gene_amount_per_download
        self.all_samples = xena.dataset_samples(HUB, DATASET, None)
        self.all_genes = xena.dataset_field(HUB, DATASET)
        self.GENE_AMOUNT_PER_POOL = 5000
        self.NUMBER_OF_GENES = len(self.all_genes)
        self.NUMBER_OF_POOLS = self.NUMBER_OF_GENES//self.GENE_AMOUNT_PER_POOL
        if(self.NUMBER_OF_GENES%self.GENE_AMOUNT_PER_POOL != 0):
             self.NUMBER_OF_POOLS += 1
        
    
    def download_expression_profiles_part(self, n):
        ini = n*self.GENE_AMOUNT_PER_POOL
        fin = (n+1)*self.GENE_AMOUNT_PER_POOL
        if fin > self.NUMBER_OF_GENES:
            fin = self.NUMBER_OF_GENES + 1
        print("Processing " + str(n))
        genes_to_download = self.all_genes[slice(ini, fin, 1)]
        res = xena.dataset_probe_values(self.HUB, self.DATASET, self.all_samples, genes_to_download)
        res[0] = genes_to_download
        print("Processed " + str(n))
        return res
    
    def download_expression_profiles(self):
        pool = Pool(self.NUMBER_OF_POOLS)
        pre_all_data = pool.map_async(self.download_expression_profiles_part, list(range(self.NUMBER_OF_POOLS+1)))
        all_data = pre_all_data.get()
        return all_data


In [None]:
%%time

HUB = 'https://toil.xenahubs.net'
DATASET = "tcga_RSEM_gene_tpm"
downloader = ParallelDownloader(5000, HUB, DATASET)
all_data = downloader.download_expression_profiles()

Processing 0
Processing 1
Processing 2
Processing 3
Processing 4
Processing 5
Processing 6
Processing 7
Processing 8
Processing 9
Processing 10
Processing 11
Processing 12
Processed 12
Processing 13
Processed 13


Now we have all dataset stored in memory.

It is stored in multiple list of lists, then must be transformed to a Pandas DataFrame, like it is done down bellow:

In [10]:
import pandas as pd

df_gene_exp = pd.DataFrame()

for x in range(len(all_data)):
    header = all_data[x][0]
    indexes = all_samples
    tmp = pd.DataFrame.transpose(pd.DataFrame(all_data[x][1], columns=indexes, index=header))
    df_gene_exp = pd.concat([df_gene_exp, tmp], axis=1)

In [None]:
# Saving data
df_gene_exp.to_hdf('data/tcga_RSEM_gene_tpm.h5', key='gene_expression')

In [29]:
import pandas as pd
df_gene_exp = pd.read_hdf('data/tcga_RSEM_gene_tpm.h5', key='gene_expression').iloc[:, :-1]

## Clinical BRCA

In [30]:
%%time

import xenaPython as xena

HUB = 'https://tcga.xenahubs.net'
DATASET = "TCGA.BRCA.sampleMap/BRCA_clinicalMatrix"
brca_samples = xena.dataset_samples(HUB, DATASET, None)
all_fields = xena.dataset_field(HUB, DATASET)
brca_clinical = xena.dataset_probe_values(HUB, DATASET, brca_samples, all_fields)

CPU times: user 56.5 ms, sys: 6.78 ms, total: 63.2 ms
Wall time: 6.59 s


In [3]:
import pandas as pd
brca_clinical = pd.DataFrame(brca_clinical[1], columns=brca_samples, index=all_fields).T

## Clinical non-BRCA

In [31]:
%%time

import xenaPython as xena

HUB = 'https://pancanatlas.xenahubs.net'
DATASET = "Survival_SupplementalTable_S1_20171025_xena_sp"
all_samples = xena.dataset_samples(HUB, DATASET, None)
non_brca_samples = list(set(all_samples).difference(brca_samples).intersection(list(df_gene_exp.index)))
all_fields = xena.dataset_field(HUB, DATASET)
non_brca_clinical = xena.dataset_probe_values(HUB, DATASET, non_brca_samples, all_fields)

CPU times: user 104 ms, sys: 15.7 ms, total: 119 ms
Wall time: 4.15 s


In [5]:
import pandas as pd
non_brca_clinical = pd.DataFrame(non_brca_clinical[1], columns=non_brca_samples, index=all_fields).T

In [101]:
non_brca_clinical['OS'].value_counts()

0      6195
1      3083
NaN       7
Name: OS, dtype: int64

## Cancer type (NOT VALID)

In [8]:
%%time

import xenaPython as xena

HUB = 'https://pancanatlas.xenahubs.net'
DATASET = "TCGA_phenotype_denseDataOnlyDownload.tsv"
all_samples = xena.dataset_samples(HUB, DATASET, None)
all_fields = xena.dataset_field(HUB, DATASET)
cancertype = xena.dataset_probe_values(HUB, DATASET, all_samples, all_fields)

CPU times: user 51.3 ms, sys: 13.3 ms, total: 64.6 ms
Wall time: 3.06 s


In [9]:
import pandas as pd
cancertype = pd.DataFrame(cancertype[1], columns=all_samples, index=all_fields).T

In [10]:
cancertype.loc[set(list(cancertype.index)).intersection(brca_clinical.index),'_primary_disease'].value_counts()


10    1241
Name: _primary_disease, dtype: int64

## Data exploration

In [32]:
print("Total number of samples: {}".format(len(df_gene_exp.index)))

Total number of samples: 10535


In [34]:
print("Total number of BRCA samples: {}".format(len(brca_clinical.index)))

TypeError: object of type 'builtin_function_or_method' has no len()

In [133]:
print("Total number of non-BRCA samples: {}".format(len(non_brca_clinical.index)))

Total number of non-BRCA samples: 9285


In [21]:
index_with_clincal_info = set(brca_clinical.index).union(non_brca_clinical.index)
index_with_NO_exp_info = index_with_clincal_info.difference(df_gene_exp.index)

In [26]:
# Drop patients in brca_clinical with no 
brca_clinical = brca_clinical.drop(index_with_NO_exp_info.intersection(brca_clinical.index))
brca_clinical.shape

(1212, 203)

In [25]:
# Dropped patients in df_gene_exp are dropped too in each clinical data
non_brca_clinical = non_brca_clinical.drop(index_with_NO_exp_info.intersection(non_brca_clinical.index))
non_brca_clinical.shape

(9285, 34)

In [27]:
1212+9285

10497

# Data exportation

In [174]:
%%time
#Saving clinical data (brca)

with pd.HDFStore("data/TCGA_data.h5", "w") as store:
    store['brca_clinical'] = brca_clinical
    store['non_brca_clinical'] = non_brca_clinical
    store['both_gene_expression'] = df_gene_exp

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->['AJCC_Stage_nature2012', 'Age_at_Initial_Pathologic_Diagnosis_nature2012', 'CN_Clusters_nature2012', 'Converted_Stage_nature2012', 'Days_to_Date_of_Last_Contact_nature2012', 'Days_to_date_of_Death_nature2012', 'ER_Status_nature2012', 'Gender_nature2012', 'HER2_Final_Status_nature2012', 'Integrated_Clusters_no_exp__nature2012', 'Integrated_Clusters_unsup_exp__nature2012', 'Integrated_Clusters_with_PAM50__nature2012', 'Metastasis_Coded_nature2012', 'Metastasis_nature2012', 'Node_Coded_nature2012', 'Node_nature2012', 'OS', 'OS.time', 'OS.unit', 'OS_Time_nature2012', 'OS_event_nature2012', 'PAM50Call_RNAseq', 'PAM50_mRNA_nature2012', 'PR_Status_nature2012', 'RFS', 'RFS.time', 'RFS.unit', 'RPPA_Clusters_nature2012', 'SigClust_Intrinsic_mRNA_nature2012', 'SigClust_Unsupervised_mRNA_nature2012', 'Survival_Data_Form_nature2012', 'Tum

CPU times: user 16.3 s, sys: 5.78 s, total: 22 s
Wall time: 42 s
