### cBioPortal REST API test

List of studies:
- https://github.com/cBioPortal/datahub/tree/master/public
    - if needed you could parse the name of studies from the website and feed them into here
    
    
Issues:
- is this the proper REST API usage?
- I can only query 1000 rows at a time 
    - ie. pageSize=1000 parameter

#### The raw data from REST API

In [4]:
import pandas as pd

pd.read_json('http://cbioportal-rc.herokuapp.com/api/studies/acyc_mskcc_2013/clinical-data?projection=SUMMARY&pageNumber=0&pageSize=1000&direction=ASC').head(8)

Unnamed: 0,clinicalAttributeId,entityId,value
0,CANCER_TYPE,ACYC-MSKCC_09_12352,Salivary Gland Cancer
1,CANCER_TYPE_DETAILED,ACYC-MSKCC_09_12352,Adenoid Cystic Carcinoma
2,METASTATIC_SITE,ACYC-MSKCC_09_12352,Bone
3,METASTATIC_TUMOR_INDICATOR,ACYC-MSKCC_09_12352,No
4,ONCOTREE_CODE,ACYC-MSKCC_09_12352,ACYC
5,PERINEURAL_INVASION,ACYC-MSKCC_09_12352,Microscopic
6,PLATFORM,ACYC-MSKCC_09_12352,WGS/WES
7,PRIMARY_SITE,ACYC-MSKCC_09_12352,Salivary gland


In [5]:
def json_to_df(df):
    # get the unique entities - study subjects
    unique_ids = set(df.entityId)
    # get unique attributes - attributes in this study
    unique_attrib = set(df.clinicalAttributeId)
    # create a dataframe from (study subjects x attributes)
    new_df = pd.DataFrame(index = unique_ids, columns = unique_attrib)
    # go through every id
    for i in unique_ids:
        # get the attribute value
        for j in unique_attrib:
            try:
                # extract value
                value = df.loc[(df['entityId']== i)&(df['clinicalAttributeId'] == j)]
                # set the value in dataframe
                new_df.set_value(index = i, col = j, value = value.values[0,2])
            # in case the value isn't there which is quite often
            except IndexError:
                pass
    return new_df



In [6]:
df1 = json_to_df(pd.read_json('http://cbioportal-rc.herokuapp.com/api/studies/acyc_mskcc_2013/clinical-data?projection=SUMMARY&pageNumber=0&pageSize=1000&direction=ASC'))
df1.head(5)

Unnamed: 0,METASTATIC_SITE,CANCER_TYPE_DETAILED,TUMOR_TISSUE_SITE,PERINEURAL_INVASION,ONCOTREE_CODE,TUMOR_STAGE,SAMPLE_TYPE,PLATFORM,CANCER_TYPE,METASTATIC_TUMOR_INDICATOR,PRIMARY_SITE
ACYC-MSKCC_001781,,Adenoid Cystic Carcinoma,Maxilla,Yes,ACYC,Advanced,Primary,WGS/WES,Salivary Gland Cancer,No,Head and Neck
ACYC-MSKCC_001947,,Adenoid Cystic Carcinoma,Parotid gland,Yes,ACYC,Early,Primary,WGS/WES,Salivary Gland Cancer,No,Salivary gland
ACYC-MSKCC_001739,Lung,Adenoid Cystic Carcinoma,Base of tongue,Yes,ACYC,Early,Primary,WGS/WES,Salivary Gland Cancer,Yes,Oral cavity
ACYC-MSKCC_05_6986,Lung,Adenoid Cystic Carcinoma,Parotid gland,No,ACYC,Early,Primary,WGS/WES,Salivary Gland Cancer,No,Salivary gland
ACYC-MSKCC_131169,Lung,Adenoid Cystic Carcinoma,Submandibular gland,Yes,ACYC,Early,Primary,WGS/WES,Salivary Gland Cancer,Yes,Salivary gland


In [7]:
df2 = json_to_df(pd.read_json('http://cbioportal-rc.herokuapp.com/api/studies/acbc_mskcc_2015/clinical-data?projection=SUMMARY&pageNumber=0&direction=ASC'))
df2.head(5)

Unnamed: 0,IHC_HER2,CANCER_TYPE_DETAILED,MYB_NFIB_CNA,METASTATIC_SITE,ER_STATUS_BY_IHC,TUMOR_TISSUE_SITE,TYPE_OF_SURGERY,ONCOTREE_CODE,MYB_NFIB_NONSYNONYMOUS_COUNT,TUMOR_STAGE,SAMPLE_TYPE,PR_STATUS_BY_IHC,PLATFORM,CANCER_TYPE,METASTATIC_TUMOR_INDICATOR,TUMOR_SIZE,PRIMARY_SITE
AdCC11T,Negative,Adenoid Cystic Breast Cancer,17q21-q25.1 gain,,Negative,Breast,Breast Conservation,ACBC,7,II,Primary,Negative,WES,Breast Cancer,No,24,Breast
AdCC8T,Negative,Adenoid Cystic Breast Cancer,12q12-q14.1 loss,,Negative,Breast,Breast Conservation/Mastectomy,ACBC,9,I,Primary,Negative,WES,Breast Cancer,No,17,Breast
AdCC5T,Negative,Adenoid Cystic Breast Cancer,9q13-q34.2 loss,,Negative,Breast,Breast Conservation/Mastectomy,ACBC,11,II,Primary,Negative,WES,Breast Cancer,No,40,Breast
AdCC12T,Negative,Adenoid Cystic Breast Cancer,,,Negative,Breast,Breast Conservation,ACBC,18,II,Primary,Negative,WES,Breast Cancer,No,20,Breast
AdCC32T,Negative,Adenoid Cystic Breast Cancer,12q12-q14.1 loss,,Negative,Breast,Mastectomy,ACBC,14,I,Primary,Negative,WES,Breast Cancer,No,35,Breast


In [8]:
#read other studies
df3 = json_to_df(pd.read_json('http://cbioportal-rc.herokuapp.com/api/studies/ov_tcga/clinical-data?projection=SUMMARY&pageNumber=0&direction=ASC'))
df4 = json_to_df(pd.read_json('http://cbioportal-rc.herokuapp.com/api/studies/ccrcc_utokyo_2013/clinical-data?projection=SUMMARY&pageNumber=0&direction=ASC'))

### Jaccardi similarity between study attributes

In [9]:
import numpy as np

def similarity(df1, df2):
    # get the unique attributes
    study1_attrs=set(df1.columns.values)
    study2_attrs=set(df2.columns.values)

    # create a dataframe of desired dimension
    df = pd.DataFrame(index=study1_attrs, columns=study2_attrs)
    df = df.fillna(0.0)
    
    # loop over unique values by column per dataframe
    # first go over unique values in dataframe1 columns
    for i in study1_attrs:
        unique_vals_1 = set(df1[i].unique())
        # then go over unique values in dataframe2 columns
        for j in study2_attrs:
            # get unique vals for dataframe2
            unique_vals_2 = set(df2[j].unique())
            # get the cardinality
            intersection_cardinality = float(len(set.intersection(unique_vals_1,unique_vals_2)))
            union_cardinality = float(len(set.union(unique_vals_1,unique_vals_2)))
            # jaccard coefficient
            jaccard_coef = float(intersection_cardinality/union_cardinality)
            # set the value into pandas dataframe
            df.set_value(i, j, jaccard_coef)

    # if you print 'df' you can actually get the jaccard coefficient between each column per study
    
    # now format the presentation of similar attributes
    similar_attributes_value = ''
    for i in (df.index.values):
        for j in (df.columns.values):
            if(df.get_value(i,j) > 0.5):
                similar_attributes_value += (i +' with ' + j + ' and Jacardi coef = ' + str(df.get_value(i,j)) + ', ')
    
    # get the same named columns
    same_attributes = study1_attrs.intersection(study2_attrs)
    # these are unique attributes in each study
    unique_attrs_study1 = study1_attrs-same_attributes
    unique_attrs_study2 = study2_attrs-same_attributes
                
    print("Study1 attributes:", study1_attrs)
    print()
    print("Study2 attributes:", study2_attrs)
    print()
    print("-----------------------------------------------------------------------------------------\n")
    print("Same attributesin both studies:",same_attributes)
    print("-------------------------------------\n")
    print("Similar attributes based on Jaccardi coefficient:",similar_attributes_value)

    return 


In [10]:
similarity(df1,df2)

Study1 attributes: {'METASTATIC_SITE', 'CANCER_TYPE_DETAILED', 'TUMOR_TISSUE_SITE', 'PERINEURAL_INVASION', 'ONCOTREE_CODE', 'TUMOR_STAGE', 'SAMPLE_TYPE', 'PLATFORM', 'CANCER_TYPE', 'METASTATIC_TUMOR_INDICATOR', 'PRIMARY_SITE'}

Study2 attributes: {'IHC_HER2', 'CANCER_TYPE_DETAILED', 'MYB_NFIB_CNA', 'METASTATIC_SITE', 'ER_STATUS_BY_IHC', 'TUMOR_TISSUE_SITE', 'TYPE_OF_SURGERY', 'ONCOTREE_CODE', 'MYB_NFIB_NONSYNONYMOUS_COUNT', 'TUMOR_STAGE', 'SAMPLE_TYPE', 'PR_STATUS_BY_IHC', 'PLATFORM', 'CANCER_TYPE', 'METASTATIC_TUMOR_INDICATOR', 'TUMOR_SIZE', 'PRIMARY_SITE'}

-----------------------------------------------------------------------------------------

Same attributesin both studies: {'METASTATIC_SITE', 'CANCER_TYPE_DETAILED', 'TUMOR_TISSUE_SITE', 'ONCOTREE_CODE', 'TUMOR_STAGE', 'SAMPLE_TYPE', 'PLATFORM', 'CANCER_TYPE', 'METASTATIC_TUMOR_INDICATOR', 'PRIMARY_SITE'}
-------------------------------------

Similar attributes based on Jaccardi coefficient: SAMPLE_TYPE with SAMPLE_TYPE and Jaca

In [11]:
similarity(df3,df4)

Study1 attributes: {'OTHER_SAMPLE_ID', 'PATHOLOGY_REPORT_FILE_NAME', 'SPECIMEN_SECOND_LONGEST_DIMENSION', 'SAMPLE_TYPE', 'VIAL_NUMBER', 'SHORTEST_DIMENSION', 'SAMPLE_TYPE_ID', 'LONGEST_DIMENSION', 'IS_FFPE', 'PATHOLOGY_REPORT_UUID'}

Study2 attributes: {'METASTATIC_SITE', 'CANCER_TYPE_DETAILED', 'STAGE_AT_DIAGNOSIS', 'GENE_EXPRESSION_CLUSTER', 'GRADE', 'SARCOMATOID_COMPONENT', 'VHL_MUTATION_AA', 'VHL_MUTATION_CODON', 'CANCER_TYPE', 'GENE_PANEL'}

-----------------------------------------------------------------------------------------

Same attributesin both studies: set()
-------------------------------------

Similar attributes based on Jaccardi coefficient: 
