### Using `cbiotorch`: an example workflow
In this notebook we'll use some of the key functionality provided by the `cbiotorch` package to develop a simple pytorch prediction model.

#### What is `cbiotorch` for?
CBioPortal is a fantastic resource of curated cancer genomics datasets. Mutation profiles for samples from cancer patients provide excellent resources for developing predictive modelling of clinical cancer outcomes, but require substantial pre-processing and reconciliation. This includes
* reconciliation of data across multiple studies, including the use of varying gene panels;
* separation/pooling of different cancer types; 
* identification and cleaning of clinical outcomes; and
* data processing for ease of use with ML libraries.

This package achieves the stated goals and prepared CBioPortal datasets for use with PyTorch, a popular and flexible library for applying ML methods. The following tutorial demonstrates a simple application of that workflow by loading two 

#### Loading CBioPortal datasets with `cbiotorch`

First off we need some studies. In `cbiotorch`, these are stored in the `MutationDataset` class. We can specify which of these we use by providing a list of study identifiers. Here we'll use two studies, "msk_impact_2017" and "tmb_mskcc_2018". 

In [1]:
from cbiotorch.data import MutationDataset

msk_mutations = MutationDataset(study_id=["msk_impact_2017", "tmb_mskcc_2018"])
msk_mutations.write(replace=True)

Note that this takes a little while to run. Because we don't have the datasets loaded, `cbiotorch` has to query CBioPortal's REST API. We can write the datasets to file using the `.write()` method. Once we've run this, in future `MutationDataset` will look for the saved files, and so this will be much faster.

One problem when combining multiple datasets (and even sometimes within a single dataset) is that different gene panels are used to profile different samples. This can be a problem for prediction, as it is not necessarily possible to distinguish which genes were unmutated and which were not profiled. In this case, some samples were profiled using the "IMPACT341" panel and some using the "IMPACT410" panel. What `MutationDataset` does is automatically generate a "maximal valid gene panel", i.e pooling all genes which were profiled in all samples across the data. We can see below that this is 341 genes long (i.e. simply the IMPACT341 panel, which is a subset of IMPACT410), and look at some of the genes contained.

In [2]:
print(f"Length of maximum viable panel: {len(msk_mutations.auto_gene_panel)}")
print(f"Some genes in that panel: {', '.join(msk_mutations.auto_gene_panel[:5])}.")

Length of maximum viable panel: 341
Some genes in that panel: KEAP1, IFNGR1, DAXX, BARD1, CHEK1.


#### Pre-processing data: lung cancer example

In order to properly use mutation datasets, we have to apply various processing steps. These can occur at various stages in a workflow, but we achieve all of these using *transforms*. Transforms in `cbiotorch` come (unsurprisingly) from the `transforms` module, which is designed to behave similarly to the `torchvision` module of the same name. Broadly speaking, there are two stages at which we might employ them: at dataset initiation, where they are applied to the entire dataset as it is assembled (i.e. before it is written to file in the example use of `MutationDataset` above), and those applied only to individual samples during data loading in model training. We'll discuss more about the latter type of transform later on, and for now focus on situations where we want to apply a pre-preocessing transform. Here we'll assume we only want to work with lung cancer samples.

#### Extracting clinical outcomes

#### Training a model

In [5]:
from cbiotorch.transforms import ToSparseCountTensor

transform_sparse = ToSparseCountTensor(
    dims=["hugoGeneSymbol", "variantType"], dim_refs=msk_mutations.auto_dim_refs
)
msk_mutations.add_transform(transform_sparse)

### Mess

In [3]:
from cbiotorch.data import ClinicalDataset

msk_clinical = ClinicalDataset(["msk_impact_2017", "tmb_mskcc_2018"])
msk_clinical.write(replace=True)

In [5]:
from cbiotorch.cbioportal import CBioPortalSwaggerClient


client = CBioPortalSwaggerClient.from_url(
    "https://www.cbioportal.org/api/v2/api-docs",
    config={
        "validate_requests": False,
        "validate_responses": False,
        "validate_swagger_spec": False,
    },
)

In [6]:
from cbiotorch.api import get_clinical_from_api


get_clinical_from_api(client, "msk_impact_2017").columns

Index(['studyId', 'patientId', 'uniqueSampleKey', 'uniquePatientKey',
       'sampleId', 'clinicalAttribute', 'CANCER_TYPE', 'CANCER_TYPE_DETAILED',
       'DNA_INPUT', 'FRACTION_GENOME_ALTERED', 'MATCHED_STATUS',
       'METASTATIC_SITE', 'MUTATION_COUNT', 'ONCOTREE_CODE', 'PRIMARY_SITE',
       'SAMPLE_CLASS', 'SAMPLE_COLLECTION_SOURCE', 'SAMPLE_COVERAGE',
       'SAMPLE_TYPE', 'SOMATIC_STATUS', 'SPECIMEN_PRESERVATION_TYPE',
       'SPECIMEN_TYPE', 'TMB_NONSYNONYMOUS', 'TUMOR_PURITY'],
      dtype='object', name='clinicalAttributeId')

In [10]:
from typing import cast, List

import pandas as pd
from tqdm import tqdm

from cbiotorch.cbioportal import ClinicalDataModel, PatientModel


def get_patients_from_api(client: CBioPortalSwaggerClient, study_id: str):
    """Get patients dataframe from CBioPortal API."""
    patients = client.Patients.getAllPatientsInStudyUsingGET(studyId=study_id).result()
    patients = cast(List[PatientModel], patients)
    patients_df = pd.DataFrame(
        [
            {
                patient_attribute: getattr(patient, patient_attribute)
                for patient_attribute in dir(patient)
            }
            for patient in patients
        ]
    )
    print("reading clinical patient data")
    patients_clinical = []
    for patient in tqdm(patients_df.patientId):
        patient_clinical = client.Clinical_Data.getAllClinicalDataOfPatientInStudyUsingGET(
            studyId=study_id, patientId=patient
        ).result()
        patient_clinical = cast(List[ClinicalDataModel], patient_clinical)
        patients_clinical += [
            {k: getattr(clinical_attribute, k) for k in dir(clinical_attribute)}
            for clinical_attribute in patient_clinical
        ]

    patient_clinical_long_df = pd.DataFrame(patients_clinical)
    patient_clinical_df = pd.pivot(
        data=patient_clinical_long_df,
        values="value",
        index=list(
            set(patient_clinical_long_df.columns) - set(["clinicalAttributeId"] + ["value"])
        ),
        columns=["clinicalAttributeId"],
    ).reset_index()
    return patient_clinical_df.drop(columns=["sampleId", "uniqueSampleKey"])

In [14]:
tmb_patients = get_patients_from_api(client, "tmb_mskcc_2018")

reading clinical patient data


100%|██████████| 1661/1661 [03:30<00:00,  7.89it/s]


In [15]:
tmb_patients

clinicalAttributeId,studyId,patientId,uniquePatientKey,clinicalAttribute,AGE_GROUP,DRUG_TYPE,OS_MONTHS,OS_STATUS,SAMPLE_COUNT,SEX
0,tmb_mskcc_2018,P-0000057,UC0wMDAwMDU3OnRtYl9tc2tjY18yMDE4,,31-50,PD-1/PDL-1,0,1:DECEASED,1,Female
1,tmb_mskcc_2018,P-0000062,UC0wMDAwMDYyOnRtYl9tc2tjY18yMDE4,,>71,PD-1/PDL-1,1,1:DECEASED,1,Male
2,tmb_mskcc_2018,P-0000063,UC0wMDAwMDYzOnRtYl9tc2tjY18yMDE4,,61-70,PD-1/PDL-1,42,0:LIVING,1,Male
3,tmb_mskcc_2018,P-0000071,UC0wMDAwMDcxOnRtYl9tc2tjY18yMDE4,,61-70,PD-1/PDL-1,43,0:LIVING,1,Male
4,tmb_mskcc_2018,P-0000082,UC0wMDAwMDgyOnRtYl9tc2tjY18yMDE4,,50-60,PD-1/PDL-1,57,0:LIVING,1,Male
...,...,...,...,...,...,...,...,...,...,...
1656,tmb_mskcc_2018,P-0026892,UC0wMDI2ODkyOnRtYl9tc2tjY18yMDE4,,>71,PD-1/PDL-1,3,0:LIVING,1,Male
1657,tmb_mskcc_2018,P-0026970,UC0wMDI2OTcwOnRtYl9tc2tjY18yMDE4,,61-70,PD-1/PDL-1,4,0:LIVING,1,Male
1658,tmb_mskcc_2018,P-0027031,UC0wMDI3MDMxOnRtYl9tc2tjY18yMDE4,,31-50,PD-1/PDL-1,5,0:LIVING,1,Female
1659,tmb_mskcc_2018,P-0027041,UC0wMDI3MDQxOnRtYl9tc2tjY18yMDE4,,61-70,PD-1/PDL-1,41,0:LIVING,1,Female


In [47]:
from cbiotorch.api import get_patients_from_api


patients = get_patients_from_api(client=client, study_id="tmb_mskcc_2018")
for i, patient in enumerate(patients.patientId):
    print(i)
    client.Clinical_Data.getAllClinicalDataOfPatientInStudyUsingGET(
        studyId="tmb_mskcc_2018", patientId=patient
    ).result()

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [48]:
client.Clinical_Data.getAllClinicalDataOfPatientInStudyUsingGET(
    studyId="tmb_mskcc_2018", patientId=patient
).result()

[ClinicalData(clinicalAttribute=None, clinicalAttributeId='AGE_GROUP', patientId='P-0027346', sampleId=None, studyId='tmb_mskcc_2018', uniquePatientKey='UC0wMDI3MzQ2OnRtYl9tc2tjY18yMDE4', uniqueSampleKey=None, value='>71'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='DRUG_TYPE', patientId='P-0027346', sampleId=None, studyId='tmb_mskcc_2018', uniquePatientKey='UC0wMDI3MzQ2OnRtYl9tc2tjY18yMDE4', uniqueSampleKey=None, value='PD-1/PDL-1'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='OS_MONTHS', patientId='P-0027346', sampleId=None, studyId='tmb_mskcc_2018', uniquePatientKey='UC0wMDI3MzQ2OnRtYl9tc2tjY18yMDE4', uniqueSampleKey=None, value='3'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='OS_STATUS', patientId='P-0027346', sampleId=None, studyId='tmb_mskcc_2018', uniquePatientKey='UC0wMDI3MzQ2OnRtYl9tc2tjY18yMDE4', uniqueSampleKey=None, value='0:LIVING'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SAMPLE_COUNT', patientId='P-0027346

In [42]:
client.Clinical_Data.getAllClinicalDataOfPatientInStudyUsingGET(
    patientId="P-0004007", studyId="msk_impact_2017"
).result()

[ClinicalData(clinicalAttribute=None, clinicalAttributeId='OS_MONTHS', patientId='P-0004007', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA3Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='2.76'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='OS_STATUS', patientId='P-0004007', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA3Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='0:LIVING'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SAMPLE_COUNT', patientId='P-0004007', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA3Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='2'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SEX', patientId='P-0004007', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA3Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='Male'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SMOKING_HISTORY', patientId='P-

In [38]:
client.Clinical_Data.getAllClinicalDataOfPatientInStudyUsingGET(
    patientId="P-0004005", studyId="msk_impact_2017"
).result()

[ClinicalData(clinicalAttribute=None, clinicalAttributeId='OS_MONTHS', patientId='P-0004005', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA1Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='22.26'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='OS_STATUS', patientId='P-0004005', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA1Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='0:LIVING'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SAMPLE_COUNT', patientId='P-0004005', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA1Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='1'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SEX', patientId='P-0004005', sampleId=None, studyId='msk_impact_2017', uniquePatientKey='UC0wMDA0MDA1Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey=None, value='Female'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='SMOKING_HISTORY', patientId=

In [27]:
import pandas as pd

clinical_alt = client.Clinical_Attributes.getAllClinicalAttributesInStudyUsingGET(
    studyId="msk_impact_2017"
).result()
clinical_alt_df_long = pd.DataFrame(
    [
        {clinical_attribute: getattr(c, clinical_attribute) for clinical_attribute in dir(c)}
        for c in clinical_alt
    ]
)

clinical_alt_df = pd.pivot(
    data=clinical_alt_df_long,
    values="value",
    index=list(set(clinical_alt_df_long.columns) - set(["clinicalAttributeId"] + ["value"])),
    columns=["clinicalAttributeId"],
).reset_index()

KeyError: 'value'

In [19]:
from cbiotorch.api import get_samples_from_api


get_samples_from_api(client, "tmb_mskcc_2018")

Unnamed: 0,copyNumberSegmentPresent,patientId,sampleId,sampleType,sequenced,studyId,uniquePatientKey,uniqueSampleKey
0,,P-0000057,P-0000057-T01-IM3,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDAwMDU3OnRtYl9tc2tjY18yMDE4,UC0wMDAwMDU3LVQwMS1JTTM6dG1iX21za2NjXzIwMTg
1,,P-0000062,P-0000062-T01-IM3,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDAwMDYyOnRtYl9tc2tjY18yMDE4,UC0wMDAwMDYyLVQwMS1JTTM6dG1iX21za2NjXzIwMTg
2,,P-0000063,P-0000063-T01-IM3,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDAwMDYzOnRtYl9tc2tjY18yMDE4,UC0wMDAwMDYzLVQwMS1JTTM6dG1iX21za2NjXzIwMTg
3,,P-0000071,P-0000071-T01-IM3,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDAwMDcxOnRtYl9tc2tjY18yMDE4,UC0wMDAwMDcxLVQwMS1JTTM6dG1iX21za2NjXzIwMTg
4,,P-0000082,P-0000082-T01-IM3,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDAwMDgyOnRtYl9tc2tjY18yMDE4,UC0wMDAwMDgyLVQwMS1JTTM6dG1iX21za2NjXzIwMTg
...,...,...,...,...,...,...,...,...
1656,,P-0026892,P-0026892-T01-IM6,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDI2ODkyOnRtYl9tc2tjY18yMDE4,UC0wMDI2ODkyLVQwMS1JTTY6dG1iX21za2NjXzIwMTg
1657,,P-0026970,P-0026970-T01-IM6,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDI2OTcwOnRtYl9tc2tjY18yMDE4,UC0wMDI2OTcwLVQwMS1JTTY6dG1iX21za2NjXzIwMTg
1658,,P-0027031,P-0027031-T01-IM6,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDI3MDMxOnRtYl9tc2tjY18yMDE4,UC0wMDI3MDMxLVQwMS1JTTY6dG1iX21za2NjXzIwMTg
1659,,P-0027041,P-0027041-T01-IM6,Primary Solid Tumor,,tmb_mskcc_2018,UC0wMDI3MDQxOnRtYl9tc2tjY18yMDE4,UC0wMDI3MDQxLVQwMS1JTTY6dG1iX21za2NjXzIwMTg
