### Example `cbiotorch` workflow
In this notebook we'll use some of the key functionality provided by the `cbiotorch` package to develop a simple pytorch prediction model.

#### What is `cbiotorch` for?
CBioPortal is a fantastic resource of curated cancer genomics datasets. Mutation profiles for samples from cancer patients provide excellent resources for developing predictive modelling of clinical cancer outcomes, but require substantial pre-processing and reconciliation. This includes
* reconciliation of data across multiple studies, including the use of varying gene panels;
* separation/pooling of different cancer types; 
* identification and cleaning of clinical outcomes; and
* data processing for ease of use with ML libraries.

This package achieves the stated goals and prepared CBioPortal datasets for use with PyTorch, a popular and flexible library for applying ML methods. The following tutorial demonstrates a simple application of that workflow by loading two 

#### Loading CBioPortal datasets with `cbiotorch`

First off we need some studies. In `cbiotorch`, these are stored in the `MutationDataset` class. We can specify which of these we use by providing a list of study identifiers. Here we'll use two studies, "msk_impact_2017" and "tmb_mskcc_2018". 

In [1]:
from cbiotorch.data import MutationDataset

msk_mutations = MutationDataset(study_id=["msk_impact_2017", "tmb_mskcc_2018"])
msk_mutations.write(replace=True)

Note that this takes a little while to run. Because we don't have the datasets loaded, `cbiotorch` has to query CBioPortal's REST API. We can write the datasets to file using the `.write()` method. Once we've run this, in future `MutationDataset` will look for the saved files, and so this will be much faster.

One problem when combining multiple datasets (and even sometimes within a single dataset) is that different gene panels are used to profile different samples. This can be a problem for prediction, as it is not necessarily possible to distinguish which genes were unmutated and which were not profiled. In this case, some samples were profiled using the "IMPACT341" panel and some using the "IMPACT410" panel. What `MutationDataset` does is automatically generate a "maximal valid gene panel", i.e pooling all genes which were profiled in all samples across the data. We can see below that this is 341 genes long (i.e. simply the IMPACT341 panel, which is a subset of IMPACT410), and look at some of the genes contained.

In [13]:
print(f"Length of maximum viable panel: {len(msk_mutations.auto_gene_panel)}")
print(f"Some genes in that panel: {', '.join(msk_mutations.auto_gene_panel[:5])}.")

Length of maximum viable panel: 341
Some genes in that panel: KEAP1, IFNGR1, DAXX, BARD1, CHEK1.


#### Pre-processing data: lung cancer example

#### Extracting clinical outcomes

In [5]:
from cbiotorch.transforms import ToSparseCountTensor


transform_sparse = ToSparseCountTensor(
    dims=["hugoGeneSymbol", "variantType"], dim_refs=msk_mutations.auto_dim_refs()
)
msk_mutations.add_transform(transform_sparse)