### Using `cbiotorch`: an example workflow
In this notebook we'll use some of the key functionality provided by the `cbiotorch` package to develop a simple pytorch prediction model.

#### What is `cbiotorch` for?
CBioPortal is a fantastic resource of curated cancer genomics datasets. Mutation profiles for samples from cancer patients provide excellent resources for developing predictive modelling of clinical cancer outcomes, but require substantial pre-processing and reconciliation. This includes
* reconciliation of data across multiple studies, including the use of varying gene panels;
* separation/pooling of different cancer types; 
* identification and cleaning of clinical outcomes; and
* data processing for ease of use with ML libraries.

This package achieves the stated goals and prepared CBioPortal datasets for use with PyTorch, a popular and flexible library for applying ML methods. The following tutorial demonstrates a simple application of that workflow by loading two 

#### Loading CBioPortal datasets with `cbiotorch`

First off we need some studies. In `cbiotorch`, these are stored in the `MutationDataset` class. We can specify which of these we use by providing a list of study identifiers. Here we'll use two studies, "msk_impact_2017" and "tmb_mskcc_2018". 

In [4]:
from cbiotorch.data import MutationDataset

msk_mutations = MutationDataset(study_id=["msk_impact_2017", "tmb_mskcc_2018"])
msk_mutations.write(replace=True)

INFO:cbiotorch:Searching for mutations from File for msk_impact_2017.
  mutations_df = pd.read_csv(join(self.from_dir, study_id, "mutations.csv"))
INFO:cbiotorch:Read mutations
INFO:cbiotorch:Read samples
INFO:cbiotorch:Read sample/genes
INFO:cbiotorch:Searching for mutations from File for tmb_mskcc_2018.
INFO:cbiotorch:Read mutations
INFO:cbiotorch:Read samples
INFO:cbiotorch:Read sample/genes


Note that this takes a little while to run. Because we don't have the datasets loaded, `cbiotorch` has to query CBioPortal's REST API. We can write the datasets to file using the `.write()` method. Once we've run this, in future `MutationDataset` will look for the saved files, and so this will be much faster.

One problem when combining multiple datasets (and even sometimes within a single dataset) is that different gene panels are used to profile different samples. This can be a problem for prediction, as it is not necessarily possible to distinguish which genes were unmutated and which were not profiled. In this case, some samples were profiled using the "IMPACT341" panel and some using the "IMPACT410" panel. What `MutationDataset` does is automatically generate a "maximal valid gene panel", i.e pooling all genes which were profiled in all samples across the data. We can see below that this is 341 genes long (i.e. simply the IMPACT341 panel, which is a subset of IMPACT410), and look at some of the genes contained.

In [2]:
print(f"Length of maximum viable panel: {len(msk_mutations.auto_gene_panel)}")
print(f"Some genes in that panel: {', '.join(msk_mutations.auto_gene_panel[:5])}.")

Length of maximum viable panel: 341
Some genes in that panel: KEAP1, IFNGR1, DAXX, BARD1, CHEK1.


#### Pre-processing data: lung cancer example

In order to properly use mutation datasets, we have to apply various processing steps. These can occur at various stages in a workflow, but we achieve all of these using *transforms*. Transforms in `cbiotorch` come (unsurprisingly) from the `transforms` module, which is designed to behave similarly to the `torchvision` module of the same name. Broadly speaking, there are two stages at which we might employ them: at dataset initiation, where they are applied to the entire dataset as it is assembled (i.e. before it is written to file in the example use of `MutationDataset` above), and those applied only to individual samples during data loading in model training. We'll discuss more about the latter type of transform later on, and for now focus on situations where we want to apply a pre-preocessing transform. Here we'll assume we only want to work with lung cancer samples.

#### Extracting clinical outcomes

We might want to compare between these two studies what clinical features are available. We can do this using the `ClinicalDataset` class.

In [12]:
from cbiotorch.data import ClinicalDataset
from cbiotorch.transform import ToTensor, FilterSelect
from cbiotorch.write import PandasWriter

msk_clinical = ClinicalDataset(
    study_id=["tmb_mskcc_2018", "msk_impact_2017"],
    pre_transform=FilterSelect(),
)

msk_clinical.set_writer(PandasWriter(replace=True))
msk_clinical.write()

#### Training a model

In [5]:
from cbiotorch.transform import ToSparseCountTensor

transform_sparse = ToSparseCountTensor(
    dims=["hugoGeneSymbol", "variantType"], dim_refs=msk_mutations.auto_dim_refs
)
msk_mutations.add_transform(transform_sparse)

### Mess

In [30]:
hash((msk_clinical_new.__class__, 1))

5106505292700980056