In [1]:
import tensorsignatures as ts
%matplotlib inline
from helper import hide_toggle
hide_toggle()

2023-07-15 21:36:55.695324: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-15 21:36:55.730788: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-15 21:36:55.731327: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Instructions for updating:
non-resource variables are not supported in the long term


# The TensorSignatures CLI

The TensorSignatures CLI comes with six subroutines,

* `boot`: computes bootstrap intervals for a TensorSignature initialisation,
* `data`: simulates mutation count data for a TensorSignature inference,
* `prep`: computes a normalisation constant and formats a count tensor,
* `refit`: refits the exposures to set of fixed tensor signatures (Sec. A.2.3), • train: runs a denovo extraction of tensor signatures (Sec. A.2.3),
* `write`: creates a hdf5 file out of dumped tensor signatures pkls.


The goal of this tutorial is to illustrate how to run TensorSignatures in a practical setting. For this reason we will first simulate mutation count data using `tensorsignatures data`, and subsequently run `tensorsignatures train` to extract constituent signatures. In the next section we will then analyse the results of this experiment in jupyter with help of the `tensorsignatures` API.

## Simulate data via CLI

To create a reproducible (the first positional argument sets a seed: 573) synthetic dataset from 5 mutational signatures (second positional argument) with the CLI, we invoke the data subprogram

In [2]:
%%bash
tensorsignatures data 573 5 data.h5 -s 100 -m 1000 -d 3 -d 5

which will simulate 100 samples (`-s 100`) each with 10,000 mutations (`-m 10000`), and two additional genomic dimensions with 3 and 5 states (`-d 3 -d 5`) respectively. The program writes a `hdf5` file `data.h5` to the current folder containing the datasets `SNV` and `OTHER` representing the SNV count tensor and all other variants respectively.

## Running TensorSignatures using the command line interface

Since we know the number of signatures that made up the dataset we can run a TensorSignatures decomposition simply by executing

In [76]:
%%bash
tensorsignatures --verbose train data.h5 my_first_run.pkl 5

m: (1, 1, 1, 1, 5)
S1: (3, 3, 1, 96, 5)
A: (3, 3, 1, 1, 5)
B: (3, 3, 1, 1, 5)
k0: (3, 5)
k1: (5, 5)
K: (1, 1, 15, 1, 5)
S: (3, 3, 15, 96, 5)
E: (5, 100)
T: (234, 5)
Chat1: (3, 3, 15, 96, 100)
Chat2: (234, 100)
C1: (3, 3, 15, 96, 100)
C2: (234, 100)
Using negative binomial likelihood




Instructions for updating:
dim is deprecated, use axis instead






2020-08-24 15:02:21.420682: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-08-24 15:02:21.421607: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-08-24 15:02:21.421724: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (3b6a0be0a779): /proc/driver/nvidia/version does not exist
2020-08-24 15:02:21.422996: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-08-24 15:02:21.443402: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2020-08-24 15:02:21.444744: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x

which saves a pickle able binary file to the disk, which we can load into a interactive python session (eg. a Jupyter notebook) for further investigation

In [19]:
init = ts.load_dump('my_first_run.pkl')
init.S.shape

(3, 3, 3, 5, 96, 5, 1)

However, usually we do not know the number of active mutational processes a priori. For this reason, it is necessary to run the algorithm using different decomposition ranks, and to subsequently select the most appropriate model for the data. Moreover, we recommend to run several initialisations of the algorithm at each decomposition rank. This is necessary, because non-negative matrix factorisation produces stochastic solutions, i.e. each decomposition represents a local minimum of the objective function that is used to train the model. As a result, it is worthwhile to sample the solution space thoroughly, and to pick the solution which maximised the log-likelihood. Running TensorSignatures at different decomposition ranks while computing several initialisations is easy using the CLI. For example, to compute decompositions from rank 2 to 10 with 3 initialisation each, we would simply write a nested bash loop (*Caution: this may take some time*).

In [None]:
%%bash
for rank in {2..10}; do
  for init in {0..2}; do
    tensorsignatures train data.h5 sol_${rank}_${init}.pkl ${rank} -i ${init} -j MyFirstExperiment; 
  done;
done;

Also note the additional arguments we pass here to the programme; the `-i` argument identifies each initialisation uniquely (mandatory), and the `-j` parameter allows us to name the experiment, which in this context denotes multiple TensorSignature decompositions across a range of ranks extracted using the same hyper parameters (number of epochs, dispersion, etc).

## Summarising the result from many initialisations with `tensorsignatures write`


This command produces for each rank (2-10) ten initialisation and saves the results as pickleable binary files to the hard disk. Loading the 9 x 3 initialisations manually using `ts.load_dump` would be quite tedious and even impracticable in larger experiments. For this reason, we included the subprogram `tensorsignatures write`, which takes a `glob` filename pattern and an output filename as arguments to generate a `hdf5` file containing all initialisations.

In [23]:
%%bash
tensorsignatures write "sol_*.pkl" results.h5

Processing 90 files.


Progress:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_1.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_1.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_10.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_10.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_2.pkl:   0%|          | 0/90 [00:00<?, ?it/s] Loading: sol_10_2.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_3.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_3.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_4.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_4.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_5.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_5.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_6.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_6.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_7.pkl:   0%|          | 0/90 [00:00<?, ?it/s]Loading: sol_10_7