# PhyliCS: tutorial

PhyliCS is a pipeline for **multi-sample** copy-number variation (CNV) analysis on
**single-cell DNA** sequencing data. It allows to quantify **intra-tumor heterogeneity** and to investigate **temporal and spatial evolution** of tumors.

This tutorial is meant to show how to run the main stages of the pipeline. Specificallly, we will execute the code needed to reproduce the second use-case *(Temporal evolution)* presented in the Supplementary materials of our paper and we will, just, list the commands required to reproduce the first use-case *(Spatial intra-tumor heterogeneity)*, since it implies a dataset made of, approximately, 6000 cells and it would take an execution time which is not feasible for a turorial. 

## Temporal evolution
Here, we want to use PhyliCS to investigate temporal evolution of CNVs in a cancer case. To this purpose, we are going to take advantage of the results of one of the CNV analyses performed by Garvin et. al [1] to validate their tool (*Ginkgo*). Specifically, it uses the single-cell data of two samples coming from a breast tumor and its liver metastasis (T16P/M), used by Navin et al. [2] for their study on intra-tumor heterogeneity characterization. Since CNV calls are, publicly, available in Ginkgo github repository, we will skip the CNV calling stage and we will move directly to the analysis.

### Single-sample analysis
This step should follow the CNV calling stage. We can skip it, since it simply implies calling Ginkgo on the single-cell allignment files. Anyhow, it is worthy to notice that Ginkgo will produce several output files and our application will use two of them:
- `SegCopy`: a table where the columns represent the cells and the rows to the bins the genome has been divided into. So, each element of this table contains the copy-number computed for the corresponding cell and position.
- `results.txt.`: a table containing some useful statistics about each cell, such as mean copy-number. 

In order to perform single-sample analysis, type:

In [1]:
!mkdir data/navin_out 
!phylics --run --run_single --input_dirs primary:data/navin_primary metastasis:data/navin_metastasis --output_path data/navin_out --verbose

mkdir: impossibile creare la directory "data/navin_out": File già esistente
[single_sample_post_analysis]  Complete analysis
--------------------------------------------------------------------------------
[single_sample_post_analysis]  Computing heatmap and phylogenetic tree (method = complete, metric = euclidean)
[single_sample_post_analysis] -- cophenet coefficient: 0.9922210550775276
--------------------------------------------------------------------------------
[single_sample_post_analysis]  Plotting mean ploidy distribution
--------------------------------------------------------------------------------
[single_sample_post_analysis]  Computing mean CNV profile
[single_sample_post_analysis] -- mean ploidy = 2.884057
--------------------------------------------------------------------------------
[single_sample_post_analysis]  Computing the optimal number of clusters
[single_sample_post_analysis] -- n_clusters = 2 - The average silhouette_score is :0.856911221767078
[single_sample

#### Parameters
- `--run`*(required)*: analysis execution mode. If no sub-mode (`--run_cnvs | --run_single | --run_multiple`) is specified, then the complete analysis is executed (CNV calling, single-sample analysis, multi-samplea analysis)
- `--run_single`*(required)*: single-sample mode. It specifies that only the single-sample analysis has to be executed.
- `--input_dirs`*(required)*: list of pairs made of sample name and input directory path, separated by ":", for each input sample. The input directories must contain Ginkgo output files (`SegCopy`, `results.txt`).
- `--output_path` *(required)*: output directory path. All output files and directories will be created at this position.
- `--verbose` *(optional)*: verbose mode. 

#### Output
`data/navin_out/<sample_name>_post_CNV`:
    - `heatmap.png`: heatmap and dendrogram computed by the phylogenetic algorithm. 
    - `mean_cnv.png`: average copy-number plot. It shows which is the average copy-number, computed on all cells, for each genome position.
    
    - `mean_ploidy_distribution.svg`: mean ploidy density distribution plot. The mean ploidy is the mean copy-number of each single cell and this plot shows how the mean ploidies are distributed. It allows to high-light groups of pseudo-diploids cells.
    
    - `silhouette_results.png`: silhouette plot for each of the tested K's. 
    - `per_k_silhouette_scores.csv`: average silhoutte scores for each of the  tested K's.
    - `silhouette_summary.png`: dot plot of the silhouette score for the tested K's.
    - `clusters.tsv`: composition and mean copy-number of the clusters built with the Silhouette method.
    - `clusters_heatmap.png`: heatmaps of the  clusters built with the Silhouette method.
    

### Cell filtering
To filter out normal cells, type:

In [2]:
!phylics --run_cell_filtering --input_dirs primary:data/navin_primary --intervals 1.5-2.3 --output_path data/navin_out --verbose
!phylics --run_cell_filtering --input_dirs metastasis:data/navin_metastasis --intervals 1.5-2.3 --output_path data/navin_out --verbose

Cell filtering execution
[valid_cells]  Initial cells: 52
[valid_cells]  Filtered out cells: 33
[valid_cells]  Remaining cells: 19
Cell filtering execution
[valid_cells]  Initial cells: 48
[valid_cells]  Filtered out cells: 25
[valid_cells]  Remaining cells: 23


#### Parameters
- `--run_cell_filtering` *(required)*: cell filtering execution mode.
- `--input_dirs` *(required)*: same as before, but only one sample at a time is accepted, here.
- `--output_path` *(required)*: same as before.
- `--intervals | --values` *(required)*: list of meand ploidy ranges or single values. Cells which mean ploidy are in the specified ranges or correspond to the single values are filtered-out. At least one of these two parameters must be specified.
- `--verbose` *(optional*): as before.

#### Output
`data/navin_out/<sample_name>_filtered`:
    - `SegCopy`: filtered CNV file.
    - `results.txt`: filtered statistics file.


### Multiple-sample analysis
To perform multiple-sample analysis, type:

In [3]:
!phylics --run --run_multiple --input_dirs primary:data/navin_out/primary_filtered metastasis:data/navin_out/metastasis_filtered  --output_path data/navin_out --verbose

[multi_sample_post_analysis]  CNV calls merging
--------------------------------------------------------------------------------
[multi_sample_post_analysis]  Complete analysis
--------------------------------------------------------------------------------
[multi_sample_post_analysis]  Heterogeneity score computation
[multi_sample_post_analysis] -- Permutation test (n_permutations = 1000)
[multi_sample_post_analysis] ---- iteration: 0
[multi_sample_post_analysis] ---- iteration: 100
[multi_sample_post_analysis] ---- iteration: 200
[multi_sample_post_analysis] ---- iteration: 300
[multi_sample_post_analysis] ---- iteration: 400
[multi_sample_post_analysis] ---- iteration: 500
[multi_sample_post_analysis] ---- iteration: 600
[multi_sample_post_analysis] ---- iteration: 700
[multi_sample_post_analysis] ---- iteration: 800
[multi_sample_post_analysis] ---- iteration: 900
[multi_sample_post_analysis] ---- Permutation test done (10.91587233543396s)
[multi_sample_post_analysis] -- Samples: [

#### Parameters
- `--run`*(required)*: same as before.
- `--run_multiple`*(required)*: multiple-sample mode. It specifies that only the multi-sample analysis has to be executed.
- `--input_dirs`*(required)*: as before.
- `--output_path` *(required)*: as before.
- `n_permutations
- `--verbose` *(optional)*: as before. 

#### Output
`data/navin_out/<sample_name>_<sample_name>_post_CNV`:
    - `heatmap.png`: heatmap and dendrogram computed by the phylogenetic algorithm. 
    - `mean_cnv.png`: average copy-number plot. It shows which is the average copy-number, computed on all cells, for each genome position.
    
    - `mean_ploidy_distribution.svg`: mean ploidy density distribution plot. The mean ploidy is the mean copy-number of each single cell and this plot shows how the mean ploidies are distributed. It allows to high-light groups of pseudo-diploids cells.
    
    - `silhouette_results.png`: silhouette plot for each of the tested K's. 
    - `per_k_silhouette_scores.csv`: average silhoutte scores for each of the  tested K's.
    - `silhouette_summary.png`: dot plot of the silhouette score for the tested K's.
    - `clusters.tsv`: composition and mean copy-number of the clusters built with the Silhouette method.
    - `clusters_heatmap.png`: heatmaps of the  clusters built with the Silhouette method.
    

## Spatial intra-tumor heterogeneity
Here, we will just show the commands you need to execute to reproduce the first use-case. The purpose was to study intra-tumor heterogeneity, from the perspective of space, by taking in account multiple samples taken from different regions of the same tumor. We used the public data provided by 10x Genomics on their official website [3].

### Download data
First, you need to download the allignment data and the file containing the cell barcodes, produced by Cell Ranger DNA, for each sample. 

```
wget -O data/10x_breastA/possorted_bam.bam http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-dna/1.1.0/breast_tissue_A_2k/breast_tissue_A_2k_possorted_bam.bam

wget -O data/10x_breastA/per_cell_summary_metrics.csv http://cf.10xgenomics.com/samples/cell-dna/1.1.0/breast_tissue_A_2k/breast_tissue_A_2k_per_cell_summary_metrics.csv

wget -O data/10x_breastB/possorted_bam.bam http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-dna/1.1.0/breast_tissue_B_2k/breast_tissue_B_2k_possorted_bam.bam

wget -O data/10x_breastB/per_cell_summary_metrics.csv http://cf.10xgenomics.com/samples/cell-dna/1.1.0/breast_tissue_B_2k/breast_tissue_B_2k_per_cell_summary_metrics.csv

wget -O data/10x_breastC/possorted_bam.bam http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-dna/1.1.0/breast_tissue_C_2k/breast_tissue_C_2k_possorted_bam.bam

wget -O data/10x_breastB/per_cell_summary_metrics.csv http://cf.10xgenomics.com/samples/cell-dna/1.1.0/breast_tissue_C_2k/breast_tissue_C_2k_per_cell_summary_metrics.csv
```

### Data preparation
Then, you need to run the data-preparation module which will call `sctools_demultiplex` to split 10x allignment files into single-cell allignemnt files, filter-out poor quality reads and multi-mappers and produce single-cell .bed files. You can do it by, simply, typing:

```
phylics --run_10x_preproc --input_dirs breast_A:data/10x_breastA --output_path data --verbose

phylics --run_10x_preproc --input_dirs breast_B:data/10x_breastB --output_path data --verbose

phylics --run_10x_preproc --input_dirs breast_C:data/10x_breastC --output_path data --verbose
```

### Data analysis
Now, you can run the complete pipeline, without going step by step. To do so, type:

```
mkdir 10x_breast_u
phylics --run --input_dirs breast_A:data/breast_A_sc breast_B:data/breast_B_sc breast_C:data/breast_C_sc --genome GrCh38 --binning variable_500000_101_bwa --output_path  --output_prefix third_run --tasks 8 --verbose```

In case you prefer following each stage, you may use and properly modify the instructions showed in the previous example.



## References
1. Tyler Garvin, Robert Aboukhalil, Jude Kendall, Timour Baslan, Gurinder S Atwal, James Hicks, Michael Wigler, and Michael C Schatz. Interactive analysis and assessment of single-cell copy-number variations. Nature methods, 12(11):1058, 2015.

2. Nicholas Navin, Alexander Krasnitz, Linda Rodgers, Kerry Cook, Jennifer Meth, Jude Kendall, Michael Riggs, Yvonne Eberling, Jennifer Troge, Vladimir Grubor, et al. Inferring tumor progression from genomic heterogeneity. Genome research, 20(1):68–80, 2010.

3. 10x Genomics. 10x Genomics: Biology at True Resolution, 2019. URL https://www.10xgenomics.com.
