# Chronos Vignette

This vignette walks through a simple exercise in training Chronos on a subset of DepMap public 20Q4 data. 

## Imports

In [1]:
import numpy as np
import pandas as pd
import chronos
import os
from matplotlib import pyplot as plt
import seaborn as sns
from taigapy import default_tc as tc

## Setting up the Data

Chronos always requires at least three dataframes: 
* a matrix of readcounts with sequenced entities as the index, individual sgRNAs as the columns, and values indicating how many reads were found for that sgRNA. A sequenced entity any vector of sgRNA readcounts read out during the experiment. It could be a sequencing run of pDNA, or of a biological replicate at some time point during the experiment.
* A sequence map mapping sequenced entities to either pDNA or a cell line and giving the days since infection and pDNA batch. 
* A guide map mapping sgRNAs to genes. Each sgRNA included must map to one and only one gene.

In [2]:
sequence_map = pd.read_csv("Data/Achilles_sequence_map.csv")
guide_map = pd.read_csv("Data/avana_guide_map.csv")
readcounts = pd.read_csv("Data/Achilles_readcounts.csv", index_col=0)

Sequence maps must have the columns

* sequence_id (str), which must match a row in readcounts
* cell_line_name (str). Must be "pDNA" for pDNA, and each pDNA batch must have at least one pDNA measurement.
* pDNA batch (any simple hashable type, preferably int or str). pDNA measurements sharing the same batch will be grouped and averaged, then used as the reference for all biological replicate sequencings assigned that same batch. If you don't have multiple pDNA batches (by far the most common experimental condition), just fill this column with 0 or some other constant value.
* days: days post infection. This value will be ignored for pDNA.

Other columns will be ignored.

In [3]:
sequence_map

Unnamed: 0,sequence_ID,cell_line_name,pDNA_batch,days
0,SNU-840 311Cas9_RepC_p6_batch2,ACH-000280,2,21
1,A549-311Cas9 Rep A p5_batch3,ACH-000681,3,21
2,SNU-840 311Cas9_RepA_p6_batch2,ACH-000280,2,21
3,COLO320-311Cas9 Rep B p6_batch3,ACH-000202,3,21
4,Avana4pDNA20160601-311cas9 RepG10_batch2,pDNA,2,0
5,Panc.02.03-311cas9 Rep C p6_batch3,ACH-000042,3,21
6,A2058-311cas9_RepA_p6_batch3,ACH-000788,3,21
7,Avana4pDNA20160601-311cas9 RepG12_batch2,pDNA,2,0
8,A2058-311cas9_RepB_p6_batch3,ACH-000788,3,21
9,SNU-840 311Cas9_RepB_p6_batch2,ACH-000280,2,21


Guide maps must have the columns 

* sgrna (str): must match a column in readcounts. An sgrna can only appear once in this data frame.
* gene (str): the gene the sgrna maps to.

Other columns will be ignored.

In [4]:
guide_map[:4]

Unnamed: 0,sgrna,genome_alignment,gene,n_alignments,positive_control,negative_control
0,AAAAAAATCCAGCAATGCAG,chr10_110964620_+,SHOC2 (8036),1,True,False
1,AAAAAAGAAGAAAAAACCAG,chr4_75970356_-,SDAD1 (55153),1,True,False
2,AAAAACTCTGGGAAATGACT,chr1_229305137_+,SPHAR (10638),1,False,True
3,AAAAAGATCATGATTGAGCG,chr12_123630253_-,EIF2B1 (1967),1,True,False


Finally, here's what readcounts should look like. They can include NaNs. Note the axes.

In [5]:
readcounts.iloc[:4, :3]

Unnamed: 0,AAAAAAATCCAGCAATGCAG,AAAAAAGAAGAAAAAACCAG,AAAAACTCTGGGAAATGACT
SNU-840 311Cas9_RepC_p6_batch2,348.0,78.0,430.0
A549-311Cas9 Rep A p5_batch3,257.0,100.0,398.0
SNU-840 311Cas9_RepA_p6_batch2,531.0,160.0,358.0
COLO320-311Cas9 Rep B p6_batch3,441.0,154.0,608.0


### NaNing clonal outgrowths

In Achilles, we've observed rare instances where a single guide in a single biological replicate will produce an unexpectedly large number of readcounts, while other guides targeting the same gene or other replicates of the same cell line do not show many readcounts. We suspect this is the result of a single clone gaining some fitness advantage. Although it _could_ be related to a change induced by the guide, in general it's probably misleading. Therefore Chronos has an option to identify and remove these events.

In [6]:
chronos.nan_outgrowths(readcounts=readcounts, guide_gene_map=guide_map,
                                   sequence_map=sequence_map)

calculating LFC
finding maximum LFC cells
finding sequences and guides with outgrowth
22 (0.00442% of) readcounts to be removed
                                    gene              
SNU-840 311Cas9_RepC_p6_batch2      PTEN (5728)           2.331286
A549-311Cas9 Rep A p5_batch3        OR2G2 (81470)         2.223027
                                    PSPC1 (55269)         2.356427
                                    PTEN (5728)           3.103739
                                    TNFRSF13C (115650)    2.500551
                                    TRABD2B (388630)      2.215409
Panc.02.03-311cas9 Rep C p6_batch3  VIPR2 (7434)          2.276700
SNU-840 311Cas9_RepB_p6_batch2      PTEN (5728)           2.435050
A549-311Cas9 Rep D p5_batch3        FIBP (9158)           2.795995
                                    PTEN (5728)           2.372597
dtype: float64
NaNing bad outgrowths


### Train Chronos

Now we initialize the model. Note the form of the data: each of the three parameters is actually a dictionary. If we were training the model with data from multiple libraries simultaneously, each library's data would have its own entries in the dict.

In [7]:
model = chronos.Chronos(
    sequence_map={"avana": sequence_map},
    guide_gene_map={"avana": guide_map},
    readcounts={"avana": readcounts}
)



Finding all unique guides and genes
found 17785 unique guides and 4712 unique genes in avana
found 17785 unique guides and 4712 unique genes overall

finding guide-gene mapping indices

finding all unique sequenced replicates, cell lines, and pDNA batches
found 20 unique sequences (excluding pDNA) and 7 unique cell lines in avana
found 20 unique replicates and 7 unique cell lines overall

finding replicate-cell line mappings indices

finding replicate-pDNA mappings indices


assigning float constants
initializing graph


AttributeError: module 'tensorflow' has no attribute 'placeholder'

Now, optimizing the model:

In [None]:
model.train(1001)

## Scaling

We can get back the model for gene effects as follows:

In [None]:
gene_effects = model.gene_effect

gene_effects.iloc[:4, :5]

The gene effects represent Chronos' estimate of the relative change in growth rate for a cell line, but the scale is unintuitive. We'll use the normal Achilles conventional, where -1 is the median of essential gene effects and 0 is the median of nonessential gene effects. Note that we are just scaling the data overall--not per cell line, as DepMap does with CERES scores.

In [None]:
nonessentials = guide_map.query("negative_control").gene
essentials = guide_map.query("positive_control").gene
gene_effects -= gene_effects[nonessentials].median().median()
gene_effects /= np.abs(gene_effects[essentials].median().median())

gene_effects.iloc[:4, :5]

Here's what the distribution of scores for the two control groups looks like over all cell lines.

In [None]:
sns.kdeplot(gene_effects[nonessentials].stack(), label="Nonessentials",
           gridsize=1000)
sns.kdeplot(gene_effects[essentials].stack(), label="Essentials",
           gridsize=1000)
plt.xlabel("Chronos Gene Effect Scores")

## Copy Number Correction

If you have gene-level copy number calls, Chronos includes an option to correct gene effect scores after the fact. This works best if the data has been scaled, as above.

In [None]:
cn = pd.read_csv("Data/CCLE_copy_number.csv", index_col=0)
cn.iloc[:4, :3]

Unfortunately, we don't have copy number calls for some of the genes targeted by the Avana library:

In [None]:
try:
    corrected, shifts = chronos.alternate_CN(gene_effects, cn)
except ValueError as e:
    print(e)

We could choose to drop these genes. Instead, we'll assume normal ploidy (=1, in the current CCLE convention) for them and fill in the CN matrix accordingly.

In [None]:
for col in set(gene_effects.columns) - set(cn.columns):
    cn[col] = 1

In [None]:
corrected, shifts = chronos.alternate_CN(gene_effects, cn)

The "shifts" dataframe contains some information about the inferred CN effect, while corrected contains the corrected gene effects matrix. Overall, gene effect matrices will change little after correction, since most genes in most lines are near diploid.

In [None]:
sns.kdeplot(corrected[nonessentials].stack(), label="Nonessentials",
           gridsize=1000)
sns.kdeplot(corrected[essentials].stack(), label="Essentials",
           gridsize=1000)
plt.xlabel("Chronos Gene Effect Scores")