# Chronos Vignette

This vignette walks through a simple exercise in training Chronos on a subset of DepMap public 20Q4 and the Sanger Institute's Project Score data. 

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
import chronos
import os
from matplotlib import pyplot as plt
import seaborn as sns
from taigapy import default_tc as tc

Some tweaks that will make plots more legible

In [3]:
from matplotlib import rcParams
rcParams['axes.titlesize'] = 14
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['savefig.dpi'] = 200
rcParams['savefig.transparent'] = False
rcParams['font.family'] = 'Arial'
rcParams['font.size'] = '11'
rcParams['figure.dpi'] = 200
rcParams["savefig.facecolor"] = (1, 1, 1.0, 0.2)

rcParams['xtick.labelsize'] = 10
rcParams['ytick.labelsize'] = 10
rcParams['legend.fontsize'] = 7

## Setting up the Data

Chronos always requires at least three dataframes: 
* a matrix of readcounts with sequenced entities as the index, individual sgRNAs as the columns, and values indicating how many reads were found for that sgRNA. A sequenced entity any vector of sgRNA readcounts read out during the experiment. It could be a sequencing run of pDNA, or of a biological replicate at some time point during the experiment.
* A sequence map mapping sequenced entities to either pDNA or a cell line and giving the days since infection and pDNA batch. 
* A guide map mapping sgRNAs to genes. Each sgRNA included must map to one and only one gene.

Below, we'll load a small subset of the DepMap Avana data. The files have been reformatted from the release to the format Chronos expects

In [4]:
sequence_map = pd.read_csv("Data/SampleData/AvanaSequenceMap.csv")
guide_map = pd.read_csv("Data/SampleData/AvanaGuideMap.csv")
readcounts = chronos.read_hdf5("Data/SampleData/AvanaReadcounts.hdf5")

Sequence maps must have the columns

* sequence_id (str), which must match a row in readcounts
* cell_line_name (str). Must be "pDNA" for pDNA, and each pDNA batch must have at least one pDNA measurement.
* pDNA batch (any simple hashable type, preferably int or str). pDNA measurements sharing the same batch will be grouped and averaged, then used as the reference for all biological replicate sequencings assigned that same batch. If you don't have multiple pDNA batches (by far the most common experimental condition), just fill this column with 0 or some other constant value.
* days: days post infection. This value will be ignored for pDNA.

Other columns will be ignored.

In [5]:
sequence_map[:5]

Unnamed: 0,sequence_ID,ScreenID,days,pDNA_batch,Replicate,ScreenType,cell_line_name,ModelConditionID,Library,PassesQC
0,HEL-311Cas9_RepA_p4_Avana-3,SC-000004.AV01,21,Avana-3,A,2DS,ACH-000004,MC-000004-pA3k,Avana,True
1,HEL-311Cas9_RepB_p4_Avana-3,SC-000004.AV01,21,Avana-3,B,2DS,ACH-000004,MC-000004-pA3k,Avana,True
2,KU812-311cas9-RepA-p6_Avana-3,SC-000074.AV01,21,Avana-3,A,2DS,ACH-000074,MC-000074-OKtM,Avana,True
3,KU812-311cas9-RepB-p6_Avana-3,SC-000074.AV01,21,Avana-3,B,2DS,ACH-000074,MC-000074-OKtM,Avana,True
4,T47D-311Cas9-RepA-p6_Avana-4,SC-000147.AV01,21,Avana-4,A,2DS,ACH-000147,MC-000147-Uovr,Avana,True


Guide maps must have the columns 

* sgrna (str): must match a column in readcounts. An sgrna can only appear once in this data frame.
* gene (str): the gene the sgrna maps to.

Other columns will be ignored.

In [6]:
guide_map[:4]

Unnamed: 0,sgrna,GenomeAlignment,gene,nAlignments,DropReason,UsedByChronos
0,AAAAATGCGCAAATTCAGCG,chr3_138742712_-,PIK3CB (5291),1.0,,True
1,AAAACACATCAGTATAACAT,chr3_49368469_+,RHOA (387),1.0,,True
2,AAAACTACAGAAGCCTCCCG,chr10_34450424_-,PARD3 (56288),1.0,,True
3,AAAAGGCCTGACATATCTGA,chr15_66444677_+,MAP2K1 (5604),2.0,,True


Finally, here's what readcounts should look like. They can include NaNs. Note the axes.

In [7]:
readcounts.iloc[:4, :3]

Unnamed: 0,AAAAATGCGCAAATTCAGCG,AAAACACATCAGTATAACAT,AAAACTACAGAAGCCTCCCG
HEL-311Cas9_RepA_p4_Avana-3,101.0,224.0,636.0
HEL-311Cas9_RepB_p4_Avana-3,147.0,400.0,350.0
KU812-311cas9-RepA-p6_Avana-3,124.0,191.0,364.0
KU812-311cas9-RepB-p6_Avana-3,129.0,536.0,1280.0


To QC the data, we'll want control groups. We'll use predefined sets of common and nonessential genes, and use these to define control sets of sgRNAs.

In [8]:
common_essentials = pd.read_csv("Data/SampleData/AchillesCommonEssentialControls.csv")["Gene"]
nonessentials = pd.read_csv("Data/SampleData/AchillesNonessentialControls.csv")["Gene"]

In [9]:
positive_controls = guide_map.sgrna[guide_map.gene.isin(common_essentials)]
negative_controls = guide_map.sgrna[guide_map.gene.isin(nonessentials)]

### NaNing clonal outgrowths

In Achilles, we've observed rare instances where a single guide in a single biological replicate will produce an unexpectedly large number of readcounts, while other guides targeting the same gene or other replicates of the same cell line do not show many readcounts. We suspect this is the result of a single clone gaining some fitness advantage. Although it _could_ be related to a change induced by the guide, in general it's probably misleading. Therefore Chronos has an option to identify and remove these events.

In [10]:
chronos.nan_outgrowths(readcounts=readcounts, guide_gene_map=guide_map,
                                   sequence_map=sequence_map)

calculating LFC
finding maximum LFC calls
filtering
finding second highest LFC calls
finding sequences and guides with outgrowth
NAing 296 readcounts (0.00090 of total)


### QCing the data

You can generate a report with basic QC metrics about your data. You don't have to have control guides to do this, but the report is most useful if you do. If you don't have the `reportlab` python package installed, this section will error and should be skipped. This command will write a pdf report named "Initial QC.pdf" in the `./Data/reports` directory.

In [11]:
reportdir = "./Data/reports"
# permanently deletes the directory - careful if you edit this line!
! rm -rf "./Data/reports"
! mkdir "./Data/reports"

In [12]:
from chronos import reports
metrics = reports.qc_initial_data("Initial QC", readcounts, sequence_map,guide_map, 
        negative_controls, positive_controls,
                  directory=reportdir
       )

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


calculating replicate correlation
generating control separation metrics
Plotting log fold-change distribution
plotting control separation metrics


  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()
  plt.tight_layout()


Look in the Data/reports directory to see the QC report, "Initial QC.pdf".

## Train Chronos

### Creating the model

Now we initialize the model. Note the form of the data: each of the three parameters is actually a dictionary. If we were training the model with data from multiple libraries simultaneously, each library's data would have its own entries in the dict. 

The `negative_control_sgrnas` is an optional parameter, but including it will allow 1. better removal of library size effects from readcounts, and 2. estimation of the negative binomial quadratic overdispersion parameter per screen, which is otherwise a fixed hyperparameter. If provided, these should be cutting sgRNAs that are strongly expected to have no viability impact.

`log_dir` is an optional argument containing a directory for tensorflow to write summaries to. We include it here so that tensorboard can load the model.

In [13]:
logdir = "./Data/logs"
# permanently deletes the directory - careful if you edit this line!
! rm -rf "./Data/logs"
! mkdir "./Data/logs"

In [14]:
model = chronos.Chronos(
    sequence_map={"avana": sequence_map},
    guide_gene_map={"avana": guide_map},
    readcounts={"avana": readcounts},
    negative_control_sgrnas={"avana": negative_controls},
    log_dir=logdir
)

normalizing readcounts


Finding all unique guides and genes
found 3474 unique guides and 883 unique genes in avana
found 3474 unique guides and 883 unique genes overall

finding guide-gene mapping indices

finding all unique sequenced replicates, cell lines, and pDNA batches
found 92 unique sequences (excluding pDNA) and 44 unique cell lines in avana
found 92 unique replicates and 44 unique cell lines overall

finding replicate-cell line mappings indices

finding replicate-pDNA mappings indices


assigning float constants
Estimating or aligning variances
	Estimating excess variance (alpha) for avana
Between 0 (batch=Index(['avana_Avana-2', 'avana_Avana-3', 'avana_Avana-4'], dtype='object')) and 0 (batch=Index(['avana_Avana-2', 'avana_Avana-3', 'avana_Avana-4'], dtype='object')) negative control sgRNAs were found to be systematically over- or under-represented in the screens and excluded.
Creating excess variance tensors
	Created excess variance tensor for avana with shape [92, 1]
init

2023-08-23 16:03:24.490054: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2023-08-23 16:03:24.542028: W tensorflow/c/c_api.cc:304] Operation '{name:'excess_variance/avana/Assign' id:6 op device:{requested: '', assigned: ''} def:{{{node excess_variance/avana/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](excess_variance/avana, excess_variance/avana/Initializer/initial_value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


avana
(92, 3474)
(92,)
building other regularizations

Creating optimizer
	creating log at ./Data/logs
initializing rest of graph
estimating initial screen efficacy and gene effect
	 avana


2023-08-23 16:03:24.991346: W tensorflow/c/c_api.cc:304] Operation '{name:'GE/library_effect/avana/Assign' id:149 op device:{requested: '', assigned: ''} def:{{{node GE/library_effect/avana/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](GE/library_effect/avana, GE/library_effect/avana/Initializer/initial_value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


	verifying graph integrity
verifying user inputs
verifying variables
verifying calculated terms
	avana _gene_effect
	avana _selected_efficacies
	avana_predicted_readcounts_unscaled
	avana _predicted_readcounts
	avana _normalized_readcounts
	avana _cost_presum
sess run
	avana _cost
	avana _full_costs
ready to train


If you have tensorboard, the cell below will show Chronos' node structure. `GE` means gene effect (relative change in growth rate), `FC` means predicted fold change, `t0` is the inferred relative guide abundance at t0, and `out_norm` is the predicted readcounts. 

In [15]:
%reload_ext tensorboard
!kill $(ps -e | grep 'tensorboard' | awk '{print $1}')
%tensorboard --logdir ./data/logs

Now, optimizing the model:

### Train

Below, we train  the model for 301 epochs. This should take a minute or so with periodic updates provided

In [16]:
model.train(301, report_freq=50, burn_in_period=50, ge_only=0)

NB2 cost 0.3275882564646015
Full cost 0.39445532037733083
relative_growth_rate
	avana max 1.035, min 0.94536
mean guide efficacy 0.9921441878147884
t0_offset SD: [('avana', 7.073982246190464e-05)]

gene mean -0.21312644010028722
SD of gene means 0.29797867921901705
Mean of gene SDs 0.2506958572484661



51 epochs trained, time taken 0:00:02, projected remaining 0:00:08
NB2 cost 0.21536041941021716
Full cost 0.25132057996620094
relative_growth_rate
	avana max 1.373, min 0.58844
mean guide efficacy 0.9337792276370737
t0_offset SD: [('avana', 0.1428844417973969)]

gene mean -0.10686461815328012
SD of gene means 0.43405764849143774
Mean of gene SDs 0.2141562123855888



101 epochs trained, time taken 0:00:03, projected remaining 0:00:06
NB2 cost 0.20541944722051755
Full cost 0.22997130612679043
relative_growth_rate
	avana max 1.607, min 0.45889
mean guide efficacy 0.8829074844509959
t0_offset SD: [('avana', 0.1357214065176107)]

gene mean -0.02065916764819174
SD of gene means 0.43437010748

## After Training

### Saving and Restoring

Chronos' `save` method dumps all the inputs, outputs, and model parameters to the specified directory. These files are written such that they can be read in individually and analyzed, but also used to restore the model by passing the directory path to the function `load_saved_model`.

In [17]:
savedir = "Data/Achilles_run_compare"

In [18]:
if not os.path.isdir(savedir):
    os.mkdir(savedir)

In [19]:
model.save(savedir, overwrite=True)

In [20]:
print("Saved files:\n\n" + '\n'.join(['\t' + s for s in os.listdir(savedir)
                if s.endswith("csv")
                or s.endswith("hdf5")
                or s.endswith("json")
                ]))

Saved files:

	library_effect.csv
	cell_line_growth_rate.csv
	avana_predicted_readcounts.hdf5
	parameters.json
	guide_efficacy.csv
	avana_sequence_map.csv
	t0_offset.csv
	avana_predicted_lfc.hdf5
	avana_guide_gene_map.csv
	gene_effect_corrected.hdf5
	avana_negative_control_sgrnas.csv
	screen_delay.csv
	screen_excess_variance.csv
	cell_line_efficacy.csv
	gene_effect.hdf5
	avana_readcounts.hdf5


The .hdf5 files are binaries written with chronos' `write_hdf5` function, which is an efficient method for writing large matrices. They can be read with chronos' `read_hdf5` function.

Restoring the model can be done with a single function call:

In [21]:
model_restored = chronos.load_saved_model(savedir)



Finding all unique guides and genes
found 3474 unique guides and 883 unique genes in avana
found 3474 unique guides and 883 unique genes overall

finding guide-gene mapping indices

finding all unique sequenced replicates, cell lines, and pDNA batches
found 92 unique sequences (excluding pDNA) and 44 unique cell lines in avana
found 92 unique replicates and 44 unique cell lines overall

finding replicate-cell line mappings indices

finding replicate-pDNA mappings indices


assigning float constants
Estimating or aligning variances
	Estimating excess variance (alpha) for avana
Between 0 (batch=Index(['avana_avana_Avana-2', 'avana_avana_Avana-3', 'avana_avana_Avana-4'], dtype='object')) and 0 (batch=Index(['avana_avana_Avana-2', 'avana_avana_Avana-3', 'avana_avana_Avana-4'], dtype='object')) negative control sgRNAs were found to be systematically over- or under-represented in the screens and excluded.
Creating excess variance tensors
	Created excess variance tensor for avana with shape

2023-08-23 16:03:39.291670: W tensorflow/c/c_api.cc:304] Operation '{name:'excess_variance_1/avana/Assign' id:3099 op device:{requested: '', assigned: ''} def:{{{node excess_variance_1/avana/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](excess_variance_1/avana, excess_variance_1/avana/Initializer/initial_value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


initializing rest of graph
	verifying graph integrity
verifying user inputs
verifying variables


2023-08-23 16:03:39.824066: W tensorflow/c/c_api.cc:304] Operation '{name:'GE_1/library_effect/avana/Assign' id:3242 op device:{requested: '', assigned: ''} def:{{{node GE_1/library_effect/avana/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](GE_1/library_effect/avana, GE_1/library_effect/avana/Initializer/initial_value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


verifying calculated terms
	avana _gene_effect
	avana _selected_efficacies
	avana_predicted_readcounts_unscaled
	avana _predicted_readcounts
	avana _normalized_readcounts
	avana _cost_presum
sess run
	avana _cost
	avana _full_costs
ready to train
assigning trained parameters
	library effect
	gene effect
	guide efficacy
	cell efficacy
	cell growth rate
	screen excess variance
	screen delay
	t0 offset
Complete.
Cost when saved: 0.204619, cost now: 0.204620
Full cost when saved: 0.227851, full cost now: 0.227855


In [22]:
print("trained model cost: %f\nrestored model cost: %f" % (model.cost, model_restored.cost))

trained model cost: 0.204619
restored model cost: 0.204620


2023-08-23 16:03:40.830772: W tensorflow/c/c_api.cc:304] Operation '{name:'GE/library_effect/avana/Adam_1/Assign' id:1637 op device:{requested: '', assigned: ''} def:{{{node GE/library_effect/avana/Adam_1/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](GE/library_effect/avana/Adam_1, GE/library_effect/avana/Adam_1/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


The most important file for most use cases is gene_effect.hdf5, which holds Chronos' estimate of the relative change in growth rate caused by gene knockouts. Negative values indicate inhibitory effects. You can also access the gene effect (and other parameters) from the trained model directly:

In [23]:
gene_effects = model.gene_effect

gene_effects.iloc[:4, :5]

gene,A1CF (29974),A2M (2),A2ML1 (144568),A3GALT2 (127550),A4GALT (53947)
cell_line_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACH-000004,0.329518,0.063793,0.260753,0.378038,0.206696
ACH-000074,0.409007,0.467706,0.637598,0.378765,0.283279
ACH-000147,0.318163,0.295269,0.413231,0.131998,0.408138
ACH-000168,0.36581,0.374993,0.516981,0.29318,0.246941


### Copy Number Correction

If you have gene-level copy number calls, Chronos includes an option to correct gene effect scores after the fact. This works best if the data has been scaled, as above.

In [24]:
cn = chronos.read_hdf5("Data/SampleData/OmicsCNGene.hdf5")
cn.iloc[:4, :3]

Unnamed: 0,A1CF (29974),A2M (2),A2ML1 (144568)
ACH-001636,1.110663,1.04767,1.04767
ACH-000784,1.193826,1.161938,1.109943
ACH-000147,1.365585,0.499068,0.499068
ACH-000657,1.033547,1.040547,1.040547


Unfortunately, we don't have copy number calls for one of the genes targeted by the Avana library:

In [25]:
try:
    corrected, shifts = chronos.alternate_CN(gene_effects, cn)
except ValueError as e:
    print(e)

Missing 1 genes from gene_effect in copy_number.
Examples: ['POU2AF3 (120376)']


We could choose to drop these genes. Instead, we'll assume normal ploidy (=1, in the current CCLE convention) for them and fill in the CN matrix accordingly.

In [26]:
for col in set(gene_effects.columns) - set(cn.columns):
    cn[col] = 1

In [27]:
corrected, shifts = chronos.alternate_CN(gene_effects, cn)


Fitting cell line group 1 of 1
finding low CN gene effect shifts
smoothing and interpolating cutting toxicity for all genes
constructed spline matrix of shape 38852, 105
	cost: 0.03937682375926195
	cost: 0.0384827817216349
	cost: 0.03834606931877082
	cost: 0.03826118870198349
	cost: 0.038201696197665364
	cost: 0.03816025534423646
generating matrix


The `shifts` dataframe contains some information about the inferred CN effect, while `corrected` contains the corrected gene effects matrix. Overall, gene effect matrices will change little after correction, since most genes in most lines are near diploid.

We'll write the corrected dataframe to the saved directory we made earlier

In [28]:
chronos.write_hdf5(corrected, os.path.join(savedir, "gene_effect_corrected.hdf5"))

### QC report

The function `dataset_qc_report` in the `reports` module of Chronos presents a variety of QC metrics and interrogates some specific examples. The report minimally requires a set of positive and negative control genes. To get the full report requires copy number, mutation data, expression data, a list of expression addictions (genes which are dependencies in highly expressing lines), and oncogenic mutations.

Below, we'll load an annotated DepMap MAF file (subsetted to our cell lines). We'll select gain of function cancer driver events from it and generate a binary mutation matrix. We have a prior belief that cell lines with driver gain of function mutation events will be dependent on the mutated gene, so this matrix will be used by the QC report to assess our ability to identify selective dependencies. Specifically, we expect the oncogenes in this matrix to be dependencies in cell lines where the matrix is `True`, and not otherwise.

In [29]:
maf = pd.read_csv("Data/SampleData/OmicsSomaticMutations.csv")

In [30]:
cancer_relevant = maf[
  (
      maf.Driver | maf.LikelyDriver  
  ) & (
      maf.LikelyGoF
  )
]


cancer_relevant = cancer_relevant[~cancer_relevant.duplicated(subset=["ModelID", "Gene"])]

cancer_relevant['truecol'] = True

gof_matrix_base = pd.pivot(cancer_relevant, index="ModelID", columns="Gene", values="truecol")

Another way to evaluate selective dependencies is using expression addictions, a common pattern in which a gene is a stronger dependency in lines with higher expression. We'll use a list derived from DepMap RNAi (Tsherniak et al., Cell 2017), and subset our expression matrix to match.

In [31]:
expression_addictions = pd.read_csv("Data/SampleData/RNAiExpressionAddictions.csv")['Gene']

In [32]:
addiction_expressions = chronos.read_hdf5("Data/SampleData/OmicsExpressionProteinCodingGenesTPMLogp1.hdf5")[
    expression_addictions
]

Now, we're ready to run the QC report on Chronos' results:

In [33]:
metrics = reports.dataset_qc_report("ChronosAvana", savedir, 
                          common_essentials, nonessentials,
                          gof_matrix_base, addiction_expressions,
                          cn, directory="Data/reports",
                          gene_effect_file="gene_effect_corrected.hdf5"
                         )

Loading data from Data/Achilles_run_compare
plotting global control separation
plotting selective dependency separation
plotting gene effect mean relationships
plotting copy number effect
plotting screen efficacy and growth rate
plotting readcount predictions


  plt.tight_layout()


plotting LFC predictions


  plt.tight_layout()


plotting difference from naive gene score
summarizing
plotting genes with low agreement with naive gene effect
	FOXR1 (283150)
avana    avana
dtype: object avana    avana
dtype: object
Guide and replicate key for FOXR1 (283150), ACH-000004:
avana    av
dtype: object
GAGACCTCCAGCTTTCCAGG    avGuide1
GGAAGATGCCAGCTGCTCAG    avGuide2
TGAGACCTCCAGCTTTCCAG    avGuide3
TGGGATTTACCCACATCCAG    avGuide4
dtype: object
HEL-311Cas9_RepA_p4_Avana-3    avRep1
HEL-311Cas9_RepB_p4_Avana-3    avRep2
dtype: object
avana    avana
dtype: object avana    avana
dtype: object
Guide and replicate key for FOXR1 (283150), ACH-000750:
avana    av
dtype: object
GAGACCTCCAGCTTTCCAGG    avGuide1
GGAAGATGCCAGCTGCTCAG    avGuide2
TGAGACCTCCAGCTTTCCAG    avGuide3
TGGGATTTACCCACATCCAG    avGuide4
dtype: object
LOXIMVI-311Cas9_RepA_p6_Avana-2    avRep1
LOXIMVI-311Cas9_RepB_p6_Avana-2    avRep2
dtype: object
	TNPO3 (23534)
avana    avana
dtype: object avana    avana
dtype: object
Guide and replicate key for TNPO3 (23534

## Running with multiple libraries

We can add Sanger's [Project Score](https://www.nature.com/articles/s41586-019-1103-9) data (screened with the KY library) and run Chronos jointly on it and the Avana data. 

In [34]:
ky_guide_map = pd.read_csv("./Data/SampleData/KYGuideMap.csv")
ky_sequence_map = pd.read_csv("./Data/SampleData/KYSequenceMap.csv")
ky_readcounts = chronos.read_hdf5("./Data/SampleData/KYReadcounts.hdf5")

In [35]:
ky_positive_controls = ky_guide_map.sgrna[ky_guide_map.gene.isin(common_essentials)]
ky_negative_controls = ky_guide_map.sgrna[ky_guide_map.gene.isin(nonessentials)]

Note how the call signature of Chronos with multiple libraries is constructed:

In [39]:
model2 = chronos.Chronos(
    sequence_map={"avana": sequence_map, 'ky': ky_sequence_map},
    guide_gene_map={"avana": guide_map, 'ky': ky_guide_map},
    readcounts={"avana": readcounts, 'ky': ky_readcounts},
    negative_control_sgrnas={"avana": negative_controls, "ky": ky_negative_controls}
)

normalizing readcounts


Finding all unique guides and genes
found 3474 unique guides and 883 unique genes in avana
found 4084 unique guides and 833 unique genes in ky
found 7558 unique guides and 887 unique genes overall

finding guide-gene mapping indices

finding all unique sequenced replicates, cell lines, and pDNA batches
found 92 unique sequences (excluding pDNA) and 44 unique cell lines in avana
found 63 unique sequences (excluding pDNA) and 23 unique cell lines in ky
found 155 unique replicates and 58 unique cell lines overall

finding replicate-cell line mappings indices

finding replicate-pDNA mappings indices


assigning float constants
Estimating or aligning variances
	Estimating excess variance (alpha) for avana
Between 0 (batch=Index(['avana_Avana-2', 'avana_Avana-3', 'avana_Avana-4'], dtype='object')) and 0 (batch=Index(['avana_Avana-2', 'avana_Avana-3', 'avana_Avana-4'], dtype='object')) negative control sgRNAs were found to be systematically over- or under-represented 

2023-08-23 16:07:46.199308: W tensorflow/c/c_api.cc:304] Operation '{name:'inferred_t0_5/base_avana/Assign' id:7596 op device:{requested: '', assigned: ''} def:{{{node inferred_t0_5/base_avana/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](inferred_t0_5/base_avana, inferred_t0_5/base_avana/Initializer/initial_value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


building other regularizations

Creating optimizer
initializing rest of graph


2023-08-23 16:07:47.119229: W tensorflow/c/c_api.cc:304] Operation '{name:'GE_5/library_effect/ky/Assign' id:7811 op device:{requested: '', assigned: ''} def:{{{node GE_5/library_effect/ky/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_DOUBLE, validate_shape=false](GE_5/library_effect/ky, GE_5/library_effect/ky/Initializer/initial_value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


estimating initial screen efficacy and gene effect
	 avana
	 ky
	verifying graph integrity
verifying user inputs
verifying variables
verifying calculated terms
	avana _gene_effect
	avana _selected_efficacies
	avana_predicted_readcounts_unscaled
	avana _predicted_readcounts
	avana _normalized_readcounts
	avana _cost_presum
sess run
	avana _cost
	avana _full_costs
	ky _gene_effect
	ky _selected_efficacies
	ky_predicted_readcounts_unscaled
	ky _predicted_readcounts
	ky _normalized_readcounts
	ky _cost_presum
sess run
	ky _cost
	ky _full_costs
ready to train


In [None]:
model2.train(301)

Note that the gene effect now has NAs. These are cases where a cell line was only screened in one library and that library had no guides for that gene.

Chronos infers library batch effects. Note that these are only inferred for genes present in all libraries

In [None]:
model2.library_effect

## Running your screen with pretrained DepMap parameters

If you conducted a screen in one of the DepMap integrated libraries (currently Avana, KY, or Humagne-CD), you can load parameters from the trained DepMap model and use them to process your specific screen. This gives you many of the benefits of coprocessing your screen with the complete DepMap dataset without the computational expense. 

The following command fetches the 22Q3 public dataset from Figshare and stores it in the Chronos package directory under Data/DepMapParameters

In [None]:
chronos.fetch_parameters()

First, we create a model with the data we want to train as before, but with two important details:
- we pass the argument `pretrained=True` when we initialize
- the library batch names must match the DepMap library batch names, as that's what we're using for the pretrained model

In [None]:
model2_pretrained = chronos.Chronos(
    sequence_map={"Achilles-Avana-2D": sequence_map, 'Achilles-KY-2D': ky_sequence_map},
    guide_gene_map={"Achilles-Avana-2D": guide_map, 'Achilles-KY-2D': ky_guide_map},
    readcounts={"Achilles-Avana-2D": readcounts, 'Achilles-KY-2D': ky_readcounts},
    negative_control_sgrnas={"Achilles-Avana-2D": negative_controls, "Achilles-KY-2D": ky_negative_controls},
    pretrained=True
)

Now we import the DepMap data from the directory into the model, and train:

In [None]:
model2_pretrained.import_model("./Data/DepMapParameters/")

In [None]:
model2_pretrained.train()