# Chronos Vignette

This vignette walks through a simple exercise in training Chronos on a subset of DepMap public 20Q4 and the Sanger Institute's Project Score data. 

## Imports

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import pandas as pd
import chronos
import os
from matplotlib import pyplot as plt
import seaborn as sns
from taigapy import default_tc as tc

Some tweaks that will make plots more legible

In [None]:
from matplotlib import rcParams
rcParams['axes.titlesize'] = 14
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['savefig.dpi'] = 200
rcParams['savefig.transparent'] = False
rcParams['font.family'] = 'Arial'
rcParams['font.size'] = '11'
rcParams['figure.dpi'] = 200
rcParams["savefig.facecolor"] = (1, 1, 1.0, 0.2)

rcParams['xtick.labelsize'] = 10
rcParams['ytick.labelsize'] = 10
rcParams['legend.fontsize'] = 7

## Setting up the Data

Chronos always requires at least three dataframes: 
* a matrix of readcounts with sequenced entities as the index, individual sgRNAs as the columns, and values indicating how many reads were found for that sgRNA. A sequenced entity any vector of sgRNA readcounts read out during the experiment. It could be a sequencing run of pDNA, or of a biological replicate at some time point during the experiment.
* A sequence map mapping sequenced entities to either pDNA or a cell line and giving the days since infection and pDNA batch. 
* A guide map mapping sgRNAs to genes. Each sgRNA included must map to one and only one gene.

Below, we'll load a small subset of the DepMap Avana data. The files have been reformatted from the release to the format Chronos expects

In [None]:
sequence_map = pd.read_csv("Data/SampleData/AvanaSequenceMap.csv")
guide_map = pd.read_csv("Data/SampleData/AvanaGuideMap.csv")
readcounts = chronos.read_hdf5("Data/SampleData/AvanaReadcounts.hdf5")

Sequence maps must have the columns

* sequence_id (str), which must match a row in readcounts
* cell_line_name (str). Must be "pDNA" for pDNA, and each pDNA batch must have at least one pDNA measurement.
* pDNA batch (any simple hashable type, preferably int or str). pDNA measurements sharing the same batch will be grouped and averaged, then used as the reference for all biological replicate sequencings assigned that same batch. If you don't have multiple pDNA batches (by far the most common experimental condition), just fill this column with 0 or some other constant value.
* days: days post infection. This value will be ignored for pDNA.

Other columns will be ignored.

In [None]:
sequence_map[:5]

Guide maps must have the columns 

* sgrna (str): must match a column in readcounts. An sgrna can only appear once in this data frame.
* gene (str): the gene the sgrna maps to.

Other columns will be ignored.

In [None]:
guide_map[:4]

Finally, here's what readcounts should look like. They can include NaNs. Note the axes.

In [None]:
readcounts.iloc[:4, :3]

To QC the data, we'll want control groups. We'll use predefined sets of common and nonessential genes, and use these to define control sets of sgRNAs.

In [None]:
common_essentials = pd.read_csv("Data/SampleData/AchillesCommonEssentialControls.csv")["Gene"]
nonessentials = pd.read_csv("Data/SampleData/AchillesNonessentialControls.csv")["Gene"]

In [None]:
positive_controls = guide_map.sgrna[guide_map.gene.isin(common_essentials)]
negative_controls = guide_map.sgrna[guide_map.gene.isin(nonessentials)]

### NaNing clonal outgrowths

In Achilles, we've observed rare instances where a single guide in a single biological replicate will produce an unexpectedly large number of readcounts, while other guides targeting the same gene or other replicates of the same cell line do not show many readcounts. We suspect this is the result of a single clone gaining some fitness advantage. Although it _could_ be related to a change induced by the guide, in general it's probably misleading. Therefore Chronos has an option to identify and remove these events.

In [None]:
chronos.nan_outgrowths(readcounts=readcounts, guide_gene_map=guide_map,
                                   sequence_map=sequence_map)

### QCing the data

You can generate a report with basic QC metrics about your data. You don't have to have control guides to do this, but the report is most useful if you do. If you don't have the `reportlab` python package installed, this section will error and should be skipped. This command will write a pdf report named "Initial QC.pdf" in the `./Data/reports` directory.

In [None]:
reportdir = "./Data/reports"
# permanently deletes the directory - careful if you edit this line!
! rm -rf "./Data/reports"
! mkdir "./Data/reports"

In [None]:
from chronos import reports
metrics = reports.qc_initial_data(
    "Initial QC", readcounts, sequence_map,guide_map, 
    negative_controls, positive_controls, directory=reportdir, 
       )

Look in the Data/reports directory to see the QC report, "Initial QC.pdf".

## Train Chronos

### Creating the model

Now we initialize the model. Note the form of the data: each of the three parameters is actually a dictionary. If we were training the model with data from multiple libraries simultaneously, each library's data would have its own entries in the dict. 

The `negative_control_sgrnas` is an optional parameter, but including it will allow 1. better removal of library size effects from readcounts, and 2. estimation of the negative binomial quadratic overdispersion parameter per screen, which is otherwise a fixed hyperparameter. If provided, these should be cutting sgRNAs that are strongly expected to have no viability impact.

`log_dir` is an optional argument containing a directory for tensorflow to write summaries to. We include it here so that tensorboard can load the model.

In [None]:
logdir = "./Data/logs"
# permanently deletes the directory - careful if you edit this line!
! rm -rf "./Data/logs"
! mkdir "./Data/logs"

In [None]:
import warnings
warnings.filterwarnings("error")

In [None]:
model = chronos.Chronos(
    sequence_map={"avana": sequence_map},
    guide_gene_map={"avana": guide_map},
    readcounts={"avana": readcounts},
    negative_control_sgrnas={"avana": negative_controls},
    log_dir=logdir
)

If you have tensorboard, the cell below will show Chronos' node structure. `GE` means gene effect (relative change in growth rate), `FC` means predicted fold change, `t0` is the inferred relative guide abundance at t0, and `out_norm` is the predicted readcounts. 

In [None]:
%reload_ext tensorboard
!kill $(ps -e | grep 'tensorboard' | awk '{print $1}')
%tensorboard --logdir ./data/logs

Now, optimizing the model:

### Train

Below, we train  the model for 301 epochs. This should take a minute or so with periodic updates provided

In [None]:
model.train(301)

## After Training

### Saving and Restoring

Chronos' `save` method dumps all the inputs, outputs, and model parameters to the specified directory. These files are written such that they can be read in individually and analyzed, but also used to restore the model by passing the directory path to the function `load_saved_model`.

In [None]:
savedir = "Data/Achilles_run_compare"

In [None]:
if not os.path.isdir(savedir):
    os.mkdir(savedir)

In [None]:
model.save(savedir, overwrite=True)

In [None]:
print("Saved files:\n\n" + '\n'.join(['\t' + s for s in os.listdir(savedir)
                if s.endswith("csv")
                or s.endswith("hdf5")
                or s.endswith("json")
                ]))

The .hdf5 files are binaries written with chronos' `write_hdf5` function, which is an efficient method for writing large matrices. They can be read with chronos' `read_hdf5` function.

Restoring the model can be done with a single function call:

In [None]:
model_restored = chronos.load_saved_model(savedir)

In [None]:
print("trained model cost: %f\nrestored model cost: %f" % (model.cost, model_restored.cost))

The most important file for most use cases is gene_effect.hdf5, which holds Chronos' estimate of the relative change in growth rate caused by gene knockouts. Negative values indicate inhibitory effects. You can also access the gene effect (and other parameters) from the trained model directly:

In [None]:
gene_effects = model.gene_effect

gene_effects.iloc[:4, :5]

If your library includes many depleting genes with negative gene effect scores, this can drive nonessential genes towards positive values as Chronos tries to maintain the overall mean score near 0: 

In [None]:
print("Mean of all effects: %1.3f, mean of nonessential gene effects: %1.3f" %(
    np.nanmean(gene_effects.mean()), np.nanmean(gene_effects.reindex(columns=nonessentials))
))

We usually want nonessential gene effects to be centered at 0, so we can interpret 0 gene effect as "no impact on viability." This is a trivial change to make.

In [None]:
gene_effects -= np.nanmean(gene_effects.reindex(columns=nonessentials))

In [None]:
sns.kdeplot(np.ravel(gene_effects))
plt.xlabel("Distribution of adjusted gene effects")

### Copy Number Correction

If you have gene-level copy number calls, Chronos includes an option to correct gene effect scores after the fact. This works best if the data has been scaled, as above.

In [None]:
cn = chronos.read_hdf5("Data/SampleData/OmicsCNGene.hdf5")
cn.iloc[:4, :3]

Unfortunately, we don't have copy number calls for one of the genes targeted by the Avana library:

In [None]:
try:
    corrected, shifts = chronos.alternate_CN(gene_effects, cn)
except ValueError as e:
    print(e)

We could choose to drop these genes. Instead, we'll assume normal ploidy (=1, in the current CCLE convention) for them and fill in the CN matrix accordingly.

In [None]:
for col in set(gene_effects.columns) - set(cn.columns):
    cn[col] = 1

In [None]:
corrected, shifts = chronos.alternate_CN(gene_effects, cn)

The `shifts` dataframe contains some information about the inferred CN effect, while `corrected` contains the corrected gene effects matrix. Overall, gene effect matrices will change little after correction, since most genes in most lines are near diploid.

We'll write the corrected dataframe to the saved directory we made earlier

In [None]:
chronos.write_hdf5(corrected, os.path.join(savedir, "gene_effect_corrected.hdf5"))

### QC report

The function `dataset_qc_report` in the `reports` module of Chronos presents a variety of QC metrics and interrogates some specific examples. The report minimally requires a set of positive and negative control genes. To get the full report requires copy number, mutation data, expression data, a list of expression addictions (genes which are dependencies in highly expressing lines), and oncogenic mutations.

Below, we'll load an annotated DepMap MAF file (subsetted to our cell lines). We'll select gain of function cancer driver events from it and generate a binary mutation matrix. We have a prior belief that cell lines with driver gain of function mutation events will be dependent on the mutated gene, so this matrix will be used by the QC report to assess our ability to identify selective dependencies. Specifically, we expect the oncogenes in this matrix to be dependencies in cell lines where the matrix is `True`, and not otherwise.

In [None]:
maf = pd.read_csv("Data/SampleData/OmicsSomaticMutations.csv")

In [None]:
cancer_relevant = maf[
  (
      maf.Driver | maf.LikelyDriver  
  ) & (
      maf.LikelyGoF
  )
]


cancer_relevant = cancer_relevant[~cancer_relevant.duplicated(subset=["ModelID", "Gene"])]

cancer_relevant['truecol'] = True

gof_matrix_base = pd.pivot(cancer_relevant, index="ModelID", columns="Gene", values="truecol")

Another way to evaluate selective dependencies is using expression addictions, a common pattern in which a gene is a stronger dependency in lines with higher expression. We'll use a list derived from DepMap RNAi (Tsherniak et al., Cell 2017), and subset our expression matrix to match.

In [None]:
expression_addictions = pd.read_csv("Data/SampleData/RNAiExpressionAddictions.csv")['Gene']

In [None]:
addiction_expressions = chronos.read_hdf5("Data/SampleData/OmicsExpressionProteinCodingGenesTPMLogp1.hdf5")[
    expression_addictions
]

Now, we're ready to run the QC report on Chronos' results:

In [None]:
metrics = reports.dataset_qc_report("ChronosAvana", savedir, 
                          common_essentials, nonessentials,
                          gof_matrix_base, addiction_expressions,
                          cn, directory="Data/reports",
                          gene_effect_file="gene_effect_corrected.hdf5"
                         )

## Identifying Hits

## Identifying Hits

You may be interested in getting a list of genes that are true dependencies in your screen. Chronos provides two methods to do this:

- Given a set of negative control genes (or a boolean matrix of individual genes within specific cell lines that are negative controls, such as a matrix of unexpressed genes), Chronos can compute empirical p-values for the null hypothesis that the KO had no viability effect against the alternative hypothesis that it caused loss of viability
-  Given a set of positive and negative control genes, Chronos can estimate the probability that a given gene effect score came from the distribution of positive controls vs negative controls. If the controls are representative of essential/nonessential genes, then the probability tells you how likely it is that a given score represents an essential gene for that cell line. 

To get an unbiased estimate, is important not to use CRISPR results from the same library to choose or refine the control sets. We want the scores for the gene sets to capture any bias or artifacts present in CRISPR.

In [None]:
from chronos.hit_calling import get_probability_dependent, get_pvalue_dependent

In [None]:
pvalues = get_pvalue_dependent(corrected, nonessentials)

In [None]:
probabilities = get_probability_dependent(corrected, nonessentials, common_essentials)

Each of these has advantages and disadvantages for hit-calling. The probability of dependency is highly dependent on choosing a set of positive controls that accurately capture the distribution of gene-loss-driven depleting phenotypes in your screen. Too stringent a list will cause underestimates, while too loose a list will lead to failure control false discovery. The inverse is true of the negative controls, but it is often possible to identify a very rigorous and representative set of negative controls using unexpressed genes. You should plot the distributions of all your gene effect scores, your negative controls, and your positive controls to see if the positive controls really do represent the left tail:

In [None]:
sns.kdeplot(np.ravel(corrected), label="All genes", color="green")
sns.kdeplot(np.ravel(corrected.reindex(columns=common_essentials)), label="Positive controls", color="red", fill=True)
sns.kdeplot(np.ravel(corrected.reindex(columns=nonessentials)), label="Negative controls", color="blue", fill=True)
plt.legend()
plt.xlabel("Gene Effect")

On the other hand, the power of the empirical p-values will be strictly limited by the number of negative controls. In fact, the minimum possible _p_ that can be achieved is 1 / len(negative_controls). This is likely to be an issue in a subgenome library. Below, we see the effect of this cap on significance:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plt.sca(axes[0])
for ind in probabilities.index:
    plt.scatter(corrected.loc[ind], probabilities.loc[ind], s=5, alpha=.7, linewidth=.5, label=ind)
plt.legend(fontsize=4, loc="lower left")
plt.xlabel("Gene Effect Estimate")
plt.ylabel("Probability of Dependency")

plt.sca(axes[1])
for ind in pvalues.index:
    plt.scatter(corrected.loc[ind], -np.log10(pvalues.loc[ind]), s=5, alpha=.7, linewidth=.5, label=ind)
plt.xlabel("Gene Effect Estimate")
plt.ylabel("-Log10 P")

We can make a roughly head-to-head comparison of discoveries from the two methods by estimating a false discovery rate using each. Using p-values for many hypotheses, false discovery is typically controlled with the Benjamini-Hochberg procedure. This is a frequentist FDR. In contrast, using probabilities of dependency, we can directly estimate the Bayesian FDR: the number of true discoveries below a threshold is just the sum of the probabilities of dependency. See https://arxiv.org/pdf/1803.05284.pdf for a discussion of frequentist and Bayesian false discovery.

In [None]:
from chronos.hit_calling import get_fdr_from_probabilities, get_fdr_from_pvalues
fdr_from_probabilities = get_fdr_from_probabilities(probabilities)
fdr_from_pvalues = get_fdr_from_pvalues(pvalues)

We'll do a quick, nonrigorous calibration check, considering only the control sets:

In [None]:
def calibration_check(fdr, positive_controls, negative_controls):
    controls_only = fdr\
    .reindex(columns=list(positive_controls) + list(negative_controls))\
    .dropna(axis=1)

    is_essential = pd.DataFrame(
        np.repeat(
            controls_only.columns.isin(positive_controls).reshape(1, -1), 
            len(controls_only), axis=0
        ),
       index=controls_only.index,
        columns=controls_only.columns
    )
    
    calibration = pd.DataFrame({
        "FDR": np.ravel(controls_only),
        "IsTrue": np.ravel(is_essential)
    }).sort_values("FDR")
    calibration["ProportionFalse"] = np.cumsum(~calibration.IsTrue) / np.arange(1, len(calibration)+1)
    
    return calibration

In [None]:
calibration_probabilities = calibration_check(fdr_from_probabilities, common_essentials, nonessentials)
calibration_pvalues = calibration_check(fdr_from_pvalues, common_essentials, nonessentials)

In [None]:
plt.plot(
    calibration_probabilities["ProportionFalse"],
    calibration_probabilities["FDR"],
    label="Using Probabilities (Bayesian)"
)

plt.plot(
    calibration_pvalues["ProportionFalse"],
    calibration_pvalues["FDR"],
    '--', 
    label="Using P-values (Frequentist)"
)

max_fdr = calibration_probabilities.ProportionFalse.max()
plt.plot([0, max_fdr], [0, max_fdr], '-.', color="black", label="Perfect Calibration")

plt.ylabel("Estimated FDR")
plt.xlabel("True FDR")
plt.legend()

We can see that both estimates of FDR are conservative, which is surely preferable to being optimistic. This may be partly due to the presence of the noncontrol genes in the full data, which have a lower proportion of true dependencies than the controls and affect the FDR estimates in the controls. The probability based method has an unfair advantage as it as seen the same set of positive controls as well as the negative controls. To be truly rigorous we would need to split the controls used in training and in evaluating calibration, and run Chronos without any non-control genes. 

Note that the BH estimates reach 1, while probability-based estimates saturate below 1. This is because the BH procedure assumes the proportion of true discoveries is approximately 0 over the whole dataset:

In [None]:
fdr_from_pvalues.max().max(), fdr_from_pvalues.max(axis=1).mean()

In [None]:
fdr_from_probabilities.max(axis=1).max(), fdr_from_probabilities.max(axis=1).mean()

If an estimated FDR is correct, 1- the maximum FDR for a cell line should be equal to the fraction of true dependencies in that cell line. The lowest probability-based estimate of 0.38 is close to the proportion of common essential genes in the dataset (0.36), illustrating the greater power of the FDRs computed this way. When we compute FDRs from p-values using the Benjamini-Hochberg procedure, we are estimating adjusted p-values to _control_ false discovery rather than the true FDR, and the adjusted p-values reach 1.

We should also note that having a large number of true dependencies our test library improves the apparent performance of the frequentist method, because it allows values to remain significant after the BH procedure. In a setting with few true dependencies, the frequentist method may become severely underpowered.

Summing up: using `get_probability_dependent` is more powerful, and especially so in the case of subgenome libraries with limited negative control sets and a modest proportion of true hits. However, it requires correctly specifying the positive control distribution by identifying _representative_ positive control genes independently from your screening data (ideally, independent of any CRISPR data). 

## Comparing a Screen in Two Conditions

### Data Format for Comparing Conditions

A common use case for comparing screens is the anchor screen, in which the same model is screened in two different conditions. In [DeWeirdt et al.](https://doi.org/10.1038%2Fs41467-020-14620-6) (2020), the Meljuso, OVCAR8, and A375 cell lines were screened using either small molecule inhibitors or S. Aureus knockouts of BCL2L1, MCL1, and PARP/PARP1 in combination with the Brunello genome-wide library. A subset of their screens with the BCL2L1 inhibitor A-1331852 is loaded below.

In [None]:
deweird_readcounts = chronos.read_hdf5("Data/SampleData/DeWeirdtReadcounts.hdf5")

deweirdt_condition_map = pd.read_csv("Data/SampleData/DeWeirdtConditionMap.csv")

deweirdt_guide_map = pd.read_csv("Data/SampleData/DeWeirdtGuideMap.csv")

deweirdt_negative_controls = deweirdt_guide_map.sgrna[
    deweirdt_guide_map.gene.isin([s.split(' ')[0] for s in nonessentials])
]

The readcounts and guide maps are formatted just like the Avana data above. The condition map is very similar to a sequence map, but requires two additional columns: `"replicate"` and `"condition"`. __Different replicates should be biologically independent__, ideally independently library-transfected cell populations. Different sequences of the same replicate should be assigned the same replicate name.

In [None]:
deweirdt_condition_map

Any cell line with less than two unique replicates in each of the two conditions being compared will not be evaluated.

It is important when running the comparison to first normalize the readcounts and `nan_outgrowths`. Otherwise, outliers that occur in single replicates will produce excessive noise that will reduce statistical power.

In [None]:
deweirdt_normed = chronos.normalize_readcounts(deweird_readcounts, deweirdt_negative_controls, deweirdt_condition_map)

In [None]:
chronos.nan_outgrowths(deweirdt_normed, deweirdt_condition_map, deweirdt_guide_map, rpm_normalize=False)

### Training with Conditions Distinguished

To understand how `ChronosComparison` works, it helps to first manually create a model with conditions distinguished. First, we create a sequence map that distinguishes conditions. We'll compare "Control" and "A-1331852". The function `create_condition_map` creates the map, while the function `filter_sequence_map_by_condition` restricts the sequences considered to those matching one of the designated conditions and to cell lines with at least two unique replicates in each batch. It also will trim the number of replicates considered so that there is an even and equal number in each condition for each the cell line.

In [None]:
from chronos.hit_calling import filter_sequence_map_by_condition, create_condition_sequence_map

condition_pair = ("Control", "A-1331852")
distinguished_map = create_condition_sequence_map(
    filter_sequence_map_by_condition(deweirdt_condition_map, condition_pair),
    condition_pair
)
    

In [None]:
distinguished_map

Notice that the `"cell_line_name"` column has been overwritten in the format "<cell_line>__in__<condition>". When we train Chronos, we'll get an individual assessment of each gene's effect in each cell line in each condition. The new column `"true_cell_line_name"` exists for convenience.

Running Chronos works just like before:

In [None]:
distinguished_model = chronos.Chronos(
    readcounts={"brunello": deweird_readcounts},
    sequence_map={"brunello": distinguished_map},
    guide_gene_map={"brunello": deweirdt_guide_map},
    negative_control_sgrnas={"brunello": deweirdt_negative_controls}
)

In [None]:
distinguished_model.train()

MCL1 and BCL2L1 are a well-established synthetic lethal pair in cancer. Below, we see a difference of about 1.5 between the gene effects of MCL1 in the control condition and in the BCL2L1 inhibitor A-1331852. 

In [None]:
distinguished_model.gene_effect['MCL1']

But are these differences significant?

### Running the Comparator

To test for significance, we first create a `hit_calling.ConditionComparison` instance. The input syntax is almost exactly the same as a `Chronos` model instance, except we replace the `sequence_map` argument with a `condition_map`.

In [None]:
from chronos.hit_calling import ConditionComparison
comparator = ConditionComparison(
    readcounts={"brunello": deweird_readcounts},
    condition_map={"brunello": deweirdt_condition_map},
    guide_gene_map={"brunello": deweirdt_guide_map},
    negative_control_sgrnas={"brunello": deweirdt_negative_controls},
)

To identify significant differences, `ConditionComparison` will first train a distinguished model, as above, then permute the condition labels so that each condition label has an equal number of replicates from each of the real conditions and train models on the permutations. These models have no biological difference between conditions, and can be used to form the null distribution. Permutations that have every condition flipped from an existing permutation are discarded. An example of a permuted map is shown below.

In [None]:
from chronos.hit_calling import create_permuted_sequence_maps

create_permuted_sequence_maps(deweirdt_condition_map, condition_pair)[0]

The number of permutations limits the minimum p-value that can be calculated. Due to the requirement that we only consider permutations that have equal numbers of replicates from each condition, the number of permutations actually generated is as follows:

- Less than 2 replicates for any condition for a cell line: 0 permutations, that line is discarded
- 2-3 replicates for any condition of any cell line: 2 permutations
- At least 4 replicates for all conditions for all cell lines: 18 permutations
- At least 6 replicates for all conditions for all cell lines: 200 permutations
  
For obvious reasons, we don't recommend including more than 4 replicates per condition unless your library is very small.

Now, to compare the conditions. Note that with three models to train, this may take about 10 minutes.

In [None]:
comparison_statistics = comparator.compare_conditions(("Control", "A-1331852"))

Below, we can see the most significant differences. The expected hit MCL1 was found in both screens, while BAX and BCL2 were found in Meljuso. [MARCH5, UBE2J, and UBE2K are also expected findings](https://doi.org/10.1038/s41375-024-02178-x). Meljuso appears to be a cleaner screen with overall more significant hits.

In [None]:
comparison_statistics.sort_values(["gene"]).loc[lambda x: x.likelihood_fdr < .05]

With multiple cell lines in the same pair of conditions, you may be interested in what change in what genes can be recovered by considering all lines together. `get_consensus_difference_statistics` combines likelihood changes across cell lines to create a consensus estimate of significance. This reports a number of additional genes not significant in either cell line individually.

In [None]:
from chronos.hit_calling import get_consensus_difference_statistics
consensus = get_consensus_difference_statistics(comparison_statistics)
consensus.sort_values("likelihood_fdr").loc[lambda x: x.likelihood_fdr < .05]

### Other notes

If you want to compare two different screens in the same condition, you can create a condition map in which the two cell lines are assigned the same `"cell_line_name"` but a different `"condition"`. However, bear in mind that your real differences are likely to be confounded with batch effects. You can assess the degree of this problem by checking for false discoveries among negative controls or unexpressed genes.

You may also find that your hits are dominated by common essentials. This can happen because the two conditions have different screen quality. In particular, if one condition is mildly or moderately growth-inhibiting, this can lead to less apparent common essential dropout (because all other cells are also dropping out) vs the other condition and make these genes appear systematically different in gene effect. This can be addressed by both increasing the `gene_effect_hierarchical` regularization in `ChronosCompare` (see `chronos.Chronos` for a description), and potentially also adding new bins to `gene_readcount_total_bin_quantiles` (default": `[0.05]`) in `compare_conditions` (see `chronos.hit_calling.ConditionComparison.compare_conditions`.)

## Running with multiple libraries

We can add Sanger's [Project Score](https://www.nature.com/articles/s41586-019-1103-9) data (screened with the KY library) and run Chronos jointly on it and the Avana data. 

In [None]:
ky_guide_map = pd.read_csv("./Data/SampleData/KYGuideMap.csv")
ky_sequence_map = pd.read_csv("./Data/SampleData/KYSequenceMap.csv")
ky_readcounts = chronos.read_hdf5("./Data/SampleData/KYReadcounts.hdf5")

In [None]:
ky_positive_controls = ky_guide_map.sgrna[ky_guide_map.gene.isin(common_essentials)]
ky_negative_controls = ky_guide_map.sgrna[ky_guide_map.gene.isin(nonessentials)]

Note how the call signature of Chronos with multiple libraries is constructed:

In [None]:
ky_readcounts.isnull().sum(axis=1).sort_values()

In [None]:
model2 = chronos.Chronos(
    sequence_map={"avana": sequence_map, 'ky': ky_sequence_map},
    guide_gene_map={"avana": guide_map, 'ky': ky_guide_map},
    readcounts={"avana": readcounts, 'ky': ky_readcounts},
    negative_control_sgrnas={"avana": negative_controls, "ky": ky_negative_controls}
)

In [None]:
model2.train(301)

Note that the gene effect now has NAs. These are cases where a cell line was only screened in one library and that library had no guides for that gene.

Chronos infers library batch effects. Note that these are only inferred for genes present in all libraries

## Running your screen with pretrained DepMap parameters

If you conducted a screen in one of the DepMap integrated libraries (currently Avana, KY, or Humagne-CD), you can load parameters from the trained DepMap model and use them to process your specific screen. This gives you many of the benefits of coprocessing your screen with the complete DepMap dataset without the computational expense. 

The following command fetches the 23Q4 public dataset from Figshare and stores it in the Chronos package directory under Data/DepMapParameters

In [None]:
chronos.fetch_parameters()

First, we create a model with the data we want to train as before, but with two important details:
- we pass the argument `pretrained=True` when we initialize
- the library batch names must match the DepMap library batch names, as that's what we're using for the pretrained model

In [None]:
model2_pretrained = chronos.Chronos(
    sequence_map={"Achilles-Avana-2D": sequence_map, 'Achilles-KY-2D': ky_sequence_map},
    guide_gene_map={"Achilles-Avana-2D": guide_map, 'Achilles-KY-2D': ky_guide_map},
    readcounts={"Achilles-Avana-2D": readcounts, 'Achilles-KY-2D': ky_readcounts},
    negative_control_sgrnas={"Achilles-Avana-2D": negative_controls, "Achilles-KY-2D": ky_negative_controls},
    pretrained=True
)

Now we import the DepMap data from the directory into the model, and train:

In [None]:
model2_pretrained.import_model("./Data/DepMapParameters/")

In [None]:
model2_pretrained.train()