<a href="https://colab.research.google.com/github/WCSCourses/CRISPR_Informatics_2024/blob/jb63_screening_module/modules/crispr_screen_analysis/generate_crispr_counts_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Guide Counts from raw sequencing data

In this notebook we will retrieve raw seqeuncing data using from a publically available we can now try converting reads to counts. For this task, we'll be using `SRATOOLS` for obtaining the reads and `MAGECK` for generating counts
`MAGECK` can take a CRAM, BAM or FASTQ file and generate counts for single-guide CRISPR screens. These include single-guide, paired-guide, screens using alternative Cas systems etc.

We can get ready for counting by loading the `crispr-lib-matching` into our session with:

In [1]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:15
🔁 Restarting kernel...


To confirm you have the tool available, and to get some info on how it works, you can run the command with a help flag (`--help`). If you execute the following code section, you should see a `Usage:` message and lots of options for the tool.

In [1]:
!conda install bioconda::sra-tools bioconda::mageck

Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

In [2]:
!fasterq-dump SRR8293875

spots read      : 73,227,278
reads read      : 73,227,278
reads written   : 73,227,278


In [5]:
!head SRR8293875.fastq

@SRR8293875.1 D00212:CB21JANXX170514:CB21JANXX:3:1101:10000:13155 length=50
GAAGACCCTTGTGGAAAGGACGAAACACCGAGCGCGGACAGCATCCCGCC
+SRR8293875.1 D00212:CB21JANXX170514:CB21JANXX:3:1101:10000:13155 length=50
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR8293875.2 D00212:CB21JANXX170514:CB21JANXX:3:1101:10000:13984 length=50
TGCACCTTGTGGAAAGGACGAAACACCGTTCTGAGCAGCAGTTCTGCCGT
+SRR8293875.2 D00212:CB21JANXX170514:CB21JANXX:3:1101:10000:13984 length=50
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR8293875.3 D00212:CB21JANXX170514:CB21JANXX:3:1101:10000:17994 length=50
AGCTTGTGGAAAGGACGAAACACCGTGTGCGCAAAGTGTGCGGCCGTTTC


## Tool inputs

Now that we have the tool ready to run counting on a sample we will we need several input files:

1) A library file, describing the CRISPR guides we would like to quantify. This will be a tab-seperated file (`.tsv`) file containing two columns: one column of guide ID's and one column of guide sequences
2) An experiment description file (written in the `.yaml` file format). This file allows us to describe simple and more complex screens using a structured but human-readable text file.
3) A sample `.cram` (or `.fastq` file) containing sequencing reads
4) (Required when generating counts from a `.cram` file) A copy of the reference genome file used in generating the `.cram` file


The next few code blocks will demonstrate what each of these files should look like:

## The CRISPR library file

In [None]:
head libraries/yusa.tsv

## The experimental description file

Yaml files are commonly used for configuring computational tools. This file contains a lot of "boilerplate" code - which is necessary for the tool to be flexible but there are really only a few things we need to worry about when conducting a simple screen.

The key attibutes we define in the yaml are what kind of seqeuncing run we performed (`single-end`) and which CRISPR library we want to generate counts from. If we had run a paired-end sequencing run or a dual-guide experiment we would have to describe that experiment differently.

Note: lines starting with `#` are ingored by the tool and are purely descriptive.


In [None]:
cat experiment.yaml

## Generating counts

With our inputs defined, we can run the counting tool like so:


In [None]:
mkdir output_dir

crispr-lib-matching \
-o output_dir \
-l libraries \
experiment.yaml \
--count-mm-reads \
--sort-mm-read-counts \
-r /lustre/scratch124/casm/team113/secure-lustre/resources/dermatlas/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-c 4 \
test/merged_crams/7672STDY14727349_merged.cram


N.B. For the curious reading this command, I often use `\` characters at the ends of lines of bash code to put line breaks into my commands and make them easier for you to read.

## Recap

The command we just ran specifies the input files we want to use for counting and that we would like to use 4 compute cores to processs the reads in parallel.

I've also included two optional arguments in the command:
`--count-mm-reads` and `--sort-mm-read-counts`. These tell the tool to:

- Tally sequences in the sample which don't match our expected CRISPR library
- Find the most abundant sequences which don't match expected CRISPR library

Performing these steps can be really helpful for troubleshooting what's wrong with a library when you get no matches. For example, maybe there's an extra `G` appended to all our sequences that is preventing exact matching of reads to guides from working.



## Running jobs on the Farm

You might remember from the farm course that the farm relies on an LSF (load sharing facility) system to organise computational tasks. Outside of notebooks like this one, you might need to find you need to run tasks like this one within a `bsub` statement - so I thought I'd include an example of this here.

As a refresher from the farm course, bsub commands require you to specify how many 'compute nodes' you'd like a task to run on, how much memory the job you'd like to run will need and where you'd like any messages the tool generates to be stored

Here is an example translation of our prior counting command into a bsub job:

In [None]:
mkdir output_dir

bsub -q normal \
-o test%J.o \
-e test%J.err \
-n 4 \
-M 20GB \
-R "select[mem>20GB] rusage[mem=20GB] span[hosts=1]" \
"crispr-lib-matching \
-o output_dir \
-l libraries \
experiment.yaml \
--count-mm-reads \
--sort-mm-read-counts \
-r /lustre/scratch124/casm/team113/secure-lustre/resources/dermatlas/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-c 4 \
test/merged_crams/7672STDY14727349_merged.cram;"

In english - this roughly translates to:

- `bsub`: Submit a job to the LSF queue.
- `-q normal`: Specify the queue named "normal" for the job.
- `-o test%J.o`: Direct the standard output of the job to a file named "test" followed by the job ID and the extension ".o".
- `-e test%J.err`: Direct the standard error of the job to a file named "test" followed by the job ID and the extension ".err".
- `-n 4`: Request 4 CPU cores for the job.
- `-M 20GB`: Set the memory limit for the job to 20 GB.
- `-R "select[mem>20GB] rusage[mem=20GB] span[hosts=1]"`: Specify the resource requirements for the job:
  - `select[mem>20GB]`: Select only hosts with more than 20 GB of memory available.
  - `rusage[mem=20GB]`: Specify that the job will use 20 GB of memory.
  - `span[hosts=1]`: Ensure that all requested CPU cores are allocated on the same host.

In summary, this command submits a job to the "normal" queue, requests 4 CPU cores and 20 GB of memory on a single host, and directs the standard output and error to appropriately named files.


The job will take a little while to complete - whilst you're waiting you you might want check its status using the `bjobs` command in a terminal session.

## Output files

Once the CRISPR counting tool has finished, it will populate the output directory you specified with several statistic and results files.


### Guide counts
Probably the most important of these from your perspective is the `lib.0.counts.tsv` file. This file should contain guide sequences and the number of times they were observed within the sample. We can inspect the first few lines of this file with:

In [None]:
head output_dir/lib.0.counts.tsv


### Mismatch counts
Because we ran the tool with the `--count-mm-reads`
and `--sort-mm-read-counts` options enabled, you should also have a `mm.read.counts.tsv` containing "unexpected" sequences observed in the sample. These could be contaminants or near matches to guides due to sequencing errors. What is the most abundant of the unexpected seqeunces? Take a look with:

In [None]:
head output_dir/mm.read.counts.tsv


## Summary files
The `libstats.json` and other json files in your results directory should give a convenient high-level overviews of the sample. This will report the number of reads in the sample; the number of guides that were missing in a sample; the average counts per guide and how even the composition of the library is (Gini coefficient).

In [None]:
cat output_dir/lib.0.stats.json

## Running counting across an entire experiment



**WORK IN PROGRESS** TOFIX shortly

Whilst helpful, the above commands only allow you to process one guide at a time. Since your experiment might involve a large number of samples I can imagine a multiplexed way of handling samples might be helpful.

There are a number of ways you could handle multiple sample processing. For instance, you could write a for-loop to process samples in series.

 To make your life a bit easier though, I've built a demo nextflow pipeline that will run counting on all files within a directory for you.



In [None]:
module load nextflow
nextflow run /lustre/scratch124/casm/team113/secure-lustre/resources/crispr/toy-crispr-pipeline/main.nf \
-params-file ~/crispr_project_template/crispr_pipeline_config.json \
-profile farm22 \
-c /lustre/scratch124/casm/team113/secure-lustre/resources/crispr/toy-crispr-pipeline/nextflow.config