# **Installation and Setup**

---

Run this to set up CRISPRware environment in Colab



In [1]:
!mkdir -p /root/.mamba/pkgs
!chmod -R 777 /root/.mamba
!wget -qO- https://micromamba.snakepit.net/api/micromamba/linux-64/latest | tar -xvj bin/micromamba

bin/micromamba


In [2]:
!git clone https://github.com/ericmalekos/crisprware crisprware
%cd crisprware

Cloning into 'crisprware'...
remote: Enumerating objects: 630, done.[K
remote: Counting objects: 100% (157/157), done.[K
remote: Compressing objects: 100% (118/118), done.[K
remote: Total 630 (delta 97), reused 90 (delta 37), pack-reused 473 (from 1)[K
Receiving objects: 100% (630/630), 107.79 MiB | 12.50 MiB/s, done.
Resolving deltas: 100% (298/298), done.
Updating files: 100% (125/125), done.
/content/crisprware


In [3]:
!/content/bin/micromamba env create -f environment.yml -n crisprware --root-prefix /content/micromamba --quiet -y
!/content/bin/micromamba run -n crisprware --root-prefix /content/micromamba pip install .

    Be aware that packages installed with 'pip' are managed independently from 'conda-forge' channel.
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m76.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m98.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m52.6 MB/s[0m eta [36m0:00:00[0m
[?25hProcessing /content/crisprware
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: crisprware
  Building wheel for crisprware (setup.py) ... 

In [4]:
# A helper function to run commands in the crisprware environment
# Wrap each module in this command before running
# This is only required in the Colab environment
def run_in_crisprware(command):
  !/content/bin/micromamba run -n crisprware --root-prefix /content/micromamba {command}

# RNASeq

This section covers usage of RNASeq quantification to limit the search space to expressed transcripts.


---

### Kallisto example

The Encode consortium uses Kallisto for transcript quantification in their cell line characterization studies. We can find Kallisto quantification files in the K526 cell line at [Encode project ENCSR000CPH](https://www.encodeproject.org/experiments/ENCSR000CPH/). The documenation indicates they used the [Gencode Hg38 v29 annotation](https://www.gencodegenes.org/human/release_29.html) so we download that as well.

In [5]:
%cd /content/crisprware/
!mkdir -p kallisto
%cd kallisto
!wget -q --show-progress -O K562_rep1.tsv https://www.encodeproject.org/files/ENCFF823JHX/@@download/ENCFF823JHX.tsv
!wget -q --show-progress -O K562_rep2.tsv https://www.encodeproject.org/files/ENCFF940GYO/@@download/ENCFF940GYO.tsv
!wget -q --show-progress -O gencode.v29.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.chr_patch_hapl_scaff.annotation.gtf.gz && gunzip -f gencode.v29.gtf.gz
!ls

/content/crisprware
/content/crisprware/kallisto
gencode.v29.gtf  K562_rep1.tsv	K562_rep2.tsv


When we check the quantification files we see that much besides the transcript_id is present in the first column, and when we check that all entries have the expected ENST00000######.# trasncript identifier we see that spike-ins are included. Let's remove this extra information.

In [6]:
!echo "Extra info in the id column:" && head -5 K562_rep1.tsv && echo ""
!echo "Unexpected entries not found in Gencode:" && grep -v "ENST" K562_rep1.tsv | tail -4 && echo ""
!echo "Cleaning files" && for file in *.tsv; do { head -n1 "$file"; awk -F'\t' 'NR>1 && /ENST/ {split($1,a,"|"); $1=a[1]; print}' OFS='\t' "$file"; } > "${file%.tsv}_cleaned.tsv"; done && echo ""
!echo "Reformated files:" && head -5 K562_rep1_cleaned.tsv && echo ""

Extra info in the id column:
target_id	length	eff_length	est_counts	tpm
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|	1657	1480.58	52.0305	0.439011
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|	632	455.749	0	0
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|	1351	1174.58	2072.92	22.047
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|	68	28.2727	3.73048	1.64833

Unexpected entries not found in Gencode:
ERCC-00168	1024	847.58	5	0.073695
ERCC-00170	1023	846.58	2725	40.2112
ERCC-00171	505	328.817	968	36.7765
phiX174	5386	5209.58	0	0

Cleaning files

Reformated files:
target_id	length	eff_length	est_counts	tpm
ENST00000456328.2	1657	1480.58	52.0305	0.439011
ENST00000450305.2	632	455.749	0	0
ENST00000488147.1	1351	1174.58	20

With the file properly formated, we can use the `preprocess_annotation` module to generate consensus models for each gene. We will require that a transcript has a median expression value of 10 TPM across the two replicates for it to pass filtering.

In [7]:
run_in_crisprware('\
                  preprocess_annotation --tpm_files *_cleaned.tsv \
                  --gtf gencode.v29.gtf \
                  --median 10 \
                  --model consensus'\
                  )



	Processing isoform quantification files

	Removing transcripts below threshold

	Inferring file type from header line

		K562_rep1_cleaned.tsv is a Kallisto file

	Initial unique transcripts:			206694
	Transcripts after filtering by expression:	10392

	Generating transcript-gene relationships

	Saving transcript-gene relationships to:	/content/crisprware/kallisto/gencode.v29/tmp/tx2gene.tsv
	Final unique genes:		5728
	Final unique transcripts:	10392
	Saving quantification file to:		/content/crisprware/kallisto/gencode.v29/tmp/filtered_gencode.v29.tsv
	Saving transcript filtered GTF to:	/content/crisprware/kallisto/gencode.v29/gencode.v29_filtered.gtf
	Saving consensus GTF to: /content/crisprware/kallisto/gencode.v29/gencode.v29_consensus.gtf

	A CONSENSUS MODEL COULD NOT BE GENERATED FOR 68 GENES
	If this number is large, consider filtering by TPM expression more strictly or using a more conservative GTF.
	If this number is small, consider manually removing problematic transcripts f

### Salmon example

The paper [*Rauber & Mohammadian et al.*](https://www.nature.com/articles/s41590-024-01774-4) uses Salmon for quantification in human fibroblasts, deposited at [GEO accension GSE228953](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE228953). According to the paper, the Enesembl 104 annotation is used which we can find at the [Ensembl FTP server](https://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/)

In [8]:
%cd /content/crisprware/
!mkdir -p salmon
%cd salmon
!wget --show-progress -q -O GRCh38.104.gtf.gz https://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.gtf.gz && gunzip -f GRCh38.104.gtf.gz
!curl -L "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE228953&format=file" -o GSE228953.tar && tar -xf GSE228953.tar
!find . -name "*quant.tar.gz" -type f -exec sh -c 'tar -xzf "$0" -C "$(dirname "$0")" && rm "$0"' {} \;
!ls

/content/crisprware
/content/crisprware/salmon
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.1M  100 50.1M    0     0  43.1M      0  0:00:01  0:00:01 --:--:-- 43.1M
GRCh38.104.gtf	PsA_ST4_quant  PsA_ST8_quant  RA_ST2_quant  RA_ST6_quant
GSE228953.tar	PsA_ST5_quant  PsA_ST9_quant  RA_ST3_quant  RA_ST7_quant
PsA_ST2_quant	PsA_ST6_quant  RA_ST10_quant  RA_ST4_quant  RA_ST8_quant
PsA_ST3_quant	PsA_ST7_quant  RA_ST1_quant   RA_ST5_quant  RA_ST9_quant


In [9]:
run_in_crisprware('\
                  preprocess_annotation --gtf GRCh38.104.gtf \
                  --tpm_files PsA_ST2_quant/quant.sf \
                  --min 1 \
                  ')



	Processing isoform quantification files

	Removing transcripts below threshold

	Inferring file type from header line

		PsA_ST2_quant/quant.sf is a Salmon file

	Initial unique transcripts:			187432
	Transcripts after filtering by expression:	38451

	Generating transcript-gene relationships

	Saving transcript-gene relationships to:	/content/crisprware/salmon/GRCh38.104/tmp/tx2gene.tsv


	ENST00000634176.1
	ENST00000634111.1
	ENST00000610439.4
	ENST00000632021.1
	ENST00000633283.1
	ENST00000631392.1
	ENST00000633575.1
	ENST00000633328.1
	ENST00000631882.1
	ENST00000631559.1
	...
	and 38441 more

	GENE LEVEL FILTERING OF TOP TRANSCRIPTS WILL BE IGNORED
	CHECK THAT THE SAME ANNOTATION WAS USED FOR QUANTIFICATION AND CURRENT PROCESSING
	CHECK THAT THE FIRST COLUMN OF THE TPM QUANTIFICATION FILES CONTAINS ONLY THE TRANSCRIPT ID

	Saving quantification file to:		/content/crisprware/salmon/GRCh38.104/tmp/filtered_GRCh38.104.tsv
	Saving transcript filtered GTF to:	/content/crisprware/salm

Notice this warning output:  

	Warning: Transcripts not found in GTF/GFF:

	ENST00000634176.1
	ENST00000634111.1
	ENST00000610439.4
	ENST00000632021.1
	ENST00000633283.1
	ENST00000631392.1
	ENST00000633575.1
	ENST00000633328.1
	ENST00000631882.1
	ENST00000631559.1
	...
	and 38441 more

	GENE LEVEL FILTERING OF TOP TRANSCRIPTS WILL BE IGNORED
	CHECK THAT THE SAME ANNOTATION WAS USED FOR QUANTIFICATION AND CURRENT PROCESSING
	CHECK THAT THE FIRST COLUMN OF THE TPM QUANTIFICATION FILES CONTAINS ONLY THE TRANSCRIPT ID

This indicates that the transcript ids are not matching between the Salmon quant.sf files and the GTF.  
Let's investigate:  

In [10]:
!echo "quant.sf transcript format:"
!grep "ENST00000634176" PsA_ST2_quant/quant.sf | cut -f1 && echo ""

!echo "Ensembl GTF transcript format:"
!grep "ENST00000634176" GRCh38.104.gtf | awk -F 'transcript_id "' '{print $2}' | awk -F '"' '{print $1}' | sort -u

quant.sf transcript format:
ENST00000634176.1

Ensembl GTF transcript format:
ENST00000634176


Notice that the quant.sf file has a transcript version number ".1" while the GTF does not. We can account for this in the processing step by passing the flag `--strip_tx_id`, which will resolve the warning.

In [11]:
run_in_crisprware('\
                  preprocess_annotation --gtf GRCh38.104.gtf \
                  --tpm_files PsA_ST2_quant/quant.sf \
                  --min 1 \
                  --strip_tx_id \
                  ')



	Processing isoform quantification files

	Removing transcripts below threshold

	Inferring file type from header line

		PsA_ST2_quant/quant.sf is a Salmon file

	Initial unique transcripts:			187432
	Unique transcripts after ID stripping:		187432
	Transcripts after filtering by expression:	38451

	Generating transcript-gene relationships

	Saving transcript-gene relationships to:	/content/crisprware/salmon/GRCh38.104/tmp/tx2gene.tsv
	Final unique genes:		13811
	Final unique transcripts:	38451
	Saving quantification file to:		/content/crisprware/salmon/GRCh38.104/tmp/filtered_GRCh38.104.tsv
	Saving transcript filtered GTF to:	/content/crisprware/salmon/GRCh38.104/GRCh38.104_filtered.gtf




### FLAIR example
The paper [*Robinson & Jagannatha et al.*](https://elifesciences.org/articles/69431) has associated FLAIR quantification and GTF file at [GEO accension GSE141750](https://https.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE141750).

We can pull that data:

In [12]:
%cd /content/crisprware/
!mkdir -p flair
%cd flair
!wget --show-progress -q -O GSE141750_Mse_nanopore.isoforms.gtf.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE141nnn/GSE141750/suppl/GSE141750%5FMse%5Fnanopore.isoforms.gtf.gz
!wget --show-progress -q -O GSE141750_counts_matrix.xlsx https://ftp.ncbi.nlm.nih.gov/geo/series/GSE141nnn/GSE141750/suppl/GSE141750%5Fcounts%5Fmatrix.xlsx
!ls

/content/crisprware
/content/crisprware/flair
GSE141750_counts_matrix.xlsx  GSE141750_Mse_nanopore.isoforms.gtf.gz


The counts are saved in an "xlsx" (Microsoft Excel) format. We need to convert it to TSV format:

In [13]:
from pandas import read_excel
# to avoid printing a warning from xlsx reader
from warnings import filterwarnings
filterwarnings("ignore", category=UserWarning, module='openpyxl')

data_xlsx = read_excel('GSE141750_counts_matrix.xlsx', index_col=None).to_csv('GSE141750_counts_matrix.tsv', sep='\t', encoding='utf-8',  index=False)

Let's look at the format, there are transcript ids in the first column followed by counts for three replicates of either untreated (ctl_batch) or LPS (lps_batch) treated samples.

In [14]:
!head -6 GSE141750_counts_matrix.tsv

ids	m1-lps.fastq_lps_batch1	m2-lps.fastq_lps_batch1	m3-lps.fastq_lps_batch1	m1-ctl.fastq_ctl_batch1	m2-ctl.fastq_ctl_batch1	m3-ctl.fastq_ctl_batch1
0000c6c2-62f7-48c8-821a-e2263cfb3fe2;16_chr11:70651000	0	0	1	0	1	1
000145b6-0f98-46c3-b0e4-8880c8bd2a61;16_ENSMUSG00000024187.14	4	4	4	10	6	17
0008107d-5030-496d-a9a4-b4829718e3f2;0_ENSMUSG00000031447.7	7	4	3	9	5	14
000aab60-efca-43c7-86de-ba61147dde51;16_ENSMUSG00000019082.18	1	1	4	0	0	0
00102dfe-3450-4599-bc4d-e0cdc42889ec;0_ENSMUSG00000063506.14	4	2	3	1	0	1


We will take all samples and filter the GTF file to include only transcripts with at least 1 count in each sample

In [15]:
run_in_crisprware('\
                  preprocess_annotation --gtf GSE141750_Mse_nanopore.isoforms.gtf.gz \
                  --tpm_files GSE141750_counts_matrix.tsv \
                  --min 1 \
                  ')



	Unzipping GSE141750_Mse_nanopore.isoforms.gtf.gz
	Unzipped file saved as GSE141750_Mse_nanopore.isoforms.gtf
	Processing isoform quantification files

	Removing transcripts below threshold

	Inferring file type from header line

		GSE141750_counts_matrix.tsv is a FLAIR file

	Initial unique transcripts:			33244
	Transcripts after filtering by expression:	11837

	Generating transcript-gene relationships

	Saving transcript-gene relationships to:	/content/crisprware/flair/GSE141750_Mse_nanopore.isoforms/tmp/tx2gene.tsv
	Final unique genes:		6689
	Final unique transcripts:	11837
	Saving quantification file to:		/content/crisprware/flair/GSE141750_Mse_nanopore.isoforms/tmp/filtered_GSE141750_Mse_nanopore.isoforms.tsv
	Saving transcript filtered GTF to:	/content/crisprware/flair/GSE141750_Mse_nanopore.isoforms/GSE141750_Mse_nanopore.isoforms_filtered.gtf


	Removing file: GSE141750_Mse_nanopore.isoforms.gtf


We can also do the same with only the LPS counts:

In [16]:
!cut -f1,2,3,4 GSE141750_counts_matrix.tsv > LPS_GSE141750_counts_matrix.tsv
run_in_crisprware('\
                  preprocess_annotation --gtf GSE141750_Mse_nanopore.isoforms.gtf.gz \
                  --tpm_files LPS_GSE141750_counts_matrix.tsv \
                  --min 1 \
                  ')



	Unzipping GSE141750_Mse_nanopore.isoforms.gtf.gz
	Unzipped file saved as GSE141750_Mse_nanopore.isoforms.gtf
	Processing isoform quantification files

	Removing transcripts below threshold

	Inferring file type from header line

		LPS_GSE141750_counts_matrix.tsv is a FLAIR file

	Initial unique transcripts:			33244
	Transcripts after filtering by expression:	17322

	Generating transcript-gene relationships

	Saving transcript-gene relationships to:	/content/crisprware/flair/GSE141750_Mse_nanopore.isoforms/tmp/tx2gene.tsv
	Final unique genes:		7565
	Final unique transcripts:	17322
	Saving quantification file to:		/content/crisprware/flair/GSE141750_Mse_nanopore.isoforms/tmp/filtered_GSE141750_Mse_nanopore.isoforms.tsv
	Saving transcript filtered GTF to:	/content/crisprware/flair/GSE141750_Mse_nanopore.isoforms/GSE141750_Mse_nanopore.isoforms_filtered.gtf


	Removing file: GSE141750_Mse_nanopore.isoforms.gtf


# RiboSeq


---

A number of tools exists for calling translated ORFs from RiboSeq. In order to find gRNAs against these putative coding regions we can convert output from these programs into a GTF with annotated coding sequence (CDS) entries and run the CRISPRware ipeline normally.

Currently this works for [PRICE](https://github.com/erhard-lab/price) and [RiboTISH](https://github.com/zhpn1024/ribotish) output. For other RiboSeq ORF callers raise a [github issue](https://github.com/ericmalekos/crisprware/issues) and I will address it.

In [17]:
%cd /content/crisprware/
!mkdir -p riboseq
%cd riboseq

/content/crisprware
/content/crisprware/riboseq


### RiboTISH

For ORFs called with RiboTISH set these options in the `ribotish predict` command: `--inframecount`, `--blocks`, `--aaseq` and provide the same GTF that was passed to ribotish. We have a test file with these columns and we will download the corresponding GTF from gencode.

In [18]:
!head -5 /content/crisprware/tests/test_data/ribotish/GSE208041_ATGORFs.tsv
!wget --show-progress -q -O gencodev41.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz && gunzip -f gencodev41.gtf.gz

Gid	Tid	Symbol	GeneType	GenomePos	StartCodon	Start	Stop	TisType	TISGroup	TISCounts	TISPvalue	RiboPvalue	RiboPStatus	FisherPvalue	TISQvalue	FrameQvalue	FisherQvalue	AALen	Seq	AASeq	Blocks	InFrameCount
ENSG00000100985.7	ENST00000372330.3	MMP9	protein_coding	chr20:46008926-46016368:+	ATG	19	2143	Annotated	0	0	None	5.01092337210136E-145	N	None	None	1.34551249593E-142	None	707	ATGAGCCTCTGGCAGCCCCTGGTCCTGGTGCTCCTGGTGCTGGGCTGCTGCTTTGCTGCCCCCAGACAGCGCCAGTCCACCCTTGTGCTCTTCCCTGGAGACCTGAGAACCAATCTCACCGACAGGCAGCTGGCAGAGGAATACCTGTACCGCTATGGTTACACTCGGGTGGCAGAGATGCGTGGAGAGTCGAAATCTCTGGGGCCTGCGCTGCTGCTTCTCCAGAAGCAACTGTCCCTGCCCGAGACCGGTGAGCTGGATAGCGCCACGCTGAAGGCCATGCGAACCCCACGGTGCGGGGTCCCAGACCTGGGCAGATTCCAAACCTTTGAGGGCGACCTCAAGTGGCACCACCACAACATCACCTATTGGATCCAAAACTACTCGGAAGACTTGCCGCGGGCGGTGATTGACGACGCCTTTGCCCGCGCCTTCGCACTGTGGAGCGCGGTGACGCCGCTCACCTTCACTCGCGTGTACAGCCGGGACGCAGACATCGTCATCCAGTTTGGTGTCGCGGAGCACGGAGACGGGTATCCCTTCGACGGGAAGGACGGGCTCCTGGCACACGCCTTTCCTCCTGGCCCCGGCATTCAGGGAGACGCCCATTTCGACGATGACGAGT

Only a single ORF species can be selected at a time, e.g. 5'UTR, 3'UTR, Novel, etc. using the required `--tisype` specifier. Note that TIS types which include a single quote must passed in double quotes and escaped, e.g. 3'UTR -> "3\\'UTR"

In [19]:
run_in_crisprware('gtf_from_ribotish.py \
                  --input_gtf ./gencodev41.gtf \
                  --ribotish /content/crisprware/tests/test_data/ribotish/GSE208041_ATGORFs.tsv \
                  --select_based_on AALen \
                  --min_aalen 100 \
                  --tistype "3\'UTR" \
                  --output_gtf ./uORF_100AA.gtf')

Sorted GTF file has been written to ./uORF_100AA.gtf


In [20]:
!head -10 ./uORF_100AA.gtf

chr1	HAVANA	exon	1253909	1255487	.	-	.	gene_id "ENSG00000160087.21"; transcript_id "ENST00000450390.6"; gene_type "protein_coding"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_name "UBE2J2-208"; exon_number 8; exon_id "ENSE00001675740.1"; level 2; protein_id "ENSP00000407565.2"; transcript_support_level "5"; hgnc_id "HGNC:19268"; havana_gene "OTTHUMG00000001911.9"; havana_transcript "OTTHUMT00000005431.3";	ENST00000450390.6
chr1	HAVANA	transcript	1253909	1273853	.	-	.	gene_id "ENSG00000160087.21"; transcript_id "ENST00000450390.6"; gene_type "protein_coding"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_name "UBE2J2-208"; level 2; protein_id "ENSP00000407565.2"; transcript_support_level "5"; hgnc_id "HGNC:19268"; havana_gene "OTTHUMG00000001911.9"; havana_transcript "OTTHUMT00000005431.3";	ENST00000450390.6
chr1	HAVANA	exon	1253912	1255487	.	-	.	gene_id "ENSG00000160087.21"; transcript_id "ENST00000349431.11"; gene_type "protei

### PRICE

For ORFs called with PRICE we use default settings and use the orfs.tsv output as input to this script. We have a test file with these columns and we will download the corresponding GTF from gencode, this time in mouse.

In [21]:
!head -5 /content/crisprware/tests/test_data/price/GSE22004_ORFS.tsv
!wget --show-progress -q -O gencodevM31.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M31/gencode.vM31.annotation.gtf.gz && gunzip -f gencodevM31.gtf.gz

Gene	Id	Location	Candidate Location	Codon	Type	Start	Range	p value	./GSE22004/BMDN_WT_merged	Total
ENSMUSG00000033793.13	ENSMUST00000044369.13_uORF_1	1+:5153417-5153501|5154639-5154666	1+:5153417-5153501|5154639-5154666	CTG	uORF	0.61	0.49	0.0053767	31.5	31.5
ENSMUSG00000033793.13	ENSMUST00000044369.13_uORF_3	1+:5153499-5153501|5154639-5154673	1+:5153499-5153501|5154639-5154673	AGG	uORF	0.10	0.26	0.74776	9.0	9.0
ENSMUSG00000033793.13	ENSMUST00000194676.6_uORF_4	1+:5154643-5154673	1+:5154643-5154673	GTG	uORF	0.23	0.12	0.36007	7.3	7.3
ENSMUSG00000033793.13	ENSMUST00000044369.13_Trunc_0	1+:5159280-5159334|5163585-5163675|5165837-5165951|5168251-5168356|5171292-5171346|5187612-5187710|5194499-5194692|5203306-5203485|5206034-5206160|5213972-5214074|5220170-5220284|5232327-5232388	1+:5154682-5154786|5159231-5159334|5163585-5163675|5165837-5165951|5168251-5168356|5171292-5171346|5187612-5187710|5194499-5194692|5203306-5203485|5206034-5206160|5213972-5214074|5220170-5220284|5232327-5232388	ATG	

In [23]:
run_in_crisprware('gtf_from_price.py \
                  -i /content/crisprware/tests/test_data/price/GSE22004_ORFS.tsv \
                  -g ./gencodevM31.gtf \
                  --tis_type uORF \
                  --start_codon CTG \
                  --min_p_value 0.001 \
                  -o price_CTG_uORFs.gtf')

In [24]:
!tail -6 price_CTG_uORFs.gtf

chrX	HAVANA	transcript	165240249	165262863	.	-	.	gene_id "ENSMUSG00000079316.11"; transcript_id "ENSMUST00000112091.9_uORF_1";
chrX	HAVANA	exon	165262536	165262863	.	-	.	gene_id "ENSMUSG00000079316.11"; transcript_id "ENSMUST00000112091.9_uORF_1";
chrX	HAVANA	exon	165245311	165245338	.	-	.	gene_id "ENSMUSG00000079316.11"; transcript_id "ENSMUST00000112091.9_uORF_1";
chrX	HAVANA	exon	165240249	165241339	.	-	.	gene_id "ENSMUSG00000079316.11"; transcript_id "ENSMUST00000112091.9_uORF_1";
chrX	Price	CDS	165245335	165245338	.	-	0	gene_id "ENSMUSG00000079316.11"; transcript_id "ENSMUST00000112091.9_uORF_1";
chrX	Price	CDS	165262536	165262612	.	-	0	gene_id "ENSMUSG00000079316.11"; transcript_id "ENSMUST00000112091.9_uORF_1";


# ChIP/ATAC/etc.

Chromatin accessibility is a strong determinant of CRISPR activity and we can use widely available data to rationally target sites of interest. Let's use ATAC-Seq in K562 along with our RNA-Seq from earlier to target open chromatin near transcript start sites of expressed genes. We will use files from [Encode project ENCSR483RKN](https://www.encodeproject.org/experiments/ENCSR483RKN/)

In [25]:
%cd /content/crisprware/
!mkdir -p K562_TSS
%cd K562_TSS

/content/crisprware
/content/crisprware/K562_TSS


## Bigwig signal file

This workflow uses a Bigwig signal to determine areas of open chromatin. This has the benefit of not relying on assumptions that go into peak calling, instead looking for the windows of highest aggregate signal.

In [26]:
!wget -q --show-progress https://www.encodeproject.org/files/ENCFF600FDO/@@download/ENCFF600FDO.bigWig
!wget -q --show-progress https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz && gunzip -f chr21.fa.gz



With the signal file downloaded, let's use the RNA-Seq from above to find TSS of isoforms expressed at at least 10 TPM. We will start by generating a window of +/- 500 bp around each TSS. **This requires that you've run the Kallisto portion of the RNA-Seq section**


In [27]:
run_in_crisprware('\
                  preprocess_annotation --tpm_files ../kallisto/*_cleaned.tsv \
                  --gtf ../kallisto/gencode.v29.gtf \
                  --min 10 \
                  --tss_window 500 500'
                  )



	Processing isoform quantification files

	Removing transcripts below threshold

	Inferring file type from header line

		../kallisto/K562_rep1_cleaned.tsv is a Kallisto file

	Initial unique transcripts:			206694
	Transcripts after filtering by expression:	8629

	Generating transcript-gene relationships

	Saving transcript-gene relationships to:	/content/crisprware/K562_TSS/gencode.v29/tmp/tx2gene.tsv
	Final unique genes:		5063
	Final unique transcripts:	8629
	Saving quantification file to:		/content/crisprware/K562_TSS/gencode.v29/tmp/filtered_gencode.v29.tsv
	Saving transcript filtered GTF to:	/content/crisprware/K562_TSS/gencode.v29/gencode.v29_filtered.gtf

	Saving TSS:	/content/crisprware/K562_TSS/gencode.v29/TSS_filtered.bed




Now we can use the `bigwig_to_signalwindow.py` script with TSS windows BED and Bigwig file to find the 250 bp window with the highest signal (i.e. most open chromatin) in each TSS.

In [None]:
run_in_crisprware('\
                  bigwig_to_signalwindow.py \
                  --window_size 250 \
                  ./gencode.v29/TSS_filtered.bed \
                  ./ENCFF600FDO.bigWig \
                  ./open_TSS.bed'
                  )

Processing BED file: ./gencode.v29/TSS_filtered.bed
Using BigWig file: ./ENCFF600FDO.bigWig
Output will be saved to: ./open_TSS.bed
Window size: 250 bp
Number of chromosomes in BigWig file: 148


## BED called peaks

If Bigwig files are not available we can also use called peaks in BED format to achieve the same end

In [None]:
!wget -q --show-progress https://www.encodeproject.org/files/ENCFF558BLC/@@download/ENCFF558BLC.bed.gz
!wget -q --show-progress https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr16.fa.gz
!gunzip -f *.gz



After downloading we can find protospacers by specifying the TSS starts sites and ATAC peaks in the `--locations_to_keep` argument. By default this will look for protospacers in the intersection of the two inputs.

In [None]:
run_in_crisprware('\
                  generate_guides \
                  --fasta chr16.fa \
                  --locations_to_keep ENCFF558BLC.bed open_TSS.bed \
                  --threads 2'
                  )


	No GTF file, '--feature exon' will be ignored.


	Chromosomes for which to find targets:	chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9
	Processing chr16

	Saved output file to /content/crisprware/K562_TSS/chr16_gRNA/chr16_gRNA.bed



These outputs can then be scored and ranked as done in the primary CRISPRware tutorial.