# Downloading and preparing example metadata

A diagenic (dual guide) experiment which targets paralogs was used to provide an example of combinatorial CRISPR screen quantification with pyCROQUET. The publication associated with this dataset is:

>Gonatopoulos-Pournatzis, T., Aregger, M., Brown, K.R. et al.   
>**Genetic interaction mapping and exon-resolution functional genomics with a hybrid Cas9–Cas12a platform.**.   
>*Nat Biotechnol 38, 638–648 (2020).*   
>https://doi.org/10.1038/s41587-020-0437-z

Data availability:

* [GEO series GSE144281](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE144281) - raw counts and sample metadata 
* [SRA project PRJNA603290](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA603290&o=acc_s%3Aa) - raw sequencing (FASTQ) and run metadata

This notebook is designed to run with the R kernal. Alternatively, with the repository downloaded, these commands could be run directly in R/RStudio. You will need to install the [tidyverse](https://cran.r-project.org/web/packages/tidyverse/index.html) library, if it is not already present.

***

We manually created an SRA run identifier (e.g. SRR10969645) to sample identifier (e.g. HH-79) mapping. From this we can download and merge the sample metadata, such as the sample labels used in the expected(CHyMErA) count matrix, and the SRA metadata, such as the FASTQ filenames associated with each sample.

1. Load R library dependencies.

In [1]:
# Load R libraries
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



2. Read in SRA run to sample id mapping (manually generated).

In [2]:
# Read in SRA run ID to sample mapping
sra_run_to_sample_id <- read.table('sra_run_to_sample_id.tsv', header = T, sep = "\t")

# Preview data
head(sra_run_to_sample_id)

Unnamed: 0_level_0,Sample.ID,Run
Unnamed: 0_level_1,<chr>,<chr>
1,HH-79,SRR10969645
2,HH-80,SRR10969649
3,HH-81,SRR10969650
4,HH-82,SRR10969651
5,HH-83,SRR10969661
6,HH-84,SRR10969665


3. Download sample name to sample id mapping from GEO from [GSE144281 supplementary file](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE144281).

In [3]:
# Download GEO sample metadata to temporary file
tmp <- tempfile()
download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144281/suppl/GSE144281_chymera_sample_key.txt.gz', tmp)

# Read temporary file in data frame and write to metadata directory
GSE144281_sample_key <-  read.csv(gzfile(tmp), sep = "\t", header = T)
write.table(GSE144281_sample_key, 'GSE144281_chymera_sample_key.txt', sep = "\t", quote = F, row.names = F)

# Preview data
head(GSE144281_sample_key)

Unnamed: 0_level_0,Dataset,Sample.ID,Sample.Name,Label,R1,R2,md5sum.R1,md5sum.R2,Summary.File,Read.Length,X,X.1
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,HumanAndMouseOptimization_NovaSeq_20171205,MH-5,N2a T6_Lb_A,N2A_Lb_T6A,Blencowe_1_MH-5_S1_R1_001.fastq.gz,Blencowe_1_MH-5_S1_R2_001.fastq.gz,2fed71c28d92a36706beb2c8c2245b0b,5f7d86c8bf666a9970849969f855ee3d,Blencowe_1_MH-5_counts.txt,151bp,151bp,Illumina NovaSeq5000
2,HumanAndMouseOptimization_NovaSeq_20171205,MH-6,N2a T6_Lb_B,N2A_Lb_T6B,Blencowe_2_MH-6_S2_R1_001.fastq.gz,Blencowe_2_MH-6_S2_R2_001.fastq.gz,215619cd4e6e043431f6e6ef56386408,f03e8fd322146d49fd72797eb85029bb,Blencowe_2_MH-6_counts.txt,151bp,151bp,Illumina NovaSeq5000
3,HumanAndMouseOptimization_NovaSeq_20171205,MH-7,N2a T6_Lb_C,N2A_Lb_T6C,Blencowe_3_MH-7_S3_R1_001.fastq.gz,Blencowe_3_MH-7_S3_R2_001.fastq.gz,0db7bb50fb0ba1c487b7f544e359ad06,86621956022f4042e33286d3482e8ee0,Blencowe_3_MH-7_counts.txt,151bp,151bp,Illumina NovaSeq5000
4,HumanAndMouseOptimization_NovaSeq_20171205,MH-8,CGR8 T6_Lb_A,CGR8_Lb_T6A,Blencowe_4_MH-8_S4_R1_001.fastq.gz,Blencowe_4_MH-8_S4_R2_001.fastq.gz,583d907d9d8f57e0905abd4056e260ba,c9738a700f37be83f468a6788336faa6,Blencowe_4_MH-8_counts.txt,151bp,151bp,Illumina NovaSeq5000
5,HumanAndMouseOptimization_NovaSeq_20171205,MH-9,CGR8 T6_Lb_B,CGR8_Lb_T6B,Blencowe_5_MH-9_S5_R1_001.fastq.gz,Blencowe_5_MH-9_S5_R2_001.fastq.gz,bb5dc892f35ecaa0eac66b0eeb0a4769,bd9a0dedfb27ef11a241a89fc1fb0b8f,Blencowe_5_MH-9_counts.txt,151bp,151bp,Illumina NovaSeq5000
6,HumanAndMouseOptimization_NovaSeq_20171205,MH-10,CGR8 T6_Lb_C,CGR8_Lb_T6C,Blencowe_6_MH-10_S6_R1_001.fastq.gz,Blencowe_6_MH-10_S6_R2_001.fastq.gz,c35f5cddb23ec078b35e1a1f25bf6c8c,a4e0c0f6d7ebcedd8d4ab9059e01de59,Blencowe_6_MH-10_counts.txt,151bp,151bp,Illumina NovaSeq5000


4. Read in run metadata downloaded separately from [SRA project PRJNA603290](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA603290&o=acc_s%3Aa).

In [4]:
# SRA project metadata downloaded directly from SRA
# Read SRA project metadata into data frame
PRJNA603290 <- read.csv('PRJNA603290_SRA_metadata.csv', header = T)

# Preview data
head(PRJNA603290)

Unnamed: 0_level_0,Run,Assay.Type,AvgSpotLen,Bases,BioProject,BioSample,Bytes,Center.Name,Consent,DATASTORE.filetype,⋯,ReleaseDate,Sample.Name,SRA.Study,annotation_file,cas_protein,Cell_Line,crispr_library,library_file,source_name,treatment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,SRR10969567,OTHER,295,16450165253,PRJNA603290,SAMN13927460,6140557629,GEO,public,"fastq,sra",⋯,2020-02-04T00:00:00Z,GSM4284922,SRP245362,human_optimization_libV7_annot.txt,LbCas12a,HAP1,CHyMErA Human Optimization Libary,human_optimization_libV7_guides.fasta,dual-guide amplicon,non-treated
2,SRR10969568,OTHER,297,14048233088,PRJNA603290,SAMN13927460,5281661513,GEO,public,"fastq,sra",⋯,2020-02-04T00:00:00Z,GSM4284922,SRP245362,human_optimization_libV7_annot.txt,LbCas12a,HAP1,CHyMErA Human Optimization Libary,human_optimization_libV7_guides.fasta,dual-guide amplicon,non-treated
3,SRR10969569,OTHER,260,75964566399,PRJNA603290,SAMN13927460,28455240865,GEO,public,"fastq,sra",⋯,2020-02-04T00:00:00Z,GSM4284922,SRP245362,human_optimization_libV7_annot.txt,LbCas12a,HAP1,CHyMErA Human Optimization Libary,human_optimization_libV7_guides.fasta,dual-guide amplicon,non-treated
4,SRR10969570,OTHER,300,12194668141,PRJNA603290,SAMN13927460,4502839618,GEO,public,"fastq,sra",⋯,2020-02-04T00:00:00Z,GSM4284922,SRP245362,human_optimization_libV7_annot.txt,LbCas12a,HAP1,CHyMErA Human Optimization Libary,human_optimization_libV7_guides.fasta,dual-guide amplicon,non-treated
5,SRR10969571,OTHER,292,12109078531,PRJNA603290,SAMN13927460,4612455213,GEO,public,"fastq,sra",⋯,2020-02-04T00:00:00Z,GSM4284922,SRP245362,human_optimization_libV7_annot.txt,LbCas12a,HAP1,CHyMErA Human Optimization Libary,human_optimization_libV7_guides.fasta,dual-guide amplicon,non-treated
6,SRR10969572,OTHER,292,13377172364,PRJNA603290,SAMN13927460,4925672548,GEO,public,"fastq,sra",⋯,2020-02-04T00:00:00Z,GSM4284922,SRP245362,human_optimization_libV7_annot.txt,LbCas12a,HAP1,CHyMErA Human Optimization Libary,human_optimization_libV7_guides.fasta,dual-guide amplicon,non-treated


5. Combine SRA sample mapping to SRA and GEO metadata.

In [5]:
# Join manually generated SRA-to-sample ID mapping to GEO sample metadata to get sample labels
# Join this to the SRA metadata to get sample-to-fastq mapping
sample_metadata <- sra_run_to_sample_id %>%
  left_join(GSE144281_sample_key, by = 'Sample.ID') %>% 
  left_join(PRJNA603290 %>% rename('GEO.Sample.Name' = 'Sample.Name'), by = 'Run')

# Preview data
head(sample_metadata)

Unnamed: 0_level_0,Sample.ID,Run,Dataset,Sample.Name,Label,R1,R2,md5sum.R1,md5sum.R2,Summary.File,⋯,ReleaseDate,GEO.Sample.Name,SRA.Study,annotation_file,cas_protein,Cell_Line,crispr_library,library_file,source_name,treatment
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,HH-79,SRR10969645,HumanParalogScreen_NovaSeq_20181109,HAP1 Paralogue/Dual T0,,Moffat_HH-79_S1_R1_001.fastq.gz,Moffat_HH-79_S1_R2_001.fastq.gz,86086da3a6812f925cb310584ed98998,90feaecdf3bcc0a4331c9cf0e00f66de,HH-79_counts.txt,⋯,2020-02-04T00:00:00Z,GSM4284932,SRP245362,human_dualTargetingParalog_libV3_annot.txt,LbCas12a,HAP1,CHyMErA Dual-Targeting & Paralog Library,human_dualTargetingParalog_libV3_guides.fasta,dual-guide amplicon,non-treated
2,HH-80,SRR10969649,HumanParalogScreen_NovaSeq_20181109,HAP1 Paralogue/Dual T18A,,Moffat_HH-80_S2_R1_001.fastq.gz,Moffat_HH-80_S2_R2_001.fastq.gz,8361ecc9327f3e433495496dc48c5205,3a9f8791a79e041e264698517c4450cf,HH-80_counts.txt,⋯,2020-02-04T00:00:00Z,GSM4284932,SRP245362,human_dualTargetingParalog_libV3_annot.txt,LbCas12a,HAP1,CHyMErA Dual-Targeting & Paralog Library,human_dualTargetingParalog_libV3_guides.fasta,dual-guide amplicon,non-treated
3,HH-81,SRR10969650,HumanParalogScreen_NovaSeq_20181109,HAP1 Paralogue/Dual T18B,,Moffat_HH-81_S3_R1_001.fastq.gz,Moffat_HH-81_S3_R2_001.fastq.gz,96c53d3db54583397376705003702a24,cb493111b507ab9b23e497e4edcea4fa,HH-81_counts.txt,⋯,2020-02-04T00:00:00Z,GSM4284932,SRP245362,human_dualTargetingParalog_libV3_annot.txt,LbCas12a,HAP1,CHyMErA Dual-Targeting & Paralog Library,human_dualTargetingParalog_libV3_guides.fasta,dual-guide amplicon,non-treated
4,HH-82,SRR10969651,HumanParalogScreen_NovaSeq_20181109,HAP1 Paralogue/Dual T18C,,Moffat_HH-82_S4_R1_001.fastq.gz,Moffat_HH-82_S4_R2_001.fastq.gz,5db539774d53d93c41265f640babb6e4,a8c3a9370d01582df728a7fdec0c47c1,HH-82_counts.txt,⋯,2020-02-04T00:00:00Z,GSM4284932,SRP245362,human_dualTargetingParalog_libV3_annot.txt,LbCas12a,HAP1,CHyMErA Dual-Targeting & Paralog Library,human_dualTargetingParalog_libV3_guides.fasta,dual-guide amplicon,non-treated
5,HH-83,SRR10969661,HumanParalogScreen_NovaSeq_20181109,RPE1 Paralogue/Dual T0,,Moffat_HH-83_S5_R1_001.fastq.gz,Moffat_HH-83_S5_R2_001.fastq.gz,2d4e3802336532ff2bbfaef25c1bb6cb,88befb772811bef0b6bce537dd67cf49,HH-83_counts.txt,⋯,2020-02-04T00:00:00Z,GSM4284934,SRP245362,human_dualTargetingParalog_libV3_annot.txt,LbCas12a,RPE1,CHyMErA Dual-Targeting & Paralog Library,human_dualTargetingParalog_libV3_guides.fasta,dual-guide amplicon,non-treated
6,HH-84,SRR10969665,HumanParalogScreen_NovaSeq_20181109,RPE1 Paralogue/Dual T24A,,Moffat_HH-84_S6_R1_001.fastq.gz,Moffat_HH-84_S6_R2_001.fastq.gz,249822c500ab5c6ff8dbb5e47606d5a6,43dc730005ac251ea6dbc95e8d3cd8d3,HH-84_counts.txt,⋯,2020-02-04T00:00:00Z,GSM4284934,SRP245362,human_dualTargetingParalog_libV3_annot.txt,LbCas12a,RPE1,CHyMErA Dual-Targeting & Paralog Library,human_dualTargetingParalog_libV3_guides.fasta,dual-guide amplicon,non-treated


6. Write sample metadata to file.

In [6]:
write.table(sample_metadata, 'PRJNA603290_GSE144281_sample_metadata.tsv', sep = "\t", row.names = F, quote = F)