CeLAuth is a collection of Perl scripts that allow to authenticate cell lines from ChIP-seq experiments.
CeLAuth requires Perl, Java and R
- Parallel::ForkManager
- Sort::Key::Natural
- Statistics::Basic
To install perl modules:
cpan Parallel::ForkManager Sort::Key::Natural Statistics::Basic
- ggplot2
- gplots
- RColorBrewer
- gtools
- tidyr
- dplyr
- grid
- gridExtra
To install R packages on an R session:
R
install.packages(c("ggplot2",“gplots”, “RColorBrewer”, "gtools", "tidyr", "dplyr", "grid", "gridExtra"))
Then you can download and install the latest release:
git clone --recursive https://github.com/bdolmo/CeLAuth.git
The whole pipeline is written to be modular in design. This means that each step can be executed independently as long as the required input files are available
This can be achieved by executing the script alignReads.pl
. This script makes use of an standard pipeline for mapping short reads based on BWA-MEM to generate aligned files (BAM), sorting and index creation through SAMtools and removal of PCR duplicates with Picard.
perl alignReads.pl -i <fastq_input_dir> -o <output_dir> -g <reference_fasta> -r <hg19/hg38> -t <CPUs>
Where:
-i,--input STRING Input directory with gzipped FASTQ files (fq.gz or fastq.gz)
-o,--outdir STRING Output directory where all BAM files will be generated
-g,--genome STRING Reference genome in FASTA format
-r,--reference STRING Genome version. Choose between [hg19, hg38] (default = hg19)
-m,--mode STRING Alignment mode. Choose between [dna, rna] (default = dna)
-t,--threads INTEGER Number of CPUs to be used at mapping (default = 4)
BWA/STAR require a genome index file located at the same directory as the reference genome. When executing alignReads.pl, if BWA index is absent, a prompt message will appear asking the user for its creation (here shown for BWA)
Press ‘y’ to create a BWA index or press ‘n’ to cancel the operation
If accepted, a new genome index will be generated and then the FASTQ alignment will be resumed.
Next to the mapping phase we will perform the variant calling for every BAM. We will use callSNV.pl
which wraps FreeBayes variant caller. FreeBayes is a popular tool that is commonly used to detect germline and somatic mutations in a wide variety of sequencing assays.
perl callSNV.pl -i <bam_input_dir> -o <output_dir> -t <CPUs> -r <reference_fasta>
Where:
-i,--input STRING Input directory with BAM files
-o,--outdir STRING Output directory where all VCF files will be generated
-g,--genome STRING Reference genome in FASTA format
-t,--threads INTEGER Number of CPUs to be used at var calling (default = 4)
In this phase we will extract all genomic segments that have a minimum coverage for each of the samples being analyzed. The coverage metric, refers to the total number of times a genomic position has been sequenced or the total number of reads that overlap that genomic position. In our preliminary analysis we observed that a minimum coverage of 3X was enough to get reliable results. This script uses fast coverage extractions using mosdepth
To extract the coverage:
perl extractCoverage.pl -i <bam_input_dir> -o <output_dir> -r <hg19/hg38> -n <normalization_factor> -m <minimum_coverage> -t <CPUs>
Where:
-i,--input STRING Input directory with BAM files
-o,--outdir STRING Output directory where all COVERAGE.bed files ending with *COVERAGE.bed will be generated
-r,--reference STRING Genome version (choose: hg19 or hg39, default = hg19)
-n,--normfactor INTEGER Normalization factor (default = 10e7)
-m,--mincov INTEGER Minimum coverage required (default = 3)
-t,--threads INTEGER Number of CPUs to be used at coverage extraction (default = 4)
Now, we can plot a histogram of normalized counts at protocadherin promoters by using profilePcdh.pl
as follows:
Get normalized counts at protocadherin promoters and create per sample histograms
perl profilePcdh.pl -i <dir_with_norm_bed_gz> -o <output_dir> -e <bed_regions> -n <normalization_factor> -r <hg19/hg38>
Where:
-i,--input STRING Input directory with *.normalized.bed.gz files
-o,--outdir STRING Output directory
-e,--regions STRING BED regions to extract counts
-n,--normfactor INT Normalization factor (default=10e7)
-r,--reference STRING Genome version. Choose between [hg19, hg38] (default = hg19)
Instead of using all shared regions between the whole dataset, we preferred to compare each pairwise sample coverage profile to identify common regions. This comes at a cost of higher computational tasks, but offers the possibility to pool samples with greater differences at both sequencing characteristics or antibody class.
For this step we will use the script extractCommonRegions.pl
:
perl extractCommonRegions.pl -i <coverage_dir> -t <CPUs>
Where:
-i,--input STRING Input directory with files ending with *COVERAGE.bed
-t,--threads INTEGER Number of CPUs to be used (default = 4)
Output files ending with the suffix *shared.bed will be released at the same input directory where all input files ending with *COVERAGE.bed were found.
The last analytical step consists on calculate the relatedness of every sample from their VCF files. Each pairwise is restricted over the shared sequenced regions identified previously. We will use SNVs exclusively, since INDEL lower genotyping robustness may introduce artifacts.
perl correlateGenotypes.pl -v <input_vcf_dir> -b <input_shared_bed_dir> -r <hg19_hg38> -o <output_dir> -t <CPUs>
Where:
-v,--vcf_dir STRING Input directory with VCF files
-b,--bed_dir STRING Input directory where with BED files ending with *shared.bed
-o,--outdir INTEGER Output directory
-r,--reference STRING Genome version (choose: hg19 or hg39, default = hg19)
-t,--threads INTEGER Number of CPUs to be used (default = 4)
The output matrix is written on a Tab-Separated Values (TSV) file.
we have written a Perl script that takes as input the matrix in TSV format and produces a heatmap using gplots.
perl plotHeatmap.pl -i <INPUT_TSV> -o <OUTPUT_NAME>
Where:
-i STRING Input TSV file with sample genotype correlations
-o STRING Output name of the resulting PNG image
Here we show and example plot from the test samples included in the project. We can observe that samples 3,9 and 10 can be clustered together, and also samples 6,7 and 8, whereas samples 1,2,4 and 5 can be classified as unrelated.