src
: Contains source code used to perform the analysis in the publication.doc
: Relevant documentation to help reproduce the analysis. Includes aNavarro_2020_datasets_metadata.xlsx
table with metadata required to perform the analysis. Additionally, a set of tablesdatasets_ebi.tsv
,datasets_se.tsv
anddatasets_pe.tsv
is available to run with the Nextflow pipeline.data
: Files used in the analysis that are not a byproduct of its execution (for instance, bed annotations from other publications), or that may be useful per se, like peak annotations.sample
: A stubnextflow.config
file to use as a base to reproduce the primary analysis locally.
After running some of the analyses, you will end up with extra directories:
intermediate
: Files generated in the primary analysis that are required for downstream results (main figures of the publication). This includes: bigWig, bam, binned score files, called peaks. Note: Alternatively to running the primary analysis yourself, pre-generated bigwig files and other outputs are available at the data repository attached to this publication.figures
: Output directory where the publication results go.
This data analysis has been run under a HPC environment (UPPMAX) that runs CentOS Linux 7 (Core) and SLURM scheduling. The provided code here is runnable on a local computer. However, the primary analysis can be computationally expensive, so you can download the generated bigwig files instead (these will be available upon publication).
Tools used and versions:
nextflow (version 19.07.0 build 5106)
bowtie2 (v 2.3.5.1)
deepTools (v 3.1.0)
samtools (v 1.10)
Picard (2.20.4)
sratools (v 2.9.6-1)
MACS2 (v 2.1.2)
bwtool (v1.0)
bedtools (v2.27.1)
matplotlib (v 3.1.1)
BEDOPS (v 2.4.38)
intervene (v 0.6.4)
R (v 3.6.0)
R packages required:
rtracklayer
GenomicRanges
ggplot2
reshape2
eulerr
ggpubr
purrr
effsize
pheatmap
The software provided here is not a package and it's therefore not intended for installation. It relies heavily on the standard wide-used tools mentioned above and can be run as long as the required tools are installed. It is available for transparency and reproducibility reasons.
A Nextflow pipeline is provided to re-run the primary analysis. If you want to run it
yourself, you need to fill out the nextflow.config
file with paths to bowtie
reference index for mm9 and reference genome.
Then you should be able to run it using nextflow:
nextflow run public_chip.nf -c nextflow.config --acclist doc/datasets_pe.tsv
Note: There are currently three versions of this nextflow pipeline, for running paired-end data, single-end data and
single-end data coming from another sequencing repository. The other versions of the pipeline run with the table in doc
that has its matching name as --acclist
, for example:
nextflow run public_chip_se.nf -c nextflow.config --acclist doc/datasets_se.tsv
Running the nextflow pipeline will by default output everything on the intermediate
directory.
This can be used to run the rest of the analyses available under src
.
Note: Alternatively to running the primary analysis yourself, pre-generated bigwig files and other outputs, such as IGV sessions with the most relevant tracks, bed annotations and documentation are available at the data repository attached to this publication.
A few more relevant intermediate files (peaks calling and aggregation, annotation) are processed
through the corresponding sh
scripts.
You'll need the accession file list (as provided in the supplementary data of
this publication) and fill out corresponding nextflow.config
values (an example
nextflow.config
file under sample
):
params {
base_dir = './'
acclist = 'acclist.csv'
bw_index = 'path to bowtie index of mm9 genome'
refdir = 'path to mm9 reference genome'
genomesize = 2150570000
max_threads = '12'
binsize = 5000
}
profiles {
standard {
process.executor = 'local'
}
}
This study analyses data from many genomics studies. Relevant metadata needed for its download
and processing through the pipeline will be available under doc
. The main pipeline
is public_chip.nf
, and there are two variants public_chip_se.nf
(for the processing of
single-ended libraries, specifies on the datasets_se.csv
file) and public_ebi.nf
(for
downloading files that are in EBI database instead).
Intermediate data output from each dataset will be under intermediate/data_type/dataset_id
,
for instance: intermediate/bw/Deaton_2016
.
This workflow processes lines in the accession list one by one. You can check the kind of output provided just by by providing only the first couple of lines in the .csv accession list.
Nextflow will generate an intermediate
folder with corresponding subfolders
per dataset:
bw
: Bigwig files.bins
: Bed files with 5kb bins signal values.bam
: Bowtie produced alignments.peaks
: Peaks called for some of the samples.bed
: IAP annotation generated by the consensus run.
It will also generate logging information on the pipeline run.
Expected running times rely heavily on the type of machine, sizes of datasets, internet connection and queuing status in case of running this on a HPC system.
Code used to generate the figures on the publication is also available in src
.
Additionally, data repository attached to this publication contains pre-generated
bigwig files and also for some of the figures tables with either bin values,
bed file mean coverage values, or summarized coverage values. When these tables
are used to reproduce a plot, corresponding r
scripts state so.
Overall:
- Peaks were called using MACS2 on broadPeak with qval = 0.05 for each replicate.
The aggregated peak set consists of loci that appear in at least 2 out of the 3
replicates. These are the
reliable
groups available underpeaks/Deaton_2016/aggregated
. - Peak density heatmaps were calculated using
deepTools computeMatrix
. The precomputed matrix was then visualized bypeak_density_annotated.py
script. - Average heatmaps were calculated using
rtracklayer
andGenomicRanges
libraries in R. Corresponding scripts are found per-figure insrc
. - Bin-based plots also calculated in R and scripts for this are provided.
- Profile plots were calculated by a customized version of ngsplot that
calculates Y axes as 1x FPGC (Fragments per Genome Content) coverage. Very similar results
can also be obtained by seqplots providing the corresponding bigwig and bedfiles, where
reads were extended to match fragment size in paired-end datasets. If you want to re-run
the customized version of ngsplot, you need to download their original source code and
replace the relevant files with the ones provided. Our customized version includes several
modes of calculating Y axis. The parameter you need to set is
-M relative
, as opposed tostandard
which reproduces ngsplot usual behavior. - IAP consensus dataset was calculated integrating several datasets
src/run_iap_consensus.sh
. This requires the BAM files to be calculated. - Venn intersection overlap was calculated by
intervene
, and size proportional euler diagrams using R packageeulerr
.