This repository contains the active Snakemake workflow for preparing HeLa G4 reference sets, quantifying RNA-seq time-series data, and testing whether promoter-proximal G4 loci are associated with transcriptional state and UV-response behavior.
The current workflow entry point is workflow/Snakefile. Run Snakemake with -s workflow/Snakefile from the repository root.
The active DAG includes all of the modules below:
- Reference-genome preparation for
hg38 - OQS import, liftOver, filtering, and structure annotation
- GC-rich genomic background sampling and structure annotation
- G4 ChIP-seq peak calling and structure-aware peak preparation
- G4 CUT&Tag peak calling and structure-aware peak preparation
- Merging G4 ChIP-seq and CUT&Tag loci into a union G4 set
- RNA-seq time-series import from Salmon, gene-level matrices, DESeq2, and QC
g4_tss: promoter/TSS transcription analysis (Tasks 1-8)g4_tss_atac_split: ATAC-filtered versions of the TSS analysis fornoUV,15m,30m, and60m
The active workflow consumes the following configuration blocks in config/config.yaml:
genomeoqsrna_seqg4_tss_atac_split
Key configured inputs currently used by the workflow:
hg38primary assembly, GENCODE v48 annotation, ENCODE blacklist, RepeatMasker, Umap/Bismap, andhg19tohg38liftOver chain- strand-specific OQS BED files for
KandPDSconditions - local paired-end FASTQs for G4 ChIP-seq and G4 CUT&Tag
- Salmon quantifications in
resources/rna-seq/SU_* - RNA-seq bigWigs
merged_t00.bw,merged_t12.bw,merged_t30.bw,merged_t60.bw - ATAC peak BED files for
noUV,15m,30m, and60m
workflow/rules/prepare_genome.smk downloads and prepares shared hg38 resources:
- genome FASTA
- Bowtie2 index
- FASTA index
- genome BED
- ENCODE blacklist BED
- RepeatMasker BED
- mappability BED
- GENCODE v48 GTF
hg19tohg38liftOver chain
The workflow builds four complementary G4-related datasets:
workflow/rules/g4p_chip_seq.smk: aligns G4 ChIP-seq FASTQs, filters BAMs, calls MACS2 peaks, and annotates structure-supported lociworkflow/rules/g4p_cut_tag.smk: does the same for G4 CUT&Tagworkflow/rules/g4chip_g4cuttag_merge.smk: merges prepared ChIP-seq and CUT&Tag intervals into a non-redundant union BEDworkflow/rules/prepare_oqs.smk: downloads OQS BEDs, standardizes fields, lifts tohg38, filters blacklist overlaps, combines strands, and annotates structure class
workflow/rules/gc_rich_bg.smk samples GC-rich windows, removes blacklist and OQS overlaps, and applies the same structure-aware annotation used for G4 datasets.
Current parameters in the rule file:
- tile size:
150 bp - GC threshold:
0.28 - requested sample size:
20000 - random seed:
42
workflow/rules/rnaseq.smk discovers Salmon quantifications matching ^SU_(?P<rep>[123])(?P<time>00|12|30|60)$, then:
- builds a sample manifest
- downloads GENCODE v35 for RNA-seq gene models
- derives
tx2geneand gene-annotation tables - imports transcript quantification to gene level
- writes counts, TPM, lengths, and normalized-count matrices
- runs omnibus DESeq2 likelihood-ratio testing across time
- runs all pairwise DESeq2 contrasts across
0,12,30,60 - generates PCA and sample-distance QC outputs
- writes a markdown summary of the RNA-seq branch
The current RNA-seq summary in results/rnaseq/summary.md reports:
12samples across0,12,30,6017064genes retained after prefiltering5741significant genes in the omnibus time-course LRT at adjustedp <= 0.05
workflow/rules/g4_tss.smk answers the question: are promoter-proximal G4s linked to active transcription and UV response?
It performs eight linked analyses:
- extracts canonical TSSs from the RNA-seq GTF and expands them to
±500 bpwindows - classifies promoter windows into
G4_TSS,GC_bg_TSS, orNo_overlap - summarizes baseline expression by group from TPM and normalized counts
- tests expression differences across groups and generates violin plots plus group BED files
- generates RNA-seq TSS metaprofiles with deepTools
- tests G4 enrichment across expression deciles
- tests UV-response behavior by promoter group using DESeq2 LRT and pairwise contrasts
- tests G4 structure-class enrichment across expression classes and assembles an HTML report
workflow/rules/g4_tss_atac_split.smk reuses the shared TSS annotation and reruns the promoter analysis after filtering merged G4 loci by ATAC peak overlap.
Configured analyses:
noUVusesmerged_t00.bwand baseline expression only15muses ATAC peaks at15m, RNA-seq expression at12 min, and the0_vs_12DESeq2 contrast30muses ATAC peaks at30m, RNA-seq expression at30 min, and the0_vs_30contrast60muses ATAC peaks at60m, RNA-seq expression at60 min, and the0_vs_60contrast
Each timepoint produces a matched set of annotation tables, expression summaries, decile analysis outputs, structure-enrichment outputs, metaprofiles, and an HTML report. UV-response plots are produced for 15m, 30m, and 60m; the noUV report omits UV-response sections.
The current results support a strong link between promoter-proximal G4 overlap and active transcription in this HeLa UV-response dataset.
- The RNA-seq branch analyzes
12samples across0,12,30, and60minutes, retains17064genes after prefiltering, and detects5741significant time-responsive genes in the omnibus DESeq2 LRT. - In the main
g4_tssanalysis,9637genes are classified asG4_TSS, compared with3530No_overlapgenes and only15GC_bg_TSSgenes. Baseline expression is highest in theG4_TSSgroup (mean10.04, median10.22) versusNo_overlap(mean7.52, median7.37) andGC_bg_TSS(mean8.22, median8.58). - Baseline expression differences are statistically strong:
G4_TSSexceedsNo_overlapin the one-sided Mann-Whitney test (p_adj = 0, rank-biserialr = -0.49) and also exceedsGC_bg_TSSdespite the very small GC-rich control set (p_adj = 0.00446). - G4 promoter overlap rises sharply with expression rank. The fraction of genes with promoter G4 overlap increases from
24.3%in decile1to86.0%in decile10, while GC-rich control overlap stays near background (~0.2-0.7%). The decile trend is strong for merged G4s (Spearman rho = 0.903,p = 3.44e-4) and absent for GC-rich controls. - UV responsiveness is also enriched among promoter-G4 genes. At
lrt_padj <= 0.05,3887/9637(40.3%)G4_TSSgenes are significant versus919/3543(25.9%) in all non-G4_TSSgenes (odds ratio = 1.93, Fisherp = 2.94e-54). - Directionality is mixed rather than uniformly induced: among
G4_TSSgenes inpG4_DEG_intersect.tsv, the largest response classes arerepressed_sustained(1035genes),induced_sustained(461),repressed_transient(416), andinduced_transient(214), with an additional2677genes significant only in the omnibus LRT. - Promoter G4 strength is associated with UV-response significance. The
max_signalstratification shows a significant association between strength bin and LRT significance (chi-square p = 1.50e-5), and continuous G4 signal is associated with stronger LRT significance (Spearman p = 1.34e-6). - Structure-class composition is broadly stable across expression classes. No structure reaches
FDR < 0.05in theQ4_vs_Q1enrichment tests, althoughtwoTetrads,loop1_3, andloop4_5show nominal upward trends in the highest-expression class. - The ATAC-restricted follow-up preserves the same baseline pattern. Filtering by G4-center overlap with ATAC peaks retains
54.96%of merged G4 intervals innoUV,36.92%at15m,57.61%at30m, and68.66%at60m, and theG4_TSSgroup still has the highest baseline expression in every split (mean10.03-10.10vs8.51-9.02forNo_overlap). - Expression-decile enrichment becomes clearest in the later accessible-G4 analyses: the merged-G4 decile correlation is weak in
noUVand15m, but significant at30m(rho = 0.733,p = 0.0158) and60m(rho = 0.818,p = 0.00381). - UV-response enrichment also persists after ATAC filtering. The fraction of significant LRT genes in accessible
G4_TSSpromoters is40.4%at15m,40.2%at30m, and40.7%at60m, compared with34.7%,33.3%, and31.5%in the correspondingNo_overlapsets.
results/g4chip/
g4_hela_peaks.narrowPeak,g4_hela_peaks.xls,g4_hela_summits.bedg4_hela_peaks.bedg4_hela_peaks_prepared.bedg4_hela_peaks_prepared.tsvg4_hela_peaks_prepared_center.bed
results/g4cuttag/
g4_hela_peaks.narrowPeak,g4_hela_peaks.xls,g4_hela_summits.bedg4_hela_peaks_cuttag.bedg4_hela_peaks_prepared.bedg4_hela_peaks_prepared.tsvg4_hela_peaks_prepared_center.bed
results/g4chip_g4cuttag/
g4_hela_chip_cuttag_merged.bed
results/oqs/
- strand-specific downloaded and lifted intermediates for
plus/minusandK/PDS - final merged sets:
oqs_K_prepared.bedoqs_K_prepared.tsvoqs_K_prepared_center.bedoqs_PDS_prepared.bedoqs_PDS_prepared.tsvoqs_PDS_prepared_center.bed
results/gc_rich_bg/
gc_rich_bg_sampled.bedgc_rich_bg_sampled_no_oqs.bedgc_rich_bg_prepared.bedgc_rich_bg_prepared.tsvgc_rich_bg_prepared_center.bedgc_rich_bg_selection_summary.tsv
results/rnaseq/metadata/
sample_manifest.tsvtx2gene.tsvgene_annotation.tsvtximport_mapping_summary.tsv
results/rnaseq/matrices/
gene_counts.tsv.gzgene_tpm.tsv.gzgene_lengths.tsv.gznormalized_counts.tsv.gz
results/rnaseq/deseq2/
time_lrt_results.tsv.gztime_lrt_significant.tsv.gzpairwise/0_vs_12_results.tsv.gzpairwise/0_vs_12_significant.tsv.gzpairwise/0_vs_30_results.tsv.gzpairwise/0_vs_30_significant.tsv.gzpairwise/0_vs_60_results.tsv.gzpairwise/0_vs_60_significant.tsv.gzpairwise/12_vs_30_results.tsv.gzpairwise/12_vs_30_significant.tsv.gzpairwise/12_vs_60_results.tsv.gzpairwise/12_vs_60_significant.tsv.gzpairwise/30_vs_60_results.tsv.gzpairwise/30_vs_60_significant.tsv.gzpairwise/pairwise_significant_gene_counts.tsv
results/rnaseq/qc/
vst_pca.tsv,vst_pca.png,vst_pca.pdfsample_distance.tsv,sample_distance.png,sample_distance.pdf
Top-level RNA-seq outputs:
results/rnaseq/summary.md
results/g4_tss/
- annotation and promoter windows:
canonical_tss.bedcanonical_tss_windows_1kb.bedgene_name_table.tsvwindows_g4_intersect.bedwindows_gc_bg_intersect.bedtss_group_annotation.tsv
- baseline expression outputs:
baseline_tpm.tsvbaseline_tpm_summary.tsvbaseline_normalized_counts.tsvgene_expression_by_group.tsv
- statistics and visualizations:
expression_group_statistics.tsvexpression_group_fisher.tsvexpression_group_summary.tsvexpression_violin_by_group.pdftss_G4_TSS.bedtss_GC_bg_TSS.bedtss_No_overlap.bedrnaseq_tss_matrix.gzrnaseq_tss_metaprofile.pdf
- decile analysis:
expression_deciles.tsvdecile_overlap_fractions.tsvdecile_correlation_stats.tsvdecile_enrichment_plot.pdf
- UV-response analysis:
uv_group_lrt_summary.tsvuv_group_fold_change_stats.tsvuv_fold_change_by_group.pdfuv_volcano_by_group.pdf
- structure enrichment:
g4_structure_by_expression_class.tsvstructure_class_enrichment_stats.tsvstructure_class_by_expression_class.pdf
- final report:
g4_tss_transcription_report.htmlsoftware_versions.txt
results/g4_tss_atac_split/{timepoint}/ for noUV, 15m, 30m, 60m
Shared outputs at every timepoint:
g4_atac_center_filtered.bedg4_atac_center_filter_summary.tsvwindows_g4_intersect.bedtss_group_annotation.tsvbaseline_tpm.tsvbaseline_tpm_summary.tsvbaseline_normalized_counts.tsvgene_expression_by_group.tsvexpression_group_statistics.tsvexpression_group_fisher.tsvexpression_group_summary.tsvexpression_violin_by_group.pdftss_G4_TSS.bedtss_GC_bg_TSS.bedtss_No_overlap.bedrnaseq_tss_matrix.gzrnaseq_tss_metaprofile.pdfexpression_deciles.tsvdecile_overlap_fractions.tsvdecile_correlation_stats.tsvdecile_enrichment_plot.pdfg4_structure_by_expression_class.tsvstructure_class_enrichment_stats.tsvstructure_class_by_expression_class.pdfg4_tss_atac_split_report.htmlsoftware_versions.txt
Additional UV-response outputs for 15m, 30m, and 60m:
uv_group_lrt_summary.tsvuv_group_fold_change_stats.tsvuv_fold_change_by_group.pdfuv_volcano_by_group.pdf
Prepared G4-like datasets such as *_prepared.tsv use the structure annotation generated by workflow/scripts/prepare_external_g4_dataset.py and workflow/scripts/regex_structures_common.py.
The TSV schema is:
chrom start end name source_signal strand structure_start structure_end structure_center structure source_dataset
Current structure labels include:
loop1_3loop4_5loop6_7longLoopsimpleBulgecomplexBulgetwoTetrads
Rows without a supported structure match are dropped during preparation.
rule all is defined through workflow/rules/common.smk. It currently requests:
- core prepared G4 and background outputs
- the main
g4_tssreport and representative intermediate deliverables - one
g4_tss_atac_split_report.htmlper configured ATAC timepoint
This means a default run builds the full RNA-seq branch plus both promoter-analysis branches, not just the original G4 preparation steps.
config/
config.yaml
slurm/config.yaml
logs/
cluster/
g4_tss/
g4_tss_atac_split/
rnaseq/
resources/
ref_genomes/
rna-seq/
samples/
results/
g4chip/
g4cuttag/
g4chip_g4cuttag/
gc_rich_bg/
g4_tss/
g4_tss_atac_split/
oqs/
rnaseq/
workflow/
Snakefile
envs/
rules/
scripts/
Software used by the active workflow:
- Snakemake
- Conda or Mamba
- Bowtie2
- samtools
- bedtools
- MACS2
- UCSC
liftOver - deepTools
wget,curl,gunzip,tar,awk,csplit- R with the packages required by
workflow/envs/rnaseq_deseq2.yaml
If you use the bundled SLURM profile, Snakemake also needs the cluster-generic executor plugin.
Dry-run:
snakemake -s workflow/Snakefile -n -pLocal execution:
snakemake -s workflow/Snakefile --cores 8 --use-conda -pSLURM execution with the bundled profile:
snakemake -s workflow/Snakefile --profile config/slurmBefore cluster execution, review config/slurm/config.yaml and update site-specific defaults such as account, partition, qos, and memory settings.
Run a single branch target:
snakemake -s workflow/Snakefile --use-conda results/g4_tss/g4_tss_transcription_report.htmlRun one ATAC-split report:
snakemake -s workflow/Snakefile --use-conda results/g4_tss_atac_split/30m/g4_tss_atac_split_report.htmlUnlock after an interrupted run:
snakemake -s workflow/Snakefile --unlockShow a summary of tracked outputs:
snakemake -s workflow/Snakefile --summary