# ATAC-seq Data Analysis

---

## Quest 1: subset aligned bams and normalize by sample

Such that, each sample (not replicate) has the same amount of PE reads. 

1. Start with `bwa` produced bams, filter in only properly aligned reads.

2. Find out minimum number of properly aligned reads of all samples, and use that number to subset the bams.

**PERTINENT SCRIPTS:**

`filter_bam.py`: Filter a pair-end BAM file such that either read1 or read2 has MAPQ >= threshold. BAMs must be sorted by name, with READ1 and READ2 having the same qname and right next to each other.

`filter_bam.sh`: First filter original \*.sorted.bam to include only properly paired alignments (-f 2). Then sort bams by name (`samtools sort -n`). Final saved bams are called `*.PE.mapq20.subset.bam`.

Output files:
- `ATAC*.sorted.bam`: Alignment files produced from `bwa`. **Initial bam.**
- `ATAC*.PE.bam`: Properly paired alignments sorted by chr
- `ATAC*.PE.sortbyname.bam`: Properly paired alignments sorted by name
- `ATAC*.PE.filtered.bam`: Properly paired alignments with mapq > threshold (20)
- **`ATAC*.PE.mapq.bam`**: Properly paired alignments with **mapq > threshold** and **sorted by chr**. **Final ouput.**

**SUBSAMPLING BAMS:**

The goal is to make sure every sample has the same amount of paired end reads with mapq above threshold. Read numbers within two replicates can vary. Thus, based on total number of alignments in each `*.filtered.bam`, the total number of reads per sample we should set is `14,000,000`. Most bams (A1-A6,A9,A10) can take `7,000,000` per replicate, except `A7: 6,700,000, A8: 7,300,000, A11:7,200,000, A12: 6,800,000`

- subsampled are `ATAC*.PE.subset.bam`

In [7]:
cd /c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3

In [11]:
# count number of reads for each bam 
# wc -l is used to decide how to subset each bam file
if [[ -e readCounts_of_PE_mapq_filtered_bam.txt ]] ; then
    truncate -s 0 readCounts_of_PE_mapq_filtered_bam.txt
fi
PE_mapq_bams=(*.PE.filtered.bam)
for p in ${PE_mapq_bams[@]}; do
    wc -l $p >> readCounts_of_PE_mapq_filtered_bam.txt
done

In [6]:
# count number of reads for each bam 
# wc -l is used to decide how to subset each bam file
if [[ -e readCounts_of_PE_mapq_bam.txt ]] ; then
    truncate -s 0 readCounts_of_PE_mapq_bam.txt
fi
PE_mapq_bams=(*.PE.mapq.bam)
for p in ${PE_mapq_bams[@]}; do
    wc -l $p >> readCounts_of_PE_mapq_bam.txt
done

In [9]:
cat readCounts_of_PE_mapq_bam.txt | sort -k2 -V

6495432 ATAC1.PE.mapq.bam
6004665 ATAC2.PE.mapq.bam
6648471 ATAC3.PE.mapq.bam
6378458 ATAC4.PE.mapq.bam
5921502 ATAC5.PE.mapq.bam
8228211 ATAC6.PE.mapq.bam
5484665 ATAC7.PE.mapq.bam
5906854 ATAC8.PE.mapq.bam
11218517 ATAC9.PE.mapq.bam
7892896 ATAC10.PE.mapq.bam
6348925 ATAC11.PE.mapq.bam
5781403 ATAC12.PE.mapq.bam


In [None]:
# subsampling needs to work on ATAC*.PE.filtered.bam because we want to subsample by pairs

In [None]:
cat filter_bam.sh

---

In [5]:
samtools view ATAC1.PE.subset.bam | cut -f 1 | sort | uniq | wc -l
samtools view ATAC1.PE.subset.sam | cut -f 1 | sort | uniq | wc -l

3500000
3500000


In [54]:
samtools view ATAC11.PE.subset.bam | cut -f 1 | sort | uniq | wc -l
samtools view ATAC11.PE.subset.sam | cut -f 1 | sort | uniq | wc -l

3650000
3650000


In [57]:
rm toy.o7695883.*

---

## Quest 2: Peak calling (MACS2) for chromVAR

**INPUTS:**

- BAM files that have been filtered to include only properly paired reads with read quality (either r1 or r2 > 20), which are these `ATAC*.PE.mapq.bam` files. *(note, not using the subsampled bams, because `chromVAR` does normalization by itself, so I don't think it's necessary to pre-normalize.)*
- chromVAR paper actually filters bams with mapq  > 30.


**PEAK CALLING**

Peak calling should use narrow peaks, because narrow peaks gives the summit coordiantes. We can then use the summit coordinates to create a uniform width of 500 bp for all the peaks. 

MACS2 parameters used to call narrow peaks:

```
macs2 callpeak -t $bam -n $SAMPLE_NAME \
	--format BAMPE \
	--gsize hs \
	--qvalue 0.05 \
	--cutoff-analysis \
	--keep-dup all \
	--outdir "/gpfs/commons/groups/sanjana_lab/cdai/TFscreen/atac/macs2/v7" \
	--bdg
```
    
_script:_ `../macs2/v7/atac_macs2_v7-pe.sh`

`NAME_peaks.narrowPeak` is BED6+4 format file which contains the peak locations together with `peak summit`, `p-value`, and `q-value`. You can load it to the UCSC genome browser. Definition of some specific columns are:

- 1st: chromosome
- 2nd: start
- 3rd: end
- 4th: name. In this case peak names. 
- 5th: integer score for display. It's calculated as int(-10*log10pvalue) or int(-10*log10qvalue) depending on whether -p (pvalue) or -q (qvalue) is used as score cutoff. It's used for genome browser display.
- 7th: **fold-change** at peak summit
- 8th: **-log10pvalue** at peak summit
- 9th: **-log10qvalue** at peak summit
- 10th: **relative summit position to peak start**. Note it's **RELATIVE**

**ADJUSTING PEAK WIDTH**

Adjust peak width so that the peak width is 500 centered at summit based on chromVAR recommendation for ATACseq.

bash commands below:

```
SUMMITS=(/c/groups/sanjana_lab/cdai/TFscreen/atac/macs2/v7/ATAC*summits.bed)
for summit in ${SUMMITS[@]}; do
    SAMPLE_NAME=$(basename -s .bed $summit | sed -E 's/_summits//')
    awk '{print $1, $2 - 250 , $3 + 250, $4, $5}' $summit > ${SAMPLE_NAME}_500_Peaks.bed
done
```


In [94]:
cd /c/groups/sanjana_lab/cdai/TFscreen/atac/macs2/v7 # macs2 output path

In [95]:
# These are the input bam files
ls /c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC*mapq*bam

/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC10.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC11.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC12.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC1.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC2.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC3.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC4.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC5.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC6.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC7.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC8.PE.mapq.bam
/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC9.PE.mapq.bam


---

### Quest 3 - Manipulating peak files to make 500 width files

Use chromVAR to calculate deviation score, which is a metrix to measure chromatin accessibility. 

Inputs:
- Peak files, 500 bp centered at summit. `ATAC*_500_Peaks.bed`
- Annotations:
    - Method A: use `JASPAR 2016` TF motif package. See chromVAR documentation.
    - Method B: use custom constructed promoter region annotation according to chromVAR documentation.
    
Scripts: 
See codes in `ATAC-chromVAR-R.ipynb`, the `R` notebook for chromVAR analysis.

In [152]:
ls ../../*.csv

../../ATACseqSampleName.csv  ../../samplesheet.csv


In [156]:
head -2 ../../samplesheet.csv

SampleID,Tissue,Factor,Condition,Treatment,Replicate,bamReads,ControlID,bamControl,Peaks,Peakcaller
A1,ESC,TF,ESC,Full,1,/c/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3/ATAC1.PE.mapq.bam,NA,NA,/c/groups/sanjana_lab/cdai/TFscreen/atac/macs2/v7/ATAC1_500_Peaks.bed,bed


In [221]:
pwd

/c/groups/sanjana_lab/cdai/TFscreen/atac


In [222]:
samtools view bams_v3/ATAC1.PE.mapq.bam | awk 'NR > 20398130 && NR < 20398135 {print NR, $0}'

20398131 NB501157:251:HG7FNBGX9:4:12406:1456:10474	83	chr8	141998714	60	37M	=	141998712	-39	CCCTACAGCTCCCAGGGCCCAGGCCAGCTCCACCTCC	EAEE/E/AE//EE/AAEAEEAEEE66EE6EEEAAAA/	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
20398132 NB501157:251:HG7FNBGX9:4:21407:10452:12148	99	chr8	141998724	60	37M	=	141998734	47	CCCAGGGCCCAGGCCAGCTCCACCTCCAGGCTTGCTC	AAAAAEAEAE/EAEAEAE/<AEAEA<EEE<EEEEAE/	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
20398133 NB501157:251:HG7FNBGX9:3:22411:11399:19754	83	chr8	141998734	60	37M	=	141998620	-151	AGGCCAGCTCCACCTCCAGGCTTGCTCCAAGTCCTTC	A<EEEEAEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
20398134 NB501157:251:HG7FNBGX9:4:21407:10452:12148	147	chr8	141998734	60	37M	=	141998724	-47	AGGCCAGCTCCACCTCCAGGCTTGCTCCAAGTCCTTC	EAAEEEEEEEEEEEAEEEEEEEEEEEEEAEEEAAAAA	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37


---

### Quest 4 Find differentially accessible peaks in the promoter region or gene body region

Inputs: determined differentially accessible peaks from DiffBind
- `diffbind_24H_significant_peaks.bed`
- `diffbind_5D_significant_peaks.bed`

Annotation: gencode promoter region and gene region bed files
- `Protein_coding_gene_promoters.bed`
- `Protein_coding_gene.bed`

Output: selected peaks that are intersected with annotation
- 5D promoter region: `diffbind_5D_significant_peaks_gene.bed`
- 24H promoter region: `diffbind_24H_significant_peaks_promoter.bed`
- 5D gene body region: `diffbind_5D_significant_peaks_gene.bed`
- 24H gene body region: `diffbind_24H_significant_peaks_gene.bed`

In [8]:
cd /c/groups/sanjana_lab/cdai/TFscreen/atac

In [45]:
wc -l diffbind_*promo*.bed diffbind_*gene*.bed

    918 diffbind_24H_significant_peaks_promoter.bed
   9437 diffbind_5D_significant_peaks_promoter.bed
   3105 diffbind_24H_significant_peaks_gene.bed
  21070 diffbind_5D_significant_peaks_gene.bed
  34530 total


### Quest 5 - annotate consensus peak set with promoter region

Inputs: 
- Readcount matrix with peak coordinates.
`diffbind_consensu_min2overlap_readcounts_with_coordinates.txt`
- Promoter region bed file. `Protein_coding_gene_promoters.bed`

```intersectBed -wo -f 0.55 \
             -a diffbind_consensu_min2overlap_readcounts_with_coordinates.bed \
             -b Protein_coding_gene_promoters.bed > diffbind_consensu_min2overlap_readcounts_promoterRegion.txt```

Output:
- filtered consensus peak set,with annotated gene names. `diffbind_consensu_min2overlap_readcounts_promoterRegion.txt`

In [52]:
head diffbind_consensu_min2overlap_readcounts_promoterRegion.txt
wc -l head diffbind_consensu_min2overlap_readcounts_promoterRegion.txt

chr1	959088	959588	66	50	73	62	84	53	49	64	141	73	93	97	chr1	958809	961309	ENSG00000188976.11	NOC2L	-	500
chr1	959088	959588	66	50	73	62	84	53	49	64	141	73	93	97	chr1	958584	961084	ENSG00000187961.14	KLHL17	+	500
chr1	960230	960730	61	44	61	88	57	71	44	60	124	76	61	78	chr1	958809	961309	ENSG00000188976.11	NOC2L	-	500
chr1	960230	960730	61	44	61	88	57	71	44	60	124	76	61	78	chr1	958584	961084	ENSG00000187961.14	KLHL17	+	500
chr1	966291	966791	63	81	75	64	64	77	35	88	125	90	47	38	chr1	964497	966997	ENSG00000187583.10	PLEKHN1	+	500
chr1	999111	999611	69	36	78	49	61	60	37	50	96	66	47	83	chr1	999138	1001638	ENSG00000187608.10	ISG15	+	473
chr1	1000098	1000598	106	107	71	104	124	114	81	149	219	154	184	149	chr1	999672	1002172	ENSG00000188290.10	HES4	-	500
chr1	1000098	1000598	106	107	71	104	124	114	81	149	219	154	184	149	chr1	999138	1001638	ENSG00000187608.10	ISG15	+	500
chr1	1001589	1002089	76	97	67	65	88	102	62	72	134	98	73	46	chr1	999672	1002172	ENSG00000188290.10	HES4	-	500
chr1	1019256	101

: 1

### Quest 5 - annotate consensus peak set with gene region

Inputs: 
- Readcount matrix with peak coordinates.
`diffbind_consensu_min2overlap_readcounts_with_coordinates.txt`
- Promoter region bed file. `Protein_coding_gene_promoters.bed`

```intersectBed -wo -f 0.55 \
             -a diffbind_consensu_min2overlap_readcounts_with_coordinates.bed \
             -b Protein_coding_genes.bed > diffbind_consensu_min2overlap_readcounts_promoterRegion.txt```

Output:
- filtered consensus peak set,with annotated gene names. `diffbind_consensu_min2overlap_readcounts_promoterRegion.txt`

In [54]:
ls *.bed

diffbind_24H_significant_peaks.bed
diffbind_24H_significant_peaks_gene.bed
diffbind_24H_significant_peaks_promoter.bed
diffbind_5D_significant_peaks.bed
diffbind_5D_significant_peaks_gene.bed
diffbind_5D_significant_peaks_promoter.bed
diffbind_consensu_min2overlap.bed
diffbind_consensu_min2overlap_readcounts_with_coordinates.bed
Protein_coding_gene_promoters.bed
Protein_coding_genes.bed


In [56]:
head diffbind_consensu_min2overlap_readcounts_geneRegion.txt
wc -l head diffbind_consensu_min2overlap_readcounts_geneRegion.txt

chr1	935787	936287	33	36	35	43	56	46	42	27	108	71	154	145	chr1	923928	944581	ENSG00000187634.12	SAMD11	+	500
chr1	940274	940774	57	65	62	61	76	83	49	77	112	96	167	123	chr1	923928	944581	ENSG00000187634.12	SAMD11	+	500
chr1	941555	942055	72	82	80	71	87	91	72	79	154	111	216	121	chr1	923928	944581	ENSG00000187634.12	SAMD11	+	500
chr1	944512	945012	40	42	61	43	69	40	29	57	147	94	282	232	chr1	944203	959309	ENSG00000188976.11	NOC2L	-	500
chr1	966291	966791	63	81	75	64	64	77	35	88	125	90	47	38	chr1	966497	975865	ENSG00000187583.10	PLEKHN1	+	294
chr1	975934	976434	81	62	85	77	68	75	68	50	117	94	99	40	chr1	975204	982093	ENSG00000187642.9	PERM1	-	500
chr1	999111	999611	69	36	78	49	61	60	37	50	96	66	47	83	chr1	998962	1000172	ENSG00000188290.10	HES4	-	500
chr1	1001589	1002089	76	97	67	65	88	102	62	72	134	98	73	46	chr1	1001138	1014540	ENSG00000187608.10	ISG15	+	500
chr1	1013263	1013763	45	27	33	38	43	61	21	46	48	31	34	32	chr1	1001138	1014540	ENSG00000187608.10	ISG15	+	500
chr1	1020372	1020872	82	93

: 1

---

## Misc 1 - custom script to count reads at promoter region 
For some reason qsub doesn't work well. Thus, running it manually. 

- first convert bam to bed, using `bamtobed`, not shown here.
- then use `coverage` to count reads of each peak. I'm forcing min 10% overlap of each read. code below.

In [35]:
for i in {1..12}; do
    echo "bedtools coverage -counts -F 0.1 -a ../macs2/v6/ATAC${i}_peaks.broadPeak -b ATAC${i}_filtered.sortbychr.bed > ATAC${i}.counts.bed"
    bedtools coverage -counts -F 0.1 -a ../macs2/v6/ATAC${i}_peaks.broadPeak -b ATAC${i}_filtered.sortbychr.bed > ATAC${i}.counts.bed
done

bedtools coverage -counts -F 0.1 -a ../macs2/v6/ATAC1_peaks.broadPeak -b ATAC1_filtered.sortbychr.bed > ATAC1.counts.bed
GL000008.2	78	115	NB501157:251:HG7FNBGX9:2:13211:10341:6575/2	23	+

GL000008.2	78	115	NB501157:251:HG7FNBGX9:2:13211:10341:6575/2	23	+

bedtools coverage -counts -F 0.1 -a ../macs2/v6/ATAC2_peaks.broadPeak -b ATAC2_filtered.sortbychr.bed > ATAC2.counts.bed
GL000008.2	78	115	NB501157:251:HG7FNBGX9:2:12307:24914:17792/1	40	+

GL000008.2	78	115	NB501157:251:HG7FNBGX9:2:12307:24914:17792/1	40	+

bedtools coverage -counts -F 0.1 -a ../macs2/v6/ATAC3_peaks.broadPeak -b ATAC3_filtered.sortbychr.bed > ATAC3.counts.bed
GL000008.2	129	166	NB501157:251:HG7FNBGX9:1:11109:8208:9189/2	60	+

GL000008.2	129	166	NB501157:251:HG7FNBGX9:1:11109:8208:9189/2	60	+

bedtools coverage -counts -F 0.1 -a ../macs2/v6/ATAC4_peaks.broadPeak -b ATAC4_filtered.sortbychr.bed > ATAC4.counts.bed
GL000008.2	188	225	NB501157:251:HG7FNBGX9:1:22103:2890:17327/1	60	+

GL000008.2	188	225	NB501157:251:HG7FN

In [138]:
head  "/gpfs/commons/home/nliscovitch/neuron_diff_atac/consensus_peaks_diffbind_centered.bed"

chr1	10383	10683
chr1	19871	20171
chr1	29010	29310
chr1	36291	36591
chr1	136953	137253
chr1	237470	237770
chr1	385380	385680
chr1	437803	438103
chr1	450051	450351
chr1	545224	545524


In [139]:
pwd

/c/groups/sanjana_lab/cdai/TFscreen/atac/macs2/v7


In [141]:
head ATAC10_500_Peaks.bed

GL000008.2 536 1037 ATAC10_peak_1 3.05205
GL000008.2 2792 3293 ATAC10_peak_2 9.83333
GL000194.1 24193 24694 ATAC10_peak_3 4.82873
GL000195.1 51043 51544 ATAC10_peak_4 8.67152
GL000195.1 66620 67121 ATAC10_peak_5 6.23942
GL000195.1 130255 130756 ATAC10_peak_6 4.20744
GL000205.2 1293 1794 ATAC10_peak_7 2.75170
GL000205.2 3437 3938 ATAC10_peak_8 2.01589
GL000205.2 13553 14054 ATAC10_peak_9 2.16511
GL000205.2 39240 39741 ATAC10_peak_10 3.08570


In [142]:
head ATAC9_500_Peaks.bed

GL000008.2 2081 2582 ATAC9_peak_1 4.46012
GL000008.2 2795 3296 ATAC9_peak_2 14.77612
GL000008.2 3630 4131 ATAC9_peak_3 2.69125
GL000008.2 3972 4473 ATAC9_peak_4 2.17865
GL000194.1 24431 24932 ATAC9_peak_5 9.03100
GL000195.1 30600 31101 ATAC9_peak_6 9.63581
GL000195.1 51043 51544 ATAC9_peak_7 9.63581
GL000195.1 66799 67300 ATAC9_peak_8 3.39180
GL000195.1 68398 68899 ATAC9_peak_9 18.79097
GL000205.2 39119 39620 ATAC9_peak_10 8.50252


In [65]:
cd /c/groups/sanjana_lab/cdai/TFscreen/atac

In [66]:
ls *.bed

diffbind_24H_significant_peaks.bed
diffbind_24H_significant_peaks_gene.bed
diffbind_24H_significant_peaks_promoter.bed
diffbind_5D_significant_peaks.bed
diffbind_5D_significant_peaks_gene.bed
diffbind_5D_significant_peaks_promoter.bed
diffbind_consensu_min2overlap.bed
diffbind_consensu_min2overlap_readcounts_with_coordinates.bed
Protein_coding_gene_promoters.bed
Protein_coding_genes.bed


In [190]:
head diffbind_consensu_min2overlap.bed

chr1 629661 630161
chr1 633788 634288
chr1 778487 778987
chr1 822856 823356
chr1 827265 827765
chr1 869627 870127
chr1 876567 877067
chr1 904528 905028
chr1 912614 913114
chr1 920994 921494


In [67]:
head Protein_coding_gene_promoters.bed

chr1	63419	65919	ENSG00000186092.6	OR4F5	+
chr1	451197	453697	ENSG00000284733.1	OR4F29	-
chr1	686173	688673	ENSG00000284662.1	OR4F16	-
chr1	921928	924428	ENSG00000187634.12	SAMD11	+
chr1	958809	961309	ENSG00000188976.11	NOC2L	-
chr1	958584	961084	ENSG00000187961.14	KLHL17	+
chr1	964497	966997	ENSG00000187583.10	PLEKHN1	+
chr1	981593	984093	ENSG00000187642.9	PERM1	-
chr1	999672	1002172	ENSG00000188290.10	HES4	-
chr1	999138	1001638	ENSG00000187608.10	ISG15	+


In [70]:
tar -xvf appTGENE_5.1.01571668689245-1904007515.tar.gz TGENE

tar: TGENE: Not found in archive
tar: Exiting with failure status due to previous errors


: 2

In [72]:
tar -xvf appTGENE_5.1.01571668689245-1904007515.tar.gz

appTGENE_5.1.01571668689245-1904007515/messages.txt
appTGENE_5.1.01571668689245-1904007515/tgene.html
appTGENE_5.1.01571668689245-1904007515/links.tsv
appTGENE_5.1.01571668689245-1904007515/peaks_match_zbtb18_motif.bed


In [73]:
pwd

/c/groups/sanjana_lab/cdai/TFscreen/atac


### Quest 6. Intersect promoter region that are targets of ZBTB18 with ATAC seq peaks

In [74]:
ls *.bed

diffbind_24H_significant_peaks.bed
diffbind_24H_significant_peaks_gene.bed
diffbind_24H_significant_peaks_promoter.bed
diffbind_5D_significant_peaks.bed
diffbind_5D_significant_peaks_gene.bed
diffbind_5D_significant_peaks_promoter.bed
diffbind_consensu_min2overlap.bed
diffbind_consensu_min2overlap_readcounts_with_coordinates.bed
peaks_match_zbtb18_motif.bed
Protein_coding_gene_promoters.bed
Protein_coding_genes.bed
x_zbx.bed
y_peaks.bed


In [88]:
bedtools intersect -wo -a x_zbx.bed -b y_peaks.bed > zbx_target.bed

In [90]:
head zbx_target.bed

chr1	1324191	1326691	2501	-	ENSG00000127054.20	INTS11	chr1	1324531	1325030	500	*	0.758	499
chr1	1322756	1325256	2501	+	ENSG00000224051.7	CPTP	chr1	1324531	1325030	500	*	0.758	499
chr1	1398835	1401335	2501	-	ENSG00000221978.12	CCNL2	chr1	1399206	1399705	500	*	0.734	499
chr1	1406793	1409293	2501	-	ENSG00000242485.6	MRPL20	chr1	1407077	1407576	500	*	0.708	499
chr1	1510151	1512651	2501	+	ENSG00000197785.13	ATAD3A	chr1	1511791	1512290	500	*	0.71	499
chr1	1540124	1542624	2501	-	ENSG00000205090.9	TMEM240	chr1	1540579	1541078	500	*	0.76	499
chr1	1630095	1632595	2501	+	ENSG00000189409.13	MMP23B	chr1	1630222	1630721	500	*	0.734	499
chr1	1630095	1632595	2501	+	ENSG00000189409.13	MMP23B	chr1	1631814	1632313	500	*	0.754	499
chr1	1745499	1747999	2501	-	ENSG00000215790.7	SLC35E2A	chr1	1746126	1746625	500	*	0.644	499
chr1	1890617	1893117	2501	-	ENSG00000078369.18	GNB1	chr1	1891061	1891560	500	*	0.776	499


### Quest 7. Subset peaks to include only peaks that are in the promoter region

Inputs: 
- `../atac/diffbind_consensu_min2overlap.bed`: peak files that have been processed by diffbind
- `../Protein_coding_gene_promoters.bed`: protein coding gene promoter region coordinates

Outputs:
- `../atac/diffbind_consensu_min2overlap_promoterOnly.bed`
- `../atac/diffbind_consensu_min2overlap_promoterOnly_counts`

In [100]:
# get the promoter only region peak coordinates
intersectBed -wa -a diffbind_consensu_min2overlap_readcounts_with_coordinates.bed \
                 -b Protein_coding_gene_promoters.bed | cut -f 1,2,3 \
                > diffbind_consensu_min2overlap_promoterOnly.bed

In [101]:
# get the promoter only region coresponding read counts A1: A12
intersectBed -wa -a diffbind_consensu_min2overlap_readcounts_with_coordinates.bed \
                 -b Protein_coding_gene_promoters.bed | cut -f 4-15 \
                > diffbind_consensu_min2overlap_promoterOnly_counts.txt

### Quest 8. Get gene Up 2000 annotation bed

In [104]:
ls *coding*

Protein_coding_gene_promoters.bed  Protein_coding_genes.txt
Protein_coding_genes.bed


In [105]:
head Protein_coding_genes.bed

chr1	65419	71585	ENSG00000186092.6	OR4F5	+
chr1	450703	451697	ENSG00000284733.1	OR4F29	-
chr1	685679	686673	ENSG00000284662.1	OR4F16	-
chr1	923928	944581	ENSG00000187634.12	SAMD11	+
chr1	944203	959309	ENSG00000188976.11	NOC2L	-
chr1	960584	965719	ENSG00000187961.14	KLHL17	+
chr1	966497	975865	ENSG00000187583.10	PLEKHN1	+
chr1	975204	982093	ENSG00000187642.9	PERM1	-
chr1	998962	1000172	ENSG00000188290.10	HES4	-
chr1	1001138	1014540	ENSG00000187608.10	ISG15	+


In [106]:
chromsize=(/c/groups/sanjana_lab/cdai/ref_genome/hg38_chrom_size.txt)

In [110]:
slopBed -s -i Protein_coding_genes.bed -g $chromsize -l 2000 -r 0 > Protein_coding_genes_Up_2k.bed

---

In [9]:
pwd

/c/groups/sanjana_lab/cdai/TFscreen/atac


In [22]:
ls *.csv

ATAC_deviation_score_jaspar.csv
ATAC_FoldChangeByTimePoints_geneRegion.csv
ATAC_FoldChangeByTimePoints_promoterRegion.csv
ATACseqSampleName.csv
chromVAR_jaspar2018_zScore.csv
chromVAR_jaspar2018_zScore_promoterOnly.csv
chromVAR_jaspar2020_zScore.csv
[0m[01;32mJASPAR2020_meta.csv[0m
samplesheet2.csv
samplesheet.csv
TF_ScreenSelectedNormalizedCounts.csv
TF_target_interactions_all.csv
TF_target_interactions.csv
TF_target_interactions_with_RNAseq.csv


### Quest 9. make bed files of protein coding genes

`Annotations.ipynb` constructs GenomicRanges of gene, exons, introns, utr, promoter, enhancer regions.

It's also converted to this R script. `/c/groups/sanjana_lab/cdai/TFscreen/atac/annotations/annotation.R`

In [28]:
head Gencode_hg38_v31_proteincoding_gene_features.bed

chr1	65419	65433	exon	OR4F5	+
chr1	65520	65573	exon	OR4F5	+
chr1	69037	71585	exon	OR4F5	+
chr1	65419	65433	UTR	OR4F5	+
chr1	65520	65564	UTR	OR4F5	+
chr1	70006	71585	UTR	OR4F5	+
chr1	65434	65519	intron	OR4F5	+
chr1	65574	69036	intron	OR4F5	+
chr1	63419	65918	promoter	OR4F5	+
chr1	55419	63418	enhancer	OR4F5	+


In [60]:
cd /gpfs/commons/groups/sanjana_lab/cdai/TFscreen/atac/macs2/v7/annotate_peaks_w_features

In [61]:
ls

ATAC10_peaks_annotated.bed  ATAC4_peaks_annotated.bed
ATAC11_peaks_annotated.bed  ATAC5_peaks_annotated.bed
ATAC12_peaks_annotated.bed  ATAC6_peaks_annotated.bed
ATAC1_peaks_annotated.bed   ATAC7_peaks_annotated.bed
ATAC2_peaks_annotated.bed   ATAC8_peaks_annotated.bed
ATAC3_peaks_annotated.bed   ATAC9_peaks_annotated.bed


### Quest 10. Further clean ATAC*.PE.mapq.bed files: remove reads not mapped to chr1-chr22, chrX, chrY

In [1]:
cd /gpfs/commons/groups/sanjana_lab/cdai/TFscreen/atac/bams_v3

In [4]:
cat bed_filter_chr.sh

#!/bin/bash
#$ -N fixbed
#$ -t 1-12
#$ -j y
#$ -cwd
#$ -V
#$ -pe smp 1
#$ -l h_vmem=20G




echo "Current task: $SGE_TASK_ID"
date

oldbed=ATAC${SGE_TASK_ID}.PE.mapq.bed
out=$(basename -s .bed $oldbed).chr.bed


awk 'BEGIN {OFS="\t"}; $1 ~ /chr[0-9XY].*/ {print}' $oldbed  > $out 

echo "Done at $(date)"





In [12]:
cd ../../../ref_genome

In [13]:
ls

[0m[48;5;10;38;5;21mbwa_gencode_genome[0m
[48;5;10;38;5;21mbwa_gencode_transcripts[0m
[48;5;10;38;5;21mbwa_refseq[0m
gencode_GRCh38.primary_assembly.genome.fa
gencode_GRCh38.primary_assembly.genome.fa.fai
gencode.v31.primary_assembly.annotation.gene_idTOgene_name.txt
gencode.v31.primary_assembly.annotation.gff3
gencode.v31.primary_assembly.annotation.gtf
gencode.v31.primary_assembly.annotation.pandas.df.20190805.txt
gencode.v31.primary_assembly.annotation.pandas.df.20200108.txt
gencode.v31.primary_assembly.annotation.pandas.df.txt
gencode.v31.transcripts.fa
hg38_chrom_size.txt
hg38_gene_size.txt
[48;5;10;38;5;21mhisat2.genome[0m
[48;5;10;38;5;21mkallisto_index[0m
[38;5;34mLog.out[0m
[38;5;27mMAT[0m
ncbi_GRCh38_latest_genomic.gff
ncbi_GRCh38_latest_rna.fna
[48;5;10;38;5;21mrsem.1.2.21.gencode.v31.ref[0m
[38;5;34mRSEM.out[0m
[38;5;34msh_bwa_refseq_index.sh[0m
[38;5;34msh_hisat2_build_index.sh[0m
[38;5;34msh_rsem_ref.sh[0m
[38;5;34msh_star_index.sh[0m
[48;5;10;

In [None]:
bedtools -fi gencode_GRCh38.primary_assembly.genome.fa \
    -

In [3]:
pwd

/gpfs/commons/home/cdai/notebooks/TFscreen
