Profile plot description:
The figure was generated using the profile plot utility of deeptools. Both sample show a peak in ChIP signal and an immediate drop off at the TSS and low signal all the way until the TES as expected. This shows open chromatin at the TSS as expected for conditions with high TF activity

Methods:
Obtained reads were first put through QC analysis with FastQC. After inspecting QC metrics, reads were then trimmed using Trimmomatic before alignment with Bowtie2 to an index built from the human reference genome (GRCh38), also using Bowtie2. Reads were then sorted and indexed with samtools before samtools flagstat was used to generate metrics for alignment. At this point QC reports and log files were compiled using MultiQC. BigWig files were then generated using deeptools bamCoverage utility, with coverage summarized using deeptools bamCoverage and corrPlot utilities. Peakcalling with MACS3 was then performed on the sorted BAM files. The bedtools intersect utility was then used to generate a set of reproducible peaks by identifying peaks from both replicates that overlapped by 50% reciprocally befor again using bedtools intersect to filter out peaks overlapping blacklisted regions. Peak annotation was done using HOMER with default parameters. Bedtools computeMatrix and plotProfile utilities were then used to generate a profile plot from BigWig files. Lastly, the HOMER findMotifsGenome.pl script was used to find motifs in the filtered peaks. 

QC Report Summary:
Firstly, looking at the FastQC reports, GC content appears to follow a roughly normal distribution but differs slightly between INPUT and IP samples, as well as having what looks to be a small second peak in all distributions. Given that we expect different sequences between IP and INPUT, this isn't a red flag. However, FastQC does consider the distributions for the INNPUT samples to show an issue, which may need to be investigated further. Quality scores also look fine, with high quality across all samples. Duplication is higher in IP samples but again that is to be expected. One concern is the much lower number of sequences in the rep2 INPUT, with 10M instead of the ~30M in other samples. Flagstat results don't seem to show red flags either, with most reads from all samples passing QC checks and mapping. Trimmomatic also shows not many sequences were dropped, which is a good sign. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
# read in peaks
peaks = pd.read_table("results/annotated_peaks.txt")

# set colnames for readability
peaks.columns = ['id', 'chrom', 'start', 'end', 'strand', 'score', 'focus', 'annotation', 'details', 'dist', 'promID', 'entrez', 'unigene', 'refseq', 'ensembl', 'gene_name', 'alias', 'desc', 'type']
peaks.head(10)

In [None]:
# read in rnaseq res
rnaseq = pd.read_table("rnaseq_res.txt")

# filter by padj
pcut = 0.05
rnaseq_fltr = rnaseq[rnaseq['padj'] < 0.05]
rnaseq_fltr.columns = ['gene_name', 'transcript', 'log2FoldChange', 'padj']
rnaseq_fltr.head()

In [None]:
merged = pd.merge(rnaseq_fltr, peaks, on='gene_name')
merged.head()

In [None]:
# read in motif finding
motifs = pd.read_table("results/homermotifs/knownResults.txt")
motifs.head(20)

In [None]:
numUp = len(rnaseq_fltr[rnaseq_fltr['log2FoldChange'] > 1])
numDn = len(rnaseq_fltr[rnaseq_fltr['log2FoldChange'] < -1])

numUp5 = len(merged[merged['log2FoldChange'] > 1][abs(merged['dist']) < 5000])
numUp20 = len(merged[merged['log2FoldChange'] > 1][abs(merged['dist']) < 20000])
numDn5 = len(merged[merged['log2FoldChange'] < -1][abs(merged['dist']) < 5000])
numDn20 = len(merged[merged['log2FoldChange'] < -1][abs(merged['dist']) < 20000])

botVals = [numUp5, numDn5, numUp20, numDn20]
topVals = [x - y for x,y in zip([numUp, numDn, numUp, numDn], botVals)]
botPct = [x * 100 / y for x,y in zip(botVals, [numUp, numDn, numUp, numDn])]
topPct = [x * 100 / y for x,y in zip(topVals, [numUp, numDn, numUp, numDn])]

fig, ax = plt.subplots(figsize=(10, 6))
x = range(4)
bar1 = ax.bar(x, botPct, label='Runx1 Binding', color='red')
bar2 = ax.bar(x, topPct, bottom=botPct, label='No Runx1 Binding', color='lightgray')
ax.set_xticks(x)
ax.legend()

for i in range(4):
    # Label for "with ChIP"
    ax.text(x[i], botPct[i] / 2, str(botVals[i]),
            ha='center', va='center', color='white', fontsize=10)

    # Label for "without ChIP"
    ax.text(x[i], botPct[i] + topPct[i] / 2, str(topVals[i]),
            ha='center', va='center', color='black', fontsize=10)
ax.set_ylabel('Percentage of Genes')
ax.set_xticklabels(['+/-5kb of TSS Upregulated', '+/-5kb of TSS Downregulated',
                    '+/-20kb of TSS Upregulated', '+/-20kb of TSS Downregulated'],
                    rotation=30)

plt.show()

