In [None]:
import pandas as pd

Let's pull out significantly differentially expressed genes.

In [None]:
deseq2_dir = "/oasis/tscc/scratch/biom200/featurecounts/"
deseq2_result = pd.read_csv(deseq2_dir+"differential_expression.csv", index_col=0)
deseq2_result.head()

GeneID isn't really helpful, let's add the gene name onto the dataframe. 

In [None]:
peak_dir = "/oasis/tscc/scratch/biom200/fto_clip/"

gene_names = pd.read_table(peak_dir+"gencode.v19.annotation.genenames.txt", index_col=0)
gene_names.head()

In [None]:
deseq2_result = deseq2_result.join(gene_names)
deseq2_result.head()

Which genes have a significant value in the padj column? 

In [None]:
sig_genes = deseq2_result.loc[deseq2_result['padj'] < 0.05]

Let's separate those between upregulated and downregulated

In [None]:
sig_genes_up = sig_genes.loc[sig_genes['log2FoldChange'] > 1]
print sig_genes_up.shape
sig_genes_up.head()

In [None]:
sig_genes_down = sig_genes.loc[sig_genes['log2FoldChange'] < -1]
print sig_genes_down.shape
sig_genes_down.head()

I want to save those geneIDs, now that I have called them as significant, I don't care about the rest of the stuff

In [None]:
upregulated_genes = sig_genes_up.index
downregulated_genes = sig_genes_down.index

We are going to use bedtools to intersect those genes with a list of peaks that we called from FTO clip. Check out the bedtools documentation. In particular, we are going to use bedtools intersect. 

In order to use bedtools intersect, we need a bed file of genes, not just a list of geneIDs. I put a bed file in the shared folder, let's load that in as a dataframe and make new bed files of genes that we are interested in

In [None]:
bedfile_of_genes = pd.read_table(peak_dir+"hg19_genes.bed",  
                              names = ['chrom','start','stop','geneid','name','strand'])
bedfile_of_genes.head()

I want to set the geneID as the index

In [None]:
bedfile_of_genes.set_index("geneid", drop=False, inplace=True)
bedfile_of_genes.head()

How do I use this new index to grab only upregulated genes?

In [None]:
upregulated_bed = bedfile_of_genes.loc[upregulated_genes]

In [None]:
downregulated_bed = bedfile_of_genes.loc[downregulated_genes]

Let's save those files, but we don't want to save the index again or the header because bedfiles don't have a header. They also need to be tab separated

In [None]:
save_dir = "/home/ucsd-train01/projects/fto_shrna/fto_clip/"
upregulated_bed.to_csv(save_dir+"upregulated_genes.bed", index=None, header=None, sep="\t")
downregulated_bed.to_csv(save_dir+"downregulated_genes.bed", index=None, header=None, sep="\t")

One more thing, we need a bedfile of significant peaks to compare to these upregulated and downregulated genes. Let's load up the peak file, and filter for pvalue and fold change cutoffs

In [None]:
rep1_peaks = pd.read_table(peak_dir+"fto_clip_rep1.bed", index_col=0, 
                          names = ['chrom','start','stop','pval','fc','strand'])
rep1_peaks.head()

How do we select rows with pval greater than 3 and fold change greater than 2?

In [None]:
rep1_peaks_sig_peaks = rep1_peaks.loc[(rep1_peaks['pval'] > 3) &
               (rep1_peaks['fc'] > 2)]
rep1_peaks_sig_peaks.head()

In [None]:
rep1_peaks_sig_peaks.to_csv(save_dir+"fto_rep1_sig_peaks.bed", header=None, sep="\t")

Let's do the same thing for rep2 peaks

In [None]:
rep2_peaks = pd.read_table(peak_dir+"fto_clip_rep2.bed", index_col=0, 
                          names = ['chrom','start','stop','pval','fc','strand'])
rep2_peaks.head()

In [None]:
rep2_peaks_sig_peaks = rep2_peaks.loc[(rep2_peaks['pval'] > 3) &
               (rep2_peaks['fc'] > 2)]
rep2_peaks_sig_peaks.head()

In [None]:
rep2_peaks_sig_peaks.to_csv(save_dir+"fto_rep2_sig_peaks.bed", header=None, sep="\t")

Now we're ready to move onto bedtools.