# Extract upstream sequences from differentially expressed genes to find TFBS

## Overall purpose of the function

Extract the upstream sequences from differentially expressed genes to find transcription factor binding sites (TFBS) using oPPOSSUM.

**The basic inputs of the function include:**
1. csv file with differentially expressed genes (e.g. output from Limma-Voom, EdgeR or DESeq)
2. reference genome
3. gff annotation file

**The outputs from this function are:**
1. fasta file with upstream **background sequences** from random genes in the genome
2. fasta file with upstream **target sequences** from differentially expressed genes 

The background and target sequences can be used to query servers such as oPOSSUM to find TFBS.

Link to oPOSSUM website: http://opossum.cisreg.ca/cgi-bin/oPOSSUM3/opossum_seq_ssa

## What are transcription factors?
Transcription factors are proteins that regulate the transcription of genes—that is, their copying into RNA, on the way to making a protein.

https://www.khanacademy.org/science/biology/gene-regulation/gene-regulation-in-eukaryotes/a/eukaryotic-transcription-factors

## How do transcription factors work?
A typical transcription factor binds to DNA at a certain target sequence. Once it's bound, the transcription factor makes it either harder or easier for RNA polymerase to bind to the promoter of the gene.

Some transcription factors activate transcription. For instance, they may help the general transcription factors and/or RNA polymerase bind to the promoter, as shown in the diagram below.

In [13]:
from IPython.display import display, Image
Image(url = 'https://ka-perseus-images.s3.amazonaws.com/6567f50d30ad3ac65aff1e815caf202b3abd7111.png')

## Transcription factor binding sites (TFBS)

A typical transcription factor binds to DNA at a certain target sequence (or motif). Once it's bound, the transcription factor makes it either harder or easier for RNA polymerase to bind to the promoter of the gene, and consequently regulates the amount of messenger RNA (mRNA) produced by the gene. Some transcription factors activate transcription, while others repress transcription.

Transcription factor binding sites (TFBS) are often located in the 5’-upstream region of target genes to modulate the rate of gene transcription. DNA binding sites can be thus defined as short DNA sequences (typically 4 to 30 base pairs long) that are specifically bound by one or more DNA-binding proteins or protein complexes.

# Steps

## 1. Log and filter differentially expressed genes using a threshold value

In [43]:
# Import the widgets
from ipywidgets import widgets, interact, interactive, Button, Layout
# Import the display function for explicitly displaying widgets in the notebook
from IPython.display import display, clear_output

In [76]:
import pandas as pd
import numpy as np
import math

def get_degenes(filepath, gene_id, threshold, threshold_col_id):
    genes = pd.read_csv(filepath)
    genes = genes.dropna()   
    if genes[gene_id].dtypes == float:
        genes = genes.astype({gene_id:int})
        genes = genes.astype({gene_id:str})
        pass
    elif genes[gene_id].dtypes == int:
        genes = genes.astype({gene_id:str})
    else:
        print("gene names are strings, great!")
    
    DEgenes = genes.loc[(genes[threshold_col_id] >= threshold) | (genes[threshold_col_id] <= -threshold)]
    DEgenes = DEgenes[[gene_id]]
    DEgenes.rename(columns={gene_id:'gene_id'}, inplace=True)
    
    return DEgenes

In [45]:
## Function parameters 
style = {'description_width': 'initial'}
#filepath = widgets.Text(value = '/Volumes/HOME_INTEL/RNAseq-POMV/Results/ControlvsPOMV6_ALL.csv', description='Filepath:',disabled=False)

filepath = widgets.Text(value = '', description='Filepath:',disabled=False)

gene_id = widgets.Text(value = 'ENTREZID', description='Gene ID:', disabled=False)

threshold_col_id = widgets.Text(value = 'logFC', description='Threshold column ID:', disabled=False, style=style)

threshold = widgets.FloatSlider(value=2.0, min=0.5, max=10.0, step=0.5, description='Threshold', disabled=False, continuous_update=False, orientation='horizontal', readout=True, readout_format='.1f', style=style)
threshold.style.handle_color = 'lightblue'

ui1 = widgets.VBox([filepath, gene_id, threshold_col_id, threshold])

display(ui1)

VBox(children=(Text(value='', description='Filepath:'), Text(value='ENTREZID', description='Gene ID:'), Text(v…

In [79]:
button1 = widgets.Button(description="Filter Genes", layout=Layout(width='30%', height='40px'), button_style = 'info')
out1=widgets.Output()

def on_button_clicked(button1): 
    
    if filepath.value == '':
        with out1:
            clear_output(wait = True)
            print("Filepath to differentially expressed genes not provided")
            pass
    
    else: 
        with out1: 
            clear_output(wait = True)
            DEgenes = get_degenes(filepath.value, gene_id.value, threshold.value, threshold_col_id.value)
            print("Total number of differentially expressed genes:", len(DEgenes))

button1.on_click(on_button_clicked)

display(button1, out1)

Button(button_style='info', description='Filter Genes', layout=Layout(height='40px', width='30%'), style=Butto…

Output()

In [47]:
if filepath.value == '':
    pass
else:
    DEgenes1 = get_degenes(filepath.value, gene_id.value, threshold.value, threshold_col_id.value)

## 2. Extract features and coordinates from the GFF file

In [87]:
import re

def get_features(gff, feature, search_gff, attribute, coord):
    col_names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    mygff = pd.read_csv(gff, sep='\t', comment='#', low_memory=False, header=None, names=col_names)
    CDS = mygff[mygff.type == feature]
    CDS = CDS.copy()

    RE_GENE_NAME = re.compile(r'({}\W)(?P<gene_id>.+?)[,;]'.format(search_gff))
    def extract_gene_name(attributes_str):
        res = RE_GENE_NAME.search(attributes_str)
        if res is None:
            return ''
        else:
            return res.group('gene_id')
    CDS['gene_id'] = CDS.attributes.apply(extract_gene_name)
    
    RE_DESC = re.compile(r'({}\W)(?P<attribute>.+?)[,;]'.format(attribute))
    def extract_description(attributes_str):
        res = RE_DESC.search(attributes_str)
        if res is None:
            return ''
        else:
            return res.group('attribute')
    CDS['attribute'] = CDS.attributes.apply(extract_description)

    CDS.drop('attributes', axis=1, inplace=True)
    
    if coord == 'all':
        CDS_start_points = CDS
    elif coord == 'min':
        CDS_start_points = (CDS.groupby(['seqid', 'gene_id', 'attribute', 'strand'], as_index=False)['start'].min())
    elif coord == 'max':
        CDS_start_points = (CDS.groupby(['seqid', 'gene_id', 'attribute', 'strand'], as_index=False)['start'].max())
    elif coord == 'median':
        CDS_start_points = (CDS.groupby(['seqid', 'gene_id', 'attribute', 'strand'], as_index=False)['start'].median())
        CDS_start_points = CDS_start_points.astype({'start':int})
    else:
        print('Non valid argument given to extraxt gene coordinates for start position')
      
    return CDS_start_points

In [88]:
## Function parameters 
#gff = widgets.Text(value = '/Users/sam079/Documents/2018_Transcription_factors/Data/GCF_000233375.1_ICSASG_v2_genomic.gff', description='GFF:',disabled=False)

gff = widgets.Text(description='Filepath:',disabled=False)

feature = widgets.Text(value = 'CDS', description='Feature:', disabled=False)

search_gff = widgets.Text(value = 'GeneID', description='Gene ID tag in GFF', disabled=False, style=style)

attribute = widgets.Text(value = 'product', description='Other tags in GFF', disabled=False, style=style)

coord = widgets.RadioButtons(options=['all', 'min', 'max', 'median'],value='min', description='Start coordinates',disabled=False, style=style, button_style = 'info')

ui2 = widgets.VBox([gff, feature, search_gff, attribute, coord])

display(ui2)

VBox(children=(Text(value='', description='Filepath:'), Text(value='CDS', description='Feature:'), Text(value=…

In [90]:
button2 = widgets.Button(description="Get Features and Coordinates",layout=Layout(width='30%', height='40px'), button_style = 'info')
button2.style
out2=widgets.Output()

def on_button_clicked(button2):
    
    if gff.value == '':
        with out2:
            clear_output(wait = True)
            print("Filepath to GFF file not provided")
            pass
    
    else: 
        with out2: 
            clear_output(wait = True)
            CDS_start_points = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)
            print("Total number of selected features:", len(CDS_start_points),'\n',CDS_start_points.head(10))

button2.on_click(on_button_clicked)

display(button2, out2)

Button(button_style='info', description='Get Features and Coordinates', layout=Layout(height='40px', width='30…

Output()

In [94]:
if gff.value == '':
    pass
else:
    CDS_start_points1 = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)

### Examine the features extracted from the GFF file

Get information from any gene in the list of differentially expressed genes by entering its Gene ID. 

In [52]:
def find_genes(gene_name):
    gene_name = str(gene_name)
    query = CDS_start_points1[CDS_start_points1['ENTREZID'] == gene_name]
    return query

In [53]:
## Function parameters 
gene_name = widgets.Text(description='Search for a Gene ID:', disabled=False,style=style)

ui_CDS = widgets.VBox([gene_name])

display(ui_CDS)

VBox(children=(Text(value='', description='Search for a Gene ID:', style=DescriptionStyle(description_width='i…

In [54]:
button_CDS = widgets.Button(description="Search Genes",layout=Layout(width='30%', height='40px'), button_style = 'info')
out_CDS=widgets.Output()

def on_button_clicked(button_CDS):
    
    if gene_name.value == '':
        with out_CDS:
            clear_output(wait = True)
            print("Gene ID not provided")
            pass
    
    else: 
        with out_CDS: 
            clear_output(wait = True)
            query = find_genes(gene_name.value)
            print(query)

button_CDS.on_click(on_button_clicked)

display(button_CDS, out_CDS)

Button(button_style='info', description='Search Genes', layout=Layout(height='40px', width='30%'), style=Butto…

Output()

## 3. Extract background and target sequences from genome

### Background sequences

In [95]:
from pyfaidx import Fasta

def create_background_fasta(CDS_start_points1, genome, background_outfile, upstream_nucl):
    genome = Fasta(genome)
    CDS_random = CDS_start_points1.sample(500)
    outfile = open(background_outfile, "w")   
    back_list = []
    back_dict= {}
    
    for index, row in CDS_random.iterrows():
        genes = row['gene_id']
        if row['start'] > upstream_nucl:
            sequences = genome[row['seqid']][row['start'] - upstream_nucl:row['start'] + 3]
        else:
            sequences = genome[row['seqid']][row['start'] - row['start']:row['start']]
        back_dict[genes] = sequences
        back_list.append(back_dict)
        back_dict = {}
    
    for d in back_list:
        for key, value in d.items():
            outfile.write(">" + key + " " + value.fancy_name + "\n" + value.seq + "\n")
    
    outfile.close() 

In [56]:
## Function parameters 
#genome = widgets.Text(value = '/Volumes/OSM_CBR_AF_POMV_work/POMV_RNA_seq/Genomes/Salmo_salar/GCF_000233375.1_ICSASG_v2_genomic.fna', description='Genome:',disabled=False)

genome = widgets.Text(description='Genome:',disabled=False)

background_outfile = widgets.Text(value = 'background_sequences.txt', description = 'Output file', disabled=False)

upstream_nucl1 = widgets.IntSlider(value=5000, min=100, max=10000, step=100, description='Sequence length', disabled=False, continuous_update=False, orientation='horizontal', readout=True, readout_format='d', style=style)

upstream_nucl1.style.handle_color = 'lightblue'

ui3 = widgets.VBox([genome, background_outfile, upstream_nucl1])

display(ui3)

VBox(children=(Text(value='', description='Genome:'), Text(value='background_sequences.txt', description='Outp…

In [93]:
button3=widgets.Button(description="Create Background Fasta",layout=Layout(width='30%', height='40px'), button_style = 'info')
out3=widgets.Output()

def on_button_clicked(button3):
    
    if genome.value == '':
        with out3:
            clear_output(wait = True)
            print("Filepath to Genome file not provided")
            pass
    
    else: 
        with out3: 
            clear_output(wait = True)
            CDS_start_points1 = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)
            create_background_fasta(CDS_start_points1, genome.value, background_outfile.value, upstream_nucl1.value)
            print("Background sequences located in:",background_outfile.value,"and have a length of:",upstream_nucl1.value,"nucleotides")

button3.on_click(on_button_clicked)

display(button3, out3)

Button(button_style='info', description='Create Background Fasta', layout=Layout(height='40px', width='30%'), …

Output()

### Target sequences

In [96]:
def create_target_fasta(DEgenes1, CDS_start_points1, genome, target_outfile, upstream_nucl):
    genome = Fasta(genome)
    outfile = open(target_outfile, "w")
    newdf = pd.merge(DEgenes1, CDS_start_points1)
    seq_list = []
    seq_dict= {}

    for index, row in newdf.iterrows():
        genes = row['gene_id']
        if row['start'] > upstream_nucl:
            sequences = genome[row['seqid']][row['start'] - upstream_nucl:row['start'] + 3]
        else:
            sequences = genome[row['seqid']][row['start'] - row['start']:row['start']]
        seq_dict[genes] = sequences
        seq_list.append(seq_dict)
        seq_dict = {}
    
    for d in seq_list:
        for key, value in d.items():
            outfile.write(">" + key + " " + value.fancy_name + "\n" + value.seq + "\n")
    
    outfile.close() 
    
    return seq_list

In [97]:
## Function parameters 
target_outfile = widgets.Text(value = 'target_sequences.txt', description = 'Output file', disabled=False)

upstream_nucl2 = widgets.IntSlider(value=5000, min=100, max=10000, step=100, description='Sequence length', disabled=False, continuous_update=False, orientation='horizontal', readout=True, readout_format='d', style=style)

upstream_nucl2.style.handle_color = 'lightblue'

ui4 = widgets.VBox([target_outfile, upstream_nucl2])

display(ui4)

VBox(children=(Text(value='target_sequences.txt', description='Output file'), IntSlider(value=5000, continuous…

In [98]:
button4=widgets.Button(description="Create Target Fasta",layout=Layout(width='30%', height='40px'), button_style = 'info')
out4=widgets.Output()

def on_button_clicked(button4):

    if genome.value == '':
        with out4:
            clear_output(wait = True)
            print("Filepath to Genome file not provided")
            pass
    
    else: 
        with out4: 
            clear_output(wait = True)
            DEgenes1 = get_degenes(filepath.value, gene_id.value, threshold.value, threshold_col_id.value)
            CDS_start_points1 = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)
            create_target_fasta(DEgenes1, CDS_start_points1, genome.value, target_outfile.value, upstream_nucl2.value)    
            print("Target sequences located in:",target_outfile.value,"and have a length of:",upstream_nucl2.value,"nucleotides")

button4.on_click(on_button_clicked)

display(button4, out4)

Button(button_style='info', description='Create Target Fasta', layout=Layout(height='40px', width='30%'), styl…

Output()

In [61]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')