# Finding Transcription Factor Binding Sites

## Overall purpose of this Graphical User Interface

Extract the upstream sequences from differentially expressed genes to find transcription factor binding sites (TFBS) using oPPOSSUM.

**The basic inputs of the function include:**
1. csv file with differentially expressed genes (e.g. output from Limma-Voom, EdgeR or DESeq)
2. reference genome
3. gff annotation file

**The outputs from this function are:**
1. fasta file with upstream **background sequences** from random genes in the genome
2. fasta file with upstream **target sequences** from differentially expressed genes 

The background and target sequences can be used to query servers such as oPOSSUM to find TFBS.

Link to oPOSSUM website: http://opossum.cisreg.ca/cgi-bin/oPOSSUM3/opossum_seq_ssa

## Transcription factor binding sites (TFBS)

Transcription factor binding sites (TFBS) are often located in the 5’-upstream region of target genes (up to 10000 nucleotides upstream) to modulate the rate of gene transcription. Transcription factor binding sites can be thus defined as short DNA sequences (typically 4 to 30 base pairs long) that are specifically bound by one or more DNA-binding proteins or protein complexes.

In [156]:
from IPython.display import display, Image
Image(url = 'https://ka-perseus-images.s3.amazonaws.com/6567f50d30ad3ac65aff1e815caf202b3abd7111.png')

# Steps

## 1. Log and filter differentially expressed genes using a threshold value

- Enter the filepath to file with differentially expressed genes produced in EdgeR, Limma-Voom or DEseq2
- Enter the column header of the column that contains gene names (defaults to ENTREZID)
- Enter column header of column that contains the threshold for filtering genes (defaults to logFC) 
- Move the treshold to your desire (defaults to logFC 2)

In [157]:
# Import the widgets
from ipywidgets import widgets, interact, interactive, Button, Layout

# Import the display function for explicitly displaying widgets in the notebook
from IPython.display import display, clear_output

In [158]:
#Import packages
import pandas as pd
import numpy as np
import math

def get_degenes(filepath, gene_id, threshold, threshold_col_id):
    genes = pd.read_csv(filepath)
    genes = genes.dropna()   
    if genes[gene_id].dtypes == float:
        genes = genes.astype({gene_id:int})
        genes = genes.astype({gene_id:str})
        pass
    elif genes[gene_id].dtypes == int:
        genes = genes.astype({gene_id:str})
    else:
        print("gene names are strings, great!")
    
    DEgenes = genes.loc[(genes[threshold_col_id] >= threshold) | (genes[threshold_col_id] <= -threshold)]
    DEgenes = DEgenes[[gene_id]]
    DEgenes.rename(columns={gene_id:'gene_id'}, inplace=True)
    
    return DEgenes

In [159]:
## Function parameters and styling of widgets

style = {'description_width': 'initial'}

filepath = widgets.Text(value = '', 
                        description='Filepath:',disabled=False)

gene_id = widgets.Text(value = 'ENTREZID', 
                       description='Gene ID:', disabled=False)

threshold_col_id = widgets.Text(value = 'logFC', 
                                description='Threshold column ID:', 
                                disabled=False, style=style)

threshold = widgets.FloatSlider(value=2.0, min=0.5, max=10.0, step=0.5, 
                                description='Threshold', 
                                disabled=False, continuous_update=False, 
                                orientation='horizontal', readout=True, 
                                readout_format='.1f', style=style)

threshold.style.handle_color = 'lightblue'

ui1 = widgets.VBox([filepath, gene_id, threshold_col_id, threshold])

display(ui1)

In [160]:
# Styling of button and configuration of button clicked

button1 = widgets.Button(description="Filter Genes", 
                    layout=Layout(width='30%', height='40px'), 
                    button_style = 'info', 
                    style = {'font_weight': 'bold', 'font-size' : '30px'}, 
                    tooltip = 'Description', icon = 'check')

out1=widgets.Output()

def on_button_clicked(button1): 
    
    if filepath.value == '':
        with out1:
            clear_output(wait = True)
            print("Filepath to differentially expressed genes not provided")
            pass    
    else: 
        with out1: 
            clear_output(wait = True)
            DEgenes = get_degenes(filepath.value, gene_id.value, threshold.value, threshold_col_id.value)
            print("Total number of differentially expressed genes:", len(DEgenes))

button1.on_click(on_button_clicked)

display(button1, out1)

## 2. Extract features and coordinates from the GFF file

- Enter the filepath to the GFF file
- Click the feature to extract from the GFF (defaults to CDS)
- Enter the tag or label for Gene IDs in the GFF. The Gene IDs have to match the IDs in the differentially expressed file (defaults to 'GeneID')
- Enter the label or tag of another attribute to extract from the GFF (defaults to gene product)
- Select the start coordinate of the feature (defaults to min, which is the smallest coordinate position in the GFF)

In [161]:
import re

In [162]:
def get_features(gff, feature, search_gff, attribute, coord):
    col_names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    mygff = pd.read_csv(gff, sep='\t', comment='#', low_memory=False, header=None, names=col_names)
    CDS = mygff[mygff.type == feature]
    CDS = CDS.copy()

    RE_GENE_NAME = re.compile(r'({}\W)(?P<gene_id>.+?)[,;]'.format(search_gff))
    def extract_gene_name(attributes_str):
        res = RE_GENE_NAME.search(attributes_str)
        if res is None:
            return ''
        else:
            return res.group('gene_id')
    CDS['gene_id'] = CDS.attributes.apply(extract_gene_name)
    
    RE_DESC = re.compile(r'({}\W)(?P<attribute>.+?)[,;]'.format(attribute))
    def extract_description(attributes_str):
        res = RE_DESC.search(attributes_str)
        if res is None:
            return ''
        else:
            return res.group('attribute')
    CDS['attribute'] = CDS.attributes.apply(extract_description)

    CDS.drop('attributes', axis=1, inplace=True)
    
    if coord == 'all':
        CDS_start_points = CDS
    elif coord == 'min':
        CDS_start_points = (CDS.groupby(['seqid', 'gene_id', 'attribute', 'strand'], as_index=False)['start'].min())
    elif coord == 'max':
        CDS_start_points = (CDS.groupby(['seqid', 'gene_id', 'attribute', 'strand'], as_index=False)['start'].max())
    elif coord == 'median':
        CDS_start_points = (CDS.groupby(['seqid', 'gene_id', 'attribute', 'strand'], as_index=False)['start'].median())
        CDS_start_points = CDS_start_points.astype({'start':int})
    else:
        print('Non valid argument given to extraxt gene coordinates for start position')
      
    return CDS_start_points

In [163]:
## Function parameters and styling of widgets

gff = widgets.Text(description='Filepath:',
                    disabled=False, 
                    style=style)

feature = widgets.RadioButtons(options=['CDS', 'exon'], 
                                value = 'CDS', 
                                description='Feature:', 
                                disabled=False, 
                                style=style)


search_gff = widgets.Text(value = 'GeneID', 
                          description='GeneID label', 
                          disabled=False, 
                          style=style)

attribute = widgets.Text(value = 'product', 
                         description='Other labels', 
                         disabled=False, 
                         style=style)

coord = widgets.RadioButtons(options=['all', 'min', 'max', 'median'],
                             value='min', description='Start coordinates',
                             disabled=False, 
                             style=style, 
                             button_style = 'info')

ui2 = widgets.VBox([gff, feature, search_gff, attribute, coord])

display(ui2)

In [164]:
# Styling of button and configuration of button clicked

button2 = widgets.Button(description="Get Features and Coordinates",
                            layout=Layout(width='30%', height='40px'), 
                            button_style = 'info', 
                            style = {'font_weight': 'bold', 'font-size' : '30px'}, 
                            tooltip = 'Description', 
                            icon = 'check')

out2=widgets.Output()

def on_button_clicked(button2):
    
    if gff.value == '':
        with out2:
            clear_output(wait = True)
            print("Filepath to GFF file not provided")
            pass
    
    else: 
        with out2: 
            clear_output(wait = True)
            CDS_start_points = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)
            print("Total number of selected features:", len(CDS_start_points),'\n',CDS_start_points.head(10))

button2.on_click(on_button_clicked)

display(button2, out2)

## 3. Extract background and target sequences from genome

- Enter filepath to the species Genome
- Enter a filepath to store the fasta file with background sequences 
- Select the number of nucleotides to extract upstream from the start coordinates of background genes and target genes

### Background sequences

In [165]:
from pyfaidx import Fasta

def create_background_fasta(CDS_start_points1, genome, background_outfile, upstream_nucl):
    genome = Fasta(genome)
    CDS_random = CDS_start_points1.sample(500)
    outfile = open(background_outfile, "w")   
    back_list = []
    back_dict= {}
    
    for index, row in CDS_random.iterrows():
        genes = row['gene_id']
        if row['start'] > upstream_nucl:
            sequences = genome[row['seqid']][row['start'] - upstream_nucl:row['start'] + 3]
        else:
            sequences = genome[row['seqid']][row['start'] - row['start']:row['start']]
        back_dict[genes] = sequences
        back_list.append(back_dict)
        back_dict = {}
    
    for d in back_list:
        for key, value in d.items():
            outfile.write(">" + key + " " + value.fancy_name + "\n" + value.seq + "\n")
    
    outfile.close() 

In [166]:
## Function parameters and styling of widgets

genome = widgets.Text(description='Genome:',
                      disabled=False)

background_outfile = widgets.Text(description = 'Output file', 
                                  disabled=False)

upstream_nucl1 = widgets.IntSlider(value=5000, min=100, max=10000, step=100, 
                                   description='Sequence length', 
                                   disabled=False, 
                                   continuous_update=False, 
                                   orientation='horizontal', 
                                   readout=True, 
                                   readout_format='d', 
                                   style=style)

upstream_nucl1.style.handle_color = 'lightblue'

ui3 = widgets.VBox([genome, background_outfile, upstream_nucl1])

display(ui3)

In [167]:
# Styling of button and configuration of button clicked

button3=widgets.Button(description="Create Background Fasta",
                                layout=Layout(width='30%', height='40px'), 
                                button_style = 'info', 
                                style = {'font_weight': 'bold', 'font-size' : '30px'}, 
                                tooltip = 'Description', icon = 'check')


out3=widgets.Output()

def on_button_clicked(button3):
    
    if genome.value == '':
        with out3:
            clear_output(wait = True)
            print("Filepath to Genome file not provided")
            pass
    
    else: 
        with out3: 
            clear_output(wait = True)
            CDS_start_points1 = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)
            create_background_fasta(CDS_start_points1, genome.value, background_outfile.value, upstream_nucl1.value)
            print("Your background sequences are located in:",background_outfile.value,"and have a length of:",upstream_nucl1.value,"nucleotides")

button3.on_click(on_button_clicked)

display(button3, out3)

### Target sequences

In [168]:
def create_target_fasta(DEgenes1, CDS_start_points1, genome, target_outfile, upstream_nucl):
    genome = Fasta(genome)
    outfile = open(target_outfile, "w")
    newdf = pd.merge(DEgenes1, CDS_start_points1)
    seq_list = []
    seq_dict= {}

    for index, row in newdf.iterrows():
        genes = row['gene_id']
        if row['start'] > upstream_nucl:
            sequences = genome[row['seqid']][row['start'] - upstream_nucl:row['start'] + 3]
        else:
            sequences = genome[row['seqid']][row['start'] - row['start']:row['start']]
        seq_dict[genes] = sequences
        seq_list.append(seq_dict)
        seq_dict = {}
    
    for d in seq_list:
        for key, value in d.items():
            outfile.write(">" + key + " " + value.fancy_name + "\n" + value.seq + "\n")
    
    outfile.close() 
    
    return seq_list

In [169]:
## Function parameters and styling of widgets
target_outfile = widgets.Text(description = 'Output file', 
                              disabled=False)

upstream_nucl2 = widgets.IntSlider(value=5000, min=100, max=10000, step=100, 
                                   description='Sequence length', 
                                   disabled=False, 
                                   continuous_update=False, 
                                   orientation='horizontal', 
                                   readout=True, 
                                   readout_format='d', 
                                   style=style)

upstream_nucl2.style.handle_color = 'lightblue'

ui4 = widgets.VBox([target_outfile, upstream_nucl2])

display(ui4)

In [170]:
# Styling of button and configuration of button clicked

button4=widgets.Button(description="Create Target Fasta",
                                layout=Layout(width='30%', height='40px'), 
                                button_style = 'info', 
                                style = {'font_weight': 'bold', 'font-size' : '30px'}, 
                                tooltip = 'Description', icon = 'check')

out4=widgets.Output()

def on_button_clicked(button4):

    if genome.value == '':
        with out4:
            clear_output(wait = True)
            print("Filepath to Genome file not provided")
            pass
    
    else: 
        with out4: 
            clear_output(wait = True)
            DEgenes1 = get_degenes(filepath.value, gene_id.value, threshold.value, threshold_col_id.value)
            CDS_start_points1 = get_features(gff.value, feature.value, search_gff.value, attribute.value, coord.value)
            create_target_fasta(DEgenes1, CDS_start_points1, genome.value, target_outfile.value, upstream_nucl2.value)    
            print("Your target sequences are located in:",target_outfile.value,"and have a length of:",upstream_nucl2.value,"nucleotides")

button4.on_click(on_button_clicked)

display(button4, out4)

# Additional Functions

## Extract sequences from the genome
The gene ID entered in the 'Find Gene ID' field must match the gene ID in the GFF file tagged with the label in the field 'GeneID tag in GFF'

In [171]:
def extract_gene_sequences(gff, feature, search_gff, genome, query_gene):
    col_names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    mygff = pd.read_csv(gff, sep='\t', comment='#', low_memory=False, header=None, names=col_names)
    mygff = mygff[mygff.type == feature]
    mygff = mygff.copy()

    RE_GENE_NAME = re.compile(r'({}\W)(?P<gene_id>.+?)[,;]'.format(search_gff))
    def extract_gene_name(attributes_str):
        res = RE_GENE_NAME.search(attributes_str)
        if res is None:
            return ''
        else:
            return res.group('gene_id')
    mygff['gene_id'] = mygff.attributes.apply(extract_gene_name)
    mygff.drop('attributes', axis=1, inplace=True)
    
    genome = Fasta(genome)
    
    mygff_idx = mygff.set_index('gene_id')
    
    if query_gene not in mygff_idx.index:           
        print("Gene query was not found in GFF")
        pass
    
    else: 
      
        seq_list = []
        seq_dict = {}
    
        genes = mygff_idx.loc[query_gene]
       
        if feature == 'gene':
            seqid = mygff_idx.loc[query_gene]['seqid']
            start = mygff_idx.loc[query_gene]['start']
            end = mygff_idx.loc[query_gene]['end']
            sequence = genome[seqid][start:end]
            seq_dict[query_gene] = sequence
            seq_list.append(seq_dict)
        
        elif len(genes.axes) == 1:
            seqid = mygff_idx.loc[query_gene]['seqid']
            start = mygff_idx.loc[query_gene]['start']
            end = mygff_idx.loc[query_gene]['end']
            sequence = genome[seqid][start:end]
            seq_dict[query_gene] = sequence
            seq_list.append(seq_dict)
        
        else:
            genes = mygff_idx.loc[query_gene].index
            seqid = mygff_idx.loc[query_gene]['seqid']
            start = mygff_idx.loc[query_gene]['start']
            end = mygff_idx.loc[query_gene]['end']

            for gene, x, y, z in zip(genes, seqid, start, end):
                sequence = genome[x][y - 1:z]
                seq_dict[gene] = sequence
                seq_list.append(seq_dict)
                seq_dict = {}
    
        return seq_list

In [172]:
## Function parameters and styling of widgets

gff2 = widgets.Text(value = '/OSM/CBR/AF_POMV/work/POMV_RNA_seq/Genomes/Salmo_salar/GCF_000233375.1_ICSASG_v2_genomic.gff', 
                    description='GFF:',disabled=False)

feature2 = widgets.RadioButtons(options=['gene', 'CDS', 'exon'], 
                                value = 'gene', 
                                description='Feature:', 
                                disabled=False)

search_gff2 = widgets.Text(value = 'GeneID', 
                           description='GeneID tag in GFF', 
                           disabled=False, 
                           style=style)

genome2 = widgets.Text(value = '/OSM/CBR/AF_POMV/work/POMV_RNA_seq/Genomes/Salmo_salar//GCF_000233375.1_ICSASG_v2_genomic.fna', 
                       description='Genome:',
                       disabled=False)

query_gene = widgets.Text(description='Find Gene ID', 
                          disabled=False, 
                          style=style)

ui5 = widgets.VBox([gff2, feature2, search_gff2, genome2, query_gene])

display(ui5)

In [173]:
# Styling of button and configuration of button clicked

button_gene = widgets.Button(description="Find Gene Sequence",
                                    layout=Layout(width='30%', height='40px'), 
                                    button_style = 'info', 
                                    style = {'font_weight': 'bold', 'font-size' : '30px'}, 
                                    tooltip = 'Description', 
                                    icon = 'check')

out_gene=widgets.Output()

def on_button_clicked(button_gene):
    
    if query_gene.value == '':
        with out_gene:
            clear_output(wait = True)
            print("Gene ID not provided")
            pass

    else: 
        with out_gene: 
            clear_output(wait = True)
            myseq = extract_gene_sequences(gff2.value, feature2.value, search_gff2.value, genome2.value, query_gene.value)
            print(myseq)
            
button_gene.on_click(on_button_clicked)

display(button_gene, out_gene)

In [174]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')