# Developing a function to find transcription factor binding sites from differentially expressed genes

## Function aim
Build a command-line python program count the number of transcription factor binding sites (TFBs) from differentially expressed genes

## What are transcription factors
Transcription factors are proteins that regulate the transcription of genes—that is, their copying into RNA, on the way to making a protein.

https://www.khanacademy.org/science/biology/gene-regulation/gene-regulation-in-eukaryotes/a/eukaryotic-transcription-factors

## How do transcription factors work?

A typical transcription factor binds to DNA at a certain target sequence. Once it's bound, the transcription factor makes it either harder or easier for RNA polymerase to bind to the promoter of the gene.

Some transcription factors **activate transcription**. For instance, they may help the general transcription factors and/or RNA polymerase bind to the promoter, as shown in the diagram below.


In [4]:
# @hidden_cell
from IPython.display import display, Image
Image(url = 'https://ka-perseus-images.s3.amazonaws.com/6567f50d30ad3ac65aff1e815caf202b3abd7111.png')

Other transcription factors **repress transcription**. This repression can work in a variety of ways. As one example, a repressor may get in the way of the basal transcription factors or RNA polymerase, making it so they can't bind to the promoter or begin transcription.

## Binding sites

A typical transcription factor binds to DNA at a certain target sequence (or motif). Once it's bound, the transcription factor makes it either harder or easier for RNA polymerase to bind to the promoter of the gene, and consequently regulates the amount of messenger RNA (mRNA) produced by the gene. Some transcription factors activate transcription, while others repress transcription.
 
Transcription factor binding sites (TFBS) are often located in the 5’-upstream region of target genes to modulate the rate of gene transcription. DNA binding sites can be thus defined as short DNA sequences (typically 4 to 30 base pairs long) that are specifically bound by one or more DNA-binding proteins or protein complexes.

In [5]:
# @hidden_cell
Image(url = 'https://ka-perseus-images.s3.amazonaws.com/1ba8fe2b28b3dd5cd79ec75b74982ee87692dc9e.png')

The flexibility of DNA is what allows transcription factors at distant binding sites to do their job. The DNA loops like cooked spaghetti to bring far-off binding sites and transcription factors close to general transcription factors or "mediator" proteins.

In the cartoon above, an activating transcription factor bound at a far-away site helps RNA polymerase bind to the promoter and start transcribing.

To find and count TFBS my function should complete the following tasks:
1. Find the coordinates of differentially expressed genes or target genes
2. Extract the upstream sequences of these target genes
3. Find and count transcription factor binding sites (TFBSs) in the extracted sequences

### 0. Filter differentially expressed genes by logFC values

In [2]:
import pandas as pd
import numpy as np
import math

In [58]:
# Arguments to pass to first function 
filepath = '/Volumes/sam079/RNAseq-POMV/Results/ControlvsPOMV6_ALL.csv'
gene_name_columnid = 'ENTREZID'
threshold = 2
threshold_column_id = 'logFC'

In [66]:
def filter_DE_genes_names(filepath, columnid, threshold, threshold_column_id):

    genes = pd.read_csv(filepath) # might crash in the office
    genes = genes.dropna()

    if genes[columnid].dtypes == float:
        genes = genes.astype({columnid:int})
        genes = genes.astype({columnid:str})
        pass
    elif genes[columnid].dtypes == int:
        genes = genes.astype({columnid:str})
    else:
        print("gene names are strings, great!")


    DEgenes = genes.loc[(genes[threshold_column_id] >= threshold) | (genes[threshold_column_id] <= -threshold)]
    DEgenes = DEgenes[[gene_name_columnid]]

    return DEgenes

print(DEgenes)

        ENTREZID
195    100380312
297    100286653
355    106607818
374    100136513
476    100195251
484    106560450
522    100302577
537    100195148
603    100196303
632    106586518
648    106580888
729    106612490
747    100306782
851    100380724
956    100337625
1050   100196244
1078   100136541
1083   100136587
1089   100194553
1106   100194720
1107   100194722
1125   100196683
1237   106560422
1285   106560850
1304   106561065
1323   106561186
1339   106561321
1362   106561470
1363   106561472
1375   106561504
...          ...
9143   106610285
9144   106610286
9145   106610287
9146   106610288
9206   106610637
9271   106610967
9319   106611246
9462   106611944
9539   106612344
9557   106612527
9692   106613461
9702   106613521
9704   106613533
9717   106613622
9731   106613674
9747   106613747
9942   100194889
9943   100136920
10063  100196022
10397  100196087
10428  100196194
10573  100196256
10682  106602560
10728  106586517
10736  106587588
10756  100380557
10780  1001963

### 1. Find the coordinates of differentially expressed genes or target genes

https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy

In [127]:
col_names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']

In [168]:
df = pd.read_csv('/Users/sam079/Documents/2018_Transcription_factors/Data/GCF_000233375.1_ICSASG_v2_genomic.gff', 
                         sep='\t', comment='#', low_memory=False,
                         header=None, names=col_names)

In [169]:
df.seqid.unique()

array(['NC_027300.1', 'NC_027301.1', 'NC_027302.1', ..., 'NW_012498597.1',
       'NW_012561447.1', 'NC_001960.1'], dtype=object)

In [170]:
df.type.value_counts()

exon                     1313909
CDS                      1155790
region                    232155
mRNA                       97746
gene                       79030
cDNA_match                 41529
tRNA                       23676
lnc_RNA                     9435
transcript                  2603
pseudogene                  2556
V_gene_segment               144
C_gene_segment                38
rRNA                           2
D_loop                         1
origin_of_replication          1
Name: type, dtype: int64

In [171]:
bestref = df[df.source.isin(['BestRefSeq'])]
bestref.sample(20)

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes
765926,NC_027308.1,BestRefSeq,CDS,17016238,17016423,.,-,2,ID=cds28891;Parent=rna33150;Dbxref=GeneID:1001...
820820,NC_027308.1,BestRefSeq,exon,66053376,66054575,.,-,.,ID=id398971;Parent=rna35258;Dbxref=GeneID:1001...
2170389,NC_027321.1,BestRefSeq,exon,51124260,51124558,.,-,.,ID=id1057465;Parent=rna87789;Dbxref=GeneID:100...
453264,NC_027303.1,BestRefSeq,CDS,48241592,48241753,.,-,2,ID=cds16500;Parent=rna18890;Dbxref=GeneID:1001...
2125571,NC_027321.1,BestRefSeq,exon,17220330,17220397,.,+,.,ID=id1035226;Parent=rna86112;Dbxref=GeneID:100...
2204442,NC_027322.1,BestRefSeq,CDS,22692679,22692743,.,+,2,ID=cds78518;Parent=rna89039;Dbxref=GeneID:1001...
2584578,NW_012341938.1,BestRefSeq,exon,10281,10378,.,-,.,ID=id1265106;Parent=rna104255;Dbxref=GeneID:10...
1499991,NC_027314.1,BestRefSeq,exon,17399415,17399543,.,+,.,ID=id729124;Parent=rna61976;Dbxref=GeneID:1001...
766023,NC_027308.1,BestRefSeq,exon,17068049,17068125,.,-,.,ID=id372042;Parent=rna33154;Dbxref=GeneID:1001...
1176020,NC_027311.1,BestRefSeq,CDS,42141299,42141371,.,+,0,ID=cds43438;Parent=rna49449;Dbxref=GeneID:1001...


In [172]:
CDS = bestref[bestref.type == 'CDS']

In [173]:
CDS = CDS.copy()

In [174]:
CDS.sample(10).attributes.values

array(['ID=cds52243;Parent=rna59369;Dbxref=GeneID:100380835,Genbank:NP_001167428.1;Name=NP_001167428.1;Note=The RefSeq protein aligns at 85%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=orc2;inference=similar to AA sequence (same species):RefSeq:NP_001167428.1;partial=true;product=Origin recognition complex subunit 2;protein_id=NP_001167428.1',
       'ID=cds134;Parent=rna161;Dbxref=GeneID:100196112,Genbank:NP_001134613.1;Name=NP_001134613.1;Note=The RefSeq protein has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=cdkn3;inference=similar to AA sequence (same species):RefSeq:NP_001134613.1;product=cyclin-dependent kinase inhibitor 3;protein_id=NP_001134613.1',
       'ID=cds44351;Parent=rna50430;Dbxref=GeneID:100195178,Genbank:NP_001133679.1;Name=NP_001133679.1;gbkey=CDS;gene=rheb;product=ras homolog enriched in brain;protein_id=NP_001133679.1',
      

In [175]:
import re


RE_GENE_NAME = re.compile(r'GeneID:(?P<ENTREZID>[0-9]+)')

def extract_gene_name(attributes_str):
    res = RE_GENE_NAME.search(attributes_str)
    return res.group('ENTREZID')


CDS['ENTREZID'] = CDS.attributes.apply(extract_gene_name)

In [176]:
RE_DESC = re.compile('product=(?P<desc>.+?);')
def extract_description(attributes_str):
    res = RE_DESC.search(attributes_str)
    if res is None:
        return ''
    else:
        return res.group('desc')


CDS['desc'] = CDS.attributes.apply(extract_description)

In [177]:
CDS.drop('attributes', axis=1, inplace=True)
CDS

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ENTREZID,desc
497,NC_027300.1,BestRefSeq,CDS,1342527,1342649,.,+,0,100316613,interferon promoter stimulating protein 1
498,NC_027300.1,BestRefSeq,CDS,1342922,1343099,.,+,0,100316613,interferon promoter stimulating protein 1
499,NC_027300.1,BestRefSeq,CDS,1343567,1344243,.,+,2,100316613,interferon promoter stimulating protein 1
500,NC_027300.1,BestRefSeq,CDS,1345061,1345903,.,+,0,100316613,interferon promoter stimulating protein 1
2020,NC_027300.1,BestRefSeq,CDS,3499221,3499242,.,-,0,100194980,Mediator of RNA polymerase II transcription su...
2021,NC_027300.1,BestRefSeq,CDS,3498755,3498917,.,-,2,100194980,Mediator of RNA polymerase II transcription su...
2022,NC_027300.1,BestRefSeq,CDS,3498083,3498174,.,-,1,100194980,Mediator of RNA polymerase II transcription su...
2023,NC_027300.1,BestRefSeq,CDS,3496072,3496154,.,-,2,100194980,Mediator of RNA polymerase II transcription su...
2024,NC_027300.1,BestRefSeq,CDS,3495610,3495718,.,-,0,100194980,Mediator of RNA polymerase II transcription su...
2025,NC_027300.1,BestRefSeq,CDS,3494733,3494851,.,-,2,100194980,Mediator of RNA polymerase II transcription su...


In [237]:
from pyfaidx import Fasta

In [239]:
genome = Fasta('/Volumes/OSM_CBR_AF_POMV_work/POMV_RNA_seq/Genomes/Salmo_salar/GCF_000233375.1_ICSASG_v2_genomic.fna')

In [240]:
CDS.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ENTREZID,desc
497,NC_027300.1,BestRefSeq,CDS,1342527,1342649,.,+,0,100316613,interferon promoter stimulating protein 1
498,NC_027300.1,BestRefSeq,CDS,1342922,1343099,.,+,0,100316613,interferon promoter stimulating protein 1
499,NC_027300.1,BestRefSeq,CDS,1343567,1344243,.,+,2,100316613,interferon promoter stimulating protein 1
500,NC_027300.1,BestRefSeq,CDS,1345061,1345903,.,+,0,100316613,interferon promoter stimulating protein 1
2020,NC_027300.1,BestRefSeq,CDS,3499221,3499242,.,-,0,100194980,Mediator of RNA polymerase II transcription su...


In [196]:
DEgenes_up = DEgenes_up.astype(str)

In [197]:
DEgenes_up.dtypes

ENTREZID    object
dtype: object

In [198]:
CDS.dtypes

seqid       object
source      object
type        object
start        int64
end          int64
score       object
strand      object
phase       object
ENTREZID    object
desc        object
dtype: object

In [200]:
newdf = pd.merge(DEgenes_up, CDS)

In [203]:
newdf_small = newdf.head()
newdf_small

Unnamed: 0,ENTREZID,seqid,source,type,start,end,score,strand,phase,desc
0,100196194,NC_027314.1,BestRefSeq,CDS,3515019,3515088,.,+,0,Platelet basic protein precursor
1,100196194,NC_027314.1,BestRefSeq,CDS,3515273,3515399,.,+,2,Platelet basic protein precursor
2,100196194,NC_027314.1,BestRefSeq,CDS,3515676,3515768,.,+,1,Platelet basic protein precursor
3,100196194,NC_027314.1,BestRefSeq,CDS,3516526,3516538,.,+,1,Platelet basic protein precursor
4,100136513,NC_027308.1,BestRefSeq,CDS,10605995,10606582,.,-,0,thymidylate kinase


In [299]:
mylist = []
mydict = {}
for index, row in newdf_small.iterrows():
            genes = row['ENTREZID']
            sequences = genome[row['seqid']][row['start'] - 3:row['start']].seq
            mydict[genes] = sequences
            mylist.append(mydict)
            mydict = {}
print(mylist)
            

[{'100196194': 'CCA'}, {'100196194': 'AGG'}, {'100196194': 'AGT'}, {'100196194': 'AGG'}, {'100136513': 'ACC'}]


In [300]:
mylist = []
mydict = {}
for index, row in newdf_small.iterrows():
            genes = row['ENTREZID']
            sequences = genome[row['seqid']][row['start']:row['start'] + 3].seq
            mydict[genes] = sequences
            mylist.append(mydict)
            mydict = {}
print(mylist)

[{'100196194': 'TGA'}, {'100196194': 'CCA'}, {'100196194': 'GTC'}, {'100196194': 'AGT'}, {'100136513': 'TGT'}]
