__Author:__ Bram Van de Sande
    
__Date:__ 14 JUN 2019

__Outline:__ Notebook generating list of Transcription Factors (TFs) for human and mouse. These lists can be used for the network inference step of SCENIC (step 1 - GENIE3/GRNBoost2).

__DATA ACQUISITION:__
1. Download motif annotations for _H. sapiens_ - HGNC symbols: `wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl`
2. Download motif annotations for _M. musculus_ - MGI symbols: `wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl`
3. Download list of curated human transcription factors from: Lambert SA et al. The Human Transcription Factors. Cell 2018 https://dx.doi.org/10.1016/j.cell.2018.01.029 

In [1]:
import os
import pandas as pd

In [16]:
BASEFOLDER_NAME = '../resources/'

MOTIFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'motifs-v9-nr.hgnc-m0.001-o0.0.tbl')
MOTIFS_MGI_FNAME = os.path.join(BASEFOLDER_NAME, 'motifs-v9-nr.mgi-m0.001-o0.0.tbl')
CURATED_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'lambert2018.txt')

OUT_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'hs_hgnc_tfs.txt')
OUT_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'hs_hgnc_curated_tfs.txt')
OUT_TFS_MGI_FNAME = os.path.join(BASEFOLDER_NAME, 'mm_mgi_tfs.txt')

__MUS MUSCULUS__

In [6]:
df_motifs_mgi = pd.read_csv(MOTIFS_MGI_FNAME, sep='\t')
mm_tfs = df_motifs_mgi.gene_name.unique()
with open(OUT_TFS_MGI_FNAME, 'wt') as f:
    f.write('\n'.join(mm_tfs) + '\n')
len(mm_tfs)

1721

__HOMO SAPIENS__

List of TFs based on motif collection.

In [7]:
df_motifs_hgnc = pd.read_csv(MOTIFS_HGNC_FNAME, sep='\t')
hs_tfs = df_motifs_hgnc.gene_name.unique()
with open(OUT_TFS_HGNC_FNAME, 'wt') as f:
    f.write('\n'.join(hs_tfs) + '\n')
len(hs_tfs)

1839

List of TFs based on Lambert SA et al.

In [13]:
with open(CURATED_TFS_HGNC_FNAME, 'rt') as f:
    hs_curated_tfs = list(map(lambda s: s.strip(), f.readlines()))
len(hs_curated_tfs)

1639

List of human curated TFs for which a motif can be assigned based on our current version of the motif collection.

In [15]:
hs_curated_tfs_with_motif = list(set(hs_tfs).intersection(hs_curated_tfs))
len(hs_curated_tfs_with_motif)

1390

In [17]:
with open(OUT_TFS_HGNC_FNAME, 'wt') as f:
    f.write('\n'.join(hs_curated_tfs_with_motif) + '\n')