# A template notebook to run mFinder from Uri Alon's lab

- mFinder is a package to compute network motifs, a pattern of motifs that is over-represented than randomly permuted network.

- Designed for Windows machine, but can run on Linux environment.
- mfinder 1.21 is used here.

Last updated: 09/19/2023

Author: Yang-Joon Kim



In [3]:
# 0. Import
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns


In [4]:
import celloracle as co
co.__version__

  def twobit_to_dna(twobit: int, size: int) -> str:
  def dna_to_twobit(dna: str) -> int:
  def twobit_1hamming(twobit: int, size: int) -> List[int]:
INFO:matplotlib.font_manager:Failed to extract font properties from /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)


'0.14.0'

In [3]:
# visualization settings
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# Step 1. Import the GRN (cell-type specific, in this case)

- Our GRN is "filtered" for only 2000 edges, based on (1) p-value and (2) strength of the edges (CellOracle paper).

In [106]:
GRN_links_TDR118 = co.load_hdf5("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/08_TDR118_celltype_GRNs.celloracle.links")
GRN_links_TDR118

<celloracle.network_analysis.links_object.Links at 0x15528d0b7a30>

In [107]:
GRN_links_TDR118

<celloracle.network_analysis.links_object.Links at 0x15528d0b7a30>

In [6]:
GRN_NMP = GRN_links_TDR118.filtered_links["NMPs"]
GRN_NMP

Unnamed: 0,source,target,coef_mean,coef_abs,p,-logp
171664,nfatc1,slit3,0.167520,0.167520,6.099314e-13,12.214719
166079,hmga1a,si:ch73-281n10.2,0.134133,0.134133,8.328944e-17,16.079410
171665,creb5b,slit3,0.118250,0.118250,2.857576e-08,7.544002
27748,her9,cirbpa,0.109563,0.109563,5.209617e-10,9.283194
154387,mafbb,rpl7a,0.091194,0.091194,7.477311e-13,12.126255
...,...,...,...,...,...,...
871,foxi3a,actb1,0.010103,0.010103,1.992044e-07,6.700701
40775,sox3,dag1,0.010103,0.010103,2.348269e-09,8.629252
107705,sox21a,mdka,0.010102,0.010102,4.750005e-10,9.323306
26231,otpb,cfl1,0.010096,0.010096,3.951234e-05,4.403267


In [29]:
# unfiltered GRN
# GRN_NMPs = pd.read_csv("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/raw_GRN_for_NMPs.csv")
# GRN_NMPs

# Step 2. Reformat the GRN for mFinder

- mFinder requires the following input format
- columns: "source node", "target node", "edge weight"
- source node, and target node should be "integers". Therefore, we need to map each gene_name to a unique integer.
- NOTE that mFinder does not take into the "edge weight", and edge weight should be "1" for all edges. This means that we can only learn about the "interaction" between TFs (network motifs), but their exact interaction (positive/negative) should be figured out using the "edge weight" from CellOracle later. 

Here, we will reformat the cell-type specific GRN by converting gene_names to unique integers.

In [8]:
# # import the anndata to make a mapping dictionary of "gene_name" and "integer" pairs.
# base_GRN = pd.read_parquet("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/05_TDR118_base_GRN_dataframe.parquet")
# base_GRN

Unnamed: 0,peak_id,gene_short_name,A6H8I1_DANRE,CABZ01017151.1,CABZ01056727.1,CABZ01057488.2,CABZ01066696.1,CABZ01067175.1,CABZ01079847.1,CABZ01081359.1,...,znf143b,znf148,znf281a,znf281b,znf652,znf653,znf711,znf740b,znf76,zzz3
0,chr10_10310135_10311044,mir219-1,0,0,1,0,0,1,0,0,...,0,1,1,1,0,0,0,0,0,0
1,chr10_10312654_10313520,urm1,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,chr10_10318857_10319768,mir219-1,0,0,1,1,0,1,0,0,...,0,1,1,1,0,0,0,0,0,0
3,chr10_10330150_10331040,mir219-1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,chr10_10728430_10729439,swi5,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14506,chr9_9670995_9671898,gsk3ba,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
14507,chr9_9841650_9842440,fstl1b,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
14508,chr9_9960217_9961167,prmt2,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
14509,chr9_9977190_9977958,ugt1a1,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# make a dictionary of "integer":"gene_names" across all cell-types


15380

In [11]:
list_genes_TFs = list(set(GRN_NMP.source).union(set(GRN_NMP.target)))
len(list_genes_TFs)

390

In [12]:
# Create a dictionary mapping integers to gene names
gene_dict = {index: gene_name for index, gene_name in enumerate(list_genes_TFs)}
gene_dict

{0: 'esrrga',
 1: 'rfx4',
 2: 'serpinh1b',
 3: 'pknox2',
 4: 'rxraa',
 5: 'hoxd3a',
 6: 'en2b',
 7: 'foxi3a',
 8: 'alx4b',
 9: 'cyth1b',
 10: 'gbx1',
 11: 'foxn3',
 12: 'tfap2b',
 13: 'hmx3a',
 14: 'pbx1b',
 15: 'spry4',
 16: 'pnx',
 17: 'msx2b',
 18: 'qkia',
 19: 'agrn',
 20: 'il17rd',
 21: 'prox1a',
 22: 'smad1',
 23: 'neurod1',
 24: 'hoxd12a',
 25: 'fn1b',
 26: 'nr2f5',
 27: 'nr5a2',
 28: 'sox10',
 29: 'foxi2',
 30: 'si:ch73-281n10.2',
 31: 'fezf2',
 32: 'tfec',
 33: 'nop58',
 34: 'spon1b',
 35: 'dag1',
 36: 'raraa',
 37: 'hoxb3a',
 38: 'sox1a',
 39: 'rx3',
 40: 'creb5b',
 41: 'h3f3d',
 42: 'cdh11',
 43: 'tlx3b',
 44: 'tal1',
 45: 'etv4',
 46: 'rorb',
 47: 'hsp90ab1',
 48: 'zbtb18',
 49: 'cntfr',
 50: 'hnf4a',
 51: 'irx7',
 52: 'uncx4.1',
 53: 'smad3a',
 54: 'gli2a',
 55: 'ctnnd2b',
 56: 'nfic',
 57: 'nucks1a',
 58: 'her9',
 59: 'pax6b',
 60: 'pax7b',
 61: 'hoxc11a',
 62: 'asph',
 63: 'zbtb16a',
 64: 'aopep',
 65: 'onecutl',
 66: 'rarga',
 67: 'sox21a',
 68: 'meis3',
 69: 'mnx2a',
 

In [13]:
# Now, we will reformat the GRN as described above
# 1) grab the GRN, then extract the "source", "target", and create a dataframe
# 2) add the "edge weight" as "1" for the third column
df_mfinder = pd.DataFrame(columns =["source", "target", "edge_weight"])
df_mfinder

df_mfinder["source"] = GRN_NMP["source"]
df_mfinder["target"] = GRN_NMP["target"]
df_mfinder["edge_weight"] = 1

df_mfinder

# 3) convert the "source", "target" gene_names to "integers" using the gene_dict
df_mfinder["source"] = df_mfinder["source"].map({v: k for k, v in gene_dict.items()})
df_mfinder["target"] = df_mfinder["target"].map({v: k for k, v in gene_dict.items()})
df_mfinder


Unnamed: 0,source,target,edge_weight
171664,77,91,1
166079,374,30,1
171665,40,91,1
27748,58,95,1
154387,328,353,1
...,...,...,...
871,7,151,1
40775,331,35,1
107705,67,111,1
26231,109,199,1


In [71]:
# save the reformatted GRN into a txt file
df_mfinder.to_csv("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/filtered_GRN_for_NMPs_mfinder_format.txt",
                  sep="\t", header=False, index=False)


# Step 3. Run mFinder in linux terminal

- Refer to mFinder documentation (Uri Alon's lab website)
- link: 

1) Use "screen"
2) the motif computation takes less than a minute (with the default settings, network_size=3, for a GRN with 2000 edges, from CellOracle).
3) however, if we increase the network_size=4, the runtime increased to 20 minutes, for the same input dataset.


In [76]:
# Change the current working directory
os.chdir("/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21/")


In [85]:
! export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH

In [89]:
os.system("export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH")

0

In [101]:
mfinder_path = "/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21"

In [103]:
cmd = "export PATH="+mfinder_path+":$PATH"
cmd

'export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH'

In [94]:
# move to the directory where the input txt file is saved
# ! cd /hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/
os.chdir("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/")

In [95]:
! ls

archive_motifs				  raw_GRN_for_Muscle.csv
filtered_GRN_for_NMPs_mfinder_format.txt  raw_GRN_for_Neural_Anterior.csv
filtered_GRN_Somites_mfinder_format.txt   raw_GRN_for_Neural_Crest.csv
motifs_Somites_OUT.txt			  raw_GRN_for_Neural_Posterior.csv
NMP_motifs_OUT.txt			  raw_GRN_for_NMPs.csv
raw_GRN_for_Adaxial_Cells.csv		  raw_GRN_for_Notochord.csv
raw_GRN_for_Differentiating_Neurons.csv   raw_GRN_for_PSM.csv
raw_GRN_for_Endoderm.csv		  raw_GRN_for_Somites.csv
raw_GRN_for_Epidermal.csv		  raw_GRN_for_unassigned.csv
raw_GRN_for_Lateral_Mesoderm.csv


In [98]:
input_file_name

'filtered_GRN_Somites_mfinder_format.txt'

In [100]:
cmd = "mfinder "+input_file_name + " -f "+output_filename
cmd

os.system(cmd)

Input Network file is filtered_GRN_Somites_mfinder_format.txt
mfinder Version 1.20

Loading Network
	Reading Network file in <Source,Target,Weight> Format
Searching motifs size 3
Processing Real network...

 (Real network processing runtime was:    2.0 seconds.)
Processing Random networks
..........
 Estimated run time left :      8 minutes.

..........
 Estimated run time left :      7 minutes.

..........
 Estimated run time left :      6 minutes.

..........
 Estimated run time left :      5 minutes.

..........
 Estimated run time left :    256 seconds.

..........
 Estimated run time left :    206 seconds.

..........
 Estimated run time left :    155 seconds.

..........
 Estimated run time left :    103 seconds.

..........
 Estimated run time left :     52 seconds.

..........
 Estimated run time left :      0 seconds.


Calculating Results...

MOTIF FINDER RESULTS:

	Network name: filtered_GRN_Somites_mfinder_format.txt
	Network type: Directed
	Num of Nodes: 762 Num of Edges: 

0

as:    2.0 seconds.)

 (Single Random network processing runtime was:    5.2 seconds.)

Output File motifs_Somites_OUT.txt was generated


In [59]:
# os.getcwd()
# ! mfinder

/bin/bash: mfinder: command not found


In [81]:
! cd /hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/

In [83]:
pwd

'/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21'

In [84]:
# run the mfinder
#! cd /hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21/
! mfinder filtered_GRN_for_NMPs_mfinder_format.txt \
            -f NMP_motifs # note that there's no need for .txt formatting here.


Input Network file is filtered_GRN_for_NMPs_mfinder_format.txt
mfinder Version 1.20

Loading Network

Error: Cannot open input file : filtered_GRN_for_NMPs_mfinder_format.txt
	F input file name and path


# Step 4. Repeat the mfinder run for all cell types

- 

# Main Command - using for loop to run mFinder for all cell-types

In [76]:
# Change the current working directory
#os.chdir("/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21/")
#! export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH
#os.system("export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH")


In [6]:
# set the mfinder path in the PATH variable
mfinder_path = "/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21"

cmd = "export PATH="+mfinder_path+":$PATH"
cmd

os.system(cmd)

0

In [8]:
# move to the directory where the input txt file is saved
# ! cd /hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/
filepath = "/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/"
os.chdir(filepath)

In [10]:
# import the Links object
GRN_links_TDR118 = co.load_hdf5("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/08_TDR118_celltype_GRNs.celloracle.links")
GRN_links_TDR118

# grab only the filtered object (2000 edges)
GRN_all = GRN_links_TDR118.filtered_links

In [11]:
# # define the mfinder executible
# ! export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH

# # filepath
# filepath = "/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/"

# # define the GRN (A Links object from CellOracle - filtered for 2000 edges for each cell-type)
# # This should be the input argument for the function
# GRN_all = GRN_links_TDR118.filtered_links

for celltype in GRN_all.keys():
    
    print(celltype)
    
#     # Step 1. subset the GRN for each celltype
#     GRN_celltype = GRN_all[celltype]
    
#     # Step 2. reformat the GRN (unique integers for gene_names)    
#     list_genes_TFs = list(set(GRN_celltype.source).union(set(GRN_celltype.target)))
#     # Create a dictionary mapping integers to gene names
#     gene_dict = {index: gene_name for index, gene_name in enumerate(list_genes_TFs)}
#     gene_dict
    
#     # Now, we will reformat the GRN as described above
#     # 1) grab the GRN, then extract the "source", "target", and create a dataframe
#     # 2) add the "edge weight" as "1" for the third column
#     df_mfinder = pd.DataFrame(columns =["source", "target", "edge_weight"])
#     df_mfinder

#     df_mfinder["source"] = GRN_celltype["source"]
#     df_mfinder["target"] = GRN_celltype["target"]
#     df_mfinder["edge_weight"] = 1

#     df_mfinder

#     # 3) convert the "source", "target" gene_names to "integers" using the gene_dict
#     df_mfinder["source"] = df_mfinder["source"].map({v: k for k, v in gene_dict.items()})
#     df_mfinder["target"] = df_mfinder["target"].map({v: k for k, v in gene_dict.items()})
#     df_mfinder
#     # save the reformatted GRN into a txt file
#     df_mfinder.to_csv(filepath + "filtered_GRN_"+celltype+"_mfinder_format.txt",
#                       sep="\t", header=False, index=False)
    
    # Step 3. run mFinder
    # default setting (network_size=3, num_random_)
    # input filename
    input_file = "filtered_GRN_"+celltype+"_mfinder_format.txt"
    # output filename
    output_file = "motifs_" + celltype

    # define the mFinder command
    cmd = "mfinder "+input_file + " -f "+output_file
    cmd
    # run mFinder
    os.system(cmd)
    print(celltype + " network motif computing completed")

Adaxial_Cells
Input Network file is filtered_GRN_Adaxial_Cells_mfinder_format.txt
mfinder Version 1.20

Loading Network
	Reading Network file in <Source,Target,Weight> Format
Searching motifs size 3
Processing Real network...

 (Real network processing runtime was:    1.0 seconds.)
Processing Random networks
..........
 Estimated run time left :     18 seconds.

..........
 Estimated run time left :     20 seconds.

..........
 Estimated run time left :     19 seconds.

..........
 Estimated run time left :     16 seconds.

..........
 Estimated run time left :     14 seconds.

..........
 Estimated run time left :     11 seconds.

..........
 Estimated run time left :      9 seconds.

..........
 Estimated run time left :      6 seconds.

..........
 Estimated run time left :      3 seconds.

..........
 Estimated run time left :      0 seconds.


Calculating Results...

MOTIF FINDER RESULTS:

	Network name: filtered_GRN_Adaxial_Cells_mfinder_format.txt
	Network type: Directed
	Num of

In [None]:
# for celltype in GRN_all.keys():
    
#     print(celltype)
    
#     # Step 3. run mFinder
#     # default setting (network_size=3, num_random_)
#     # input filename
#     input_file = "filtered_GRN_"+celltype+"_mfinder_format.txt"
#     # output filename
#     output_file = "motifs_" + celltype

#     # define the mFinder command
#     cmd = "mfinder "+input_file + " -f "+output_file
#     cmd
#     # run mFinder
#     os.system(cmd)

NOTES
- Refer to mFinder documentation (Uri Alon's lab website)
- link: 

1) Use "screen"
2) the motif computation takes less than a minute (with the default settings, network_size=3, for a GRN with 2000 edges, from CellOracle).
3) however, if we increase the network_size=4, the runtime increased to 20 minutes, for the same input dataset.
