# A template notebook to run mFinder from Uri Alon's lab

- mFinder is a package to compute network motifs, a pattern of motifs that is over-represented than randomly permuted network.

- Designed for Windows machine, but can run on Linux environment.
- mfinder 1.21 is used here.

Last updated: 08/10/2023
Author: Yang-Joon Kim



In [2]:
# 0. Import
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns


In [3]:
import celloracle as co
co.__version__

  def twobit_to_dna(twobit: int, size: int) -> str:
  def dna_to_twobit(dna: str) -> int:
  def twobit_1hamming(twobit: int, size: int) -> List[int]:
INFO:matplotlib.font_manager:Failed to extract font properties from /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)


'0.14.0'

In [4]:
# visualization settings
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# Step 1. Import the GRN (cell-type specific, in this case)

- Our GRN is "filtered" for only 2000 edges, based on (1) p-value and (2) strength of the edges (CellOracle paper).

In [25]:
GRN_links_TDR118 = co.load_hdf5("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/08_TDR118_celltype_GRNs.celloracle.links")
GRN_links_TDR118

<celloracle.network_analysis.links_object.Links at 0x15290fc9c070>

In [28]:
GRN_NMP = GRN_links_TDR118.filtered_links["NMPs"]
GRN_NMP

Unnamed: 0,source,target,coef_mean,coef_abs,p,-logp
171664,nfatc1,slit3,0.167520,0.167520,6.099314e-13,12.214719
166079,hmga1a,si:ch73-281n10.2,0.134133,0.134133,8.328944e-17,16.079410
171665,creb5b,slit3,0.118250,0.118250,2.857576e-08,7.544002
27748,her9,cirbpa,0.109563,0.109563,5.209617e-10,9.283194
154387,mafbb,rpl7a,0.091194,0.091194,7.477311e-13,12.126255
...,...,...,...,...,...,...
871,foxi3a,actb1,0.010103,0.010103,1.992044e-07,6.700701
40775,sox3,dag1,0.010103,0.010103,2.348269e-09,8.629252
107705,sox21a,mdka,0.010102,0.010102,4.750005e-10,9.323306
26231,otpb,cfl1,0.010096,0.010096,3.951234e-05,4.403267


In [29]:
# unfiltered GRN
# GRN_NMPs = pd.read_csv("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/raw_GRN_for_NMPs.csv")
# GRN_NMPs

# Step 2. Reformat the GRN for mFinder

- mFinder requires the following input format
- "source node", "target node", "edge weight"
- source node, and target node should be "integers". Therefore, we need to map each gene_name to a unique integer.
- NOTE that mFinder does not take into the "edge weight" is not taken into account, and should be "1" for all edges. This means that we can only learn about the "interaction" between TFs (network motifs), but their exact interaction (positive/negative) should be figured out using the "edge weight" from CellOracle later. 

Here, we will reformat the cell-type specific GRN by converting gene_names to unique integers.

In [5]:
# # import the anndata to make a mapping dictionary of "gene_name" and "integer" pairs.
# base_GRN = pd.read_parquet("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/05_TDR118_base_GRN_dataframe.parquet")
# base_GRN

Unnamed: 0,peak_id,gene_short_name,A6H8I1_DANRE,CABZ01017151.1,CABZ01056727.1,CABZ01057488.2,CABZ01066696.1,CABZ01067175.1,CABZ01079847.1,CABZ01081359.1,...,znf143b,znf148,znf281a,znf281b,znf652,znf653,znf711,znf740b,znf76,zzz3
0,chr10_10310135_10311044,mir219-1,0,0,1,0,0,1,0,0,...,0,1,1,1,0,0,0,0,0,0
1,chr10_10312654_10313520,urm1,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,chr10_10318857_10319768,mir219-1,0,0,1,1,0,1,0,0,...,0,1,1,1,0,0,0,0,0,0
3,chr10_10330150_10331040,mir219-1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,chr10_10728430_10729439,swi5,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14506,chr9_9670995_9671898,gsk3ba,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
14507,chr9_9841650_9842440,fstl1b,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
14508,chr9_9960217_9961167,prmt2,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
14509,chr9_9977190_9977958,ugt1a1,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# base_GRN["CABZ01017151.1"].sum()

521

In [10]:
# len(set(base_GRN.gene_short_name.unique()) - set(base_GRN.columns))

12009

In [21]:
# set(base_GRN.columns) - set(base_GRN.gene_short_name.unique())

In [24]:
# list_genes_TFs = list(set(base_GRN.columns).union(set(base_GRN.gene_short_name)))
# list_genes_TFs.unique()

In [61]:
# # Create a dictionary mapping integers to gene names
# gene_dict = {index: gene_name for index, gene_name in enumerate(list_genes_TFs)}
# gene_dict

In [67]:
list_genes_TFs = list(set(GRN_NMP.source).union(set(GRN_NMP.target)))
len(list_genes_TFs)

390

In [68]:
# Create a dictionary mapping integers to gene names
gene_dict = {index: gene_name for index, gene_name in enumerate(list_genes_TFs)}
gene_dict

{0: 'pbx1b',
 1: 'gsx1',
 2: 'hoxa11b',
 3: 'hoxa4a',
 4: 'sp5l',
 5: 'phox2bb',
 6: 'marcksl1a',
 7: 'sox11a',
 8: 'smad1',
 9: 'sox1a',
 10: 'col11a1a',
 11: 'col18a1a',
 12: 'mafbb',
 13: 'dlx5a',
 14: 'asph',
 15: 'foxd3',
 16: 'h3f3d',
 17: 'ntn1a',
 18: 'pax7a',
 19: 'nr5a2',
 20: 'hmx3a',
 21: 'tfec',
 22: 'meis3',
 23: 'dmrta2',
 24: 'runx3',
 25: 'efna3b',
 26: 'smad3a',
 27: 'etv2',
 28: 'fosaa',
 29: 'creb5b',
 30: 'marcksl1b',
 31: 'vax2',
 32: 'hbbe3',
 33: 'nfatc1',
 34: 'neurod4',
 35: 'zbtb18',
 36: 'gli2a',
 37: 'hoxc6b',
 38: 'tlx2',
 39: 'hoxc8a',
 40: 'cyth1b',
 41: 'otpb',
 42: 'gbx2',
 43: 'barhl2',
 44: 'tbx16l',
 45: 'chsy1',
 46: 'tlx3b',
 47: 'hnf4a',
 48: 'hoxd9a',
 49: 'mef2aa',
 50: 'emx3',
 51: 'tenm4',
 52: 'pax6b',
 53: 'mnx2b',
 54: 'prox1a',
 55: 'aopep',
 56: 'nkx2.4b',
 57: 'klf12b',
 58: 'foxf1',
 59: 'hoxd12a',
 60: 'fgfr3',
 61: 'serbp1a',
 62: 'greb1l',
 63: 'hoxa9b',
 64: 'dlx3b',
 65: 'nkx2.5',
 66: 'robo1',
 67: 'en1b',
 68: 'sox6',
 69: 'kif2

In [69]:
# Now, we will reformat the GRN as described above
# 1) grab the GRN, then extract the "source", "target", and create a dataframe
# 2) add the "edge weight" as "1" for the third column
df_mfinder = pd.DataFrame(columns =["source", "target", "edge_weight"])
df_mfinder

df_mfinder["source"] = GRN_NMP["source"]
df_mfinder["target"] = GRN_NMP["target"]
df_mfinder["edge_weight"] = 1

df_mfinder

# 3) convert the "source", "target" gene_names to "integers" using the gene_dict
df_mfinder["source"] = df_mfinder["source"].map({v: k for k, v in gene_dict.items()})
df_mfinder["target"] = df_mfinder["target"].map({v: k for k, v in gene_dict.items()})
df_mfinder


Unnamed: 0,source,target,edge_weight
171664,33,137,1
166079,154,199,1
171665,29,137,1
27748,160,260,1
154387,12,213,1
...,...,...,...
871,236,264,1
40775,211,150,1
107705,126,102,1
26231,41,227,1


In [71]:
# save the reformatted GRN into a txt file
df_mfinder.to_csv("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/filtered_GRN_for_NMPs_mfinder_format.txt",
                  sep="\t", header=False, index=False)


# Step 3. Run mFinder in linux terminal

- Refer to mFinder documentation (Uri Alon's lab website)
- link: 

1) Use "screen"
2) the motif computation takes less than a minute (with the default settings, network_size=3, for a GRN with 2000 edges, from CellOracle).
3) however, if we increase the network_size=4, the runtime increased to 20 minutes, for the same input dataset.


In [55]:
# Change the current working directory
os.chdir("/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21/")


In [46]:
! export PATH=/hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21:$PATH

In [53]:
# move to the directory where the input txt file is saved
! cd "/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR118_cicero_output/07_TDR118_celloracle_GRN/"

In [59]:
# os.getcwd()
# ! mfinder

/bin/bash: mfinder: command not found


In [60]:
# run the mfinder
#! cd /hpc/projects/data.science/yangjoon.kim/github_repos/mfinder/mfinder1.21/
! mfinder filtered_GRN_for_NMPs_mfinder_format.txt \
            -f NMP_motifs # note that there's no need for .txt formatting here.


/bin/bash: mfinder: command not found


# Step 4. Repeat the mfinder run for all cell types

- 

In [74]:
for celltype in GRN_links_TDR118.filtered_links.keys():
    # reformat the GRN (unique integers for gene_names)
    
    # run mFinder
    # default setting (network_size=3, num_random_)
    ! mfinder filtered_GRN_for_NMPs_mfinder_format.txt \
            -f NMP_motifs # note that there's no need for .txt formatting here.

Adaxial_Cells
Differentiating_Neurons
Endoderm
Epidermal
Lateral_Mesoderm
Muscle
NMPs
Neural_Anterior
Neural_Crest
Neural_Posterior
Notochord
PSM
Somites
unassigned
