# Setup
This IPython notebook will walk through the steps of characterizing iModulons through the semi-automated tools in PyModulon. You will need:

* M and A matrices
* Expression data (e.g. `log_tpm_norm.csv`)
* Gene table and KEGG/GO annotations (Generated in `1_create_the_gene_table.ipynb`)
* Sample table, with a column for `project` and `condition`
* TRN file

Optional:
* iModulon table (if you already have some characterized iModulons)

In [1]:
from pymodulon.core import IcaData
from pymodulon.plotting import *
from os import path
import pandas as pd
import re
from Bio.KEGG import REST
from tqdm.notebook import tqdm

In [2]:
# Enter the location of your data here
data_dir = path.join('..','data','processed_data')

# GO and KEGG annotations are in the 'external' folder
external_data = path.join('..','data','external')

## Check your sample table (i.e. metadata file)
Your metadata file will probably have a lot of columns, most of which you may not care about. Feel free to save a secondary copy of your metadata file with only columns that seem relevant to you. The two most important columns are:
1. `project`
2. `condition`

Make sure that these columns exist in your metadata file

In [3]:
df_metadata = pd.read_csv(path.join(data_dir,'metadata.tsv'),index_col=0,sep='\t')
df_metadata[['project','condition']].head()

Unnamed: 0,project,condition
SRX14953908,multistage_ferment_AFEX,6perACSH_Glucose
SRX14953909,multistage_ferment_AFEX,6perACSH_Glucose
SRX14953910,multistage_ferment_AFEX,6perACSH_Glucose
SRX14953911,multistage_ferment_AFEX,6perACSH_Glucose
SRX14953912,multistage_ferment_AFEX,6perACSH_Glucose


In [4]:
print(df_metadata.project.notnull().all())
print(df_metadata.condition.notnull().all())

True
True


## Check your TRN

Each row of the TRN file represents a regulatory interaction.  
**Your TRN file must have the following columns:**
1. `regulator` - Name of regulator (`/` or `+` characters will be converted to `;`)
1. `gene_id` - Locus tag of gene being regulated

The following columns are optional, but are helpful to have:
1. `regulator_id` - Locus tag of regulator
1. `gene_name` - Name of gene (can automatically update this using `name2num`)
1. `direction` - Direction of regulation ('+' for activation, '-' for repression, '?' or NaN for unknown)
1. `evidence` - Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)
1. `PMID` - Reference for regulation

You may add any other columns that could help you. TRNs may be saved as either CSV or TSV files. See below for an example:

In [5]:
#df_trn = pd.read_csv(path.join(external_data,'TRN.csv'))
#df_trn.head()

The `regulator` and `gene_id` must be filled in for each row

In [6]:
#print(df_trn.regulator.notnull().all())
#print(df_trn.gene_id.notnull().all())

## Load the data
You're now ready to load your IcaData object!

In [7]:
A = pd.read_csv(path.join(data_dir,'A.csv'), index_col = 0)
X = pd.read_csv(path.join(data_dir,'log_tpm_norm.csv'), index_col = 0)
M = pd.read_csv(path.join(data_dir,'M.csv'), index_col = 0)
iM_table = pd.read_csv(path.join(data_dir, 'imodulon_table.csv'), index_col = 0)
index_to_iM = {str(index) : row['name'] for index, row in iM_table.iterrows()}
M = M.rename(columns = index_to_iM)
index_to_iM = {int(index) : row['name'] for index, row in iM_table.iterrows()}
A = A.rename(index = index_to_iM)
iM_table = iM_table.set_index('name')

ica_data = IcaData(M = M,
                   A = A,
                   X = X,
                   imodulon_table = iM_table,
                   gene_table = path.join(data_dir,'gene_info.csv'),
                   sample_table = path.join(data_dir,'metadata.tsv'),
                   #trn = path.join(external_data,'TRN.csv'),
                   optimize_cutoff=True)



If you don't have a TRN (or have a very minimal TRN), use `threshold_method = 'kmeans'`

In [8]:
# ica_data = IcaData(M = path.join(data_dir,'M.csv'),
#                    A = path.join(data_dir,'A.csv'),
#                    X = path.join(data_dir,'log_tpm_norm.csv'),
#                    gene_table = path.join(data_dir,'gene_info.csv'),
#                    sample_table = path.join(data_dir,'metadata.tsv'),
#                    trn = path.join(data_dir,'TRN.csv'),
#                    threshold_method = 'kmeans')

# Regulatory iModulons
Use `compute_trn_enrichment` to automatically check for Regulatory iModulons. The more complete your TRN, the more regulatory iModulons you'll find.

In [9]:
#ica_data.compute_trn_enrichment()

You can also search for AND/OR combinations of regulators using the `max_regs` argument.

Regulator enrichments can be directly saved to the `imodulon_table` using the `save` argument. This saves the enrichment with the lowest q-value to the table.

In [10]:
# First search for regulator enrichments with 2 regulators
#ica_data.compute_trn_enrichment(max_regs=2,save=True)

# Next, search for regulator enrichments with just one regulator. This will supercede the 2 regulator enrichments.
#ica_data.compute_trn_enrichment(max_regs=1,save=True)

The list of regulatory iModulons are shown below

In [11]:
#regulatory_imodulons = ica_data.imodulon_table[ica_data.imodulon_table.regulator.notnull()]
#print(len(ica_data.imodulon_table),'Total iModulons')
#print(len(regulatory_imodulons),'Regulatory iModulons')
#regulatory_imodulons

You can rename iModulons in this jupyter notebook, or you can save the iModulon table as a CSV and edit it in Excel.

If two iModulons have the same regulator (e.g. 'Reg'), they will be named 'Reg-1' and 'Reg-2'

In [12]:
#ica_data.rename_imodulons(regulatory_imodulons.regulator.to_dict())
#ica_data.imodulon_table.head()

In [13]:
#regulatory_imodulons = ica_data.imodulon_table[ica_data.imodulon_table.regulator.notnull()]

# Functional iModulons

GO annotations and KEGG pathways/modules were generated in the 1_create_the_gene_table.ipynb notebook. Enrichments will be calculated in this notebook, and further curated in the 3_manual_iModulon_curation notebook.

## GO Enrichments

First load the Gene Ontology annotations

In [14]:
DF_GO = pd.read_csv(path.join(external_data,'GO_annotations_curated.csv'),index_col=0)
DF_GO.head()

Unnamed: 0,gene_id,gene_name,gene_ontology
45,ZCP4_0936,efp,cytoplasm
46,ZCP4_0936,efp,translation elongation factor activity
65,ZCP4_1645,hisH,imidazoleglycerol-phosphate synthase activity
73,ZCP4_1199,rpe,metal ion binding
74,ZCP4_1199,rpe,"pentose-phosphate shunt, non-oxidative branch"


In [15]:
#DF_GO_enrich = ica_data.compute_annotation_enrichment(DF_GO,'gene_ontology')

In [16]:
#DF_GO_enrich.head()

## KEGG Enrichments

### Load KEGG mapping
The `kegg_mapping.csv` file contains KEGG orthologies, pathways, modules, and reactions. Only pathways and modules are relevant to iModulon characterization.

In [17]:
DF_KEGG = pd.read_csv(path.join(external_data,'kegg_mapping.csv'),index_col=0)
print(DF_KEGG.database.unique())
DF_KEGG.head()

['KEGG_pathway' 'KEGG_module' 'KEGG_reaction']


Unnamed: 0,gene_id,database,kegg_id
0,ZCP4_0005,KEGG_pathway,-
8,ZCP4_0006,KEGG_pathway,map00361
9,ZCP4_0006,KEGG_pathway,map00364
10,ZCP4_0006,KEGG_pathway,map00623
11,ZCP4_0006,KEGG_pathway,map01100


In [18]:
kegg_pathways = DF_KEGG[DF_KEGG.database == 'KEGG_pathway']
kegg_modules = DF_KEGG[DF_KEGG.database == 'KEGG_module']

### Perform enrichment
Uses the `compute_annotation_enrichment` function

In [21]:
DF_pathway_enrich = ica_data.compute_annotation_enrichment(kegg_pathways,'kegg_id')
# let's add links to view highlighted KEGG maps
links = []
for index, row in DF_pathway_enrich.iterrows():
    EC_nums = ica_data.view_imodulon(row['imodulon']).EC_number
    EC_num_opts = set()
    for val in EC_nums.values:
        if val == '': continue
        if str(val) == 'nan': continue
        for val2 in val.split(','):
            EC_num_opts.add(val2)
    url = 'https://www.kegg.jp/kegg-bin/show_pathway?map='+row['kegg_id']+'&multi_query='+'%0A'.join(EC_num_opts)
    links.append(url)
DF_pathway_enrich['KEGG_highlight_link'] = links
DF_module_enrich = ica_data.compute_annotation_enrichment(kegg_modules,'kegg_id')

In [22]:
DF_pathway_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,KEGG_highlight_link
0,nitrogen_fixation,map00625,2.295555e-06,0.0004223821,0.3,0.5,0.375,3.0,6.0,10.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...
1,nitrogen_fixation,map00910,9.558823e-06,0.0008794117,0.3,0.333333,0.315789,3.0,9.0,10.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...
2,translation_2,map03010,7.694818e-30,1.415847e-27,1.0,0.339623,0.507042,18.0,53.0,18.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...
3,ATP_synthase_1,map00195,4.782999e-18,8.800718e-16,0.346154,1.0,0.514286,9.0,9.0,26.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...
4,ATP_synthase_1,map00190,4.062174e-13,3.7372e-11,0.346154,0.473684,0.4,9.0,19.0,26.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...


In [23]:
DF_module_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,nitrogen_fixation,M00175,1.157677e-07,2.002781e-05,0.3,1.0,0.461538,3.0,3.0,10.0
1,translation_2,M00178,7.694818e-30,1.331204e-27,1.0,0.339623,0.507042,18.0,53.0,18.0
2,translation_2,M00179,5.858429000000001e-29,5.067541e-27,0.888889,0.516129,0.653061,16.0,31.0,18.0
3,ATP_synthase_1,M00157,4.782999e-18,8.274588e-16,0.346154,1.0,0.514286,9.0,9.0,26.0
4,ATP_synthase_1,M00001,8.710231e-10,7.53435e-08,0.230769,0.6,0.333333,6.0,10.0,26.0


### Convert KEGG IDs to human-readable names

In [24]:
for idx,key in tqdm(DF_pathway_enrich.kegg_id.items(),total=len(DF_pathway_enrich)):
    if '-' not in key:
        text = REST.kegg_find('pathway',key).read()
    try:
        name = re.search('\t(.*)\n',text).group(1)
        DF_pathway_enrich.loc[idx,'pathway_name'] = name
    except AttributeError:
        DF_pathway_enrich.loc[idx,'pathway_name'] = None
    
for idx,key in tqdm(DF_module_enrich.kegg_id.items(),total=len(DF_module_enrich)):
    if '-' not in key:
        text = REST.kegg_find('module',key).read()
    try:
        name = re.search('\t(.*)\n',text).group(1)
        DF_module_enrich.loc[idx,'module_name'] = name
    except AttributeError:
        DF_module_enrich.loc[idx,'module_name'] = None

  0%|          | 0/44 [00:00<?, ?it/s]

  0%|          | 0/33 [00:00<?, ?it/s]

In [25]:
DF_pathway_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,KEGG_highlight_link,pathway_name
0,nitrogen_fixation,map00625,2.295555e-06,0.0004223821,0.3,0.5,0.375,3.0,6.0,10.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...,Chloroalkane and chloroalkene degradation
1,nitrogen_fixation,map00910,9.558823e-06,0.0008794117,0.3,0.333333,0.315789,3.0,9.0,10.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...,Nitrogen metabolism
2,translation_2,map03010,7.694818e-30,1.415847e-27,1.0,0.339623,0.507042,18.0,53.0,18.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...,Ribosome
3,ATP_synthase_1,map00195,4.782999e-18,8.800718e-16,0.346154,1.0,0.514286,9.0,9.0,26.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...,Photosynthesis
4,ATP_synthase_1,map00190,4.062174e-13,3.7372e-11,0.346154,0.473684,0.4,9.0,19.0,26.0,https://www.kegg.jp/kegg-bin/show_pathway?map=...,Oxidative phosphorylation


In [30]:
pd.set_option('display.max_colwidth', 1000)
DF_pathway_enrich[DF_pathway_enrich['kegg_id'] == 'map00010']

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,KEGG_highlight_link,pathway_name
7,ATP_synthase_1,map00010,4.805597e-09,1.76846e-07,0.269231,0.333333,0.297872,7.0,21.0,26.0,https://www.kegg.jp/kegg-bin/show_pathway?map=map00010&multi_query=1.1.1.363%0A4.1.1.1%0A5.4.2.11%0A4.1.1.12%0A4.2.1.12%0A2.7.1.2%0A3.1.1.31%0A2.7.1.40%0A4.1.3.42%0A1.1.1.49%0A2.7.2.3%0A4.1.2.14%0A4.2.1.11%0A1.1.99.28%0A1.2.1.12%0A3.6.3.14,Glycolysis / Gluconeogenesis


In [27]:
ica_data.view_imodulon('ATP_synthase_1')

Unnamed: 0,gene_weight,gene_name,accession,start,end,strand,gene_product,COG,uniprot,operon,kegg_maps,EC_number
ZCP4_0092,0.124386,gpmA,CP006818.1,101594.0,102280.0,+,phosphoglycerate mutase,Carbohydrate transport and metabolism,P30798,Op234,map00010;map00260;map00680;map01100;map01110;m...,5.4.2.11
ZCP4_0117,0.081,ZCP4_0117,CP006818.1,132676.0,132999.0,+,hypothetical protein,No COG annotation,,Op258,,
ZCP4_0322,0.129047,eda,CP006818.1,362336.0,362962.0,-,2-keto-3-deoxy-phosphogluconate aldolase,Carbohydrate transport and metabolism,Q00384,Op452,map00030;map00630;map01100;map01120;map01200,"4.1.2.14,4.1.3.42"
ZCP4_0604,0.123002,gfo,CP006818.1,689454.0,690755.0,+,putative dehydrogenase,Function unknown,Q07982,Op718,,1.1.99.28
ZCP4_0621,0.116757,atpF,CP006818.1,710898.0,711536.0,-,F0F1-type ATP synthase%2C beta subunit,Energy production and conversion,Q5NPR5,Op735,map00190;map00195;map01100,
ZCP4_0622,0.111882,atpF,CP006818.1,711550.0,712050.0,-,F0F1-type ATP synthase%2C beta subunit,Energy production and conversion,Q5NPR7,Op736,map00190;map00195;map01100,
ZCP4_0623,0.142469,ZCP4_0623,CP006818.1,712172.0,712411.0,-,ATP synthase F0 subcomplex C subunit,Energy production and conversion,Q5NPR8,Op737,map00190;map00195;map01100,
ZCP4_0624,0.123656,atpB,CP006818.1,712447.0,713247.0,-,ATP synthase F0 subcomplex A subunit,Energy production and conversion,Q5NPR9,Op738,map00190;map00195;map01100,
ZCP4_0625,0.099825,ZCP4_0625,CP006818.1,713292.0,713609.0,-,hypothetical protein,Function unknown,D2N0W2,Op739,,
ZCP4_0903,0.100852,glk,CP006818.1,995849.0,996823.0,-,glucokinase,Carbohydrate transport and metabolism,P21908,Op1007,map00010;map00052;map00500;map00520;map00521;m...,2.7.1.2


In [28]:
DF_module_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,module_name
0,nitrogen_fixation,M00175,1.157677e-07,2.002781e-05,0.3,1.0,0.461538,3.0,3.0,10.0,"Nitrogen fixation, nitrogen => ammonia"
1,translation_2,M00178,7.694818e-30,1.331204e-27,1.0,0.339623,0.507042,18.0,53.0,18.0,
2,translation_2,M00179,5.858429000000001e-29,5.067541e-27,0.888889,0.516129,0.653061,16.0,31.0,18.0,
3,ATP_synthase_1,M00157,4.782999e-18,8.274588e-16,0.346154,1.0,0.514286,9.0,9.0,26.0,"F-type ATPase, prokaryotes and chloroplasts"
4,ATP_synthase_1,M00001,8.710231e-10,7.53435e-08,0.230769,0.6,0.333333,6.0,10.0,26.0,"Glycolysis (Embden-Meyerhof pathway), glucose ..."


## Save files

In [None]:
#DF_GO_enrich['source'] = 'GO'
# DF_pathway_enrich['source'] = 'KEGG pathways'
# DF_module_enrich['source'] = 'KEGG modules'
# DF_subti_enrich['source'] = 'SubtiWiki'

#DF_GO_enrich.rename({'gene_ontology':'annotation'},axis=1, inplace=True)
# DF_pathway_enrich.rename({'kegg_id':'annotation'},axis=1, inplace=True)
# DF_module_enrich.rename({'kegg_id':'annotation'},axis=1, inplace=True)
# DF_subti_enrich.rename({'value':'annotation'},axis=1, inplace=True)

#DF_enrichments = pd.concat([DF_GO_enrich, DF_pathway_enrich, DF_module_enrich, DF_subti_enrich])
#DF_enrichments.to_csv(path.join(data_dir,'functional_enrichments.csv'))

# Check for single gene iModulons

Some iModulons are dominated by a single, high-coefficient gene. These iModulons may result from:
1. Overdecomposition of the dataset to identify noisy genes
1. Artificial knock-out of single genes
1. Regulons with only one target gene

No matter what causes these iModulons, it is important to be aware of them. The find_single_gene_imodulons function identifies iModulons that are likely dominated by a single gene.

The iModulons identified by ``find_single_gene_imodulons`` may contain more than one gene, since a threshold-agnostic method is used to identify these iModulons.

In [None]:
sg_imods = ica_data.find_single_gene_imodulons(save=True)
len(sg_imods)

In [None]:
for i,mod in enumerate(sg_imods):
    ica_data.rename_imodulons({mod:'SG_'+str(i+1)})

In [None]:
ica_data.imodulon_table[ica_data.imodulon_table.single_gene == True]

In [None]:
ica_data.view_imodulon('SG_1')

# add on chromosome/plasmid location

In [None]:
f = open('../Zymomonas_mobilis/sequence_files/genome.gff3', 'r')
lines = f.readlines()
f.close()
gene_to_reg = {}
for line in lines:
    if '##sequence-region' in line:
        seq_region = line.split(' ')[1]
    if 'ID=gene-' in line:
        gene = line.split('ID=gene-')[1].split(';')[0]
        gene_to_reg.update({gene : seq_region})
        
new_col = [gene_to_reg[index] for index in ica_data.gene_table.index]
ica_data.gene_table['chromosome_id'] = new_col

# Save iModulon object

In [None]:
from pymodulon.util import explained_variance
from pymodulon.io import *

In [None]:
# Add iModulon sizes and explained variance
for im in ica_data.imodulon_names:
    ica_data.imodulon_table.loc[im,'imodulon_size'] = len(ica_data.view_imodulon(im))
    ica_data.imodulon_table.loc[im,'explained_variance'] = explained_variance(ica_data,imodulons=im)

This will save your iModulon table, your thresholds, and any other information stored in the ica_data object.

In [None]:
save_to_json(ica_data, path.join('..','data','interim','zmo_raw.json.gz'))

If you prefer to view and edit your iModulon table in excel, save it as a CSV and reload the iModulon as before

In [None]:
ica_data.imodulon_table.to_csv(path.join('..','data','processed_data','imodulon_table_raw.csv'))