# Setup
This IPython notebook will walk through the steps of characterizing iModulons through the semi-automated tools in PyModulon. You will need:

* M and A matrices
* Expression data (e.g. `log_tpm_norm.csv`)
* Gene table and KEGG/GO annotations (Generated in `gene_annotation.ipynb`)
* Sample table, with a column for `project` and `condition`
* TRN file

Optional:
* iModulon table (if you already have some characterized iModulons)

In [1]:
from pymodulon.core import IcaData
from pymodulon.plotting import *
from os import path
import pandas as pd
import re
from Bio.KEGG import REST
from tqdm.notebook import tqdm

In [2]:
# Enter the location of your data here
data_dir = path.join('../..','data','processed')

# GO and KEGG annotations are in the 'external' folder
external_data = path.join('../..','data','external')
interim_data = path.join('../..','data','interim')

## Check your sample table (i.e. metadata file)
Your metadata file will probably have a lot of columns, most of which you may not care about. Feel free to save a secondary copy of your metadata file with only columns that seem relevant to you. The two most important columns are:
1. `project`
2. `condition`

Make sure that these columns exist in your metadata file

In [4]:
df_metadata = pd.read_csv(path.join(data_dir,'metadata.tsv'),index_col=0,sep='\t')
df_metadata[['project','condition']].head()

Unnamed: 0_level_0,project,condition
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Control1-MSG,Control,control
Control2-MSG,Control,control
L-galactose-1,Azenta01,L-galactose
L-galactose-2,Azenta01,L-galactose
MSG-1A,Azenta01,Monosodium glutamate


In [5]:
print(df_metadata.project.notnull().all())
print(df_metadata.condition.notnull().all())

True
True


# Organize the data
We need to modify the sample order of the A.csv file because it governs the order of samples in the dashboard. We will use the `df_metadata` to do this.

In [24]:
# read A.csv
A = pd.read_csv(path.join(data_dir,'A.csv'),index_col=0)
A = A.reindex(columns=df_metadata['sample_id'])
A.to_csv(path.join(data_dir,'A_ordered.csv'))

In [26]:
# read log_tpm_norm.csv
log_tpm = pd.read_csv(path.join(data_dir,'log_tpm_norm.csv'),index_col=0)
log_tpm = log_tpm.reindex(columns=df_metadata['sample_id'])
log_tpm.to_csv(path.join(data_dir,'log_tpm_norm_ordered.csv'))

## Load the data
You're now ready to load your IcaData object!

In [6]:
ica_data = IcaData(M = path.join(data_dir,'M.csv'),
                   A = path.join(data_dir,'A_ordered.csv'),
                   X = path.join(data_dir,'log_tpm_norm_ordered.csv'),
                   gene_table = path.join(external_data,'gene_info.csv'),
                   sample_table = path.join(data_dir,'metadata.tsv'),
                   threshold_method='kmeans')



If you don't have a TRN (or have a very minimal TRN), use `threshold_method = 'kmeans'`

In [8]:
# ica_data = IcaData(M = path.join(data_dir,'M.csv'),
#                    A = path.join(data_dir,'A.csv'),
#                    X = path.join(data_dir,'log_tpm_norm.csv'),
#                    gene_table = path.join(data_dir,'gene_info.csv'),
#                    sample_table = path.join(data_dir,'metadata.tsv'),
#                    trn = path.join(data_dir,'TRN.csv'),
#                    threshold_method = 'kmeans')

# Functional iModulons

GO annotations and KEGG pathways/modules were generated in the 1_create_the_gene_table.ipynb notebook. Enrichments will be calculated in this notebook, and further curated in the 3_manual_iModulon_curation notebook.

## GO Enrichments

First load the Gene Ontology annotations

In [7]:
DF_GO = pd.read_csv(path.join(external_data,'GO_annotations_curated.csv'),index_col=0)
DF_GO.head()

Unnamed: 0,locus_tag,GOs
54,XNR_RS00270,mannosyltransferase activity
55,XNR_RS00270,molecular_function
56,XNR_RS00270,catalytic activity
57,XNR_RS00270,dolichyl-phosphate beta-D-mannosyltransferase ...
58,XNR_RS00270,cellular_component


In [8]:
# Change 'locus_tag' to 'gene_id'
DF_GO = DF_GO.rename(columns={'locus_tag':'gene_id'})

In [9]:
DF_GO_enrich = ica_data.compute_annotation_enrichment(DF_GO,'GOs')

In [10]:
DF_GO_enrich.head()

Unnamed: 0,imodulon,GOs,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,15,ribonucleoprotein complex,1.1761319999999999e-63,1.336674e-60,0.272727,0.893617,0.41791,42.0,47.0,154.0
1,15,ribosome,1.1761319999999999e-63,1.336674e-60,0.272727,0.893617,0.41791,42.0,47.0,154.0
2,15,intracellular non-membrane-bounded organelle,3.889383e-61,1.7681129999999998e-58,0.272727,0.84,0.411765,42.0,50.0,154.0
3,15,ribosomal subunit,2.810101e-61,1.7681129999999998e-58,0.25974,0.909091,0.40404,40.0,44.0,154.0
4,15,non-membrane-bounded organelle,3.889383e-61,1.7681129999999998e-58,0.272727,0.84,0.411765,42.0,50.0,154.0


## KEGG Enrichments

### Load KEGG mapping
The `kegg_mapping.csv` file contains KEGG orthologies, pathways, modules, and reactions. Only pathways and modules are relevant to iModulon characterization.

In [11]:
DF_KEGG = pd.read_csv(path.join(external_data,'kegg_mapping.csv'),index_col=0)
print(DF_KEGG.database.unique())
DF_KEGG.head()

['KEGG_ko' 'KEGG_Pathway' 'KEGG_Module' 'KEGG_Reaction']


Unnamed: 0,gene_id,database,kegg_id
0,XNR_RS30570,KEGG_ko,-
1,XNR_RS00010,KEGG_ko,-
2,XNR_RS00015,KEGG_ko,-
3,XNR_RS00020,KEGG_ko,-
4,XNR_RS00025,KEGG_ko,-


In [12]:
kegg_pathways = DF_KEGG[DF_KEGG.database == 'KEGG_Pathway']
kegg_modules = DF_KEGG[DF_KEGG.database == 'KEGG_Module']

### Perform enrichment
Uses the `compute_annotation_enrichment` function

In [13]:
DF_pathway_enrich = ica_data.compute_annotation_enrichment(kegg_pathways,'kegg_id')
DF_module_enrich = ica_data.compute_annotation_enrichment(kegg_modules,'kegg_id')

In [14]:
DF_pathway_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,0,map01212,3.272854e-26,7.953035999999999e-24,0.216495,0.446809,0.291667,21.0,47.0,97.0
1,0,map01200,9.216777000000001e-17,1.119838e-14,0.237113,0.153333,0.186235,23.0,150.0,97.0
2,0,map00071,1.505542e-15,9.756987e-14,0.14433,0.333333,0.201439,14.0,42.0,97.0
3,0,map00280,1.606088e-15,9.756987e-14,0.154639,0.288462,0.201342,15.0,52.0,97.0
4,0,map00640,9.103613e-15,4.424356e-13,0.14433,0.297872,0.194444,14.0,47.0,97.0


In [15]:
DF_module_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size
0,0,M00373,6.401638e-13,1.76045e-10,0.092784,0.5625,0.159292,9.0,16.0,97.0
1,0,M00087,8.7454e-11,1.202493e-08,0.082474,0.470588,0.140351,8.0,17.0,97.0
2,0,M00088,1.384311e-09,1.268951e-07,0.061856,0.666667,0.113208,6.0,9.0,97.0
3,0,M00375,3.414887e-09,2.347735e-07,0.061856,0.6,0.11215,6.0,10.0,97.0
4,0,M00095,5.903501e-08,3.246926e-06,0.051546,0.625,0.095238,5.0,8.0,97.0


### Convert KEGG IDs to human-readable names

In [16]:
import urllib.error

for idx,key in tqdm(DF_pathway_enrich.kegg_id.items(),total=len(DF_pathway_enrich)):
    try:
        text = REST.kegg_find('pathway',key).read()
        name = re.search('\t(.*)\n',text).group(1)
        DF_pathway_enrich.loc[idx,'pathway_name'] = name
    except (AttributeError, urllib.error.HTTPError):
        print(f"Bad KEGG ID: {key}")
        DF_pathway_enrich.loc[idx,'pathway_name'] = None
    
for idx,key in tqdm(DF_module_enrich.kegg_id.items(),total=len(DF_module_enrich)):
    try:
        text = REST.kegg_find('module',key).read()
        name = re.search('\t(.*)\n',text).group(1)
        DF_module_enrich.loc[idx,'module_name'] = name
    except (AttributeError, urllib.error.HTTPError):
        print(f"Bad KEGG ID: {key}")
        DF_module_enrich.loc[idx,'module_name'] = None

  0%|          | 0/163 [00:00<?, ?it/s]

Bad KEGG ID: map00072
Bad KEGG ID: map00281
Bad KEGG ID: map01130
Bad KEGG ID: map01130
Bad KEGG ID: map00281
Bad KEGG ID: -
Bad KEGG ID: -


  0%|          | 0/108 [00:00<?, ?it/s]

Bad KEGG ID: M00167
Bad KEGG ID: M00323
Bad KEGG ID: M00670
Bad KEGG ID: M00210
Bad KEGG ID: M00669
Bad KEGG ID: M00178
Bad KEGG ID: M00179
Bad KEGG ID: M00183
Bad KEGG ID: M00215
Bad KEGG ID: M00208
Bad KEGG ID: M00491
Bad KEGG ID: M00206
Bad KEGG ID: M00207
Bad KEGG ID: M00240
Bad KEGG ID: M00188
Bad KEGG ID: -
Bad KEGG ID: M00236
Bad KEGG ID: M00435
Bad KEGG ID: M00436
Bad KEGG ID: M00239
Bad KEGG ID: M00239
Bad KEGG ID: M00237
Bad KEGG ID: M00236
Bad KEGG ID: M00439
Bad KEGG ID: M00205
Bad KEGG ID: M00201
Bad KEGG ID: M00237
Bad KEGG ID: M00200
Bad KEGG ID: M00215
Bad KEGG ID: M00216
Bad KEGG ID: M00619
Bad KEGG ID: M00178
Bad KEGG ID: M00233
Bad KEGG ID: M00208
Bad KEGG ID: -
Bad KEGG ID: M00460
Bad KEGG ID: M00237
Bad KEGG ID: M00439
Bad KEGG ID: M00239
Bad KEGG ID: M00222
Bad KEGG ID: M00254
Bad KEGG ID: M00216
Bad KEGG ID: M00212
Bad KEGG ID: M00454
Bad KEGG ID: M00221
Bad KEGG ID: M00335
Bad KEGG ID: M00237
Bad KEGG ID: M00254
Bad KEGG ID: M00606
Bad KEGG ID: M00239
Bad KEGG I

In [17]:
DF_pathway_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,pathway_name
0,0,map01212,3.272854e-26,7.953035999999999e-24,0.216495,0.446809,0.291667,21.0,47.0,97.0,Fatty acid metabolism
1,0,map01200,9.216777000000001e-17,1.119838e-14,0.237113,0.153333,0.186235,23.0,150.0,97.0,Carbon metabolism
2,0,map00071,1.505542e-15,9.756987e-14,0.14433,0.333333,0.201439,14.0,42.0,97.0,Fatty acid degradation
3,0,map00280,1.606088e-15,9.756987e-14,0.154639,0.288462,0.201342,15.0,52.0,97.0,"Valine, leucine and isoleucine degradation"
4,0,map00640,9.103613e-15,4.424356e-13,0.14433,0.297872,0.194444,14.0,47.0,97.0,Propanoate metabolism


In [18]:
DF_module_enrich.head()

Unnamed: 0,imodulon,kegg_id,pvalue,qvalue,precision,recall,f1score,TP,target_set_size,imodulon_size,module_name
0,0,M00373,6.401638e-13,1.76045e-10,0.092784,0.5625,0.159292,9.0,16.0,97.0,Ethylmalonyl pathway
1,0,M00087,8.7454e-11,1.202493e-08,0.082474,0.470588,0.140351,8.0,17.0,97.0,beta-Oxidation
2,0,M00088,1.384311e-09,1.268951e-07,0.061856,0.666667,0.113208,6.0,9.0,97.0,"Ketone body biosynthesis, acetyl-CoA => acetoa..."
3,0,M00375,3.414887e-09,2.347735e-07,0.061856,0.6,0.11215,6.0,10.0,97.0,Hydroxypropionate-hydroxybutylate cycle
4,0,M00095,5.903501e-08,3.246926e-06,0.051546,0.625,0.095238,5.0,8.0,97.0,"C5 isoprenoid biosynthesis, mevalonate pathway"


## Save files

In [19]:
DF_GO_enrich['source'] = 'GO'
# DF_pathway_enrich['source'] = 'KEGG pathways'
# DF_module_enrich['source'] = 'KEGG modules'
# DF_subti_enrich['source'] = 'SubtiWiki'

DF_GO_enrich.rename({'gene_ontology':'annotation'},axis=1, inplace=True)
# DF_pathway_enrich.rename({'kegg_id':'annotation'},axis=1, inplace=True)
# DF_module_enrich.rename({'kegg_id':'annotation'},axis=1, inplace=True)
# DF_subti_enrich.rename({'value':'annotation'},axis=1, inplace=True)

DF_enrichments = pd.concat([DF_GO_enrich, DF_pathway_enrich, DF_module_enrich])
DF_enrichments.to_csv(path.join(data_dir,'functional_enrichments.csv'))

# Check for single gene iModulons

Some iModulons are dominated by a single, high-coefficient gene. These iModulons may result from:
1. Overdecomposition of the dataset to identify noisy genes
1. Artificial knock-out of single genes
1. Regulons with only one target gene

No matter what causes these iModulons, it is important to be aware of them. The find_single_gene_imodulons function identifies iModulons that are likely dominated by a single gene.

The iModulons identified by ``find_single_gene_imodulons`` may contain more than one gene, since a threshold-agnostic method is used to identify these iModulons.

In [20]:
sg_imods = ica_data.find_single_gene_imodulons(save=True)
len(sg_imods)

0

In [14]:
for i,mod in enumerate(sg_imods):
    ica_data.rename_imodulons({mod:'SG_'+str(i+1)})

In [15]:
ica_data.imodulon_table[ica_data.imodulon_table.single_gene == True]

AttributeError: 'DataFrame' object has no attribute 'single_gene'

In [21]:
ica_data.view_imodulon(17)

Unnamed: 0,gene_weight,gene_name,eggNOG_OGs,Description,GOs,EC,KEGG_ko,KEGG_Pathway,KEGG_Module,KEGG_Reaction,...,BiGG_Reaction,PFAMs,accession,old_locus_tag,start,end,strand,gene_product,COG,operon
XNR_RS00455,-0.027750,gloA,"COG0346@1|root,COG0346@2|Bacteria,2IHZH@201174...",glyoxalase bleomycin resistance protein dioxyg...,-,-,-,-,-,-,...,-,Glyoxalase,NC_020990.1,XNR_0096,113835,114227,+,VOC family protein,Amino acid transport and metabolism,Op78
XNR_RS00460,-0.026695,XNR_RS00460,"COG1846@1|root,COG1846@2|Bacteria,2GM7K@201174...",MarR family,-,-,-,-,-,-,...,-,MarR_2,NC_020990.1,XNR_0097,114261,114791,-,MarR family transcriptional regulator,Transcription,Op79
XNR_RS00470,-0.035224,XNR_RS00470,"290IR@1|root,2ZN7H@2|Bacteria,2IFZ1@201174|Act...",Domain of unknown function (DUF5134),-,-,-,-,-,-,...,-,DUF5134,NC_020990.1,XNR_0099,116714,117280,+,DUF5134 domain-containing protein,Function unknown,Op80
XNR_RS00540,0.045767,XNR_RS00540,"COG2971@1|root,COG2971@2|Bacteria,2GKBF@201174...",BadF BadG BcrA BcrD,-,-,-,-,-,-,...,-,BcrAD_BadFG,NC_020990.1,XNR_0113,130210,131250,+,ATPase BadF/BadG/BcrA/BcrD type,Carbohydrate transport and metabolism,Op94
XNR_RS00625,-0.038861,XNR_RS00625,"COG4291@1|root,COG4291@2|Bacteria,2IIVD@201174...",membrane,-,-,-,-,-,-,...,-,DUF1345,NC_020990.1,XNR_0130,154417,155052,+,DUF1345 domain-containing protein,Energy production and conversion,Op108
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XNR_RS28205,0.033241,XNR_RS28205,"COG1280@1|root,COG1280@2|Bacteria,2GMDP@201174...",Lysine exporter protein (LysE YggA),-,-,-,-,-,-,...,-,LysE,NC_020990.1,XNR_5687,6414812,6415465,+,LysE family translocator,Amino acid transport and metabolism,Op4258
XNR_RS31650,-0.058175,XNR_RS31650,,,,,,,,,...,,,NC_020990.1,XNR_5714,6438487,6438621,+,hypothetical protein,X,Op4277
XNR_RS31655,-0.034436,XNR_RS31655,"COG0654@1|root,COG0654@2|Bacteria,2GN2W@201174...",PFAM monooxygenase FAD-binding,-,"1.14.13.232,1.14.13.233,1.14.13.7","ko:K03380,ko:K14252","ko00253,ko00623,ko00627,ko01057,ko01120,ko0113...","M00780,M00823","R00815,R03566,R05462,R09190",...,-,"FAD_binding_3,Phe_hydrox_dim",NC_020990.1,,6497081,6497611,-,FAD-dependent monooxygenase,Energy production and conversion,Op4309
XNR_RS28855,0.032170,dppC,"COG1173@1|root,COG1173@2|Bacteria,2GKAW@201174...",PFAM binding-protein-dependent transport syste...,-,-,"ko:K02031,ko:K02034","ko02024,map02024",M00239,-,...,-,"ABC_tran,BPD_transp_1",NC_020990.1,XNR_5822,6555690,6556559,+,ABC transporter permease,Amino acid transport and metabolism,Op4344


# Save iModulon object

In [22]:
from pymodulon.util import explained_variance
from pymodulon.io import *

In [23]:
# Add iModulon sizes and explained variance
for im in ica_data.imodulon_names:
    ica_data.imodulon_table.loc[im,'imodulon_size'] = len(ica_data.view_imodulon(im))
    ica_data.imodulon_table.loc[im,'explained_variance'] = explained_variance(ica_data,imodulons=im)

In [24]:
ica_data.imodulon_table.head()

Unnamed: 0,imodulon_size,explained_variance
0,97.0,0.006052
1,24.0,0.003215
2,16.0,0.004825
3,21.0,0.010579
4,17.0,0.002131


This will save your iModulon table, your thresholds, and any other information stored in the ica_data object.

In [26]:
save_to_json(ica_data, path.join('../..','data','interim','salb_raw.json.gz'))

If you prefer to view and edit your iModulon table in excel, save it as a CSV and reload the iModulon as before

In [27]:
ica_data.imodulon_table.to_csv(path.join('../..','data','interim','imodulon_table_raw.csv'))

In [30]:
imodulon_names_df

Unnamed: 0.1,Unnamed: 0,imodulon_size,explained_variance
0,0,97.0,0.006052
1,1,24.0,0.003215
2,2,16.0,0.004825
3,3,21.0,0.010579
4,4,17.0,0.002131
...,...,...,...
58,58,514.0,0.036333
59,59,42.0,0.004464
60,60,411.0,0.002833
61,61,6.0,0.000503


In [29]:
# Read the CSV file containing the imodulon names
imodulon_names_df = pd.read_csv(path.join(interim_data, 'imodulon_table_raw.csv'))
# Extract the names into a list
imodulon_names = imodulon_names_df['Unnamed: 0'].tolist()
all_genes = []  # List to store all genes from each imodulon

for imodulon_name in imodulon_names:
    imodulome_data = ica_data.view_imodulon(imodulon_name)
    genes = imodulome_data['gene_name'].tolist()
    for gene in genes:
        all_genes.append((gene, imodulon_name))

# Convert the collected gene data to a pandas DataFrame
df = pd.DataFrame(all_genes, columns=['Gene', 'Imodulon'])

# Save the DataFrame to a CSV file
df.to_csv(path.join(data_dir,'genes_with_imodulon.csv'), index=False)