# Modelling Pipeline (Culling Variant) Tutorial

This tutorial walks through how to run the modeling pipeline which performs "iterative culling" of interaction terms based on their p-values after performing LassoCV on initial formulas. It also shows how to potentially add/substitute in main effects into the final formulas obtained from this process in an effort to improve the variance explained.

## Imports

In [1]:
import statsmodels.api as sm
import patsy
from sklearn.metrics import r2_score
import numpy as np
from sklearn.model_selection import StratifiedKFold
#from sklearn.linear_model import LassoCV
import warnings
import asyncio

import pandas as pd
import numpy as np
from scipy.stats import rankdata, pearsonr
import nest_asyncio
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
from patsy import dmatrices, dmatrix

from typing import List, Tuple, Any, Dict, Optional

nest_asyncio.apply()

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=matplotlib.MatplotlibDeprecationWarning)
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings("ignore", message="The least populated class in y has only")

from yeastdnnexplorer.interface import *

In [12]:
from iterative_culling import (
    create_formulas_with_max_lrb, process_tf_data_combined, find_common_features_across_tfs, filter_significant_features, create_and_fit_combined_filtered_models, get_existing_formulas_and_tfs, evaluate_model, get_cross_validation_folds_from_dataframe, classify_genes, create_max_LRB_columns, custom_wrapper_cross_validation, iterative_model_selection, add_main_effects

)

In [3]:
# obtain data from the DB - a tutorial can be found in the LassoCV notebook
all_cc_mcisaac_data = pd.read_csv("/Users/ericjia/Downloads/updated_chase_cc_mcisaac_data.csv")
all_tfs = [ "WTM1", "MIG2", "RIM101", "GZF3", "ASH1", "TEC1", "SIP3", "SKN7", "WTM2", "HAA1", "MET31", "CRZ1", "CHA4", "ZAP1", "SKO1", "FZF1", "HAP2", "HAP3", "HAP5", "INO4", "RTG1", "MOT3", "CBF1", "MSN2", "RTG3", "RSF2", "HIR2", "SIP4", "UME1", "CIN5", "ROX1", "XBP1", "RDR1", "PDR3", "RLM1", "SFL1", "SMP1", "PHD1", "SUT1", "SOK2", "STP2", "AFT2", "YRR1", "GAL4", "LEU3", "SWI6", "ACE2", "RGM1", "GCN4", "MIG3", "STB5", "RFX1", "ARG81", "AZF1", "SFP1", "GTS1", "FKH1", "YOX1", "FKH2", "DIG1", "MET28", "RGT1", ]

# ensure that the cc_mcisaac data only include columns for TFs from all_tfs specifically - otherwise the max_LRB calculation has error/noise
all_cc_mcisaac_data = all_cc_mcisaac_data[[col for col in all_cc_mcisaac_data.columns if any(substr in col for substr in all_tfs) or col == 'target_locus_tag']]

# add the max_LRB terms corresponding to each TF
all_cc_mcisaac_data = create_max_LRB_columns(all_cc_mcisaac_data)

## LassoCV

First we create the formulas to be passed into the LASSO regression methods. The formulas take the form:

LRR_{tf1} ~ LRB_{tf1} + LRB_{tf1}:LRB_{tf2} + LRB_{tf1}:LRB_{tf3} + ... + max_LRB_{tf1}

We are also interested in adding an additional term to the formula called the "max_LRB" for each response TF to potentially account for additional variance that can be explained by this term.

In [6]:
max_formulas = create_formulas_with_max_lrb(all_tfs)

Now we call the wrapper method to perform LASSO in both cases and get a dictionary of the surviving features for each response TF. We take the surviving features that appear in both methods moving forward.

In [7]:
common_features = find_common_features_across_tfs(all_tfs, all_cc_mcisaac_data, max_formulas)

## Iterative Culling

Now we perform the culling process which sequentially first performs culling on the entire dataset with p-value threshold 0.001 and cull insignificant terms until all are significant. Then, we repeat the process by using the top10% data by binding, and a less stringent p-value threshold of 0.01 and perform the same culling process until all features are significant. We then keep track of these models

In [8]:
final_models = create_and_fit_combined_filtered_models(common_features, all_cc_mcisaac_data, full_data_iterations=3, top10_data_iterations = 3)

Here is an example of the interaction terms that remain in the formula for the response variable ACE2

In [9]:
final_models["ACE2"].summary()

0,1,2,3
Dep. Variable:,LRR_ACE2,R-squared:,0.189
Model:,OLS,Adj. R-squared:,0.182
Method:,Least Squares,F-statistic:,25.48
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,2.9000000000000004e-27
Time:,09:41:15,Log-Likelihood:,-450.36
No. Observations:,662,AIC:,914.7
Df Residuals:,655,BIC:,946.2
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0944,0.076,-1.248,0.212,-0.243,0.054
LRB_ACE2:LRB_HAA1,-0.1046,0.027,-3.853,0.000,-0.158,-0.051
LRB_ACE2:LRB_SUT1,0.0844,0.029,2.901,0.004,0.027,0.142
LRB_ACE2:LRB_GCN4,-0.1365,0.031,-4.335,0.000,-0.198,-0.075
LRB_ACE2:LRB_HIR2,-0.1157,0.027,-4.225,0.000,-0.169,-0.062
LRB_ACE2,0.5456,0.058,9.425,0.000,0.432,0.659
LRB_ACE2:LRB_GAL4,0.1479,0.030,4.931,0.000,0.089,0.207

0,1,2,3
Omnibus:,111.784,Durbin-Watson:,2.139
Prob(Omnibus):,0.0,Jarque-Bera (JB):,186.211
Skew:,1.049,Prob(JB):,3.6699999999999997e-41
Kurtosis:,4.532,Cond. No.,18.6


## Experimenting with Main Effects

Now, we might be interested in exploring the potential addition of main effects to our formula in order to improve the variance explained. For each interaction term in our final_models above, we can do the following: 1) modify the formula to replace the interaction term for its corresponding main effect 2) add the main effect into the formula. We can perform CV to determine which formula has the highest average r-squared. If it is one of these modifications that adds/substitutes in a main effect, we can make all changes to the original formula after logging the changes needed across all interaction terms.

In [10]:
final_results = add_main_effects(final_models, all_cc_mcisaac_data)

Now, the final results show for each response TF (column 1) the surviving terms from this process (column 2) which are sorted for each response TF by the magnitude of their coefficient in the linear model (column 3).

In [11]:
final_results

Unnamed: 0,Response TF,Feature,Coefficient
0,ACE2,LRB_GAL4,0.215798
1,ACE2,LRB_ACE2,0.145691
2,ACE2,LRB_ACE2:LRB_GCN4,-0.124202
3,ACE2,LRB_ACE2:LRB_SUT1,0.112207
4,ACE2,LRB_SUT1,0.092949
...,...,...,...
220,YOX1,LRB_YOX1:LRB_GZF3,-0.077286
221,YOX1,LRB_YOX1:LRB_TEC1,0.009965
222,ZAP1,LRB_GAL4,0.132575
223,ZAP1,LRB_ZAP1:LRB_AZF1,-0.093389
