In [1]:
# main changes: added a separate Workflow after the end of the first one. Altertered the try_interactor_variants method 
# in order to only substitute main effect and not add it

# configure the logger to print to console
from typing import Union
import logging
from matplotlib import pyplot as plt
import re
import random

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, clone
from sklearn.linear_model import LassoCV, LinearRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import r2_score



from yeastdnnexplorer.ml_models.lasso_modeling import (
    generate_modeling_data,
    stratification_classification,
    stratified_cv_modeling,
    bootstrap_stratified_cv_modeling,
    examine_bootstrap_coefficients,
    get_significant_predictors,
    stratified_cv_r2,
    try_interactor_variants,
    get_interactor_importance,
    get_non_zero_predictors,
    backwards_OLS_feature_selection,
    get_full_data,
    select_significant_features,
    )


logging.basicConfig(level=logging.ERROR)

random.seed(42)
np.random.seed(42)

# Interactor Modeling Workflow 1

This tutorial describes a process of modeling perturbation response by binding data
with the goal of discovering a meaningful set of interactor terms. More specifically,
we start with the following model:

$$
tf_{perturbed} \sim tf_{perturbed} + tf_{perturbed}:tf_{2} + tf_{perturbed}:tf_{2} + ... + max(non\ perturbed\ binding)
$$

Where the response variable is the $tf_{perturbed}$ perturbation response, and the
predictor variables are binding data (e.g., calling card experiments). Predictor
terms such as $tf_{perturbed}:tf_{2}$ represent the interaction between the
$tf_{perturbed}$ binding and the binding of another transcription factor. The final
term, $\max(\text{non-perturbed binding})$, is defined as the maximum binding score
for each gene, excluding $tf_{perturbed}$. This term is included to mitigate the
effect of outlier genes which may have high binding scores across multiple
transcription factors, potentially distorting the model.

We assume that the actual relationship between the perturbation response and the
binding data is sparse and use the following steps to identify significant terms.
These terms represent a set of TFs which, when considered as interactors with the
perturbed TF, improve the inferred relationship between the binding and perturbation
data.


## Interactor sparse modeling

1. First, we apply bootstrapping to a 4-fold cross-validated Lasso model. The folds
are stratified based on the binding data domain of the perturbed TF, ensuring that
each fold better represents the domain structure.

    - We produce two variations of this model:
        
        1. A model trained using all available data.
        
        2. A model trained using only the top 10% of data based on the binding
        score of the perturbed TF.

1. For model `1.1`, we select coefficients whose 99.8% confidence interval does not
include zero. For model `1.2`, we select coefficients whose 90.0% confidence interval
does not include zero. We assume that, due to the non-linear relationship between
perturbation response and binding, interaction effects are more pronounced in the
top 10% of the data. By intersecting the coefficients from both models, we highlight
those that are predictive across the full dataset.

1. Now, with our reduced set of features, we perform the exact same process as Step 3 in
Workflow 1 above. That is,  With this set of predictors, next create an OLS model using the same 4-fold
stratified cross validation from which we calculated an average $R^2$. Next, for each
interactor in the model, we produce one other cross validated OLS model by
replacing the interactor with its corresponding main effect. We note if this
variant yields a better average $R^2$. We remove all features from our set in which the main effect outperforms the interactor term.

1. Finally, we report, as significant interactors, the interaction terms which have survived the steps above. We use the average R-squared achieved by this model and compare it to the average R-squared achieved by the univariate counterpart (the response TF predicted solely by its main effect). We would like to create a boxplot of this comparison across all TFs to see how this pipeline affects model performance in explaining variance in contrast to the simple univariate model.


***NOTE***: To generate the `response_df` and `predictors_df` below, see the first six 
cells in the LassoCV tutorial.

In [2]:
response_df = pd.read_csv("~/htcf_local/lasso_bootstrap/erics_tfs/response_dataframe_20241105.csv", index_col=0)
predictors_df = pd.read_csv("~/htcf_local/lasso_bootstrap/erics_tfs/predictors_dataframe_20241105.csv", index_col=0)

### Step 1: Find significant predictors using all of the data

The function `get_significant_predictors()` is a wrapper of the lassoCV bootstrap 
protocol described in the LassoCV notebook. It allows using the same code to produce
both the 'all data' (step 1.1) and 'top 10%' models (step 1.2), and returns the 
significant coefficients as described in the protocol above.

In [5]:
all_data_sig_coef, all_y, all_stratification_classes = get_significant_predictors(
    "CBF1",
    response_df,
    predictors_df,
    ci_percentile=99.8,
    n_bootstraps=100,
    add_max_lrb=True)

top10_data_sig_coef, top10_y, top10_stratification_classes = get_significant_predictors(
    "CBF1",
    response_df,
    predictors_df,
    quantile_threshold=0.1,
    ci_percentile=90.0,
    n_bootstraps=100,
    add_max_lrb=True)

Significant coefficients for 99.8, where intervals are entirely above or below ±0.0:
CBF1:SWI6: (-0.15340683343647527, -0.0007913600573889841)
CBF1:RGM1: (1.108602887592501e-05, 0.17429468895434033)
CBF1:ARG81: (-0.21682807454185382, -0.04237904791794094)
CBF1:MET28: (0.078854251438266, 0.18700460281512202)
CBF1:AZF1: (-0.15333695359516117, -0.012929575918097505)
CBF1:GAL4: (0.11671175479397583, 0.32225532638329824)
CBF1:MSN2: (0.08410392831397877, 0.2548485159840825)
max_lrb: (0.000988469681798027, 0.11028595405133462)
Significant coefficients for 90.0, where intervals are entirely above or below ±0.0:
CBF1:ARG81: (-0.16367301981599305, -0.018194961118955785)
CBF1:MET28: (0.0808741229105233, 0.19464882619151444)
CBF1:AZF1: (-0.08412597626524772, -0.0004959148887297946)
CBF1:GAL4: (0.00914657465324337, 0.229843453877801)


## Step 2

We next need to intersect the significant coefficients (see definitions above) in both
models. In this case, four interactors survive (note that there are only 100
bootstraps in this example in the interest of speed for the tutorial. We recommend no 
less than 1000 in practice).

In [6]:
intersect_coefficients = set(all_data_sig_coef.keys()).intersection(set(top10_data_sig_coef.keys()))
print(f"The surviving coefficients are: {intersect_coefficients}")

The surviving coefficients are: {'CBF1:MET28', 'CBF1:AZF1', 'CBF1:ARG81', 'CBF1:GAL4'}


## Step 3

We next implement the method which searches alternative models, which include the
surviving interactor terms, with variations on substituing in the main effect. In this case, 
we have 4 terms. Thus, we will do the following for each surviving interactor term. The goal of this process, remember, is to generate a set of
high confidence interactor terms for this TF. If the predictive power of the main effect
is equivalent or better than a model with the interactor, we consider that a low
confidence interactor effect. 

After identifying all interactor terms for which substituting in the main effect improves
the average R-squared from CV, we drop these terms from our set of features. We then log the final average R-squared achieved by this model.

In [7]:

# get the additional main effects which will be tested from the intersect_coefficients
main_effects = []
for term in intersect_coefficients:
    if ":" in term:
        main_effects.append(term.split(":")[1])
    else:
        main_effects.append(term)

# combine these main effects with the intersect_coefficients
interactor_terms_and_main_effects = list(intersect_coefficients) + main_effects

# generate a model matrix with the intersect terms and the main effects. This full
# model will not be used for modeling -- subsets of the columns will be, however.
_, full_X = generate_modeling_data(
    'CBF1',
    response_df,
    predictors_df,
    formula = f"~ {' + '.join(interactor_terms_and_main_effects)}",
    drop_intercept=False,
)

full_X["max_lrb"] = predictors_df.drop(columns="CBF1").max(axis=1)

# Currently, this function tests each interactor term in the intersect_coefficients
# with two variants by replacing the interaction term with the main effect only, and
# with the main effect + interactor. If either of the variants has a higher avg
# r-squared than the intersect_model, then that variant is returned. In this case,
# the original intersect_coefficients are the best model.
full_avg_rsquared, x = get_interactor_importance(
    all_y,
    full_X,
    all_stratification_classes,
    intersect_coefficients
)

print(f"The full model avg r-squared is {full_avg_rsquared}")
print(f"The interactor results are: {x}")

# Now that we have identifed the interactors whose main effects improve average R-squared, we drop them from our model

if x:
    interactors_to_remove = set()
    for dictionary in x:
        interactor = dictionary.get("interactor")
        interactors_to_remove.add(interactor)  
        print("Removing term: "+str(interactor))

    final_feature_set = intersect_coefficients.difference(interactors_to_remove)
    # get the final avg r-squared for this set
    final_model_avg_r_squared, _ =get_interactor_importance(
        all_y,
        full_X,
        all_stratification_classes,
        final_feature_set
    )

else:
    final_feature_set = intersect_coefficients
    final_model_avg_r_squared = full_avg_rsquared

print(f"The final model avg r-squared is {final_model_avg_r_squared}")
print(f"Final set of terms: {final_feature_set}")



The full model avg r-squared is 0.03092194305599505
The interactor results are: [{'interactor': 'CBF1:GAL4', 'variant': 'GAL4', 'avg_r2': 0.048032396899616414}]
Removing term: CBF1:GAL4
The final model avg r-squared is 0.017883902536610485
Final set of terms: {'CBF1:MET28', 'CBF1:AZF1', 'CBF1:ARG81'}




## Step 4: Comparing our final model to a univariate model

In our last step, we take our reamining set of features from the end of Step 3, and now compare its performance to that of a univariate model where the response TF is predicted solely by its main effect. We will use the average R-squared achieved by 4-Fold CV on both models.

In [12]:
y, X = generate_modeling_data("CBF1", response_df, predictors_df, drop_intercept=True, formula="CBF1_LRR ~ CBF1")
classes = stratification_classification(X["CBF1"].squeeze(), y.squeeze())
avg_r2_univariate = stratified_cv_r2(
        y,
        X,  
        classes,
    )

print(f"The univariate average R-squared is: {avg_r2_univariate}")
print(f"The final model average R-squared is {final_model_avg_r_squared}")

The univariate average R-squared is: 0.004970412632932103
The final model average R-squared is 0.017883902536610485




As we can see, the final mdel we achieved through Workflow 1 demonstrates a higher average R-squared achieved by 4-fold CV. 

# Interactor Modeling Workflow 2

An alternative workflow to identifying a meaningful set of transcription factors takes a slightly different approach. There are some commonalities between this workflow and Workflow 1, and we will point them out in the steps below which outline how this alternative workflow operates.


1. First, we apply a 4-fold cross-validated Lasso model (without bootstrapping - this is the 
difference between this step and Step 1 from Workflow 1). The folds
are stratified based on the binding data domain of the perturbed TF, ensuring that
each fold better represents the domain structure.

    - We produce two variations of this model:
        
        1. A model trained using all available data.
        
        2. A model trained using only the top 10% of data based on the binding
        score of the perturbed TF.

2. Each Lasso model from Step 1 will return a set of non-zero coefficients. We then intersect the coefficients from both models, and retain this list of features
in the exact same fashion as Step 2 of Workflow 1.  

3. With this set of predictors, we then perform a "backwards OLS feature selection," in 
which we create OLS models using this set of predictors on the entire dataset, and remove
features which have a p-value above a threshold for significance (0.001). We continously 
remove features and re-create models on the reduced set of features until all of the 
features in a model are significant. Then, taking this filtered set of predictors, we 
perform the same process, but this time on the top 10% of the data based on the binding
score of the perturbed TF. Since we are now using a smaller dataset, our threshold for 
significance is increased (0.01), and we perform the same process until we arrive at a 
final model in which all of the terms are significant. 

From here onwards, we follow the same steps as Step 3 and Step 4 in Workflow 1 above. I have copied their descriptions from above, and have renamed them Steps 4 and 5 to match the numbering for this workflow.

4. Now, with our reduced set of features, we perform the exact same process as Step 3 in
Workflow 1 above. That is,  With this set of predictors, next create an OLS model using the same 4-fold
stratified cross validation from which we calculated an average $R^2$. Next, for each
interactor in the model, we produce one other cross validated OLS model by
replacing the interactor with its corresponding main effect. We note if this
variant yields a better average $R^2$. We remove all features from our set in which the main effect outperforms the interactor term.

5. Finally, we report, as significant interactors, the interaction terms which have survived the steps above. We use the average R-squared achieved by this model and compare it to the average R-squared achieved by the univariate counterpart (the response TF predicted solely by its main effect). We would like to create a boxplot of this comparison across all TFs to see how this pipeline affects model performance in explaining variance in contrast to the simple univariate model.

Let's choose a particular TF to run though this workflow with.

In [13]:
tf_of_interest = "CBF1"

## Step 1: get the non-zero coefficients from the Lasso models

The function `get_non_zero_predictors()` is a wrapper of the stratified CV modeling 
protocol described in the LassoCV notebook. It allows using the same code to produce
the Lasso models trained on the 'all data' (step 1.1) and 'top 10%' models (step 1.2), and returns the 
features with non-zero coefficients as described in the protocol above.

In [14]:
# Step 1.1: get the non-zero features on all data
tf_surviving_terms = get_non_zero_predictors(tf_of_interest, response_df, predictors_df)
# Step 1.2: get the non-zero features on only the top10% of genes
tf_top10_surviving_terms = get_non_zero_predictors(tf_of_interest, response_df, predictors_df, quantile_threshold=0.1)

## Step 2: Intersect the features from both Lasso models

Here, we simply intersect the sets of non-zero features found by both of the Lasso models in Step 1.

In [15]:
intersect_coefficients = set(tf_surviving_terms).intersection(set(tf_top10_surviving_terms))
print(f"The surviving coefficients are: {intersect_coefficients}")

The surviving coefficients are: {'CBF1:MET28', 'CBF1:SKN7', 'CBF1:SWI6', 'CBF1:AZF1'}


## Step 3: Perform backwards OLS feature selection on the intersected features

Now, taking our set of intersecting features from the step above, we perform the process of backwards OLS feature selection as described above. The function `backwards_OLS_feature_selection()` is a wrapper function that repeatedly calls `select_significant_features()` to perform the iterative process of removing insignificant features based on their p-value with respect to the given threshold. It also uses a helper method called `get_full_data()` to transform the input data into a single DataFrame that is usable by Patsy to generate design matrices for these OLS models.

In [16]:
backward_OLS_feature_result = backwards_OLS_feature_selection(tf_of_interest, intersect_coefficients, response_df, predictors_df)

## Step 4

This is now the exact same workflow as Step 3 from Workflow 1. To recap, we now implement the method which searches alternative models, which include the surviving interactor terms, with variations on including the main effect. The goal of this process is to generate a set of high confidence interactor terms for this TF. If the predictive power of the main effect is equivalent or better than a model with the interactor, we consider that a low confidence interactor effect.

In [17]:

# get the additional main effects which will be tested from the backward_OLS_feature_result
main_effects = []
for term in backward_OLS_feature_result:
    if ":" in term:
        main_effects.append(term.split(":")[1])
    else:
        main_effects.append(term)

# combine these main effects with the backward_OLS_feature_result
interactor_terms_and_main_effects = list(backward_OLS_feature_result) + main_effects

# generate a model matrix with the intersect terms and the main effects. This full
# model will not be used for modeling -- subsets of the columns will be, however.
_, full_X = generate_modeling_data(
    tf_of_interest,
    response_df,
    predictors_df,
    formula = f"~ {' + '.join(interactor_terms_and_main_effects)}",
    drop_intercept=False
)

# add the max_lrb term to the data
full_X["max_lrb"] = predictors_df.drop(columns="CBF1").max(axis=1)

# have to generate the stratification classes and the "y" column for input into 
# get_interactor_importance below
y, X = generate_modeling_data(tf_of_interest, response_df, predictors_df)
all_stratification_classes = stratification_classification(X[tf_of_interest].squeeze(), y.squeeze())

# Currently, this function tests each interactor term in the backward_OLS_feature_result
# with two variants by replacing the interaction term with the main effect only, and
# with the main effect + interactor. If either of the variants has a higher avg
# r-squared than the intersect_model, then that variant is returned. 
full_avg_rsquared, x = get_interactor_importance(
    y,
    full_X,
    all_stratification_classes,
    backward_OLS_feature_result
)

print(f"The full model avg r-squared is {full_avg_rsquared}")
print(f"The interactor results are: {x}")

if x:
    interactors_to_remove = set()
    for dictionary in x:
        interactor = dictionary.get("interactor")
        interactors_to_remove.add(interactor)  
        print(f"Removing term: {interactor}")

    final_feature_set = intersect_coefficients.difference(interactors_to_remove)
    # get the final avg r-squared for this set
    final_model_avg_r_squared, _ =get_interactor_importance(
        y,
        full_X,
        all_stratification_classes,
        final_feature_set
    )

else:
    final_feature_set = intersect_coefficients
    final_model_avg_r_squared = full_avg_rsquared

print(f"The final model avg r-squared is {final_model_avg_r_squared}")
print(f"Final set of terms: {final_feature_set}")

The full model avg r-squared is 0.016951766474153418
The interactor results are: []
The final model avg r-squared is 0.016951766474153418
Final set of terms: {'CBF1:MET28', 'CBF1:SKN7', 'CBF1:SWI6', 'CBF1:AZF1'}




## Step 5

This is the same as Step 4 from Workflow 1 above. 

In [18]:
y, X = generate_modeling_data(tf_of_interest, response_df, predictors_df, drop_intercept=True, formula=f"{tf_of_interest}_LRR ~ {tf_of_interest}")
classes = stratification_classification(X[tf_of_interest].squeeze(), y.squeeze())
avg_r2_univariate = stratified_cv_r2(
        y,
        X,  
        classes,
    )

print(f"The univariate average R-squared is: {avg_r2_univariate}")
print(f"The final model average R-squared is {final_model_avg_r_squared}")

The univariate average R-squared is: 0.004970412632932103
The final model average R-squared is 0.016951766474153418




As we can see, the final mdel we achieved through Workflow 2 demonstrates a higher average R-squared achieved by 4-fold CV. 