# lassoCV

This notebook demonstrates how to conduct Lasso with stratified K fold cross validation
on the Calling Cards data.  

## Pulling the data

The calling cards data should now strictly be taken from data source 'brent_nf_cc'. All
of the Mitra data has been reprocessed through the nf-core/callingcards:1.0.0 pipeline.  

Where there are multiple replicates, they have been aggregated. The `deduplicate`
parameter to `PromoterSetSigAPI()` selects aggregated data where it exists. Where
there is a single passing replicate, that replicate is used.

## Setup

As usual, import the `yeastdnnexplorer` interface functions

In [None]:
import patsy as pt
import pandas as pd
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.model_selection import StratifiedKFold
from sklearn.base import BaseEstimator, clone

from yeastdnnexplorer.interface import *

# configure the logger to print to console
import logging

logging.basicConfig(level=logging.DEBUG)

pss_api = PromoterSetSigAPI()
expression_api = ExpressionAPI()

## Pull the deduplicated calling cards data

This will pull all of the currently usable data. In the future, we will remove 
"unreviewed". This will take a minute or two as it will need to fetch all of the
underlying data

In [None]:
pss_api.push_params(
    {
        "source_name": "brent_nf_cc",
        "deduplicate": "true",
        "data_usable": ["unreviewed", "pass"],
    }
)

pss_res = await pss_api.read(retrieve_files=True)

## Pull the corresponding perturbation data

In this case, we are pulling the McIsaac data. In order to label blacklisted genes,
we'll need the shrunken data. For modelling, we will use the unshrunken data.

In [None]:
expression_api.push_params(
    {
        "regulator_symbol": ",".join(
            pss_res.get("metadata").regulator_symbol.unique().tolist()
        ),
        "source_name": "mcisaac_oe",
        "time": "15",
    }
)

expression_res_shrunken = await expression_api.read(retrieve_files=True)

# this will add the effect_colname parameter to the expression API
expression_api.push_params(
    {
        "effect_colname": "log2_ratio",
    }
)

expression_res_unshrunken = await expression_api.read(retrieve_files=True)



## Transform the data into a usable format for modelling

Note that there are new functions, `metric_arrays` and'
`negative_log_transform_by_pvalue_and_enrichment`. See the API section of this
documentation for more details.  

You will likely want to save the results of this cell so that you do not have to run
the DB or transformation steps in future sessions, unless of course you need or want
to update the your data.

In [None]:
# extract the data to a more managable format using `metric_arrays`

X = metric_arrays(
    pss_res,
    {"poisson_pval": np.min, "callingcards_enrichment": np.max},
)

Y = metric_arrays(
    expression_res_unshrunken,
    {"effect": np.max},
)

Y_shrunken = metric_arrays(
    expression_res_shrunken,
    {"effect": np.max},
)

# define a set of common genes between X and y
common_genes = X["poisson_pval"].index.intersection(
    Y.get("effect", pd.DataFrame()).index
)

# binarize the Y.get("effect") DataFrame as True if the value is not 0
# We wish to exclude any genes that are always unresponsive
Y_binary = Y_shrunken.get("effect", pd.DataFrame()).eq(0)

# define blacklisted genes as those records where they are labeled "unresponsive"
# in all columns
blacklisted_genes = Y_binary[Y_binary.all(axis=1)].index.intersection(common_genes)

# remove the blacklist genes from Y and retain only common genes
Y_filtered = Y.get("effect", pd.DataFrame()).loc[common_genes].drop(blacklisted_genes)

# remove the blacklisted_genes for X and retain only the common genes
X_filtered = {}
for key in X.keys():
    X_filtered[key] = X[key].loc[common_genes].drop(blacklisted_genes)

# Next, transform the X object into a predictors_df using the shifted negative log rank
# transformation. See `negative_log_transform_by_pvalue_and_enrichment` for more
# details
scores_list = [
    negative_log_transform_by_pvalue_and_enrichment(
        X_filtered["poisson_pval"].loc[:, i],
        X_filtered["callingcards_enrichment"].loc[:, i],
    )
    for i in X_filtered["poisson_pval"].columns
]

# Convert the list of scores into a DataFrame
predictors_df = pd.DataFrame(scores_list).T

# Set the index and columns to match X_filtered["poisson_pval"]
predictors_df.index = X_filtered["poisson_pval"].index
predictors_df.columns = X_filtered["poisson_pval"].columns


# conduct a similar shifted negative log rank transformation on the Y values
Y_filtered_ranked = Y_filtered.rank(ascending=False, method="average")
Y_filtered_transformed = Y_filtered_ranked.apply(shifted_negative_log_ranks, axis=0)

## Model the data, one TF at a time

It is the case that there are some TFs with replicates in the expression data. Those 
can be combined in the future, but for now I wanted to leave them and examine
the modeling results. The important feature to notice in this section is that the 
modelling is a loop over the column names of `Y`. This is embarrassingly parallel.

In [8]:
def stratification_classification(binding_vector: pd.Series, perturbation_vector: pd.Series) -> np.ndarray:
    """
    Bin the binding and perturbation data and create groups for stratified k folds
    """
    
    # Rank genes by binding and perturbation scores
    binding_rank = pd.Series(binding_vector).rank(method='min', ascending=False).values
    perturbation_rank = pd.Series(perturbation_vector).rank(method='min', ascending=False).values
    
    # Define bins for classification
    bins = [0, 8, 64, 512, np.inf]
    labels = [1, 2, 3, 4]
    
    # Bin genes based on binding and perturbation ranks
    binding_bin = pd.cut(binding_rank, bins=bins, labels=labels, right=True).astype(int)
    perturbation_bin = pd.cut(perturbation_rank, bins=bins, labels=labels, right=True).astype(int)
    
    # Generate a combined classification value
    return (binding_bin - 1) * 4 + perturbation_bin

def interaction_modeling(
    colname: str,
    Y_filtered_transformed: pd.DataFrame,
    predictors_df: pd.DataFrame,
    formula: str = None,
    estimator: BaseEstimator = LassoCV()
) -> BaseEstimator:
    """
    Model interaction terms with a specified target column from Y_filtered_transformed
    and predictors in predictors_df, using LassoCV with stratified cross-validation.

    :param colname: The column of the response data to use as a model. Note: it is
        assumed that the columns of Y_filtered_transformed are a subset of
        predictors_df
    :param Y_filtered_transformed: The transformed target DataFrame
    :param predictors_df: The predictors DataFrame
    :param formula: The formula to use for the interaction model. If None, the formula
    :param estimator: The estimator to use for fitting the model. It must have a `cv`
        attribute that can be set with a list of StratifiedKFold splits

    :return: The LassoCV model

    :raises ValueError: If the estimator does not have a `cv` attribute
    """
    # Verify estimator has a `cv` attribute
    if not hasattr(estimator, 'cv'):
        raise ValueError("The estimator must support a `cv` parameter.")

    # Step 1: Create a temporary DataFrame and add the target column as `<colname>_LRR`
    tmp_df = predictors_df.copy()
    tmp_df[f"{colname}_LRR"] = Y_filtered_transformed[colname]

    # Step 2: Drop the row where index matches `colname` if it exists
    if colname in tmp_df.index:
        tmp_df = tmp_df.drop(index=colname)

    # Step 3: Define the interaction formula
    if formula is None:
        interaction_terms = " + ".join(
            [
                f"{colname}:{other_col}"
                for other_col in predictors_df.columns
                if other_col != colname
            ]
        )
        formula = f"{colname}_LRR ~ {colname} + {interaction_terms}"

    # Step 4: Generate X, y matrices with patsy
    y, X = pt.dmatrices(formula, tmp_df, return_type="dataframe")

    # Step 5: Generate bins for stratified k-fold cross-validation
    classes = stratification_classification(X[colname], y.values.ravel())
    
    # Step 6: Initialize StratifiedKFold for stratified splits
    skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
    folds = list(skf.split(X, classes))

    # Clone the estimator and set the `cv` attribute with predefined folds
    model = clone(estimator)
    model.cv = folds

    # Step 7: Fit the model using the custom cross-validation folds
    model.fit(X, y.values.ravel())

    return model


## Modelling per TF

`estimator` can be any sci-kit learn estimator that has a `cv` method

In [None]:
lasso_estimator = LassoCV(max_iter=10000)

lasso_model = interaction_modeling("CBF1", Y_filtered_transformed, predictors_df, estimator=lasso_estimator)