# Control and IV automation

## Hybrid Variable Selection: Lasso and Multiple Linear Regression

### Overview
This step combines the strengths of **Lasso regression** and **Multiple Linear Regression** to identify potential confounders. The goal is to:
1. Efficiently reduce the dimensionality of the predictors using Lasso.
2. Refine the selection by testing statistical significance using p-values from multiple linear regression.

---

### Steps

#### **Step 1: Lasso for Initial Selection**
- Fit a **Lasso regression** model to predict the treatment variable $ X $.
- Use the Lasso coefficients to select variables with **non-zero coefficients**.
- This step reduces the dimensionality of the dataset by discarding irrelevant variables, especially in high-dimensional settings or when multicollinearity is present.

---

#### **Step 2: Multiple Linear Regression for Refinement**
- For each variable selected by Lasso:
  1. Regress the treatment variable $ X $ on the selected predictors.
  2. Regress the outcome variable $ Y $ on the selected predictors.
- For both regressions, retain variables with **p-values below a predefined threshold** (e.g., $ p < 0.1 $).

---

#### **Step 3: Final Confounder Selection**
- The final set of potential confounders includes variables that are:
  - Statistically significant predictors of $ X $ (treatment).
  - Statistically significant predictors of $ Y $ (outcome).

---

### Advantages of This Approach
1. **Dimensionality Reduction**: Lasso effectively handles high-dimensional data by shrinking irrelevant coefficients to zero.
2. **Statistical Validation**: Multiple linear regression tests the significance of the selected variables, ensuring they are meaningful.
3. **Balances Complexity**: The combination of Lasso and p-values provides an interpretable and computationally efficient framework.
4. **Flexibility**: Users can adjust:
   - The regularization strength $ \alpha $ in Lasso.
   - The p-value threshold for statistical significance.

---

### Example Parameters
- **Regularization Strength** ($ \alpha $):
  - Controls the sparsity of Lasso. Smaller values ($ \alpha \to 0 $) include more variables; larger values ($ \alpha > 0.1 $) select fewer.
- **P-Value Threshold**:
  - A common threshold is $ p < 0.1 $ for exploratory purposes. Use $ p < 0.05 $ for stricter selection.

---

### Why Use This Hybrid Approach?
- Combines the **efficiency of Lasso** with the **rigor of p-values**.
- Balances computational efficiency with interpretability.
- Ensures selected confounders are statistically significant for both $ X $ (treatment) and $ Y $ (outcome).

---


In [88]:
!pip install doubleml



In [89]:
import pandas as pd
import numpy as np

from doubleml import DoubleMLData, DoubleMLPLR

from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

## Step 0
The first step is to import your cleaned perfectly balanced dataset (`df_cleaned`), merge it with the full QoG dataset (`df_full`), and keep only the match with your dataset (`how='left'`).
Finally, you get rid of the variables that will not be used in your regression (e.g. years or identifiers).

In [90]:
# Step 0: Load and prepare the dataset
df_full = pd.read_csv('https://www.qogdata.pol.gu.se/data/qog_std_ts_jan24.csv')

# Define variables of interest for initial analysis
variables_of_interest = ['cname', 'ccodealp', 'year', 'wdi_co2', 'sgi_ec', 'ti_cpi',
                         'wdi_gdppppcon2017', 'wdi_popurb', 'pg_regtoreen', 'wdi_oilrent',
                         'wdi_popgr', 'wdi_trade']
df_cleaned = df_full[variables_of_interest]

# Drop rows with missing values and apply transformations
df_cleaned = df_cleaned.dropna()
df_cleaned['wdi_co2'] = df_cleaned['wdi_co2'] * 1000
df_cleaned['pg_regtoreen'] = df_cleaned['pg_regtoreen'] / 1000
df_cleaned = pd.get_dummies(df_cleaned, columns=['ccodealp'], drop_first=False)

# Merge df_cleaned with df_full and retain only perfect matches
df_merged = pd.merge(df_cleaned[['cname', 'year']], df_full, on=['cname', 'year'], how='left')
perfect_match_columns = [col for col in df_merged.columns if df_merged[col].notna().all()]
df = df_merged[perfect_match_columns]


#drop non necessary columns
df = df.drop([ 'cname','ccode', 'ccode_qog', 'cname_qog', 'ccodecow', 'version', 'cname_year', 'ccodealp_year'], axis=1)

# Sort the DataFrame by 'id' and 'year' to ensure correct ordering
df = df.sort_values(['ccodealp', 'year']).reset_index(drop=True)

df.head()

  df_full = pd.read_csv('https://www.qogdata.pol.gu.se/data/qog_std_ts_jan24.csv')


Unnamed: 0,year,ccodealp,al_ethnic2000,al_language2000,al_religion2000,banko_for1,banko_for1_db,banko_for2,banko_for2_db,banko_for3,...,who_homm,who_homt,who_infmortf,who_infmortm,who_infmortt,who_matmort,who_roadtrd,who_suif,who_suim,who_suit
0,2013,AUS,0.092902,0.33495,0.821086,0.033884,0.030838,0.026492,0.02411,0.026492,...,1.4,1.1,3.17,3.76,3.47,6.0,5.3,5.3,15.3,10.2
1,2014,AUS,0.092902,0.33495,0.821086,0.031372,0.028509,0.023955,0.021769,0.023955,...,1.4,1.1,3.07,3.62,3.35,5.0,5.4,5.8,17.0,11.3
2,2015,AUS,0.092902,0.33495,0.821086,0.034574,0.031554,0.027088,0.024722,0.027088,...,1.5,1.1,2.99,3.53,3.27,5.0,5.2,6.0,17.7,11.8
3,2016,AUS,0.092902,0.33495,0.821086,0.033336,0.030631,0.026213,0.024086,0.026213,...,1.3,1.0,2.94,3.48,3.22,5.0,5.6,5.6,16.3,10.9
4,2017,AUS,0.092902,0.33495,0.821086,0.036331,0.035585,0.02921,0.02861,0.02921,...,1.3,1.0,2.91,3.47,3.2,5.0,5.2,5.9,17.7,11.8


In [91]:
# Compute the first difference for each variable in order to capture individual fixed effects

# Identify columns to compute first differences (all the columns in this case)
exclude_columns = ['year', 'ccodealp']
columns_to_diff = df.columns[~df.columns.isin(exclude_columns)]

# Sort the DataFrame by 'id' and 'year' to ensure correct ordering
df_sorted = df.sort_values(['ccodealp', 'year']).reset_index(drop=True)

# Compute the first difference within each individual group
df_s = df_sorted.groupby('ccodealp')[columns_to_diff].diff().dropna()  # Drop NaN for the first observation in each group


# Display the first-differenced DataFrame
df_s.head()

Unnamed: 0,al_ethnic2000,al_language2000,al_religion2000,banko_for1,banko_for1_db,banko_for2,banko_for2_db,banko_for3,banko_for3_db,banko_soe1,...,who_homm,who_homt,who_infmortf,who_infmortm,who_infmortt,who_matmort,who_roadtrd,who_suif,who_suim,who_suit
1,0.0,0.0,0.0,-0.002512,-0.002329,-0.002537,-0.002341,-0.002537,-0.002341,0.0,...,0.0,0.0,-0.1,-0.14,-0.12,-1.0,0.1,0.5,1.7,1.1
2,0.0,0.0,0.0,0.003203,0.003045,0.003133,0.002953,0.003133,0.002953,0.0,...,0.1,0.0,-0.08,-0.09,-0.08,0.0,-0.2,0.2,0.7,0.5
3,0.0,0.0,0.0,-0.001239,-0.000923,-0.000875,-0.000636,-0.000875,-0.000636,0.0,...,-0.2,-0.1,-0.05,-0.05,-0.05,0.0,0.4,-0.4,-1.4,-0.9
4,0.0,0.0,0.0,0.002995,0.004953,0.002997,0.004524,0.002997,0.004524,0.0,...,0.0,0.0,-0.03,-0.01,-0.02,0.0,-0.4,0.3,1.4,0.9
5,0.0,0.0,0.0,0.002259,0.00273,0.002198,0.002575,0.002198,0.002575,0.0,...,0.0,0.0,-0.01,0.0,-0.01,-1.0,-0.4,-0.5,-0.6,-0.5


## Step 1 to 3

The following function `hybrid_variable_selection_with_iv` will run step 1 to 3.

In [108]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

def hybrid_variable_selection_with_iv(data, x_col, y_col, alpha=0.01, pval_threshold=0.1, max_features=10):
    """
    Hybrid approach to select confounders and instruments using Lasso and Multiple Linear Regression,
    with a cap on the number of variables selected by Lasso.

    Parameters:
    - data: pd.DataFrame, input dataset with all variables.
    - x_col: str, name of the treatment variable (X).
    - y_col: str, name of the outcome variable (Y).
    - alpha: float, regularization strength for Lasso.
    - pval_threshold: float, significance threshold for p-values in multiple linear regression.
    - max_features: int, maximum number of variables to retain from Lasso selection.

    Returns:
    - confounders: list, selected potential confounders (significant for both X and Y).
    - instruments: list, selected potential instruments (significant for X but not Y).
    """
    # Prepare predictors
    predictors = data.drop(columns=[x_col, y_col])
    feature_names = predictors.columns

    # Standardize predictors
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(predictors)

    # Step 1: Lasso for Initial Selection
    lasso = Lasso(alpha=alpha, max_iter=1000)
    lasso.fit(X_scaled, data[x_col])
    coef_series = pd.Series(lasso.coef_, index=feature_names)

    # Get Lasso coefficients
    coef_series = pd.Series(lasso.coef_, index=feature_names)

    # Keep the top `max_features` variables based on absolute coefficients
    top_features = coef_series[coef_series != 0].abs().nlargest(max_features).index.tolist()


    # Filter dataset to keep only top features
    selected_data = data[top_features]

    # Step 2: Multiple Linear Regression for Refinement
    def significant_variables(response, predictors, threshold):
        model = sm.OLS(response, sm.add_constant(predictors)).fit()
        pvals = model.pvalues.drop('const')  # Exclude constant's p-value
        return pvals

    # Test significance for X (treatment)
    pvals_for_x = significant_variables(data[x_col], selected_data, pval_threshold)

    # Test significance for Y (outcome)
    pvals_for_y = significant_variables(data[y_col], selected_data, pval_threshold)

    # Step 3: Create lists for confounders and instruments
    confounders = [
        var for var in top_features
        if (pvals_for_x[var] < pval_threshold) and (pvals_for_y[var] < pval_threshold)
    ]
    instruments = [
        var for var in top_features
        if (pvals_for_x[var] < pval_threshold) and (pvals_for_y[var] >= pval_threshold)
    ]

    return confounders, instruments

# Example usage
# df = your_data_frame
confounders, instruments = hybrid_variable_selection_with_iv(
    data=df_s,
    x_col='wdi_oilrent',  # Treatment variable
    y_col='wdi_co2',       # Outcome variable
    alpha=0.01,            # Regularization strength for Lasso
    pval_threshold=0.1,    # Significance threshold for p-values
    max_features=1000        # Limit the number of features selected by Lasso
)

print(f"Selected Potential Confounders (Count:{len(confounders)}) :", confounders)
print(f"Selected Potential Instruments (Count:{len(instruments)}) :", instruments)


Selected Potential Confounders (Count:6) : ['wdi_gdpcappppcur', 'dr_pg', 'wdi_gdpind', 'top_top1_income_share', 'sgi_ectx', 'wdi_empserf']
Selected Potential Instruments (Count:21) : ['wdi_gdpcappppcon2017', 'wdi_gdpcapcur', 'wdi_acelu', 'wdi_acelr', 'wdi_pop1564', 'pwt_tfp', 'wdi_lfpmne15', 'pwt_shhc', 'fi_sm', 'wdi_internet', 'sgi_ecbg', 'wgov_minage', 'who_suif', 'wdi_popurbagr', 'ef_gl', 'banko_for2', 'fh_fog', 'pwt_xr', 'ti_se', 'ti_cpi_max', 'wbgi_vae']


## Step 4

You can now go to [The Causal Insight Assistant](https://chatgpt.com/g/g-673cf983ab908191b25350ef96397965-causal-insight-assistant)

The Causal Insight Assistant is a customized GPT designed for analyzing causal relationships in datasets.
Its purpose is to assist researchers by:
- Identifying potential confounders.
- Evaluating instrumental variables.


Here is a prompt example:

`I am trying to measure the effect of Oil rents (% of GDP) on CO2 emissions (metric tons per capita).
And I need to include confounders.

To do so. I need you to find the definition of the following variables and state if they might be indeed confounders or not.

 ['wdi_gdpcappppcur', 'dr_pg', 'wdi_gdpind', 'top_top1_income_share', 'sgi_ectx', 'wdi_empserf']`