# Double Machine Learning (DML)

Double Machine Learning (DML) is a framework, initially proposed by Chernozhukov et al. {cite}`chernozhukov2018double`, for estimating causal effects when many confounding variables are present. It combines machine learning techniques with econometric methods to control for these confounders and obtain unbiased estimates of treatment effects. For example, let's assume we are interested in unveiling the effect of wind power (WP) production or solar power (SP) production on electricity prices. In this case, WP and SP production would be our **treatment variables**, while the electricity price would be our **response**. We know that these factors have an effect in reducing prices due to their low marginal costs. However, we also know that there are many other factors affecting electricity prices (e.g., demand, gas prices, macroeconomic trends). These are the **confounders**. We can also assume that some these confounders have an effect both on our treatment and our response variable. For example, the season we are in or the specific hour of the day will certainly affect the generation of SP, but will also have an effect on the demand (hence, the prices) because of the well-known daily and season consumption profiles. Ignoring the effect of these confounders might lead to biased estimates. 

With DML, we are trying to isolate the effect of the treatment variables on the response. This framework assumes that the response $y$ (e.g., the prices) is a function of the treatment $w$ and other confounding variables $x$:

\begin{equation}
    y = g(w, x) + \epsilon
\end{equation}

where $g$ is an arbitrary function (linear or nonlinear) and $\epsilon$ is the error term.

Similarly, since we assumed that the treatment is also affected by other confounding variables, we have that $w$ can be modeled as a function of $x$:

\begin{equation}
    w = m(x) + \nu
\end{equation}

where $m$ is an arbitrary function (linear or nonlinear) and $\nu$ is the error term.

Now, the DML framework involves two main stages:
1. **Nuisance parameter estimation**: use a machine learning model to estimate the functions $\hat{g}(w, x)$ and $\hat{m}(x)$.
2. **Orthogonalization and estimation**: use the estimated functions to "remove" the effect of the confounding variables from both $w$ and $y$. Then, we estimate the causal effects by regressing the residuals of the response on the residuals of the treatment.

The **key intuition** is that if we remove the effect of other confounders from the tratment and the response, the variation that remains in the residuals is only due to the treatment itself. It should be noted that this approach assumes we already know the causal graph, and that there are no omitted variables.


## The Partially Linear Case

For simplicity, we now consider a partially linear case where the relationship between the outcome $y$ and the treatment $w$ can be expressed linearly, while allowing for a potentially complex, nonlinear relationship between the confounders $x$ and both the treatment and outcome.

In this case, the model is specified as follows:

\begin{equation}
    y = \beta w + g(x) + \epsilon
\end{equation}

\begin{equation}
    w = m(x) + \nu
\end{equation}

Here:
- $\beta$ is the coefficient capturing the causal effect of the treatment $w$ on the outcome $y$. This is what we are trying to estimate!
- $g(x)$ is an unknown function capturing the effect of the confounders $x$ on the outcome.
- $m(x)$ is an unknown function capturing the effect of the confounders $x$ on the treatment.


The **key steps** to implement the DML  in the partially linear case are:

1. **Split the Data**: Randomly split the data into $K$ folds.
2. **Train Predictive Models**:
    - For each fold $k$ (where $k \in \{1, 2, ..., K\}$):
        1. **Treatment Model**: Train a machine learning model $\hat{m}_{-k}(x)$ using $K-1$ folds to predict $w$ from $x$.
        2. **Outcome Model**: Train a machine learning model $\hat{g}_{-k}(x)$ using $K-1$ folds to predict $y$ from $x$.
3. **Generate Residuals**:
    - Use the models trained on $K-1$ folds to predict the held-out fold $k$.
    - Compute the residuals for the treatment and outcome models:
        \begin{equation}
            \hat{V}_W = W - \hat{W}, \quad \hat{V}_Y = Y - \hat{Y}
        \end{equation}
4. **Regress Residuals**:
    - Regress the residualized outcome $\hat{V}_Y$ on the residualized treatment $\hat{V}_W$ to estimate the causal effect $\beta$:
        \begin{equation}
        \hat{\beta}_k = \text{coef}\left( \hat{V}_Y \sim \hat{V}_W \right)
        \end{equation}
5. **Average Estimates**:
    - Repeat steps 2-4 for each fold and average the resulting $K$ estimates to obtain the final causal estimate:
        \begin{equation}
            \hat{\beta} = \frac{1}{K} \sum_{k=1}^{K} \hat{\beta}_k
        \end{equation}
6. **Robustness**:
    - For more robustness with respect to random partitioning in finite samples, repeat the algorithm multiple times (e.g., 100 times) with different random splits and report the median estimate.

This algorithm ensures that the estimation of the treatment effect is orthogonal to the nuisance parameters (the confounders), thereby removing bias due to overfitting and ensuring that the estimated treatment effect is unbiased.

The **key advantages** of DML in the partially linear case are:
1. **Flexibility**: allows the use of flexible machine learning models to estimate complex, nonlinear relationships between confounders and the treatment/outcome.
2. **Bias Reduction**: the orthogonalization step ensures that the estimation of the causal effect is unbiased by the confounders.
3. **Robustness**: cross-fitting and multiple repetitions provide robustness against overfitting and ensure stable estimates.


## Electricity example
