# Causal ML para Campañas de Marketing

Un minorista busca mejorar la efectividad de sus campañas con estrategias de marketing de descuentos. Distribuyen promociones a través de varios canales y buscan perfeccionar sus estrategias de marketing utilizando datos sobre demografía de usuarios, detalles de campañas y cupones, información de productos y transacciones anteriores. El conjunto de datos original está disponible en [Kaggle](https://www.kaggle.com/datasets/vasudeva009/predicting-coupon-redemption), y la muestra específica proviene de [esta fuente](https://doi.org/10.7910/DVN/2P8AY0).

**Diccionario de datos:**

- dailyspending: gasto diario del cliente
- coupons: si el cliente recibió un cupón -Variable de tratamiento
- coupons_preperiod: si el cliente recibió un cupón en el período anterior
- dailyspending_preperiod: gasto diario del cliente en el período anterior
- income_bracket: nivel de ingresos del 1 al 12
- age_range: rango de edad del 1 al 6
- married: si el cliente está casado
- rented: si el cliente alquila una casa
- family_size: número de personas en el hogar del cliente

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model, ensemble

import warnings
warnings.simplefilter('ignore')

## Check the data

In [None]:
# Read data
path_data = 'https://github.com/pabloestradac/experimentation-notebooks/raw/main/data/'
df = pd.read_csv(path_data + 'coupon.csv')
df.head()

Unnamed: 0,dailyspending,coupons,coupons_preperiod,dailyspending_preperiod,income_bracket,age_range,married,rented,family_size
0,411.624,0,0,0.0,4,6,1,0,2
1,253.574444,0,0,411.624,4,6,1,0,2
2,261.673684,1,0,253.574444,4,6,1,0,2
3,0.0,1,1,0.0,5,4,1,0,2
4,0.0,1,1,0.0,5,4,1,0,2


In [None]:
# Descriptive Statistics
df.describe().round(2)

Unnamed: 0,dailyspending,coupons,coupons_preperiod,dailyspending_preperiod,income_bracket,age_range,married,rented,family_size
count,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0
mean,291.45,0.24,0.18,269.47,5.01,3.57,0.74,0.08,2.54
std,310.26,0.43,0.39,380.83,2.35,1.3,0.44,0.27,1.19
min,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
25%,56.09,0.0,0.0,0.0,4.0,3.0,0.0,0.0,2.0
50%,210.57,0.0,0.0,123.42,5.0,4.0,1.0,0.0,2.0
75%,427.36,0.0,0.0,395.34,6.0,4.0,1.0,0.0,3.0
max,1975.75,1.0,1.0,3565.34,12.0,6.0,1.0,1.0,5.0


## Regression

¿Cuál es el efecto de enviar cupones en el gasto diario del cliente?

$$
\text{dailyspending} = \beta_0 + \beta_1 \text{coupons} + e
$$

In [None]:
# OLS no controls
model_base = ('dailyspending ~ coupons')
base = smf.ols(model_base, data=df)
results_ols = base.fit(cov_type='HC1')
print(results_ols.summary().tables[1])

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    268.7191      9.405     28.572      0.000     250.285     287.153
coupons       95.4337     21.778      4.382      0.000      52.750     138.117


El gasto de mi clientes aumentan en 95 dlares cuando existen los cupones. Cuadno el per valie es menor a 0.05 rechazo la hopotesis nula, es decir si hay un efecto signficativo, si hay una diferencia entre el grupo q recibio el cupon y los que no recibieron.

In [None]:
#codigo por revisar


Agreguemos al modelo covariables medidas antes del experimento:

$$
\text{dailyspending} = \beta_0 + \beta_1 \text{coupons} + \beta_2' X + e
$$

In [None]:
# OLS with additive controls
X = df.drop(columns=['dailyspending'])
X = sm.add_constant(X)
Y = df['dailyspending']
results_ols_add = sm.OLS(Y, X).fit(cov_type='HC1')
print(results_ols_add.summary().tables[1])

                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                     170.7605     40.484      4.218      0.000      91.414     250.107
coupons                    73.6412     23.234      3.169      0.002      28.103     119.180
coupons_preperiod         -22.6512     24.574     -0.922      0.357     -70.815      25.513
dailyspending_preperiod     0.1200      0.029      4.127      0.000       0.063       0.177
income_bracket             19.6230      4.227      4.643      0.000      11.339      27.907
age_range                 -16.0827      6.030     -2.667      0.008     -27.902      -4.263
married                    40.0459     19.473      2.056      0.040       1.879      78.213
rented                     11.3779     27.913      0.408      0.684     -43.331      66.087
family_size                 1.4065      9.218      0.153      0.879     -16.660 

In [None]:
# OLS with interacted controls
X = df.drop(columns=['dailyspending', 'coupons'])
X = X - X.mean(axis=0)
X[['coupons*' + col for col in X.columns]] = df[['coupons']].values * X
X['coupons'] = df['coupons']
X = sm.add_constant(X)
Y = df['dailyspending']
results_ols_int = sm.OLS(Y, X).fit(cov_type='HC1')
print(results_ols_int.summary().tables[1])

                                      coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                             273.4269      9.944     27.497      0.000     253.937     292.917
coupons_preperiod                 -24.1186     30.235     -0.798      0.425     -83.378      35.141
dailyspending_preperiod             0.0800      0.035      2.283      0.022       0.011       0.149
income_bracket                     22.6714      5.047      4.492      0.000      12.780      32.562
age_range                         -11.3180      6.471     -1.749      0.080     -24.000       1.364
married                            16.9742     22.445      0.756      0.450     -27.018      60.966
rented                              9.7466     34.742      0.281      0.779     -58.347      77.841
family_size                        12.4076     11.451      1.084      0.279     -10.036      34.851


## Double Machine Learning

En lugar de asumir una relación lineal entre las covariables y el outcome, podemos usar modelos de machine learning para estimar flexiblemente esta relación $g(\cdot)$.

$$
\begin{gathered}
\text{dailyspending} = \beta_1 \text{coupons} + g(X) + u \\
\text{coupons} = m(X) + v
\end{gathered}
$$

In [None]:
!pip install doubleml~=0.7.0

Collecting doubleml~=0.7.0
  Downloading DoubleML-0.7.1-py3-none-any.whl.metadata (7.7 kB)
Downloading DoubleML-0.7.1-py3-none-any.whl (256 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.4/256.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: doubleml
Successfully installed doubleml-0.7.1


In [None]:
import doubleml as dml

In [None]:
# DML with linear and logistic regression
splits = 10
covariates = list(df.drop(['dailyspending', 'coupons'], axis = 1).columns)
dml_data = dml.DoubleMLData(df, y_col='dailyspending', d_cols='coupons', x_cols=covariates)
ml_g = linear_model.LinearRegression() # outcome model
ml_m = linear_model.LogisticRegression() # treatment model
results_dml_linear = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_linear)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: LinearRegression()
Learner ml_m: LogisticRegression()
Out-of-sample Performance:
Learner ml_l RMSE: [[300.91433731]]
Learner ml_m RMSE: [[0.3648944]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|      2.5 %     97.5 %
coupons  75.313516  23.020088  3.271643  0.001069  30.194972  120.43206


In [None]:
# DML with lasso
cv = 10
ml_g = linear_model.LassoCV(cv=cv)
ml_m = linear_model.LogisticRegressionCV(penalty='l1', solver='saga', cv=cv)
results_dml_lasso = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_lasso)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: LassoCV(cv=10)
Learner ml_m: LogisticRegressionCV(cv=10, penalty='l1', solver='saga')
Out-of-sample Performance:
Learner ml_l RMSE: [[301.57974678]]
Learner ml_m RMSE: [[0.48567981]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|      2.5 %     97.5 %
coupons  57.191812  17.169714  3.330971  0.000865  23.539791  90.843834


In [None]:
# DML with random forest
ml_g = ensemble.RandomForestRegressor(max_features='sqrt')
ml_m = ensemble.RandomForestClassifier()
results_dml_rf = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_rf)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_features='sqrt')
Learner ml_m: RandomForestClassifier()
Out-of-sample Performance:
Learner ml_l RMSE: [[311.79916999]]
Learner ml_m RMSE: [[0.39740697]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
             coef    std err        t     P>|t|     2.5 %     97.5 %
coupons  45.35179  23.686851  1.91464  0.055538 -1.073585  91.777165


También podemos usar un modelo de regresión no lineal con interacción para la ecuación del outcome:

$$
\begin{gathered}
\text{dailyspending} = g(\text{coupons}, X) + u \\
\text{coupons} = m(X) + v
\end{gathered}
$$

In [None]:
# DML with interacted regression and lasso
ml_g = linear_model.LassoCV(cv=cv)
ml_m = linear_model.LogisticRegressionCV(penalty='l1', solver='saga', cv=cv)
results_dml_int = dml.DoubleMLIRM(dml_data, ml_g, ml_m, n_folds=splits,
                         normalize_ipw=True, trimming_rule='truncate', trimming_threshold=0.01).fit()
print(results_dml_int)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: LassoCV(cv=10)
Learner ml_m: LogisticRegressionCV(cv=10, penalty='l1', solver='saga')
Out-of-sample Performance:
Learner ml_g0 RMSE: [[288.94782743]]
Learner ml_g1 RMSE: [[333.00815688]]
Learner ml_m RMSE: [[0.4853062]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|      2.5 %      97.5 %
coupons  71.606016  21.989593  3.256359  0.001129  

In [None]:
#Obtener efectos por rango de edad
groups = df[['age_range']].astype('str')
gate_fam = results_dml_int.gate(groups=groups)
print(gate_fam)


------------------ Fit summary ------------------
               coef     std err         t     P>|t|      [0.025      0.975]
Group_1   60.654107  104.949771  0.577935  0.563409 -145.237292  266.545507
Group_2   75.843856   54.291160  1.396984  0.162659  -30.665027  182.352739
Group_3   78.277595   41.248270  1.897718  0.057957   -2.643630  159.198820
Group_4   49.272657   40.434615  1.218576  0.223228  -30.052332  128.597646
Group_5   78.567455   68.965511  1.139228  0.254820  -56.729701  213.864611
Group_6  107.235622   67.449600  1.589863  0.112111  -25.087607  239.558851


## Summary

In [None]:
results = pd.DataFrame(columns=['Estimate', 'SE', 't-stat', 'p-value', 'CI_low', 'CI_high'],
                       index=['OLS', 'OLS_add', 'OLS_int', 'DML_linear', 'DML_lasso', 'DML_rf', 'DML_int'])

for i, res in enumerate([results_ols, results_ols_add, results_ols_int]):
    results.iloc[i, 0] = res.params['coupons']
    results.iloc[i, 1] = res.bse['coupons']
    results.iloc[i, 2] = res.tvalues['coupons']
    results.iloc[i, 3] = res.pvalues['coupons']
    results.iloc[i, 4] = res.conf_int().loc['coupons', 0]
    results.iloc[i, 5] = res.conf_int().loc['coupons', 1]


for i, res in enumerate([results_dml_linear, results_dml_lasso, results_dml_rf, results_dml_int]):
    results.iloc[i+3, 0] = res.coef[0]
    results.iloc[i+3, 1] = res.se[0]
    results.iloc[i+3, 2] = res.t_stat[0]
    results.iloc[i+3, 3] = res.pval[0]
    results.iloc[i+3, 4] = res.confint().iloc[0, 0]
    results.iloc[i+3, 5] = res.confint().iloc[0, 1]

results.astype('float').round(2)

Unnamed: 0,Estimate,SE,t-stat,p-value,CI_low,CI_high
OLS,95.43,21.78,4.38,0.0,52.75,138.12
OLS_add,73.64,23.23,3.17,0.0,28.1,119.18
OLS_int,68.32,23.23,2.94,0.0,22.79,113.84
DML_linear,75.31,23.02,3.27,0.0,30.19,120.43
DML_lasso,57.19,17.17,3.33,0.0,23.54,90.84
DML_rf,45.35,23.69,1.91,0.06,-1.07,91.78
DML_int,71.61,21.99,3.26,0.0,28.51,114.7
