In [1]:
# !conda install --y pandas
# !conda install --y numpy
# !pip install -q sweetviz
# !pip install --quiet facets-overview

: 

# 1. Getting Familiar with the Data

**Importance to project**

Having clean datasets that you trust is fundamental to causal inference. Causal inference is often counterintuitive, and when counterintuitive conclusions come up, it’s nice to be sure that it isn’t just some data bug crawling in. Trusting and understanding your data is an important step to save you time down the road.

## 1.1 Dataset, Libraries, and Setup

## 1.2 Getting Familiar with the Data

**Workflow**

1. Use pandas to read non_rand_discount.csv into memory.
2. Display the first few lines of the dataset.
3. Check whether the data types are correct (no numeric data treated as strings).
3. If they are not correct, convert data that should be of type numeric to the appropriate type.
4. Look for nulls (missing entries) in your dataset. Nulls are problematic if present in the outcome (profit) or treatment variable (discount). If there aren’t nulls in any of these columns, it’s OK to proceed without addressing them.

### 1.3 Read the data into a pandas dataframe

1. Use pandas to read non_rand_discount.csv into memory.
2. Display the first few lines of the dataset.

In [None]:
from IPython.display import display, HTML, Markdown
import pandas as pd
import toml
import sys
import os
from pycaret.regression import pyc
from pycaret.regression import setup
from sklearn.impute import KNNImputer
import seaborn as sns

# Change the working directory to the root of the project
while os.path.basename(os.getcwd()) != "Causality":
    os.chdir("..")

# Load the configuration
config = toml.load("config.toml")

# Add the source folder to the path
sys.path.append("./causality")
from data import load_data

# If the file does not exists download it
if not os.path.exists(config["path"]["filename"]):
    data_raw = load_data(config=config)

# Check if the file exists
# destiny_file = f"{config["path"]["data_raw"]}/{config["path"]["filename"]}"

# Load the data
data_raw = pd.read_parquet("data/raw/non_rand_discount.parquet")
data_raw.sample(5)

### 1.3.1 Get familiar with the data

In [None]:
data_raw.describe()

In [None]:
data_raw.select_dtypes('O')

In [61]:
data_clean = data_raw
data_clean.discount = data_clean['discount'].apply(lambda x: float(x.replace('US$ ','')))

In [None]:
data_raw.info()

In [63]:
# Int
cols_int_ls = [i[0] for i in (data_clean.dtypes==int).items() if i[1]==True]
for column in cols_int_ls:
    data_clean[column] = pd.to_numeric(data_clean[column], downcast="integer")


# Float
float_cols_ls = [i[0] for i in (data_clean.dtypes==float).items() if i[1]==True]
for column in float_cols_ls:
    data_clean[column] = pd.to_numeric(data_clean[column], downcast="float")

In [64]:
# import sweetviz as sv

# my_report = sv.analyze(data_raw, target_feat ='sales')
# my_report.show_html()

### 1.3.2 Know what columns it has

### 1.3.3 Check for missing variables

In [None]:
data_raw.isnull().sum()[data_raw.isnull().sum()>0]

In [None]:
(data_raw.isnull().sum()/data_raw.count())[data_raw.isnull().sum()>0]

### 1.3.4 Clean the data for further use

In [68]:
# Check if the variable __temp exists

if "__temp" in locals():
    __temp = data_raw.select_dtypes('O')  # Save the categorical columns
data_raw = data_raw.drop(columns=["cust_state"])  # Drop the categorical columns

imputer = KNNImputer(n_neighbors=2)
data_raw= pd.DataFrame(imputer.fit_transform(data_raw), columns = data_raw.columns)

data_clean = data_raw.join(__temp)

In [None]:
data_clean = data_raw.dropna()

data_train = data_clean.sample(frac=0.9, random_state=42)
data_test = data_clean.drop(data_train.index)

data_train.reset_index(drop=True, inplace=True)
data_test.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data_train.shape))
print('Unseen Data For Predictions: ' + str(data_test.shape))

# 2. Understanding the Treatment Assignment Mechanism

## 2.1 Objective

- Understand the treatment assignment mechanism
- Get acquainted with potential outcome notation

## 2.2 Importance to project

A big part of causal inference is understanding the treatment assignment mechanism (who are we giving discounts to) so that we can know what sorts of bias might be lurking in our data. By the end of this milestone, you should have a basic idea about what customer profile receives more of a discount.

One very well-known treatment assignment mechanism is random: we give discounts to different customers at random. That would be the best assignment mechanism to identify the causal effect of discounts on profits because customers with different discounts would be comparable to each other. In other words, the only thing that would change systematically across different discount levels would be the discount itself. We would then be able to isolate the effect of increasing discounts on profits very easily.

However, random assignments are not always available. Developing randomized experiments is costly either in money or in time (or both). For instance, in our case with the e-commerce company, we know that discounts were not randomly assigned. This makes isolating the causal effect harder.

For example, suppose that the discount does increase profits. But the e-commerce company gives more discounts to low-income customers (maybe because they are more willing to negotiate harder for lower prices), which, on average, generates fewer profits than high-income customers. Then, since we are giving more discount to less profitable customers, it will look like the relationship between discounts and profits is negative, even though discount increases profits.

![img](https://github.com/matheusfacure/causal-inf-and-personalization-manning/blob/main/img/project1/income-bias.png?raw=true)

Understanding the treatment assignment mechanism is so important that we can’t even make causal inference statements without also saying something about the treatment assignment mechanism. Some might even argue that once this step is done, causal inference is easy. Hence, it is only fair we start our causal inference analysis with understanding very well how the treatment was assigned.

Note: To talk about the treatment assignment mechanism, we will use the language of potential outcomes. I strongly recommend you get familiar with it before starting this milestone. Refer to the reference materials for some quick resources to learn about it.

## 2.3 Workflow

### 2.3.1 Exploring relation between discounts and profit

1. This particular e-commerce company makes 5% on sales, so its revenues are `0.05 * Sales`. You can consider that its costs are mostly how much it gives in discounts. Hence, profitability is given by `5% * Sales - Discount`. Check how discount correlates with the outcomes.
   - Use Seaborn or another plotting library to plot the relationship between discount and sales.
   - Plot the relationship between discount and profits. Does profit increase or decrease with discount?

In [None]:
# Display the Dive visualization for the training data.

jsonstr = data_clean.to_json(orient='records')
HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

### 2.3.2 Influencing variables

2. We recall from the introduction that the e-commerce company allocates more discounts to customers with higher sales predictions, where sales predictions are given by their machine learning model. This means people with different discount levels are probably different in other dimensions as well.
   - Plot the relationship between age and discount. Do older people get more or fewer discounts?
   - Plot the relationship between customer state and discount. Do different states get different discounts?
   - Plot the relationship between sales prediction and discount. Is the e-commerce company giving more discounts to those it expects to buy more?
   - Plot the relationship between any other feature of your choice and discount. Can you learn more about how discounts were allocated?

In [None]:
sns.set(rc={'figure.figsize':(21,11)})

sns.violinplot(x=pd.qcut(data_clean.age, 10),
               y='discount', 
               data=data_clean,
               hue_order='discount')

In [None]:
sns.set(rc={'figure.figsize':(21,11)})

sns.violinplot(x='tenure',
               y='discount', 
               data=data_clean,
               hue_order='discount')

In [None]:
sns.violinplot(x='cust_state', y='discount', 
               data=data_clean,
               hue_order='discount')

In [None]:
sns.violinplot(x=pd.qcut(data_clean.sales_prediction, 10),
               y='discount', 
               data=data_clean,
               hue_order='discount')

In [None]:
sns.scatterplot(x='sales',
               y='sales_prediction', 
               hue='discount',
               data=data_clean
               )

#### 2.3.2.1 Models

In [None]:
#@markdown <p>The discount mechanism can be completely explained by *Sales* and *profit*. Probably there's some information leakage in the syntactical data generation process. Because when the discount is assigned the should be no knowledge of the sales output. </p>
#@markdown <p>For this reason I explicitly removed the sales value from the training data.</p>
target = 'discount'
drop_cols = set([
                 'sales', 'profit', # Data leakage
                 'sales_prediction_bins'  # possible multicollinearity
                 ])
numeric_features = [i for i in cols_int_ls + float_cols_ls if i not in drop_cols.union([target])]

exp_reg101 = pyc.setup(data = data_train.drop(columns=drop_cols), target = target, session_id=42,
                   numeric_features = numeric_features)

In [None]:
best = pyc.compare_models(fold = 5, sort = 'RMSLE')

In [None]:
models = {
    "lightgbm" : create_model('lightgbm'),
    "br" : create_model('br'),
    "ridge" : create_model('ridge'),
    "lr" : create_model('lr'),
    "lasso" : create_model('lasso')
}

for model in models.keys():
    models[model] = tune_model(models[model])

##### Light GBM

In [None]:
plot_model(models['lightgbm'], plot='residuals')

In [None]:
plot_model(models['lightgbm'], plot = 'error')

In [None]:
plot_model(models['lightgbm'], plot='feature')

In [None]:
plot_model(models['lightgbm'], plot='manifold')

##### Bayesian Ridge

In [None]:
plot_model(models['br'])

In [None]:
plot_model(models['br'], plot = 'error')

In [None]:
plot_model(models['br'], plot='feature')

##### Ridge

In [None]:
plot_model(models['ridge'])

In [None]:
plot_model(models['ridge'], plot = 'error')

In [None]:
plot_model(models['ridge'], plot='feature')

##### Lasso

In [None]:
plot_model(models['lasso'])

In [None]:
plot_model(models['lasso'], plot = 'error')

In [None]:
plot_model(models['lasso'], plot='feature')

##### Linear Regression

In [None]:
plot_model(models['lr'])

In [None]:
plot_model(models['lr'], plot = 'error')

In [None]:
plot_model(models['lr'], plot='feature')

In [None]:
plot_model(models['lr'], plot = 'manifold')

### 2.3.3 The E-Commerce Company's Treatment Allocation Mechanism

3. From what we saw on the plots in the previous step, it looks like sales prediction is a big driver of discounts. In other words, the treatment assignment mechanism is given primarily by sales prediction. As it turns out, this e-commerce company has a machine learning (ML) model that estimates how much customers will buy (sales prediction), and it uses that model to allocate more discounts to those customers that are expected to buy the most. That is why we see a high degree of correlation between discount and sales prediction. Also, since this ML model uses as input sales information such as age and purchase history, other features can also be correlated with discount. In the industry, allocating treatment according to some model is a very common practice, so when we start a causal inference project and want to learn about the treatment assignment mechanism, we can either explore like we did here or talk to those involved in the allocation of the treatment to understand how it was assigned.
>Note: This last strategy is not always available, as the developers that dealt with the treatment allocation can have forgotten the details or even left the company.

- Now that we have a clearer picture of the treatment allocation, use Seaborn to plot the relationship between profits and discount (as in Step 1), adding a color dimension, which is `sales_prediction_bins`. This plot should resemble the one with high and low income that we covered in the **Importance to project** section. Can you make an argument similar to how discount influences profits?


In [None]:
sns.set(rc={'figure.figsize':(21,11)})

sns.violinplot(x=pd.qcut(data_clean.sales, 10),
               y='discount', 
               data=data_clean,
               hue_order='discount')

In [None]:
sns.set(rc={'figure.figsize':(21,11)})

sns.scatterplot(x='sales',
               y='profit', 
               hue='discount',
               data=data_clean
               )

### 2.3.4 Describing Customers with Potential Outcomes

4. By now, you probably have a good idea of the difference between customers with high and low discount levels (the former have high sales prediction and the latter have low). Now let’s see how we can use potential outcomes (PO) to describe these customers. Potential outcomes tell us what outcome (profits) a customer would have under different levels of the treatment (discount). Here is an example of how to use them. For simplicity’s sake, suppose discounts are a binary treatment, taking a value of low or high.

![img](https://github.com/matheusfacure/causal-inf-and-personalization-manning/blob/main/img/project1/po.png?raw=true)

For customers that received a low discount `(T=low or 0)`, we would observe the low discount potential outcome `Y(0)`. This is denoted by the dark blue dots above. But the high discount potential outcome `Y(1)` would also be defined; we just wouldn’t be able to see it. `Y(1)` for the low discount customers would tell us what profits (outcome) the low discount customer would have generated had they received high discounts. This is denoted by the light red dots in the image above. Notice that the potential outcome `Y(1)|T=low` is counterfactual, meaning we cannot see it, while `Y(0)|T=low` is factual, meaning that the potential outcome is the outcome we can observe: `Y(0)=Y` when `T=0`. In contrast, when we look at the high discount group, we can see `Y(1)` (dark red), but `Y(0)` is counterfactual. Of course, this is just an illustration. Knowing what you now know about the data, what can we say about the potential outcomes of customers with high and low discounts?

In [43]:
data_extra=data_clean.assign(highDiscount = data_clean.discount>80)

In [None]:
sns.set(rc={'figure.figsize':(21,11)})

sns.boxplot(x=pd.qcut(data_extra['sales_prediction'],6)
               , y='profit'
               , data=data_extra
               , hue='highDiscount'
               )

In [None]:
sns.set(rc={'figure.figsize':(21,11)})

sns.boxplot(x=pd.qcut(data_extra['sales'],6)
               , y='profit'
               , data=data_extra
               , hue='highDiscount'
               )

In [None]:
ax = sns.barplot(x = 'highDiscount'
               , y = 'profit'
               , data = data_extra
            #    , hue = 'highDiscount'
               )

In [None]:
ax = sns.swarmplot(x=pd.qcut(data_extra['sales'],6)
               , y='profit'
               , data=data_extra
               , hue='highDiscount'
            #    , split=True
            #    , inner="points"
            #    , bw=.4
               )

ax.set(ylabel="")

### 2.4 Section Summary

This Jupyter notebook explores how the treatment (discount) was assigned and what that means in terms of potential outcomes of customers with high and low discount levels.

- Who do you think receives more discounts?

- And what type of customer receives less discount?

This notebook shows how the discount mechanism works.



# 3 Confounding Bias

In [48]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

## 3.1 Simple relations

### sales ~ discount

In [None]:
y, X = dmatrices('sales ~ discount', data=data_clean, return_type='dataframe')
mod = sm.OLS(y, X)

res = {'s__d':
       {
           'model': mod.fit(),
           'description': "Sales given by discount",
           'formula': "sales ~ discount"
           }
       }
res['s__d']['model'].summary()

In [None]:
fig = sm.graphics.plot_partregress_grid(res['s__d']['model'])
fig.tight_layout(pad=1.0)

In [None]:
data_clean[['sales', 'discount']].corr()

### profit ~ discount

In [None]:
mod = smf.ols(formula='profit ~ discount', data=data_clean)  # Profit discount
res['p__d'] = {
    'model' : mod.fit(),
    'formula' : "profit ~ discount"
}

res['p__d']['model'].summary()

In [None]:
fig = sm.graphics.plot_partregress_grid(res['p__d']['model'])
fig.tight_layout(pad=1.0)

In [None]:
data_clean[['profit', 'discount']].corr()

## 3.2
By how much do we expect sales and profit to change (increase or decrease) for each additional unit of discount? Is this in line with the plot you saw earlier?

## 3.3 Relation model:
discount ~ sales_prediction + age

In [None]:
mod = smf.ols(formula='discount ~ sales_prediction + age', data=data_clean)  # Profit discount
res['p__sp_a'] = {
    'model' : mod.fit(),
    'formula' : "profit ~ discount"
}
res['p__sp_a']['model'].summary()

In [None]:
fig = sm.graphics.plot_partregress_grid(res['p__sp_a']['model'])
fig.tight_layout(pad=1.0)

## 3.4 Questions

Based on what you know about the potential outcomes of customers with high and low discounts.

1. What can you say about the relationship between discounts and profits?

$$profits = 39.3178 + 0.0753*discount$$

There's a clear but small relationship between `discounts` and `profits`. What is critical to mention is that a kit if customers have a negative profit as there might have small sales with big discounts. Affecting our overall discount performance. We observe something similar when we introduce `sale prediction` and `discount`.

2. Can we interpret that relationship as causal? If not, why?

No because the relationship between discount and profits is quite small, many customers generated negative profit and thus, it's probable that the customers generating huge profits would have generated a big profit anyhow, as when you go buy something and it's in offer. You were going to purchase it anyhow but since it's in offer you saved some cash.

3. Identify how the correlation between discount and profitability is different from the causal effect of discount of profitability. Is this bias negative or positive?

The correlation relation is `0.141303`
The coeficient for the linear regression is `0.0753` when the intersect is about 40. This meant that by default the clients got a **40%** and the profit had little to do with it.

In [None]:
len(data_clean[data_clean.discount==0])/len(data_clean)

This was because only `2.7135%` of the clients did not recive a discount. the other `97%` got a discount in average of `$ 80` which meant translated into an average loss.

In [None]:
data_clean[data_clean.discount>0].describe()

In [None]:
data_clean[data_clean.discount==0].describe()

We can further observe this with our model
$$discount ~ sales\_prediction + age$$

Once we train an OLS model on it we realize that we part from the assumption of giving a discount.

$$discount = -103.1282 + 0.1024*sales\_prediction + 2.0833*age$$

Where we observe that age is more important than sale prediction to predict the discount when it should be the other way around.

One way to fix this could have been having a minimum sale value.

## 3.5 Causal graph

Since discounts were not randomly assigned and potential outcome Y(0) is different across different discount levels, we can say that there is confounding bias in the relationship between discounts and profits. Draw a causal graph that is in line with our understanding of the bias in this relationship. You can use the Python library graphviz if you want. Or simply draw the graph on PowerPoint or Google Slides.

In [None]:
import graphviz as gr

g = gr.Digraph()

g.edge("sales_prediction", "discount")
g.edge("Tenure", "discount")
g.edge("age", "discount")

# g.edge("age", "Tenure")

g.edge("discount", "sale")
g.edge("age", "sale")

g.edge("sale", "profit")

g

# 4 Adjustment formula

## Workflow

 1. Using Seaborn, plot the regression line between discount and profits alongside the scatterplot (like in the image above). Notice customers with high sales_prediction_bins got more discounts. This is a key factor in biasing our data. Ideally, discounts would be randomized and we would get all discount levels equally distributed along the sales_prediction_bins spectrum.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15,8))

sns.scatterplot(data=data_clean, x="discount", y="profit", hue="sales_prediction_bins")
sns.regplot(data=data_clean, x="discount", y="profit", scatter=False)
plt.ylim(-20, 200);

 2. We don’t have randomized discounts, but linear regression can be used to make the discount data look as good as randomly assigned. As shown by the FWL theorem, regression has partially out properties that allow us to estimate what would happen to profits if we increased the discount while keeping other variables as sales_prediction_bins fixed. Don’t worry if it is still not clear how all of this works. We will break it down into the following steps:
- Use statsmodels to regress discount on sales_prediction_bins. Remember to treat sales_prediction_bins as a categorical variable: "discount~C(sales_prediction_bins)".
- Create a new dataframe df_discount_res that is the original one plus a column with the above model’s discount residual, plus the discount average df["discount"] - prediction + df["discount"].mean().

In [None]:
mod = smf.ols("discount~C(sales_prediction_bins)", data=data_clean)  # Discount profit
res['d__spb'] = {
    'model' : mod.fit(),
    'formula' : "discount~C(sales_prediction_bins)"
}

df_res = data_clean.assign(discount_res = model_discount.resid + data_clean["discount"].mean())

res['d__spb']['model'].summary()

 3. This discount residual can be seen as a version of discounts that have been debiased from sales_prediction_bins. Remember how sales_prediction_bins was causing confounding bias because it is a common cause of both profits and discounts? Now the residuals we get from predicting discounts from sales_prediction_bins are, by definition, no longer explained by sales_prediction_bins. This is almost magical. To see it, using this new dataframe you created in Step 2, plot profit by discount residual and set colors to be the sales predictions bins (just like in Step 1). What happens to the bias? If everything went well, you should see that customers with high residualized discounts no longer have high sales_prediction_bins. In fact, it looks like this residualized discount is as good as randomly assigned!

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(data=df_res, x="discount_res", y="profit", hue="sales_prediction_bins")
sns.regplot(data=df_res, x="discount_res", y="profit", scatter=False)
plt.ylim(-20, 200);

4. Continuing on task to control for sales predictions, we will repeat Steps 3 and 4, but now with **profits** instead of discounts.
Regress profit on sales_prediction_bins. Remember to treat sales_prediction_bins as a categorical variable.
Create a new dataframe df_profit_res that is df_discount_res plus a column with the above model’s profit residual, plus the profit average
<br>`df["profit"] - prediction + df["profit"].mean()`.

In [None]:
formula="profit~C(sales_prediction_bins)"
mod = smf.ols(formula=formula, data=data_clean)  # Discount profit
res['d__cspb'] = {
    'model' : mod.fit(),
    'formula' : formula
}

df_res = (df_res.assign(profit_res = res['d__cspb']['model'].resid + df_res["profit"].mean()))

res['d__cspb']['model'].summary()

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(data=df_res, x="discount_res", y="profit_res", hue="sales_prediction_bins")
sns.regplot(data=df_res, x="discount_res", y="profit_res", scatter=False)
plt.ylim(-20, 200);

5. Using this new dataframe, just like in Step 3, plot profit residual by discount residual and set colors to be the sales predictions bins. Do you see any difference?
Using the data with both profit and discount residuals, df_profit_res, use statsmodels to regress profit residuals on discount residuals. What is the linear coefficient? This 
coefficient can be interpreted as how much we expect profit to change as we increase discount by one unit while holding sales_prediction_bins fixed.
Run a regression model where profit is the dependent variable and discount and sales_prediction_bins are the independent variables: "profit~discount + C(sales_prediction_bins)". What is the discount coefficient? How does it compare to the result you found in Step 5? If everything goes well, it should be exactly the same as the number you got on Step 5.
We’ve debiased the relationship between discount and profits so that it is no longer confounded by sales prediction. If we’ve managed to remove the bias, it means that we can now interpret this relationship as causal. If that is the cause, what is the causal effect of discount on profits? Do you recommend the e-commerce company keep the discounts or not?

In [None]:
model = smf.ols("profit_res~discount_res", data=df_res).fit()
model.summary().tables[1]

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(data=df_res, x="discount_res", y="profit_res", hue="sales_prediction_bins")
sns.regplot(data=df_res, x="discount_res", y="profit_res", scatter=False)
plt.ylim(-20, 200);

### 6
Run a regression model where profit is the dependent variable and discount and sales_prediction_bins are the independent variables: "profit~discount + C(sales_prediction_bins)". What is the discount coefficient? How does it compare to the result you found in Step 5? If everything goes well, it should be exactly the same as the number you got on Step 5.

In [None]:
model = smf.ols("profit~discount + C(sales_prediction_bins)", data=data_clean).fit()
model.summary().tables[1]

7. We’ve debiased the relationship between discount and profits so that it is no longer confounded by sales prediction. If we’ve managed to remove the bias, it means that we can now interpret this relationship as causal.

1- **If that is the cause, what is the causal effect of discount on profits?**

The relationship is low and not statistically significant, but if anything it might be negative as a discount reduces the profits.
Also, not considered in this analysis but when a customer satisfies a need it's a product the customer might no longer need and therefore one less thing to sell.

2- **Do you recommend the e-commerce company keep the discounts or not?**

No, because there's no evidence of any benefit for the company.