# Data Analysis and Causal Mechanisms

## Environment Setup and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid") 
sns.set_palette('viridis')
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['font.family'] = 'monospace'

In [None]:
# Load data
potential_outcomes_df = pd.read_pickle('../data/potential_outcomes_df.pkl')
observational_df = pd.read_pickle('../data/observational_df.pkl')

## Causal Mechanisms

<center>
<img 
  src="../assets/confounding_bias.png" 
  alt="Confounding Relationships" 
  style="width:500px;height:auto;"
> 

In [None]:
# Observational data
observational_df.head()

### True Causal Effects
The `observational_df` data frame contains data that we might observe during a typically business process. It contains hidden causal mechanisms, such as confounding bias, that affects treatment assignment (Upsell Marketing) as well as the outcome (AMU Signup)


#### Treatment Assignment Mechanism
Our observational data was simulated with known mechanisms for confounding and the true causal effects are stored in the `potential_outcomes_df` data frame. Many of the columns in this data would be impossible to observe in reality, but it is helpful to see them to gain a deep understanding of the goals of causal inference models

Let's go through the different aspects of these unobservable features

In [None]:
potential_outcomes_df.head()

#### Potential Outcomes
Causal inference is based on the concept of potential outcomes. Potential outcomes denote what would have happened in the case where some treatment was administered.

For every customer in our data, we have two potential outcomes:
- $Y^{0}$ denotes the outcome that would have happened if the customer did not get treated (did not see an upsell marketing message)
- $Y^{1}$ denotes the outcome that would have happened if the customer did see an upsell marketing message

The fundamental problem of causal inference is the fact that it is impossible to observe both of the potential outcomes for the same individual. Instead, we get the outcome Y, which is the realization of either $Y^{0}$ or $Y^{1}$ depending on treatment assignment.

In our `potential_outcomes_df`, we have the true potential outcome likelihoods as well as the individual treatment effects for all customers. In the real world, these are hidden from us!

In our simulated data, each customer has a baseline probability of 0.05 to signup for AMU in the absence of upsell marketing exposure. The `y_1_propensity` is the change to the baseline probability as a function of the customers confounding features. The more activity a customer is, the more likely they are to signup for our service if shown as upsell message

In [None]:
potential_outcomes_df \
.drop(
    columns=['upsell_marketing', 'upsell_confounding_propensity', 'amu_signup_rct', 'upsell_marketing_rct']) \
.head()

#### Confounding in Treatment Assignment
The `potential_outcomes_df` also has the true causal mechanism for treatment assignment encoded in the `upsell_confounding_propensity` column. This measures the likelihood a customers sees an upsell marketing message as a function of their activity. In the real world, this is also hidden from us!

In [None]:
potential_outcomes_df \
.drop(
    columns=[
        'y_0_propensity', 'y_1_propensity', 'individual_treatment_effect', 'amu_signup_rct', 'upsell_marketing_rct']) \
.head()

### Dangers of Unadjusted Estimates

<center>
<img 
  src="../assets/correlation_causation.png" 
  alt="Correlation is not causation" 
  style="width:300px;height:auto;"
> 

In [None]:
# Check treatment effect
signup_rate_by_treatment = potential_outcomes_df.groupby('upsell_marketing')['amu_signup'].mean()
print(f"Signup rates by treatment group: {signup_rate_by_treatment}")

# Plot distribution
plt.figure(figsize=(8, 6))
potential_outcomes_df.groupby('upsell_marketing')['amu_signup'].mean().plot(kind='bar', color=['tab:blue', 'tab:green'])
plt.title("Signup Rate by Upsell Message Exposure")
plt.xlabel("Upsell Marketing (0 = No, 1 = Yes)")
plt.ylabel("Signup Rate")
plt.xticks(rotation=0)
plt.show()

In [None]:
# Difference in conversion rates by teatment type
biased_lift = (
    (potential_outcomes_df['amu_signup'][potential_outcomes_df.upsell_marketing==1].mean()) - 
    (potential_outcomes_df['amu_signup'][potential_outcomes_df.upsell_marketing==0].mean())
)

# True average treatment effect based on ITE
actual_lift = potential_outcomes_df.individual_treatment_effect.mean()
print(
    f'Biased Marketing Lift: {biased_lift:.2%}',
    f'Acutal Marketing Lift: {actual_lift:.2%}', 
    sep='\n'
)

## Exploratory Data Analysis

### Correlation

Correlation measures the **linear** relationship between two numeric datasets.

Ranges from -1 to 1

Below are some common patterns that would appear in a scatter plot for different correlation values

<br>
<center>
<img 
  src="../assets/negative_correlation.png" 
  alt="Negative Correlation" 
  style="width:auto;height:auto;"
> 
<br>
<br>

<br>
<center>
<img 
  src="../assets/positive_correlation.png" 
  alt="Positive Correlation" 
  style="width:auto;height:auto;"
> 
<br>
<br>


<br>
<center>
<img 
  src="../assets/zero_correlation.png" 
  alt="Zero Correlation" 
  style="width:auto;height:auto;"
> 
<br>
<br>


**Beware**: A correlation of zero does not imply a lack of association/relationship between two numeric datasets

<br>
<center>
<img 
  src="../assets/non_linear_relationship.png" 
  alt="Non Linear Relationship" 
  style="width:auto;height:auto;"
> 
<br>
<br>

In [None]:
observational_df \
.corr() \
.style \
.background_gradient(cmap='coolwarm_r', vmin=-1, vmax=1) \
.format('{:.2f}')

In [None]:
## Summary Statistics
observational_df \
.describe() \
.drop('count') \
.style \
.format({
    'amu_signup': '{:.1%}',
    'upsell_marketing': '{:.1%}',
    'streaming_tier_prime': '{:.1%}',
    'play_days': '{:,.2f}',
    'songs_listened': '{:,.2f}',
    'other_subscriptions': '{:.1%}',
    'retail_spending': '${:,.2f}'})

In [None]:
# Customer features by treatment
## Summary Statistics
observational_df \
.groupby('upsell_marketing', as_index=False) \
.mean() \
.style \
.format({
    'amu_signup': '{:.1%}',
    'streaming_tier_prime': '{:.1%}',
    'play_days': '{:,.2f}',
    'songs_listened': '{:,.2f}',
    'other_subscriptions': '{:.1%}',
    'retail_spending': '${:,.2f}'})

### Randomized Control Trials
RCTs remove confounding bias but making the control (no-upsell) and treatment (upsell) groups similiar on their attributes and behavorial profiles

In [None]:
potential_outcomes_df \
.filter(
    items=['amu_signup', 'upsell_marketing', 'amu_signup_rct', 'upsell_marketing_rct']) \
.head()

In [None]:
potential_outcomes_df \
.filter(
    items=[
        'upsell_marketing_rct', 'amu_signup', 
        'streaming_tier_prime', 'play_days', 
        'songs_listened', 'other_subscriptions', 
        'retail_spending'], axis=1) \
.groupby('upsell_marketing_rct', as_index=False) \
.mean() \
.style \
.format({
    'amu_signup': '{:.1%}',
    'streaming_tier_prime': '{:.1%}',
    'play_days': '{:,.2f}',
    'songs_listened': '{:,.2f}',
    'other_subscriptions': '{:.1%}',
    'retail_spending': '${:,.2f}'})

What do RCTs do to our estimates of the causal effect?
- They remove bias and provide accurate estimates of the true causal effects of treatment!

In [None]:
# Difference in conversion rates by teatment type in an RCT
rct_lift = (
    (potential_outcomes_df['amu_signup_rct'][potential_outcomes_df.upsell_marketing_rct==1].mean()) - 
    (potential_outcomes_df['amu_signup_rct'][potential_outcomes_df.upsell_marketing_rct==0].mean())
)

# True average treatment effect based on ITE
actual_lift = potential_outcomes_df.individual_treatment_effect.mean()
print(
    f'RCT Marketing Lift: {rct_lift:.2%}',
    f'Acutal Marketing Lift: {actual_lift:.2%}', 
    sep='\n'
)