# An Introduction to Quantile Regression
__Date__: Fall 2022 <br>
__Author__: Alex Parker

This notebook borrows heavily from the __[Medium blog post](https://towardsdatascience.com/mean-vs-median-causal-effect-37057a6c54c9)__ by Metteo Courthoud 

#### Quantile Regression Pros:
1. Can be more informative than OLS by showing impact on the entire distribution
2. Useful for highly skewed data that may have a large proportion of 0 values
3. Can be used to show effects at different quantiles
4. More robust to outliers than OLS


#### Quantile Regression Cons:
1. median analysis does not translate well to estimated a total business impact number
2. Assumes rank invariance: the ranks of the observations do not change as a result of an intervention

#### Interpretation
Under the asusmption of rank invariance the interpretation for a QR coefficient is the estimated effect of a single observation at the appropriate qunatile. Simply put, this means that if a QR is run on the 50th percentile of customers, then the coefficients represent the estimated effects of the covariates on the median customer.



In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
import plotly.graph_objects as go
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy
from my_functions import *
from plotly.subplots import make_subplots


sns.set_theme(context = 'notebook', style = 'whitegrid')

### Data Generation: Randomized Experiment with some demographic information

In [None]:
n = 25000

# personal characteristics
age = np.random.randint(18, 70, n)
gender = np.random.choice(['male', 'female','other'], p=[0.51, 0.41,.08], size= n)
income = np.random.lognormal(4 + np.log(age), 0.1, n)

# treatment status
treatment = np.random.choice(['control','treatment'],p =[.5,.5], size = n)

# Dependant variable
spend = 50*(gender=='female') + 25*(gender == 'other') + (income/10)*np.random.normal(loc = 1, scale = .1, size = n)
spend = spend + spend*(treatment == 'treatment')*.05
spend = np.maximum(np.round(spend, 2) - 250, 0)

# Generate the dataframe
df = pd.DataFrame({'spend':spend,'treatment': treatment, 'age': age, 'gender': gender,'income':income})
df = df.assign(
    treatment = df.treatment.astype('category'),
    gender = df.gender.astype('category'))

df.head()

#### The dependent variable is highly skewed with a high degree of 0 values

In [None]:
fig = px.histogram(df, x = 'spend', color = 'treatment')
fig.show()

In [None]:
df.groupby('treatment')['spend'].describe(percentiles = [.1,.25,.5,.75,.9])

In [None]:
sample = df.sample(frac = .1)
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=False)
fig.suptitle('Scatterplots')

# Bulbasaur
sns.scatterplot(ax=axes[0], x=sample.age, y=sample.spend, hue = sample.treatment)
axes[0].set_title('Spend and Age')

# Charmander
sns.scatterplot(ax=axes[1], x=sample.income, y=sample.spend, hue = sample.treatment)
axes[1].set_title('Spend and Income')

# Squirtle
sns.scatterplot(ax=axes[2], x=sample.age, y=sample.income, hue = sample.treatment)
axes[2].set_title('Income and Age')

plt.show()

In [None]:
### How is spend correlated with the categorical variables
fig = px.box(df, x = 'gender', y = 'spend', color = 'treatment')
fig.show()

## Analysis of the Causal Effect of Treatment

### Difference of means

In [None]:
df.groupby('treatment')['spend'].agg(['count','mean','median','std'])

### OLS

In [None]:
smf.ols("spend ~ treatment", data=df).fit().summary().tables[1]

In [None]:
smf.ols("spend ~ treatment + age + gender", data=df).fit().summary().tables[1]

### Quantile Regression

In [None]:
smf.quantreg("spend ~ treatment", data=df).fit(q = .5).summary().tables[1]

In [None]:
smf.quantreg("spend ~ treatment + gender + income", data=df).fit(q = .5).summary().tables[1]

### Analysis of Effect at several different quantiles

In [None]:
qrs = run_quantile_regressions(df, formula = "spend ~ treatment + gender + income", varname = "treatment[T.treatment]", q = .05)
qrs

In [None]:
# Plot
fig, ax = plt.subplots()
sns.lineplot(data=qrs, x='q', y='coeff')
ax.fill_between(data=qrs, x='q', y1='ci_lower', y2='ci_upper', alpha=0.1);
plt.axhline(y=0, c="k", lw=2, zorder=1)
ols_coeff = smf.ols("spend ~ treatment + gender + income", data=df).fit().params["treatment[T.treatment]"]
plt.axhline(y=ols_coeff, ls="--", c="C1", label="OLS coefficient", zorder=1)
plt.legend()
plt.title("Estimated coefficient, by quantile")
plt.show()