# Live Well Dorset Referral Trend Causal Inference
Causal inference analysis for Live Well Dorset programmes communication interventions. 
[Research project](https://andyist.github.io/mres/)

In [1]:
import pandas as pd                              # tables and data manipulations
from causalimpact import CausalImpact            # estimate causal effect of intervention on a time series
from datetime import datetime                    # date helpers
import warnings                                  # `do not disturbe` mode              
warnings.filterwarnings('ignore')

%run Livewell.ipynb # Project specific helpers

#Load and prepare client data (not publicly available)
clients = pd.read_csv("csv-data/18Jan2019/clients.csv", index_col='clientID', parse_dates=['DateRegistered','Date_registered_Month_Year'], dayfirst=True)
clients = prepare_clients(clients)
clients = clients.loc[~clients.DateRegistered.isin(['2016-06-15 17:03:00','2016-06-15 17:04:00'])]
clients = clients.loc[clients.DateRegistered >= '2016-01-01 00:00:00']
# Remove the time component from the date stamps (we are not interested in finer resolutions than daily)
clients = clients.assign(DateRegistered=clients.DateRegistered.dt.round('D'))

### 1. Counting referrals
This analysis is conducted using only referral date and if the refereal soure is defined as GP based or not. In preperation the two parameteres are extracted from the general client data, the GP source is then split to two dummy columns represented by boolean vaulues. This structrue allows for daily grouping and counting of the GP and non GP referrrals.

In [2]:
# Split gp_referral into boolean columns
gp_dates = clients[['DateRegistered','gp_referral']]
gp_dates['gp_referral'] = gp_dates['gp_referral'].astype(str)
dum = gp_dates.gp_referral.str.get_dummies().astype(bool)
gp_dates = pd.concat([gp_dates, dum], axis=1)
gp_dates.rename(columns = {'0': 'no-gp', '1': 'gp'}, inplace=True)
# Group by date, counting referrals
gp_date_counts = gp_dates.groupby('DateRegistered').sum()
gp_date_counts = gp_date_counts.reset_index()
gp_date_counts['DateRegistered'] = pd.to_datetime(gp_date_counts['DateRegistered'])
gp_date_counts = gp_date_counts.set_index('DateRegistered')

### 2. Padding missing values
The inconsistent nature of referral gathering results in the possability for non existant dates within the timeline. A xero fill process is added to pad the data such that missing dates in the time series are accounted for with 0 referrals.

In [3]:
# Trim date range
df = gp_date_counts.copy()
# Populate dates with no value as 0 to produce a continuous timeline
df = df.asfreq(freq='D', fill_value=0)
# Optionally reduce the time series to entries from a specific date
#df = df.loc[df.index > '2018-01-01']

### 3. Causal inferance analysis
The analysis is acheived using the Caulsa Impact library developed by Brodersen, K.H. et al (2015). The exmaple used in this study realied upon a discreeet intervention date being identified with a format of year,month,day. This is subsuqnelty used by a project specific get_periods function to standardise the production of boundary dates. This function requires the data frame to be analysed (as results fomr steps 1 and 2), the intervention date object, and the post period with option for a pre-period multiplier - that is how much more pre-period date range should be used for the analysis compared to the desire post period counterfactual projection. 

The Causal Impact library provides a useul interface for outputting the statistical results summary and visualisations if required. 

It is intended that this approach be run for each intervention date that exists within the date range and the statistical results be gathered and reviewed as per the researches particular requirments.

In [4]:
# Set the date of the intervention
intervention = datetime(2018,3,29)
# Prepare pre and post date ranges - default pre is 3 times post which is 28 days
pre_period, post_period = get_periods(df, intervention, 60, 3)
# Reduce the time series to the period of analysis
mask = ((df.index >= pre_period[0]) & (df.index <= post_period[1]))
s = df.loc[mask]
s = s[['gp','no-gp']]
# Run the analysis and output the sumary results
ci = CausalImpact(s, pre_period, post_period)
print('From ' + pre_period[0] + ' to ' + post_period[1])
print('-------')
print(ci.summary())

From 2017-10-01 to 2018-05-29
-------
Posterior Inference {Causal Impact}
                          Average            Cumulative
Actual                    8.0                488.0
Prediction (s.d.)         5.7 (0.7)          348.5 (43.9)
95% CI                    [4.3, 7.1]         [263.3, 435.3]

Absolute effect (s.d.)    2.3 (0.7)          139.5 (43.9)
95% CI                    [0.9, 3.7]         [52.7, 224.7]

Relative effect (s.d.)    40.0% (12.6%)      40.0% (12.6%)
95% CI                    [15.1%, 64.5%]     [15.1%, 64.5%]

Posterior tail-area probability p: 0.0
Posterior prob. of a causal effect: 100.00%

For more details run the command: print(impact.summary('report'))


In [5]:
ci.plot()

<Figure size 1500x1200 with 3 Axes>

Brodersen, K.H. et al., 2015. Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 9, pp.247–274.
https://ai.google/research/pubs/pub41854