# PyGeoLift Walkthrough

## Summary

Example of using the `pygeolift` package. This largely following the R package vignette for [GeoLift](https://github.com/facebookincubator/GeoLift/blob/main/vignettes/GeoLift_Walkthrough.md#example---data).

## Prerequisites

Follow the steps in `../README.md` to install R and python dependencies. This notebook should be running with a Python kernal from the poetry virtual environment.

Ensure that pygeolift is in the path.

In [3]:
import sys
import os
sys.path.append("..")

Development settings for reloading `pygeolift`.

In [4]:
%reload_ext autoreload
%autoreload 1 # Reload all modules imported with %aimport every time before executing the Python code typed.
%aimport pygeolift

## Reading Example Data

The `pygeolift` package includes the example datasets from the [GeoLift](https://github.com/facebookincubator/GeoLift) R package.

The dataset `GeoLift_PreTest` is a includes cities and conversions. It is intended to be used as an example for selecting markets with GeoLift.

In [5]:
import pygeolift.data
geo_lift_pre_test = pygeolift.data.load_GeoLift_PreTest()
geo_lift_pre_test.head()

Unnamed: 0,location,Y,date
0,new york,3300,2021-01-01
1,new york,3202,2021-01-02
2,new york,4138,2021-01-03
3,new york,3716,2021-01-04
4,new york,3270,2021-01-05


This data needs to be preprocessed with `geo_lift_pre_test` before running the market selection algorithm. The preprocessing includes: converting dates to a integer times (1, 2, ..., max time), and dropping any units with missing data.

In [7]:
from pygeolift.geolift import geo_data_read
geo_test_data_pre_test = geo_data_read(geo_lift_pre_test,
                                        date_id = "date",
                                        location_id = "location",
                                        Y_id = "Y",
                                        X = [], #empty list as we have no covariates
                                        format = "yyyy-mm-dd",
                                        summary = True)

R[write to console]: ##################################
#####       Summary       #####
##################################

* Raw Number of Locations: 40
* Time Periods: 90
* Final Number of Locations (Complete): 40



In [9]:
from pygeolift.geolift import geo_lift_market_selection
import numpy as np

market_selections = geo_lift_market_selection(data = geo_test_data_pre_test,
                                        treatment_periods = [10,15],
                                        N = [2,3,4,5],
                                        Y_id = "Y",
                                        location_id = "location",
                                        time_id = "time",
                                        effect_size = list(np.arange(-0.25, 0.25, 0.05)),
                                        lookback_window = 1, 
                                        include_markets = ["chicago"],
                                        exclude_markets = ["honolulu"],
                                        cpic = 7.50,
                                        budget = 100000,
                                        alpha = 0.1,
                                        Correlations = True,
                                        fixed_effects = True,
                                        side_of_test = "one_sided"
                                        )


R[write to console]: Setting up cluster.

R[write to console]: Importing functions into cluster.

R[write to console]: Attempting to load the environment ‘package:dplyr’

R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


R[write to console]: 
Deterministic setup with 2 locations in treatment.

R[write to console]: 
Deterministic setup with 3 locations in treatment.

R[write to console]: 
Deterministic setup with 4 locations in treatment.

R[write to console]: 
Deterministic setup with 5 locations in treatment.



  ID                                           location duration EffectSize
1  1            atlanta, chicago, las vegas, saint paul       15       0.05
2  2                                  chicago, portland       15       0.05
3  3             chicago, cincinnati, houston, portland       15       0.05
4  4                  chicago, houston, nashville, reno       15       0.05
5  5       atlanta, chicago, cleveland, las vegas, reno       10       0.05
6  6 atlanta, chicago, cleveland, las vegas, saint paul       10       0.05
  Power AvgScaledL2Imbalance Investment   AvgATT Average_MDE ProportionTotal_Y
1     1            0.5558341   85305.00 189.2567  0.04991417        0.08767192
2     1            0.1738778   32281.87 146.5321  0.05111983        0.03306537
3     1            0.1971864   74118.37 159.3627  0.04829912        0.07576405
4     1            0.3321341   75556.12 174.0647  0.05193036        0.07816073
5     1            0.4536741   69300.00 195.0787  0.05292824        0.107

The output is a Python class that is equivalent to the output of the R function [geolift::GeoLiftMarketSelection](https://rdrr.io/github/facebookincubator/GeoLift/man/print.GeoLiftMarketSelection.html). See those docs for details, or see [../pygeolift/geolift.py].

In [12]:
type(market_selections)

pygeolift.geolift.GeoLiftMarketSelection

This object has three attributes: 

In [14]:
print(market_selections.BestMarkets.head())  # Data frame of the locations ranked from best to worst
print(market_selections.PowerCurves.head()) # Data frame of the power for each effect size for each (location, duration) 
print(market_selections.parameters) # parameters used when selecting markets

   ID                                      location  duration  EffectSize  \
1   1       atlanta, chicago, las vegas, saint paul      15.0        0.05   
2   2                             chicago, portland      15.0        0.05   
3   3        chicago, cincinnati, houston, portland      15.0        0.05   
4   4             chicago, houston, nashville, reno      15.0        0.05   
5   5  atlanta, chicago, cleveland, las vegas, reno      10.0        0.05   

   Power  AvgScaledL2Imbalance  Investment      AvgATT  Average_MDE  \
1    1.0              0.555834   85305.000  189.256736     0.049914   
2    1.0              0.173878   32281.875  146.532081     0.051120   
3    1.0              0.197186   74118.375  159.362699     0.048299   
4    1.0              0.332134   75556.125  174.064736     0.051930   
5    1.0              0.453674   69300.000  195.078734     0.052928   

   ProportionTotal_Y  abs_lift_in_zero   Holdout  rank  correlation  
1           0.087672             0.000  

The attribute `BestMarkets` contains a data frame with the best markets. 
Each row is a unique (location, duration) in the data. The values for each (location, duration) is 
the minimum effect size that had a power > 80%.  Locations that lack power and are over budget are filtered out prior to returning values.

- `AverageATT`: Estimated incremental `Y` in the treated units. ATT is the *Average Treatment Effect on the Treated*
- `EffectSize`: Relative effect size. I.e. 0.05 means that the treatment condition increases the `Y` of all treated units by 5%. Since 
   the average values of `Y` vary between units, this means that the treatment effect in levels varies between units. Only the row for the lowest (in abs value) `EffectSize` for a (location, duration) is shown in this table. See `PowerCurves` for the power of all effect sizes of that (location, duration).
- `duration`: Number of time periods in the experiment (usually days)
- `Power`: The proportion of times that the simulation was significant for that `EffectSize`. The number of simulations is equal to `lookback_period`. That means that if `lookback_period = 1`, then power is either 1 or 0. 
- `Average_MDE`: Estimated relative lift. This misnamed. It is *NOT* the MDE, which is the treatment effect in the true DGP.
- rank = avgerage of the ranks of `abs(EffectSize)`, `abs_lift_in_zero` (difference between estimated and actual Effect Size), 
- `AvgScaledL2Imbalance`: A statistical thing that is used to choose between models. The theory is that weights that differ more
   from equal weights means the model is more informative. 
- `Investment`: If a cost per incremental conversion (CPIC) is provided, it is equal to `sum(Y * EffectSize * CPIC)` for the treated units.
- `abs_lift_in_zero`: The `mean(abs(EffectSize - detected_lift))`. This is a measure of the goodness of the estimator - closer to 0 is better. In practice, so few simulations are run, it may have little meaning. 
- `ProportionTotal_Y`: Proportion of total `Y` in the treated group. Warning: GeoLift does weird things for negative effect sizes and switches the groups in those cases. I don't fully understand it.
- `Holdout`: Proportion of total `Y` in the control group.
- `correlation`: It is intended to measure the correlation between the treated group and the overall population of geo-units. The higher it is, the more representative the treated units should be for the overall population of units. This is needed because the method is only concerned about maximizing power and estimating treatment effects *for the treated units*; it is not not estimating how well it estimates the treatment effect for all units. I don't fully understand their calculations. 

In [13]:
market_selections.BestMarkets.head()

Unnamed: 0,ID,location,duration,EffectSize,Power,AvgScaledL2Imbalance,Investment,AvgATT,Average_MDE,ProportionTotal_Y,abs_lift_in_zero,Holdout,rank,correlation
1,1,"atlanta, chicago, las vegas, saint paul",15.0,0.05,1.0,0.555834,85305.0,189.256736,0.049914,0.087672,0.0,0.912328,1,0.977539
2,2,"chicago, portland",15.0,0.05,1.0,0.173878,32281.875,146.532081,0.05112,0.033065,0.001,0.966935,2,0.93211
3,3,"chicago, cincinnati, houston, portland",15.0,0.05,1.0,0.197186,74118.375,159.362699,0.048299,0.075764,0.002,0.924236,3,0.914481
4,4,"chicago, houston, nashville, reno",15.0,0.05,1.0,0.332134,75556.125,174.064736,0.05193,0.078161,0.002,0.921839,3,0.914669
5,5,"atlanta, chicago, cleveland, las vegas, reno",10.0,0.05,1.0,0.453674,69300.0,195.078734,0.052928,0.107147,0.003,0.892853,5,0.978876


Displaying the `BestMarkets` table requires summarizing the MDE vs. duration or investment. Alternatively, providing various ways to filter and rank on additional criteria could help.

We might like to have a plot like the one produced by ![plot.GeoLiftMarketSelection](https://github.com/facebookincubator/GeoLift/blob/main/vignettes/GeoLift_Walkthrough_files/figure-markdown_github/GeoLiftMarketSelection_Plots-1.png).  
However, that only displays information about one market at a time.  

The R code for `plot.GeoLiftMarketSelection()` is [here](https://github.com/facebookincubator/GeoLift/blob/502ff162d2a556f89a05d456993c1e181d8da144/R/plots.R#L686).  I haven't ported it to Python.
