> 1. DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.
> 2. SET THE "General Access" OF THE COPIED NOTEBOOK TO "Anyone with the link" BY CLICKING ON "Share" TO ENABLE SHARING WITH YOUR PEERS FOR REVIEW.
> 3. Download the data from the course and "mount it" (see cell below) so that you can work with it in this notebook


### This project is from the *Causal Inference for Data Science course on CoRise.* Learn more about the course [here](https://corise.com/course/causal-inference-for-data-science).



---





# Week 3 Project: Applying Difference-in-Differences
***

Welcome to the third project for Causal Inference for Data Science!

This project marks the end of our core curriculum. After this week, we'll take a brief tour of advanced methods by relating them to the ones you've already learned. We'll also work on a final "wrap-up" project that you can add to your data science portfolio.

But we're getting ahead of ourselves: We still need to complete Week 3 :)


## Scenario

Your previous analyses have gone viral within Tongass. Suddenly, everyone is talking about the "dynamic synergies" between in-store and online sales.

Pretty soon, you get a call from the CEO. "I'm blown away," she says.

"Uh, thanks," you say, trying to keep it cool.

"I'm *so* blown away that I've decided to launch new physical locations in the tri-state area. Customers in New Jersey, New York, and Connecticut will have way more options going forward!"

"Wow!" you reply. "That's amazing."

"Indeed, but I want to make sure we approach this launch thoughtfully before we expand to other geos. Could you analyze the impact of opening new stores in these states?"

"Of course!" you say. "I'll make sure we have a solid causal inference plan and get a rigorous read on how these openings affect Tongass's business."

"Causal inference what now?"

"Sorry," you reply. "The point is: We're going to show Amazon who's boss."

"Excellent!" she says.


## Project notes

As always, we start with the same notes:

### Data

We will work with a consistent data set througout this course (we introduce the data set more fully below). Not all parts of the data set will be applicable in any given week. The goal is demonstrating how a single set of granular data can be transformed to apply different causal inference techniques. We also hope to convey that manipulating data is in many ways the most important aspect of statistical modeling.

### Structure

We attempt to strike a balance between providing concrete steps to follow and making room for exploration. That said, we encourage you to explore: The best way to become a causal inference expert is to attack a single problem from multiple angles to see how different modeling choices affect an analysis. If this freedom is overwhelming, **don't panic**! You can simply fill out the code blocks marked "TODO" and ignore the optional ones. When we ask you to build models, we will provide the treatment effect you should expect so you can check your work.

In [13]:
# loading necessary packages
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()

import statsmodels.formula.api as smf

## I. Load the data
***

We will work with a consistent data set throughout this course. The data set is in the file called `tongass_transactions.csv`.

Note: The data set is at the **transaction level**, not the customer level. Any given customer can (and likely does) have multiple transactions. Some measures and fields are at the customer level, while others are at the transaction level. It will be up to you to manipulate this data set so that it can be used for analysis. As we'll discuss, this week's problem can be tackled at different levels of aggregation (e.g., individuals, states, treatment v. control groups). We'll need to be thoughtful about how to aggregate our data.

Below, we define the fields that are relevant for this week:
- `customer_id`: the unique identifier for a given customer
- `age`: the age of the customer
- `income`: the income of the customer 
- `state`: the customer's state of residence
- `distance`: the distance (in miles) from a customer's home to the nearest Tongass store
- `tx_order`: whether the transaction is the customer's first, second, third... etc.
- `amount`: the dollar value of the transaction
- `tx_date`: the date of the transaction
- `is_credit`: whether the transaction involved a credit card or a different payment method (1 if credit card, 0 if other)
- `in_store`: whether the transaction happened in a physical store (1 if yes, 0 if no and happened on tongass.com)

**NOTE**: If we don't mention a field above, then it won't be relevant for this week :)

In [12]:
# TODO: read in data (already filled out for you :)
df = pd.read_csv('./tongass_transactions.csv')
df.head()

Unnamed: 0,customer_id,age,income,state,received_re,received_in_store_re,distance,index,tx_order,amount,in_store,tx_date,is_credit,is_bonus
0,0,65,122753,ND,0,0,6.765402,0,0.0,61.964375,0.0,2020-12-31,0.0,0.0
1,0,65,122753,ND,0,0,6.765402,1,1.0,41.057234,0.0,2021-03-31,0.0,0.0
2,0,65,122753,ND,0,0,6.765402,2,2.0,71.752128,1.0,2021-06-30,1.0,0.0
3,0,65,122753,ND,0,0,6.765402,3,3.0,93.129942,1.0,2022-10-31,1.0,0.0
4,1,79,32977,DC,0,0,3.146723,0,0.0,61.334116,0.0,2020-01-31,0.0,0.0


## II. Modify the data set to prepare it for difference-in-differences analysis
***

In previous weeks, we got familiar with the data set. (If we hadn't, we would recommend repeating that step from Week 1!)

Now, we want to approach our data through the lens of difference-in-differences specifically.

First, that means adding in "treatment" and "time period" variables. Let's do that below.

In [10]:
treated_states = ['NJ', 'NY', 'CT'] # stores open in these states
store_opening_date = '2021-03-31' # stores open during this month

In [1]:
# TODO: use the "treated_states" list to create a column for whether
# an observation is in treatment group; use the "store_opening_date" variable
# to create a column for whether an observation happens after treatment begins. keep in mind
# that using 1s and 0s is more convenient than "True"s and "False"s when it comes to
# fitting linear regressions

## III. Aggregate the data set to the appropriate "level" and vet the parallel trends assumption; this also helps us see if our causal question is worth answering (spoiler: it will be)
***

As you know, difference-in-differences analysis hinges on a critical assumption known as "parallel trends" — that is, in the absence of the treatment, the treatment group would have seen the same trends as the control group.

Unfortunately, we can never be 100% confident in parallel trends; however, we can at least gain some comfort with it.

One way is by comparing the treatment and control groups visually to see whether they follow parallel trends prior to treatment. Try this now.

The [seaborn lineplot function](https://seaborn.pydata.org/generated/seaborn.lineplot.html) is particularly helpful here.

**Keep in mind that we could do this analysis at multiple "levels"**: in particular, we could visualize and model the effect of treatment on individual customers, states, or treatment/control groups overall. Similarly, we could aggregate the data to the month level or simply to the "pre/post" level. For this project, we'll aggregate the data to the "state" and "month" level, but there is no right answer here. Indeed, we'd encourage you to experiment with different levels of analysis to see how that changes your modeling process, but we'll leave that to the optional section below ;)

In [17]:
# TODO: aggregate data to state/month level and visualize treatment versus control for parallel trends

## IV. Fit difference-in-differences model and interpret the results
***

Woohoo! It's finally time to start modeling. Fit a DD regression using state and month as controls (similar to how we fit a DD regression using farm and month as controls in this week's material).

In [2]:
# TODO: fit a difference-in-differences model and interpret the results

# CHECK: depending on how you aggregate your data and which additional controls you include, your
# treatment effect should be ~329 (with a 95% CI of 48 to 610)

In [3]:
# TODO: convert this to a markdown cell and write a quick explanation of your model's results

## V. Cluster your standard errors
***

Given how we chose to aggregate our data, we should cluster our standard errors to see if that changes the precision of our estimated treatment effect.

Indeed, regardless of whether we fit our model at the customer- or state-level, it's common to cluster standard errors at the level of treatment (in this case, "state" is the level of treatment). It's also reasonable to try to-way clustering (e.g., at the state- _and_ month-level), although we don't need to worry about that for now.

The key thing to remember about clustering is that it helps us account for subtle correlations between observations. Even if we account for state-level effects by including `state` in our model, our within-state error terms could still be correlated (e.g., if the economy in one or two states randomly booms, it might encourage shoppers in that state to spend more).

Unfortunately, there is no "one-size fits-all" guidance about how to cluster your standard errors. (If you want a ton of detail, [this](https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf) is an excellent resource.) We personally like to try different justifiable clustering strategies and pick the most conservative (i.e., least precise).

In [17]:
# TODO: cluster your standard errors at the state-level
# CHECK: Your clustered standard errors should actually be lower than the
# unclustered versions — but remember: that can happen!

In [21]:
# TODO: pick the version that you feel most confident in

## VI. Tease out the causal mechanism by considering different outcome variables
***

In your analysis above, you likely use `amount` as your dependent variable. But there are other possible dependent variables we could consider, e.g., in-store sales or online sales.

Try repeating your model-fitting process with these dependent variables.

These complementary models help us determine whether opening new stores **only** boosts in-store sales, or whether overall sales increase but at the expense of online sales. Although overall sales is probably most important from a business standpoint, it's still important to understand _why_ a treatment works. We don't want people to think there are "dynamic synergies" between online and physical stores when, in reality, physical stores simply encourage customers to ignore the website.

In [6]:
# TODO: fit a version of your final model with "in-store sales" as your outcome variable

# CHECK: depending on how you aggregate your data and which additional controls you include, your
# treatment effect should be ~1955

In [7]:
# TODO: fit a version of your final model with "online sales" as your outcome variable

# CHECK: depending on how you aggregate your data and which additional controls you include, your
# treatment effect should be ~1612

In [8]:
# TODO: convert this cell to a markdown cell and comment on what this means
# for the causal mechanism we're analyzing

## VII. Consolidate the analysis you performed above so it's useful for a stakeholder
***

### Congratulations!

You've done a ton of incredible work. Now, it's time to package it all together so the Tongas CEO can follow along.

We will annoyingly repeat the same advice from previous weeks:

This step often feels like doing an analysis "in reverse." We don't want to step someone through all the logic we just went through to arrive at our answer (as tempting as that might be). We want to share our answer **first,** then help our stakeholders understand it intuitively by sharing visuals and explaining how confident we can be.

Here is a set of suggested steps, but feel free to tweak as you see fit:
- Share the results from your final model, making sure to put the results in **business terms** (e.g., "opening stores in the tri-state area increased online sales by X and total sales by Y. Assuming stores cost less than Z to operate, or we can continue to increase revenues, we should do this in more states")
- Show key visuals to help someone grok the relationship intuitively
- Comment on our degree of confidence of results, both in quantitative terms (e.g., confidence interval) and qualitative terms (e.g., "model seems robust/sensitive to controls, which means we can be confident/should consider this a preliminary hypothesis warranting deeper experimentation")

In [3]:
# TODO: change this cell to a markdown cell and write an "executive summary" that
# explains your results

In [4]:
# TODO: output a key visual (either from above or a new one) that you think communicates
# your results in a statistically responsible way (tip: the same visual you used for parallel
# trends might be helpful here)

In [5]:
# TODO: change this cell to a markdown cell and write a blurb on how confident you
# are in your results and why

## VIII. OPTIONAL: Consider additional analysis steps
***
1. We didn't ask for this explicitly, but it's possible to include additional controls in our difference-in-difference models to improve the precision of our estimates. For example, we might want to include additional "state-level" variables (e.g., average age of state, average distance of customers in state, etc.) to ensure we account for such differences between states. Consider fitting additional models with these kinds of controls.
2. We would highly recommend tackling this problem in multiple ways to see how it changes your analysis. For example, you could try aggregating to an even higher level (just treatment v. control groups, just pre v. post time periods) to see and whether that changes your estimated treatment effect. Conversely, it's possible to go the other way and repeat this analysis at the individual _customer_ level. How would that analysis look? What causal question would that analysis be answering? (Note: This gets complicated, and we didn't want to get into all the nuances of panel data methods — that's a broader set of problems than difference-in-differences! That said, if you're interested, consider doing some reading about modeling individual panel data where observations are clustered into higher-level units (in this case, states) and tackling it again). You have all the foundational tools you need to take this on!
3. The [`linearmodels`](https://bashtage.github.io/linearmodels/index.html) package in Python is made for working with panel data (e.g., it doesn't include every single state and month in your regression summary ;). Consider fitting models in this package so you have another tool in your DS arsenal.
4. An important robustness check in some DD models is allowing variable time trends. As mentioned in the written material for this week, differential time trends allow for the possibility that treatment and control groups were following different (non-parallel ;) trends prior to treatment. Consider fitting another model with variable time trends as a robustness check. (TBH, this is mostly so you have practice — we can see from the parallel trends visualization that we don't need to worry about differential time trends.)

## IX. EXTREMELY OPTIONAL: Tackle another problem
***

When it comes to learning causal inference, there is no substitute for practice. We would strongly support finding data sets in the wild (e.g. [here](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0), [here](https://ourworldindata.org/), or [here](https://github.com/awesomedata/awesome-public-datasets) and using the same general framework we leveraged here toward a causal question you're interested in.