New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent cohort definition? #3
Comments
Hi Andres, thank you for noticing this and for raising the issue! I think the confusion comes from the "post" label attached to the disaggregated plot. And not from how cohort is defined, where cohort is the date/time of the treatment for a focal entity. The disaggregated plot There is indeed an inconsistency when one calls Post-treatment period should be defined as in the aggregations: time >= cohort that is >= a relative period of 0, which is the period when the treatment is administered. I will fix this for a new release, in the meantime you can still use the same method and get a plot that uses time >= cohort. After running your example: from differences import ATTgt, simulate_data
panel_data = simulate_data(n_cohorts=5, intensity_by=1, tau=0.0, low=2.0, high=2.0)
att = ATTgt(data=panel_data, cohort_name="cohort")
att.fit(formula="y", est_method="reg") you can run: import numpy as np
from differences.tools.utility import single_idx
# just to process the ATTgt output to a format easier to plot
_, res = single_idx(att.results())
res = (
res
.assign(
post=lambda x: np.where(x["cohort"] <= x["time"], "post", "pre"),
cohort=lambda x: x['cohort'].astype(str) # a string plays nicer with the legend (which is interactive)
)
)
# if you want the plot to look like the ones in this package
# you can still use the ATTgt method which has some preset parameters: just change the data for now
chart = att.plot(data=res, color_by="cohort", shape_by="post") The simulated data Regarding the To put it differently, if, in the fake data generated with The post period is defined as post relative to the treatment date (time >= cohort) and not post relative to when the outcome changes. In a fake dataset we can manually define the size of the treatment effect on the outcome and the time when the effect starts kicking in, but that is not known in a non-simulated dataset. So I think there was confusion created by having access (in this example) to the "effect" column. In this example, when calling from differences import simulate_data
panel_data = simulate_data(
n_cohorts=5,
intensity_by="cohort",
tau=-1, # <--- change here
alpha=1, # and here
low=2,
high=2
)
You can also change a number of other things in I have also not checked or tested the I am not sure if other packages in R or Stata have a function to simulate the data, but Since I have posted the package I have had very little time to work on it, but I hope to be able to make some improvements and additions soon! So thank you very much for taking the time to raise this issue and please let me know if I misunderstood or have not fully answered your points. |
Hi Bernardo. Thanks for the detailed response. I had identified the same line (attgt - utility.py - L1021) as the source of the problem. Regarding the Happy to help with improvements and new features. |
Definitely agree that the defaults of simulate_data should be more intuitive. About helping, that would be wonderful, did you have something in mind? Some things that I think would enhance the package significantly are: Improvements in the documentation, especially functions, and classes related to the attgt and did modules; additionally, I think, it would be ideal if the package explained what is happening when one calls ATTgt and its various methods, maybe in a notebook. Tests also need a lot of improvement, right now they are just testing that the results are consistent with the R did package, ideally the package would need much better code coverage. When I’ll have some time, I will try to lay out some improvements in the code and features that I think could be implemented, but of course, feel free to open issues about enhancements or anything. I may not be able to respond quickly in the next couple of months but I'll reply asap. |
Hi there. I'm a little bit confused by how the cohort variable is defined and later used within the ATTgt class. Let me provide an example here to determine whether this is an actual bug or I'm just misunderstanding something. Let's start by using the
simulate_data
function to simulate some (panel) data.The simulated data set the cohort variable as the last time step before the intervention (e.g. 1903 for unit e0 when treatment effect kicks in in 1904). This cohort definition is non-standard as it is usually defined as the start of treatment. I then proceed to fit the model and plot the estimated ATTs.
Resulting plot looks correct, with marker shapes correctly encoding pre- and post-intervention periods (e.g. for cohort = 1903, time step 1903 is labeled as pre, while 1904 is labeled as post).
However, there seems to be inconsistency with the above cohort definition when aggregating the ATTs. For instance, when I run the
aggregate
method for the "cohort" level, I get the following results,The average ATTs presented here follow the standard cohort definition (i.e. start of the intervention), as they include the time == cohort data points. This clearly biases the estimated ATTs, as it includes a pre-intervention time period. This is easy to see by manually aggregating the ATTs. Clearly the second result, which does not include the time == cohort estimates, is the right one.
Hence, it seems like both the
simulate_data
function and the (disaggregated)plot
method generate/expect a non-standard cohort definition which is inconsistent with other functions/methods in the package. Thoughts? Sorry if I'm missing something.The text was updated successfully, but these errors were encountered: