### Do Giffen goods exist for poor households?
#### Replicating Published Results in a Jupyter Notebook
#### Kenneth Flamm
##### University of Texas at Austin
kflamm@mail.utexas.edu
##### version 0
#### February 2020

In this exercise, we will analyze actual data that reveals so-called Giffen behavior by poor households participating in a randomized field experiment in Yunan, China.

#### Introduction

In many microeconomics textbooks, the rarest of all commodities is the so-called **Giffen good**$^1$, a product whose demand declines when its price declines or, equivalently, whose demand increases when its price increases: in short, an upward-sloping demand curve. Economic theory (specifically, the theory of consumer behavior: budget-constrained optimization of a welfare index derived from consumer preferences) says such a demand curve is theoretically possible, if the good is a very inferior one (demand increases greatly, cet. par., as income decreases), budget share for the good is large, and substitutability with other goods is relatively weak.

[See *CORE Team, The Economy, Chapter 3*](https://www.core-econ.org/the-economy/book/text/03.html), for a discussion of income vs. substitution effects determining consumer response to a price change, in the context of a model of leisure time demand.

A famous historical example sometimes cited as a possible example of a real-life Giffen good is linked to the Great Famine in 19th century Ireland$^2$, where it has been asserted that Irish demand for potatos increased even as the price rose. The heuristic description was that poor Irish peasants, faced with a rise in the price of potatos during the Great Irish Famine, fended off starvation by increasing calories derived from potato consumption, while cutting back on their consumption of their "fancy" luxury good, meat. 

![Image](https://miro.medium.com/max/1016/1*NvuhsBl0yacGcnjUZ9yabg.jpeg)

This directly contradicts what micro textbooks like to call "The Law of Demand", the assertion that real-life demand curves are almost always downward-sloping. 

Fortunately for militant Law-of-Demanders, the Irish example has been widely discounted as an example of Giffen behavior by economic historians. The Irish potato blight that occurred at the time of the famine greatly reduced potato supply, driving prices up. It is therefore not clear that that Irish potato consumption actually increased as prices rose. Because the potatos were in fact being grown by the same poor Irish peasants consuming them, the net effect on Irish peasant incomes was negative, with income loss from the blight more than offsetting the increased market price of what potatos could be sold. In addition to the income effect (negative) attributable to the price increase in potatos consumed, there was an additional direct negative effect on household income and wealth from reduced potato production that reduced demand, and that has to be taken into account.

In short, the *ceteris paribus* demand curve assumption-- that all other factors affecting demand other than a goods own price were beinng held constant-- that is needed to argue that observed changes in market prices and quantities trace out a demand curve fails. (In technical economist jargon, this is referred to as ***the identification problem***$^3$.)



$^1$ refs to come

$^2$ https://www.tandfonline.com/doi/pdf/10.1080/00213624.1995.11505707 .

$^3$ refs to come

$^4$

$^5$ Though one economic historian writes: "Scholars have long debated whether there was enough food in Ireland to feed the population during the Great Irish Famine; there has been less detailed examination of high-frequency data to understand how markets distributed food after the harvests failed. This article explores a hitherto unused weekly price and quantity data set from the Cork city markets to analyse how markets may have hindered the distribution of available food from 1846 to 1849. Although, historically, economists have long suspected that raw data on the market for potatoes during the Irish Famine behaved like that for a classical ‘Giffen’ good, there is little evidence for this among foodstuffs available throughout the crisis in Cork. But bacon pigs – a food that never reached a stable equilibrium but completely disappeared from the market in 1847 – exhibited some characteristics which do not appear to accord with the classical law of demand. Further analysis of this data suggests that middle-class purchasing power outbid the poorest people in Ireland at a time when there was a surplus of superior foods and a deficiency of inferior foods. These circumstances indicate that unusual market behaviour may have made the crop failure’s redistributive consequences – as well as its mortality toll – much worse."

[Charles Read, "The Irish Famine and Unusual Market Behavior in Cork", 2017](https://journals.sagepub.com/doi/10.1177/0332489317705461)

#### The Search for Giffen Behaviour

Ever since, however, economists have searched for real-life examples which might actually illustrate Giffen behavior, contradicting the "Law of Demand". How would this work? "Although the price increase makes the staple less attractive in relative terms, the fact that it makes the consumer so much poorer (in real terms) forces him to consume more bread. Translating this to the language of consumer theory, the conditions under which Giffen behavior is likely to be observed, therefore, include that the good in question be strongly inferior and that expenditure on that good comprise a large portion of the consumer’s budget."$^4$ The potato might seem to meet that description for poor Irish subsistence farmers in 1849, but the available data seem inadequate for rigorous analysis.$^5$

In 1991, a group of researchers at Texas A&M (Kagel, John Henry; Battalo, Raymond Charles; Green, Leonard (1995). Economic choice theory: an experimental analysis of animal behavior. New York: Cambridge University Press, pp. 25-28.) noted that while it would be very difficult to run an experiment testing for Giffen behavior on very poor human beings for ethical reasons, rat behavior might provide a suitable analog. They designed an experiment that rewarded thirsty rats facing a fixed budget of lever pushes with either root beer (good tasting water) or quinine water (rats don't like quinine flavored water, but it slakes their thirst). The "price" for each liquid was fixed in terms of required level pushes. One group of "wealthy" thirsty rats were allocated a large budget of lever pushes, while another group of "poor" thirsty rats was allocated half the lever push budget of their wealthy comrades. What do you suppose happened when the price of quinine water (in lever pushes) was increased?

However, these researchers proved the existence of Giffen good at the individual rat level and not at the market level, since rats don't interact and trade in markets.



#### In search of the elusive Giffen good...
A 2008 paper by Nolan and Miller experimentally demonstrated the existence of Giffen goods among people at the household level, through a field experiment that directly subsidized purchases of rice and wheat flour for extremely poor families in China.
[Link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2964162/pdf/nihms242289.pdf)

The intuition behind their experimental setup is shown in the following figure:
![Image](giffen_pic.png)

Let's start by seeing if we can reproduce their result using the replication data set available off the ***American Economic Review*** web site.

In [9]:
import numpy as np # import the numpy numerical package:Python data science computational engine
import matplotlib.pyplot as plt # import the graphics engine
import pandas as pd
import seaborn as sns
%matplotlib inline
print(plt.style.available)
plt.style.use('seaborn-white')

['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']


In [10]:
df_g=pd.read_stata('Giffen.dta')
df_g

Unnamed: 0,county,round,household,person_id,relationship,age,male,family_size,income_per_capita,expend_per_capita,...,pct_ch_amt_rice,ch_hh_rice_percap,pct_ch_sub_rice_arc,pct_ch_sub_wheat_arc,min_p,ch_log_hh_rice,ch_log_p_sub_rice,ch_log_hh_people,ch_log_hh_pay,ch_log_hh_nonwage
0,430626,1,1,1,Head,69.0,0,2.0,149.583328,342.666656,...,,,,,1.0,,,,,
1,430626,2,1,1,Head,69.0,0,2.0,39.166668,113.333336,...,80.000000,125.000000,-8.510638,0.000000,1.0,0.441833,-0.127833,0.0,0.0,-1.340027
2,430626,3,1,1,Head,69.0,0,2.0,12.083333,70.416664,...,52.631580,150.000000,8.163265,0.000000,1.0,0.356675,0.204794,0.0,0.0,-1.175999
3,430626,1,1,2,Spouse,73.0,1,2.0,149.583328,342.666656,...,,,,,1.0,,,,,
4,430626,2,1,2,Spouse,73.0,1,2.0,39.166668,113.333336,...,15.384615,125.000000,-8.510638,0.000000,1.0,0.441833,-0.127833,0.0,0.0,-1.340027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10423,622701,2,120,2,Parent,66.0,1,3.0,233.333328,204.000000,...,-200.000000,-116.666664,0.000000,-24.096386,1.0,-5.857933,0.117783,0.0,0.0,-0.016998
10424,622701,3,120,2,Parent,66.0,1,3.0,236.666672,148.000000,...,0.000000,100.000000,0.000000,22.099449,1.0,5.703783,-0.117783,0.0,0.0,0.014185
10425,622701,1,120,3,Parent,68.0,0,3.0,237.333328,132.366669,...,,,,,1.0,,,,,
10426,622701,2,120,3,Parent,68.0,0,3.0,233.333328,204.000000,...,-200.000000,-116.666664,0.000000,-24.096386,1.0,-5.857933,0.117783,0.0,0.0,-0.016998


In [11]:
df_g.dtypes

county                                int32
round                                  int8
household                             int16
person_id                              int8
relationship                       category
age                                 float64
male                                   int8
family_size                         float32
income_per_capita                   float32
expend_per_capita                   float32
province                             object
subsidy_group                       float32
hhid                                float64
hh_rice_percap                      float32
hh_all_wheat_and_noodles_percap     float32
hh_cereals_percap                   float32
hh_meat_percap                      float32
hh_pulses_percap                    float32
hh_veg_percap                       float32
hh_dairy_percap                     float32
hh_fats_percap                      float32
hh_cals_percap                      float32
hh_rice_calorie_share           

In [12]:
df_g['round'].value_counts()

1    3600
2    3441
3    3387
Name: round, dtype: int64

In [13]:
df_g['hhid'].value_counts()

431028001.0    24
433125083.0    23
431028079.0    22
433125036.0    21
622701050.0    21
               ..
431028038.0     1
431028034.0     1
620123022.0     1
433125107.0     1
622701012.0     1
Name: hhid, Length: 1296, dtype: int64

___

#### Create Hunan family subset

In [14]:
# Hunan household data subset
df_h=df_g.loc[(df_g.person_id==df_g.min_p)&(df_g.province=='Hunan'),:]

In [15]:
df_h.hhid.nunique()

646

In [16]:
df_h['round'].value_counts()

1    644
3    634
2    631
Name: round, dtype: int64

* 646 total households, but no more than 644 in any single round, and only 631 in round 2
* we call this an "unbalanced" panel data set

In [17]:
# let's create a variable for how many time period 'rounds' that a hh is in
# First, define a handy "groupby" I can use to aggregate and transform data by household group
gbhh=df_h.groupby('hhid')

In [18]:
df_h.loc[:,'nrounds']=gbhh['round'].transform('nunique') # this does same thing as Stata 'egen' command with by(round)
# except Stata didn't use to have a built-in `nunique` transformation  !!
df_h.nrounds.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


3    1881
2      18
1      10
Name: nrounds, dtype: int64

* subset of 627 households observed all 3 rounds = 1881/3
* 9 hh observed over 2 rounds =18 /2
* 10 hh observed only in a single round, 'singletons', have no impact on estimated coeffs in a model with fixed hh effects = 10/1

In [19]:
pd.crosstab(df_h['nrounds'],df_h['round'],margins=True)

round,1,2,3,All
nrounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,8,0,2,10
2,9,4,5,18
3,627,627,627,1881
All,644,631,634,1909


In [20]:
# in treatment and control
pd.crosstab(df_h['nrounds'],df_h['subsidy_group'])

subsidy_group,0.0,10.0,20.0,30.0
nrounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3,0,3,4
2,2,6,4,6
3,471,477,471,462


* control group size = 1/3 size of 3 treatment groups

* possibly an experimental design choice to improve identification of the price elasticity parameter:
    * increased variation in subsidy level improves precision of coefficient estimate
    * single subsidy level would give you sign of effect, but no information about size of slope coefficient for consumption-price variation

In [21]:
pd.crosstab(df_h['round'],df_h['subsidy_group'])

subsidy_group,0.0,10.0,20.0,30.0
round,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,161,162,162,159
2,157,161,158,155
3,158,160,158,158


#### quick guide to using pivot tables and crosstab:
[link to pivot table guide](https://medium.com/@yangdustin5/quick-guide-to-pandas-pivot-table-crosstab-40798b33e367)

In [22]:
# number of observations per hhid and round
pd.pivot_table(df_h,index=['round','subsidy_group'],values='hhid', columns='nrounds',aggfunc='count',
              margins=True)

Unnamed: 0_level_0,nrounds,1,2,3,All
round,subsidy_group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.0,3.0,1.0,157.0,161
1,10.0,,3.0,159.0,162
1,20.0,3.0,2.0,157.0,162
1,30.0,2.0,3.0,154.0,159
2,0.0,,,157.0,157
2,10.0,,2.0,159.0,161
2,20.0,,1.0,157.0,158
2,30.0,,1.0,154.0,155
3,0.0,,1.0,157.0,158
3,10.0,,1.0,159.0,160


In [23]:
gbhh.hhid.count().value_counts()

3    627
1     10
2      9
Name: hhid, dtype: int64

#### we just learned one very important thing: not all households observed in all periods
* let's pull the variables we are going to use from the larger dataset, subset = `df_h`

In [24]:
keep_columns=['nrounds','county','round','person_id','province','hhid',
              'subsidy_group','family_size','income_per_capita','hh_rice_percap',
              'hh_staple_calorie_share_1']
column_list=['min_p','pct_ch_sub_rice_arc', 'ch_log_hh_rice', 'ch_log_p_sub_rice', 
             'ch_log_hh_people', 'ch_log_hh_pay','ch_log_hh_nonwage']
df_h=df_h.loc[:,df_h.columns.isin(keep_columns+column_list)]
df_h

Unnamed: 0,county,round,person_id,family_size,income_per_capita,province,subsidy_group,hhid,hh_rice_percap,hh_staple_calorie_share_1,pct_ch_sub_rice_arc,min_p,ch_log_hh_rice,ch_log_p_sub_rice,ch_log_hh_people,ch_log_hh_pay,ch_log_hh_nonwage,nrounds
0,430626,1,1,2.0,149.583328,Hunan,10.0,430626001.0,225.000000,0.740947,,1.0,,,,,,3
1,430626,2,1,2.0,39.166668,Hunan,10.0,430626001.0,350.000000,0.740947,-8.510638,1.0,0.441833,-0.127833,0.000000,0.000000,-1.340027,3
2,430626,3,1,2.0,12.083333,Hunan,10.0,430626001.0,500.000000,0.740947,8.163265,1.0,0.356675,0.204794,0.000000,0.000000,-1.175999,3
6,430626,1,1,1.0,80.000000,Hunan,10.0,430626002.0,250.000000,0.510488,,1.0,,,,,,3
7,430626,2,1,1.0,80.000000,Hunan,10.0,430626002.0,500.000000,0.510488,-8.510638,1.0,0.693147,-0.127833,0.000000,0.000000,0.000000,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5216,433130,2,1,3.0,135.555557,Hunan,10.0,433130099.0,466.666656,0.625744,-8.695652,1.0,-0.305382,-0.087011,0.000000,0.000000,0.135876,3
5219,433130,3,2,3.0,122.611115,Hunan,10.0,433130099.0,450.000000,0.625744,8.510638,2.0,-0.441833,0.127833,-0.405465,0.000000,-0.100364,3
5223,433130,1,1,4.0,1517.291626,Hunan,10.0,433130100.0,400.000000,0.860290,,1.0,,,,,,3
5224,433130,2,1,4.0,18.333332,Hunan,10.0,433130100.0,250.000000,0.860290,-8.695652,1.0,-0.470004,-0.087011,0.000000,-8.699514,0.058499,3


In [25]:
# quality check hh_rice_percap variable: change in log rice_percap = ch_log_hh_rice - ch_log_people
# hh_rice_percap ch_log_hh_rice ch_log_hh_people

In [26]:
rice_chk=df_h[['hhid','round','family_size','hh_rice_percap', 'ch_log_hh_rice', 'ch_log_hh_people']]
rice_chk.sort_values(['hhid','round'],inplace=True)
rice_chk

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,hhid,round,family_size,hh_rice_percap,ch_log_hh_rice,ch_log_hh_people
0,430626001.0,1,2.0,225.000000,,
1,430626001.0,2,2.0,350.000000,0.441833,0.000000
2,430626001.0,3,2.0,500.000000,0.356675,0.000000
6,430626002.0,1,1.0,250.000000,,
7,430626002.0,2,1.0,500.000000,0.693147,0.000000
...,...,...,...,...,...,...
5216,433130099.0,2,3.0,466.666656,-0.305382,0.000000
5219,433130099.0,3,3.0,450.000000,-0.441833,-0.405465
5223,433130100.0,1,4.0,400.000000,,
5224,433130100.0,2,4.0,250.000000,-0.470004,0.000000


In [29]:
rice_chk.loc[(rice_chk['round']==2)|( rice_chk['round']==3),'ch_L_hsize']=(np.log(rice_chk.family_size)
                    -np.log(rice_chk.family_size).shift(1))
rice_chk.loc[(rice_chk['round']==2)|( rice_chk['round']==3),'ch_L_hsize']=(np.log(rice_chk.family_size)
                    -np.log(rice_chk.family_size).shift(1))
((rice_chk['ch_log_hh_people']-rice_chk['ch_L_hsize']).describe())

count    1256.000000
mean       -0.044133
std         0.194671
min        -1.386294
25%         0.000000
50%         0.000000
75%         0.000000
max         1.098612
dtype: float64

* number people not= family_size !
* people appear to be used as capitas in log(rice per cap),

In [30]:
# wish to reproduce result in Jensen and Miller, Table 4, first three columns, standard first difference log-log model
# need to create county_time interaction variable
df_h.loc[:,'county_time']=(df_h['county']*10+df_h['round'])
df_h.county_time.describe()

count    1.909000e+03
mean     4.315765e+06
std      1.088952e+04
min      4.306261e+06
25%      4.307262e+06
50%      4.309223e+06
75%      4.331252e+06
max      4.331303e+06
Name: county_time, dtype: float64

### Subtle pro points
if you were to define/convert county-time to a categorical variable before dropping NaNs, you would not have to use C() syntax in OLS, **but**

`statsmodels` would create dummy variables for all categories, including dropped ones, and that would create appearance of 0 coefficients and singular covariance matrix warning when results are reported

For this reason, using C() syntax is IMO preferred way to create categorical dummies in regression model in `statsmodels`. One less thing to think about when interpreting dropped/variable coefficients and covariance warnings.

Also, since authors estimate only first difference models, the only `unbalanced` observations used in regression models are observed in periods 1 and 2 (1 only, 3 only, and 1 and 3 only, are dropped from estimation, and there are no households observed only in 2 and 3). Let's remember that for later.

* **Another thing worth knowing as we proceed:**  misaligned assignment can result in NaN's being generated without an error message. You should always check that your transformations are doing what you think they are doing to your data. Easy to do in Jupyter notebook.

### first, let's confirm we can replicate Jensen-Miller estimation results

In [31]:
# create replication double-log model dataset
hrep=df_h[['ch_log_hh_rice','ch_log_p_sub_rice','ch_log_hh_pay','ch_log_hh_nonwage',
           'ch_log_hh_people','hh_staple_calorie_share_1','hhid','county_time','round','nrounds']].dropna()
hrep

Unnamed: 0,ch_log_hh_rice,ch_log_p_sub_rice,ch_log_hh_pay,ch_log_hh_nonwage,ch_log_hh_people,hh_staple_calorie_share_1,hhid,county_time,round,nrounds
1,0.441833,-0.127833,0.000000,-1.340027,0.000000,0.740947,430626001.0,4306262,2,3
2,0.356675,0.204794,0.000000,-1.175999,0.000000,0.740947,430626001.0,4306263,3,3
7,0.693147,-0.127833,0.000000,0.000000,0.000000,0.510488,430626002.0,4306262,2,3
8,-0.223144,0.204794,0.000000,0.000000,0.000000,0.510488,430626002.0,4306263,3,3
10,0.344841,-0.040822,-7.824046,3.992066,0.000000,0.824068,430626003.0,4306262,2,3
...,...,...,...,...,...,...,...,...,...,...
5208,-0.167054,0.127833,0.000000,0.123276,0.000000,0.543143,433130098.0,4331303,3,3
5216,-0.305382,-0.087011,0.000000,0.135876,0.000000,0.625744,433130099.0,4331302,2,3
5219,-0.441833,0.127833,0.000000,-0.100364,-0.405465,0.625744,433130099.0,4331303,3,3
5224,-0.470004,-0.087011,-8.699514,0.058499,0.000000,0.860290,433130100.0,4331302,2,3


In [32]:
pd.crosstab(hrep.nrounds,hrep.county_time,dropna=False)

county_time,4306262,4306263,4307262,4307263,4309222,4309223,4310282,4310283,4331252,4331253,4331302,4331303
nrounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2,0,0,1,0,1,0,2,0,0,0,0,0
3,110,110,103,102,104,104,103,103,107,106,100,100


Jensen and Miller allow for "cluster" correlation of OLS disturbance terms.
![Image](https://miro.medium.com/max/2484/1*gFo31s49sqAbnjl08QzHCA.png)

In [62]:
import statsmodels.formula.api as smf
mod = smf.ols(
'''ch_log_hh_rice ~ ch_log_p_sub_rice + ch_log_hh_pay + ch_log_hh_nonwage +
   ch_log_hh_people + C(county_time) ''', data=hrep)
full_sampl=mod.fit(cov_type='cluster',cov_kwds={'groups':hrep['hhid']})
print(full_sampl.summary())

                            OLS Regression Results                            
Dep. Variable:         ch_log_hh_rice   R-squared:                       0.106
Model:                            OLS   Adj. R-squared:                  0.096
Method:                 Least Squares   F-statistic:                     12.65
Date:                Wed, 26 Feb 2020   Prob (F-statistic):           7.72e-28
Time:                        12:42:43   Log-Likelihood:                -1415.5
No. Observations:                1256   AIC:                             2863.
Df Residuals:                    1240   BIC:                             2945.
Df Model:                          15                                         
Covariance Type:              cluster                                         
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept             

##### Note that with C() syntax for categorical variables, statsmodels automatically dropped first category 4306262 when intercept included in model

In [63]:
import statsmodels.formula.api as smf
mod = smf.ols(
'''ch_log_hh_rice ~ ch_log_p_sub_rice + ch_log_hh_pay + ch_log_hh_nonwage +
   ch_log_hh_people + C(county_time) ''',
              data=hrep.loc[df_h['hh_staple_calorie_share_1']<=.8,:])
poor=mod.fit(cov_type='cluster',cov_kwds={'groups':hrep.loc[df_h['hh_staple_calorie_share_1']<=.8,
                                                           'hhid']})
print(poor.summary())

                            OLS Regression Results                            
Dep. Variable:         ch_log_hh_rice   R-squared:                       0.106
Model:                            OLS   Adj. R-squared:                  0.092
Method:                 Least Squares   F-statistic:                     9.255
Date:                Wed, 26 Feb 2020   Prob (F-statistic):           3.07e-19
Time:                        12:42:46   Log-Likelihood:                -1188.2
No. Observations:                 997   AIC:                             2408.
Df Residuals:                     981   BIC:                             2487.
Df Model:                          15                                         
Covariance Type:              cluster                                         
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept             

* **Pro tip** Note also that you need to be careful to make sure that dataframe used for covariance calculation matches dataframe specified in model, otherwise will get misalignment-related error message!

In [66]:
import statsmodels.formula.api as smf
mod = smf.ols(
'''ch_log_hh_rice ~ ch_log_p_sub_rice + ch_log_hh_pay + ch_log_hh_nonwage +
   ch_log_hh_people + C(county_time) ''',
              data=hrep.loc[df_h['hh_staple_calorie_share_1']>.8,:])
poorest=mod.fit(cov_type='cluster',cov_kwds={'groups':hrep.loc[df_h['hh_staple_calorie_share_1']>.8,
                                                           'hhid']})
print(poorest.summary())

                            OLS Regression Results                            
Dep. Variable:         ch_log_hh_rice   R-squared:                       0.312
Model:                            OLS   Adj. R-squared:                  0.269
Method:                 Least Squares   F-statistic:                     11.16
Date:                Wed, 26 Feb 2020   Prob (F-statistic):           5.79e-17
Time:                        12:44:42   Log-Likelihood:                -155.96
No. Observations:                 259   AIC:                             343.9
Df Residuals:                     243   BIC:                             400.8
Df Model:                          15                                         
Covariance Type:              cluster                                         
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept             

![Image](jm_tabl4.png)

#### Can we test alternative specifications?
* This dataset represents poor practice for replication datasets.
* Impossible to recover original sample data; deliberate obfuscation effort by authors.
* But, can recover enough information to use in **ixed effects** model (to come).
    * just need changes from round-to-round, not levels....


In [36]:
len(hrep['hhid'].unique())

631

In [37]:
len(df_h['hhid'].unique())

646

##### 15 hh's not in J-M estimation sample

In [38]:
jm=hrep['hhid'].unique()
kf=df_h['hhid'].unique()
not_jm=df_h.loc[~df_h['hhid'].isin(jm),:]
not_jm[['hhid','round','nrounds']]

Unnamed: 0,hhid,round,nrounds
1068,430726023.0,1,2
1069,430726023.0,3,2
1072,430726024.0,1,2
1073,430726024.0,3,2
1561,430726099.0,1,2
1562,430726099.0,3,2
1565,430726100.0,1,2
1566,430726100.0,3,2
1570,430726102.0,1,2
1571,430726102.0,3,2


##### 10 singleton hh's and 5 doubleton (x2 rounds) hh's  = 15 hh in 20 observations
doubletons potentially can add info to improve FE estimates..
only the contiguous doubletons (rounds 1-2) are being used in the J-M estimates, rounds 1-3 doubletons are being dropped

the rounds 1-3 doubletons contain no info on the direct effect of the subsidy (both unsubsidized periods) but can improve estimates of other time-varying control variables and thus indirectly improve subsidy coefficient precision.

In [39]:
not_kf=hrep.loc[~hrep.hhid.isin(kf),:]
not_kf

Unnamed: 0,ch_log_hh_rice,ch_log_p_sub_rice,ch_log_hh_pay,ch_log_hh_nonwage,ch_log_hh_people,hh_staple_calorie_share_1,hhid,county_time,round,nrounds


* just confirmed that all households in JM estimation sample are in larger `df_h` (kf) sample

##### Drop single obs unused hh's and non-contiguous doubletons from sample
        * create df_h2 that is same as J-M estimation sample

In [40]:
print(df_h.shape)
df_h2=df_h.loc[df_h['hhid'].isin(jm),:]
df_h2.shape

(1909, 19)


(1889, 19)

* dropped 20 observations (15 households, including 5 potentially FE-usable doubleton households), to replicate JM sample

In [41]:
# make sure df_h sorted by hhid and period
df_h2.sort_values(['hhid','round'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


* not sure why pesky warning message keeps popping up, don't need with `inplace=True` which means I'm already aware I am overwriting
    * could submit to development team if I wanted to figure out how
    * most likely this is considered a feature, not a bug
    * `Pandas` team design philosophy seems to be giving lots of warning messages to get you to carefully consider what is going on with your data
    * not a bad thing
    
    

In [42]:
df_h[df_h['nrounds']==2].groupby(['hhid'])['round'].agg(['min','max']) # look at doubletons in original dataframe

Unnamed: 0_level_0,min,max
hhid,Unnamed: 1_level_1,Unnamed: 2_level_1
430726014.0,1,2
430726023.0,1,3
430726024.0,1,3
430726099.0,1,3
430726100.0,1,3
430726102.0,1,3
430922005.0,1,2
431028032.0,1,2
431028093.0,1,2


##### Twice observed hhids: 4 hhids observed in 1 and 2 only, 5 observed in 1 and 3 only, total = 9 obs observed in only 2 rounds

In [43]:
df_h[df_h['nrounds']==1].groupby(['hhid'])['round'].agg(['min','max']) #look at singletons

Unnamed: 0_level_0,min,max
hhid,Unnamed: 1_level_1,Unnamed: 2_level_1
430922008.0,1,1
430922016.0,1,1
430922074.0,3,3
430922078.0,1,1
431028034.0,1,1
431028038.0,1,1
431028104.0,3,3
433125048.0,1,1
433125052.0,1,1
433125107.0,1,1


##### 8 hhids observed only in 1, 2 observed only in 3


In [44]:
df_h.loc[(df_h['round']==3) & (df_h['ch_log_p_sub_rice']!=np.NaN),'ch_log_p_sub_rice'].count()

625

In [45]:
df_h.loc[(df_h['round']==2) & (df_h['ch_log_p_sub_rice']!=np.NaN),'ch_log_p_sub_rice'].count()

631

back to slimmed down `df_h2` dataframe

In [46]:
# let's drop all remaining obs with missing price change in round 3
print(df_h2.shape)
df_h2.drop(index=df_h.loc[(df_h['round']==3) & (df_h['ch_log_p_sub_rice']==np.NaN),:].index)
df_h2.shape

(1889, 19)


(1889, 19)

nothing left to drop!

In [47]:
print(df_h2.loc[(df_h2['nrounds']==2 ) & ((df_h2['round']==3)),'hhid'])
drop_list=df_h2.loc[(df_h2['nrounds']==2 ) & ((df_h2['round']==3)),'hhid'].values
drop_list

Series([], Name: hhid, dtype: float64)


array([], dtype=float64)

### Nothing to drop
#### could consider using doubleton households appearing but not used in Jensen-Miller estimation sample
#### households in which data reported for rounds 1 and 3 only (note: no observations reported for 2 and 3 only)