--- 
Microeconometrics | Summer 2021 | M.Sc. Economics, Bonn University 

# Replication of Angrist, J., and Evans, W. (1998). "Children and Their Parent's Labor Supply: Evidence from Exogenous Variation in Family Size". <a class="tocSkip">   

[Carolina Alvarez Garavito](https://github.com/carolinalvarez)
---

**Angrist, J.D., & Evans, W.N. (1998).** [Children and Their Parents' Labor Supply: Evidence from Exogenous Variation in Family Size](https://www.jstor.org/stable/116844?seq=1). *The American Economic Review*, 88(3). 450-477. 

# Table of contents
* [Introduction](#Introduction)
* [Identification Strategy](#Identification)
* [Empirical Methodology](#Empirical-Methodology)
* [Replication Angrist & Evans (1998)](#Replication-of-Angrist-&-Evans-(1998))
 * [Data & Descriptive Statistics](#Data-&-Descriptive-Statistics)

In [1]:
%matplotlib inline
#!pip install linearmodels
#!pip install stargazer
import numpy as np
import pandas as pd
import pandas.io.formats.style
import seaborn as sns
import statsmodels as sm
import statsmodels.formula.api as smf
import statsmodels.api as sm_api
from linearmodels.iv import IV2SLS
import matplotlib as plt
import matplotlib.pyplot as plt
import copy
from IPython.display import HTML
from stargazer.stargazer import Stargazer
from statsmodels.api import add_constant
from functools import reduce

In [2]:
from auxiliary.auxiliary_data_preparation import ( 
    data_preparation_1980,
    data_preparation_1990,
    get_data_all_women_1980,
    get_data_all_women_1990,
    data_preparation_married_couples,
    data_preparation_married_couples_1990,
    rename_interactions_earnings,
    families_one_more_kid
)

from auxiliary.auxiliary_statistics import ( 
    table_sum_stats,
    table_sum_stats_husbands, 
    Table_3_panel_1,
    Table_3_panel_2,
    difference_means
)

from auxiliary.auxiliary_plots import ( 
    plot_distribution
)

from auxiliary.auxiliary_regressions import ( 
    OLS_Regressions_more2k,
    OLS_Labor_Supply_Models,
    OLS_Labor_Supply_Interactions_wifes,
    OLS_Labor_Supply_Interactions_husbands,
    OLS_Labor_Supply_First_Stage_wifes,
    OLS_Labor_Supply_First_Stage_husbands,
    IV_Labor_Supply_Interactions,
    mean_samples,
    IV_Comparison_Models,
    mean_differences_instruments,
    wald_estimates_regressions
)

---
# Introduction 
---

---
# Identification
--- 
![causal graph1](files/causal_graph_v1.png)


Angrist and Evans (1998) study the causal mechanisms between fertility and the work effort of both men and women. The authors begin by explaining the theoretical and practical reasons of studying the relationship between fertility and labor supply. First, there has been development of economic models that link the family and the labor market. Second, the relationship between fertility and labor supply could explain the increase of women's participation in the labor market in the post-war period, where having fewer children could have increased the female labor-force share. Meanwhile, other studies have linked fertility with female withdraws from the labor market and lower wages compared to men.

The mayority of empirical studies related to childbearing and labor supply find a negative correlation between family size (i.e., fertility) and female labor force. However, in his assesment of Economics of the Family, Robert J. Willis argues that there has not been well-measured exogenous variables that allow to separate cause and effect relationships from correlations among variables such as delay of marriage, decline of childbearing, increase in divorces, and increase in female labor force participation.

In this vein, the authors argue that the problems concerning the causal association between family size and labor supply arises from the theoretical argument that both factors are jointly determinated. For example, some labor-supply econometric models often use child-status variables as regressors on hours of work. On the other hand, economic demographers usually measure the effect of wages on fertility. According to the authors, "*since fertility variables cannot be both dependent and exogenous at the same time, it seems unlikely that either sort of regression has a causal interpretation*". 

Angrist and Evans (1998) contribute by using an **instrumental variable strategy (IV)** based on the sex-mix of children in families with two or more kids. This captures parental preferences for mixed-sex siblings, where parents of same-sex children are much more likely to have an additional child.

**Endogeneity Problem**

<center>Fertility 🠊 Labor supply</center>
<center>Labor supply 🠊 Fertility </center>

**Instrument** 

<center>Dummy variable for whether the sex of the second child matches the sex of the first child</center> 

---
# Empirical Methodology
## Casual estimation with a Binary IV

\begin{equation}
Y = \alpha + \delta D + \epsilon
\end{equation}

\begin{equation}
E[Y] = E[\alpha + \delta D + \epsilon]= \alpha + \delta E[D] + E[\epsilon]
\end{equation}

We re-write it as a difference equation in Z and divide both sides by $ E[D|Z=1] - E[D|Z=0]$ which yields:

\begin{equation}
\frac{E[Y|Z=1]-E[Y|Z=0]}{E[D|Z=1]-E[D|Z=0]} =\frac{\delta (E[D|Z=1]-E[D|Z=0]) + (E[\epsilon|Z=1]-E[\epsilon|Z=0])}{E[D|Z=1]-E[D|Z=0]}
\end{equation}

If the data holds for the causal graph despicted above, then $Z$ has no association with $ /epsilon$ and therefore:

\begin{equation}
\frac{E[Y|Z=1]-E[Y|Z=0]}{E[D|Z=1]-E[D|Z=0]} =\delta
\end{equation}

Under these conditions, the ratio of the population-level association between Y and Z and between D and Z is equal to the causal effect of D on Y. Then, if $Z$ is associated with $D$ but not with $/upvarepsilon$, then the following is the IV-Estimator for infinite samples:

\begin{equation}
\hat{\delta}_{IV,WALD} = \frac{E_N[y_i|z_i=1] - E_N[y_i|z_i=0]}{E_N[d_i|z_i=1] - E_N[d_i|z_i=0]}
\end{equation}

This is the IV-Estimator, which is known as the Wald Estimator when the instrument is binary. The wald estimator takes the average difference in the observed outcome of those who were exposed to the instrumental variable and of those who were not. Then it takes the average difference between the ones in the treatment group who took the treatment and those in the tratment group who did not receive the treatment.

## IV Estimation as LATE Estimation

Imbends and Angrist (1994) developed a framework for classifiying individuals as: i) those who respond positively to an instrument; ii) those who remain unaffected by the instrument; iii) those who rebel against the instrument. When $D$ and $Z$ are binary variables, then they are four possible group of individuals:

| Status                                    |Potential treatment assignment         | 
| ------------------------------------------|:-------------------------------------:| 
| Compliers ($\tilde{C}=c$)                 | $D^{Z=0}=0; D^{Z=1}=1$                | 
| Defiers ($\tilde{C}=d$)                   | $D^{Z=0}=1 D^{Z=1}=0$                 | 
| Always takers ($\tilde{C}=a$)             | $D^{Z=0}=1 D^{Z=1}=1$                 |  
| Never takers ($\tilde{C}=n$)              | $D^{Z=0}=0 D^{Z=1}=0$                 |   


A valid instrument $Z$ for the casual effect of $D$ on $Y$ must satisfy three assumptions in order to identify the **LATE**:

* Independence assumption: ($Y^{1}, Y^{0}, D^{Z=1}, D^{Z=0} \indep Z$)

This is analogous to the assumption that $cov(Z, \varepsilon)=0$ in the traditinal IV litera
* Non-zero effect of instrument assumption: $k \neq 0$ for all $i$
* Monotonicity assumption: either $k \geq 0$ for all $i$ or $k \leq 0$ for all $i$ 


---

---
# Replication of Angrist & Evans (1998)
---

## A. Data & Descriptive Statistics

Angrist and Evans (1998) use two extracts from the Census Public Use Micro Samples(PUMS) that correspond to the year 1980 and 1990 respectively. The Census contains information on labor supply, the sex of mother's first two children, and an indicator of multiple births.

However, there is no retrospective fertility information in the PUMS data sets other than the total number of children ever born. That means, the census does not track children across households. The authors thus matched children to mothers within households accordingly to the following strategy: they attached people in a household labeled as *child* to a female householder or the spouse of a male householder. They deleted any mother for whom the number of children in the household did not match the total amount of children ever born. Also, in households with multiple families, relationship codes and subfamily identifiers were used to pair children with mothers.

The sample is then limited to mothers aged 21-35 whose oldest child was less than 18 years old at the time of the Census. There are two main reasons to restrict the data in such fashion. First, few women younger than age 21 have two children, thus taking into account younger women will decrease the number of observations for the instrumental variable *more than two children* . Second, a child over 18 is very likely to have moved to a different household. It is very unlikely that a woman aged 35 years old at the time of the census has a child 18 year old or more. Thus, restricting the sample to women aged 35 or less assures that the two children are still living in the household and therefore, still be financially dependent from their parents.

For the empirical analysis, the authors use two samples for each year of census. The first includes all women (after restricting the sample to mothers aged 21-35) with two or more children. The second sample includes only married women for testing the main theories of household production (e.g., Gronau, 1973) and exploring the impact of children as well on father's labor supply.

The following table summarizes the samples created by the authors and used for the empirical analysis:


| Year        | Sample        | Description                                                            |
| :----       | :----         |:----                                                                   |
| 1980        | Full sample   | Woman with two or more children, age 21-35 years old                   |
|             | Married sample| Couples married at time of census, only once and at time of first birth|                   
| 1990        | Full sample   | Woman with two or more children, age 21-35 years old                   |            
|             |Married sample | Woman married at time of census                                        |                                

Variables with information on timing of first marriage and the number of marriages is not available in the 1990 PUMS; thereore, for building the 1990 married sample, only the variable wheter the woman was married at the time of the census is considered.

In [3]:
census_1_1980 = pd.read_stata("data/m_d_806_1.dta")
census_2_1980 = pd.read_stata("data/m_d_806_2.dta")
data_1980=census_1_1980.append(census_2_1980, ignore_index=False, verify_integrity=False, sort=False)
data_1980=data_preparation_1980(data_1980)

In [None]:
census_1_1990 = pd.read_stata("data/m_d_903_1.dta")
census_2_1990 = pd.read_stata("data/m_d_903_2.dta")
census_3_1990 = pd.read_stata("data/m_d_903_3.dta")
data_1990=census_1_1990.append([census_2_1990, census_3_1990], ignore_index=False, verify_integrity=False, sort=False)
data_1990=data_preparation_1990(data_1990)

---
<span style="color:coral">**NOTE**:</span> The original data provided by the authors can be found [here](https://economics.mit.edu/faculty/angrist/data1/data/angev98). For this replication the data is split into several .dta-files due to GitHub size constraints.

---

In [None]:
data_1980.describe()

In [None]:
data_1990.describe()

In [None]:
data_1980.head()

In [None]:
data_1990.head()

In [4]:
data_all_women_1980=get_data_all_women_1980(data_1980)
print("The sample of all women for 1980 aged between 21 and 35 with second kid no older than 1 year old has", len(data_all_women_1980), "observations.")

The sample of all women for 1980 aged between 21 and 35 with second kid no older than 1 year old has 394840 observations.


In [None]:
data_all_women_1990=get_data_all_women_1990(data_1990)
print("The sample of all women for 1990 aged between 21 and 35 with second kid no older than 1 year old has", len(data_all_women_1990), "observations.")

In [None]:
data_all_women_1980.head()

In [5]:
data_all_women_1980=data_preparation_married_couples(data_all_women_1980)

In [6]:
#creating the sample for married couples 1980

msample_1980=data_all_women_1980[(data_all_women_1980['TIMESMAR']==1) & (data_all_women_1980['MARITAL']==0) & (data_all_women_1980['illegit']==0) & (data_all_women_1980['agefstd']>=15) &
            (data_all_women_1980['agefstm']>=15) & (data_all_women_1980["AGED"]!=np.NaN)]

print("The sample of married couples has", len(msample_1980), "observations.")

The sample of married couples has 254652 observations.


In [None]:
data_all_women_1990=data_preparation_married_couples_1990(data_all_women_1990)

In [None]:
#creating the sample for married couples 1990

msample_1990=data_all_women_1990[(data_all_women_1990['MARITAL']==0) & (data_all_women_1990['agefstd']>=15) &
            (data_all_women_1990['agefstm']>=15) & (data_all_women_1990["AGED"]!=np.NaN)]

print("The sample of married couples for 1990 has", len(msample_1990), "observations.")

In [None]:
#Creating sample of only middle income husbands 1980
sample_middle_third=msample_1980[msample_1980["husband_distribution"]=="middle_third"].copy()

print("The sample of married couples whos husband belongs to the middle income distribution has", len(sample_middle_third), "observations.")

In [None]:
#Creating sample of only middle income husbands 1990
sample_middle_third_1990=msample_1990[msample_1990["husband_distribution"]=="middle_third"].copy()

print("The sample of married couples whos husband belongs to the middle income distribution for the 1990 Census has", len(sample_middle_third_1990), "observations.")

In [None]:
sample_middle_third=rename_interactions_earnings(sample_middle_third)

In [None]:
sample_middle_third_1990=rename_interactions_earnings(sample_middle_third_1990)

In [7]:
#Samples of moms by education 1980
sample01=msample_1980[msample_1980["lessgrad"]==1]
sample02=msample_1980[msample_1980["hsgrad"]==1]
sample03=msample_1980[msample_1980["moregrad"]==1]

In [None]:
#Samples of moms by education 1990
sample04=msample_1990[msample_1990["lessgrad"]==1]
sample05=msample_1990[msample_1990["hsgrad"]==1]
sample06=msample_1990[msample_1990["moregrad"]==1]

In [None]:
# Creating the marriaged sample out of total sample of moms 1980

data_1980["qtrmar"] = np.where((data_1980["QTRMAR"] >= 0), data_1980["QTRMAR"] - 1, data_1980["QTRMAR"])

data_1980["yom"] = np.where((data_1980["QTRBTHM"] <= data_1980["qtrmar"]), data_1980["YOBM"] + data_1980["AGEMAR"], data_1980["YOBM"] + data_1980["AGEMAR"]+1)

    
data_1980["dom_q"]=(data_1980.yom + (data_1980.qtrmar)/4)
data_1980["do1b_q"]=(data_1980.YOBK + (data_1980.QTRBKID)/4)

data_1980["illegit"]= np.NaN
data_1980.loc[data_1980["dom_q"] - data_1980["do1b_q"] > 0, "illegit"] = 1
data_1980.loc[data_1980["dom_q"] - data_1980["do1b_q"] <= 0, "illegit"] = 0

#creating the sample for married couples  out of total sample of moms 1980

msample_total_1980=data_1980[((data_1980['AGEM']>=21) & (data_1980['AGEM']<=35)) & (data_1980['TIMESMAR']==1) & (data_1980['MARITAL']==0) & (data_1980['illegit']==0) & (data_1980['agefstd']>=15) & 
            (data_1980['agefstm']>=15) & (data_1980["AGED"]!=np.NaN)]

In [None]:
#creating the sample for married couples  out of total sample of moms 1990

msample_total_1990=data_1990[(data_1990['MARITAL']==0) & (data_1990['agefstd']>=15) &
            (data_1990['agefstm']>=15) & (data_1990["AGED"]!=np.NaN)]

print("The sample of married couples for 1990 has", len(msample_total_1990), "observations.")

In [None]:
data_all_women_1980_one=families_one_more_kid(data_1980)
data_all_women_1980_one=data_preparation_married_couples(data_all_women_1980_one)

In [None]:
data_all_women_1990_one=families_one_more_kid(data_1990)

In [None]:
msample_1980_one=data_all_women_1980_one[(data_all_women_1980_one['TIMESMAR']==1) & (data_all_women_1980_one['MARITAL']==0) & (data_all_women_1980_one['illegit']==0) & (data_all_women_1980_one['agefstd']>=15) &
            (data_all_women_1980_one['agefstm']>=15) & (data_all_women_1980_one["AGED"]!=np.NaN)]

In [None]:
msample_1990_one=data_all_women_1990_one[(data_all_women_1990_one['MARITAL']==0) & (data_all_women_1990_one['agefstd']>=15)
                                         & (data_all_women_1990_one['agefstm']>=15) & (data_all_women_1990_one["AGED"]!=np.NaN)]

In [None]:
plot_distribution(msample_1980, "total_incomed")
msample_1980["total_incomed"].mean()

In [None]:
msample_1980['husband_distribution'].value_counts().plot(kind='bar')

In [None]:
income_dad_density=plt.figure(figsize=(8,8))
income_dad_density=plt.xlim(0,400000)
income_dad_density=plt.xlabel('Total Income Dad')
income_dad_density=plt.ylabel('Density')
income_dad_density=sns.kdeplot(msample_1980['total_incomed'],shade=True)
income_dad_density

**Table 2| Part 1: Descriptive Statistics, Women aged 21-35 with 2 or more children - 1980 PUMS**

In [None]:
#Table 2 for 1980
table1_1=table_sum_stats(data_all_women_1980)
table1_2=table_sum_stats(msample_1980)
table1_3=table_sum_stats_husbands(msample_1980)
data_frames = [table1_1, table1_2, table1_3]
Table2_1980 = reduce(lambda  left,right: pd.merge(left,right,on=['Variable'],
                                            how='left'), data_frames)

Table2_1980.rename(columns = {'Mean_x':'All women (mean)', 
                       'Std. Dev._x':'All women (std.dev)',
                       'Mean_y':'Married women (mean)',
                      'Std. Dev._y':'Married women (std.dev)',
                      'Mean':'Husbands (mean)',
                      'Std. Dev.':'Husbands (std.dev)'}, 
            inplace = True)

Table2_1980=Table2_1980[["Variable", "All women (mean)", "All women (std.dev)", "Married women (mean)", "Married women (std.dev)", 'Husbands (mean)', "Husbands (std.dev)"]]
Table2_1980 = Table2_1980.replace(np.nan, '-', regex=True)

print("The sample of all women for 1980 has", len(data_all_women_1980), "observations, while the sample for married couples has" , len(msample_1980), "observations")
Table2_1980

**Table 2| Part 2: Descriptive Statistics, Women aged 21-35 with 2 or more children - 1990 PUMS**

In [None]:
#Table 2 for 1990
table1_1=table_sum_stats(data_all_women_1990)
table1_2=table_sum_stats(msample_1990)
table1_3=table_sum_stats_husbands(msample_1990)
data_frames = [table1_1, table1_2, table1_3]
Table2_1990 = reduce(lambda  left,right: pd.merge(left,right,on=['Variable'],
                                            how='left'), data_frames)

Table2_1990.rename(columns = {'Mean_x':'All women (mean)', 
                       'Std. Dev._x':'All women (std.dev)',
                       'Mean_y':'Married women (mean)',
                      'Std. Dev._y':'Married women (std.dev)',
                      'Mean':'Husbands (mean)',
                      'Std. Dev.':'Husbands (std.dev)'}, 
            inplace = True)

Table2_1990=Table2_1990[["Variable", "All women (mean)", "All women (std.dev)", "Married women (mean)", "Married women (std.dev)", 'Husbands (mean)', "Husbands (std.dev)"]]
Table2_1990 = Table2_1990.replace(np.nan, '-', regex=True)
print("The sample of all women for 1990 has", len(data_all_women_1990), "observations, while the sample for married couples has" , len(msample_1990), "observations")
Table2_1990

Table 2 part 1 provides information on statistics and variable definition for covariates, instruments and dependent variables later used in the empirical analysis for the 1980 census data, while Table 2 part 2 provides the same information for 1990 census data. 

The covariate of main interest is *more than two children* and the first instrumental variable for this covariate is *same sex*, described as if the first two children were the same gender. The table also shows the two components of *same sex*, which are *two boys* and *two girls*. Just as stated in Angrist and Evans (1998), among all the women who already had a second child, 40.2 percent had a third, where the correspoding fraction for the 1990 sample is 37 percent. For both samples, around 50% of all the families with two childs have children of the same gender and above 51% of first births correspond to a boy.

Meanwhile, another instrument used in the empirical analysis correspond to multiple births, or *twins*. In the 1980 PUMS, multiple births is constructed as siblings who have the same age and quarter of birth (note: for the construction of this indicator, the age of the second and third child was used; this means, the twin birth corresponds to the mother's second birth). For the 1980 PUMS, the mean of *twins* is 0.09 for the sample of all women and 0.08 for the sample of married women. Since the variable quarter of birth is not reported for the 1990 PUMS dataset, the multiple birth variable was defined as children who have the same age.

In [None]:
Table3_1_1=Table_3_panel_1(data_all_women_1980_one)
Table3_1_2=Table_3_panel_1(msample_1980_one)
Table3_2_1=Table_3_panel_2(data_all_women_1980)
Table3_2_2=Table_3_panel_2(msample_1980)

keys = ['All women, PUMS 1980', 'Married women, PUMS 1980']
frames1 = [Table3_1_1, Table3_1_2]
frames2 = [Table3_2_1, Table3_2_2]
table3_1 = pd.concat(frames1, axis=1, keys=keys) 
table3_2 = pd.concat(frames2, axis=1, keys=keys)

In [None]:
table3_1

In [None]:
table3_2

In [None]:
Table3_1_1=Table_3_panel_1_prueba(data_all_women_1980_one)
Table3_1_2=Table_3_panel_1_prueba(msample_1980_one)
Table3_2_1=Table_3_panel_2_prueba(data_all_women_1980)
Table3_2_2=Table_3_panel_2_prueba(msample_1980)

keys = ['All women, PUMS 1980', 'Married women, PUMS 1980']
frames1 = [Table3_1_1, Table3_1_2]
frames2 = [Table3_2_1, Table3_2_2]
table3_1 = pd.concat(frames1, axis=1, keys=keys) 
table3_2 = pd.concat(frames2, axis=1, keys=keys)
table3_2

In [None]:
Table3_1_3=Table_3_panel_1(data_all_women_1990_one)
Table3_1_4=Table_3_panel_1(msample_1990_one)
Table3_2_3=Table_3_panel_2(data_all_women_1990)
Table3_2_4=Table_3_panel_2(msample_1990)

keys = ['All women, PUMS 1990', 'Married women, PUMS 1990']
frames1 = [Table3_1_3, Table3_1_4]
frames2 = [Table3_2_3, Table3_2_4]
table3_3 = pd.concat(frames1, axis=1, keys=keys) 
table3_4 = pd.concat(frames2, axis=1, keys=keys)

In [None]:
table3_3

In [None]:
table3_4

**Table 6 Part 1: OLS Estimates of *More than 2 children* equations for 1980 PUMS**

In [None]:
OLS_Regressions_more2k(data_all_women_1980, msample_1980)

**Table 6 Part 2: OLS Estimates of *More than 2 children* equations for 1990 PUMS**

In [None]:
OLS_Regressions_more2k(data_all_women_1990, msample_1990)

Table 7: 

In [None]:
outcomes_labor_supply_moms=["workedm", "WEEKSM", "HOURSM", "total_incomem", "faminc_log", "nonmomi_log"]
outcomes_labor_supply_dads=["workedd", "WEEKSD", "HOURSD", "total_incomed", "faminc_log", "nonmomi_log"]
controls_OLS_moms = ["const", "more2k", 'AGEM', 'agefstm', "boy1st", "boy2nd", "blackm", "hispm", "otheracem"]
controls_IV_1_moms=["const", 'AGEM', 'agefstm', "boy1st", "boy2nd", "blackm", "hispm", "otheracem"]
controls_IV_2_moms=["const", 'AGEM', 'agefstm', "boy1st", "blackm", "hispm", "otheracem"]
controls_OLS_dads = ["const", "more2k", 'AGED', 'agefstd', "boy1st", "boy2nd", "blackd", "hispd", "otheraced"]
controls_IV_1_dads=["const", 'AGED', 'agefstd', "boy1st", "boy2nd", "blackd", "hispm", "otheraced"]
controls_IV_2_dads=["const", 'AGED', 'agefstd', "boy1st", "blackd", "hispd", "otheraced"]

In [None]:
Table7_1=OLS_Labor_Supply_Models(data_all_women_1980, outcomes_labor_supply_moms, controls_OLS_moms, controls_IV_1_moms, controls_IV_2_moms)
Table7_2=OLS_Labor_Supply_Models(msample_1980, outcomes_labor_supply_moms, controls_OLS_moms, controls_IV_1_moms, controls_IV_2_moms)
Table7_3=OLS_Labor_Supply_Models(msample_1980, outcomes_labor_supply_dads, controls_OLS_dads, controls_IV_1_dads, controls_IV_2_dads)

keys=["All women", "Married Women", "Husbands"]
frames=[Table7_1, Table7_2, Table7_3]
Table7=pd.concat(frames, axis=1, keys=keys)
Table7 = Table7.replace(np.nan, '-', regex=True)
Table7

Table 8

In [None]:
Table8_1=OLS_Labor_Supply_Models(data_all_women_1990, outcomes_labor_supply_moms, controls_OLS_moms, controls_IV_1_moms, controls_IV_2_moms)
Table8_2=OLS_Labor_Supply_Models(msample_1990, outcomes_labor_supply_moms, controls_OLS_moms, controls_IV_1_moms, controls_IV_2_moms)
Table8_3=OLS_Labor_Supply_Models(msample_1990, outcomes_labor_supply_dads, controls_OLS_dads, controls_IV_1_dads, controls_IV_2_dads)

keys=["All women", "Married Women", "Husbands"]
frames=[Table8_1, Table8_2, Table8_3]
Table8=pd.concat(frames, axis=1, keys=keys)
Table8 = Table8.replace(np.nan, '-', regex=True)
Table8

Table 9

In [None]:
Interaction1="more2k_bottomthird", "more2k_middlethird", "more2k_upperthird", "more2k_lessgrad", "more2k_hsgrad", "more2k_moregrad",    
outcome="workedm"
Table9_1=OLS_Labor_Supply_Interactions_wifes(msample_1980, Interaction1, "workedm")

Interaction2="more2k_lessgrad_earnings", "more2k_hsgrad_earnings", "more2k_moregrad_earnings" 
outcome="workedm"
Table9_1_2=OLS_Labor_Supply_Interactions_wifes(sample_middle_third, Interaction2, "workedm")

Interaction3="more2k_lessgrad_husbands", "more2k_hsgrad_husbands", "more2k_moregrad_husbands" 
outcome="workedd"
Table9_1_3=OLS_Labor_Supply_Interactions_husbands(sample01, sample02, sample03, Interaction3, "workedd")

frames=[Table9_1, Table9_1_2, Table9_1_3]
Table9_OLS_worked=pd.concat(frames, axis=0)

In [None]:
Interaction4="samesex_bottomthird", "samesex_middlethird", "samesex_upperthird", "samesex_lessgrad", "samesex_hsgrad", "samesex_moregrad",       
outcome="more2k"
Table9_1_4=OLS_Labor_Supply_First_Stage_wifes(msample_1980, Interaction4, "more2k")

Interaction5="samesex_lessgrad_earnings", "samesex_hsgrad_earnings", "samesex_moregrad_earnings",       
outcome="more2k"
Table9_1_5=OLS_Labor_Supply_First_Stage_wifes(sample_middle_third, Interaction5, "more2k")

Interaction6="samesex_lessgrad", "samesex_hsgrad", "samesex_moregrad",       
outcome="more2k"
Table9_1_6=OLS_Labor_Supply_First_Stage_husbands(msample_1980, Interaction6, "more2k")

Table9_OLS_fs=pd.concat([Table9_1_4, Table9_1_5, Table9_1_6], axis=0)
Table9_OLS_fs.index=Table9_OLS_worked.index

In [None]:
Table9_IV_worked=IV_Labor_Supply_Interactions(msample_1980, sample_middle_third, sample01, sample02, sample03, "workedm", "workedd")

**Table 9| Part 1 : OLS and 2SLS Estimates for Labor Supply Models with Interaction Terms Using 1980 Census Data with *Worked for pay* as dependent variable**

In [None]:
Variables1=["bottom_third", "middle_third", "upper_third", "lessgrad", "hsgrad", "moregrad"]
mean_1=mean_samples(msample_1980, Variables1, "workedm")

In [None]:
Variables2=["lessgrad", "hsgrad", "moregrad"]
mean_2=mean_samples(sample_middle_third, Variables2, "workedm")

In [None]:
Variables3=["lessgrad", "hsgrad", "moregrad"]
mean_3=mean_samples(msample_1980, Variables3, "workedd")

In [None]:
frames=[mean_1, mean_2, mean_3]
Table9_means_1=pd.concat(frames, axis=0)
Table9_means_1.index=Table9_OLS_worked.index

In [None]:
Table9_part1 = pd.concat([Table9_OLS_fs, Table9_means_1, Table9_OLS_worked, Table9_IV_worked], axis=1, keys=["First-Stage (more than 2 kids)", "", "OLS", "2SLS"])
Table9_part1

The table reports estimates of the coefficient on *worked for pay* for both married women and husband samples (1980) in equation (4), which was modified to allow interactions with wife's schooling and husband education. Main effects for each interaction in each sample are included in the equation. Other covariates in the model are those listed in the vector $w_i$

**Table 9| Part 2 : OLS and 2SLS Estimates for Labor Supply Models with Interaction Terms Using 1980 Census Data with *weeks worked per year* as dependent variable**

In [None]:
Interaction1="more2k_bottomthird", "more2k_middlethird", "more2k_upperthird", "more2k_lessgrad", "more2k_hsgrad", "more2k_moregrad",   
outcome="WEEKSM"
Table9_2=OLS_Labor_Supply_Interactions_wifes(msample_1980, Interaction1, outcome)


Interaction2="more2k_lessgrad_earnings", "more2k_hsgrad_earnings", "more2k_moregrad_earnings" 
outcome="WEEKSM"
Table9_2_2=OLS_Labor_Supply_Interactions_wifes(sample_middle_third, Interaction2, outcome)


Interaction3="more2k_lessgrad_husbands", "more2k_hsgrad_husbands", "more2k_moregrad_husbands" 
outcome="WEEKSD"
Table9_2_3=OLS_Labor_Supply_Interactions_husbands(sample01, sample02, sample03, Interaction3, outcome)

frames=[Table9_2, Table9_2_2, Table9_2_3]
Table9_OLS_worked_2=pd.concat(frames, axis=0) 

Table9_IV_worked_2=IV_Labor_Supply_Interactions(msample_1980, sample_middle_third, sample01, sample02, sample03, "WEEKSM", "WEEKSD")

mean_4=mean_samples(msample_1980, Variables1, "WEEKSM")
mean_5=mean_samples(sample_middle_third, Variables2, "WEEKSM")
mean_6=mean_samples(msample_1980, Variables3, "WEEKSD")

frames=[mean_4, mean_5, mean_6]
Table9_means_2=pd.concat(frames, axis=0)
Table9_means_2.index=Table9_OLS_worked_2.index

In [None]:
Table9_part2 = pd.concat([Table9_means_2, Table9_OLS_worked_2, Table9_IV_worked_2], axis=1, keys=["", "OLS", "2SLS"]).round(3)
Table9_part2

The table reports estimates of the coefficient on *weeks worked per year* for both married women and husband samples (1980 Census Data) in equation (4), which was modified to allow interactions with wife's schooling and husband education. Main effects for each interaction in each sample are included in the equation. Other covariates in the model are those listed in the vector $w_i$

Table 10 | Part 1

In [None]:
Table10_1=OLS_Labor_Supply_Interactions_wifes(msample_1990, Interaction1, "workedm")
Table10_1_2=OLS_Labor_Supply_Interactions_wifes(sample_middle_third_1990, Interaction2, "workedm")
Table10_1_3=OLS_Labor_Supply_Interactions_husbands(sample04, sample05, sample06, Interaction3, "workedd")

frames=[Table10_1, Table10_1_2, Table10_1_3]
Table10_OLS_worked=pd.concat(frames, axis=0)

In [None]:
Table10_1_4=OLS_Labor_Supply_First_Stage_wifes(msample_1990, Interaction4, "more2k")
Table10_1_5=OLS_Labor_Supply_First_Stage_wifes(sample_middle_third_1990, Interaction5, "more2k")
Table10_1_6=OLS_Labor_Supply_First_Stage_husbands(msample_1990, Interaction6, "more2k")
Table10_OLS_fs=pd.concat([Table10_1_4, Table10_1_5, Table10_1_6], axis=0)
Table10_OLS_fs.index=Table10_OLS_worked.index

In [None]:
Table10_IV_worked=IV_Labor_Supply_Interactions(msample_1990, sample_middle_third_1990, sample04, sample05, sample06, "workedm", "workedd")

In [None]:
mean_7=mean_samples(msample_1990, Variables1, "workedm")
mean_8=mean_samples(sample_middle_third_1990, Variables2, "workedm")
mean_9=mean_samples(msample_1990, Variables3, "workedd")
frames=[mean_7, mean_8, mean_9]
Table10_means_1=pd.concat(frames, axis=0)
Table10_means_1.index=Table10_OLS_worked.index
Table10_part1 = pd.concat([Table10_OLS_fs, Table10_means_1, Table10_OLS_worked, Table10_IV_worked], axis=1, keys=["First-Stage (more than 2 kids)", "", "OLS", "2SLS"])
Table10_part1

**Table 10 | Part 2**

In [None]:
Table10_2=OLS_Labor_Supply_Interactions_wifes(msample_1990, Interaction1, "WEEKSM")
Table10_2_2=OLS_Labor_Supply_Interactions_wifes(sample_middle_third_1990, Interaction2, "WEEKSM")
Table10_2_3=OLS_Labor_Supply_Interactions_husbands(sample04, sample05, sample06, Interaction3, "WEEKSD")

frames=[Table10_2, Table10_2_2, Table10_2_3]
Table10_OLS_worked_2=pd.concat(frames, axis=0) 

Table10_IV_worked_2=IV_Labor_Supply_Interactions(msample_1990, sample_middle_third, sample04, sample05, sample06, "WEEKSM", "WEEKSD")

mean_10=mean_samples(msample_1990, Variables1, "WEEKSM")
mean_11=mean_samples(sample_middle_third_1990, Variables2, "WEEKSM")
mean_12=mean_samples(msample_1990, Variables3, "WEEKSD")

frames=[mean_10, mean_11, mean_12]
Table10_means_2=pd.concat(frames, axis=0)
Table10_means_2.index=Table10_OLS_worked_2.index

In [None]:
Table10_part2 = pd.concat([Table10_means_2, Table10_OLS_worked_2, Table10_IV_worked_2], axis=1, keys=["", "OLS", "2SLS"]).round(3)
Table10_part2

**Table 11: Comparison of 2SLS Estimates using *Same sex* and *Twins-2* instruments in 1980 Census Data**

In [None]:
controls_IV_comp_moms=["const", "AGEM", "agefstm", "AGEQK", "AGEQ2ND", "boy1st", "boy2nd", "blackm", "hispm", "otheracem"]
controls_IV_comp_dads=["const", "AGED", "agefstd", "AGEQK", "AGEQ2ND", "boy1st", "boy2nd", "blackd", "hispd", "otheraced"]

Table11_1=IV_Comparison_Models(data_all_women_1980, outcomes_labor_supply_moms, controls_IV_comp_moms)
Table11_2=IV_Comparison_Models(msample_1980, outcomes_labor_supply_moms, controls_IV_comp_moms)
Table11_3=IV_Comparison_Models(msample_1980, outcomes_labor_supply_dads, controls_IV_comp_dads)
Table11_3.index=Table11_1.index

In [None]:
Table11=pd.concat([Table11_1, Table11_2, Table11_3], axis=1, keys=["All woman", "Married women", "Husbands"])
Table11

The table above reports 2SLS estimates of the coefficient of the variable *More than 2 children* in equation (4) using *Same Sex* and *Twins-2* (that is, whether the second birth corresponded to twin children) as instruments for all women, married women and husband samples (1980 Census Data). Other covariates in the model are: *Age, Age at first birth, age of first kid, age of second kid, boy1st, boy2nd, black, hispanic, and other race*. 

**Table 4: Differences in means**

In [None]:
dem_var_1980=["AGEM", "agefstm", "blackm", "whitem", "otheracem", "hispm", "educm"]
dem_var_1990=["AGEM", "agefstm", "blackm", "whitem", "otheracem", "hispm", "YEARSCHM"]
Table4_80_1=difference_means(data_all_women_1980, dem_var_1980, "same_sex")
Table4_80_2=difference_means(data_all_women_1980, dem_var_1980, "twins")
Table4_90=difference_means(data_all_women_1990, dem_var_1990, "same_sex")
Table4_90.index=Table4_80_1.index
Table4=pd.concat([Table4_80_1, Table4_90, Table4_80_2], axis=1, keys=["By same sex-1980", "By same sex-1990", "By twins-1980"])
Table4

**Table 5 | Part 1: Mean Difference by Instrument**

In [None]:
outcomes_means=["more2k", "KIDCOUNT", "workedm", "WEEKSM", "HOURSM", "total_incomem", "faminc_log"]
Table5_means_1980=mean_differences_instruments(data_all_women_1980, outcomes_means)
Table5_means_1980

**Table 5 | Part 2: Wald Estimates of Labor Supply Models**

In [None]:
outcomes_wald_1980=["workedm", "WEEKSM", "HOURSM", "total_incomem", "faminc_log"]
Table5_1=wald_estimates_regressions(data_all_women_1980, outcomes_wald_1980, "same_sex")
Table5_2=wald_estimates_regressions(data_all_women_1980, outcomes_wald_1980, "twins")
Table5_1980_part2=pd.concat([Table5_1, Table5_2], axis=1, keys=["Wald Estimates-Same sex", "Wald Estimates-Twins"])

In [None]:
Table5_1980_part2

In [None]:
outcomes_wald_1990=["workedm", "WEEKSM", "HOURSM", "total_incomem", "faminc_log"]
Table5_1990=wald_estimates_regressions(data_all_women_1990, outcomes_wald_1990, "same_sex")
Table5_1990

**EXTENSION: CAUSAL TREES**

In [8]:
#!pip install econml
from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
from econml.grf import CausalIVForest

In [9]:
data_all_women_1980.replace(np.nan, 0, inplace=True)
data_all_women_1980.replace(np.inf, 0, inplace=True)

In [15]:
# split for train and test sets for each subgroup
#First: women with less education
train_01, test_01 = train_test_split(sample01, test_size=0.2)

In [None]:
treatment = ['more2k']
outcome = ['workedm']
covariates = ["AGEM", "agefstm", "boy1st", "boy2nd", "blackm", "hispm", "otheracem", "total_incomed"]
instruments = ['same_sex']

Y_1 = train_01[outcome]
T_1 = train_01[treatment]
X_1 = train_01[covariates]
Z_1 = train_01[instruments]
X_test_1 = test_01[covariates]
W_1 = None

In [None]:
est_01 = CausalIVForest(criterion='het', 
                     n_estimators=500,       
                     min_samples_leaf=5, 
                     max_depth=None, 
                     max_samples=0.5,
                     honest=True,
                     inference=True,
                     fit_intercept = True
                     )

est_01.fit(X_1, T_1, y=Y_1, Z=Z_1)

In [None]:
treatment_effects_01, lb_01, ub_01 = est_01.predict(X_1, interval=True, alpha=0.05)

te_01 = []
for i in range(len(treatment_effects_01)):
    dict_te = {}
    dict_te['cate'] = treatment_effects_01[i][0]
    dict_te['lb'] = lb_01[i][0]
    dict_te['ub'] = ub_01[i][0]
    te_01.append(dict_te)
df_te_01 = pd.DataFrame(te_01)

In [None]:
df_te_01["cate"].mean()

## Apendix: Dictionary of key variables 

#### data_2

| **Name**        | **Description**                            |
|-----------------|--------------------------------------------|
| **index**       |                                            |
| byr             | birth year                                 |
| race            | ethnicity, 1 for white and 2 for nonwhite  |
| interval        | interval of draft lottery numbers, 73 intervals with the size of five consecutive numbers        |
| year            | year for which earnings are collected      |
| **variables**   |                                            |
| vmn1            | nominal earnings                           |
| vfin1           | fraction of people with zero earnings      |
| vnu1            | sample size                                |
| vsd1            | standard deviation of earnings             |

-------
Notebook by Carolina Alvarez | GitHub profile: https://github.com/carolinalvarez.

---

# References 

* **Angrist, J., & Evans, W. (1998)**. *Children and Their Parents' Labor Supply: Evidence from Exogenous Variation in Family Size*, 88(3), The American Economic Review.

