# Notebook Objective

In this notebook, I use an instrumental variable to estimate the effect of a time limited welfare eligibility on family well-being. 

Data are from the Family Transition Program (FTP). FTP was the first welfare reform initiative in which some families reached a time limit on their welfare eligibility and had their benefits canceled. The program took place in  Escambia County, Florida from 1994 to 1999. Key findings from the study, as well as additional background information can be found [here](https://www.mdrc.org/publication/family-transition-program).

# Preparing the Data

In [1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels import IV2SLS # pip install linearmodels

I leverage both the administrative data (which contains official employment and income records) and survey data (which contains participant self-reported data on well-being) in the analysis. We'll need to load, merge, and clean the data before estimating treatment effects. 

In [2]:
def admin_data_loader():
    """Loads ftp administrative dataset."""

    path = '/Users/danielchen/Desktop/UChicago/Year Two/Autumn 2020/Program Evaluation/Problem Sets/Problem Set 2/ftp_ar.dta'
    df = pd.read_stata(path)

    return df


def survey_data_loader():
    """Loads ftp survey dataset."""

    path = '/Users/danielchen/Desktop/UChicago/Year Two/Autumn 2020/Program Evaluation/Problem Sets/Problem Set 1/ftp_srv.dta'
    df = pd.read_stata(path)

    return df


def ftp_merger(dataframe1, dataframe2):
    """Merges the two ftp datasets."""

    df = pd.merge(dataframe1, dataframe2, on='sampleid')

    return df

In [4]:
admin = admin_data_loader()
survey = survey_data_loader()
df = ftp_merger(admin, survey)
df

Unnamed: 0,sampleid,e_x,cflag,longtdec,b_aidst,gender,ethnic,marital,afdctime,afdchild,...,emppq1_y,yrearn_y,yrearnsq_y,pearn1_y,recpc1_y,yrrec_y,yrkrec_y,rfspc1_y,yrrfs_y,yrkrfs_y
0,1,0,1.0,1,2.0,2.0,5.0,5.0,2.0,3.0,...,1,5700,32490000,600,0,0,0,1,1,12
1,100,0,,2,2.0,2.0,1.0,1.0,2.0,2.0,...,1,2350,5522500,1100,0,0,0,1,1,11
2,1000,0,,1,2.0,2.0,1.0,5.0,2.0,3.0,...,1,7500,56250000,1600,1,1,2,1,1,7
3,1004,0,1.0,1,1.0,2.0,1.0,4.0,5.0,1.0,...,1,9600,92160000,1700,0,1,1,1,1,11
4,1007,0,1.0,5,1.0,2.0,1.0,3.0,4.0,3.0,...,0,400,160000,0,1,1,11,1,1,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1724,994,1,1.0,1,1.0,2.0,1.0,3.0,3.0,3.0,...,1,35100,1232010000,8500,0,0,0,0,0,0
1725,995,1,1.0,5,2.0,2.0,1.0,1.0,6.0,1.0,...,0,0,0,0,0,0,0,1,1,12
1726,996,0,1.0,7,2.0,2.0,1.0,1.0,5.0,1.0,...,1,300,90000,100,1,1,12,1,1,12
1727,997,0,1.0,6,1.0,2.0,1.0,1.0,1.0,1.0,...,1,3300,10890000,1800,1,1,10,1,1,12


The merged dataframe contains duplicate columns indicated by the "_x" or "_y" suffixes attached to certain column names. The next two functions remove duplicate columns and cleans up the remaining names.

In [3]:
def drop_y_columns(dataframe):
    """Drops duplicate columns."""

    df = dataframe.copy()

    cols_to_drop = [col for col in df if col.endswith('_y')]
    df = df.drop(cols_to_drop, 1)

    return df


def colunm_renamer(dataframe):
    """Removes '_x' from column names after merging."""

    df = dataframe.copy()

    col_names = [col for col in df.columns.values]
    new_names = [col_name[:-2] if col_name.endswith('_x') else col_name for col_name in col_names]
    df.columns = new_names

    return df

In [5]:
df = drop_y_columns(df)
df = colunm_renamer(df)
df

Unnamed: 0,sampleid,e,cflag,longtdec,b_aidst,gender,ethnic,marital,afdctime,afdchild,...,nkids0,nkids1,nkids2,nkidsge3,ageykid,himed,hioth,khimed,khioth,married
0,1,0,1.0,1,2.0,2.0,5.0,5.0,2.0,3.0,...,0.0,1.0,0.0,0.0,10.0,0.0,1.0,3.0,1.0,1.0
1,100,0,,2,2.0,2.0,1.0,1.0,2.0,2.0,...,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0
2,1000,0,,1,2.0,2.0,1.0,5.0,2.0,3.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3.0,1.0,0.0
3,1004,0,1.0,1,1.0,2.0,1.0,4.0,5.0,1.0,...,0.0,0.0,0.0,1.0,7.0,0.0,1.0,3.0,1.0,0.0
4,1007,0,1.0,5,1.0,2.0,1.0,3.0,4.0,3.0,...,0.0,0.0,0.0,1.0,5.0,0.0,1.0,3.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1724,994,1,1.0,1,1.0,2.0,1.0,3.0,3.0,3.0,...,0.0,0.0,0.0,1.0,6.0,0.0,1.0,3.0,3.0,1.0
1725,995,1,1.0,5,2.0,2.0,1.0,1.0,6.0,1.0,...,0.0,0.0,0.0,1.0,6.0,1.0,0.0,1.0,0.0,0.0
1726,996,0,1.0,7,2.0,2.0,1.0,1.0,5.0,1.0,...,0.0,1.0,0.0,0.0,8.0,0.0,1.0,3.0,1.0,0.0
1727,997,0,1.0,6,1.0,2.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


The cleaned dataframe contains 1729 rows or observations for 1729 unique families.

# Understanding Key Variables + Summary Statistics

- `e` is the treatment dummy where 0 means that a family was randomly assigned to the control group and 1 means that a family was randomly assigned to the treatment group. The treatment group had their benefits time limited - in addition to receiving a variety benefits that is beyond the analysis scope of this notebook - of while the control group did not. 
- `fmi2` is from the survey data. Families were asked if they were believed to have been subject to the time limit or not. Possible responses also include "don't know" or "no response". 

First, let's get a broad overview of how many people believed that they were subject to the time limit versus those who did not believe that they were subject to the time limit. 

In [6]:
def summary_stats(dataframe):
    """Returns a dataframe showing how many people believed in the time limit 
    vs. how many people did not. 
    """

    df = dataframe.copy()

    categories = [
        'Believed Subject to Time Limit',
        "Didn't Believe Subject to Time Limit",
        "Don't Know"
    ]

    sum_table = pd.DataFrame({'CATEGORY': categories, 
                              'COUNTS': df['fmi2'].value_counts()})

    sum_table = sum_table.append({'CATEGORY': 'Valid Responses', 
                                  'COUNTS': len(df['fmi2'])}, 
                                  ignore_index=True)

    return sum_table

In [8]:
summary_stats(df)

Unnamed: 0,CATEGORY,COUNTS
0,Believed Subject to Time Limit,666
1,Didn't Believe Subject to Time Limit,365
2,Don't Know,118
3,Valid Responses,1729


For this analysis, I will only be working with observations where the family believed that they were subject to the time limit or the opposite. I will not be working with those who don't know. I'll subset the data and create a new dummy variable for these people. The new dummy variable is referred to as `TLyes`.

In [9]:
def new_dummy_creator(dataframe):
    """Creates a new treatment variable. 1 for those who believed in time limit. 
    0 for those who did not. Everyone else is dropped. 
    """

    df = dataframe.copy()

    df = df[(df['fmi2'] == 1) | (df['fmi2'] == 2)]

    df['TLyes'] = [1 if val == 1 else 0 for val in df['fmi2']]

    return df

In [10]:
df = new_dummy_creator(df) 

Next, I'd like to get a sense of whether or not there was confusion around the time limit. While a family may have been randomly assigned to the treatment group, and therefore had their benefits time limited, it's possible that the family may have not believed that they were subject to the time limit and vice versa. Below, I cross-tabulate the assignment variable `e` against the self-reported belief variable `TLyes`.

In [11]:
def xtab_generator(dataframe):
    """Generates crosstabs of original dummy variable vs. new dummy variable."""

    df = dataframe.copy()

    tabs = pd.crosstab(index=df['e'], columns=df['TLyes'], 
                       margins=True, margins_name='Total',
                       rownames=['Original Treatment'],
                       colnames=['Time Limit Belief'])

    return tabs

In [12]:
xtab_generator(df)

Time Limit Belief,0,1,Total
Original Treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,300,205,505
1,65,461,526
Total,365,666,1031


From the table above, it's clear that participants were confused as to whether or not the time limit applied to their families. Of the 505 families originally assigned to control, roughly 60% were correct in identifying that the time limit did not apply to them. However, a plurality (40.6%) incorrectly thought that their benefits were time limited when then they in reality were not. Participants assigned to the treatment group better understood the guidelines dictating their benefits as 87.6% of these 562 families correctly identified that their benefits were time limited. Conversely, the remaining 12.4% thought that they had no limits when, in reality, they did. 