### The Impact of NAFTA on US Wages and Income Inequality at the State Level
#### Part I, ver 0.1
#### Individual Term Project
#### Prof. Flamm
#### Spring 2020

##### Preface
This project uses a dataset created by Hakobyan and McLaren (2016)\*, and used by them to analyze the impact of NAFTA on US wages at the national level. We are going to be using this same data in a different way.

\* [Link to Hakobyan and McLaren]('https://www.mitpressjournals.org/doi/10.1162/REST_a_00587')

[Link to supplemental datafile deposited for replication when article was published]('https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MJAPQ9')

#### First Steps
> First, you need to pick a state and inform me of your choice. One student per state; first come, first served.

> Picking Texas because you love it (or California or New York) is a great idea but comes with a price. The dataset you will download will need to be held in memory on your laptop. Big states have lots of people and lots of observations, but use up lots of memory. If you pick a big state, you will need a laptop with lots of memory to be sure you can do the analysis in a reasonable amount of computing time. Smaller states are generally easier to deal with.

>The dataframe we will be using contains observations on a U.S. census 5% sample of the U.S. population, and an estimate of the underlying population numbers that each randomly drawn observation is intended to represent. (This is called a *stratified* sample.) 

> Within each state, geographic areas are further subdivided into consistent (across census year) Public Use Microdata Areas (consistent PUMAs, or `conspuma`'s), the most spatially disaggregated, anonymized public use sample data on individual households that is distributed by the US Census Bureau. This data is used for many purposes by US policymakers, businesses, government agencies, nonprofits, academics, and private analysts. Any effort invested in becoming familiar with it and learning how to use it will be worthwhile.

>There are 543 of these `conspuma`'s distributed across the 50 states + DC. California has the most (233), while Wyoming has the least (4). 

***Tip:*** If you want to get better estimates of standard errors using less stringent assumptions about unobserved statistical disturbances (*cluster robust* standard errors), you probably want a state with at least 20 `conspuma`'s to get minimally decent approximations.

(May be of interest: The above rule of thumb based on published Monte Carlo studies;  new computation-intensive methods of getting standard errors through simulation that work acceptably using smaller numbers of spatial units are coming online in statistical software. But that is beyond the scope of this course.)

**Conspuma Cheat Sheet**

   rank  statefip	#codes	stname	stusps

    0	6	233	California	CA
    1	48	153	Texas	TX
    2	36	143	New York	NY
    3	12	127	Florida	FL
    4	42	92	Pennsylvania	PA
    5	39	91	Ohio	OH
    6	17	87	Illinois	IL
    7	26	68	Michigan	MI
    8	13	63	Georgia	GA
    9	34	61	New Jersey	NJ
    10	37	58	North Carolina	NC
    11	25	52	Massachusetts	MA
    12	18	48	Indiana	IN
    13	53	46	Washington	WA
    14	24	44	Maryland	MD
    15	47	44	Tennessee	TN
    16	51	42	Virginia	VA
    17	29	41	Missouri	MO
    18	8	38	Colorado	CO
    19	27	37	Minnesota	MN
    20	22	36	Louisiana	LA
    21	4	36	Arizona	AZ
    22	55	31	Wisconsin	WI
    23	1	30	Alabama	AL
    24	21	30	Kentucky	KY
    25	41	27	Oregon	OR
    26	45	27	South Carolina	SC
    27	9	25	Connecticut	CT
    28	28	23	Mississippi	MS
    29	20	21	Kansas	KS
    30	5	19	Arkansas	AR
    31	19	19	Iowa	IA
    32	40	18	Oklahoma	OK
    33	49	16	Utah	UT
    34	35	15	New Mexico	NM
    35	32	15	Nevada	NV
    36	31	14	Nebraska	NE
    37	54	12	West Virginia	WV
    38	33	11	New Hampshire	NH
    39	23	10	Maine	ME
    40	15	9	Hawaii	HI
    41	16	9	Idaho	ID
    42	44	7	Rhode Island	RI
    43	46	7	South Dakota	SD
    44	30	7	Montana	MT
    45	10	6	Delaware	DE
    46	11	5	District of Columbia	DC
    47	2	5	Alaska	AK
    48	38	5	North Dakota	ND
    49	50	4	Vermont	VT
    50	56	4	Wyoming	WY

See `data_prep_indiv-proj_2020.ipynb` for data prep code this is taken from.

[Reference for `conspuma`'s at this link.](https://usa.ipums.org/usa-action/variables/CONSPUMA#description_section)



#### Overview of what you will be asked to do
> 1. In your chosen state, I will be asking you to divide the labor force into 4 groups based on educational attainment: Less than a high school degree, high school degree, some college, college graduate or even higher degree. For each group, we will be exploring what the effects of NAFTA on wages and salaries were over the 1990-2000 period. In addition, we will divide the industries in which these workers are employed into groups based on their vulnerability to Mexican imports: we can think of those industries not facing competition from Mexican imports, or with tariffs unaffected by NAFTA, as our "control" group, and break other industries into "treatment" groups depending on their level of tariff cuts.

> 2. To begin, we will look at the distribution of workers across the state economy by educational group, before and after NAFTA took effect, for the the state as a whole. I will ask you do put together a chart examining how the distribution of total wage and salary income in the state between educational groups changed after NAFTA went into effect. You can use a simple bar chart, or something fancier if you are so inclined. 

> 3. Next, I would like you to look at at mean and median wages in your state, overall, and in each of the educational groups/industry groups, before and after NAFTA kicked in. Because this is survey data, it can get a little complicated. The variable `perwt` gives you the number of individual employees each observation represents in the underlying population. 

> 4. To follow up, I will ask you to do a "skyscraper-type" bar graph showing what percent of total state employment income went to each decile of the labor income distribution within each educational group, before and after NAFTA kicked in. I will show you in class (and a forthcoming exercise) how to calculate the share of income in a group going to a decile, making use of Python package `statsmodels`.

> 5. While the previous analysis is a start at understanding what was happening, there were a lot of things going on over 1990-2000 that had nothing to do with NAFTA but may still be affecting changes in wages: wages may have been going up or down in different states and educational groups because of overall economic changes affecting all industries in a state. 

> Fortunately, not all locations (conspuma's) were equally affected by new competition from increased imports from Mexico stimulated by a lowering of NAFTA tariff rates. Some locations had few if any workers in industries competing with cheaper Mexican imports (because Mexico was not a competitor in that industry, or because tariffs were already very low), while other other locations were faced with large reductions in tariffs on products in which Mexico was very competitive. We can think of the latter sets of locations (opened up to Mexican competition by NAFTA) as "treatment" groups in an experiment on how easy it is for workers to relocate, and the former group of locations and industries (not in competition with their Mexican industry counterparts, of not facing NAFTA tariff cuts) as "control" groups.

> One methodology we can use if we adopt this perspective is called "difference in differences": we compare the outcomes of two groups, both before and after a policy took effect: The treatment group: those who were affected by the policy; the control group: those who were not affected by the policy.

> "Specifically, we take the difference in outcomes of the treatment and control group before the policy was implemented, and compare it with the difference in outcomes after the policy was implemented. This method is known in economics as differences-in-differences: A method that applies an experimental research design to outcomes observed in a natural experiment. It involves comparing the difference in the average outcomes of two groups, a treatment and control group, both before and after the treatment took place. We need to compare outcomes before the policy has happened, because in a natural experiment we cannot choose exactly who receives the treatment (whereas in the lab we could randomly assign the treatment). Since the two groups are not randomly chosen, we need to account for any pre-existing differences between the two groups that could affect the outcomes, for example differences in age (for people) or characteristics (for products). If these other factors remain constant over the period considered, then we can reasonably conclude that any observed changes in the outcome differences between the groups are due to the policy. Natural experiments therefore allow us to make causal statements about policies and outcomes." 
[link to source of above quote](https://www.core-econ.org/doing-economics/book/text/03-01.html)

> This methodology makes the ***identifying assumption*** of "*parallel trends*". "It requires that in the absence of treatment, the difference between the 'treatment' and 'control' group is constant over time. Although there is no statistical test for this assumption, visual inspection is useful when you have observations over many time points."

![Image](https://www.mailman.columbia.edu/sites/default/files/png/DIDgraph.png)

> We can think of the "parallel trends" assumption, in the context of NAFTA, as being that wages for a given group of workers with like educational attainment and other characteristics in industries and locations affected by NAFTA tariff cuts would have moved in the same way over time as wages for industries and locations not affected by NAFTA tariff cuts, absent the NAFTA tariff cuts. 

> A simple linear regression model-- which allows us to control for worker characteristics, industries, and locations when we compare pre- and post- NAFTA outcomes-- is all we need to estimate a difference-in-differences model. We will talk about how to use this simple framework in part II of this project description.

>6. We will also be using more advanced econometric models, which will be discussed in class, to examine NAFTA impacts as part of this exercise.