Demography 88<br>
Fall 2018<br>
Carl Mason (cmason@berkeley.edu)<br>

# Lab 7  The wage impacts of immigration 

## Outline
1. Review TVD results from Great Migration (chicago) Lab
1. Discuss The Clemens paper
1. Consider what a productivity measuring statistic might look like
    1. Design a productivity based statistic of interest
    1. Compute the statistic
    1. Characterize its uncertainty with a bootstrap simulation


## The Clemens paper

1. Huge gains from removing restrictions on immigration
   1. Reasons why the trillion dollar bill is left on the sidewalk:
1. Assumptions and requirements
   1. migrants from poor to rich countries see large productivity gains
       1. productivity = output/unit of input in this case labor. marginal productivity = wage
       1. productivity is a matter of place ?  or person (migration selectivity)?
   1. gain/loss to others hard to estimate BUT presumed minor/caneling/merely pecuniary
       1. wages might rise in poor countries
       1. wages might fall in rich countries
       1. returns to other factors (capital) might rise in rich countries
       1. returns to other factors (capital) might fall in poor countries
       1. Not all externalities deserve a "pigovian tax"
           1. smoke stacks yes; price/wage changes *no*
   1. lots of models with production functions; general equilibrium & etc 

### A smaller but livlier version of Clemens' figure 1

<a href="http://shiny.demog.berkeley.edu/carlm/EconImmig0/" target="_new"> Neoclassical model of wage effects of immigration</a>



<img src="https://courses.demog.berkeley.edu/mason88/images/clemensFig1.png">



In [None]:
# Run this cell to import the stuff we'll need
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt 
plt.style.use('fivethirtyeight')

%matplotlib inline
from datascience import Table
from datascience.predicates import are
from datascience.util import *

from IPython.display import HTML, IFrame, display
datasite="https://courses.demog.berkeley.edu/mason88/data/"
quizsite="https://courses.demog.berkeley.edu/mason88/cgi-bin/quiz.py"
  
def cquiz(qno) : 
    import IPython, requests 
    try:
        sid
    except NameError: 
        print("HEY! did you enter your sid way up at the top of this notebook?")
    Linkit='{0}?qno={1}&sid={2}'.format(quizsite,qno,sid)
    #print(Linkit)
    html = requests.get(Linkit)
    #display(IFrame(Linkit, 1000, 300))
    display(IFrame(Linkit, 1000, 400))


    
######################
# Here it is ... the obvious place to put your student id
sid=""
######################
if sid == "" :
    print("HEY! didn't I tell you to put your sid in the obvious place")
 


In [None]:
### #### #### ####
###  SELECTING  LAB  PARTNERS
### ### #### ####

N=18
numbers=np.arange(N)
print(np.mod(N,2))
if (not np.mod(N,2) == 0) :
    numbers=np.append(numbers,"lucky")
    N+=1
numbersTab=Table().with_column('n',numbers)
randomized=numbersTab.sample(k=N,with_replacement=False)
selection=randomized['n']
selection.shape = (2,int(N/2))
Table().with_columns('zero',selection[0],'one',selection[1]).show()

In [None]:
cquiz('wageimp0-partners')


## The data for this week

1. Residents of California, New York, Illinois, or Texas In 2000 and 2015
1. Ages grouped by age and time 25-30 yrs old (in 2000) vs 40-45 (in 2015)
1. Immigrants who arrived at ages 20-25
1. productivity = Total Annual Wages/(usual hr per week * weeks worked last year) 
1. educ4 = education level coded into 5 mutually exclusive groups
1. Data come from American Community Survey via IPUMS
1. Data include weights which we must use.
>Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0.





In [None]:
# Read the data and examine the columns
wg=Table().read_table(datasite+'earnings2015new.csv')
## adding a column "age" to avoid clumsy typing
wg.append_column('age',['old' if a > 35 else 'young' for a in wg['AGE']])
wg.where('immig',False).where('YEAR',2000).show(10)
wg.where('immig',True).where('YEAR',2000).show(10)
wg.where('immig',False).where('YEAR',2015).show(10)
wg.where('immig',True).where('YEAR',2015).show(10)

## Before we proceed with science, let's discuss the strengths and weaknesses of the data.

1. Did the world change between 2000 and 2015 ?
1. Are we observing the same people in 2000 and 2015 ?
1. Are five year age categories ideal ?



## Productivity measure

Clemens refers to productivity without defining it.  Which is reasonable because economists all know what it means: productivty = output/unit of input. Since the input is labor, Clemens is talking about *output per unit of labor* To confuse matters further, in figure 1,  includes **not** productivity but rather two *demand curves for labor* (one for rich countries the other for poor countries).

productivity and demand for labor are linked by the "fact" that in a perfect market, the *marginal productivity of labor = the wage*. Thus wages are the key to everything. AND wages are  *theoretically* the productivity of the last worker hired aka the "margingal productivity of labor". 

We should perhaps discuss this in class.

Having convinced ourselves of the above economics "fact" we can compute a measure of *marginal productivity* of various sources of labor--which is *sort of* what Clemens is talking about.

In [None]:
# computing productivity
wg.append_column('prod',wg["INCWAGE"]/(wg["UHRSWORK"]*wg["wkswork"]))

tab0=wg.select("immig","age","prod").groups(['age','immig'],np.nanmean)
tab0.show()


In [None]:
cquiz("wage0-01")

## Digression on PERWT 

The above table of means of marginal productivity by age and immigrant status is nice BUT as respectable data scientists, we must take note of the following:

* These data comprise a structured random sample rather than a "simple" random sample, and thus it is <b> essential that we take account of weights</b> if we ever want to publish our results.
    
*  In the Great Migration labs, recall that when working with sums, it was proper to divide by .05 since the sample was a 5% simple random sample. Another way of thinking about that is that each observation in that Great Migration lab sample represented 20 people in the US in 1960.  Consequently, giving each observation a "weight" of 20 was appropriate in all computations.  Because *all* observations had a weight of 20, it was not necessary to explicitly include the weights when computing means -- because all those 20s would just cancel out.  NOT SO THIS TIME.

*  In *this* data set, sampling was done according to a complicated set of rules that ensured that the sample would have much broader coverage in terms of geography and ethnicity than would a simple random sample of the same size.  The price we pay for that is that we must take weights into account *even when computing means*. While in the Great Migration sample, all weights were "20"  in this sample, the weights are stored in 'PERWT' and the variation across individuals is considerable.

In [None]:
# Since the data from 2000 are drawn from a 5% sample, most observations have weights around 20 however, 
# the data from 2015 are a 1% sample so many observations should have weights of around 100 ... and of course,
# there are a few that are way out there.

print(wg.select("PERWT").stats())
wg.select('PERWT').hist(bins=100)

#### So we can improve the table from the cell "Computing productivity" by using  *weighted means* rather than unweighted means.

Weighted means are not hard to compute, but it involves an extra step. The formula is:

$$ \text{Weighted mean} = \frac{\sum_{i=1}^{N}{x_iw_i}}{\sum_{i=1}^{N}{w_i}}$$

where $x_i$ is the $i^{th}$ observation of the variable that you are taking the weighted mean of, and $w_i$ is the weight corresponding to the $i^{th}$ observation. In our present case, $x$ is a measure of productivity and $w$ is 'PERWT'.

Actually, owing to the limitations of the Tables.group method, computing the weighted means also require some extra python -- which slows things down a bit.

In [None]:
# computing weighted productivity
res=[]
for a in np.unique(wg['age']):
    for i in np.unique(wg['immig']) :
        # create a table that is a subset of wg
        temp=wg.where("immig",i).where("age",a)
        # compute wtd mean the "hard way"
        hard= np.sum(temp['prod']*temp['PERWT'])/np.sum(temp['PERWT'])
        # and the "easy way"
        easy=np.average(temp['prod'],weights=temp['PERWT'])
        print("immig={0} age={1} wtd mean: hard={2} easy={3}".format(i,a,hard,easy))
        res.append({"immig":i,"age":a, "wtdProd":easy})
Table.from_records(res)

In [None]:
cquiz('wage0-02')

## What conclusions might we draw from weighted means of marginal productivities?

##### Do immigrants really lose ground as they gain experience in the US ?
##### What other explanations might there be?

#### Consider this statistic :

$$ \frac{\text{prod}_{\text{immig,old}}}{\text{prod}_{\text{immig,young}}}/  \frac{\text{prod}_{\text{USborn,old}}}{\text{prod}_{\text{USborn,young}}} $$

Can we justify the above statistic as an estimate of the degree to which immigrants lose(gain) ground (in relation to USborn wages) with experience in US labor markets?


In [None]:
## Ratio of ratios 
res=dict()
for a in np.unique(wg['age']):
    for i in np.unique(wg['immig']) :
        # create a table that is a subset of wg
        temp=wg.where("immig",i).where("age",a)

        easy=np.average(temp['prod'],weights=temp['PERWT'])
        # store result in dictionary indexed by tuple
        res[(i,a)]=easy
# compute diff in ratio statistic        
(res[True,'old']/res[True,'young']) / (res[False,'old']/res[False,'young'])

In [None]:
cquiz('wage0-03')

 ## And what about uncertainty ?
 
 Do I hear you crying out for a bootstrap?  or what ?
 
 1. Write a function that computes the ratio-of-ratios statistic from a sample of wg records
 1. Write a loop that draws a sample; calls the function to compute statistic; stores result
 1. Interpret result

In [None]:
# Write Function that computes statistic
def getStat(wg=wg) :
    res=dict()
    for a in np.unique(wg['age']):
        for i in np.unique(wg['immig']) :
            # create a table that is a subset of wg
            
            temp=wg.where("immig",i).where("age",a)

            easy=np.average(temp['prod'],weights=temp['PERWT'])
            # store result in dictionary indexed by tuple so we can use it later
            res[(i,a)]=easy
    # compute diff in ratio statistic        
    return((res[True,'old']/res[True,'young'])/( res[False,'old']/res[False,'young']))

In [None]:

##### Write loop that..
# draws samples; computes and stores the statistic for each sample
##
bs=[]
for n in np.arange(10):
    # note end='\r'  this will address our impatience.
    print('sampling iteration: {0}'.format(n),end='\r')

    "bs.append(getStat(?????))"


In [None]:
### Interpret the result

Table().with_columns('bs',bs).hist()
plt.scatter(np.percentile(bs,[5,95]), [0,0], color='red', s=30)
plt.scatter(getStat(wg),0 ,color='blue',s=30)


In [None]:
cquiz('wage0-04')

## Where do we go next?

So based on the above analysis,  we are have a result that indicates that overall, immigrants tend to become less productive relative to US born workers as they gain experience in the US labor market.  Is it just me, or does that also strike you as odd?  Many immigrants arrive in the US unable to speak english and without much in the way of job networks or experience with US work place norms.  Could language skills and experience in the US really make immigrants *less* productive ?

Really ?  Let's discuss this in class and agree that you should try our analysis on subsets of the data .. 


## On your own ...

copy cells from above as needed and write some additional code to redo our analysis for immigrants from *individual* countries.  Specifically, for the four largest countries of origin, compute the ratio of ratios statistic **AND** run the bootstrap procedure to construct 90% confidence bounds.

## Here is a code snippet that you might find valuable

`wg.append_column('keep',[(w['bpl']=="China") | (w['immig'] == False )for w in wg.to_array()])`

Hints:
 - You only need to copy and modify the cells called "Write loop that ..."  and "Interpret the result".
 - You can get away with running only 25 bootstrap samples ... but more is better.
 - The cell below can inform us as to which are the largest origin countries of US immigrants.

In [None]:
# which are the four
wg.where('immig', True).group('bpl').sort('count',descending=True)

In [None]:
## Here is a blank cell for your work

In [None]:
cquiz('wage0-05')

In [None]:
cquiz('wage0-06')

## That is for Lab 7

###  Please take the trouble to evaluate so that next year's students' sufferring can be reduced.

In [None]:
cquiz('wage0-eval')