Demography 88<br>
Fall 2019<br>
Carl Mason (cmason@berkeley.edu)<br>

# The impact of immigration on wages 

## Outline
1. Discuss IPUMS.org as source for final project data
1. Review TVD results from Great Migration (chicago) Lab
1. Discuss The Clemens paper
1. Consider what a productivity measuring statistic might look like
    1. Design a productivity based statistic of interest
    1. Compute the statistic
    1. Characterize its uncertainty with a bootstrap simulation


## The Clemens paper

1. Huge gains from removing restrictions on immigration
   1. Reasons why the trillion dollar bill is left on the sidewalk:
1. Assumptions and requirements
   1. migrants from poor to rich countries see large productivity gains
       1. productivity = output/unit of input in this case labor. marginal productivity = wage
       1. productivity is a matter of place ?  or person (migration selectivity)?
   1. gain/loss to others hard to estimate BUT presumed minor/caneling/merely pecuniary
       1. wages might rise in poor countries
       1. wages might fall in rich countries
       1. returns to other factors (capital) might rise in rich countries
       1. returns to other factors (capital) might fall in poor countries
       1. Not all externalities deserve a "pigovian tax"
           1. smoke stacks yes; price/wage changes *no*
   1. lots of models with production functions; general equilibrium & etc 

### A smaller but livlier version of Clemens' figure 1

<a href="https://shiny.demog.berkeley.edu/carlm/EconImmig0/" target="_new"> Neoclassical model of wage effects of immigration</a>



<img src="https://courses.demog.berkeley.edu/mason88/images/clemensFig1.png">



In [None]:
# Run this cell to import the stuff we'll need
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt 
import gc
#plt.style.use('fivethirtyeight')

%matplotlib inline
from datascience import Table
from datascience.predicates import are
from datascience.util import *

from IPython.display import HTML, IFrame, display
datasite="https://courses.demog.berkeley.edu/mason88/data/"
quizsite="https://courses.demog.berkeley.edu/mason88/cgi-bin/quiz.py"
  
def cquiz(qno) : 
    import IPython, requests 
    try:
        sid
    except NameError: 
        print("HEY! did you enter your sid way up at the top of this notebook?")
    Linkit='{0}?qno={1}&sid={2}'.format(quizsite,qno,sid)
    #print(Linkit)
    html = requests.get(Linkit)
    #display(IFrame(Linkit, 1000, 300))
    display(IFrame(Linkit, 1000, 400))


    
######################
# Here it is ... the obvious place to put your student id
sid=""
######################
if sid == "" :
    print("HEY! didn't I tell you to put your sid in the obvious place")
 


In [None]:
### #### #### ####
###  SELECTING  LAB  PARTNERS
### ### #### ####

N=8
numbers=np.arange(N)
print(np.mod(N,2))
if (not np.mod(N,2) == 0) :
    numbers=np.append(numbers,"lucky")
    N+=1
numbersTab=Table().with_column('n',numbers)
randomized=numbersTab.sample(k=N,with_replacement=False)
selection=randomized['n']
selection.shape = (2,int(N/2))
Table().with_columns('zero',selection[0],'one',selection[1]).show()

In [None]:
cquiz('wage01-partners')


## The data for this week

1. Residents of California, New York, Illinois, or Texas In 2000 and 2015
1. Ages grouped by age and time 25-30 yrs old (in 2000) vs 40-45 (in 2015)
1. Immigrants who arrived at ages 20-25
1. productivity = Total Annual Wages/(usual hr per week * weeks worked last year) 
1. educ4 = education level coded into 5 mutually exclusive groups
1. Data come from American Community Survey via IPUMS
1. Data include weights which we must use.
>Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0.





In [None]:
# Read the data and examine the columns

wg=Table().read_table(datasite+'earnings2015new.csv')
## changing 'immig' to string instead of logical
wg.append_column('immig',['immigrant' if im else 'USborn' for im in wg['immig']])
## adding a column "age" to avoid clumsy typing
wg.append_column('age',['old' if a > 35 else 'young' for a in wg['AGE']])
wg.where('immig','immigrant').where('YEAR',2000).show(10)
wg.where('immig','USborn').where('YEAR',2000).show(10)
wg.where('immig','immigrant').where('YEAR',2015).show(10)
wg.where('immig','USborn').where('YEAR',2015).show(10)

## Cleaning up the birth place variable

#### with a very useful python trick

In [None]:
## tidy up the 'bpl'  variable --with a clever python trick
bpl_modified=['USA' if w['immig'] == "USborn" else w['bpl'] for w in wg.to_array()]

wg.append_column('bpl',bpl_modified)
## tidy up immig variable no fancy trick needed

wg.pivot('immig','bpl').sort('USborn',descending=True)

## Before we proceed with science, let's discuss the strengths and weaknesses of the data.

1. Did the world change between 2000 and 2015 ?
1. Are we observing the same people in 2000 and 2015 ?
1. Are five year age categories ideal ?



## Productivity measure

Clemens refers to productivity without defining it.  Which is reasonable because economists all know what it means: productivty = output/unit of input. Since the input is labor, Clemens is talking about *output per unit of labor* To confuse matters further, in figure 1,  includes **not** productivity but rather two *demand curves for labor* (one for rich countries the other for poor countries).

productivity and demand for labor are linked by the "fact" that in a perfect market, the *marginal productivity of labor = the wage*. Thus wages are the key to everything. AND wages are  *theoretically* the productivity of the last worker hired aka the "margingal productivity of labor". 

We should perhaps discuss this in class.

Having convinced ourselves of the above economics "fact" we can compute a measure of *marginal productivity* of various sources of labor--which is *sort of* what Clemens is talking about.

In [None]:
# computing productivity
wg.append_column('prod',wg["INCWAGE"]/(wg["UHRSWORK"]*wg["wkswork"]))

wg.select(['age','immig','prod'])


## Digression on PERWT 

The marginal productivity numbers that we just computed are nice BUT as respectable data scientists, we must take note of the following:

* These data comprise a structured random sample rather than a "simple" random sample, and thus it is <b> essential that we take account of weights</b> if we ever want to publish our results.
    
*  In the Great Migration labs, recall that when working with sums, it was proper to divide by .05 since the sample was a 5% simple random sample. Another way of thinking about that is that each observation in that Great Migration lab sample represented 20 people in the US in 1960.  Consequently, giving each observation a "weight" of 20 was appropriate in all computations.  Because *all* observations had a weight of 20, it was not necessary to explicitly include the weights when computing means -- because all those 20s would just cancel out.  NOT SO THIS TIME.

*  In *this* data set, sampling was done according to a complicated set of rules that ensured that the sample would have much broader coverage in terms of geography and ethnicity than would a simple random sample of the same size.  The price we pay for that is that we must take weights into account *even when computing means*. While in the Great Migration sample, all weights were "20"  in this sample, the weights are stored in 'PERWT' and the variation across individuals is considerable.

In [None]:
# Since the data from 2000 are drawn from a 5% sample, most observations have weights around 20 however, 
# the data from 2015 are a 1% sample so many observations should have weights of around 100 ... and of course,
# there are a few that are way out there.

print(wg.select("PERWT").stats())
wg.select('PERWT').hist(bins=100)

#### So we can improve our measure of marginal productivity by using  *weights*


Weighted means are not hard to compute, but it involves an extra step. The formula is:

$$ \text{Weighted mean} = \frac{\sum_{i=1}^{N}{x_iw_i}}{\sum_{i=1}^{N}{w_i}}$$

where $x_i$ is the $i^{th}$ observation of the variable that you are taking the weighted mean of, and $w_i$ is the weight corresponding to the $i^{th}$ observation. In our present case, $x$ is a measure of productivity and $w$ is 'PERWT'.

Actually, owing to the limitations of the Tables.group method, computing the weighted means also require some extra python -- which slows things down a bit.



## Computing the weighted mean of productivity

Fill in the ???

In [None]:
## The weighted mean of "productivity"
'''
## The hard way:
hardway=np.sum(wg[???]*wg[???]/np.sum(wg[???]))
## The easy way:
easyway=np.average(wg['prod'],weights=wg['PERWT'])
print("hardway: {0}  easyway: {1}".format(hardway,easyway))
'''

## But what does  27.196  tell us?

The weighted mean productivity of what ? 

In [None]:
# computing weighted productivity for groups whose wtd productivity we wish 
# to compare
pdict=dict()
for a in ['young','old']:
    for i in ['immigrant','USborn'] :
        # create a table that is a subset of wg
        temp=wg.where("immig",i).where("age",a)
      
        easy=np.average(temp['prod'],weights=temp['PERWT'])
      
        pdict[(i,a)]=easy

        
print(pdict)

In [None]:
## Note that pdict is a dictionary ... not a table this means that we
## can reference individual cells much more easily e.g.:
pdict['USborn','old']

## What conclusions might we draw from weighted means of marginal productivities?

#### Are "old" immigrants in 2015  more disadvantaged relative to USborns than are "young" immigrants in 2000 ?

#### Consider this statistic :

$$ \frac{\text{prod}_{\text{immig,old}}}{\text{prod}_{\text{immig,young}}}/  \frac{\text{prod}_{\text{USborn,old}}}{\text{prod}_{\text{USborn,young}}} $$

Can we justify the above statistic as an estimate of the degree to which immigrants lose(gain) ground (in relation to USborn wages) with experience in US labor markets?


In [None]:


(pdict['immigrant','old']/pdict['immigrant','young'])/\
    (pdict['USborn','old']/pdict['USborn','young'])

## Well that looks like the answer ... 

####  What does 0.844 mean  ?

So based on the above analysis,  we are have a result that indicates that overall, immigrants tend to become less productive relative to US born workers as they gain experience in the US labor market?  Is it just me, or does that also strike you as odd?  Many immigrants arrive in the US unable to speak english and without much in the way of job networks or experience with US work place norms.  Could language skills and experience in the US really make immigrants *less* productive ?

Really ?  Let's discuss this in class 

 ## And what about uncertainty ?
 
 ##### Could this 0.844 be a chance event due to random sampling?
 

#### Do I hear you crying out for a simulation?  or what ?
 
In previous labs, the null hypothesis was that random chance alone explained the difference between the observed and expected quantities.  That allowed us to construct a model of the null hypotesis based on random sampling which could show us how life would be if the null hypothesis were indeed operating.

It's a little bit different in the present case.  

* What is the null hypothesis in this case?
* What is the source of random error ?
 
### What can we do?

 1. Write a function that computes the ratio-of-ratios statistic from a sample of wg records
 1. Write a loop that draws a sample; calls the function to compute statistic; stores result
 1. Interpret result

In [None]:
# A Function that computes statistic
def getStat(wg=wg) :
    res=dict()
    for a in ['old','young']:
        for i in ['immigrant','USborn'] :
            # create a table that is a subset of wg
            
            temp=wg.where("immig",i).where("age",a)

            easy=np.average(temp['prod'],weights=temp['PERWT'])
            # store result in dictionary indexed by tuple so we can use it later
            res[(i,a)]=easy
    # compute diff in ratio statistic        
    return((res['immigrant','old']/res['immigrant','young'])/\
           ( res['USborn','old']/res['USborn','young']))

In [None]:

##### Write loop that..
# draws samples; computes and stores the statistic for each sample
##
bs=[]
for n in np.arange(10):
    # note end='\r'  this will address our impatience.
    print('sampling iteration: {0}'.format(n),end='\r')

    "bs.append(getStat(?????))"


In [None]:
### Interpret the result

Table().with_columns('bs',bs).hist()
plt.scatter(np.percentile(bs,[5,95]), [0,0], color='red', s=30)
plt.scatter(getStat(wg),0 ,color='blue',s=30)
plt.xlabel('Simulated statistic')

###  Do you think education matters ?

do wages of immigrants with more or less education catch up faster to non-immigrants with similar education levels?

Let's do the analysis separately by education level

In [None]:

#Important Cell: Educaiton 
## store results of bootstrap
bs=dict()
stat=dict()
for cat in np.unique(wg['educ4']) :
    bs[cat]=[]
    # subsample of data for analysis
    temp=wg.where('educ4',cat)
    # observed stat
    stat[cat]=getStat(temp)
    ## do some bootstraps ... more is better
    for n in np.arange(10):
        print('sampling iteration category ={0}: {1}'.format(cat,n),end='       \r')
        bs[cat].append(getStat(temp.sample()))

# show results        
for cat in bs.keys():
    
    Table().with_columns(cat,bs[cat]).hist()
    plt.scatter(np.percentile(bs[cat],[5,95]), [0,0], color='red', s=30)
    plt.scatter(stat[cat], 0,color='black', s=30)
    plt.title("")


## On your own ...

So we have learned that the trajectory of economic assimilation of immigrants is complicated. Overall, since 2000 we found that immigrants *fell behind US born workers* in that the wage gap between young newly arrived immigrants and similarly aged US borns in 2000 was _*smaller*_ than the corresponding ratio for older immigrants and US borns in 2015. 


But if we run the analysis on subsamples based on education, there is a hint that those with more education fared a bit better -- but the real losers are those with a high school degree.
But it's not perfectly clear because as the sample sizes get smaller, the uncertainty increases. 

Question: Does this mean that people without a high school degree did better than those with a diploma?


The final exercise in this lab is to run the same bootstrap analysis as we did for education ... for the four largest immigrant groups.  That is, run the same bootstrap code as we did in "Important Cell: Eduation" BUT subset the data by country of origin rather than by education level.  


You'll do this by copying the "Important Cell: Education" and modifying it so that it selects subsets that include only those who were born in a particular country AND those who were born in the US. 

The cell below will help you find the four largest immigrant sending countries.



Hints:
 - You only need to copy and modify the cell called "Important Cell: Education".
 - Test your code running only 2 bootstraps for each coutry of origin -- but once it works, run 100 trials on each (get coffee).
 - The cell below can inform us as to which are the largest origin countries of US immigrants.
 - The cell below the cell below has some code snippets that you are very likely to find helpful

In [None]:
# this is the "cell below"
# which are the four largest countries of origin?

wg.where('immig', 'immigrant').group('bpl').sort('count',descending=True)

In [None]:
## this is the "cell below the cell below"
## Could be useful .. check this out

foo=wg.where('bpl','China').with_rows(wg.where('bpl','USA').to_array())
foo.group('bpl')

In [None]:
cquiz('wage01-prod')

In [None]:
cquiz('wage01-ratrat')

In [None]:
cquiz('wage01-bs')

In [None]:
cquiz('wage01-bias')

In [None]:
cquiz('wage01-adjusted')

## That is for Lab 7

###  Please take the trouble to evaluate so that next year's students' sufferring can be reduced.

In [None]:
cquiz('wage01-eval')