Demography 88<br>
Fall 2019<br>
Carl Mason (cmason@berkeley.edu)<br>
## The Great Migration (part 2) : Chicago

In this lab we take a look at the characteristics of migrants from Southern states to Northern states in 1960. In the first GreatMigration Lab, we looked at the conditions of people (particularly African American people) who *lived* in a southern state.  This week, we will look at people who were *born* in a southern state and we'll compare the characteristics of southern born people **who no longer live in the south** with those who continue to reside in the state of their birth. In other words we're going to compare *migrants* with *non-migrants*. To be more tediously precise, we're going to compare the characteristics of migrants with those *at risk of migrating* in other words with the universe of people born in the particular state (actually, as we'll see, it's a bit more complicated than that).

The data that we will use in this week's exercise is very similar to what we used in the first Great Migration lab. You'll notice that the variables have similar names (although there are fewer of them this time). Like last time, the UPPER CASE variables are drawn directly from IPUMS, whereas the lower case variables have been modified slightly by your instructor -- generally by replacing numbers with words.

ALSO NOTE that this is a five percent simple random sample, so no weighting is required.  And the sums should be divided by .05 in order to get an *estimate* of the full 1960 value.

### Run the next cell few cells ... then let's discuss what we can do with these data

In [None]:
# Run this cell to import the stuff we'll need
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt 
#plt.style.use('fivethirtyeight')
import gc

%matplotlib inline
from datascience import Table
from datascience.predicates import are
from datascience.util import *

from IPython.display import HTML, IFrame, display
datasite="https://courses.demog.berkeley.edu/mason88/data/gmig1/"
quizsite="https://courses.demog.berkeley.edu/mason88/cgi-bin/quiz.py"
  
def cquiz(qno) : 
    import IPython, requests 
    try:
        sid
    except NameError: 
        print("HEY! did you enter your sid way up at the top of this notebook?")
    Linkit='{0}?qno={1}&sid={2}'.format(quizsite,qno,sid)
    #print(Linkit)
    html = requests.get(Linkit)
    #display(IFrame(Linkit, 1000, 300))
    display(IFrame(Linkit, 1000, 400))


    
######################
# Here it is ... the obvious place to put your student id
sid=""
######################
if sid == "" :
    print("HEY! didn't I tell you to put your sid in the obvious place")
 


In [None]:
### #### #### ####
###  SELECTING  LAB  PARTNERS
### ### #### ####

N=8  # count from zero; add one to the highest number
numbers=np.arange(N)
print(np.mod(N,2))
if (not np.mod(N,2) == 0) :
    numbers=np.append(numbers,"your choice")
    N+=1
numbersTab=Table().with_column('n',numbers)
randomized=numbersTab.sample(k=N,with_replacement=False)
selection=randomized['n']
selection.shape = (2,int(N/2))
Table().with_columns('Excelent Partner 1',selection[0],'Excelent Partner 2',selection[1]).show()

In [None]:
cquiz('greatmig02-partners')

## Reading the data

The first time through, in order to make sure that your code is doing the right thing, we will all
use data from Mississippi -- that's where Clarksdale is so it's particularly of interest -- later on, once
your code is solid,  you'll come back up to the next cell and modify it to select a random state.

In [None]:
## Call this cell "Reading the Data"
## Read a file that contains the names of all the state level data sets
files=Table().read_table(datasite+'States1960.csv')
## choose a file at random
randomState=files.sample(k=1)[1][0]
## ## ## ## ## ##
## FIRST TIME THROUGH WE'RE ALL GOING TO LEARN ABOUT MISSISSIPPI
## later you will comment out the next line in order to read a randomly selected state

#randomState="Mississippi1960.csv"
print(randomState)

## read the date corresponding to the state that fate has determined that you should know more about

# NOTE -- after reading a state1960.csv file one time, you may uncomment and modify the following line to
# make sure that you read that same (randomly assigned) state in subsequent sessions.  It is helpful to do this
# if you don't have enought time to do the entire lab in one sitting  OR if you are prone to having your
# notebook restart for want of RAM

# randomState="??????1960.csv"

bp60=Table().read_table(datasite+randomState)
# delete observations for which nothing is known about migration status
bp60=bp60.where('migrate5',are.not_containing('nan'))
bp60.show(5)

## Examine the data

Just as in previous labs,  let's explore the variables that we have read in.  Since these data are also drawn from the IPUMS public use sample of the 1960 US Census, the names and codes have the same meanings as they did in the first Great Migration lab. Since our questions are more focused -- and our RAM is constrained, there are  fewer variables in this weeks' data set.  But as you are always thinking about your final project... be aware that all the variables from the first Great Migration lab (and more) are available from ipums.org, which you should cite in the following way.

>Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0.

1. 'statefip' State in which respondant currently resides
1. 'bpl' birthplace of respondant (should be the same for all observations)    
1. 'edyears' years of educations constructed from 'educ'
1. 'educ'  highest level of education thus far attained
1. 'marst'  marital status
1. 'CHBORN' children ever born
1. 'sex'   sex of respondent
1. 'AGE',
1. 'labforce' In labor force means either employed or seeking work
1. 'race' African-American or White
1. 'migrate5' Did respondant live in same hous/state five years ag0
1. 'migplac5' state of residence of respondent five years ago (1955) 
1. 'region' census region of current residence (see map below)

<img src="http://courses.demog.berkeley.edu/mason88/images/Census_Regions_and_Division_of_the_United_States.png" style="width:400px">

In [None]:
list(bp60)

In [None]:
## Crosstabs and descriptive statistics
import re

## list of numerical columns
nums=[i for i in list(bp60) if bp60[i].dtype == 'int64']
## list of categorical columns
cats=[i  for i in list(bp60) if re.match('\<',str(bp60[i].dtype))]

#descriptive statistics for things numerical
bp60.select(nums).stats().show()
#frequencies of things categorical
for vname in cats:
    print("\n\n"+vname+"\n")
  
    bp60.select([vname]).group(vname).show()


# What shall we do with these data?

### We are interested in the decision to migrate...how shall we observe that ?
* Can we use migrat5 and migplac5?


### Which characteristics of people in the data might influence the decision to migrate?

### How can we structure the question in such a way that we might be able to answer it with the data at hand?
Can we use the Jury Selection problem in https://www.inferentialthinking.com/chapters/11/2/Multiple_Categories that you **just completed in Data8?**

1. How is African-American migration from the South to the North analagous to Jury Selection ?
   1. What problems do you see ? 
1. Is education analagous to race 
   1. What problems do you see ? 
1. How does our data differ in *structure* from the data in the Jury Selection example?
   1. What will we have to do to compensate ?
   
### Let's outline the steps



### Suppose we observe a TVD that greatly exceeds all or most of the bootstrap TVDs that we generate. 
What is the strongest correct statement that we can make ?

## Towards a table of empirical distributions

### The Jury Selection example begins with a table of empirical distributions called 'jury'
#https://www.inferentialthinking.com/chapters/11/2/Multiple_Categories#composition-of-panels-in-alameda-county

Our data, however, begins as observations on individuals, so a reasonable first-ish step is to create a table of 
empirical distributions comparing the education levels of those who migrate with those who do not.  To build that table, we'll obviously need to identify which observerations correspond to migrants and which do not.



In [None]:
# Building the table of empirical distributions

# Let's create a dependent variable that takes on the values
##  1 if the person qualifies as a migrant : 
##           -- born in MS (all observations)
##           -- Was resident of MS  5 years prior (1955)
##           -- was resident of East North Central Div. (includes Chicago) in 1960
##  0 if the person did not migrate 

## The raw materials are clearly statefip, bpl, migrat5, migplac5
# Chicago and many other places mentioned by Lemann are in the "East North Central Div." 
# (region)
# for everything to work just right, it would be nice if we got 
# . 0.01442847598333973 as proportion of
# newMigrants from MS -- that is np.mean(bp60['newMigrant']) should = 0.014428....
# 
"""
# 
# 'region' refers to current location Chicago is in "East North Central Div."
#  'bpl' is birthplace (state)
#  'migplc5' is state of residence 5 years prior (1955)
# let's define "North" as consisting of these regions
north=["East North Central Div.","Middle Atlantic Division","Pacific Division"]
bp60.append_column('north_res',[reg in north for reg in bp60['region']])

bp60.append_column('newMigrant',(?==?) * (bp60['north_res']))
np.mean(bp60['newMigrant'])
"""

In [None]:
18551+1686+3663

## Interpret the pivot table of newMigrant v region(of current residence)

In [None]:
# pivot table of newMigrant
## Consider this table as you answer the question below
bp60.where('race','Black/African American/Negro').pivot('region','newMigrant').show()


## How might we improve our code and definition of newMigrant?

What should we do about people in our sample who don't live in MS or what we have defined as the North. Who are these people who show up as (False for newMigrant but nonetheless with a current region that is neither East South Central (i.e. MS) or one of the "North"?


In [None]:
## deleting observations of non-non-migrants
# drop folks who left MS before 1955
bp60.append_column('nonMigrant',(bp60['bpl']==bp60['statefip']))
bp60.append_column('keep',np.logical_or(bp60['newMigrant'], bp60['nonMigrant']))
bp60=bp60.where('keep',True)
bp60.pivot('newMigrant','region')

### Let's compare "new-migrants" with eligibles

At last we are ready to construct a table analagous to the 'jury' table.  A table showing the empirical distributions of education for migrants and *eligibles*


getEdist() takes and np array such as a column of a table and returns a table of the empirical distribution.

In [None]:
def getEdist(dat,column):
    """
    Excpects a table of data and the name of a column, returns a table of the empirical frequency distribution
    of the column
    """
    tab0=dat.select(column).group(column)
    tab0.append_column('pct',tab0['count']/np.sum(tab0['count']))
    return(tab0)

In [None]:
# Important Cell number 1
tempMig=getEdist(bp60.where('race','Black/African American/Negro').where('newMigrant',1),'edyears')

# just like in the jury selection example - where anyone who could be a juror is included
# in among the 'eligibles'
tempEligible=getEdist(bp60.where('race','Black/African American/Negro'),'edyears')
EdDist=tempMig.join('edyears',tempEligible)
EdDist.relabel('pct','migrants').relabel('pct_2','eligibles').relabel('count','migrantsN').\
    relabel('count_2','eligiblesN')
EdDist.show()
EdDist.select(['edyears','migrants','eligibles']).barh('edyears')

## Compute the total variational distance between the education level of new migrants and that of the at risk population (those born in Mississippi)

Use the functions below which were borrowed from the Jury Selection example ...

In [None]:



## Borrowed from Jury Selection example
## https://www.inferentialthinking.com/chapters/10/1/jury-selection.html
def total_variation_distance(distribution_1, distribution_2):
    return np.abs(distribution_1 - distribution_2).sum()/2

def table_tvd(table, label, other):
    return total_variation_distance(table.column(label), table.column(other))

In [None]:
## Use the function(s) above to compute the TVD between the two empirical distributions
## in the EdDist table
EdDist.show(5)
## uncomment and fill in the ???
"""
ObservedTVD=table_tvd(EdDist,'??????','??????')
print("The observed TVD betwee migrants and eligibles is {}".format(ObservedTVD))
"""
print(randomState)


# Is the total variational distance an important measure of something ?

In the language of the Jury Selection problem:  are the "new migrants" a "representative sample" of those born in Mississippi? Just as the Jury Selection example is limited to one characteristic (in that case race/ethnicity) in our case we are considering only years of education.  


#### Use the same technique as you learned in the foundation class to compute 5000 TVDs between the observed empirical distribution of years of education of new migrants and randomly selected observations of Mississippi born persons (aka the at-risk population)

### In order to get the right answer to the question below make sure that your run the 
np.random.seed(13531) command once before generating your 5000 trials.
Your instructor believes that correct value of MAX TVD under these conditions is: 
0.08836562703788241

In [None]:
# Let's remind ourselves what the EdDist table looks like
EdDist.show(5)

In [None]:
# Important cell number 2

## TVDs 
'''
panel_size = np.sum(EdDist['migrantsN'])
repetitions = 5000

tvds = make_array()
np.random.seed(13531)

for i in np.arange(repetitions):


    new_sample = proportions_from_distribution(EdDist,'??????',panel_size)

    tvds = np.append(tvds, table_tvd(new_sample,'???????', 'Random Sample'))

results = Table().with_column('TVD', tvds)

print('Max Simulated TVD{0}\n should be same every time if np.random.seed() is set otherwise should be different'.\
      format(np.max(results['TVD'])))

observed_tvd=table_tvd(EdDist,'eligibles','migrants')

print("OBSERVED TVD:{0}".format(observed_tvd))

results.hist()
plt.scatter(observed_tvd, 0, color='red', s=30)


np.mean(results['TVD'] >= observed_tvd)
'''


## Now we have the code in place and tested ... perhaps we should refine our code/question a bit.

At this point, you should have a nice chunk of code for assessing total variational distance between empirical distributions of educational attainment.  But we overlooked (at least) one thing: AGE

In [None]:

#getEdist(bp60.where('newMigrant',True),'AGE')
tmp=bp60.where('race','Black/African American').pivot('AGE','newMigrant')
tmp=getEdist(bp60.where('newMigrant',True),'AGE').\
  join('AGE',getEdist(bp60.where('newMigrant',False),'AGE'))
tmp.relabel('pct','migrants').relabel('pct_2','eligibles')
tmp.select(['AGE','migrants','eligibles']).bar('AGE')
plt.title("Age distribution of migrants and eligibles")
print(randomState)


### Since We know from the previous lab, that the level of a person's education (in MS in 1960) varies a lot with age...

Recall those downward sloping (after age 20 or so) mean years of education by age graphs?

Well it is also the case that people tend to migrate much more at relatively young ages. Like Ruby, migrants are generally in their 20s.   So our finding that migrants tended to be better educated than nonMigrants could just reflect that migrants are generally younger than nonMigrants.

Consequently, You'll need to make the following modifications:

1. Limit the analysis to African Americans. (done already but best to keep track)
1. Limit the analysis to people of similar ages -- let's say age 20-34.

## Assemble the code necessary to limit the analysis to African Americans between the ages of 20 and 34

To do this you will need to copy the contents of the two "important" cells (1 and 2) from above into the cell below and then make a few 
**changes in order to limit the data used in the analysis.**   

HINT: your instructor believes that the max TVD for Mississippi with np.random.seed(13531) and with the race and age restrictions discussed above is:
Max Simulated TVD 
0.07907577497129736

In [None]:
## Let's call this cell
#####
## Important Cell number 3
#####

# copy the contents of Important Cell number 1 and Important cell number 2 into this cell
# then  modify as necessary to run the analysis on the more limited subset of the population.



## What about White people ?

We have a pretty good analysis of the role of education in the migration of Black Americans from MS (or another southern state).  Based on our reading of Lemann, we also know that discrimination in the South was nasty and at least in part, that nastiness was driving the migration. 

So what about white people?  It turns out that whites were also migrating north during this time period.

Copy your important cell #3 (in which you refined the analysis to the 20-34 age range)
and modify the code to look at the effect of education on white migration (keep the age restrictions)

your instructor believes that for Whites 20-34 born in MS and with np.random.seed(13531) the max  TVD is  0.17484293257712805


## Now that we have the machinery in place let's consider some additional important questions

### (1) Do years of education play a similar role for White new migrants as it does for African Americans?
### (2) Do both patterns of education and migration for Whites and African Americans look the same for people born in some state other than Mississippi?

In [None]:
#Question (1)  Do years of education play a similar role for White new migrants 
# as it does for African Americans?

## copy the contents of Important Cell number 3 into this cell
## make the modification(s) necessary to run the analysis for Whites 
## run the analysis and answer the question below

## HINT  your instructor believes that for Whites 20-34 born in MS and with 
## np.random.seed(13531) the max
# TVD is   0.1344595706804368

##  And now,  another state

Let's repeat the analysis for Whites and African Americans in a different state. In order to see if the finding that you just produced generalizes to at least another state.

To do this it should only be necessary to :

1. comment one line in the cell called "Reading the Data" 
2. Run the entire notebook by selecting Cell -> Run All

<font color='red'> heads up: Make sure there are no cells with errors other wise re-running the notebook will trick you. It will stop processing but OLD RESULTS WILL STILL BE VISIBLE.</font>

If you tend towards tidiness, it is also possible to copy code selectively to read the data and create the newMigrant variable just do what's needed in order to run the variants of "Important Cell number 3" 

#### Notice (if you wish to) that because each southern state had its own unique migration pattern -- determined both by where the rail roads went as well as by historical chance and family networks --we use a pretty wide  definition of "North" to which  "new migrants" move.  You could experiment with narrowing this definition -- but doing so is not part of this lab.

## Congratualations  you have finished the second Great Migration lab.

please take a moment to evaluate the experience, and remember to also answer the question on the reading

In [None]:
cquiz('greatmig02-tvd')

In [None]:
cquiz('greatmig02-sim')

In [None]:
cquiz('greatmig02-null')

In [None]:
cquiz('greatmig02-state')

In [None]:
cquiz('greatmig02-white')

In [None]:
cquiz('greatmig02-combined')

In [None]:
cquiz('greatmig02-eval')