Demography 88<br>
Fall 2017<br>
Carl Mason (cmason@berkeley.edu)<br>
## Lab 6 The Great Migration : Chicago

In this lab we take a look at the characteristics of migrants from Southern states to Northern states in 1960. In the first GreatMigration Lab, we looked at the conditions of people (particularly African American people) who *lived* in a southern state.  This week, we will look at people who were *born* in a southern state and we'll compare the characteristics of southern born people who no longer live in the south with those who continue to reside in the state of their birth. In other words we're going to compare *migrants* with non-migrants. To be more tediously precise, we're going to compare the characteristics of migrants with those *at risk of migrating* in other words with the universe of people born in the particular state.

The data that we will use in this week's exercise is very similar to what we used in the first Great Migration lab. You'll notice that the variables have similar names (although there are fewer of them this time). Like last time, the UPPER CASE variables are drawn directly from IPUMS, whereas the lower case variables have been modified slightly by your instructor -- generally by replacing numbers with words.

ALSO NOTE that this is a five percent simple random sample, so no weighting is required.  And the sums should be divided by .05 in order to get an *estimate* of the full 1960 value.

### We wish to address two questions:
1. Were the people who chose to migrate to the north generally more or less educated than those who chose to stay?
1. Is the difference in education levels between these two groups *significantly* different.

Ponder these questions while you run the next cell 

In [None]:
# Run this cell to import the stuff we'll need
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt 
plt.style.use('fivethirtyeight')

%matplotlib inline
from datascience import Table
from datascience.predicates import are
from datascience.util import *

from IPython.display import HTML, IFrame, display
datasite="http://courses.demog.berkeley.edu/mason88/data/gmig1/"
quizsite="http://courses.demog.berkeley.edu/mason88/cgi-bin/quiz.py"
  
def cquiz(qno) : 
    import IPython, requests 
    try:
        sid
    except NameError: 
        print("HEY! did you enter your sid way up at the top of this notebook?")
    Linkit='{0}?qno={1}&sid={2}'.format(quizsite,qno,sid)
    #print(Linkit)
    html = requests.get(Linkit)
    #display(IFrame(Linkit, 1000, 300))
    display(IFrame(Linkit, 1000, 400))


    
######################
# Here it is ... the obvious place to put your student id
sid=""
######################
if sid == "" :
    print("HEY! didn't I tell you to put your sid in the obvious place")
 


In [None]:
cquiz('greatmig1-01')

In [None]:
### #### #### ####
###  SELECTING  LAB  PARTNERS
### ### #### ####

N=18
numbers=np.arange(N)
print(np.mod(N,2))
if (not np.mod(N,2) == 0) :
    numbers=np.append(numbers,"lucky")
    N+=1
numbersTab=Table().with_column('n',numbers)
randomized=numbersTab.sample(k=N,with_replacement=False)
selection=randomized['n']
selection.shape = (2,int(N/2))
Table().with_columns('zero',selection[0],'one',selection[1]).show()

In [None]:
cquiz('greatmig1-partners')

In [None]:
## Borrowed from Jury Selection example
## https://www.inferentialthinking.com/chapters/10/1/jury-selection.html
def total_variation_distance(distribution_1, distribution_2):
    return np.abs(distribution_1 - distribution_2).sum()/2

def table_tvd(table, label, other):
    return total_variation_distance(table.column(label), table.column(other))

## Reading the data

The first time through, in order to make sure that your code is doing the right thing, we will all
use data from Mississippi -- that's where Clarksdale is so it's particularly of interest -- later on, once
your code is solid,  you'll come back up to the next cell and modify it to select a random state.

In [None]:
## Read a file that contains the names of all the state level data sets
files=Table().read_table(datasite+'States1960.csv')
## choose a file at random
randomState=files.sample(k=1)[1][0]
## ## ## ## ## ##
## FIRST TIME THROUGH WE'RE ALL GOING TO LEARN ABOUT MISSISSIPPI
## later you will comment out the next line in order to read a randomly selected state
randomState="Mississippi1960.csv"
print(randomState)

## read the date corresponding to the state that fate has determined that you should know more about
bp60=Table().read_table(datasite+randomState)
# delete observations for which nothing is known about migration status
bp60=bp60.where('migrate5',are.not_containing('nan'))

## Examine the data

Just as in previous labs,  let's explore the variables that we have read in.  Since these data are also drawn from the IPUMS public use sample of the 1960 US Census, the names and codes have the same meanings as they did in the first Great Migration lab. Since our questions are more focused -- and our RAM is constrained, there are many fewer variables in this weeks' data set.  But as you are always thinking about your final project... be aware that all the variables from the first Great Migration lab (and more) are available from ipums.org, which you should cite in the following way.

>Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0.


In [None]:
list(bp60)

In [None]:
## Crosstabs and descriptive statistics
import re

## list of numerical columns
nums=[i for i in list(bp60) if bp60[i].dtype == 'int64']
## list of categorical columns
cats=[i  for i in list(bp60) if re.match('\<',str(bp60[i].dtype))]

#descriptive statistics for things numerical
bp60.select(nums).stats().show()
#frequencies of things categorical
for vname in cats:
    print("\n\n"+vname+"\n")
  
    bp60.select([vname]).group(vname).show()


## Question 1:  How does the education level of migrants compare to that of the population from which they selected themselves.

#### Questions 0.1 -- who is a migrant ?

Before we can tackle the bigger question regarding education, we need to define migrant in terms of the variables at our disposal.

Might I humbly propose that we consider two types of people:
1. People who were born in Mississippi(MS) (or later some other specific state) and who are, in 1960, residents of any state. Let's call these people the "at risk population" as it includes everyone who could have been a migrant from MS
1. People who were born in Mississippi (MS) (or later some other specific state) and resided there in 1955 but in 1960 reside in the East North Central Census region (ENC).(let's call these "new migrants")


The five year distinction is carefully chosen...because it is possible to do with our data. But it also raises some interesting questions:

1. Since 1960 is towards the end of the Great Migration, characteristics of migrants might differ from those who came earlier. 
2. People who have been in the north for a long time, may have acquired different characteristics as a result of migration rather than prior to migration.

Before we are finished with this lab, we will also need to take age into account but for now let's put that aside.

## Census Divisions:
<img src="http://courses.demog.berkeley.edu/mason88/images/Census_Regions_and_Division_of_the_United_States.png">



In [None]:

## create a variable (column of bp60) that takes on the value of True if the person migrated from their state of birth 
## to the ENC Census region *within the previous 5 years*
#HINT:  look at migplac5, bpl

## fill in the ?????
bp60.append_column('newMigrant',???????)
np.sum(bp60['newMigrant'])

In [None]:
cquiz('greatmig1-02')

In [None]:
## Consider this table as you answer the question below
bp60.where('race','Black/African American/Negro').pivot('region','newMigrant').show()
cquiz('greatmig1-03')

### Let's compare "new-migrants" with non-migrants

A useful way to compare the education level of new migrants and non-migrants is by looking at the empirical distributions of years of education of each group.  Below is a handy function for doing just that.

getEdist() takes and np array such as a column of a table and returns a table of the empirical distribution.

In [None]:
def getEdist(dat,column):
    """
    Excpects a table of data and the name of a column, returns a table of the empirical frequency distribution
    of the column
    """
    tab0=dat.select(column).group(column)
    tab0.append_column('pct',tab0['count']/np.sum(tab0['count']))
    return(tab0)

In [None]:
## Exaple  using getEdist  
## note that educ and edyears hold the same information -- edyears is a recoding of educ. Sometimes it is easier
## to use edyears and sometimes educ
temp0=getEdist(bp60,'educ')
temp0.show()
temp0.drop('count').barh('educ')
## same thing with edyears
temp1=getEdist(bp60.where('newMigrant',True),'edyears')
temp1.show()
temp1.drop('count').barh('edyears')
## 
getEdist(bp60.where('newMigrant',True).where('sex','Female'),'marst').drop('count').barh('marst')


In [None]:
## Use getEdist() to find the empirical distribution of state of residence of those whom we are calling "new Migrants"
## from Mississippi  and answer the question below
cquiz('greatmig1-04')
#HINT: the variable 'statefip' holds the state of current residence

In [None]:
## Let's call this cell 
######
##Important Cell number 1 ##
##### 
## we'll refer to it later

## Let's consruct a table and bar graph just like the one that compares "Eligibles" and "Panels" in the
## jury selection example We'll use getEdist and a .join()  and we'll relabel stuff to keep better track

## define the at-risk population everyone born in the state
atriskPop=bp60
## get the empirical distribution of new migrants -- which are a subset of the at-risk population
newMigs=getEdist(atriskPop.where('newMigrant',True),'edyears')
## relabel the columns in anticipation of a .join
newMigs.relabel(['count','pct'],['newMigN','newMigPct'])
## get the empirical dist of the at-risk population
atRisk=getEdist(atriskPop,'edyears')
atRisk.relabel(['count','pct'],['atRiskN','atRiskPct'])
## and here's the .join
EdDist=atRisk.join('edyears',newMigs)
EdDist.show()
EdDist.drop(['atRiskN','newMigN']).barh('edyears')
cquiz('greatmig1-041')

In [None]:
# Compute the total variational distance between the education level of new migrants and that of
# the at risk population (those born in Mississippi)

#HINT: there are some functions for this:
## Borrowed from Jury Selection example
## https://www.inferentialthinking.com/chapters/10/1/jury-selection.html
def total_variation_distance(distribution_1, distribution_2):
    return np.abs(distribution_1 - distribution_2).sum()/2

def table_tvd(table, label, other):
    return total_variation_distance(table.column(label), table.column(other))

cquiz('greatmig1-05')

## Is the total variational distance important ?

In the language of the Jury Selection problem:  are the "new migrants" a "representative sample" of those born in Mississippi? Just as the Jury Selection example is limited to one characteristic (in that case race/ethnicity) in our case we are considering only years of education.  

#### Use the same technique as you learned in the foundation class to compute 5000 TVDs between the observed empirical distribution of years of education of new migrants and randomly selected observations of Mississippi born persons (aka the at-risk population)

### In order to get the right answer to the question below make sure that your run the 
np.random.seed(13531) command once before generating your 5000 trials.
Your instructor believes that correct value of MAX TVD under these conditions is: 0.053950728832533934

In [None]:
# Let's call this cell:
######
# Important cell number 2
######
# Compute the empirical distribution of the TVDs
panel_size = np.sum(EdDist['newMigN'])
repetitions = 5000

tvds = make_array()
np.random.seed(13531)
for i in np.arange(repetitions):

    new_sample = ????
    tvds = ????

results = Table().with_column('TVD', tvds)
print('Max Simulated TVD:{0}\n should be same every time if np.random.seed() is set otherwise should be different'.\
      format(np.max(results['TVD'])))
observed_tvd=table_tvd(EdDist,'atRiskPct','newMigPct')
print("OBSERVED TVD:{0}".format(observed_tvd))
results.hist()



In [None]:
cquiz('greatmig1-07')

### Now we have the code in place and tested ... but we need to refine our question

At this point, you should have a nice chunk of code for assessing total variational distance.  Now it's time to think about the substantive question at hand.

While the code you have written is certainly beautiful, let's remind ourselves about what informs our inquiry. In *Promised Land*, Lemann writes about the migrations of *African Americans* from Clarksdale, Mississippi to Chicago, Illinois.  He tells us, more or less, that the deteriorating social and economic conditions in the urban areas like Chicago was due largely to a combination of forces: 
1. The influx of a large number of poor African American share croppers with cultural traits that are poorly adapted for success.
1. Relentless segregation in the north which limited opportunities and made the new arrivals easily exploitable

### In that spirit let's consider the question:  were *African Americans* with poorer prospects more likely to migrate from the South to the North?

If we take years of education to be a measure of future prospects, you can use the code that you have developed above to answer this question. You'll need to make the following modifications:

1. Limit the analysis to African Americans.
1. Limit the analysis to people of similar ages -- let's say age 25-34.

Why is the second limitation advisable ?



In [None]:
cquiz('greatmig1-08')

## Assemble the code necessary to conduct the analysis on African Americans between the ages of 25 and 34

To do this you will need to copy the contents of two "important" cells from above into the cell below and then make a few 
changes in order to limit the data used in the analysis.  Copy the contents of the cells above which are labeled:
"important cell number 1" and "important cell number 2"  then comment out the calls to cquiz() and finally  make the necessary modification(s) and run the code.

HINT: your instructor believes that the max TVD for Mississippi with np.random.seed(13531) and with the race and age restrictions discussed above is:
Max Simulated TVD0.0925147882149493

In [None]:
## Let's call this cell
#####
## Important Cell number 3
#####

# copy the contents of Important Cell number 1 and Important cell number 2 into this cell
# then comment out the calls to cquiz() 
# and finally modify as necessary to run the analysis on the more limited subset of the population.



In [None]:
cquiz('greatmig1-09')

## Now that we have the machinery in place let's consider some additional important questions

### (1) Do years of education play a similar role for White new migrants as it does for African Americans?
### (2) Do both patterns of education and migration for Whites and African Americans look the same for people born in some state other than Mississippi?

In [None]:
#Question (1)  Do years of education play a similar role for White new migrants as it does for African Americans?

## copy the contents of Important Cell number 3 into this cell
## make the modification(s) necessary to run the analysis for Whites 
## run the analysis and answer the question below

## HINT  your instructor believes that for Whites 20-34 born in MS and with np.random.seed(13531) the max
# TVD is  Max Simulated TVD0.1924837075637157

In [None]:
cquiz('greatmig1-10')

##  And now,  another state

Let's repeat the analysis for Whites and African Americans in a different state. In order to see if the finding that you just produced generalizes to at least another state.

To do this it should only be necessary to :
1. run the cell below to read in data from another state -- run the cell twice or eve three times if you have to to make sure that you aren't getting Mississippi again.
2. Copy the contents of Important Cell number 3 into the two waiting cells below and make the necessary modifications such that one cell does the analysis for African Americans and the other does it for Whites

#### Notice (if you wish to) that because each southern state had its own unique migration pattern -- determined both by where the rail roads went as well as by historical chance and family networks -- it is wise to broaden our definition of "new migrant" to include folks who wound up in several northern regions in addition to the East North Central.  The code is in the next cell so this new definition will just happen without any effort on your part.

In [None]:
## Read a file that contains the names of all the state level data sets
files=Table().read_table(datasite+'States1960.csv')
## choose a file at random
randomState=files.sample(k=1)[1][0]
## ## ## ## ## ##
## FIRST TIME THROUGH WE'RE ALL GOING TO LEARN ABOUT MISSISSIPPI
## later you will comment out the next line in order to read a randomly selected state

print(randomState)
print("run this again if random state is Mississippi")
## read the date corresponding to the state that fate has determined that you should know more about
bp60=Table().read_table(datasite+randomState)
bp60=bp60.where('migrate5',are.not_containing('nan'))
## and the new migrnat column 
## Note that since there are strong migration patterns from places in the south to places in the north --
## not all states send large numbers of migrants to Chicago or the East North Central region for that matter.
## consequently, it's best if we broaden our definition of new migrant by adding two other regions of
## residence to the definition
#bp60.append_column('newMigrant',(bp60['migplac5']==bp60['bpl'])*(bp60['region']=="East North Central Div."))
def north(reg):
    northern=set(["East North Central Div.","Middle Atlantic Division","Pacific Division"])
    if reg in northern:
        return(True)
    else:
        return(False)
    

bp60.append_column('newMigrant',(bp60['migplac5']==bp60['bpl'])*bp60.apply(north,'region'))
getEdist(bp60,'region')                  

In [None]:
## Copy Important Cell number 3 here and modify it -- as necessary to do usual TVD analaysis on African Americans
## age 25-34 form the current state

In [None]:
## Copy Important Cell number 3 here and modify it -- as necessary to do usual TVD analaysis on Whites
## age 25-34 form the current state

In [None]:
cquiz('greatmig1-11')

## Congratualations  you have finished the second Great Migration lab.

please take a moment to evaluate the experience, and remember to also answer the question on the reading

In [None]:
cquiz('greatmig1-eval')

In [None]:
cquiz('clemens')