Demography 88<br>
Fall 2017<br>
Carl Mason (cmason@berkeley.edu)<br>
## Lab 4 The Great Migration : Clarksdale

This week the plan is to use data from the 1960 Census to investigate/illustrate Nicholas Lemann's description of life in the US South. Even to those who have not read the Clarksdale chapter of <i>Promised Land</i> (of whom there are none among us) -- it will come as no surprise that Black Americans in the US South endured inequality in virtually every sphere of life.  The fact that this is well known should not deter the young data scientist from further explortion.

In the *second*  lab on the Great Migration,  we will take a look at the characteristics of those who chose to migrate from the south to the north.  Keep that in mind as we investigate conditions in the South because the obvious question that you might be asked is whether the people who sufferred most were the most likely to leave.

The Census data that we are using today comes to us via the Integrated Public Use Microdata Series project at the University of Minnesota Population Center. [http://ipums.org] is a fantastic resource containing not only census data but also data from many other sources both US and from other countries -- all of it carefully harmonized to make the data as comparable as possible. Here's the official citation:

>Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0.

## Some important notes about the data we are going to use today:

#### While the US Census is a complete enumeration, what the IPUMS distributes is not.  In order to save (a lot of money) IPUMS draws a random sample of 5 percent of households (this is done with the manuscripts of the ancient census returns, of course,  but it functionally the same as if they had selected the dwelling units).  By randomly selecting *households*  rather than *individuals*,  much more useful information is available to us -- things like number of siblings in a household; the characteristics of parents *and* children together; characteristics of dwelling units; and the relationship among people who share a dwelling.

#### It is worth keeping this in mind as you ponder your final project, as this sort of data is available from IPUMS for lots of times and places.  It also important for our purpose today to understand that the data that we will use today is in what is sometimes knows as "wide format"  (as opposed to the "long format" data that we worked with in the demographic transition lab).  This means that each row of the table that we read in, will refer to an individual -- it will contain information on lots of things pertaining to that individual, for example,  sex, age, wage income, and edcation.  Each row also contains a variable called "SERIAL"  this number uniquely identifies households.  Groups of rows of the table with the same value of SERIAL, are households.  Each person within such a household has a unique PERNUM value which starting with 1.  Within each household, PERNUM is unique for each individual while SERIAL is the same.  All household level variable e.g. dwelling unit characteristics,  will be the same for all members of the household.

#### In this particular sample (1960, 5 percent) simple random sampling is used -- consequently,  as we noted in the Fertility lab,  this is not always the case with IPUMS.  But this time it is, so we need not worry about weights -- except when we wish to present estimates for the entire population.  If we wanted, for example to know the number of 13 year olds in the population,  we would count them up in our sample and then divide that sum by .05 (or multiply it by 20  whatever).

## Put your student id in the obvious place and then run the cell below

In [None]:
# Run this cell to import the stuff we'll need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
plt.style.use('fivethirtyeight')

%matplotlib inline
from datascience import Table
from datascience.predicates import are
from IPython.display import HTML, IFrame, display
datasite="http://courses.demog.berkeley.edu/mason88/data/gmig0/"
quizsite="http://courses.demog.berkeley.edu/mason88/cgi-bin/quiz.py"
  
def cquiz(qno) : 
    import IPython, requests 
    try:
        sid
    except NameError: 
        print("HEY! did you enter your sid way up at the top of this notebook?")
    Linkit='{0}?qno={1}&sid={2}'.format(quizsite,qno,sid)
    #print(Linkit)
    html = requests.get(Linkit)
    #display(IFrame(Linkit, 1000, 300))
    display(IFrame(Linkit, 1000, 400))


    
######################
# Here it is ... the obvious place to put your student id
sid=""
######################
if sid == "" :
    print("HEY! didn't I tell you to put your sid in the obvious place")
 


## And as usual indicate with whom you are working this week

In [None]:
cquiz('greatMig0-partners')


# Background on African American Migration

1. African American History can be told (well oversimplified) as the story of three migrations
    1. The "Middle Passage"
    1. The 19th Century migration from the old (tobacco) South to the "new' (cotton) south
    1. The 20th Century "Great Migration" from the rural South to the urban North between 1917 and 1970
    

#  Antebellum Slavery

1. Cotton is the big story
   1. Makes "short staple" cotton profitable opens up vast territory for cotton cultivation
   1. BAD news for Native Americans who occupy much of that land by treaty
   1. BAD news for enslaved Americans who will move in large numbers onto cotton plantations
   
   1. Cotton is hugely important crop for Industrial Revolution
       1. 80% of British imports from US South on eve of Civil War
       1. Also key to industrialization in New England
     
   
   

# The 18th Century view of slavery and the Civil War

1. Slavery was morally wrong and therefore inefficient
1. slavery was demographically doomed
    1. Limited amount of land where slave agriculture is profitable
    1. Population grows as Malthus would want it to ==>
        ==> too many slaves not enough land
        ==> cheaper to hire workers at subsistence wage

1. Cotton Gin expanded area of cultivation OUCH!
1. Big unsettled question whether technology was improving efficiency of slave ag.
1  by 1850s the issue is the colonies. Imperative for anti-slavery folks, 
   to keep slavery out of colonies. Equally imperative for pro-slavery to open new lands.
   


# Reconstruction

1. The end of the Civil War brought emancipation but no 40 acres and no mule
    1. Self sufficiency was not widely possible for freed slaves see above (and Leeman)
    1. New methods of social control and labor exploitation develop
  
    

<img src="http://courses.demog.berkeley.edu/mason88/images/Census_1900_Percent_Black.png">

## The US South in 1960

By 1960, much of the Great Migration has already run its course.  The migration began around 1917 with US involvement in WWI and by the 1970s the general move towards the "Sun Belt" overwhelms what's left of the Great Migration and the net migration pattern of Black Americans turns southward.  But the reverse great migration is different in many respects from the north to south flow -- most importantly the reverse flow was (and is) and urban to urban migration whereas much of the Great Migrations was rural to urban.

Today the proportion of African Americans who live in the South (~ 55%) is close to what it was in 1960 (~60%).  So while the pioneers of the Great Migration were already well established in the North,  the population of the South was still experiencing natural growth and still sending considerable numbers of people north.

This week, to add interest, we can each choose a southern state at random for analysis.  The structure of each state
data file is the same, but presumably some of the values and results will be somewhat different. Doing state level analyses will also speed us along by keeping our data sets to a manageable size. Since the state data files are identically structured, code written for any state will work on any other-- so even though you and your partner may be working on different states, you can still share code snippets without any difficulty.

The goals  for this week are

1. To examine household and family characteristics of Black and White Americans in the South to validate / corroborate what we read in Lemann.

2. To quantify the discrimination  to which Black Americans in the South were subject in 1960. Discrimination is not easily measured as it has a lot to do with what is in people's brains and the US Census is far too cowardly to try to measure that directly.

## Discuss how we might infer discrimination if we cannot measure what is in people's brains?


In [None]:
## Read a file that contains the names of all the state level data sets
files=Table().read_table(datasite+'States1960.csv')
## choose a file at random
randomState=files.sample(k=1)[1][0]
print(randomState)
## read the date corresponding to the state that fate has determined that you should know more about
st60=Table().read_table(datasite+randomState)

## Let's explore the data

with some summary statistics and cross tabs

list(st60) gives us a list of column names in the st60 table. Note that some are lower case and others are UPPER case.  Variables in lower case are those that your instructor has cleaned up -- generally these are categorical  variables  and your instructor has changed number codes in the original file to more informative text strings.  The variables in UPPER CASE are straight from the IPUMS data file, UPPER case variables are always numbers but they the information that they represent is not always numerical for example, SERIAL is a numerical identifier that is unique for each household, but the number itself has no significance.


In [None]:
# A list of variables
list(st60)


# Cross tabulations and summary statistics

The code below uses the .group() method  to produce cross tabs -- by which we mean
counts of observations and each level of a categorical variable.  

In [None]:
## Crosstabs and descriptive statistics
import re
import pandas as pd
#st60.column(3).dtype
## list of numerical columns
nums=[i for i in list(st60) if st60[i].dtype == 'int64']
## list of categorical columns
cats=[i  for i in list(st60) if re.match('\<',str(st60[i].dtype))]

#descriptive statistics for things numerical
st60.select(nums).stats().show()
#frequencies of things categorical
for vname in cats:
    print("\n\n"+vname+"\n")
    #print(pd.crosstab(st60[i],columns="count"))
    st60.select([vname]).group(vname).show()
#my_tab = pd.crosstab(st60['labforce'],    columns="count")      # Name the count column
#my_tab

In [None]:

cquiz('greatmig0-01')


## Let's compare some important determinants of socioeconomic status by race

Since many determinants of socioeconomic status *should* vary with age and experience, inequality *within* age categories is more informative with respect to discrimination.

Let's compare education and income by drawing some scatter plots with AGE on the x axis and with separate dots for White and Black averages. 

In [None]:
## Education
# edyears = years of education
## a table of mean years of education by race and age
edyearsP0=st60.select(['AGE','race','edyears']).groups(['AGE','race'],collect=np.nanmean)
edyearsP0.where('AGE',25).show()
edyearsP0.relabel('edyears nanmean','YearsEducation')
edyearsP0.scatter('AGE','YearsEducation',colors='race')
edyearsP0.where('AGE',are.not_above(20)).scatter('AGE','YearsEducation',colors='race')


In [None]:
cquiz('Greatmig0-trick0')

In [None]:
cquiz('greatmig0-011')

In [None]:
## Whether education levels alone constitute discrimination,  We can also look at income
## differences -- and at income differences within age and education levels.

## Now income -- we'll use incwage -- income from wages but we'll need to make sure that
## we look only at people who are in the labor force
# find out what values labforce has so we can select on it
st60.select(['labforce']).group('labforce')

In [None]:
## include only laborforce participants
incwageP0=st60.where('labforce','Yes, in the labor force').select(['AGE','race','incwage']).\
    groups(['AGE','race'],collect=np.nanmean)
incwageP0.show(5)
## 
incwageP0.relabel('incwage nanmean','WageIncome')
incwageP0.scatter('AGE','WageIncome',colors='race')
#edyearsP0.where('AGE',are.not_above(20)).scatter('AGE','WageIncome',colors='race')


# The scatter plot of WageIncome above is informative -- but ..

One aspect of the inequality that should interest us is the extent to which it falls unequally across ages.
Do young Black Americans earn less relative to Whites of the same age then do older Black Americans?
From the scatter plot above, it is quite clear that there is a racial income gap at each age. What isn't perfectly clear is how the size of that gap varies with age. Do whites earn x times as much as blacks at every age -- or does the gap increase with age.

To get a better look,  we might also plot the *ratio* (or difference) of wages at each age.


In [None]:
# Plotting the black:white wage ratio requires .join()

incwageP1=incwageP0.where('race','White').\
    join('AGE',incwageP0.where('race','Black/African American/Negro'))
incwageP1.show(5)
## drop ages where wages are uninteresting to prevent warning messages about division by zero
incwageP1=incwageP1.where('AGE',are.between_or_equal_to(15,70))
incwageP1.append_column('wageRatio',incwageP1['WageIncome_2']/incwageP1['WageIncome'])

## and to reduce the effect of odd ones at old and young ages
incwageP1.where('AGE',are.between(15,70)).scatter('AGE','wageRatio')
print(randomState)

## So, in your state is there an age pattern to the black:white wage ratio?




# Income by age within education levels 

In the cell below I have copied the code from above that creates scatter plots of wages v age
and of wage ratio v age for black and white residents of your state.

In the def statement, I include st60subset=st60 to clarify what we are up to. st60subset will be the name used in the function to reference the dataset on which we want the body of the code to be run.  By default st60subset will just be st60 -- in which case the smallScatter function will produce the very same graphs that we have already seen.

The cool trick is that we can also pass this function *subsets* of st60 -- for example, we could give it only the rows of st60 that refer to people who have college degrees.  That would then produce graphs like the ones above in structure -- but only for those with college degrees.

In [None]:
def smallScatter(st60subset=st60):
    """
    expects a table that looks like st60 BUT most likely will contain only a subset of rows 
    perhaps only those with 11 years of education... draws the white and black mean income by
    age graphs like those above
    """
    ## include only laborforce participants
    incwageP0=st60subset.where('labforce','Yes, in the labor force').\
    select(['AGE','race','incwage']).groups(['AGE','race'],collect=np.nanmean)
    
    ## 
    incwageP0.relabel('incwage nanmean','WageIncome')
    incwageP0.scatter('AGE','WageIncome',colors='race')
    

    incwageP1=incwageP0.where('race','White').join('AGE',incwageP0.\
                        where('race','Black/African American/Negro'))

    ## drop ages where wages are uninteresting to prevent warning messages about division by zero
    incwageP1=incwageP1.where('AGE',are.between_or_equal_to(15,70))
    incwageP1.append_column('wageRatio',incwageP1['WageIncome_2']/incwageP1['WageIncome'])

    ## and to reduce the effect of odd ones at old and young ages
    incwageP1.where('AGE',are.between(15,70)).scatter('AGE','wageRatio')
    print(randomState)

In [None]:
##  call smallScatter with interesting subsets of educaiton levels
smallScatter(st60.where('edyears',are.below_or_equal_to(11)))

# Use your new smallScatter() function to explore the age pattern of the wage ratio at various education levels

discuss with everyone you can and then answer the next quiz question


In [None]:
cquiz('Greatmig0-02')

# Household composition 

One of the features of Black lives described by Lemann is fragmented families.  Lemann does not talk much about White families, but Ruby Daniel's experience as a more or less itinerant share cropper is treated in considerable depth.  

Can we find evidence in the Census to indicate that household composition for Black Americans was different from that of Whites in your favorite state.

## Generating a histogram of household size for black  and white households (code required)

It is a little bit tricky to extract household level information from our datasets because each record corresponds to an individual.  Households, in our dataset,  are composed of a variable number of 
rows which share a common household ID number which is called 'SERIAL'. Since household size varries (as we are about to see) data for large households is over-represented.

Consider for example, the number of rooms.  Assuming that larger households generally live in houses with more rooms, simply taking the mean of the 'rooms' variable can be misleading

In [None]:
# the mean number of rooms reported by each *person* in our sample
print(np.nanmean(st60['rooms']))

# PERNUM is a unique identifier of a person **within** a household. Every household
# has a record with PERNUM == 1;  only households with two people have someone with PERNUM==2 
# and so on.  So selecting only PERNUM==1 means that we are looking at one observation
# PER HOUSEHOLD
print(np.nanmean(st60.where('PERNUM',1)['rooms']))
cquiz('greatmig0-04')

## Now the histograms of household size (coding required)


In [None]:
##  GENEARTE Histograms of household size
# histogram of household size
#########
## Here's some code that produces a histogram for Black households 
##  COPY THE TWO UNCOMMENTED LINES BELOW and make the *small* modification necessary to produce
##  the corresponding historgram of white household size
#####
st60.where('race','Black/African American/Negro').select(['SERIAL']).\
    group(['SERIAL']).hist(['count'],bins=np.arange(20))

print(randomState+' Black ')

## Discuss the difference between these histograms

In most states the distribution of household size is quite different between White and Black households.
It is well worth the trouble to think and discuss what this might imply about the differing life experiences
of Black and White residents of your state AND what might be causing this not exactly subtle difference.

HINT: people who live in institutions have unique SERIAL values.
HINT: https://www.washingtonpost.com/news/wonk/wp/2014/07/15/charting-the-shocking-rise-of-racial-disparity-in-our-criminal-justice-system/?utm_term=.7c555c63b112


In [None]:
st60.append_column('NonInst',st60['relate'] != "Institutional inmates")
cquiz('greatmig0-03')

In [None]:
#st60.pivot('race','relate',values='PERNUM',collect=len).show()
st60.pivot('race','gqtype',values='PERNUM',collect=len).show()

## Families and households

In <i>Promised Land</i> Lemann describes the family lives of share croppers anecdotally through the story of Ruby and her family.  We can use a generalized version of our smallScatter() function to investigate some broad indicators such as 
the proportion of women an each age who live with a spouse and the age distribution of the number of children ever born.

In the cell below fill in the <WHAT>s to produce 

In [None]:
def genScatter(stDat=st60, xvar='AGE', depvar='incwage') :
    """
    A function that generates a scatter plot of depvar vs xvar for blacks and whites 
    To limit scatterplot to a subset of st60 -- use .where() in defining stDat
    """
    ## Assume that stDat is a subset of rows and columns of st60 -- the user will have done
    ## any necessary .where() and .select()  before feeding the data to this function
    ## so tab0 will be the result of the .group()ing by the specified xvariable and by race
    #tab0=st60.where('edyears',EDlevel).where('labforce','Yes, in the labor force').select(['AGE','race','incwage']).groups(['AGE','race'],collect=np.nanmean)
    tab0=stDat.select([xvar,'race',depvar]).groups([xvar,'race'],collect=np.nanmean)
    ## tab0 should have columns: xvar, race and depvar+' nanmean'
    ## the first scatter plot what goes on the x axis ?
    tab0.scatter( <WHAT>,depvar +' nanmean',colors='race')
    tabj=tab0.where('race','Black/Negro').join(<WHAT>,tab0.where('race',<WHAT>))
    #tabj.show(5)
    tempj.append_column('B:W Ratio',tempj[depvar+ ' nanmean']/tempj[depvar+ ' nanmean_2'])

    tabj.scatter(<WHAT>,<WHAT>)

In [None]:
## Create a variable indicating the state of married-with-spouse-present
## then use it with getScatter to see what we can learn about marriage patterns
st60.append_column('mrdSpres', st60['marst']=="Married, spouse present")
genScatter(stdat=st60.where('sex','Female'),depvar='mrdSpres')

### Children ever born

children ever born by age can tell us about fertility patterns with relatively little effort.
The variable CHBORN is simply the number of children that a woman reports having ever given birth to. There is some
uncertainty about how women report children who died very young -- particularly when the woman is at an advanced age. But is there any reason to believe that such biases would be different across race?

In [None]:
genScatter(st60.where('sex','Female'),depvar='CHBORN')

### Discuss the age distribution of children ever born

Speculate on why (if it is so in your state) at ages above 60, Black women often report fewer births than White women of the same age. 

##  Housing conditions

Since, as we have seen, Blacks are being paid on the order of half as much as whites with similar education and experience,  we should expect that most Black southerners will live in less desirable housing than White southerners. 

We can see that for example in the number of rooms per household:


In [None]:
# just use ONE record per household (dead horse well, beaten by now)
st60.where('PERNUM',1).pivot('ownership','race',values='rooms',collect=np.nanmean)

## But consider a more stringent definition of discrimination

What we have seen so far, is irrefutable evidence of inequality.    But does inequality imply discrimination?
Well ... in this case, by the shear magnitude of the inequality in education and income, "yes" would be a pretty good answer. But for 
sake of being difficult, let's consider a more stringent definition of discrimination -- such as one might encounter in an economics department: 

If we take as given the "initial endowment" (i.e. we don't consider the reasons why some people are poorer than 
others) we might call it discrimination only when people with the same resources face different constraints due some unrelated characteristic.

To investigate this sort of thing, let's use our genScatter() along with the variable rentDecile -- which is one that your instructor built -- it breaks renters up into deciles (e.g. 10th percentile, 20th percentile ...) so a value of "30" for example means that the household's gross rent is higher than 30% of all renting households. 

Click on the quiz below BEFORE trying to plot rooms by rentDecile

In [None]:
genScatter(stdat=st60.where('ownership','Rented').\
           where('PERNUM',1),xvar='rentDecile',depvar='rooms')
cquiz('greatmig0-05')

# What's in a toilet?

Unfortunately, that question is not *just* a stupid joke.  In the next cell, we create a variable called "toiletYES" which indicates whether or not a person lives in a house with their own toilet.  We also create a similar variable indiating whether or not the house has piped hot and cold water.

Use the genScatter() function to investigate the availability of private toilets and hot water across different rent levels by race,
then click on the quiz link below to answer a question that we will discuss in class next week.

In [None]:
## note the list comprehension
st60.append_column('toiletYES',[1 if re.match('Yes',toilet) else 0 for toilet in st60['toilet']])
np.mean(st60['toiletYES'])
## how to you make scatter plots out of this variable ?
st60.append_column('hotwaterYES', st60['hotwater']== 'Hot and cold piped water')
np.mean(st60['hotwaterYES'])

# a genScatter call is called for
#genScatter(?????)

In [None]:
cquiz('greatmig0-06')

## Optional Extra Challenge (fascinating and not too hard)


Create a new variable which you might call "atHome" which is True or 1  if the individual
lives with his or her parents one of whom is the head of the household (see the variable called 'relate'). 

Then use genScatter() to look at the age pattern of home leaving (aka the proportion of people who of each age/race who do not live with a parent). 

Ignore the following unfortunate facts:

1. That children who leave home and leave the state at the same time, are not in the data and so reduce the proportions of home leavers. 
1. That parental death looks like home leaving by this measure. 

Please answer the optional question below


In [None]:
##OPTIONAL##
cquiz('greatmig0-opt0')

## Optional challenge - open ended and life changing (maybe)

Use the gesScatter() function to investigate an aspect of life in the south that we have not yet touched on.  Remember that you can pass subsets of st60  to genScatter in order to look at the effect of an additional variable on the age/race pattern -- as we did income and education. Remember also that way up at the top of this notebook, you can find a list of all the variables and a little information on each.  And finally ...

#### This problem could be the excuse you've been waiting for to go to office hours.

The question below asks you to discuss your results:

In [None]:
cquiz('greatmig0-opt1')

## Congratulations you are finished with Great Migration Lab 0

Please take a moment to give your thoughtful evaluation of this lab and

#### Remember to check out the question on the reading for next week

In [None]:
cquiz('greatmig0-eval')

In [None]:
cquiz('mpi1965')