Demography 88<br>
Fall 2019<br>
Carl Mason (cmason@berkeley.edu)<br>
##  The Great Migration : Clarksdale (part 1)

#### Goals for this week:

1. To examine conditions and characteristics of Black and White Americans in the South to validate / corroborate what we read in Lemann.

2. To develop statistics that quantify the discrimination  to which Black Americans in the South were subject in 1960. Discrimination is not easily measured as it has a lot to do with what is in people's brains and the US Census is far too cowardly to try to measure that directly.

3. To quantify the uncertainty due to sampling around our estimates of discrimination


This week the plan is to use data from the 1960 Census to investigate/illustrate Nicholas Lemann's description of life in the US South in the middle of the 20th Century. Even to those who have not read the Clarksdale chapter of <i>Promised Land</i> (of whom there are none among us) -- it will come as no surprise that Black Americans in the US South endured inequality in virtually every sphere of life.  The fact that this is well known should not deter the young data scientist from further explortion.

In the *next*  lab on the Great Migration,  we will take a look at the characteristics of those who chose to migrate from the south to the north.  Keep that in mind as we investigate conditions in the South because the obvious question that you might be asked is whether the people who sufferred most were the most likely to leave.

The Census data that we are using today comes to us via the Integrated Public Use Microdata Series project at the University of Minnesota Population Center. [http://ipums.org] is a fantastic resource containing not only census data but also data from many other sources both US and from other countries -- all of it carefully harmonized to make the data as comparable as possible. Here's the official citation:

>Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015. http://doi.org/10.18128/D010.V6.0.

## Some important notes about the data we are going to use today:

#### While the US Census is a complete enumeration, what the IPUMS distributes is not.  In order to save (a lot of money) IPUMS draws a random sample of 5 percent of households (this is done with the manuscripts of the ancient census returns, of course,  but it functionally the same as if they had selected the dwelling units).  By randomly selecting *households*  rather than *individuals*,  much more useful information is available to us -- things like number of siblings in a household; the characteristics of parents *and* children together; characteristics of dwelling units; and the relationship among people who share a dwelling.

#### It is worth keeping this in mind as you ponder your final project, as this sort of data is available from IPUMS for lots of times and places.  As is typical for census data from IPUMS, each row of the table that we read in, will refer to an individual -- it will contain information on lots of things pertaining to that individual, for example,  sex, age, wage income, and education.  Each row also contains a variable called "SERIAL"  this number uniquely identifies households.  Groups of rows of the table with the same value of SERIAL, are households.  Each person within such a household has a unique PERNUM value which starting with 1.  Within each household, PERNUM is unique for each individual while SERIAL is the same.  All household level variable e.g. dwelling unit characteristics,  will be the same for all members of the household.

#### In this particular sample (1960, 5 percent) simple random sampling is used -- consequently,  as we noted in the Fertility lab,  this is not always the case with IPUMS.  But this time it is, so we need not worry about weights -- except when we wish to present estimates for the entire population.  If we wanted, for example to know the number of 13 year olds in the population,  we would count them up in our sample and then divide that sum by .05 (or multiply it by 20  whatever).

## Put your student id in the obvious place and then run the cell below

In [None]:
# Run this cell to import the stuff we'll need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
#plt.style.use('fivethirtyeight')

%matplotlib inline
from datascience import Table
from datascience.predicates import are
from datascience.util import *
from IPython.display import HTML, IFrame, display
datasite="https://courses.demog.berkeley.edu/mason88/data/gmig0/"
quizsite="https://courses.demog.berkeley.edu/mason88/cgi-bin/quiz.py"
  
def cquiz(qno) : 
    import IPython, requests 
    try:
        sid
    except NameError: 
        print("HEY! did you enter your sid way up at the top of this notebook?")
    Linkit='{0}?qno={1}&sid={2}'.format(quizsite,qno,sid)
    #print(Linkit)
    html = requests.get(Linkit)
    #display(IFrame(Linkit, 1000, 300))
    display(IFrame(Linkit, 1000, 400))


    
######################
# Here it is ... the obvious place to put your student id
sid=""
######################
if sid == "" :
    print("HEY! didn't I tell you to put your sid in the obvious place")
 


## And as usual indicate with whom you are working this week

In [None]:
cquiz('greatMig0-partners')


# Background on African American Migration

1. African American History can be told (well oversimplified) as the story of three migrations
    1. The "Middle Passage"
    1. The 19th Century migration from the old (tobacco) South to the "new' (cotton) south
    1. The 20th Century "Great Migration" from the rural South to the urban North between 1917 and 1970
    

#  Antebellum Slavery

1. Cotton is the big story
   1. Makes "short staple" cotton profitable opens up vast territory for cotton cultivation
   1. BAD news for Native Americans who occupy much of that land by treaty
   1. BAD news for enslaved Americans who will move in large numbers onto cotton plantations
   
   1. Cotton is hugely important crop for Industrial Revolution
       1. 80% of British imports from US South on eve of Civil War
       1. Also key to industrialization in New England
     
   
   

# The 18th Century view of slavery and the Civil War

1. Slavery was morally wrong and therefore inefficient
1. slavery was demographically doomed
    1. Limited amount of land where slave agriculture is profitable
    1. Population grows as Malthus would want it to 
        ==> too many slaves not enough land
        ==> cheaper to hire workers at subsistence wage

1. Cotton Gin expanded area of cultivation OUCH!
1. Big unsettled question whether technology was improving efficiency of slave ag.
1. By 1850s the issue is the colonies ("territories"). 
   1. Imperative for anti-slavery folks, to keep slavery out of territories. 
   1. Equally imperative for pro-slavery to open new lands.
   


# Reconstruction

1. The end of the Civil War brought emancipation but no 40 acres and no mule
    1. Self sufficiency was not widely possible for freed slaves see above (and Lemann)
    1. New methods of social control and labor exploitation develop
  
    

<img src="http://courses.demog.berkeley.edu/mason88/images/Census_1900_Percent_Black.png">

## The US South in 1960

By 1960, much of the Great Migration has already run its course.  The migration began around 1917 with US involvement in WWI and by the 1970s the general move towards the "Sun Belt" overwhelms what's left of the Great Migration and the net migration pattern of Black Americans turns southward.  But the reverse great migration is different in many respects from the north to south flow -- most importantly the reverse flow was (and is) and urban to urban migration whereas much of the Great Migrations was rural to urban.

Today the proportion of African Americans who live in the South (~ 55%) is close to what it was in 1960 (~60%).  So while the pioneers of the Great Migration  established themselves in the North decades earlier,  the population of the South was still experiencing natural growth and still sending considerable numbers of people north.

This week, to add interest, we can each choose a southern state at random for analysis.  The structure of each state
data file is the same, but presumably some of the values and results will be somewhat different. Doing state level analyses will also speed us along by keeping our data sets to a manageable size. Since the state data files are identically structured, code written for any state will work on any other-- so even though you and your partner may be working on different states, you can still share code snippets without any difficulty.



### Read some data and develop some code to compare the positions Black and White Americans

In [None]:
## Read a file that contains the names of all the state level data sets
files=Table().read_table(datasite+'States1960.csv')
## choose a file at random
print("Pick one state at random from this list")
files.show()
randomState=files.sample(k=1)[1][0]

print("Reading data for :"+randomState)
## read the date corresponding to the state that fate has determined that you should know more about
###
# NOTE -- after reading a state1960.csv file one time, you may uncomment and modify the following line to
# make sure that you read that same (randomly assigned) state in subsequent sessions.  It is helpful to do this
# if you don't have enought time to do the entire lab in one sitting  OR if you are prone to having your
# notebook restart for want of RAM

# randomState="??????1960.csv"
st60=Table().read_table(datasite+randomState)
st60.show(5)
print("number of rows: {} ".format(st60.num_rows))


## Let's explore the data

with some summary statistics and cross tabs

list(st60) gives us a list of column names in the st60 table. Note that some are lower case and others are UPPER case.  Variables in lower case are those that your instructor has cleaned up -- generally these are categorical  variables  and your instructor has changed number codes in the original file to more informative text strings.  The variables in UPPER CASE are straight from the IPUMS data file, UPPER case variables are always numbers but the information that they represent is not always numerical for example, SERIAL is a numerical identifier that is unique for each household, but the number itself has no significance.


In [None]:
# A list of variables
list(st60)


# Cross tabulations and summary statistics

In the next few cells, we'll develop some functions that will help us explore the data. We'll use a tiny bit of pandas but once again, for this class, knowing how to *use* the functions is what's important. 

We'll deal with continuous and categorical variables separately. The immediate goal is to create some functions that will show us differences in socio-economic sorts of measures between Black and White Americans in the US South in 1960.

In [None]:
# a function to produce a one way frequency table of a categorical 
# variable
def pfreq(vname,data=st60):
    """
    expects a categorical col from a table; returns frequency dist.
    Mostly this will be called by other functions
    """

    df = pd.Series(data[vname]).value_counts()
    res=df/len(data[vname])
    result=Table().with_column(vname,res.index).with_column('value',res.values)
    
    return(result)
    #return(df/len(var))


    

In [None]:
# One way frequency distribution aka "empirical distribution" of 
# employment status
pfreq('empstat')

In [None]:
pfreq('relate')

### disagregating by race

Since our focus is on discrimination, we'll generally want to look variables like 'empstat' *disagregated* by race.  Let's develop a function that does that.


In [None]:
# uses pfreq to produce empirical distributions disagregated by
# race -- or some other variable if desired

def byraceCAT(vname,data=st60,byvar='race'):
    """
    expects a table with at least 2columns vname -- the column of 
    interest,  byvar (deaults to 'race') is the column by which 
    frequencies shall be disagregated and displayed
    """
    vcats=np.unique(data[vname])
    for r in np.unique(data[byvar]):
        #print('\n'+ r + ':\n')
        res=pfreq(vname,data.where(byvar,r)).relabeled('value',r)
        for vc in vcats :
            if vc not in res[vname]:
                res.append(Table().with_columns(vname,vc,r,0))
        try :
            result
            result=result.join(vname,res)
        except NameError:
            result=res
            
   
    return(result)

In [None]:
byraceCAT(vname='empstat')


In [None]:
## We can capture the output of byraceCAT as a table
edRace=byraceCAT('edyears')
edRace.show()

In [None]:
byraceCAT('edyears',byvar='labforce')

In [None]:
def byraceCON(vname,byvar='race',data=st60):
    """
    expects variable name byvar and data produces descriptive stats of
    vname disagreaged by byvar which must be columns of data (table)
    """
    result=Table().with_column('Statistic',["N","mean","median","std","min","max"])
    for r in np.unique(data[byvar]):
       
        pds=pd.Series(data.where(byvar,r)[vname])
        res=[pds.count(),pds.mean(),pds.median(),pds.std(),pds.min(),pds.max()]
      
        result=result.with_column(r,res)
    return(result)    

In [None]:
byraceCON('CHBORN').show()
# once again limiting the input data can produce more meaningful results
byraceCON('CHBORN',data=st60.where('AGE',are.between(15,45)).where('sex','Female'))

In [None]:
byraceCON('incwage').show()
byraceCON('incwage',data=st60.where('edyears',12))

## Graphs ... yes graphs are useful in our search for evidence of discrimination.

Graphs are especially useful when there are a large number of categories (e.g. edyears) or a continuous variable (e.g. AGE) across which we want to compare vaues corresponding to Black and White Americans.  We'll make user of the .groups() method to draw graphs.

Since many determinants of socioeconomic status *should* vary with age and experience, inequality *within* age categories is more informative with respect to discrimination.  Age lends itself quite well to the x-axis

Let's compare income by age by drawing some scatter plots with AGE on the x axis and with separate dots for White and Black averages. 

In [None]:
## INCOME
# edyears = years of education
## a table of mean years of education by race and age

## THIS is the important line of code -- using .groups()
incwageA=st60.select(['AGE','race','incwage']).\
    groups(['AGE','race'],collect=np.nanmean)
# here is what the result looks like
incwageA.where('AGE',are.between(15,20)).show()
## clean up variable names
incwageA.relabel('incwage nanmean','WageIncome')

## use the datascience package method for scatter plot
incwageA.scatter('AGE','WageIncome',colors='race')
plt.title("Income v Age")


## Why might a careful scientist (with no knowledge of the history or race in the US *not* accept this as evidence of *discrimination*?

It ***definitely shows difference*** but is wage inequality per se proof of discrimination?

### Enlightening discussion ensues




#### Consider the graph below

In [None]:
## Education
# edyears = years of education
## a table of mean years of education by race and age

## THIS is the important line of code -- using .groups()
edyearsP0=st60.select(['AGE','race','edyears']).groups(['AGE','race'],collect=np.nanmean)
# here is what the result looks like
edyearsP0.where('AGE',are.between(25,30)).show()
## clean up variable names
edyearsP0.relabel('edyears nanmean','YearsEducation')

## use the datascience package method for scatter plot
edyearsP0.scatter('AGE','YearsEducation',colors='race')
plt.title("Education v Age")
edyearsP0.where('AGE',are.not_above(20)).scatter('AGE','YearsEducation',colors='race')
plt.title("Closeup of early ages")

## OK is that sufficient ?


How about taken together the graphs of wage income v age and education v age... is that discrimination?

Once again, knowing what we do about race in America, it is very hard to insist --in the face of these graphs -- that your state's labor market and education systems were color blind.  But ...


## Pause for further discussion

In [None]:
## How about income by years of education...
incwageEd=st60.where('labforce','Yes, in the labor force').\
    where('sex','Male').\
    select(['race','edyears','incwage']).\
    groups(['race','edyears'],collect=np.nanmean)

##  cleanup by relabeling the column of interest
incwageEd.relabel('incwage nanmean','WageIncome')
# take a look at the table we just built
incwageEd.show(5)
# and make some graphs
incwageEd.scatter('edyears','WageIncome',colors='race')
plt.title("Wage income by years of education")
##

# And finally ... income by age within education levels disagregated by race

The graphs of wage income by age and by edyears are pretty compelling, but we can put a yet finer point on it by considering age.  Over the life course, a person with a given level of education should see some wage growth as she gains experience. If workers of different race/education groups have different age profiles ... that could be gumming things up.  

Is it likely that age could produce the pattern we just saw?  Probably not but as demographers, we must press on regardless.

In [None]:
## plot incwage by age,education and race 

## include only laborforce participants
## group by age,race and edyears
# make a table of mean incwage by age and race.

incwageP0=st60.where('labforce','Yes, in the labor force').\
    where('AGE',are.between(16,64)).\
    where('sex','Male').\
    select(['AGE','race','edyears','incwage']).\
    groups(['AGE','race','edyears'],collect=np.nanmean)

##  cleanup by relabeling the column of interest
incwageP0.relabel('incwage nanmean','WageIncome')
# take a look at the table we just built
incwageP0.show(5)
# and make some graphs
incwageP0.where('edyears',are.between_or_equal_to(0,9)).scatter('AGE','WageIncome',colors='race')
plt.title("0-9 years of education")
##
incwageP0.where('edyears',are.between_or_equal_to(9,11)).scatter('AGE','WageIncome',colors='race')
plt.title("9-11 years of education")

##
incwageP0.where('edyears',12).scatter('AGE','WageIncome',colors='race')
plt.title("12 years of education")
##
incwageP0.where('edyears',are.above(12)).scatter('AGE','WageIncome',colors='race')
plt.title("more than 12 years of education")
#edyearsP0.where('AGE',are.not_above(20)).scatter('AGE','WageIncome',colors='race')


In [None]:
##  call smallScatter with interesting subsets of educaiton levels
print("Just people who have NOT graduated high school")
smallScatter(st60.where('edyears',are.below_or_equal_to(11)))
## In order to answer the next question, you will need to produce scatter plots like
## this one for high school graduates (edyears == 12)  and for those with 
## some college (edyears > 12)


##  Housing conditions

Since, as we have seen, Blacks are being paid on the order of half as much as whites with similar education and experience,  we should expect that most Black southerners will live in less desirable housing than White southerners. 

We can see that for example in the number of rooms per household:


In [None]:
# just use ONE record per household 
rooms=st60.where('PERNUM',1).pivot('ownership','race',values='rooms',collect=np.nanmean)
## Creat a function to compute household size
def mnunique(x) :
    'return the mean number of unique elements -- used with SERIAL to count hh members'
    unique_elements, counts_elements = np.unique(x, return_counts=True)
    return(np.mean(counts_elements))  
# just use ONE record per household 
hhsize=st60.pivot('ownership','race',values='SERIAL',collect=mnunique)

In [None]:
print("Rooms per house")
rooms.show()
print("People per house")
hhsize.show()

## Consider a more stringent definition of discrimination

As we saw with education and income, discrimination is surely there but given the more complicated the system and markets involved, the harder it is to separate discrimination from mere difference.  The housing market is a little less complicated than the labor market and education systems so we can take our analysis a little further.

Let's consider as a  definition of discrimination -- such as one might encounter in an economics department: 

If we take as given the "initial endowment" (i.e. we don't consider the reasons why some people are poorer than 
others) we might call it discrimination only when people with the same resources face different constraints due some unrelated characteristic.

To investigate this sort of thing, let's use the variable rentDecile -- which is one that your instructor built -- it breaks renters up into deciles (e.g. 10th percentile, 20th percentile ...) so a value of "30" for example means that the household's gross rent is higher than 30% of all renting households. To be more precise households whose rent is in the 30th to 39.9999th percentile of rents will have a rentDecile value of 30. 

We can use the rentDecile variable along with some housing characteristics to look for evidence of discrimination in the rental housing market. 

### NOTE that in the code below, we have to limit our computation to  observations where "PERNUM" == 1.  PERNUM is just an index number within each household.  Every household has a person whose PERNUM ==1, because every household has at least one person in it.  By limiting our data to one person per household, we are effectively looking at households rather than people.  That's a good thing because otherwise we would over count houses with more people in them.

In [None]:
# Cost per room by rent decile disagregated by race

roomcost=st60.where('PERNUM',1).where('rentDecile',are.above(0)).\
select(['rentDecile','race','rooms']).groups(['rentDecile','race'],collect=np.nanmean)
roomcost.show()
roomcost.scatter('rentDecile','rooms nanmean',colors='race')

## How about uncertainty ?

Once again, we must remind ourselves that we are working with a random sample and as such, any conclusions we might wish to draw--about the underlying universe (i.e. some southern state in 1960), are subject to many kinds of uncertainty. Some sources of uncertainty e.g. bias among census enumerators -- we have no way of addressing. One important kind of uncertainty that we can address is sampling uncertainty.  

Suppose for example that after all this computation, we find, that in Tennesee in 1960 the IPUMS sample reveals that among households paying about the median rent (rentDecile = 50) -- those households headed by white people enjoy an average of 3.8 rooms while black households on average have only 3.32 **(If your state is NOT Tennesee, you are probably NOT seeing these numbers in the graphs that you produced above.)**

Although there is plenty of room to quibble, a reasonable person might argue that the ratio of these two number 3.32/3.80 = 0.87 is a measure of discrimination that even an economist ought to accept.  We are comparing Black and White households **who are paying about the same amount of money for rent** and we are observing that **Black headed households get fewer rooms.** Let's call this ratio the Black-White room ratio (BWRR).

Again quibbling is possible (Let's do that in class)  But if we accept that the BWRR at least plausibly quantifies discrimination,  then we might also observe the value of 1.0 is an important benchmark.  If the Black to White ratio for the whole population (of median renters) were 1.0 then Blacks and Whites who pay the same rent generally get the same number of rooms.


And if 1.0 is an acceptable benchmark, then we can proceed to estimate how (un)confident we are that there was discrimination in the rental market by...yes you guessed it... simulating some data  and comparing the observed BWRR to 1.0.

If you didn't guess that, it's probably because you haven't seen it yet in data 8 -- but you will very shortly.

In terms of data 8 we can speak of "model" wherein renters of median priced housing are equally likely to have the same amenities regardless of race.  According to that model the BWRR <i>statistic</i> should be 1.0.  The degree to which the BWRR falls short of 1 is a measure of discrimination in the housing market.

But reminding ourselves that we are dealing with a *sample*, rather than a complete enumeration, the possibility emerges that even if the the true underlying world were without discrimination, we might still have an unlucky day in which an unusually high number of black headed households *with fewer than the typical number of rooms* just happened to be selected.

That's the uncertainty that we want to quantify:

What is the probability -- assuming that the true state of nature (your state's 1960 rental market) is without discrimination -- that we *could* draw a sample that nonetheless produces a BWTR of less than 1?

## Let's ponder this

1. Does the size of the sample make a difference?
1. Does the size of the underlying population make a difference?

## Simulating a discrimination free world

The most straight forward data 8 sort of way to look at this problems is with a model as is done in Chapter 11.1 of the text book --  Look for *Swain v Alabama*.  As in the example from the book, our model of the universe is one in which there is no discrimination. In the context of toilets that means that anyone who pays the same amount of rent should be equally likely to have a toilet in other words, in our model universe, the BWTR should be 1 with variation due only to the randomness of our sample.




In [None]:
## select some

In [None]:
## simulating a color blind market
rent50=st60.where('PERNUM',1).where('rentDecile',70)
result=[]
for trial in np.arange(100):
    smp=rent50.sample(with_replacement=False).select(['rooms'])
    smp.append_column('race',rent50['race'])
    res=smp.group("race",np.mean)
    result.append(res['rooms mean'][0]/res['rooms mean'][1])
np.mean(result)

In [None]:
# draw histogram of simulated result (BW ratio under color blind model)
plt.hist(result)
# compute the BWratio observed in the data
tratio=rent50.select(['race','rooms']).group('race',np.mean)
print(tratio)
BWratio=tratio['rooms mean'][0]/tratio['rooms mean'][1]
BWratio
# add the red dot of observed reality to the histogram
plt.scatter(BWratio, 0, color='red', s=30)
plt.title("Histogram of uncertainty with red dot of reality")
print("simulated values less than observed: {}".format(np.sum(result <= BWratio)))

## So what's the bottom line:

It depends on your state of course, but in general, you're probably looking at a pretty normal bell shaped sort of histogram with a red dot over to the left entirely beyond the left most bar of the histogram.  

###Fill in the blank:

The probability of observing a BWratio as low(high) as the observed BWratio is approximately _*BLANK*_.

## Your turn

Your task now is to repeat the above analysis for two additional variables: 

1. HotWater -- the presence of piped hotwater in the house
1. Toilet -- the exclusive use of an indoor toilet in the house

### NOTE:

1. Since these two variables are True/False rather than numerical (like the number of rooms) the mean of these variables is equivalent to the proportion of the population for which the value of the variable is "True".  This presents no problem you can still take the Black:White ratio as we did above and make all the same computations.
1. The two new variables are computed for you in the next cell. Note the CaptAliZation.


In [None]:
st60.append_column('HotWater',[x =='Hot and cold piped water' for x in st60['hotwater']])
st60.append_column('Toilet',[x =='Yes, exclusive use' for x in st60['toilet']])
print()

In [None]:
cquiz('greatmig01-signif')

In [None]:
cquiz('greatmig01-disc')

In [None]:
cquiz('greatmig01-hot')

In [None]:
cquiz('greatmig01-toilet')

In [None]:
cquiz('greatmig01-tbd0')

In [None]:
cquiz('greatmig01-tbd1')

## That's it for part 1 of the great migration lab

Please take a minute to evaluate your experience on this lab.  And remember to look on the course calendar for readings for next week.

In [92]:
cquiz('greatmig01-eval')