Discussion 3: Working with Fogel and Engerman's Data

In [None]:
from datascience import Table
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use("fivethirtyeight")
import numpy as np

We will explore a smaller version of the Fogel and Engerman dataset, focusing on:

/1/Creating a table and visualization for a common qualitative variable V15, Gender

/2/Creating a table and visualization for a common quantitative variable V4, Sales by Year 

One way to demonstrate the usefulness of these activities is to replicate the process with somewhat more involved variables. We look at two. Specifically, V40 is conceptually similar to V15, but tracks whether a not the sale was “guaranteed.” How would you make a table and visualization that depicts this in an appropriate manner? Sometimes the variable seems straight-forward, but presenting it requires some thought. What is an insightful way to depict the number of sales per year?; instead of annual figures, why not present biannual ones, or quinquennial (5 years) or decadal (10 years) figures?

/3/Creating a table and visualization for a very uncommon qualitative variable V17, Color

Sale records frequently included descriptions of color (listed below), raising a whole host of questions. One question is and how do we, today, know what was considered Griff in the 1800s? 

V17, skin color:
Value    Label 
1    Negro
2    Griff
3    Yellow
4    Mulatto
5    Copper
7    Light
8    Brown
9    Creole
11    Dark
12    American Negro
13    African Negro
14  Unknown
15  Not Recorded

Apparently, at the time, this implied a color lighter than that of those classified as Negro but darker than those classified as Mulatto; here's an article that covers some of this: http://www.uvm.edu/~psearls/johnson.html. Of course, such an answer only pushes the question back, how do we today know the rule for classified some as Negro and others as Mulatto? These are difficult questions, very difficult questions, and we will discuss how to approach them from a historically minded perspective.


In [None]:
data = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data1.csv")
data

/1/ Gender: The dataset has about 5,000 rows and over 40 variables, many of which are familiar to us from our last discussion. One pattern we noted was that men tended to be represented somewhat more frequently than women in sale records. We can see what pattern the records from New Orleans, a major hub in the slave trade, reveal.

We will do this in several steps:

In [None]:
#Selecting V15 from the entire dataset

V15_counts = data.group('V15')
V15_counts

In [None]:
#Adding labels

gender_counts = V15_counts.with_column('Gender', ['Male', 'Female', 'Other'])
gender_counts

In [None]:
#Simplification

gender_counts = V15_counts.with_column('Gender', ['Male', 'Female', 'Other']).select(['Gender', 'count'])
gender_counts

In [None]:
#Visualization
#Note, as motivation for these setps, compare: data.group('V15').barh('V15')#

gender_counts.barh('Gender')

/2/ Sales by year: This variable records the year of a sale, and there are no missing values. Histograms are a common way to present data, and we will use them quite a bit.

In [None]:
#Selecting V4, and presenting it visually

viz_salesbyyear = data.select(["V4"])
viz_salesbyyear.hist()

In [None]:
#A more detailed view complements the default settings /10 bins/
#Note that the dates cover the period 1804-1862

viz_salesbyyear = data.select(["V4"])
viz_salesbyyear.hist(bins=np.arange(1800, 1870, 1))

In [None]:
#Another way to present the information is in percent format

viz_salesbyyear = data.select(["V4"])
viz_salesbyyear.hist(bins=np.arange(1800, 1870, 1), normed=True)

/3/ Color: We can apply the approach from /1/, and note that the results raise a series of further questions. 

In [None]:
#Selecting V17 from the entire dataset; note the addition of 'show' 

V17_counts = data.group('V17')
V17_counts.show()

In [None]:
#Adding labels

color_counts = V17_counts.with_column('Color', ['Negro', 'Griff', 'Yellow', 'Mulatto', 'Copper', 'Black', 'Light', 
                                                'Brown', 'Creole', 'Quadroon', 'Dark Color', 'American Negro', 
                                                'African Negro', 'Unknown', 'Not Recorded'])
color_counts.show()

In [None]:
#Motivation for the next step, Simplification

color_counts.barh('Color')

In [None]:
#Simplification 

color_counts = V17_counts.with_column('Color', ['Negro', 'Griff', 'Yellow', 'Mulatto', 'Copper', 'Black', 'Light', 
                                                'Brown', 'Creole', 'Quadroon', 'Dark Color', 'American Negro', 
                                                'African Negro', 'Unknown', 'Not Recorded']).select(['Color','count'])
color_counts.show()

In [None]:
#Visualization

color_counts.barh('Color')

Replication of these results helps reinforce what we have covered. Look over the codebook, and see which variables interest you -- consider exploring the dataset for your project. For now, a suggestion:

/a/ Consider V40, and especially think about how would present the information about a "guarantee of sale," for instance in a newspaper or a textbook.

/b/ Consider V4, how many bins seem reasonable, what is lost and what is gained by moving away from the default settings?