In [None]:
from datascience import Table
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use("fivethirtyeight")
import numpy as np

# Discussion 4: Examining Fogel and Engerman's Data Further

Last time, we looked at an example of a qualitative variable, namely gender (V15), and a qualitative one, namely recorded number of sales per year (V4). This time we will explore what seems at first a straight-forward variable, price (V14). One motivation for this is the still quite controversial claim about economic efficiency of slavery that Fogel outlines in "Coming to Terms with the Economic Viability of Slavery."

As an outline, here are the topics we will cover today: 

/1/ Thinking about prices

/2/ Central tendencies

/3/ Exploring the dataset further

Let's start with reading in the data and doing a bit of exploration

/a/ Exploring the data about price

In [None]:
data = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data1.csv")
data

In [None]:
plots.hist(data["V14"], bins=100)

#Several alternatives exist, and see which you might, or not, prefer
#data.hist('V14', bins=100)
# plots.hist(data["V14"], bins=np.arange(0, 100000, 1000))

In [None]:
data.hist('V12', bins=np.arange(0,50,1))

In [None]:
data.hist('V13', bins=np.arange(0,50,1))

/b/  What does $350 in 1804 actually mean, or $1,000?

Before talking further about price, let's examine several ways to think of historical prices:

https://www.measuringworth.com/tutorial1.php 

Note, there are very, very high prices in the dateset. 

/c/ Simplification: limiting exploration of price, and inferences we can draw

To get a rough sense of prices, three simplifications help: V14 !=99999, and V12=1 and V13=1. By doing this, we will see a trend, and we will see that the trend describes only a part of what the dataset records. Would you consider this a huge caveat, maybe a necessary one, or maybe both?

In [None]:
PriceRecorded=data.where(data["V14"] != 99999)
PriceRecorded.sort("V14", descending=True)

In [None]:
PriceRecorded.hist('V14', bins=np.arange(0, 35000,100))

#Your graph may look a bit nicer; I worked with an older version of Tables

In [None]:
#To confirm, one option is: 
#PriceBins = PriceRecorded.bin('V14', bins=np.arange(0,31600,100))
#PriceBins.sort("bin", descending=True)
#Note: this provides a useful view: PriceBins.sort("V14 count", descending=True)

In [None]:
PriceRecorded.where(PriceRecorded.column('V14')> 5000).sort('V14', descending=True).show()

In class we saw the use of np.logical to subselect only certain values from a dataset(see below). We can do a similar thing in several steps:
/i/set V12 to 1
/ii/set V13 to 1
/iii/sort V14 to confirm earlier findings re 999s


Here is what I had in mind:

both = census.where(census.column('SEX') != 0)
both.where(np.logical_or(both.column('AGE') == 18, 
                         both.column('AGE') == 19))

In [None]:
SingularRecord=PriceRecorded.where(np.logical_and(PriceRecorded.column('V12')==1, 
                                                PriceRecorded.column('V13')==1))
SingularRecord.sort('V14', descending=True)

In [None]:
SingularRecord.hist('V14', bins=np.arange(0,2501, 100))

Does it seem like there is a central tendency? Does the visual from class, showing duration of bike rides, come to mind? What happens when chaning bins? Consider what this implies about prices. This brings us to a second topic, how migh prices differ?

/2/ Central Tendencies: A Critical Step to Evaluation Differences between Values

How do we know what price is really high, and not just high? The intuition we have is that those values far from the average value, the central tendency (or the mean, if you like), are unusual. In this context, those values way to th right of the average value are really high -- say, $2000 -- while those to the left are really low -- look at the first bin.

To make this clearer, let's find the average price using the variable V15, Gender (or, in the original Codebook, Sex). One question to ask is, what are the average, and the really unusual prices given Gender.

/a/Sub-selecting just the two variables, V15 and V14, may help to highlight the main findings:

In [None]:
Gender_SR = SingularRecord.select(["V15", "V14"])
Gender_SR

In [None]:
Male_SR = Gender_SR.where('V15',1)
Male_SR.sort('V14', descending=True)

In [None]:
Male_SR.hist('V14', bins=np.arange(0, 2501,100))

In [None]:
Female_SR = Gender_SR.where('V15', 2)
Female_SR.sort('V14', descending=True)

In [None]:
Female_SR.hist('V14', bins=np.arange(0, 2501, 100))

/b/ Visual inspection leads to what tentative conclusions? Do you note an obvious difference in price?

We can express numerically some of the quantities of interest, for instance the total value of all those prices as well as the average price. Comparing average prices across Gender begins to address the question of what prices mean, how they differ, or not. Of course, recall that we made several assumptions to permit us to do this analysis, and we came back to these assumptions in our discussion.

First, we add-up, or sum the values for V14; second, we divide the total by the number of Singular Records. The result can be thought of as the average sale value among sales that have Singular Records.

Doing the same process for the two Genders permits a comparison, and rises a host of questions, including whether the result is surprising given, for example, the so-called gender gap in pay you may have heard about.

In [None]:
sum(Female_SR ['V14'])

In [None]:
sum(Female_SR ['V14'])/1402

#another way to do this: 
#np.mean(Female_SR ['V14'])
#np.std (Female_SR ['V14'])

In [None]:
#Similar steps for Male;

sum(Male_SR ['V14'])

In [None]:
sum(Male_SR ['V14'])/1313

Given the assumptions we have made, we can observe some difference in price across Gender, but this is a not a very firm conclusion. Over the next few classes we will explore whether it is possible to arrive at a firmer conclusion, or whether we can only sketch-out roughly what appears to be a difference. 

/3/ Some ideas for exploring the dataset further:

/a/ V16 records the age. See the distribution of this variable, and note the description in the Codebook. Can you make some tentative conclusions? Is there other information you would like to have?

/b/ V17 records what the dataset describes as skin color. Do you think this variable affects price? Make sure to check the Codebook carefully, and review last week’s Notebook. One suggestion would be to compare the average prices across what contemporaries regarded as very different skin colors.