In [1]:
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use("fivethirtyeight")
import numpy as np
import math

In [2]:
data = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data1.csv")

# Exploring Averages and Dispersion

There are several different ways to assess what values data tend to center around. In class, we saw two such measures of the central tendency, the mean and the median, and another two are sometimes used, the mode and the midrange (the latter is defined briefly below, but not part of the examples). 

The [mean], frequently referred to as by x-bar, and the median, referred to as x-tilde, are defined in the textbook. Briefly, the mean is the arithmetic average (sum all the scores and divide them by the total number of scores), and the [median] is the “half-way point” (rank all the scores, and if there are an odd number of scores, the middle one is the half-way point, and if there are an even number of scores, the half-way point is the mean of the middle two numbers).

Two other measures of the central tendency to keep in mind include the mode and the midrange. The [mode], frequently referred to as M, is simply the most frequent score. Is it possible to have a set of scores without a mode, or with several? If no score is repeated, then there is no mode – by contrast, the mean and the median always exist; if two different scores are repeated the same, greatest frequency, then the data set is described as bimodal (and, yes, when more than two scores appear with the same greatest frequency, then the data set is multimodal). Finally, the [midrange], the least commonly used measure of the central tendency is the average of the highest and lowest score in the data set. 

Why so many measures? Each one has a use, for instance the midrange is sensitive to extreme values, and it can complement the median as a measure when the data set has extreme, or outlier scores. The mode is useful for nominal (“name only”) type data – in this election season, it is useful to note the mode of survey response answers yes, no, and undecided, and we saw that it was useful to note the slight gender disbalance in the data set.

In the first part of the lab, we will look at the central tendency, and we will focus on age of the principle slave on the record and of the first child, V16 and V20, respectively. In the second part of the lab, we will look at a measure of variation of scores around the mean, the standard deviation. Specifically, we will calculate and compare the standard deviation of the ages, as these are captured by V16 and V20, and discuss the results. In the final part, you will be asked to look at different variable, and calculate the central tendency and the standard deviation. The purpose of these exercises is to be able to calculate the so-called [[standard units]], the z-score, the topic to which we turn to next. 

In [None]:
#Renaming variables, to begin. Relabel V16, the age of the "principal slave"; and V20, the age of first child.
data.relabel(...)
data.relabel(...)

There is a numpy function available for the mean, but it's worth finding this "by hand" once or twice to make sure you know how. In the following cells, we'll find different statisitics about the Age of Principal Slave column.
If you wish, create a variable to hold the column you're interested in to avoid typing data.column(...) all the time!

In [None]:
# You may not notice something bizarre in the later cells if you don't take a look at the histogram of this column first!
# Do that here.
data.hist(...)

In [None]:
# Use this cell to set the variable principal_age to a useful array, using .where and .column
# You may want to refer to examples from previous discussions
principal_age = data.where(...).column(...)

In [None]:
#Find the mean without using np.mean


In [None]:
#Find the median. You can use np.median!


In [None]:
#Assess the mode visually by making a histogram. Think - how do you avoid the "age" of 99? There's a few ways!
# Suggestion: create a variable called "my_bins" for convenience


What do you notice? Do the mean and the mode look unusual, how does the histogram look? What might explain its shape? 

Let's see if similar patterns appear in V20.

In [None]:
# Again, notice anything strange first?
data.hist(...)

In [None]:
# Use this cell to set the variable child_age to a useful array, using .where and .column
child_age = data.where(...).column(...)

In [None]:
#Find the mean without using np.mean


In [None]:
#Find the median


In [None]:
#Assess the mode visually again. You can use almost the exact same code as in the last mode question!


It'll be interesting to overlay these two distributions, but to do that, we have to consider that principal_age is much longer than child_age. To make a non-misleading histogram, we should sample from principal_age to make two distributions of the same length, and then compare these!

In [None]:
#This is an old piece of code from last semester. They didn't expect us to come up up with this code - don't worry!
def compare(t, conditions, values):
    """Overlay histograms of values for two conditions."""
    t['count'] = 1
    e = t.pivot(conditions, values, 'count', sum)
    for label in e.labels[1:]:
        e[label] = e[label]/sum(e[label]) # Normalize each column of counts
    e.hist(counts=values, normed=True, overlay=True)

#Take a look at this line and find where we sample:
ages = data.where(data.column("principal age") != 99).select('principal age').sample(len(child_age)).with_column("category", "principal")
ages.relabel("principal age", "age")

child_ages = data.where(data.column("first child age") != 99).select('first child age').with_column("category", "child")
child_ages.relabel("first child age", "age")

ages.append(child_ages)

compare(ages, 'category', 'age')

# Exploring Standard Deviation


Variation is among the most important concepts in statistics, so this is among the most important exercises. First, to reiterate the concepts intuitively, and then we will work out examples that will let us calculate variation.

Variation refers to the amount that scores vary among themselves. Intuitively, scores that are relatively close together have low measures of variation, whereas scores that are spread further apart have a larger measure of variation. The standard deviation is an especially useful measure of variation, and it is important to understand how to calculate the standard deviation and, of course, how to interpret what that calculation means. We derive the formula through a series of steps below; to begin with, think of the mean of a set of scores, something we calculated in the examples above, and the standard deviation as a measure of variation of scores around the mean. 

Like the mean, standard deviation has a numpy function available. However, let's avoid using that this time in order to understand what goes into the calculation of the standard deviation! Remember: "root mean square of deviations from average"

In [None]:
# Find the standard deviation of principal_age
# First, find the mean:
principal_mean = ...

# Next, create an array of all the ages minus the mean ("deviations")
principal_deviations = ...

#Next, square the deviations:
principal_squared = ...

#Find the average of the squared deviations ("variance"):
principal_variance = ...

#Take the square root to get to the standard deviation.
principal_std = ...
principal_std

In [None]:
# Compare your result to what numpy gives you. If they're different, what Python quirk could be causing this?
np.std(principal_age)

Now, the same process for the age of the first child.

In [None]:
# Find the standard deviation of child_age
# First, find the mean:
child_mean = ...

# Next, create an array of all the ages minus the mean ("deviations")
child_deviations = ...

#Next, square the deviations:
child_squared = ...

#Find the average of the squared deviations ("variance"):
child_variance = ...

#Take the square root to get to the standard deviation.
child_std = ...
child_std

In [None]:
#Again, compare your result:
np.std(child_age)

How would you compare the two standard deviations? How do these compare to the standard deviation calculated in the textbook, with height data, a similar type of variable as age?

Exercises:

/1/ Using the formula for the textbook for Chebychev's inequality -- the proportion in the range “average ± 2 SDs” is at least 1 - 1/4 = 0.75; the proportion in the range “average ± 3 SDs” is at least 1 - 1/9 ≈ 0.89 – would you say that this is insightful for the type of values we looked at in the examples above? Do we learn more about sales or about age? Put differently, what do we learn about the sales with information on the age of the principle slave, and what do we do with all those 99s?

/2/ Pick a variable of interest that we have not explored together, and calculate the mean, median and standard deviation. Calculate the so-called z-score using the formula from the textbook, and consider what the score tells you. 


In [None]:
# Begin exercises here, feel free to use as many cells as you find useful.