In [1]:
# HIDDEN
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')
# datascience version number of last run of this notebook
version.__version__

'0.5.6'

<h1>Class 4: Microdata, subcharacteristics, and crazy cat people</h1>

Many of you will want to <i>compare subgroups</i> within a dataset as part of your term paper research. 

For example, maybe you have a dataset of people that includes measures of their health and their smoking. Smoking could be a dichotomous yes/no or 0/1 measure, in which case you'd like to compare smokers to nonsmokers. Or it might be a tricohotomous measure 1/2/3 measure, where we know *current* smokers, *former* smokers, and *never* smokers. There are several interesting comparisons one could draw between those three groups. And measuring them separately is important. Let's use some Python to separate data tables into sub-tables. 

<h2>Data sources</h2>

Let's examine data from the U.S. <a href="http://hrsonline.isr.umich.edu">Health and Retirement Study</a> (HRS), a panel survey of thousands of Americans aged 50 and over and their spouses that is conducted biennially (every 2 years) starting in 1992.

In its 11th wave in 2012, a subsample of about 1,700 individuals were asked about <b>pet ownership</b>. Let's examine pet ownership and health in these data.

<u><h3>File I/O</h3></u>

Python can import an external file into a table using a this function and a call to a valid URL like this one, where the target file probably needs to be comma-delimited (CSV):

In [32]:
HRSpets = Table.read_table("http://demog.berkeley.edu/~redwards/Courses/LS88/c04_hrspets.csv")
HRSpets

hhidpn,age,sex,edyrs,health,anypets,numdogs,numcats
10059030,84,1,17,2,0,0,0
10210020,73,1,6,5,0,0,0
10372010,76,2,10,3,0,0,0
10395020,74,2,15,5,1,1,0
10458030,68,2,16,4,1,0,1
10475010,81,2,17,3,0,0,0
10648010,71,1,12,4,0,0,0
10773030,83,1,17,2,0,0,0
10818040,63,2,13,2,0,0,0
11141010,78,2,16,2,0,0,0


<img src="http://vignette1.wikia.nocookie.net/simpsons/images/b/b5/230px-Eleanor_Abernathy.png/revision/latest?cb=20140817113422" align=right valign=top width=160> 
Let's debunk or rebunk some ridiculous theories, and then let's actually look at health and how it varies.

First, some fun. 

<h3>The Theory of the Crazy Cat Lady</h3>

Everybody's favorite TV show of the 1990s, *The Simpsons*, included a memorable character apparently named <a href="http://simpsons.wikia.com/wiki/Eleanor_Abernathy">Eleanor Abernathy MD JD, whom most of us know as the Crazy Cat Lady</a>. Go figure.


Are the cat owners in the HRS more likely to be female?

To find out, you can create a sub-table that includes only the cat owners (numcats > 0) using this code:

In [44]:
Catowners = HRSpets.where(HRSpets['numcats'] > 0)
Catowners

hhidpn,age,sex,edyrs,health,anypets,numdogs,numcats
10458030,68,2,16,4,1,0,1
11332010,80,1,12,3,1,1,2
11612010,72,1,10,3,1,0,2
11876010,76,2,12,4,1,0,1
12315040,70,2,12,3,1,0,1
12738020,64,2,3,5,1,1,1
13140010,80,2,12,2,1,1,1
14427010,71,1,12,3,1,2,1
14587010,71,2,14,1,1,1,1
15516020,71,2,12,1,1,0,3


Now let's measure the average sex of the cat owners. To select just a column of the table, call the table name followed by square brackets around the single-quoted column that you want:

In [35]:
catowners_sex = Catowners['sex']

To measure the average or mean of that column, attach `.mean()` to the end of the column name. You can combine this with the column-selecting command one step ago, too.

In [57]:
Catowners['sex'].mean()

1.6829971181556196

As a quick aside, note that there are all kinds of nifty functions in numpy like `.mean()`, and another one of probable interest is the standard deviation, `.sd()`

In [60]:
Catowners['sex'].std()

0.46530855864333526

Anyway, back to means:

In [37]:
catowners_sex.mean()

1.6829971181556196

It looks to me like Python *might* also like to perform `.mean()` on an entire matrix, column-wise appears to be the default (along with a fun error message.

In [39]:
Catowners.mean()



hhidpn,age,sex,edyrs,health,anypets,numdogs,numcats
306675000.0,64.3487,1.683,13.1037,2.71182,1,0.850144,2.04611


Is this number 1.683 strange or what?

It's actually not so bad.  Males are 1's and females are 2's. Maybe it's already intuitive to you, but 1.683 as an average means that 68.3% of this group is female. Why? Imagine we took `sex` and subtracted 1 from its value across everyone. Then effectively the variable is 0 for males and 1 for females, and its average, which must equal $1.683 - 1 = 0.683$, is more clearly the percent female in the subsample.




<font color=blue>Q1: Now ascertain whether crazy cat owners are in fact more likely to be female.</font>

You could check this in a variety of ways.  Two that come to mind are that you'll want to check the average sex of cat owners compared to (A) the average sex of everyone in the sample, or compared to (B) the average sex of people in the sample who aren't cat owners.  (There are other comparisons I can think of as well.) Calculate both A and B!

In [41]:
average_sex_of_everyone = ...

In [46]:
Noncatowners = HRSpets.where(HRSpets['numcats'] == 0)

In [None]:
average_sex_of_noncatowners = ...

<font color=blue>Q2: What else is interesting about Eleanor Abernathy? Look at those uppercase letters that come after her name. Can you test the implicit hypothesis about an average characteristic of cat owners other than sex? *(There are multiple ways to do this, but one general idea)*</font>

<font color=blue>Q3: Now let's tackle a question about health. What do we see when we compare the self-reported health status of cat owners to other people?  What about dog owners?  Do as much as you can that's interesting. *Remember the coding* of `health`, which is that 1 is excellent and 5 is poor.</font>

In [61]:
Catowners['health'].mean()

2.711815561959654