In [None]:
from datascience import *
import numpy as np
% matplotlib inline

In [None]:
# read the csv file and produce a table

tbl = Table.read_table('WRA/AAD_records_100.csv')
tbl

In [None]:
# group the values in the column "PRIMARY OCCUPATION" and count the results

occ = tbl.group("PRIMARY OCCUPATION")
occ

# display the same results as a bar chart

occ.barh('PRIMARY OCCUPATION')

"nan" stands for "not a number," and in this case we can interpret these as NULL values.  But why de we have so many?  They are skewing the bar chart.  Let's get rid of them...

In [None]:
# filter the results to exclude 'nan' values

occ_list = occ.where('PRIMARY OCCUPATION', are.not_equal_to("nan"))

# Display new primary occupation counts to Bar Chart

occ_list.barh("PRIMARY OCCUPATION")

But, wait a minute... what did we just delete?  What are all those 'nan' values?  Do they _mean_ something? And what is the difference between those and "Undefined Code"?

_Discuss with your neighbor_:

First, make a guess.  This is your hypothesis, your historian's hunch.  It is important to start thinking about possible explanations, because that will help you know where to look next.  But guessing alone is not enough.  Let's take a closer look at the evidence...

In [None]:
# filter the results to show only 'nan' values

nan_results = tbl.where('PRIMARY OCCUPATION', are.equal_to("nan"))

# show the full table (up to 999 rows)

nan_results.show(max_rows=999)

In [None]:
# a better way to show the full table...

nan_rows = nan_results.num_rows
nan_results.show(max_rows=nan_rows)


Read the results.  Based on skimming the results with your eyes, what do you think the explanation is?  What are the 'nan' values?  Does it confirm or deny your original hunch?

Having looked at some examples, we should have a better idea now, or at least be more secure with our theory.  But we can go further with the data...

Let's observe the year of birth of the people listed having 'nan' as their primary occupation.

In [None]:
nan_results.group('YEAR OF BIRTH').barh('YEAR OF BIRTH')

There are a few things to say about this chart:
- There are values above 42, but starting only at 71.  What are we seeing here?
- There is a spike starting at 24... why?
- Does that spike explain our 'nan' values?
- Does it _fully_ explain them?

Let's take a closer look at the outliers.

In [None]:
# reduce the set of rows to those with a year of birth above 42
nan_exceptions = nan_results.where('YEAR OF BIRTH', are.not_between_or_equal_to(24, 42))
nan_exceptions.group('YEAR OF BIRTH').show(max_rows=999)

# and we also are curious about the total number of exceptions here:
print(nan_exceptions.num_rows, 'out of', nan_results.num_rows)

But let's see that value as a percentage...

In [None]:
excepts = nan_exceptions.num_rows
total = nan_results.num_rows
perc = excepts / total * 100
print("Percentage of those with no listed primary occupation who are adults: {0:.2f}%".format(perc))
    # If intereseted, read more about string formatting here: 
    # https://docs.python.org/3.1/library/string.html#string-formatting

What happens when we group the nan group by sex and marital status?

In [None]:
nan_smstatus = nan_exceptions.group('SEX AND MARITAL STATUS')
nan_smstatus.show()
nan_smstatus.barh('SEX AND MARITAL STATUS')

How do we interpret this table, historically?

In [None]:
tbl_2 = tbl.where('YEAR OF BIRTH', are.not_between_or_equal_to(24,42))

# Data Cleaning: Create new column 'GENDER' that is 
# just the first letter of column 'SEX AND MARITAL STATUS'
def first_letter(string):
    return string[0]
tbl_2 = tbl_2.with_column('GENDER', tbl_2.apply(first_letter, 'SEX AND MARITAL STATUS'))

# Query for female gender
tbl_f = tbl_2.where('GENDER', are.equal_to('F'))

# Group females by primary occupation and display bar graph
occ_f = tbl_f.group('PRIMARY OCCUPATION')
occ_f.barh('PRIMARY OCCUPATION')


In [None]:
# now let's do the same for those listed as "MALE"
tbl_m = tbl_2.where('GENDER', are.equal_to('M'))
occ_m = tbl_m.group('PRIMARY OCCUPATION')
occ_m.barh('PRIMARY OCCUPATION')

We've figured out how to understand the 'nan' values.  But what about the "Undefined Code"?

Your assignment for next week will be to recreate what we did here, but looking at what the data tells us about the age and gender of the people given "Undefined Code" as an occupation.  (I'm still working out the details of using Jupyter for assignments, so I'll send more instructions by Thursday.)