# Lecture 06 Demo: Census

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Table Review: Pokemon Data

From the [Kaggle.com website](https://www.kaggle.com/datasets/rounakbanik/pokemon), I downloaded a CSV file with data about Pokemon characters. The website includes metadata, such as descriptions of the various columns in the table. I saved the `pokemon.csv` file to my local directory (where the file for this notebook resides), so now I can bring that data to life in my notebook using `Table.read_table(filename)`:

In [None]:
pokemon = Table.read_table('pokemon.csv')
pokemon.show(3)

In [None]:
# How many columns in the table?
pokemon.num_columns

In [None]:
# How many rows? Should be 3 + 798:
pokemon.num_rows

Notice that some columns hold text data (strings), and other columns hold numerical data. But there's no column that has a mixture of text and numbers. Each column is a Numpy array, and so it's actually impossible to have different data types mixed together in a single column.

The first column (index 0) looks a little unusual; let's double-check its data type by looking at the type of the first item in the first column.

In [None]:
first_item = pokemon.column(0).item(0)
type(first_item)
first_item


41 columns is a lot of data. Let's focus in on a handful of columns. `t.select()` is useful here.

In [None]:
pokemon_subset = pokemon.select('name', 'type1', 'attack')
pokemon_subset

What is the average attack rating for Pokemon where the primary type (`type1`) is **fire**?

In [None]:
# Make a table showing only 'fire' type (type1)
firetype = pokemon_subset.where('type1', 'fire')
firetype

In [None]:
# Calculate the mean of the attack column
firetype.column('attack').mean()

In [None]:
# Eyeball the various abilities
pokemon.column('abilities')

In [None]:
# Which Pokemon have the `Super Luck` ability?
pokemon.where('abilities', are.containing('Super Luck')).column('name')

Notice that we can keep all the columns and just focus in on the rows for Pokemon with the `Super Luck` ability:

In [None]:
pokemon.where('abilities', are.containing('Super Luck'))

In [None]:
# What are the various classifications?
pokemon.select('classification')

Oh, snap! Whoever created the CSV file introduced a typo. "classification" is misspelled as "classfication". Can we fix that?

Of course we can, using the `relabeled` method.

In [None]:
pokemon = pokemon.relabeled('classfication', 'classification')
pokemon.labels

In [None]:
# What are the various classifications?
pokemon.column('classification')

That's really hard to look at. Maybe we should sort the table by classification first.

In [None]:
pokemon.sort('classification').column('classification')

In [None]:
# Find the maximum height of a Mushroom Pokemon
mushrooms = pokemon.where('classification', are.containing('Mushroom'))
mushrooms.select('name', 'height_m')

In [None]:
avg_mushroom_height = mushrooms.column('height_m').max()
avg_mushroom_height

In [None]:
# Suppose we are interested in all the Mushroom Pokemon who are
# at least 60 cm in height. 
mushrooms.where('height_m', are.above_or_equal_to(.6))

In [None]:
# Notice we can now use num_rows to count the number of mushrooms with
# height at least 60 cm
mushrooms.where('height_m', are.above_or_equal_to(.6)).num_rows

In [None]:
# Another way to get this same result:
mushrooms.column('height_m') >= 0.6

In [None]:
# The np.count_nonzero function will count the number of True values
np.count_nonzero(mushrooms.column('height_m') >= 0.6)

In [None]:
# The total number of all Pokemon with height at least 60 cm:
np.count_nonzero(pokemon.column('height_m') >= 0.6)

In [None]:
# Option 3 for counting: just sum up the Trues (1s) and Falses (0s)
np.sum(mushrooms.column('height_m') >= 0.6)

In [None]:
np.sum(pokemon.column('height_m') >= 0.6)

Anything else you'd like to review?

## Census Data

In [None]:
full = Table.read_table('nc-est2019-agesex-res.csv')
full.show(3)

Each SEX-AGE combination is represented in a row. 

  - Remember, SEX is coded (0 = all, 1 = male, 2 = female).

  - The AGE values are ages in years, except 100 is interpreted as "100 or older" and 999 is interpreted as "all ages".

Let's look at the rows with AGE equaling 0 (all the people who are less than 1 year old) and, separately, AGE equaling 100 (all the people who are at least 100).

In [None]:
full.where('AGE', 0)

In [None]:
full.where('AGE', 100)

Compare the previous two tables. What do you notice?

Next, build a new table with four columns; call it `partial`. We are interested in the population estimates for 2014 and 2019:

In [None]:
partial = full.select(0, 1, 8, 13)
partial

In [None]:
# We can shorten up annoyingly-long column names
us_pop = partial.relabeled(2, '2014').relabeled(3, '2019')
us_pop.show(4)

In [None]:
us_pop.where('AGE', are.above_or_equal_to(80))

In [None]:
# we could sort by AGE
us_pop.where('AGE', are.above_or_equal_to(80)).sort('AGE')

Looking at this last result, what patterns do you notice?