In [None]:
# make sure to run this cell
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import warnings

%matplotlib inline
plt.style.use('fivethirtyeight')
warnings.simplefilter(action="ignore", category=FutureWarning)

# Lab 3 - Exploring the Data

I. [Loading the Data](#load)

II. [Initial Exploration](#init)

III. [Age in Years](#years)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 - [Age Heaping](#heap)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2 - [Population Pyramids](#pyramid)

IV. [Working with Birthdays](#bday)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 - [Month Distribution](#month)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2 - [Recalculating Age](#recalc)

V. [Saving](#save)


To get credit for doing your lab, you will need to prepare a document that contains answers to the items in <font color="blue"> blue</font>.  Some answers can be typed in.  If a figure is required, you can take a screen shot and paste it in.  Once you have finished all the answers, turn your document into a pdf and turn in the lab by uploading it to bcourses assignments.  

## Loading the Data <a id='load'></a>

In the cell below, load the data set that you created in Lab 1.

In [None]:
# Import data that you saved at the end of Lab 1
data = Table.read_table('Data_Roster_2_11.csv')

# You may need to first change the directory
%cd ~/Child-Dev-2019/data
%ls
data = Table.read_table('Data_Roster_2_11.csv')

# If you did not have interview date in your Lab 1, you may need to upload the data Prof. Reynolds prepared for you.  
# Insert lines of code from Lab 1 that renamed your variables.
data = Table.read_table('DataFromDrR.csv')


## Initial Exploration <a id='init'></a>

Look at the first family in your Table.
<font color="blue"> Item 1: Who are the members in the family?  How old are they?  (example, mother (age 25), father (age 30), two sons (ages 12 & 5) & grandfather (age 65))  Challenge: if there’s a grandparent, can you tell if maternal or paternal? </font color>

*Hint: The first family in the table will have the lowest master ID.*

<font color="blue"> Item 2: How many individuals are in your survey?

*Hint: There is a row for each survey participant.* 

<font color="blue"> Item 3: How many households are in your survey?
<font color="black">
*Hint: Group will count the number of times a unique value occurs.*

In [None]:
households = ...
num_households = ...
num_households


<font color="blue"> Item 4: What's the size of the biggest household? <font color="black">
<font color="black">

*Hint: The 'stats' method will give summary statistics for a table.*

<font color="blue"> Item 5: What's the average household size?

## Exploring the Age in Years variable <a id='years'></a>
Look at the values in the age column.  We cannot scroll through them all to make sure they make sense.  Let's group the data by the values in the age column.  This will be a more manageable table and we can scroll through.  Are there values less than 0 or greater than oldest person alive?

In [None]:
ages = ...
ages.show() # show will let us see all of the rows



Do you see any strange values?  Do these correspond to missing codes in your codebook or questionnaire? If there are missing or coded variables that are different than what we covered in Lab 1, we need to account for those. Let us know if you see any!

<font color="blue"> Item 6: What percentage of indivduals are missing age data? </font color>

*Hint: `where` might be helpful here.*

<font color="blue"> Item 7: Make a histogram of age and paste in the document. Explain what you notice.
<font color="black">

*Hint: Your code might look like this: ages.hist(counts='Age in Years', unit='year', bins=np.arange(0,110,1), normed=False)*

### Age Heaping <a id='heap'></a>

Let's test formally for age heaping.  Below is the formula for Whipple's index.  Let's calculate it for the entire population, even though it was originally designed to be used for the population between 23 & 62. $N_{(0,5)}$ is the number of individuals with their age ending in 0 or 5 and $N$ is size of the entire population.

$$\left(\frac{N_{(0,5)}}{N}\right)*5$$



<font color="blue"> Item 8: Theoretically, if there is no age heaping, what is the value of the index?

<font color="blue"> Item 8a: Calculate the Whipple index for your country.  Looking at the histogram, do you think it is a good choice to base the heaping index on 0 and 5, or should another value be used?


In [None]:
# Use Table.apply() and the functions provided below to calculate the Whipple Index
def divisible_by_5(n):
    return n % 5 == 0

def divisible_by_10(n):
    return n % 10 == 0

whipple5 = ...
whipple10 = ...

(whipple5, whipple10)


### Population Pyramids <a id='pyramid'></a>

We would like to make a population pyramid for our sample.  Remind yourself what population pyramids look like with [this link](http://www.prb.org/Publications/Lesson-Plans/HumanPopulation/Change.aspx).
Unfortunately our Python knowledge is not yet sophisticated enough to make graphs that are reflections of each other and on a rotated axis, so we will make overlaying graphs.  However, keep in mind what these would look like if you were able to rotate and unfold them.

<font color="blue"> Item 9: Let's do a histogram of males and females. What do you notice?  
Give this graph a title & upload to the google doc: https://docs.google.com/a/berkeley.edu/document/d/1AurSgy2Ucl3tGCerfgMdrkUGcDsSm_XDnfJTNJwceXw/edit?usp=sharing.

<font color="blue"> Item 10: Let's do this just for male & female children ages 10 & younger. What do you notice?

## Working with birthdays <a id='bday'></a>
Age in years may not be accurate.  We will need more accurate ages for the young kids when we analyze their height.  Let's look at the birthdate and interview date variable(s).  We need to get these into a form where we can figure out how much time passed between them.  Make sure your table has the following columns *with numeric entries (ints or floats)*: birth year, birth month, birth day and interview year, interview month, interview day.  Make sure the year has all 4 digits.

### Month Distribution <a id='month'></a>

<font color="blue"> Item 11: What month has the most birthdays?  Is it the same in the US?

In [None]:
months = data.group('Month of Birth')
months....



### Recalculating Age <a id='recalc'></a>

Now write a formula to calculate the days between the birthdate & interview dates.  We can get a pretty good approximation by calculating the days of each from 0 a.d. and subtracting.  $$\langle\textrm{days since 0 a.d.}\rangle = 365.25*\textrm{year} + 30.5*(\textrm{month}-1) + \textrm{day}$$

Use the difference between interview date and birthdate to find the number of days old a person is.  Entitle this new column `Days old`.

*Make sure to remove rows where the birthdate is unknown.*

In [None]:
def days_total(years, months, days):
    return 365.25 * years + 30.5 * (months - 1) + days

byears = ...
bmonths = ...
bdays = ...
bdays_total = ...

# Do the same for the interview date

iyears = ...
imonths = ...
idays = ...
idays_total = ...

# Make a new column "Days Old" that is difference between days old of interview & birthdate
days = ...
data = ...
data



Calculate two new columns from `Days Old`: `Months Old` and `Years Old`. 

In [None]:
data = data.with_column('Years Old', ...)
data = data.with_column('Months Old', ...)

# removing unknown birthdays
removed = data.where(..., ...)
removed


Let's compare years old that we calculated to years old from the survey question.  Make a scatter plot to compare these. 

<font color="blue"> Item 12: Repeat the above graph restricting the sample to people with age in years between 45 and 55.  Comment on how well people know their ages.

## Saving Data <a id='save'></a>

Save this data table now with the new calculated months old variable. Make sure that you include the entire population, even those missing birthdate data.  Save this large data set as a .csv file.

In [None]:
# you need to add this code!


#### Congratulations, you're done! Make sure to save this notebook, that you've saved your data set (we'll be using it next week), and to turn in your document.