In [None]:
# modules for research report
from datascience import *
import numpy as np
import random
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# module for YouTube video
from IPython.display import YouTubeVideo

# okpy config
from client.api.notebook import Notebook
ok = Notebook('wealth-final-project.ok')
_ = ok.auth(inline=True)

# Family & Household Wealth in the United States (2009-2013)

This dataset was originally published as part of household-specific data from the American Community Survey (ACS) — a yearly, ongoing
survey conducted by the U.S. Census Bureau — Public Use Microdata Sample (PUMS). This dataset is taken from the ACS 5-year PUMS,
spanning the years 2009-2013, and it has been cleaned for your convenience. All observations and variables not of interest have been
removed, and a random sample of the data containing 1,200 entries has been provided. A brief description of the dataset is provided
below.

**NB: You may not copy any public analyses of this dataset. Doing so will result in a zero.**

## Summary

The origins of the ACS can be traced back to the mid-20th century. That is, the postwar period in the United States that saw massive
population growth and swiftly changing household, urban and rural demographics. Beginning in the 1960s, lawmakers, unable to get
actionable data about their rapidly changing communities from the once-every-ten-years Census, started to look for
ways to get more immediate information about the people in their districts.

It wasn't until the 1990s, however, that plans to get more frequent Census-type data came to fruition. Congress, seeing a drop in
Census response rates as a result of its burdensome length, directed the U.S. Census Bureau to develop different ways to get the
much needed information. The U.S. Census Bureau, in concert with statisticians from other organizations, eventually developed a
yearly survey now known as the ACS. The ACS was officially launched in 2005. 

According to the [U.S. Census Bureau website](https://www.census.gov/programs-surveys/acs/geography-acs/areas-published.html): 
>“American Community Survey (ACS) data are tabulated for a variety of different geographic areas ranging in size from broad geographic
regions to cities, towns, county subdivisions, and block groups.” 


## Disclaimer

At the time this data was collected, the U.S. Census Bureau and the ACS only considered binary, opposite-sex couples in the context of marital status; any time this dataset mentions a married pair or a spouse, it is referring to an opposite-sex couple or partner. This dataset covers the period 2009-2013, when same-sex marriage had not yet been legalized under U.S. federal law, though some states maintained legal same-sex marriage during this time period.

## Data Description

This dataset utilizes the following abbreviations:
* `GQ`: group quarters
* `Non-Family`: a household that is not associated with any family. If referring to an individual, the respondent is considered a member of group quarters. If referring to a property, the property is considered vacant. 

This dataset contains three tables, included in the `data` folder:
1. `families_data` provides information about characteristics of each household.
2. `resources_data` provides information about the resources available to each household.
3. `states_data` provides the full names and abbreviations of U.S. states as strings, as well as an integer code for each state that corresponds to the integer codes found in the previous two tables. 

A description of each table's variables is provided below:
1. `families data`:
    * `ID`: a unique identifier for each household
    * `REGION`: region of the United States 
    * `DIVISION`: division of the United States, more specific than region
    * `STATE`: state of the United States
    * `FAMILY INCOME`: yearly income by family, not adjusted for inflation
    * `HOUSEHOLD LANGUAGE`: description of the geographic area from which the main household language is originally. The categories for Household Language are as follows: English only, and Other Non-English.
    * `HOUSEHOLD INCOME`: yearly income by household, not adjusted for inflation 
    * `WORKERS IN FAMILY`: number of workers in each family 
    * `PERSONS IN FAMILY`: number of persons in each family
     
    

2. `resources_data`: 
    * `ID`: a unique identifier for each household
    * `REGION`: region of the United States 
    * `DIVISION`: division of the United States, more specific than region
    * `STATE`: state of the United States 
    * `MONTHLY RENT`: the monthly rent each renting household is paying. If the property is owned by the household, this value is the string "Owner". 
    * `GROSS MONTHLY RENT`: the gross rent (monthly amount) each renting household is paying. If the property is owned by the household, this value is the string "Owner". 
    * `OCCUPANCY STATUS`: description of the occupancy status for a particular property; for example: “owned free and clear” or “rented.” 
    * `NUMBER OF VEHICLES`: number of vehicles a particular household has access to. 
    * `HOUSEHOLD TELEPHONE`: a binary variable, whether or not a household has access to a telephone.
    * `PROPERTY VALUE`: The value of property in dollars ($).
    
    
    
3. `states_data`:
    * `CODE`: number for reference in original table
    * `FULL NAME`: full state name
    * `ABRV`: abbreviation


## Note about Non-Family Values

Many values in the table below are categorized as "non-family" if the census respondent is not part of a family unit (e.g. the respondent
resides in group quarters). Our data contains a lot of family-specific data, like `FAMILY INCOME` or `PERSONS IN FAMILY`; it wouldn't
make sense for a person who is not part of a family to have responses to those. 

Non-family respondents, however, account for a significant portion of census data — roughly between 15%-30%, depending on the sample.
If you would like to work with the non-family respondent data, we encourage you to consider non-family-dependent variables, like
`HOUSEHOLD INCOME` or `PROPERTY VALUE`, among others. If, however, you do not want to work with these variables, we encourage you
to filter out these variables (in a process called "data cleaning") from your data before you get started. 

*Hint: if you want to clean your data so that non-family values do not appear, consider filtering using `.where`.*

## Inspiration

A variety of exploratory analyses, hypothesis tests, and predictions problems can be tackled with this data. Here are a few ideas to get you started:

1. Is there a relationship between property value and English-only-speaking households? 
2. Is there a significant difference in monthly rent for households in the West region of the United States compared to the Northeast region? 
3. How do rows containing ‘Non-Family’ data compare to responses filled by heads of Families?
4. Where is household telephone access limited? How is this associated with various measures of wealth?
5. What states and regions have higher or lower vehicle ownership? See `NUMBER OF VEHICLES`.

If you'd like to learn more about  wealth in the United States, check out the following resources:
1. [Where does your net worth rank in the United States?](https://www.nytimes.com/interactive/2019/08/12/upshot/are-you-rich-where-does-your-net-worth-rank-wealth.html) Data visualization by the *New York Times*. 
2. Consider taking Wealth & Poverty (PUBPOL C103) with former U.S. Secretary of Labor and current UC Berkeley professor Robert Reich. 
3. Consider taking Contemporary Theories of Political Economy (POLECON 101) with Professor Khalid Kadir (it's Data 8 GSI Maya's favorite class at Berkeley!).

*Credit to Prof. Lexin Li, and his course Big Data: A Public Health Perspective (PBHLTH 244), for introducing the Data 8 staff to this dataset. Feel free to ask him about your path to upper division and graduate level biostatistics courses at UC Berkeley.* 

Don't forget to review the [Final Project Guidelines](https://docs.google.com/document/d/1NuHDYTdWGwhPNRov8Y3I8y6R7Rbyf-WDOfQwovD-gmw/edit?usp=sharing) for a complete list of requirements.

## Preview

In [None]:
families_data = Table.read_table('data/families.csv').relabel(4, 'FAMILY INCOME').relabel(6, 'HOUSEHOLD INCOME')
families_data

In [None]:
resources_data = Table.read_table('data/resources.csv').relabel(4, 'PROPERTY VALUE').relabel(5, 'MONTHLY RENT').relabel(6, 'GROSS MONTHLY RENT')
resources_data

In [None]:
states_data = Table.read_table('data/states.csv')
states_data

<br>

# Research Report

## Introduction

*Replace this text with your introduction*

## Hypothesis Testing and Prediction Questions

**Please bold your hypothesis testing and prediction questions.**

*Replace this text with your hypothesis testing and prediction questions*

## Exploratory Data Analysis

**You may change the order of the plots and tables.**

**Quantitative Plot:**

In [None]:
# Use this cell to generate your quantitative plot
...

*Replace this text with an analysis of your plot*

**Qualitative Plot:**

In [None]:
# Use this cell to generate your qualitative plo# Use this cell to generate your qualitative plot
...

*Replace this text with an analysis of your plot*

**Aggregated Data Table:**

In [None]:
# Use this cell to generate your aggregated data table
...

*Replace this text with an analysis of your plot*

**Table Requiring a Join Operation:**

In [None]:
# Use this cell to join two datasets
...

*Replace this text with an analysis of your plot*

## Hypothesis Testing

**Do not copy code from demo notebooks or homeworks! You may split portions of your code into distinct cells. Also, be sure to
set a random seed so that your results are reproducible.**

In [None]:
# set the random seed so that results are reproducible
random.seed(1231)

...

## Prediction

**Be sure to set a random seed so that your results are reproducible.**

In [None]:
# set the random seed so that results are reproducible
random.seed(1231)

...

## Conclusion

*Replace this text with your conclusion*

## Presentation

*In this section, you'll need to provide a link to your video presentation. If you've uploaded your presentation to YouTube,
you can include the URL in the code below. We've provided an example to show you how to do this. Otherwise, provide the link
in a markdown cell.*

**Link:** *Replace this text with a link to your video presentation*

In [None]:
# Full Link: https://www.youtube.com/watch?v=BKgdDLrSC5s&feature=emb_logo
# Plug in string between "v=" and ""&feature":
YouTubeVideo('BKgdDLrSC5s')

# Submission

*Just as with the other assignments in this course, please submit your research notebook to Okpy. We suggest that you
submit often so that your progress is saved.*

In [None]:
# Run this line to submit your work
_ = ok.submit()