# Project 1: Digital Divide
### Data Analysis

#### Based on PPIC's Just the Facts report ["California's Digital Divide"](https://www.ppic.org/publication/californias-digital-divide/)

## Research Question(s):
1. What share households with school-age children in X state have access to high-speed internet? 
2. Does this number vary across demographic groups? (in this case race/ethnicity).

## Goal:
* Use our `working-data dataset` (created in [Data_Prep notebook](00_DigitalDivide_Data_Prep.ipynb) notebook) to answer our research questions.

## Context:
* Write yourself a description of the context: Include a description of the data (_data set contains X state's data for YYYY year_)

***

#### Step 1: Set up your working environment.

Import all necessary libraries and create `Path`s to your data directories. This ensures reproducibility across file systems (windows uses `\` instead of `/`)

We need 
1. `pandas` to work with the data.
2. `pathlib`, and more specifically its `Path` object, to work with paths. This will ensure our code works in both Windows (which uses `\` in its file paths) and MacOS/Linux (which uses `/`).
3. `datetime` - tip: There are version control systems for data but tagging your data files with the date is not a bad first step if you're getting started.
4. `tree` - to display a directory's tree.

In [1]:
# setting up working environment
import pandas as pd
from pathlib import Path
from tools import tree
from datetime import datetime as dt
today = dt.today().strftime("%d-%b-%y")

print(today)

01-May-19


In [2]:
# data folder and paths
RAW_DATA_PATH = Path("../data/raw/")
INTERIM_DATA_PATH = Path("../data/interim/")
PROCESSED_DATA_PATH = Path("../data/processed/")
FINAL_DATA_PATH = Path("../data/final/")

In [3]:
tree(INTERIM_DATA_PATH)

+ ../data/interim
    + placeholder
    + state_data-01-May-19.dta
    + working_data-01-May-19.dta


In [4]:
data = pd.read_stata(INTERIM_DATA_PATH / f'working_data-{today}.dta')

In [5]:
data.shape

(44816, 18)

In [6]:
data.head()

Unnamed: 0,year,serial,hhwt,statefip,countyfip,gq,cinethh,cihispeed,pernum,perwt,relate,related,sex,age,race,raced,hispan,hispand
0,2017,953662,57,ohio,0,households under 1970 definition,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,58,head/householder,head/householder,female,48,white,white,not hispanic,not hispanic
1,2017,953662,57,ohio,0,households under 1970 definition,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",2,62,child,child,male,20,white,white,not hispanic,not hispanic
2,2017,953662,57,ohio,0,households under 1970 definition,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",3,78,child,child,female,9,white,white,not hispanic,not hispanic
3,2017,953668,140,ohio,61,households under 1970 definition,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,140,head/householder,head/householder,male,28,black/african american/negro,black/african american/negro,not hispanic,not hispanic
4,2017,953668,140,ohio,61,households under 1970 definition,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",2,192,sibling,sibling,female,16,black/african american/negro,black/african american/negro,not hispanic,not hispanic


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44816 entries, 0 to 44815
Data columns (total 18 columns):
year         44816 non-null category
serial       44816 non-null int32
hhwt         44816 non-null int16
statefip     44816 non-null category
countyfip    44816 non-null int16
gq           44816 non-null category
cinethh      44816 non-null category
cihispeed    44816 non-null category
pernum       44816 non-null int8
perwt        44816 non-null int16
relate       44816 non-null category
related      44816 non-null category
sex          44816 non-null category
age          44816 non-null category
race         44816 non-null category
raced        44816 non-null category
hispan       44816 non-null category
hispand      44816 non-null category
dtypes: category(13), int16(3), int32(1), int8(1)
memory usage: 1.4 MB


Our **unit of observation** is still a (weighted) person but we're interested in **household-level** data. 

From IPUMS docs:
>HHWT indicates how many households in the U.S. population are represented by a given household in an IPUMS sample. <br><br>
>It is generally a good idea to use HHWT when conducting a household-level analysis of any IPUMS sample. The use of HHWT is optional when analyzing one of the "flat" or unweighted IPUMS samples. Flat IPUMS samples include the 1% samples from 1850-1930, all samples from 1960, 1970, and 1980, the 1% unweighted samples from 1990 and 2000, the 10% 2010 sample, and any of the full count 100% census datasets. HHWT must be used to obtain nationally representative statistics for household-level analyses of any sample other than those.<br><br>
>**Users should also be sure to select one person (e.g., PERNUM = 1) to represent the entire household.**

***

#### Step 2: Drop all observations were `pernum` doesn't equal 1

In [8]:
mask_pernum = (data['pernum'] == 1)

In [9]:
data[mask_pernum].shape

(11109, 18)

Save your data to an appropriately named variable.

In [10]:
state_households = data[mask_pernum].copy()

***

#### Step 3: Familiarize yourself with your variables of interest

From IPUMS [docs](https://usa.ipums.org/usa-action/variables/CINETHH#description_section):

>CINETHH reports whether any member of the household accesses the Internet. Here, "access" refers to whether or not someone in the household uses or connects to the Internet, regardless of whether or not they pay for the service.

In [11]:
# find the value_counts for your cinethh series
state_households['cinethh'].value_counts()

yes, with a subscription to an internet service                10442
no internet access at this house, apartment, or mobile home      476
yes, without a subscription to an internet service               191
Name: cinethh, dtype: int64

From IPUMS [docs](https://usa.ipums.org/usa-action/variables/CIHISPEED#description_section):
>CIHISPEED reports whether the respondent or any member of their household subscribed to the Internet using broadband (high speed) Internet service such as cable, fiber optic, or DSL service. <br><br>
>User Note: The ACS 2016 introduced changes to the questions regarding computer use and Internet access. See the comparability section and questionnaire text for more information. Additional information provided by the Census Bureau regarding these question alterations are available in the report: ACS Content Test Shows Need to Update Terminology

In [12]:
# find the value_counts for your cihispeed series
state_households['cihispeed'].value_counts()

yes (cable modem, fiber optic or dsl service)    8920
no                                               1522
n/a (gq)                                          667
Name: cihispeed, dtype: int64

_quick tip_ `.value_counts()` _has a_ `normalize` _parameter:_

In [13]:
pd.Series.value_counts?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mSeries[0m[0;34m.[0m[0mvalue_counts[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnormalize[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msort[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mascending[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbins[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdropna[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a Series containing counts of unique values.

The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.

Parameters
----------
normalize : boolean, default False
    If True then the object returned will contain the relative
    frequencies of the unique values.

In [19]:
# try it on your cinethh series
state_households['cinethh'].value_counts(normalize = True)

yes, with a subscription to an internet service                0.939959
no internet access at this house, apartment, or mobile home    0.042848
yes, without a subscription to an internet service             0.017193
Name: cinethh, dtype: float64

In [20]:
# on cihispeed 
state_households['cihispeed'].value_counts(normalize = True)

yes (cable modem, fiber optic or dsl service)    0.802953
no                                               0.137006
n/a (gq)                                         0.060041
Name: cihispeed, dtype: float64

***

This would be the end of our analysis if we weren't working with **weighted** data. **Weighted** data means each of our observations represent more than one person or household.

`perwt` = "Person's weight"

`hhwt` = "Household's weight"

`.value_counts(normalize=True)` counts the number of **observations** for each of a series' values and then divides it by the total count. If each of our observations was 1 person/household, we would have the answer already. 

What we need to do is **aggregate**.

***

#### Step 4: Grouping and aggregating data

The mechanics are kind of the same: 
1. Count the number of observations each that match each of the values in a series.
2. Add up **not the number of observations** but the weight of each observation.
3. Divide by the total.

#### Step 4.1: Group your data by their corresponding values

In [14]:
state_households.groupby("cihispeed")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x110a038d0>

From the [docs](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html):

>A groupby operation involves some combination of splitting the
object, __applying a function__, and combining the results. This can be
used to group large amounts of data and compute operations on these
groups.

We're missing the **applying a function** part of it.

Try the following:
```python
state_households.groupby("countyfip").sum()
```

you can pass _almost_ any function to this. 

Try `.mean()`, `.max()`, `.min()`, `.std()`.

In [15]:
state_households.groupby("countyfip").sum()

Unnamed: 0_level_0,serial,hhwt,pernum,perwt
countyfip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,4650504000.0,446164.0,4755.0,446156.0
3,120700700.0,10555.0,123.0,10557.0
7,96689510.0,9868.0,99.0,9871.0
17,305148300.0,37891.0,312.0,37893.0
23,146433800.0,13809.0,150.0,13800.0
29,84099090.0,9099.0,86.0,9106.0
35,1122669000.0,117782.0,1148.0,117759.0
41,236990800.0,23567.0,242.0,23568.0
45,169306900.0,17631.0,173.0,17622.0
49,1015987000.0,132324.0,1038.0,132311.0


You can select columns just like you would any other regular dataframe.

In [16]:
state_households.groupby("countyfip")['hhwt'].sum()

countyfip
0      446164.0
3       10555.0
7        9868.0
17      37891.0
23      13809.0
29       9099.0
35     117782.0
41      23567.0
45      17631.0
49     132324.0
57      15170.0
61      77647.0
89      18966.0
93      31271.0
103     18136.0
109      9352.0
113     53118.0
133     15673.0
139     11236.0
153     48930.0
165     27706.0
169     10623.0
Name: hhwt, dtype: float64

***

In [18]:
n_households = state_households.groupby("cihispeed")['hhwt'].sum()[2]
_state = state_households['statefip'].unique()[0]
print(f"""
We can see now {n_households:,} households in {_state} have access to high-speed internet. But, out of how many?

To make this easier to follow, let's save our results to a variable:
""")


We can see now 167,114.0 households in ohio have access to high-speed internet. But, out of how many?

To make this easier to follow, let's save our results to a variable:



In [21]:
households_with_highspeed_access = state_households.groupby("cihispeed")["hhwt"].sum()

households_with_highspeed_access

cihispeed
n/a (gq)                                          79136.0
yes (cable modem, fiber optic or dsl service)    910268.0
no                                               167114.0
Name: hhwt, dtype: float64

This looks like any regular `pandas.Series`, how do we find the total `.sum()` of a series elements?

![math](../../static/math.png)

In [22]:
households_with_highspeed_access.sum()

1156518.0

That's our denominator! 

![nice](../../static/nooice.gif)

***

When you _apply_ and operation to a `pandas.Series` it _maps_ to each of its elements.

Try the following:
```python
households_with_highspeed_access * 1_000_000
```

```python
households_with_highspeed_access + 1_000_000
```

```python
households_with_highspeed_access / 1_000_000
```

In [23]:
households_with_highspeed_access * 1_000_000

cihispeed
n/a (gq)                                         7.913600e+10
yes (cable modem, fiber optic or dsl service)    9.102680e+11
no                                               1.671140e+11
Name: hhwt, dtype: float64

In [24]:
households_with_highspeed_access + 1_000_000

cihispeed
n/a (gq)                                         1079136.0
yes (cable modem, fiber optic or dsl service)    1910268.0
no                                               1167114.0
Name: hhwt, dtype: float64

In [25]:
households_with_highspeed_access / 1_000_000

cihispeed
n/a (gq)                                         0.079136
yes (cable modem, fiber optic or dsl service)    0.910268
no                                               0.167114
Name: hhwt, dtype: float64

Now that you know the denominator of our equation (how many households total in X state), how would you find each of the 3 values in your `households_with_highspeed_access` share of the total?

In [26]:
households_with_highspeed_access / households_with_highspeed_access.sum()

cihispeed
n/a (gq)                                         0.068426
yes (cable modem, fiber optic or dsl service)    0.787076
no                                               0.144498
Name: hhwt, dtype: float64

***
***

### Part 2 of analysis: Creating derived variables

Now that you have answered **Research Question 1**, we can move on to Q2: 
>_Does this number vary across demographic groups? (in this case race/ethnicity)._

pandas `.groupby()` function can take a list of columns by which to group by 

Try the following:
```python
state_households.groupby(['race', 'cihispeed'])[['hhwt']].sum()
```

_Notice that I'm passing_ `[['hhwt']]` _(a 1-element list) and not just_ `['hhwt']` _try both yourself and let's discuss what's the difference._

In [27]:
state_households.groupby(['race', 'cihispeed'])[['hhwt']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
race,cihispeed,Unnamed: 2_level_1
white,n/a (gq),57742.0
white,"yes (cable modem, fiber optic or dsl service)",742182.0
white,no,124811.0
black/african american/negro,n/a (gq),15443.0
black/african american/negro,"yes (cable modem, fiber optic or dsl service)",114753.0
black/african american/negro,no,31265.0
american indian or alaska native,n/a (gq),60.0
american indian or alaska native,"yes (cable modem, fiber optic or dsl service)",1842.0
american indian or alaska native,no,529.0
chinese,n/a (gq),117.0


In [28]:
state_households.groupby(['race', 'cihispeed'])['hhwt'].sum()

race                              cihispeed                                    
white                             n/a (gq)                                          57742.0
                                  yes (cable modem, fiber optic or dsl service)    742182.0
                                  no                                               124811.0
black/african american/negro      n/a (gq)                                          15443.0
                                  yes (cable modem, fiber optic or dsl service)    114753.0
                                  no                                                31265.0
american indian or alaska native  n/a (gq)                                             60.0
                                  yes (cable modem, fiber optic or dsl service)      1842.0
                                  no                                                  529.0
chinese                           n/a (gq)                                            117.0


***

#### Step 1: Define your groups



Pandas' `.loc` indexer serves not only to slice dataframes but also to assign new values to certain slices of dataframes.

For example,
```python
mask_madeup_data = (data['column_1'] == 'no answer')
data.loc[mask_madeup_data, 'new_column'] = 'this row did not answer'
```

The code above grabs all the rows that satisfy the condition and then looks at `'new_column'`, if it doesn't exist, it'll create it for you and assign the value `'this row did not answer'` to all the rows that match the condition. The rest will be filled with null values (NaNs).

###### Let's create our masks

In [29]:
mask_latino = (state_households['hispan'] != 'not hispanic')

In [30]:
mask_white = (state_households['hispan'] == 'not hispanic') & (state_households['race'] == 'white')

In [31]:
mask_black = (state_households['hispan'] == 'not hispanic') & (state_households['race'].str.contains('black'))
                                                                                

In [32]:
mask_native = (state_households['hispan'] == 'not hispanic') & (state_households['race'] == 'american indian or alaska native')

In [35]:
mask_API = (state_households['hispan'] == 'not hispanic') & ((state_households['race'] >= 'chinese') & (state_households['race'] <= 'other asian or pacific islander'))

In [36]:
mask_other = (state_households['hispan'] == 'not hispanic') & (state_households['race'] >= 'other race, nec')

Assign the values to a new column `'racen'` for Race/Ethnicity

In [37]:
state_households.loc[mask_latino, 'racen'] = 'Latino'
state_households.loc[mask_white, 'racen'] = 'White'
state_households.loc[mask_black, 'racen'] = 'Black/African-American'
state_households.loc[mask_native, 'racen'] = 'Am. Indian / Alaska Native'
state_households.loc[mask_API, 'racen'] = 'Asian / Pacific Islander'
state_households.loc[mask_other, 'racen'] = 'other'


Checking your results.

Under your new logic, all `race` values should fit into `racen` values so there should not be any null values, right?

Pandas `.isna()` returns a series of either True or False for each value of a series depending on whether or not it is Null. 

AND

in python, True = 1 and False = 0. 

What do you think would happen if you as for the `.sum()` total of a `pandas.Series` of booleans?

In [38]:
state_households['race'].isna().sum()

0

***

##### Multiple ways of grouping data

Now that you have derived a working variable for race/ethnicity you can aggregate your data to answer **RQ2**. In pandas, there are many ways to do this, some of them are:
1. `.groupby()` like we've done so far.
2. `.pivot_table()`
3. `pd.crosstabs()` <- this one is a `pandas` method, not a DataFrame method. More later.

##### GroupBy

In [40]:
state_households.groupby(['racen', 'cihispeed'])[['hhwt']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
racen,cihispeed,Unnamed: 2_level_1
Am. Indian / Alaska Native,n/a (gq),60.0
Am. Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",1140.0
Am. Indian / Alaska Native,no,529.0
Asian / Pacific Islander,n/a (gq),1172.0
Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",24535.0
Asian / Pacific Islander,no,2172.0
Black/African-American,n/a (gq),15254.0
Black/African-American,"yes (cable modem, fiber optic or dsl service)",114289.0
Black/African-American,no,30805.0
Latino,n/a (gq),4408.0


Let's save that to an appropriately named variable since we'll be using it later.

In [41]:
cihispeed_by_racen = state_households.groupby(['racen', 'cihispeed'])[['hhwt']].sum()

Now, this grouped dataframe has the total number of households in each of these racen-cihispeed groups. 

We need the share of cihispeed values by racen group. 

In our equation,

$$ \frac{households\ with\ high\ speed\ internet}{total\ households\ in\ racen\ group}$$

We need to find the denominator.

In [54]:
households_by_racen = state_households.groupby('racen')[['hhwt']].sum()

In [57]:
shares_cihispeed_by_racen = cihispeed_by_racen / households_by_racen

shares_cihispeed_by_racen

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
racen,cihispeed,Unnamed: 2_level_1
Am. Indian / Alaska Native,n/a (gq),0.034702
Am. Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",0.659341
Am. Indian / Alaska Native,no,0.305957
Asian / Pacific Islander,n/a (gq),0.042039
Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",0.880053
Asian / Pacific Islander,no,0.077908
Black/African-American,n/a (gq),0.095131
Black/African-American,"yes (cable modem, fiber optic or dsl service)",0.712756
Black/African-American,no,0.192113
Latino,n/a (gq),0.084897


This is a multi-level index dataframe and there are a few ways to slice it. Let's try 3:
1. a classsic `.loc` slice
2. a cross-section (`.xs()`)
3. the `.reset_index()` method

**Classic `.loc`**

In [58]:
shares_cihispeed_by_racen.loc[(slice(None), 'yes (cable modem, fiber optic or dsl service)'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
racen,cihispeed,Unnamed: 2_level_1
Am. Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",0.659341
Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",0.880053
Black/African-American,"yes (cable modem, fiber optic or dsl service)",0.712756
Latino,"yes (cable modem, fiber optic or dsl service)",0.714553
White,"yes (cable modem, fiber optic or dsl service)",0.804646
other,"yes (cable modem, fiber optic or dsl service)",0.685934


**cross-section**

In [59]:
shares_cihispeed_by_racen.xs?

[0;31mSignature:[0m [0mshares_cihispeed_by_racen[0m[0;34m.[0m[0mxs[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0mlevel[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdrop_level[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return cross-section from the Series/DataFrame.

This method takes a `key` argument to select data at a particular
level of a MultiIndex.

Parameters
----------
key : label or tuple of label
    Label contained in the index, or partially in a MultiIndex.
axis : {0 or 'index', 1 or 'columns'}, default 0
    Axis to retrieve cross-section on.
level : object, defaults to first n levels (n=1 or len(key))
    In case of a key partially contained in a MultiIndex, indicate
    which levels are used. Levels can be referred by label or position.
drop_level : bool, default True
    If False, returns object with same levels as self.

Returns
-------
Series or DataFrame
    Cross-se

In [60]:
shares_cihispeed_by_racen.xs(key = 'yes (cable modem, fiber optic or dsl service)', level = 1)


Unnamed: 0_level_0,hhwt
racen,Unnamed: 1_level_1
Am. Indian / Alaska Native,0.659341
Asian / Pacific Islander,0.880053
Black/African-American,0.712756
Latino,0.714553
White,0.804646
other,0.685934


**`.reset_index()`**

Another way to slice a multi-level index dataframe is to make it a not-multi-level index dataframe. To do that you need to _reset_ its index.

In [61]:
shares_cihispeed_by_racen = shares_cihispeed_by_racen.reset_index()

In [64]:
shares_cihispeed_by_racen

Unnamed: 0,racen,cihispeed,hhwt
0,Am. Indian / Alaska Native,n/a (gq),0.034702
1,Am. Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",0.659341
2,Am. Indian / Alaska Native,no,0.305957
3,Asian / Pacific Islander,n/a (gq),0.042039
4,Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",0.880053
5,Asian / Pacific Islander,no,0.077908
6,Black/African-American,n/a (gq),0.095131
7,Black/African-American,"yes (cable modem, fiber optic or dsl service)",0.712756
8,Black/African-American,no,0.192113
9,Latino,n/a (gq),0.084897


In [65]:
mask_yes_cihispeed = (shares_cihispeed_by_racen['cihispeed'] == 'yes (cable modem, fiber optic or dsl service)')

shares_cihispeed_by_racen[mask_yes_cihispeed]

Unnamed: 0,racen,cihispeed,hhwt
1,Am. Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",0.659341
4,Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",0.880053
7,Black/African-American,"yes (cable modem, fiber optic or dsl service)",0.712756
10,Latino,"yes (cable modem, fiber optic or dsl service)",0.714553
13,White,"yes (cable modem, fiber optic or dsl service)",0.804646
16,other,"yes (cable modem, fiber optic or dsl service)",0.685934


***

##### Pivot Tables

The second method of aggregating our data is `.pivot_table()`s.

If you've worked with Excel, you might already be familiar with what a pivot table is.

From [Wikipedia](https://en.wikipedia.org/wiki/Pivot_table):
>A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

In [66]:
state_households.pivot_table?

[0;31mSignature:[0m
[0mstate_households[0m[0;34m.[0m[0mpivot_table[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mvalues[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maggfunc[0m[0;34m=[0m[0;34m'mean'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfill_value[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmargins[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdropna[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmargins_name[0m[0;34m=[0m[0;34m'All'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Create a spreadsheet-style pivot table as a DataFrame. The levels in
the pivot table will be stored in MultiIndex objects (hierarchical
indexes) on the index and columns o

What we need are four things:
1. What variable will become our `index`?
2. What variable will become our `columns`?
3. What variable will become our `values`?
4. How will we aggregate our values?

Pandas is going to grab each unique value in the variables you choose and use those as rows in your `.index` or separate columns in your `.columns`. The `values` variable should be _quantitative_ in this case (but it doesn't have to be, necessarily). `.pivot_table` will by default find the `mean` of your `values` variable for each cell in your new table, in this case we don't care about the `mean`, we want to `sum` up the total number of households.

Try the following:

```python
state_households.pivot_table(
    index = '______',
    columns = '______', 
    values = 'hhwt',
    aggfunc = '___',
    margins = True,
)
```

In [69]:
state_households.pivot_table(
    index = 'racen',
    columns = 'cihispeed', 
    values = 'hhwt',
    aggfunc = 'sum',
    margins = True,
)

cihispeed,n/a (gq),"yes (cable modem, fiber optic or dsl service)",no,All
racen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Am. Indian / Alaska Native,60.0,1140.0,529.0,1729.0
Asian / Pacific Islander,1172.0,24535.0,2172.0,27879.0
Black/African-American,15254.0,114289.0,30805.0,160348.0
Latino,4408.0,37101.0,10413.0,51922.0
White,55415.0,717266.0,118725.0,891406.0
other,2827.0,15937.0,4470.0,23234.0
All,167114.0,79136.0,910268.0,1156518.0


Save it to an appropriately named variable.

In [70]:
households_pivot_table = state_households.pivot_table(
    index = 'racen',
    columns = 'cihispeed', 
    values = 'hhwt',
    aggfunc = 'sum',
    margins = True,
)

What do you think the next step should be?

In [71]:
households_pivot_table['yes (cable modem, fiber optic or dsl service)'] / households_pivot_table['All']

racen
Am. Indian / Alaska Native    0.659341
Asian / Pacific Islander      0.880053
Black/African-American        0.712756
Latino                        0.714553
White                         0.804646
other                         0.685934
All                           0.068426
dtype: float64