# Project 1: Digital Divide

### Based on PPIC's Just the Facts report ["California's Digital Divide"](https://www.ppic.org/publication/californias-digital-divide/)

## Question(s):
1. What share households have access to high-speed internet? 
2. Does this number vary across demographic groups? (in this case race/ethnicity).

***

In [1]:
# setting up working environment
import pandas as pd
from pathlib import Path
from datetime import datetime as dt
today = dt.today().strftime("%d-%b-%y")

print(today)

from tools import tree 

27-Apr-19


In [2]:
# data folder and paths
RAW_DATA_FOLDER = Path("../data/raw/")
INTERIM_DATA_FOLDER = Path("../data/interim/")
PROCESSED_DATA_FOLDER = Path("../data/processed/")
FINAL_DATA_FOLDER = Path("../data/final/")

In [3]:
tree(INTERIM_DATA_FOLDER)

+ ../data/interim
    + working_dataset-26-Apr-19.dta
    + working_dataset-27-Apr-19.dta


In [4]:
data = pd.read_stata(INTERIM_DATA_FOLDER / f'working_dataset-{today}.dta')

In [5]:
data.shape

(44816, 14)

In [6]:
data.head()

Unnamed: 0,year,serial,hhwt,stateicp,countyfip,cinethh,cihispeed,pernum,perwt,relate,sex,age,race,hispan
0,2017,953662,57,ohio,0,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,58,head/householder,female,48,white,not hispanic
1,2017,953662,57,ohio,0,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",2,62,child,male,20,white,not hispanic
2,2017,953662,57,ohio,0,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",3,78,child,female,9,white,not hispanic
3,2017,953668,140,ohio,61,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,140,head/householder,male,28,black/african american/negro,not hispanic
4,2017,953668,140,ohio,61,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",2,192,sibling,female,16,black/african american/negro,not hispanic


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44816 entries, 0 to 44815
Data columns (total 14 columns):
year         44816 non-null category
serial       44816 non-null int32
hhwt         44816 non-null int16
stateicp     44816 non-null category
countyfip    44816 non-null int16
cinethh      44816 non-null category
cihispeed    44816 non-null category
pernum       44816 non-null int8
perwt        44816 non-null int16
relate       44816 non-null category
sex          44816 non-null category
age          44816 non-null category
race         44816 non-null category
hispan       44816 non-null category
dtypes: category(9), int16(3), int32(1), int8(1)
memory usage: 1.2 MB


Our **unit of observation** is still a (weighted) person but we're interested in **household-level** data. 

From IPUMS docs:
>HHWT indicates how many households in the U.S. population are represented by a given household in an IPUMS sample. <br><br>
>It is generally a good idea to use HHWT when conducting a household-level analysis of any IPUMS sample. The use of HHWT is optional when analyzing one of the "flat" or unweighted IPUMS samples. Flat IPUMS samples include the 1% samples from 1850-1930, all samples from 1960, 1970, and 1980, the 1% unweighted samples from 1990 and 2000, the 10% 2010 sample, and any of the full count 100% census datasets. HHWT must be used to obtain nationally representative statistics for household-level analyses of any sample other than those.<br><br>
>**Users should also be sure to select one person (e.g., PERNUM = 1) to represent the entire household.**

STEPS:
1. drop all observations were pernum != 1

In [8]:
mask_household = (data['pernum'] == 1)

In [9]:
data[mask_household].shape

(11109, 14)

In [10]:
# descriptive variable names
households_in_state = data[mask_household].copy()

Let's explore our internet variables

In [11]:
households_in_state['cinethh'].value_counts()

yes, with a subscription to an internet service                10442
no internet access at this house, apartment, or mobile home      476
yes, without a subscription to an internet service               191
Name: cinethh, dtype: int64

From IPUMS docs:

>CINETHH reports whether any member of the household accesses the Internet. Here, "access" refers to whether or not someone in the household uses or connects to the Internet, regardless of whether or not they pay for the service.

In [12]:
households_in_state['cihispeed'].value_counts()

yes (cable modem, fiber optic or dsl service)    8920
no                                               1522
n/a (gq)                                          667
Name: cihispeed, dtype: int64

From IPUMS docs:
>CIHISPEED reports whether the respondent or any member of their household subscribed to the Internet using broadband (high speed) Internet service such as cable, fiber optic, or DSL service. <br><br>
>User Note: The ACS 2016 introduced changes to the questions regarding computer use and Internet access. See the comparability section and questionnaire text for more information. Additional information provided by the Census Bureau regarding these question alterations are available in the report: ACS Content Test Shows Need to Update Terminology

quick tip: `.value_counts()` has a normalize parameter:

In [13]:
pd.Series.value_counts?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mSeries[0m[0;34m.[0m[0mvalue_counts[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnormalize[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msort[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mascending[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbins[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdropna[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a Series containing counts of unique values.

The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.

Parameters
----------
normalize : boolean, default False
    If True then the object returned will contain the relative
    frequencies of the unique values.

In [14]:
households_in_state['cihispeed'].value_counts(normalize = True)

yes (cable modem, fiber optic or dsl service)    0.802953
no                                               0.137006
n/a (gq)                                         0.060041
Name: cihispeed, dtype: float64

In [15]:
households_in_state['cinethh'].value_counts(normalize = True)

yes, with a subscription to an internet service                0.939959
no internet access at this house, apartment, or mobile home    0.042848
yes, without a subscription to an internet service             0.017193
Name: cinethh, dtype: float64

This however, does not help us with in this case because we are working with **weighted data**. These normalized values are based on number of observations (i.e. "80% of our observations have access to high-speed internet). If each of our rows represented one person, then this would be very close to the end of our analysis. 

### Grouping and aggregating data

What we want is to ___group___ our data by their `cihispeed` values and _add_ their `hhwt` (household weight) values to know how many households are in each category.

we do this by using `.groupby()`

In [16]:
california_households.groupby?

Object `california_households.groupby` not found.


In [17]:
households_in_state.groupby("cihispeed")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x114b5a710>

From the [docs](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

>A groupby operation involves some combination of splitting the
object, __applying a function__, and combining the results. This can be
used to group large amounts of data and compute operations on these
groups.

We have a groups, we need to apply a function to them.

In [18]:
households_in_state.groupby("cihispeed").sum()

Unnamed: 0_level_0,serial,hhwt,countyfip,pernum,perwt
cihispeed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
n/a (gq),652148900.0,79136.0,23214.0,667.0,79122.0
"yes (cable modem, fiber optic or dsl service)",8723978000.0,910268.0,375594.0,8920.0,910266.0
no,1488372000.0,167114.0,47596.0,1522.0,167094.0


But we don't want to apply it to the whole dataframe, we just want a column: `hhwt`, our households weight.

In [19]:
households_in_state.groupby("cihispeed")['hhwt'].sum()

cihispeed
n/a (gq)                                          79136.0
yes (cable modem, fiber optic or dsl service)    910268.0
no                                               167114.0
Name: hhwt, dtype: float64

In [20]:
n_households = households_in_state.groupby("cihispeed")['hhwt'].sum()[2]
_state = households_in_state['stateicp'].unique()[0]
print(f"""
We can see now {n_households:,} households in {_state} have access to high-speed internet. But, out of how many?

To make this easier to follow, let's save our results to a variable:
""")


We can see now 167,114.0 households in ohio have access to high-speed internet. But, out of how many?

To make this easier to follow, let's save our results to a variable:



In [21]:
households_w_highspeed_access = households_in_state.groupby("cihispeed")['hhwt'].sum()

households_w_highspeed_access

cihispeed
n/a (gq)                                          79136.0
yes (cable modem, fiber optic or dsl service)    910268.0
no                                               167114.0
Name: hhwt, dtype: float64

This is a pandas `Series` object and we can easily find the sum total of its values by applying `.sum()`

In [22]:
households_w_highspeed_access.sum()

1156518.0

Now that's the total number of households in California in our sample. That's our denominator. 

When you apply an operator to pandas Series object you are applying it to each one of its elements.

In [23]:
households_w_highspeed_access + 5000000

cihispeed
n/a (gq)                                         5079136.0
yes (cable modem, fiber optic or dsl service)    5910268.0
no                                               5167114.0
Name: hhwt, dtype: float64

In [24]:
households_w_highspeed_access * 53214

cihispeed
n/a (gq)                                         4.211143e+09
yes (cable modem, fiber optic or dsl service)    4.843900e+10
no                                               8.892804e+09
Name: hhwt, dtype: float64

So we can divide our Series of 3 values by the total number of households in California and get its share of the total

In [25]:
households_w_highspeed_access / households_w_highspeed_access.sum()

cihispeed
n/a (gq)                                         0.068426
yes (cable modem, fiber optic or dsl service)    0.787076
no                                               0.144498
Name: hhwt, dtype: float64

From IPUMS [docs](https://usa.ipums.org/usa-action/variables/CIHISPEED#universe_section):

>**Universe** <br>
    ACS, PRCS: Not in group quarters

# On your own
1. Filter out those **out** of your Universe. i.e "Out of all **households**, what share has access to high-speed internet?"

the value `n/a (gq)` means the ACS already flagged this _household_ (what we thought was a household) as a group quarter. So we can just drop those and see what the %s are.

In [26]:
households_in_state.head()

Unnamed: 0,year,serial,hhwt,stateicp,countyfip,cinethh,cihispeed,pernum,perwt,relate,sex,age,race,hispan
0,2017,953662,57,ohio,0,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,58,head/householder,female,48,white,not hispanic
3,2017,953668,140,ohio,61,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,140,head/householder,male,28,black/african american/negro,not hispanic
6,2017,953671,135,ohio,0,"yes, with a subscription to an internet service",no,1,134,head/householder,female,35,black/african american/negro,not hispanic
9,2017,953685,46,ohio,35,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,45,head/householder,male,56,white,not hispanic
12,2017,953690,151,ohio,113,"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",1,151,head/householder,male,42,white,not hispanic


From IPUMS [docs](https://usa.ipums.org/usa-action/source_documents/enum_form_ACS(2016)_tag.xml#51) you will learn that n/a includes those that access the internet but without paying for a subscription.


In [27]:
households_in_state.groupby(['cinethh', 'cihispeed',])[['hhwt']].sum() / households_in_state.groupby(['cinethh', 'cihispeed',])[['hhwt']].sum().sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
cinethh,cihispeed,Unnamed: 2_level_1
"yes, with a subscription to an internet service",n/a (gq),
"yes, with a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",0.787076
"yes, with a subscription to an internet service",no,0.144498
"yes, without a subscription to an internet service",n/a (gq),0.018273
"yes, without a subscription to an internet service","yes (cable modem, fiber optic or dsl service)",
"yes, without a subscription to an internet service",no,
"no internet access at this house, apartment, or mobile home",n/a (gq),0.050153
"no internet access at this house, apartment, or mobile home","yes (cable modem, fiber optic or dsl service)",
"no internet access at this house, apartment, or mobile home",no,


## Part 2 of Analysis: Creating derived variables

Right now, through groupby, you could find out what are the high-speed internet access rates by race/ethnicity, but it might be a little too much:

In [28]:
households_in_state.groupby(['race', 'cihispeed',])[['hhwt']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
race,cihispeed,Unnamed: 2_level_1
white,n/a (gq),57742.0
white,"yes (cable modem, fiber optic or dsl service)",742182.0
white,no,124811.0
black/african american/negro,n/a (gq),15443.0
black/african american/negro,"yes (cable modem, fiber optic or dsl service)",114753.0
black/african american/negro,no,31265.0
american indian or alaska native,n/a (gq),60.0
american indian or alaska native,"yes (cable modem, fiber optic or dsl service)",1842.0
american indian or alaska native,no,529.0
chinese,n/a (gq),117.0


In [29]:
households_in_state.groupby(['hispan','race', 'cihispeed',])[['hhwt']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,hhwt
hispan,race,cihispeed,Unnamed: 3_level_1
not hispanic,white,n/a (gq),55415.0
not hispanic,white,"yes (cable modem, fiber optic or dsl service)",717266.0
not hispanic,white,no,118725.0
not hispanic,black/african american/negro,n/a (gq),15254.0
not hispanic,black/african american/negro,"yes (cable modem, fiber optic or dsl service)",114289.0
not hispanic,black/african american/negro,no,30805.0
not hispanic,american indian or alaska native,n/a (gq),60.0
not hispanic,american indian or alaska native,"yes (cable modem, fiber optic or dsl service)",1140.0
not hispanic,american indian or alaska native,no,529.0
not hispanic,chinese,n/a (gq),117.0


In [30]:
mask_latino = (households_in_state['hispan'] != 'not hispanic')
mask_white = (households_in_state['hispan'] == 'not hispanic') & (households_in_state['race'] == 'white')
mask_native = (households_in_state['hispan'] == 'not hispanic') & (households_in_state['race'] == 'american indian or alaska native')
mask_black = (households_in_state['hispan'] == 'not hispanic') & (households_in_state['race'].str.contains('black'))

In [31]:
# Categorical way

In [32]:
households_in_state['race'].unique()

[white, black/african american/negro, two major races, other asian or pacific islander, three or more major races, american indian or alaska native, other race, nec, chinese, japanese]
Categories (9, object): [white < black/african american/negro < american indian or alaska native < chinese ... other asian or pacific islander < other race, nec < two major races < three or more major races]

In [33]:
households_in_state['race'].cat.categories

Index(['white', 'black/african american/negro',
       'american indian or alaska native', 'chinese', 'japanese',
       'other asian or pacific islander', 'other race, nec', 'two major races',
       'three or more major races'],
      dtype='object')

In [34]:
mask_API = (households_in_state['hispan'] == 'not hispanic') & ((households_in_state['race'] >= 'chinese') & (households_in_state['race'] <= 'other asian or pacific islander'))

In [35]:
mask_other = (households_in_state['hispan'] == 'not hispanic') & (households_in_state['race'] >= 'other race, nec')

In [36]:
# add label
households_in_state.loc[mask_latino, 'race-ethnicity'] = 'Latino'
households_in_state.loc[mask_white, 'race-ethnicity'] = 'White'
households_in_state.loc[mask_black, 'race-ethnicity'] = 'Black'
households_in_state.loc[mask_native, 'race-ethnicity'] = 'American Indian / Alaska Native'
households_in_state.loc[mask_API, 'race-ethnicity'] = 'Asian / Pacific Islander'
households_in_state.loc[mask_other, 'race-ethnicity'] = 'Other / 2+ races'

In [37]:
#check 
households_in_state['race-ethnicity'].isna().sum()

0

In [38]:
households_in_state.groupby(['race-ethnicity', 'cihispeed'])[['hhwt']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
race-ethnicity,cihispeed,Unnamed: 2_level_1
American Indian / Alaska Native,n/a (gq),60.0
American Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",1140.0
American Indian / Alaska Native,no,529.0
Asian / Pacific Islander,n/a (gq),1172.0
Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",24535.0
Asian / Pacific Islander,no,2172.0
Black,n/a (gq),15254.0
Black,"yes (cable modem, fiber optic or dsl service)",114289.0
Black,no,30805.0
Latino,n/a (gq),4408.0


In [39]:
households_in_state.groupby(['race-ethnicity', 'cihispeed'])[['hhwt']].sum() / households_in_state.groupby(['race-ethnicity'])[['hhwt']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,hhwt
race-ethnicity,cihispeed,Unnamed: 2_level_1
American Indian / Alaska Native,n/a (gq),0.034702
American Indian / Alaska Native,"yes (cable modem, fiber optic or dsl service)",0.659341
American Indian / Alaska Native,no,0.305957
Asian / Pacific Islander,n/a (gq),0.042039
Asian / Pacific Islander,"yes (cable modem, fiber optic or dsl service)",0.880053
Asian / Pacific Islander,no,0.077908
Black,n/a (gq),0.095131
Black,"yes (cable modem, fiber optic or dsl service)",0.712756
Black,no,0.192113
Latino,n/a (gq),0.084897
