# Research Question 3 (ACS)

What are the differences between the 2014 and 2019 5-year American Community Survey datasets? Which variables or labels are missing from the 2014 data but in the 2019 data, and vice versa? Which 2019 variables have changed significantly from the 2014 estimates?

## Task 1

Using the ACS data dictionary found [here](https://drive.google.com/file/d/1Nd1TgI7-IARuoqDiexzT8MWLR6A3RQhx/view), compare variables from 2014 with variables from 2019. Do the variable codes and labels remain consistent across ACS versions? For example, does DP_002E in 2014 and DP_002E in 2019 refer to the same information? (Note that you can attempt this task using any county's ACS datasets - we know for certain that (e.g.) 2014 Hillsborough County and 2014 Miami-Dade County variables are consistent with each other.)

In [1]:
import pandas as pd

In [2]:
acs_data_dictionary_path = '../data/acs5_variable_dict_2014_2019.csv'
acs_data_dictionary_df = pd.read_csv(acs_data_dictionary_path)
acs_data_dictionary_df.head()

Unnamed: 0,variable_code,label,concept,predicateType,group,limit,predicateOnly,acs_year
0,DP02_0019EA,Annotation of Estimate!!RELATIONSHIP!!Populati...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
1,DP02_0126E,Estimate!!ANCESTRY!!Total population!!Danish,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014
2,DP02_0072EA,Annotation of Estimate!!DISABILITY STATUS OF T...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
3,DP02_0069PMA,Annotation of Percent Margin of Error!!VETERAN...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
4,DP02_0126M,Margin of Error!!ANCESTRY!!Total population!!D...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014


In [3]:
acs_2014 = acs_data_dictionary_df.loc[acs_data_dictionary_df['acs_year'] == 2014]
acs_2014.head()

Unnamed: 0,variable_code,label,concept,predicateType,group,limit,predicateOnly,acs_year
0,DP02_0019EA,Annotation of Estimate!!RELATIONSHIP!!Populati...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
1,DP02_0126E,Estimate!!ANCESTRY!!Total population!!Danish,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014
2,DP02_0072EA,Annotation of Estimate!!DISABILITY STATUS OF T...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
3,DP02_0069PMA,Annotation of Percent Margin of Error!!VETERAN...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
4,DP02_0126M,Margin of Error!!ANCESTRY!!Total population!!D...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014


In [4]:
acs_2019 = acs_data_dictionary_df.loc[acs_data_dictionary_df['acs_year'] == 2019]
acs_2019.head()

Unnamed: 0,variable_code,label,concept,predicateType,group,limit,predicateOnly,acs_year
4088,DP02_0019EA,Annotation of Estimate!!RELATIONSHIP!!Populati...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2019
4089,DP02_0126E,Estimate!!ANCESTRY!!Total population!!Czech,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2019
4090,DP02_0072EA,Annotation of Estimate!!DISABILITY STATUS OF T...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2019
4091,DP02_0069PMA,Annotation of Percent Margin of Error!!VETERAN...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2019
4092,DP02_0126M,Margin of Error!!ANCESTRY!!Total population!!C...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2019


In [5]:
sheet_path = "../data/acs5_variable_dict_2014_2019.csv"
df = pd.read_csv(sheet_path, header=0)
df.head()

Unnamed: 0,variable_code,label,concept,predicateType,group,limit,predicateOnly,acs_year
0,DP02_0019EA,Annotation of Estimate!!RELATIONSHIP!!Populati...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
1,DP02_0126E,Estimate!!ANCESTRY!!Total population!!Danish,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014
2,DP02_0072EA,Annotation of Estimate!!DISABILITY STATUS OF T...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
3,DP02_0069PMA,Annotation of Percent Margin of Error!!VETERAN...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
4,DP02_0126M,Margin of Error!!ANCESTRY!!Total population!!D...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014


In [6]:
hillsborough_2014_df = pd.read_csv('../data/hillsborough_acs5-2014_census.csv')
hillsborough_2014_df.head()

Unnamed: 0,index,DP02_0001E,DP02_0001PE,DP02_0002E,DP02_0002PE,DP02_0003E,DP02_0003PE,DP02_0004E,DP02_0004PE,DP02_0005E,...,B25087_027E,B25087_028E,B25087_029E,B25088_001E,B25088_002E,B25088_003E,B25092_001E,B25092_002E,B25092_003E,GEOID
0,"Census Tract 111.03, Hillsborough County, Flor...",1268,1268,1012,79.8,374,29.5,826,65.1,302,...,86,9,161,1452,1944,555,17.9,22.3,11.5,12057011103
1,"Census Tract 114.08, Hillsborough County, Flor...",1078,1078,811,75.2,332,30.8,692,64.2,288,...,37,8,77,1512,1666,656,22.4,23.6,13.1,12057011408
2,"Census Tract 114.13, Hillsborough County, Flor...",2111,2111,1444,68.4,704,33.3,1012,47.9,465,...,37,12,71,1476,1776,455,23.3,26.6,14.8,12057011413
3,"Census Tract 116.08, Hillsborough County, Flor...",428,428,319,74.5,146,34.1,272,63.6,110,...,4,10,46,1603,1866,711,21.2,21.5,19.2,12057011608
4,"Census Tract 116.11, Hillsborough County, Flor...",1596,1596,990,62.0,438,27.4,609,38.2,241,...,37,46,17,1032,1187,437,22.7,24.3,16.8,12057011611


We know that the labels and codes for 2014 Hillsborough County and 2014 Miami-Dade County variables are consistent with each other.  Check this to make sure our code so far is correct.

In [7]:
miami_dade_county_2014_df = pd.read_csv('../data/miami_dade_acs5-2014_census.csv')
miami_dade_county_2014_df.head()

Unnamed: 0,index,DP02_0001E,DP02_0001PE,DP02_0002E,DP02_0002PE,DP02_0003E,DP02_0003PE,DP02_0004E,DP02_0004PE,DP02_0005E,...,B25087_027E,B25087_028E,B25087_029E,B25088_001E,B25088_002E,B25088_003E,B25092_001E,B25092_002E,B25092_003E,GEOID
0,"Census Tract 1.07, Miami-Dade County, Florida:...",1255,1255,611,48.7,171,13.6,468,37.3,142,...,0,0,284,1624,2547,1001,24.2,24.5,22.3,12086000107
1,"Census Tract 1.18, Miami-Dade County, Florida:...",506,506,332,65.6,90,17.8,229,45.3,27,...,0,11,219,2185,3388,1001,46.5,50.0,26.3,12086000118
2,"Census Tract 1.19, Miami-Dade County, Florida:...",2206,2206,1192,54.0,498,22.6,894,40.5,249,...,45,73,552,1146,1672,828,26.3,38.7,16.0,12086000119
3,"Census Tract 1.21, Miami-Dade County, Florida:...",780,780,487,62.4,100,12.8,401,51.4,70,...,0,0,421,1906,4001,1001,36.0,25.8,41.5,12086000121
4,"Census Tract 1.20, Miami-Dade County, Florida:...",1855,1855,998,53.8,475,25.6,789,42.5,403,...,40,41,280,1696,1953,1001,28.2,33.8,13.1,12086000120


In [8]:
hillsborough_2014_columns = set(hillsborough_2014_df.columns)
hillsborough_2014_columns.discard('index')

miami_dade_county_2014_columns = set(miami_dade_county_2014_df.columns)
miami_dade_county_2014_columns.discard('index')

hillsborough_2014_columns == miami_dade_county_2014_columns

True

They are indeed equal as expected.  Now let's test equality between each of the counties.

In [9]:
county_file_names = ['hillsborough_acs5-2014_census', 'miami_dade_acs5-2014_census', 
                     'hillsborough_acs5-2019_census', 'miami_dade_acs5-2019_census', 
                     'orange_acs5-2019_census', 'orange_acs5-2019_census']

def load_county_file_columns(county_file_name):
    full_file_name = '../data/{}.csv'.format(county_file_name)
    county_df = pd.read_csv(full_file_name)
    county_columns = set(county_df.columns)
    county_columns.discard('index')
    
    return county_columns

not_equal_list = []
for i in range(len(county_file_names)):
    county1 = county_file_names[i]
    for j in range(i + 1, len(county_file_names)):
        county2 = county_file_names[j]
        county1_columns = load_county_file_columns(county1)
        county2_columns = load_county_file_columns(county2)
        if county1_columns != county2_columns:
            not_equal_list.append((county1, county2))

print("Inconsistent county pairs:")            
for county1, county2 in not_equal_list:
    print('\t{} \t{}'.format(county1, county2))

Inconsistent county pairs:
	hillsborough_acs5-2014_census 	hillsborough_acs5-2019_census
	hillsborough_acs5-2014_census 	miami_dade_acs5-2019_census
	hillsborough_acs5-2014_census 	orange_acs5-2019_census
	hillsborough_acs5-2014_census 	orange_acs5-2019_census
	miami_dade_acs5-2014_census 	hillsborough_acs5-2019_census
	miami_dade_acs5-2014_census 	miami_dade_acs5-2019_census
	miami_dade_acs5-2014_census 	orange_acs5-2019_census
	miami_dade_acs5-2014_census 	orange_acs5-2019_census


## Tasks 2 and 3

Using the ACS data dictionary, compare the total number of variables reported in the 2014 5-year ACS to the 2019 5-year ACS. Are there any variables new to the 2019 survey that were not included in the 2014 survey?

Using the ACS data dictionary, compare the total number of variables reported in the 2014 5-year ACS to the 2019 5-year ACS. Are there any variables missing from the 2019 survey that were included in the 2014 survey?

In [10]:
acs_2014_variable_codes = set(acs_2014.variable_code.unique())
acs_2019_variable_codes = set(acs_2019.variable_code.unique())
acs_2014_variable_codes == acs_2019_variable_codes

False

No, the variable codes are different across years.  Let's look at the codes that are in one year but not the other.

In [11]:
all_variable_codes = acs_2014_variable_codes.union(acs_2019_variable_codes)
all_variable_codes_list = list(all_variable_codes)
n = len(all_variable_codes_list)
print('variable_code\tIn 2014\tIn 2019')
for i in range(n):
    variable_code = all_variable_codes_list[i]
    in_2014 = variable_code in acs_2014_variable_codes
    in_2019 = variable_code in acs_2019_variable_codes
    if in_2014 and in_2019:
        continue
    print('{}\t{}\t{}'.format(variable_code, in_2014, in_2019))

variable_code	In 2014	In 2019
DP05_0086M	False	True
DP04_0143EA	False	True
DP05_0087PEA	False	True
DP05_0083EA	False	True
DP05_0087M	False	True
DP05_0085EA	False	True
DP02_0153EA	False	True
DP05_0085M	False	True
DP02_0153MA	False	True
DP05_0082M	False	True
DP05_0088E	False	True
DP04_0143PMA	False	True
DP05_0083PM	False	True
DP05_0089PE	False	True
DP04_0142PEA	False	True
DP05_0083M	False	True
DP05_0082MA	False	True
DP05_0083PMA	False	True
DP02_0153PMA	False	True
DP05_0084EA	False	True
DP04_0142PM	False	True
DP04_0142M	False	True
DP05_0084PE	False	True
DP05_0085PMA	False	True
DP05_0086MA	False	True
DP05_0089MA	False	True
DP05_0088PEA	False	True
DP05_0084MA	False	True
DP05_0085PE	False	True
DP05_0087E	False	True
DP05_0086E	False	True
DP05_0088M	False	True
DP05_0086EA	False	True
DP02_0153M	False	True
DP02_0153E	False	True
DP05_0087PMA	False	True
DP04_0142EA	False	True
DP04_0142E	False	True
DP05_0087MA	False	True
DP05_0084E	False	True
DP05_0082EA	False	True
DP05_0086PE	False	True
DP02_0153P

We can see that many variable codes were added in 2019.  There were no variable codes present in 2014 that were not present in 2019.

## Task 4

Using the results of tasks 1-3, create a list of variables common to both survey versions (and make sure that the variable codes consistently refer to the same labels) and filter both the 2014 and 2019 ACS datasets to include only those values.

In [14]:
# TODO Write this out to a file.  Too much information.

variable_codes_in_both_years = acs_2014_variable_codes.intersection(acs_2019_variable_codes)
both_years_list = list(variable_codes_in_both_years)
for variable_code in both_years_list:
    val_in_2014 = acs_2014.loc[acs_2014['variable_code'] == variable_code].iloc[0]
    val_in_2019 = acs_2019.loc[acs_2019['variable_code'] == variable_code].iloc[0]
    # TODO Finish for all columns.  Then filter data sets.
    label_equal = val_in_2014['label'] == val_in_2019['label']
    concept_equal = val_in_2014['concept'] == val_in_2019['concept']