# Census Data Processing

Virginia census data by race/county downloaded [here](https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-detail.html), with the data dictionary available [here](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/cc-est2019-alldata.pdf). We downloaded  the data titled "Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin: April 1, 2010 to July 1, 2019 (CC-EST2019-ALLDATA)."

In [1]:
import pandas as pd
import numpy as np

In [2]:
census = pd.read_csv('cc-est2019-alldata-51.csv')

From the data dictionary, year 12 is the July 2019 population estimate, year 11 is the July 2018 population estimate, and so on back to year 3, the July 2010 population estimate. Years 1 and 2 are data from the 2010 census. For consistency, we chose to use the July estimates for each year.

In [3]:
census19 = census[census['YEAR']==12]

In [4]:
census19['TOT_POP'].sum()

17071038

After filtering to a single year, the total population is about double what was expected (~8.5 million). After digging into the data dictionary, we discovered that this was due to age group equal to 0 being total population, while age groups 1 to 18 contain the data for each group broken out into 5 year age intervals. For now, we are concerned with county and race, not age, so we will filter to only the total age group entries.

In [5]:
census19 = census19[census19['AGEGRP']==0]

In [6]:
census19['TOT_POP'].sum()

8535519

That aligns with the expected total population of ~8.5 million.

We actually want the data for all years, but to make joining easier, we did a quick mapping from year code 1-12 to actual years 2010-2019.

In [7]:
census_clean = census[(census['AGEGRP']==0) & (census['YEAR'] >= 3)].copy()

In [8]:
census_clean['YEAR'].unique()

array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [9]:
conditions = [census_clean['YEAR'] == 3, census_clean['YEAR'] == 4, census_clean['YEAR'] == 5, 
              census_clean['YEAR'] == 6, census_clean['YEAR'] == 7, census_clean['YEAR'] == 8, 
              census_clean['YEAR'] == 9, census_clean['YEAR'] == 10, census_clean['YEAR'] == 11, 
              census_clean['YEAR'] == 12]

choices = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

census_clean['YEAR'] = np.select(conditions, choices)

In [10]:
census_clean = census_clean.drop(['SUMLEV', 'AGEGRP'], axis = 1)

In [11]:
census_clean.head()

Unnamed: 0,STATE,COUNTY,STNAME,CTYNAME,YEAR,TOT_POP,TOT_MALE,TOT_FEMALE,WA_MALE,WA_FEMALE,...,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE
38,51,1,Virginia,Accomack County,2010,33148,16152,16996,11341,11496,...,1433,1161,63,56,62,56,7,6,8,9
57,51,1,Virginia,Accomack County,2011,33225,16225,17000,11348,11478,...,1503,1182,73,63,71,60,8,10,8,10
76,51,1,Virginia,Accomack County,2012,33268,16275,16993,11321,11463,...,1485,1157,81,79,69,63,9,11,8,9
95,51,1,Virginia,Accomack County,2013,32969,16092,16877,11171,11390,...,1444,1153,79,87,74,66,8,8,10,9
114,51,1,Virginia,Accomack County,2014,32971,16076,16895,11114,11409,...,1399,1153,97,94,72,61,9,9,7,10


In [12]:
census_clean.to_csv('va_census_clean.csv')