#### 4.2 Create the state dimension table
- Create state dimension table from health data as a start
- Calculate area and population density based on county dim table, group by state
- Don't partition? There's only 50 of them anyway

##### Setup
I'm going to need Spark for this because I'll want to make use of some of its functionality, such as the ability to create temporary SQL views of my dataframes.

In [1]:
from setup import create_spark_session

spark = create_spark_session()

Imports and output paths:

In [2]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import *

from clean import *
from etl import *

# For now, just locally, later on maybe write this to S3 instead
output_path = "output/"

Let's start with the health data since that has all the relevant information about states in it. How can we tell that this is a state and not a county? State FIPS codes end in "000", county ones don't; there's one exception in that the whole of the US has a FIPS code of "00000" so we need to handle that.

In [3]:
health_df = load_health_data(spark)

Started loading county health data
Finished loading county health data


In [4]:
health_df.printSchema()

root
 |-- fips: integer (nullable = true)
 |-- county_name: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)
 |-- poor_health: double (nullable = true)
 |-- smokers: double (nullable = true)
 |-- obesity: double (nullable = true)
 |-- physical_inactivity: double (nullable = true)
 |-- excessive_drinking: double (nullable = true)
 |-- uninsured: double (nullable = true)
 |-- physicians: double (nullable = true)
 |-- unemployment: double (nullable = true)
 |-- air_pollution: double (nullable = true)
 |-- housing_problems: double (nullable = true)
 |-- household_overcrowding: double (nullable = true)
 |-- food_insecurity: double (nullable = true)
 |-- residential_segregation: double (nullable = true)
 |-- over_sixtyfives: double (nullable = true)
 |-- rural: double (nullable = true)



In [5]:
health_df = health_df.withColumnRenamed('state', 'abbreviation')

In [6]:
state_dim_df = health_df.where((health_df['fips'] != 0) & (health_df['fips'] % 1000 == 0)).drop('fips').withColumnRenamed('county_name', 'state')

In [7]:
state_dim_df.limit(5).show()

+----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+
|     state|abbreviation|population| poor_health|     smokers|obesity|physical_inactivity|excessive_drinking|   uninsured| physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|       rural|
+----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+
|   Alabama|          AL|   4887871|0.2202870285|0.2092735311|  0.355|              0.298|      0.1390351529|0.1104478259|6.482388E-4|0.0393356691|         11.0|    0.1434070208|         

In [8]:
state_dim_df.count()

51

Looking good! No problems so far. We do see some null-values in the above data, but they are in the columns we'd expect.

The next part is trickier: I want population density data for states as well. The problem is that we don't have state information in the area data, so we'll have to get it from the county dimension table instead.

In [9]:
county_dim_df = load_county_dimension_table(spark)

Started loading county dimension table
Finished loading county dimension table


In [10]:
county_dim_df.limit(5).show()

+-----+-----------+-----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+--------+------------------+-----+
| fips|county_name|   latitude|   longitude|population| poor_health|     smokers|obesity|physical_inactivity|excessive_drinking|   uninsured| physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|       rural|    area|population_density|state|
+-----+-----------+-----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+--------+------------------+-----+
|48001|   Anderson|31.815347

In [11]:
state_area_df = county_dim_df.groupBy('state').agg(F.sum('area').alias('area'))
state_area_df.show()

+--------------------+------------------+
|               state|              area|
+--------------------+------------------+
|                Utah| 82169.62100000001|
|              Hawaii|          6422.628|
|           Minnesota|         79626.745|
|                Ohio| 40860.69500000001|
|            Arkansas| 52035.47799999999|
|              Oregon|         95988.012|
|               Texas|261231.70899999997|
|        North Dakota| 69000.79599999999|
|        Pennsylvania|         44742.702|
|         Connecticut|          4842.356|
|            Nebraska| 76824.17799999997|
|             Vermont| 9216.655999999999|
|              Nevada|109781.17999999998|
|          Washington|          66455.52|
|            Illinois|55518.925999999985|
|            Oklahoma|          68594.92|
|District of Columbia|            61.048|
|            Delaware|1948.5439999999999|
|              Alaska| 553559.5180000002|
|          New Mexico|121298.14800000002|
+--------------------+------------

In [12]:
state_dim_df = state_dim_df.join(state_area_df, on=["state"], how="inner").select(state_dim_df["*"], state_area_df["area"])
state_dim_df.limit(5).show()

+----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+------------------+
|     state|abbreviation|population| poor_health|     smokers|obesity|physical_inactivity|excessive_drinking|   uninsured| physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|       rural|              area|
+----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+------------------+
|   Alabama|          AL|   4887871|0.2202870285|0.2092735311|  0.355|              0.298|      0.1390351529|0.1104478259|6.482388

In [13]:
state_dim_df.count()

51

In [14]:
state_dim_df.printSchema()

root
 |-- state: string (nullable = true)
 |-- abbreviation: string (nullable = true)
 |-- population: integer (nullable = true)
 |-- poor_health: double (nullable = true)
 |-- smokers: double (nullable = true)
 |-- obesity: double (nullable = true)
 |-- physical_inactivity: double (nullable = true)
 |-- excessive_drinking: double (nullable = true)
 |-- uninsured: double (nullable = true)
 |-- physicians: double (nullable = true)
 |-- unemployment: double (nullable = true)
 |-- air_pollution: double (nullable = true)
 |-- housing_problems: double (nullable = true)
 |-- household_overcrowding: double (nullable = true)
 |-- food_insecurity: double (nullable = true)
 |-- residential_segregation: double (nullable = true)
 |-- over_sixtyfives: double (nullable = true)
 |-- rural: double (nullable = true)
 |-- area: double (nullable = true)



In [15]:
state_dim_df = state_dim_df.withColumn('population_density', state_dim_df['population'] / state_dim_df['area'])
state_dim_df[['state', 'population', 'area', 'population_density']].limit(5).show()

+---------+----------+-----------------+------------------+
|    state|population|             area|population_density|
+---------+----------+-----------------+------------------+
|     Utah|   3161105|82169.62100000001|  38.4704829051116|
|   Hawaii|   1420491|         6422.628|221.16974546867732|
|Minnesota|   5611179|        79626.745| 70.46852160037434|
|     Ohio|  11689442|40860.69500000001| 286.0803517903941|
| Arkansas|   3013825|52035.47799999999|57.918656959392216|
+---------+----------+-----------------+------------------+



In [16]:
state_dim_df.agg({'population_density': 'min'}).show()

+-----------------------+
|min(population_density)|
+-----------------------+
|     1.3321747274156701|
+-----------------------+



In [17]:
state_dim_df.agg({'population_density': 'max'}).show()

+-----------------------+
|max(population_density)|
+-----------------------+
|      11506.60136286201|
+-----------------------+



Quick data quality checks, looks like the min and max values are reasonable (not negative, not infinite).

Now we're ready to write out the dimension table to parquet.

In [19]:
state_dim_df.write.mode('overwrite').parquet(output_path + "state_dim.parquet")