Cleaning states is, surprisingly, not trivial.

The reason is that states are referred to in different ways by different data sources, which implies standardizing states across all tables is an *entitiy resolution* problem. Fortunately, there are only so many different states.

Solving it is important because we will need states to join between tables:
- immigration ports and airports
- airports with demographics
- immigration address and temperatures

Issues:
- immigration ports include not only US states, but also US territories, which have a state-like abbreviation, so require no specialization.
- in the i94 data dictionary:
    - some state names use certain abbreviations, and appear in all uppercase
    - port state abbreviations are substrings in some (but not all) port names, which on top of this don't have a uniform format
- in airports, there is no explicit state column, but rather state abbreviations are either a suffix (for US states) or a prefix (for US territories) of the iso_region column
- in demographics there is an explicit and clean State Code column (but some states are missing demographic data)
- in temperatures, state names are used instead of state abbreviations. US territories temperatures are missing.

Goal:
- to create a clean standard states table that includes both state abbreviations and state names, with an additional column indicating whether it is a regular US state or a territory.
    - when cleaning all other tables that must have a state column, use said state table as the standard.
    - for this reason, the state table should be cleaned before all other clean tables that refer to states.

In [1]:
import findspark
findspark.init()

import pyspark.sql.types as T
import pyspark.sql.functions as F

import pandas as pd
pd.set_option('display.max_rows', 100)

from etl import SparkETL

In [2]:
etl = SparkETL()
spark = etl.spark

22/05/05 14:10:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/05 14:10:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
states_schema = (
    T.StructType([
        T.StructField('Type', T.StringType(), True),
        T.StructField('Name', T.StringType(), True),
        T.StructField('Abbreviation', T.StringType(), True),
        T.StructField('Capital', T.StringType(), True),
        T.StructField('Population (2015)', T.StringType(), True),
        T.StructField('Population (2019)', T.StringType(), True),
        T.StructField('area (square miles)', T.StringType(), True),
    ])
)

In [4]:
states_staging = (
    spark
    .read
    .format('csv')
    .schema(states_schema)
    .option('encoding', 'ISO-8859-1')
    .option('header', 'true')
    .load(etl.data_sources['states'])
)

In [5]:
def filter_missing_state_id(df):
    """
    Description: filter out those "states" that have no abbreviation
    """
    return (
        df
        .filter(F.col('Abbreviation').isNotNull())
    )

In [6]:
def clean_state_id(df):
    return df.withColumn('state_id', F.trim('Abbreviation'))

In [7]:
def clean_name(df):
    """
    Description: some state names have a "[E]" suffix; remove it
    """
    return df.withColumn('name', F.expr("""
        IF(
            SUBSTR(TRIM(Name), -1) = ']', 
            SUBSTRING_INDEX(TRIM(Name), '[', 1), 
            TRIM(Name)
        )
    """))


In [8]:
def clean_type_id(df):
    return df.withColumn('type_id', F.expr("""
                                                CASE TRIM(Type)
                                                    WHEN 'State' THEN 0 
                                                    WHEN 'Federal District' THEN 1
                                                    WHEN 'Territory' THEN 2
                                                END
                        """)
                        )

In [9]:
def clean_type(df):
    return df.withColumn('type', F.trim('Type'))

In [10]:
def clean_schema(df):
    """
    Description: select only the required columns
    """
    return df.select('state_id', 'name', 'type_id', 'type')

In [11]:
def clean_states(df):
    return (
        df
        .pipe(filter_missing_state_id)
        .pipe(clean_state_id)
        .pipe(clean_name)
        .pipe(clean_type_id)
        .pipe(clean_type)
        .pipe(clean_schema)
)

In [12]:
etl.save_clean_table(
    states_staging.pipe(clean_states),
    'state'
)

                                                                                

In [13]:
etl.read_clean_table('state').toPandas()

Unnamed: 0,state_id,name,type_id,type
0,AL,Alabama,0,State
1,AK,Alaska,0,State
2,AZ,Arizona,0,State
3,AR,Arkansas,0,State
4,CA,California,0,State
5,CO,Colorado,0,State
6,CT,Connecticut,0,State
7,DE,Delaware,0,State
8,FL,Florida,0,State
9,GA,Georgia,0,State
