# permits-data / Load Data

ETL pipeline for construction permits data in Los Angeles, California, USA.

For more information:
https://data.lacity.org/A-Prosperous-City/Building-and-Safety-Permit-Information/yv23-pmwf

In [10]:
import os
import sys
from dotenv import load_dotenv, find_dotenv
import numpy as np
import pandas as pd
import psycopg2

In [11]:
# Set notebook display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [12]:
# Get project root directory
root_dir = os.path.dirname(os.getcwd())

# Set path for modules
sys.path[0] = '../'

# Set environment variables
load_dotenv(find_dotenv());
POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
POSTGRES_DB = os.getenv("POSTGRES_DB")
DB_PORT = os.getenv("DB_PORT")
DB_HOST = os.getenv("DB_HOST")
DATA_URL = os.getenv("DATA_URL")

# Environment variables specific to notebook
DATA_DIR = os.path.dirname(root_dir) + '/data'
DB_TABLE = "permits_raw"

## 1. Import Data

In [13]:
# Connect to PostgreSQL, useful only for notebook
def connect_db():
    try:
        con = psycopg2.connect(dbname=POSTGRES_DB,
                               user=POSTGRES_USER,
                               password=POSTGRES_PASSWORD,
                                host=DB_HOST, 
                                port=DB_PORT,
                              connect_timeout=3)
        print('Connected as user "{}" to database "{}" on {}:{}\n'.format(POSTGRES_USER,POSTGRES_DB,
                                                           DB_HOST,DB_PORT))
              
    except Exception as e:
        print('Error:\n', e)
    
    return con

In [14]:
conn = connect_db()

Connected as user "postgres" to database "permits" on localhost:5432



### 1.1 Update Table Columns in PostgreSQL Database

In [15]:
# Get raw data column names
def get_table_names(table, con):
    sql = "SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = N'{}'".format(table)
    etl = pd.read_sql_query(sql, con)
    columns = etl['column_name']
    
    return columns

In [16]:
# Check table names
get_table_names("permits_raw", conn).head(10)

0                 assessor_book
1                 assessor_page
2               assessor_parcel
3                         tract
4                         block
5                           lot
6    reference_no_old_permit_no
7                pcis_permit_no
8                        status
9                   status_date
Name: column_name, dtype: object

In [17]:
# Retrieve table column names
old_columns = get_table_names("permits_raw", conn)

In [18]:
# Rename columns, will update table later
def format_names(series):
    
    replace_map = {' ': '_', '-': '_', '#': 'No', '/': '_', 
                   '.': '', '(': '', ')': '', "'": ''}

    def replace_chars(text):
        for oldchar, newchar in replace_map.items():
            text = text.replace(oldchar, newchar).lower()
        return text

    return series.apply(replace_chars)

In [19]:
# Transform table column names for permits_raw
new_columns = format_names(old_columns)

In [20]:
new_columns.head()

0      assessor_book
1      assessor_page
2    assessor_parcel
3              tract
4              block
Name: column_name, dtype: object

In [21]:
# Creates a SQL query to update table columns and writes to text file
### add path string
def create_query(old_columns, new_columns, db_table, con, run=False):
    
    sql = 'ALTER TABLE {} '.format(db_table) + 'RENAME "{old_name}" to {new_name};'
    
    
    sql_query = []

    for idx, name in old_columns.iteritems():
        sql_query.append(sql.format(old_name=name, new_name=new_columns[idx]))
        
    update_names = '\n'.join(sql_query)
    
    # replace with path
    with open('../postgres/sql/update_names.sql', 'w') as text:
        text.write(update_names)
        
    # Update db is desired
    if run:
        # try in case connection not open
        cur = con.cursor()
        # replace with path
        sql_file = open('../postgres/sql/update_names.sql', 'r')
        cur.execute(sql_file.read())
        con.commit()
        #conn.close()

In [22]:
# Create SQL query for permits_raw
try:
    create_query(old_columns, new_columns, run=True, con=conn, db_table=DB_TABLE)
except Exception as e: 
    conn.rollback()
    print('Error:\n', e)

Error:
 column "assessor_book" of relation "permits_raw" already exists



In [23]:
# Check table names are updated
get_table_names("permits_raw", conn).head()

0      assessor_book
1      assessor_page
2    assessor_parcel
3              tract
4              block
Name: column_name, dtype: object

In [24]:
# TEST: 
assert (get_table_names("permits_raw", 
                        conn) == new_columns).mean() == 1, "Database table names do not match new table names"

In [25]:
# Extract full dataset
sql_all = 'SELECT * FROM {} LIMIT 1500;'.format(DB_TABLE)

# Extract full dataset
data = pd.read_sql_query(sql_all, conn)
data.head()

Unnamed: 0,assessor_book,assessor_page,assessor_parcel,tract,block,lot,reference_no_old_permit_no,pcis_permit_no,status,status_date,permit_type,permit_sub_type,permit_category,project_number,event_code,initiating_office,issue_date,address_start,address_fraction_start,address_end,address_fraction_end,street_direction,street_name,street_suffix,suffix_direction,unit_range_start,unit_range_end,zip_code,work_description,valuation,floor_area_la_zoning_code_definition,no_of_residential_dwelling_units,no_of_accessory_dwelling_units,no_of_stories,contractors_business_name,contractor_address,contractor_city,contractor_state,license_type,license_no,principal_first_name,principal_middle_name,principal_last_name,license_expiration_date,applicant_first_name,applicant_last_name,applicant_business_name,applicant_address_1,applicant_address_2,applicant_address_3,zone,occupancy,floor_area_la_building_code_definition,census_tract,council_district,latitude_longitude,applicant_relationship,existing_code,proposed_code
0,4317,3,***,TR 30210-C,,LT 1,,15044-90000-08405,Permit Finaled,09/10/2015,HVAC,1 or 2 Family Dwelling,No Plan Check,,,INTERNET,2015-08-18,1823,1/2,1823,1/2,S,THAYER,AVE,,,,90025,,,,,,,CONDITIONED AIRE MECHANICAL & ENGINEERING INC,18650 PARTHENIA STREET,NORTHRIDGE,CA,C20,532440,BRETT,MOORE,HOFFER,2016-06-30,BRETT,HOFFER,,18650 PARTHENIA ST,,"NORTHRIDGE, CA",R3-1-O,,0.0,2671.0,5,"(34.05474, -118.42628)",Net Applicant,,
1,5005,10,017,CHESTERFIELD SQUARE,,465,16SL57806,16016-70000-02464,Permit Finaled,08/01/2017,Bldg-Alter/Repair,1 or 2 Family Dwelling,No Plan Check,,,SOUTH LA,2016-02-04,2122,,2122,,W,54TH,ST,,,,90062,General rehabilitation for single family dwell...,40000.0,,,,,OWNER-BUILDER,,,,,0,JAVIER,,TALAMANTES,,JAVIER,TALAMANTES,OWNER-BUILDER,,,,C2-1VL,,,2325.0,8,"(33.99307, -118.31668)",Owner-Bldr,1.0,
2,5154,23,022,SUN-SET TRACT,D,13,14VN81535,14016-20000-13092,Issued,08/13/2014,Bldg-Alter/Repair,Apartment,Plan Check,,,VAN NUYS,2014-08-13,415,,415,,S,BURLINGTON,AVE,,1-30,1-30,90057,PHOTOVOLTAIC SOLAR PANELS ON ROOF OF (E) APT BLDG,37000.0,,,,,PERMACITY CONSTRUCTION CORP,5570 W WASHINGTON BLVD,LOS ANGELES,CA,B,827864,JONATHAN,SAUL,PORT,2015-11-30,LINDA,MARTON,,710 WILSHIRE BLVD,,"SANTA MONICA, CA",R4-1,,,2089.04,1,"(34.06012, -118.26997)",Agent for Owner,5.0,
3,4404,30,010,TR 12086,,2,,16044-30000-09658,Permit Finaled,08/29/2016,HVAC,1 or 2 Family Dwelling,No Plan Check,,,WEST LA,2016-08-22,315,,315,,S,OCEANO,DR,,,,90049,,,,,,,E/C HEATING AND AIR CONDITION,26888 CUATRO MILPAS ST,VALENCIA,CA,C20,651051,EDY,RUDOLFO,CORDON,2018-07-31,,,,,,,RS-1,,0.0,2640.0,11,"(34.05707, -118.4732)",Contractor,,
4,2646,19,011,TR 7158,,11,,17042-90000-31792,Permit Finaled,12/28/2017,Plumbing,1 or 2 Family Dwelling,No Plan Check,,,INTERNET,2017-12-26,13640,,13640,,W,PIERCE,ST,,,,91331,,,,,,,TITANIUM POWER INC,1545 S LA CIENEGA BLVD,LOS ANGELES,CA,B,989217,DENNIS,HARUO,MIYAHIRA,2017-12-31,YONI,GHERMEZI,,1545 S LA CIENEGA BLVD,,"LOS ANGELES, CA",R1-1-O,,0.0,1044.03,7,"(34.25487, -118.43002)",Net Applicant,,


### 1.2 Update Data Types in database

Before working with the data it will be important to have the correct data types mapped to the values. That way when extracting and loading into a Pandas dataframe, for example, it is immediately ready for analysis. 

Determining the correct data types involves some preliminary exploration of the data for which uses two helper functions: *get_snapshot* and 

In [26]:
### Returns overview with column, dtype, # unique values, # missing values and sample value
def get_snapshot(dataframe):
    
    """
    Takes an existing DataFrame and creates a pandas DataFrame 
    where each row displays the original DataFrame column name, 
    number of unique values, number of missing values, and a
    random sample value from that column. 
    
    Useful for exploring raw data to quickly figure out appropriate 
    data types.
    
    Example
    -------
    
    overview = get_overview(my_dataframe)
    
    """
    unique = dataframe.nunique(axis=0)
    is_null = dataframe.isnull().sum()
    data_types = dataframe.dtypes
    
    samples = pd.DataFrame()
    column_names = pd.DataFrame()
    
    for column, row in dataframe.iteritems():
        try:
            sample = dataframe[column].dropna(axis=0).sample()
            column_name = pd.Series(column)
        except:
            pass

        samples = pd.concat([samples, sample], axis=0).reset_index(drop=True)
        column_names = pd.concat([column_names, column_name], 
                                 axis=0, ).reset_index(drop=True)

        examples = pd.concat([column_names, samples], axis=1, ignore_index=True)
        examples.columns = ['COLUMN', 'SAMPLE VALUE']
    
    overview = pd.concat([data_types, unique, is_null], axis=1)
    overview.reset_index(inplace=True)
    overview.columns = columns=['COLUMN', 'DATA TYPE', '# UNIQUE VALUES', '# MISSING VALUES']
    overview = overview.merge(right=examples, on='COLUMN').drop_duplicates(subset=['COLUMN']).set_index('COLUMN')
    
    return overview

In [27]:
get_snapshot(data)

Unnamed: 0_level_0,DATA TYPE,# UNIQUE VALUES,# MISSING VALUES,SAMPLE VALUE
COLUMN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
assessor_book,object,742,1,5034
assessor_page,object,49,1,010
assessor_parcel,object,111,1,017
tract,object,1202,5,TR 6372
block,object,96,1178,BLK 42
lot,object,296,14,2
reference_no_old_permit_no,object,520,877,14LA37385
pcis_permit_no,object,1500,0,13041-90000-03938
status,object,11,0,Permit Finaled
status_date,object,962,0,11/06/2018


In [64]:
# Creates a report to show value counts for columns with less than n unique values
def explore_value_counts(dataframe, n=None, max_n=1500, columns=None, printed=True):
    
    """
    This function is helpful for quickly determining which values
    should be converted to integer or category types in a dataframe.
    
    Prints a series of custom text summaries with n value counts 
    for each column. Can work if neither n nor columns are specified.
    
    Also can return a generator yielding a text summary with n
    value counts for each column.
    
    Example:
    --------
    ## Iterates through individual tables
    gen = explore_value_counts(data, printed=False)
    print(next(gen)) 
    
    ## Prints all tables to STDOUT
    explore_value_counts(data, printed=True)
    
    Params
    --------
    dataframe : pandas DataFrame
        DataFrame with columns to be summarized.
    n : integer or string
        Max number of unique categories in column, or 'all'.
    max_n : integer
        Ceiling safeguard to avoid extremely large values of n
    columns : list 
        Columns to include in output.
    printed : bool
        If True prints to console; if False returns generator object
        which can be printed as text.
    
    Returns
    --------
    if printed=True: prints all formatted text of all tables
    
    if printed=False: generator object that outputs one table
    
    """
    
    # Parsing arguments
    if columns:
        dataframe = dataframe[columns]
    
    if n == 'all':
        n = len(dataframe) if len(dataframe) <= max_n else max_n
    elif not n:
        n = 30
    else:
        n = n if n <= max_n else max_n
        
        
    
    def make_tables():
        dataframe_n = pd.DataFrame()

        # Data selection
        for column, row in dataframe.iteritems():
            
            n_unique = dataframe[column].nunique()
            
            # Throws error if float64 is removed
            if (dataframe[column].dtype not in ['float64', '<M8[ns]']):
                dataframe_n = pd.concat([dataframe_n, dataframe[column]], axis=1)

        summary_list = []

        # Text generation
        for column, row in dataframe_n.iteritems(): 
            series = dataframe_n[column]
            name = series.name
            
            index_slice = n if n <= len(series) else len(series)
            
            # Create dataframe of value counts
            counted = series.value_counts(sort=True)[:index_slice]
            percent = series.value_counts(sort=True, normalize=True)[:index_slice]
            summary = pd.concat([counted, percent], axis=1)
            summary.columns = ['COUNT', 'PERCENTAGE']
            summary.index = summary.index.rename('UNIQUE VALUES:')
            
            # Create a custom table with n unique, missing values to print to console as text
            summary_text = 'COLUMN:   "{}"\nDATA TYPE:  {}'.format(name, series.dtype)
            summary_text = summary_text + '\nTOTAL UNIQUE:  {}'.format(name, series.nunique())
            summary_text = summary_text + '\nTOTAL MISSING:  {}'.format(series.isnull().sum())
            summary_text = summary_text + '\n' + summary.to_string() + '\n\n'

            summary_list.append(summary_text)
        
        if not printed:
            summary_gen = iter(summary_list)
            return summary_gen
        else:
            return summary_list
        
    if printed:
        print('\n'.join(make_tables()))
    else:
        return make_tables()

In [52]:
# Examine variables to determine appropriate data types
explore_value_counts(data, columns=['block'], n=10, printed=True)

COLUMN:   "block"
DATA TYPE:  object
TOTAL UNIQUE:  block
TOTAL MISSING:  1178
                COUNT  PERCENTAGE
UNIQUE VALUES:                   
2                  20    0.062112
3                  18    0.055901
1                  14    0.043478
5                  13    0.040373
4                  12    0.037267
6                  11    0.034161
7                  10    0.031056
11                  9    0.027950
B                   9    0.027950
C                   9    0.027950




#### Summary of Data Types
The following columns will have their data types programmatically updated in the database:

***Pandas Category columns:***
* status - VARCHAR(30)
* permit_type - VARCHAR(30)
* permit_sub_type - VARCHAR(30)
* permit_category - VARCHAR(30)
* initiating_office - VARCHAR(30)
* license_type - VARCHAR(10)
* zone - VARCHAR(10)
* census_tract - VARCHAR(10)
* applicant_relationship - VARCHAR(30)

***Pandas Object columns:***
* 

***Pandas Integer columns:***
* council_district - SMALLINT
* project_number - SMALLINT
* address_start - INTEGER
* address_end - INTEGER
* no_of_residential_dwelling_units - SMALLINT
* no_of_accessory_dwelling_units - SMALLINT
* no_of_stories - SMALLINT
* license_no - INTEGER

***Pandas Float columns:***
* valuation - NUMERIC(12, 2)

***DateTime columns***

In [36]:
# Updates column types in PostgreSQL database
def update_table_types(column_dict, sql_string, table, printed=False, 
                       write=False, path=None, run=False, con=None):
    
    """
    Takes a sql statement to ALTER COLUMN types (string)
    and appends it to an ALTER TABLE statement and prints
    or returns a full string. Made for PostgreSQL.
    
    Example
    ---------
    # Dictionary of SQL types
    numeric_cols = {'valuation': 'NUMERIC(12, 2)'}
    
    # ALTER column statement as string, do not end with ',' or ';'
    sql_numeric = "ALTER {column} TYPE {col_type} USING {column}::numeric"
    
    # Writes a out text file to disk
    update_table_types(integer_cols, sql_integer, printed=True, 
                                        write=True, path='./sql')
    
    Output
    ---------
    >>> "ALTER TABLE public.permits_raw
            ALTER valuation TYPE NUMERIC(12, 2) USING valuation::numeric;"
            
    Parameters
    ----------
    column_dict : dictionary
        Dictionary in form {'column name': 'SQL datatype'}
    
    sql_string : string
        Must be a SQL query string in form:
        "ALTER {column} TYPE {col_type} USING {column}::numeric
    
    printed : boolean
        If True will output to console
    
    write : boolean
        If True will write to specified path
    
    path : string
        Path of directory to write text file to
        
    run : boolean
        If True, will run query in database.
        
    con : psycopg2 connection object
        Required if run = True
    """
    
    # Define SQL update queries
    sql_alter_table = "ALTER TABLE public.{db_table}\n\t".format(db_table=table)

    # Append comma, new line and tab
    sql_string = sql_string + ",\n\t"
    
    # Update types
    sql_update_type = []
    for column, col_type in column_dict.items():
        sql_update_type.append(sql_string.format(column=column, col_type=col_type))
    
    # Join strings to create full sql query
    sql_update_type = sql_alter_table + ''.join(sql_update_type)
    
    # Replace very last character with ";"
    sql_update_type = sql_update_type[:-3] + ";"
    
    if printed:
        print(sql_update_type)
    
    if write:
        with open(path, 'w') as text:
            text.write(sql_update_type)
        print("\nSQL written to:\n{}\n".format(path))
    
        if run:
            #assert con, "No connection to database."
            # try in case connection not open
            try:
                print("Connecting...")
                cur = con.cursor()
                sql_file = open(path, 'r')
                print("Executing query...")
                cur.execute(sql_file.read())
                print("Committing changes...")
                con.commit()
                con.close()
                print("Database updated successfully.")
            except Exception as e:
                conn.rollback()
                print('Error:\n', e)
                
    elif run and not write:
        print('Set "write=True" and define path to run query from file.')
        
    return sql_update_type

Each column can be easily examined using *explore_value_counts* to determine the type constraint:

In [39]:
data.head(1)

Unnamed: 0,assessor_book,assessor_page,assessor_parcel,tract,block,lot,reference_no_old_permit_no,pcis_permit_no,status,status_date,permit_type,permit_sub_type,permit_category,project_number,event_code,initiating_office,issue_date,address_start,address_fraction_start,address_end,address_fraction_end,street_direction,street_name,street_suffix,suffix_direction,unit_range_start,unit_range_end,zip_code,work_description,valuation,floor_area_la_zoning_code_definition,no_of_residential_dwelling_units,no_of_accessory_dwelling_units,no_of_stories,contractors_business_name,contractor_address,contractor_city,contractor_state,license_type,license_no,principal_first_name,principal_middle_name,principal_last_name,license_expiration_date,applicant_first_name,applicant_last_name,applicant_business_name,applicant_address_1,applicant_address_2,applicant_address_3,zone,occupancy,floor_area_la_building_code_definition,census_tract,council_district,latitude_longitude,applicant_relationship,existing_code,proposed_code
0,4317,3,***,TR 30210-C,,LT 1,,15044-90000-08405,Permit Finaled,09/10/2015,HVAC,1 or 2 Family Dwelling,No Plan Check,,,INTERNET,2015-08-18,1823,1/2,1823,1/2,S,THAYER,AVE,,,,90025,,,,,,,CONDITIONED AIRE MECHANICAL & ENGINEERING INC,18650 PARTHENIA STREET,NORTHRIDGE,CA,C20,532440,BRETT,MOORE,HOFFER,2016-06-30,BRETT,HOFFER,,18650 PARTHENIA ST,,"NORTHRIDGE, CA",R3-1-O,,0,2671.0,5,"(34.05474, -118.42628)",Net Applicant,,


In [136]:
# Examine variables to determine appropriate data types
explore_value_counts(data, columns=['tract'], n='all', max_n=30, printed=True)

COLUMN:   "tract"
DATA TYPE:  object
TOTAL UNIQUE:  tract
TOTAL MISSING:  5
                                                    COUNT  PERCENTAGE
UNIQUE VALUES:                                                       
RANCHO SAUSAL REDONDO                                  13    0.008696
TR 1000                                                13    0.008696
TR 5609                                                12    0.008027
TR 9300                                                 7    0.004682
LANKERSHIM RANCH LAND AND WATER CO.                     6    0.004013
P M 3784                                                6    0.004013
TR 6170                                                 6    0.004013
IVANHOE                                                 6    0.004013
ORD'S SURVEY                                            6    0.004013
TR 5822                                                 5    0.003344
RANCHO LA BREA                                          5    0.003344
HUBER TRACT   

#### Update VARCHAR/CHAR types

In [138]:
conn = connect_db()

# Dictionary of columns and new varchar types
varchar_cols = {'status':'VARCHAR(50)', 'permit_type':'VARCHAR(50)', 'permit_sub_type':'VARCHAR(50)', 
                'permit_category':'VARCHAR(50)', 'initiating_office':'VARCHAR(50)', 
                'license_type':'VARCHAR(50)', 'zone':'VARCHAR(50)', 'census_tract':'VARCHAR(50)', 
                'applicant_relationship':'VARCHAR(50)', 'block':'VARCHAR(50)', 'lot':'VARCHAR(50)', 
                'reference_no_old_permit_no':'VARCHAR(50)','pcis_permit_no':'VARCHAR(50)', 
               'address_fraction_start': 'CHAR(3)', 'address_fraction_end': 'CHAR(3)', 
                'street_direction': 'CHAR(1)', 'street_name': 'VARCHAR(50)', 'street_suffix': 'VARCHAR(10)',
               'suffix_direction': 'VARCHAR(10)', 'unit_range_start': 'VARCHAR(50)', 'unit_range_end': 'VARCHAR(50)',
               'work_description': 'TEXT', 'floor_area_la_zoning_code_definition': 'VARCHAR(10)', 
               'contractors_business_name': 'VARCHAR(100)', 'contractor_address': 'VARCHAR(100)',
               'contractor_city': 'VARCHAR(50)', 'contractor_state': 'CHAR(2)', 'license_type': 'VARCHAR(10)', 
               'principal_first_name': 'VARCHAR(50)', 'principal_middle_name': 'VARCHAR(50)', 
                'principal_last_name': 'VARCHAR(50)', 'applicant_first_name': 'VARCHAR(50)', 
                'applicant_last_name': 'VARCHAR(50)', 'applicant_business_name': 'VARCHAR(100)',
               'applicant_address_1': 'VARCHAR(50)', 'applicant_address_2': 'VARCHAR(50)', 
                'applicant_address_3': 'VARCHAR(50)', 'occupancy': 'VARCHAR(50)', 
                'floor_area_la_building_code_definition': 'VARCHAR(10)', 'census_tract': 'VARCHAR(10)',
                'latitude_longitude': 'VARCHAR(50)', 'assessor_parcel': 'CHAR(3)', 'tract': 'VARCHAR(200)'}

sql_varchar = "ALTER {column} TYPE {col_type}"

# Path to varchar query file
path_varchar = root_dir + '/postgres/sql/update_varchar.sql'

# Update column types
update_table_types(varchar_cols, sql_varchar, table='permits_raw', printed=True, 
                   write=True, path=path_varchar, run=True, con=conn);

Connected as user "postgres" to database "permits" on localhost:5432

ALTER TABLE public.permits_raw
	ALTER applicant_first_name TYPE VARCHAR(50),
	ALTER reference_no_old_permit_no TYPE VARCHAR(50),
	ALTER license_type TYPE VARCHAR(10),
	ALTER applicant_business_name TYPE VARCHAR(100),
	ALTER floor_area_la_building_code_definition TYPE VARCHAR(10),
	ALTER address_fraction_start TYPE CHAR(3),
	ALTER contractors_business_name TYPE VARCHAR(100),
	ALTER permit_type TYPE VARCHAR(50),
	ALTER applicant_address_2 TYPE VARCHAR(50),
	ALTER work_description TYPE TEXT,
	ALTER applicant_last_name TYPE VARCHAR(50),
	ALTER principal_first_name TYPE VARCHAR(50),
	ALTER latitude_longitude TYPE VARCHAR(50),
	ALTER floor_area_la_zoning_code_definition TYPE VARCHAR(10),
	ALTER status TYPE VARCHAR(50),
	ALTER unit_range_end TYPE VARCHAR(50),
	ALTER occupancy TYPE VARCHAR(50),
	ALTER pcis_permit_no TYPE VARCHAR(50),
	ALTER permit_sub_type TYPE VARCHAR(50),
	ALTER street_suffix TYPE VARCHAR(10),
	ALTER princ

#### Update INTEGER types

In [119]:
conn = connect_db()

# Dictionary of columns and new integer types
integer_cols = {'assessor_book': 'SMALLINT', 'assessor_page': 'SMALLINT', 'council_district': 'SMALLINT', 
                'project_number': 'SMALLINT', 'address_start': 'INTEGER', 
                'address_end': 'INTEGER', 'no_of_residential_dwelling_units': 'SMALLINT', 
                'no_of_accessory_dwelling_units': 'SMALLINT', 'no_of_stories': 'SMALLINT', 
                'license_no': 'INTEGER', 'zip_code': 'INTEGER', 'existing_code': 'SMALLINT', 
                'proposed_code': 'SMALLINT'}

sql_integer = "ALTER {column} TYPE {col_type} USING {column}::{col_type}"

# Path to varchar query file
path_integer = root_dir + '/postgres/sql/update_integer.sql'

# Update column types
update_table_types(integer_cols, sql_integer, table='permits_raw', printed=True, 
                   write=True, path=path_integer, run=True, con=conn);

Connected as user "postgres" to database "permits" on localhost:5432

ALTER TABLE public.permits_raw
	ALTER existing_code TYPE SMALLINT USING existing_code::SMALLINT,
	ALTER council_district TYPE SMALLINT USING council_district::SMALLINT,
	ALTER no_of_stories TYPE SMALLINT USING no_of_stories::SMALLINT,
	ALTER assessor_book TYPE SMALLINT USING assessor_book::SMALLINT,
	ALTER address_start TYPE INTEGER USING address_start::INTEGER,
	ALTER no_of_accessory_dwelling_units TYPE SMALLINT USING no_of_accessory_dwelling_units::SMALLINT,
	ALTER assessor_page TYPE SMALLINT USING assessor_page::SMALLINT,
	ALTER proposed_code TYPE SMALLINT USING proposed_code::SMALLINT,
	ALTER address_end TYPE INTEGER USING address_end::INTEGER,
	ALTER project_number TYPE SMALLINT USING project_number::SMALLINT,
	ALTER no_of_residential_dwelling_units TYPE SMALLINT USING no_of_residential_dwelling_units::SMALLINT,
	ALTER license_no TYPE INTEGER USING license_no::INTEGER,
	ALTER zip_code TYPE INTEGER USING zip_code

#### Update NUMERIC types

In [133]:
conn = connect_db()

# Dictionary of columns and new numeric types
numeric_cols = {'valuation': 'NUMERIC(12, 2)'}

sql_numeric = "ALTER {column} TYPE {col_type} USING {column}::" + "{col_type}"

# Path to varchar query file
path_numeric = root_dir + '/postgres/sql/update_numeric.sql'

# Update column types
update_table_types(numeric_cols, sql_numeric, table='permits_raw', printed=True, 
                   write=True, path=path_numeric, run=True, con=conn);

Connected as user "postgres" to database "permits" on localhost:5432

ALTER TABLE public.permits_raw
	ALTER valuation TYPE NUMERIC(12, 2) USING valuation::NUMERIC(12, 2);

SQL written to:
/Users/gregory/Documents/00 Data Projects/project-portfolio/permits-data/postgres/sql/update_numeric.sql

Connecting...
Executing query...
Committing changes...
Database updated successfully.


#### Update TEMPORAL types

In [132]:
conn = connect_db()

# Dictionary of columns and new integer types
date_cols = {'status_date': 'DATE', 'issue_date': 'DATE', 'license_expiration_date': 'DATE'}

#sql_date = "ALTER {column} TYPE {col_type}"
sql_date = "ALTER {column} TYPE {col_type} USING {column}::" + "{col_type}"

# Path to varchar query file
path_integer = root_dir + '/postgres/sql/update_temporal.sql'

# Update column types
update_table_types(date_cols, sql_date, table='permits_raw', printed=True, 
                   write=True, path=path_integer, run=True, con=conn);

Connected as user "postgres" to database "permits" on localhost:5432

ALTER TABLE public.permits_raw
	ALTER issue_date TYPE DATE USING issue_date::DATE,
	ALTER status_date TYPE DATE USING status_date::DATE,
	ALTER license_expiration_date TYPE DATE USING license_expiration_date::DATE;

SQL written to:
/Users/gregory/Documents/00 Data Projects/project-portfolio/permits-data/postgres/sql/update_temporal.sql

Connecting...
Executing query...
Committing changes...
Database updated successfully.


## Summary

The functions in this notebook allow the pipeline to connect to a PostgreSQL database on any server, update the column names in tables, and update the data types as well. The notebook is not the pipeline itself, this can be run through the Makefile...
```bash
set -o allexport; source .env; set +o allexport; \
make data
```

## 2. Clean Data

In [58]:
# Connect to db
conn = connect_db()

# Extract partial dataset
sql_all = 'SELECT * FROM {} LIMIT 1500;'.format(DB_TABLE)

# Columns to parse as dates
date_columns = ['status_date', 'issue_date', 'license_expiration_date']

# Fetch fresh data
data = pd.read_sql_query(sql_all, conn, parse_dates=date_columns, 
                         coerce_float=False)

Connected as user "postgres" to database "permits" on localhost:5432



In [59]:
data.head()

Unnamed: 0,assessor_book,assessor_page,assessor_parcel,tract,block,lot,reference_no_old_permit_no,pcis_permit_no,status,status_date,permit_type,permit_sub_type,permit_category,project_number,event_code,initiating_office,issue_date,address_start,address_fraction_start,address_end,address_fraction_end,street_direction,street_name,street_suffix,suffix_direction,unit_range_start,unit_range_end,zip_code,work_description,valuation,floor_area_la_zoning_code_definition,no_of_residential_dwelling_units,no_of_accessory_dwelling_units,no_of_stories,contractors_business_name,contractor_address,contractor_city,contractor_state,license_type,license_no,principal_first_name,principal_middle_name,principal_last_name,license_expiration_date,applicant_first_name,applicant_last_name,applicant_business_name,applicant_address_1,applicant_address_2,applicant_address_3,zone,occupancy,floor_area_la_building_code_definition,census_tract,council_district,latitude_longitude,applicant_relationship,existing_code,proposed_code
0,4317,3,***,TR 30210-C,,LT 1,,15044-90000-08405,Permit Finaled,2015-09-10,HVAC,1 or 2 Family Dwelling,No Plan Check,,,INTERNET,2015-08-18,1823,1/2,1823,1/2,S,THAYER,AVE,,,,90025,,,,,,,CONDITIONED AIRE MECHANICAL & ENGINEERING INC,18650 PARTHENIA STREET,NORTHRIDGE,CA,C20,532440,BRETT,MOORE,HOFFER,2016-06-30,BRETT,HOFFER,,18650 PARTHENIA ST,,"NORTHRIDGE, CA",R3-1-O,,0.0,2671.0,5,"(34.05474, -118.42628)",Net Applicant,,
1,5005,10,017,CHESTERFIELD SQUARE,,465,16SL57806,16016-70000-02464,Permit Finaled,2017-08-01,Bldg-Alter/Repair,1 or 2 Family Dwelling,No Plan Check,,,SOUTH LA,2016-02-04,2122,,2122,,W,54TH,ST,,,,90062,General rehabilitation for single family dwell...,40000.0,,,,,OWNER-BUILDER,,,,,0,JAVIER,,TALAMANTES,NaT,JAVIER,TALAMANTES,OWNER-BUILDER,,,,C2-1VL,,,2325.0,8,"(33.99307, -118.31668)",Owner-Bldr,1.0,
2,5154,23,022,SUN-SET TRACT,D,13,14VN81535,14016-20000-13092,Issued,2014-08-13,Bldg-Alter/Repair,Apartment,Plan Check,,,VAN NUYS,2014-08-13,415,,415,,S,BURLINGTON,AVE,,1-30,1-30,90057,PHOTOVOLTAIC SOLAR PANELS ON ROOF OF (E) APT BLDG,37000.0,,,,,PERMACITY CONSTRUCTION CORP,5570 W WASHINGTON BLVD,LOS ANGELES,CA,B,827864,JONATHAN,SAUL,PORT,2015-11-30,LINDA,MARTON,,710 WILSHIRE BLVD,,"SANTA MONICA, CA",R4-1,,,2089.04,1,"(34.06012, -118.26997)",Agent for Owner,5.0,
3,4404,30,010,TR 12086,,2,,16044-30000-09658,Permit Finaled,2016-08-29,HVAC,1 or 2 Family Dwelling,No Plan Check,,,WEST LA,2016-08-22,315,,315,,S,OCEANO,DR,,,,90049,,,,,,,E/C HEATING AND AIR CONDITION,26888 CUATRO MILPAS ST,VALENCIA,CA,C20,651051,EDY,RUDOLFO,CORDON,2018-07-31,,,,,,,RS-1,,0.0,2640.0,11,"(34.05707, -118.4732)",Contractor,,
4,2646,19,011,TR 7158,,11,,17042-90000-31792,Permit Finaled,2017-12-28,Plumbing,1 or 2 Family Dwelling,No Plan Check,,,INTERNET,2017-12-26,13640,,13640,,W,PIERCE,ST,,,,91331,,,,,,,TITANIUM POWER INC,1545 S LA CIENEGA BLVD,LOS ANGELES,CA,B,989217,DENNIS,HARUO,MIYAHIRA,2017-12-31,YONI,GHERMEZI,,1545 S LA CIENEGA BLVD,,"LOS ANGELES, CA",R1-1-O,,0.0,1044.03,7,"(34.25487, -118.43002)",Net Applicant,,


### 2.1 Missing Data

#### Overview of Unique Values in Qualitative Data

Before making decisions regarding missing values or data types, it is important to be familiar with the content of each column especially considering the permits dataset mostly contains qualitative data.

#### Summary
* *zip_code* and *latitude_longitude* need their missing values to be inferred through geocoding.
* *issue_date* and *license_expiration_date* should be parsed as datetime objects on import.
* All address columns should be combined into one column containing the full address.
* Columns will be converted to more appropriate data types such as float, integer, and category.

### 2.2 Processing Missing Data

***Overview:***
* Split *latitude_longitude* into separate columns and convert to float values: *latitude*, *longitude*<br>
* Combine address columns into one columns: *full_address*<br>
* Geocode missing *latitude_longitude* with *full_address*<br>
* Geocode missing *zip_code* with complete *latitude_longitude*<br>
* Geocode any missing *full_address* with *latitude_longitude*<br>