# permits-data / Clean Data

ETL pipeline for construction permits data in Los Angeles, California, USA.

For more information:
https://data.lacity.org/A-Prosperous-City/Building-and-Safety-Permit-Information/yv23-pmwf

## Setup

In [1]:
import os
import sys

# Set path for modules
sys.path[0] = '../'

from dotenv import load_dotenv, find_dotenv
import numpy as np
import pandas as pd
import psycopg2

# Import custom eda and sql functions
from src.toolkits.eda import get_snapshot, explore_value_counts
from src.toolkits.sql import connect_db

# Import dependencies for geocoding
from geopy.geocoders import Nominatim
from geopy.geocoders import GoogleV3
from geopy.extra.rate_limiter import RateLimiter

In [2]:
# Set notebook display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [3]:
# Get project root directory
root_dir = os.path.dirname(os.getcwd())

# Set environment variables
load_dotenv(find_dotenv());
POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
POSTGRES_DB = os.getenv("POSTGRES_DB")
DB_PORT = os.getenv("DB_PORT")
DB_HOST = os.getenv("DB_HOST")
DATA_URL = os.getenv("DATA_URL")

# Google Maps environment variables
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

# Environment variables specific to notebook
DATA_DIR = os.path.dirname(root_dir) + '/data'
DB_TABLE = "permits_raw"

## 1. Clean Data

In [4]:
# Connect to db
conn = connect_db()

# Extract partial dataset
sql_all = 'SELECT * FROM {} LIMIT 500;'.format(DB_TABLE)

# Columns to parse as dates
date_columns = ['status_date', 'issue_date', 'license_expiration_date']

# Fetch fresh data
data = pd.read_sql_query(sql_all, conn, parse_dates=date_columns, coerce_float=False)

# Replace None with np.nan
data.fillna(np.nan, inplace=True)

Connected as user "postgres" to database "permits" on localhost:5432



In [5]:
#data.iloc[542]

### 1.1 Missing Data

#### Overview of Unique Values in Qualitative Data

Before making decisions about how to address missing values, it is important to be familiar with the content of each column. In some cases data can be left alone, imputed, recollected, or dropped from the dataset. Since the permits data has mostly qualitative data and unstructured text, most of it will be left alone.

In the case of geographic data such as addresses and lat/long coordinates, it will be necessary to accurately geocode the missing values. Since this information is split across several columns they will be concatenated into one column.

In [6]:
# Get an overview of data types, # unique values, # missing values and sample value
# for each column
get_snapshot(data)

Unnamed: 0_level_0,DATA TYPE,# UNIQUE VALUES,# MISSING VALUES,SAMPLE VALUE
COLUMN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
assessor_book,int64,366,0,5159
assessor_page,int64,44,0,6
assessor_parcel,object,74,0,009
tract,object,446,3,TR 37916
block,object,52,384,17
lot,object,157,4,40
reference_no_old_permit_no,object,165,305,18VN
pcis_permit_no,object,500,0,17041-90000-32881
status,object,8,0,Permit Finaled
status_date,datetime64[ns],424,0,2018-04-30 00:00:00


At the moment the only missing data of interest are *zip_code* and *latitude_longitude* coordinates, since these are necesary for mapping. 

### 1.2 Processing Missing Data

***Overview:***
* 1.2.1 - Combine address columns into one columns: *full_address*<br>
    - Correct *suffix_direction*
    - Convert *zip_code* to string
    - Concatenate to form *full_address*
* 1.2.2 - Geocode missing *latitude_longitude* with *full_address*<br>
* 1.2.3 - Split *latitude_longitude* into separate columns and convert to float values: *latitude*, *longitude*<br>
<br>
* Geocode missing *zip_code* with complete *latitude_longitude*<br>
* Geocode any missing *full_address* with *latitude_longitude*<br>

#### 1.2.1 Concatenate *full_address*

1) Correct values *suffix_direction*.<br>
2) Convert *zip_code* to string.<br>
3) Concatenate to form a complete street address string.

In [7]:
# Truncate suffix_direction to first letter (N, S, E, W)
data['suffix_direction'] = data['suffix_direction'].str[0].fillna('')

# Convert zip_code to string
data['zip_code'] = data['zip_code'].fillna('').astype(str)

# Combine address columns to concatenate
address_columns = ["address_start", "street_direction", "street_name", "street_suffix", "suffix_direction",
                  "zip_code"]

# Concatenate address values
data['full_address'] = data[address_columns].fillna('').astype(str).apply(' '.join, axis=1).str.replace('  ', ' ')

# Replace empty strings with NaN values
data[address_columns] = data[address_columns].replace('', np.nan)

In [8]:
# Display
data[address_columns + ['full_address']].head()

Unnamed: 0,address_start,street_direction,street_name,street_suffix,suffix_direction,zip_code,full_address
0,1823,S,THAYER,AVE,,90025,1823 S THAYER AVE 90025
1,2122,W,54TH,ST,,90062,2122 W 54TH ST 90062
2,415,S,BURLINGTON,AVE,,90057,415 S BURLINGTON AVE 90057
3,315,S,OCEANO,DR,,90049,315 S OCEANO DR 90049
4,13640,W,PIERCE,ST,,91331,13640 W PIERCE ST 91331


#### 1.2.2 Geocode missing *latitude_longitude*

In [9]:
# Extract rows missing in latitude_longitude
data_missing = data[data['latitude_longitude'].isnull()==1]

# Size
data_missing.shape

(21, 60)

In [10]:
data_missing

Unnamed: 0,assessor_book,assessor_page,assessor_parcel,tract,block,lot,reference_no_old_permit_no,pcis_permit_no,status,status_date,permit_type,permit_sub_type,permit_category,project_number,event_code,initiating_office,issue_date,address_start,address_fraction_start,address_end,address_fraction_end,street_direction,street_name,street_suffix,suffix_direction,unit_range_start,unit_range_end,zip_code,work_description,valuation,floor_area_la_zoning_code_definition,no_of_residential_dwelling_units,no_of_accessory_dwelling_units,no_of_stories,contractors_business_name,contractor_address,contractor_city,contractor_state,license_type,license_no,principal_first_name,principal_middle_name,principal_last_name,license_expiration_date,applicant_first_name,applicant_last_name,applicant_business_name,applicant_address_1,applicant_address_2,applicant_address_3,zone,occupancy,floor_area_la_building_code_definition,census_tract,council_district,latitude_longitude,applicant_relationship,existing_code,proposed_code,full_address
5,2219,27,052,TR 73820,,52,18VN77133,17010-20000-02747,CofO Issued,2019-04-05,Bldg-New,1 or 2 Family Dwelling,Plan Check,,,VAN NUYS,2018-09-21,7111,,7111,,N,MARISA,RD,,,,91405,"NEW SFD/GARAGE - PLAN 1A, LOT-52",196660.0,1560.0,1.0,,2.0,OWNER-BUILDER,,,,,0,,,,NaT,DAVID,LELIE,,25152 SPRINGFIELD CT,#180,"VALENCIA, CA",(T)(Q)RD2-1,,1985.0,1278.03,6,,Agent for Owner,,1.0,7111 N MARISA RD 91405
113,2537,7,012,TR 6026,,121,,17041-20000-01717,Permit Finaled,2017-04-25,Electrical,1 or 2 Family Dwelling,No Plan Check,,,VAN NUYS,2017-01-18,12453,,12453,,W,BROMWICH,ST,,,,91331,,,,,,,MSP CONSTRUCTION,7175 DE PALMA ST,DOWNEY,CA,B,789577,MIGUEL,,SOLTERO,2018-09-30,,,,,,,R1-1-CUGU,,0.0,1047.03,7,,,,,12453 W BROMWICH ST 91331
148,2656,5,160,SUBDIVISION NO. 1 OF THE PROPERTY OF THE PORTE...,,1 SEC 21 T2N R15W,,18042-20000-08835,Issued,2018-04-10,Plumbing,1 or 2 Family Dwelling,No Plan Check,,,VAN NUYS,2018-04-10,9842,,9842,,N,LASSEN,ROAD,,LOT 14,,91345,,,,,,,SEWER AND PIPELINE CONTRACTOR INC,4518 S WESTERN AVE,LOS ANGELES,CA,C36,904635,MANUEL,SANTANA,CHAMUL,2019-10-31,URIU &,ASSOCIATES,,830 S GLENDALE AVE,,"GLENDALE, CA",RD2-1,,0.0,1171.02,7,,Architect,,,9842 N LASSEN ROAD 91345
161,5512,3,042,TR 45628,,LT 7,17LA81020,16016-10001-25903,Issued,2017-04-24,Bldg-Alter/Repair,Commercial,Plan Check,,,METRO,2017-04-24,101,,101,,S,THE GROVE,DR,,,,90036,Supplemental to permit #16016-10000-25903 and ...,50000.0,,,,,NEXT VENTURE INC,560 RIVERDALE DRIVE,GLENDALE,CA,B,749452,CARL,JUAN,FROMMER,2018-05-31,JENNY,DIAZ,,1300 DOVE STREET,100,"NEWPORT BEACH, CA",(T)C2-2D-O,,,2145.01,4,,Agent for Owner,21.0,,101 S THE GROVE DR 90036
171,5586,7,007,LOPEZ VILLA TRACT,,8,,18042-40000-25623,Issued,2018-10-23,Plumbing,Apartment,No Plan Check,,,SANPEDRO,2018-10-23,1956,,1956,,N,CARMEN,AVE,,,,90068,,,,,,,A 1 COPPER REPIPE SPECIALIST,1082 E ARTESIA BLVD STE A,LONG BEACH,CA,C36,883229,RICARDO,HERNANDEZ,AMEZCUA,2020-08-31,RICARDO,AMEZCUA,,,,,R3-1XL,,0.0,1895.0,4,,Contractor,,,1956 N CARMEN AVE 90068
237,5511,8,013,TR 10389,,120,,17042-20000-32019,Permit Finaled,2018-07-30,Plumbing,Apartment,No Plan Check,,,VAN NUYS,2017-12-28,135,1/2,135,1/2,N,HARPER,AVE,,,,90048,,,,,,,ARNOLD'S REMODELING,8146 LONGRIDGE AVENUE,NORTH HOLLYWOOD,CA,B,929036,JOSE,ARNOLDO,ORANTES,2019-02-28,,,,,,,RD1.5-1-O,,0.0,2146.0,5,,,,,135 N HARPER AVE 90048
263,4319,2,060,TR 30364,,4,,18042-10001-07950,Permit Finaled,2018-05-11,Plumbing,Commercial,No Plan Check,,,METRO,2018-05-08,1925,,1925,,,CENTURY PARK,,E,18TH FLOOR,,90067,,,,,,,MUIR-CHASE PLUMBING CO INC,4530 BRAZIL STREET,LOS ANGELES,CA,C36,539835,GRANT,DRAKE,MUIR,2018-08-31,PHILLIP,HONS,MUIR-CHASE PLUMBING CO INC,,,,C2-2-O,,0.0,2679.01,5,,Agent for Contractor,,,1925 CENTURY PARK E 90067
264,7426,23,010,TR 4251,,117,,18044-40000-13741,Refund Completed,2019-03-11,HVAC,1 or 2 Family Dwelling,No Plan Check,,,SANPEDRO,2018-11-08,1440,,1440,,,GAMBLE,AVE,,,,90744,,,,,,,PRECISE AIR SYSTEMS INC,P O BOX 39609,LOS ANGELES,CA,C20,428900,FRED,,KHACHEKIAN,2020-10-31,DIANA,OLIVAS,,,,,R1-1XL-O-CUGU,,0.0,2941.2,15,,Agent for Contractor,1.0,,1440 GAMBLE AVE 90744
268,5527,31,014,TR 6790,,313,16LA,15042-20000-13885,Permit Expired,2018-09-25,Plumbing,Commercial,Plan Check,,,VAN NUYS,2016-06-13,465,,465,,N,FAIRFAX,AVE,,,,90036,"INSTALLATION OF POTABLE WATER, WASTE & VENT AN...",,,,,,FREE CREATION INC,5478 WILSHIRE BLVD UNIT 214,LOS ANGELES,CA,B,1003352,BEN,,KISLER,2017-05-31,TED MORENO,,,9111 MORNING GLOW WAY,,"SUN VALLEY, CA",C2-1VL,,0.0,1945.0,5,,Agent for Owner,,,465 N FAIRFAX AVE 90036
275,5073,7,020,THE W. G. NEVIN TRACT,6.0,2,,18046-90000-02205,Issued,2018-12-21,Elevator,Apartment,No Plan Check,,,INTERNET,2018-12-21,1511,,1511,,S,ST ANDREWS,PL,,1-45,,90019,,,,,,,CONSOLIDATED ELEVATOR COMPANY INC,964 E BADILLO ST #303,COVINA,CA,C11,1019792,MICHAEL,JOSEPH,BROWN,2018-10-31,DAVID,SANDOVAL,,964 E BADILLO ST,303,"COVINA, CA",[Q]R4-1,,0.0,2213.04,10,,Net Applicant,,,1511 S ST ANDREWS PL 90019


In [11]:
# Create helper function to geocode missing latitude_longitude values
def geocode(address, key, agent, timeout=None):
    
    """
    Uses GoogleMaps API to geocode address strings to lat/long coordinates. RateLimiter is to 
    avoid timeout errors. If an address cannot be geocoded it is left as NaN. Use of GoogleMaps 
    API incurs a charge at $0.005 per request.
    
    
    """
    
    if address:
        # Initializes GoogleMaps geocoder
        geolocator = GoogleV3(api_key=key, 
                              user_agent=agent, 
                              timeout=timeout)

        # Adds Rate Limiter to space out requests
        geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

        # Geocode address input and format for dataframe
        location = geolocator.geocode(address)
        #print(address, location.latitude)
        
        latitude, longitude = round(location.latitude, 7), round(location.longitude, 7)
        
        return latitude, longitude
    else:
        return np.nan

In [12]:
# Calculate cost
cost = len(data_missing) * 0.005
print("Cost for geocoding {} addresses is ${:.2f}.".format(len(data_missing), cost))

# Geocode missing coordinates using full addresses
data_missing['latitude_longitude'] = data_missing['full_address'].apply(geocode, args=(GOOGLE_API_KEY, 
                                                                                       "permits-data"))

# Update dataframe
data.update(data_missing)

Cost for geocoding 21 addresses is $0.10.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [13]:
data_missing

Unnamed: 0,assessor_book,assessor_page,assessor_parcel,tract,block,lot,reference_no_old_permit_no,pcis_permit_no,status,status_date,permit_type,permit_sub_type,permit_category,project_number,event_code,initiating_office,issue_date,address_start,address_fraction_start,address_end,address_fraction_end,street_direction,street_name,street_suffix,suffix_direction,unit_range_start,unit_range_end,zip_code,work_description,valuation,floor_area_la_zoning_code_definition,no_of_residential_dwelling_units,no_of_accessory_dwelling_units,no_of_stories,contractors_business_name,contractor_address,contractor_city,contractor_state,license_type,license_no,principal_first_name,principal_middle_name,principal_last_name,license_expiration_date,applicant_first_name,applicant_last_name,applicant_business_name,applicant_address_1,applicant_address_2,applicant_address_3,zone,occupancy,floor_area_la_building_code_definition,census_tract,council_district,latitude_longitude,applicant_relationship,existing_code,proposed_code,full_address
5,2219,27,052,TR 73820,,52,18VN77133,17010-20000-02747,CofO Issued,2019-04-05,Bldg-New,1 or 2 Family Dwelling,Plan Check,,,VAN NUYS,2018-09-21,7111,,7111,,N,MARISA,RD,,,,91405,"NEW SFD/GARAGE - PLAN 1A, LOT-52",196660.0,1560.0,1.0,,2.0,OWNER-BUILDER,,,,,0,,,,NaT,DAVID,LELIE,,25152 SPRINGFIELD CT,#180,"VALENCIA, CA",(T)(Q)RD2-1,,1985.0,1278.03,6,"(34.2003503, -118.4533963)",Agent for Owner,,1.0,7111 N MARISA RD 91405
113,2537,7,012,TR 6026,,121,,17041-20000-01717,Permit Finaled,2017-04-25,Electrical,1 or 2 Family Dwelling,No Plan Check,,,VAN NUYS,2017-01-18,12453,,12453,,W,BROMWICH,ST,,,,91331,,,,,,,MSP CONSTRUCTION,7175 DE PALMA ST,DOWNEY,CA,B,789577,MIGUEL,,SOLTERO,2018-09-30,,,,,,,R1-1-CUGU,,0.0,1047.03,7,"(34.2538783, -118.40469)",,,,12453 W BROMWICH ST 91331
148,2656,5,160,SUBDIVISION NO. 1 OF THE PROPERTY OF THE PORTE...,,1 SEC 21 T2N R15W,,18042-20000-08835,Issued,2018-04-10,Plumbing,1 or 2 Family Dwelling,No Plan Check,,,VAN NUYS,2018-04-10,9842,,9842,,N,LASSEN,ROAD,,LOT 14,,91345,,,,,,,SEWER AND PIPELINE CONTRACTOR INC,4518 S WESTERN AVE,LOS ANGELES,CA,C36,904635,MANUEL,SANTANA,CHAMUL,2019-10-31,URIU &,ASSOCIATES,,830 S GLENDALE AVE,,"GLENDALE, CA",RD2-1,,0.0,1171.02,7,"(34.2498959, -118.4665838)",Architect,,,9842 N LASSEN ROAD 91345
161,5512,3,042,TR 45628,,LT 7,17LA81020,16016-10001-25903,Issued,2017-04-24,Bldg-Alter/Repair,Commercial,Plan Check,,,METRO,2017-04-24,101,,101,,S,THE GROVE,DR,,,,90036,Supplemental to permit #16016-10000-25903 and ...,50000.0,,,,,NEXT VENTURE INC,560 RIVERDALE DRIVE,GLENDALE,CA,B,749452,CARL,JUAN,FROMMER,2018-05-31,JENNY,DIAZ,,1300 DOVE STREET,100,"NEWPORT BEACH, CA",(T)C2-2D-O,,,2145.01,4,"(34.072878, -118.357463)",Agent for Owner,21.0,,101 S THE GROVE DR 90036
171,5586,7,007,LOPEZ VILLA TRACT,,8,,18042-40000-25623,Issued,2018-10-23,Plumbing,Apartment,No Plan Check,,,SANPEDRO,2018-10-23,1956,,1956,,N,CARMEN,AVE,,,,90068,,,,,,,A 1 COPPER REPIPE SPECIALIST,1082 E ARTESIA BLVD STE A,LONG BEACH,CA,C36,883229,RICARDO,HERNANDEZ,AMEZCUA,2020-08-31,RICARDO,AMEZCUA,,,,,R3-1XL,,0.0,1895.0,4,"(34.1068231, -118.3226816)",Contractor,,,1956 N CARMEN AVE 90068
237,5511,8,013,TR 10389,,120,,17042-20000-32019,Permit Finaled,2018-07-30,Plumbing,Apartment,No Plan Check,,,VAN NUYS,2017-12-28,135,1/2,135,1/2,N,HARPER,AVE,,,,90048,,,,,,,ARNOLD'S REMODELING,8146 LONGRIDGE AVENUE,NORTH HOLLYWOOD,CA,B,929036,JOSE,ARNOLDO,ORANTES,2019-02-28,,,,,,,RD1.5-1-O,,0.0,2146.0,5,"(34.075354, -118.369252)",,,,135 N HARPER AVE 90048
263,4319,2,060,TR 30364,,4,,18042-10001-07950,Permit Finaled,2018-05-11,Plumbing,Commercial,No Plan Check,,,METRO,2018-05-08,1925,,1925,,,CENTURY PARK,,E,18TH FLOOR,,90067,,,,,,,MUIR-CHASE PLUMBING CO INC,4530 BRAZIL STREET,LOS ANGELES,CA,C36,539835,GRANT,DRAKE,MUIR,2018-08-31,PHILLIP,HONS,MUIR-CHASE PLUMBING CO INC,,,,C2-2-O,,0.0,2679.01,5,"(34.0605808, -118.4147074)",Agent for Contractor,,,1925 CENTURY PARK E 90067
264,7426,23,010,TR 4251,,117,,18044-40000-13741,Refund Completed,2019-03-11,HVAC,1 or 2 Family Dwelling,No Plan Check,,,SANPEDRO,2018-11-08,1440,,1440,,,GAMBLE,AVE,,,,90744,,,,,,,PRECISE AIR SYSTEMS INC,P O BOX 39609,LOS ANGELES,CA,C20,428900,FRED,,KHACHEKIAN,2020-10-31,DIANA,OLIVAS,,,,,R1-1XL-O-CUGU,,0.0,2941.2,15,"(33.794012, -118.2435141)",Agent for Contractor,1.0,,1440 GAMBLE AVE 90744
268,5527,31,014,TR 6790,,313,16LA,15042-20000-13885,Permit Expired,2018-09-25,Plumbing,Commercial,Plan Check,,,VAN NUYS,2016-06-13,465,,465,,N,FAIRFAX,AVE,,,,90036,"INSTALLATION OF POTABLE WATER, WASTE & VENT AN...",,,,,,FREE CREATION INC,5478 WILSHIRE BLVD UNIT 214,LOS ANGELES,CA,B,1003352,BEN,,KISLER,2017-05-31,TED MORENO,,,9111 MORNING GLOW WAY,,"SUN VALLEY, CA",C2-1VL,,0.0,1945.0,5,"(34.0800261, -118.3617445)",Agent for Owner,,,465 N FAIRFAX AVE 90036
275,5073,7,020,THE W. G. NEVIN TRACT,6.0,2,,18046-90000-02205,Issued,2018-12-21,Elevator,Apartment,No Plan Check,,,INTERNET,2018-12-21,1511,,1511,,S,ST ANDREWS,PL,,1-45,,90019,,,,,,,CONSOLIDATED ELEVATOR COMPANY INC,964 E BADILLO ST #303,COVINA,CA,C11,1019792,MICHAEL,JOSEPH,BROWN,2018-10-31,DAVID,SANDOVAL,,964 E BADILLO ST,303,"COVINA, CA",[Q]R4-1,,0.0,2213.04,10,"(34.0451794, -118.3120002)",Net Applicant,,,1511 S ST ANDREWS PL 90019


In [42]:
# Check that there are no more missing coordinates before proceeding
assert data['latitude_longitude'].notnull().any(), "Missing coordinates must be geocoded."

#### 1.2.3 Split *latitude_longitude* 

Split coordinates into separate columns and convert to float values.

In [22]:
# Split latitude_longitude into separate columns and convert to float values: latitude, longitude
lat_long_series = data['latitude_longitude'].astype(str).str[1:-1].str.split(',', expand=True) \
                        .astype(float).rename(columns={0: "latitude", 1: "longitude"})

# Add to original data
data = pd.concat([data, lat_long_series], axis=1)

In [47]:
# Display
data[['latitude_longitude', 'latitude', 'longitude']].head(1)

Unnamed: 0,latitude_longitude,latitude,longitude
0,"(34.05474, -118.42628)",34.05474,-118.42628


In [59]:
# Check for null values
assert data['latitude'].any(), 'Column "latitude" has missing values.'
assert data['longitude'].any(), 'Column "longitude" has missing values.'