# permits-data / Clean Data

ETL pipeline for construction permits data in Los Angeles, California, USA.

For more information:
https://data.lacity.org/A-Prosperous-City/Building-and-Safety-Permit-Information/yv23-pmwf

## Setup

In [1]:
import os
import sys

# Set path for modules
sys.path[0] = '../'

from dotenv import load_dotenv, find_dotenv
import numpy as np
import pandas as pd
import psycopg2

# Import custom eda and sql functions
from src.toolkits.eda import get_snapshot
from src.toolkits.sql import connect_db

# Import dependencies for geocoding
from geopy.geocoders import Nominatim
from geopy.geocoders import GoogleV3
from geopy.extra.rate_limiter import RateLimiter

In [2]:
# Set notebook display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [3]:
# Get project root directory
root_dir = os.path.dirname(os.getcwd())

# Set environment variables
load_dotenv(find_dotenv());
POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
POSTGRES_DB = os.getenv("POSTGRES_DB")
DB_PORT = os.getenv("DB_PORT")
DB_HOST = os.getenv("DB_HOST")
DATA_URL = os.getenv("DATA_URL")

# Google Maps environment variables
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

# Environment variables specific to notebook
DATA_DIR = os.path.dirname(root_dir) + '/data'
DB_TABLE = "permits_raw"

## 1. Clean Data

In [4]:
# Connect to db
conn = connect_db()

# Extract partial dataset
sql = 'SELECT * FROM {} LIMIT 500;'.format(DB_TABLE)

# Columns to parse as dates
date_columns = ['status_date', 'issue_date', 'license_expiration_date']

# Fetch fresh data
data = pd.read_sql_query(sql, conn, parse_dates=date_columns, coerce_float=False)

# Replace None with np.nan
data.fillna(np.nan, inplace=True)

Connected as user "postgres" to database "permits" on http://localhost:5432.



### 1.1 Missing Data

#### Overview of Unique Values in Qualitative Data

Before making decisions about how to address missing values, it is important to be familiar with the content of each column. In some cases data can be left alone, imputed, recollected, or dropped from the dataset. Since the permits data has mostly qualitative data and unstructured text, most of it will be left alone.

In the case of geographic data such as addresses and lat/long coordinates, it will be necessary to accurately geocode the missing values. Since this information is split across several columns they will be concatenated into one column.

In [5]:
# Get an overview of data types, # unique values, # missing values and sample value
# for each column
get_snapshot(data)

Unnamed: 0_level_0,DATA TYPE,# UNIQUE VALUES,# MISSING VALUES,SAMPLE VALUE
COLUMN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
assessor_book,int64,366,0,2701
assessor_page,int64,44,0,26
assessor_parcel,object,74,0,904
tract,object,446,3,TR 8978
block,object,52,384,B
lot,object,157,4,LT 1
reference_no_old_permit_no,object,165,305,17VN46725
pcis_permit_no,object,500,0,17043-90000-03110
status,object,8,0,Permit Finaled
status_date,datetime64[ns],424,0,2018-04-11 00:00:00


At the moment the only missing data of interest are *zip_code* and *latitude_longitude* coordinates, since these are necesary for mapping. 

### 1.2 Processing Missing Data

***Overview:***
* 1.2.1 - Combine address columns into one columns: *full_address*<br>
    - Correct *suffix_direction*
    - Convert *zip_code* to string
    - Concatenate to form *full_address*
* 1.2.2 - Geocode missing *latitude_longitude* with *full_address*<br>
* 1.2.3 - Split *latitude_longitude* into separate columns and convert to float values: *latitude*, *longitude*<br>
<br>
* Geocode missing *zip_code* with complete *latitude_longitude*<br>
* Geocode any missing *full_address* with *latitude_longitude*<br>

#### 1.2.1 Concatenate *full_address*

1) Correct values *suffix_direction*.<br>
2) Convert *zip_code* to string.<br>
3) Concatenate to form a complete street address string.

In [6]:
# Truncate suffix_direction to first letter (N, S, E, W)
data['suffix_direction'] = data['suffix_direction'].str[0].fillna('')

# Convert zip_code to string
data['zip_code'] = data['zip_code'].fillna('').astype(str)

# Combine address columns to concatenate
address_columns = ["address_start", "street_direction", "street_name", "street_suffix", "suffix_direction",
                  "zip_code"]

# Concatenate address values
data['full_address'] = data[address_columns].fillna('').astype(str).apply(' '.join, axis=1).str.replace('  ', ' ')

# Replace empty strings with NaN values
data[address_columns] = data[address_columns].replace('', np.nan)

In [7]:
# Display
data[address_columns + ['full_address']].head()

Unnamed: 0,address_start,street_direction,street_name,street_suffix,suffix_direction,zip_code,full_address
0,1823,S,THAYER,AVE,,90025,1823 S THAYER AVE 90025
1,2122,W,54TH,ST,,90062,2122 W 54TH ST 90062
2,415,S,BURLINGTON,AVE,,90057,415 S BURLINGTON AVE 90057
3,315,S,OCEANO,DR,,90049,315 S OCEANO DR 90049
4,13640,W,PIERCE,ST,,91331,13640 W PIERCE ST 91331


#### 1.2.2 Geocode missing *latitude_longitude*

In [8]:
# Extract rows missing in latitude_longitude
data_missing = data[data['latitude_longitude'].isnull()==1]

# Size
data_missing.shape

(21, 60)

In [9]:
# Display
data_missing[['full_address', 'latitude_longitude']].head()

Unnamed: 0,full_address,latitude_longitude
5,7111 N MARISA RD 91405,
113,12453 W BROMWICH ST 91331,
148,9842 N LASSEN ROAD 91345,
161,101 S THE GROVE DR 90036,
171,1956 N CARMEN AVE 90068,


In [10]:
# Create helper function to geocode missing latitude_longitude values
def geocode(address, key, agent, timeout=None):
    
    """
    Uses GoogleMaps API to batch geocode address strings to lat/long coordinates. RateLimiter is to 
    avoid timeout errors. If an address cannot be geocoded it is left as NaN. Use of GoogleMaps 
    API incurs a charge at $0.005 per request.
    
    
    """
    
    if address:
        # Initializes GoogleMaps geocoder
        geolocator = GoogleV3(api_key=key, 
                              user_agent=agent, 
                              timeout=timeout)

        # Adds Rate Limiter to space out requests
        geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

        # Geocode address input and format for dataframe
        location = geolocator.geocode(address)
        #print(address, location.latitude)
        
        latitude, longitude = round(location.latitude, 7), round(location.longitude, 7)
        
        return latitude, longitude
    else:
        return np.nan

In [11]:
# Calculate cost
cost = len(data_missing) * 0.005
print("Cost for geocoding {} addresses is ${:.2f}.".format(len(data_missing), cost))

# Geocode missing coordinates using full addresses
data_missing['latitude_longitude'] = data_missing['full_address'].apply(geocode, args=(GOOGLE_API_KEY, 
                                                                                       "permits-data"))

# Update dataframe
data.update(data_missing)

Cost for geocoding 21 addresses is $0.10.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [12]:
# Display
data_missing[['full_address', 'latitude_longitude']].head()

Unnamed: 0,full_address,latitude_longitude
5,7111 N MARISA RD 91405,"(34.2003503, -118.4533963)"
113,12453 W BROMWICH ST 91331,"(34.2538783, -118.40469)"
148,9842 N LASSEN ROAD 91345,"(34.2498959, -118.4665838)"
161,101 S THE GROVE DR 90036,"(34.072878, -118.357463)"
171,1956 N CARMEN AVE 90068,"(34.1068231, -118.3226816)"


In [13]:
# Check that there are no more missing coordinates before proceeding
assert data['latitude_longitude'].notnull().any(), "Missing coordinates must be geocoded."

#### 1.2.3 Split *latitude_longitude* 

Split coordinates into separate columns and convert to float values.

In [14]:
# Split latitude_longitude into separate columns and convert to float values: latitude, longitude
lat_long_series = data['latitude_longitude'].astype(str).str[1:-1].str.split(',', expand=True) \
                        .astype(float).rename(columns={0: "latitude", 1: "longitude"})

# Add to original data
data = pd.concat([data, lat_long_series], axis=1)

In [15]:
# Display
data[['latitude_longitude', 'latitude', 'longitude']].head(1)

Unnamed: 0,latitude_longitude,latitude,longitude
0,"(34.05474, -118.42628)",34.05474,-118.42628


In [16]:
# Check for null values
assert data['latitude'].any(), 'Column "latitude" has missing values.'
assert data['longitude'].any(), 'Column "longitude" has missing values.'

# Check for erroneous coordinates. All coordinates should fall within Los Angeles county.
assert (data['latitude'] > 33.2).all() and (data['latitude'] < 34.9).all(), "Incorrect latitude detected"
assert (data['longitude'] > -118.9).all() and (data['longitude'] < -118).all(), "Incorrect longitude detected"

## 2. Update PostgreSQL Database

In [17]:
get_snapshot(data)

Unnamed: 0_level_0,DATA TYPE,# UNIQUE VALUES,# MISSING VALUES,SAMPLE VALUE
COLUMN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
assessor_book,float64,366,0,4006
assessor_page,float64,44,0,20
assessor_parcel,object,74,0,024
tract,object,446,3,BONNIE BRAE TRACT
block,object,52,384,BLK 9
lot,object,157,4,521
reference_no_old_permit_no,object,165,305,13VN68383
pcis_permit_no,object,500,0,16016-10001-21896
status,object,8,0,Permit Finaled
status_date,datetime64[ns],424,0,2015-02-17 00:00:00
