# permits-data / Clean Data

ETL pipeline for construction permits data in Los Angeles, California, USA.

For more information:
https://data.lacity.org/A-Prosperous-City/Building-and-Safety-Permit-Information/yv23-pmwf

## Setup

In [237]:
import os
import sys

# Set path for modules
sys.path[0] = '../'

from dotenv import load_dotenv, find_dotenv
import numpy as np
import pandas as pd
import psycopg2

# Import custom eda and sql functions
from src.toolkits.eda import get_snapshot, explore_value_counts
from src.toolkits.sql import connect_db

# Import dependencies for geocoding
from geopy.geocoders import Nominatim
from geopy.geocoders import GoogleV3
from geopy.extra.rate_limiter import RateLimiter

In [238]:
# Set notebook display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [239]:
# Get project root directory
root_dir = os.path.dirname(os.getcwd())

# Set environment variables
load_dotenv(find_dotenv());
POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
POSTGRES_DB = os.getenv("POSTGRES_DB")
DB_PORT = os.getenv("DB_PORT")
DB_HOST = os.getenv("DB_HOST")
DATA_URL = os.getenv("DATA_URL")

# Google Maps environment variables
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

# Environment variables specific to notebook
DATA_DIR = os.path.dirname(root_dir) + '/data'
DB_TABLE = "permits_raw"

## 1. Clean Data

In [240]:
# Connect to db
conn = connect_db()

# Extract partial dataset
sql_all = 'SELECT * FROM {} LIMIT 1500;'.format(DB_TABLE)

# Columns to parse as dates
date_columns = ['status_date', 'issue_date', 'license_expiration_date']

# Fetch fresh data
data = pd.read_sql_query(sql_all, conn, parse_dates=date_columns, coerce_float=False)

# Replace None with np.nan
data.fillna(np.nan, inplace=True)

Connected as user "postgres" to database "permits" on localhost:5432



### 1.1 Missing Data

#### Overview of Unique Values in Qualitative Data

Before making decisions about how to address missing values, it is important to be familiar with the content of each column. In some cases data can be left alone, imputed, recollected, or dropped from the dataset. Since the permits data has mostly qualitative data and unstructured text, most of it will be left alone.

In the case of geographic data such as addresses and lat/long coordinates, it will be necessary to accurately geocode the missing values. Since this information is split across several columns they will be concatenated into one column.

In [241]:
# Get an overview of data types, # unique values, # missing values and sample value
# for each column
get_snapshot(data)

Unnamed: 0_level_0,DATA TYPE,# UNIQUE VALUES,# MISSING VALUES,SAMPLE VALUE
COLUMN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
assessor_book,float64,742,1,2106
assessor_page,float64,49,1,17
assessor_parcel,object,111,1,031
tract,object,1202,5,CABRILLO BEACH HEIGHTS TRACT
block,object,96,1178,9
lot,object,296,14,13
reference_no_old_permit_no,object,520,877,15WL68346
pcis_permit_no,object,1500,0,15026-10000-00305
status,object,11,0,Permit Finaled
status_date,datetime64[ns],962,0,2013-04-16 00:00:00


At the moment the only missing data of interest are *zip_code* and *latitude_longitude* coordinates, since these are necesary for mapping. 

### 1.2 Processing Missing Data

***Overview:***
* 1.2.1 - Combine address columns into one columns: *full_address*<br>
    - Correct *suffix_direction*
    - Convert *zip_code* to string
    - Concatenate to form *full_address*
* 1.2.2 - Geocode missing *latitude_longitude* with *full_address*<br>
* 1.2.3 - Split *latitude_longitude* into separate columns and convert to float values: *latitude*, *longitude*<br>
<br>
* Geocode missing *zip_code* with complete *latitude_longitude*<br>
* Geocode any missing *full_address* with *latitude_longitude*<br>

#### 1.2.1 Concatenate *full_address*

1) Correct values *suffix_direction*.<br>
2) Convert *zip_code* to string.<br>
3) Concatenate to form a complete street address string.

In [242]:
# Truncate suffix_direction to first letter (N, S, E, W)
data['suffix_direction'] = data['suffix_direction'].str[0].fillna('')

In [243]:
# Convert zip_code to string
data['zip_code'] = data['zip_code'].astype(str).str[:-2]

In [244]:
# Combine address columns to concatenate
address_columns = ["address_start", "street_direction", "street_name", "street_suffix", "suffix_direction",
                  "zip_code"]
# Concatenate address values
data['full_address'] = data[address_columns].fillna('').astype(str).apply(' '.join, axis=1).str.replace('  ', ' ')

In [245]:
# Display
data[address_columns + ['full_address']].head()

Unnamed: 0,address_start,street_direction,street_name,street_suffix,suffix_direction,zip_code,full_address
0,1823,S,THAYER,AVE,,90025,1823 S THAYER AVE 90025
1,2122,W,54TH,ST,,90062,2122 W 54TH ST 90062
2,415,S,BURLINGTON,AVE,,90057,415 S BURLINGTON AVE 90057
3,315,S,OCEANO,DR,,90049,315 S OCEANO DR 90049
4,13640,W,PIERCE,ST,,91331,13640 W PIERCE ST 91331


#### 1.2.2 Geocode missing *latitude_longitude*

In [246]:
# Extract rows missing latitude_longitude
data_missing = data[data['latitude_longitude'].isnull()==1]

# Size
data_missing.shape

(58, 60)

In [247]:
# Create helper function to geocode missing Latitude_Longitude values
def geocode(address):
    
    """Uses GoogleMaps API to geocode address strings to lat/long coordinates. RateLimiter is to 
    avoid timeout errors. Not all addresses can be geocoded, some NaN's will remain. Use of GoogleMaps 
    API incurs a charge at $0.005 per request."""
    
    if address:
        # Initializes GoogleMaps geocoder
        geolocator = GoogleV3(api_key=GOOGLE_API_KEY, 
                              user_agent="permits-data", 
                              timeout=None)

        # Adds Rate Limiter to space out requests
        geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

        # Geocode address input and format for dataframe
        location = geolocator.geocode(address)
        
        latitude, longitude = format(round(location.latitude, 7), '6f'), format(round(location.longitude, 7), '6f')
        
        return latitude, longitude
    else:
        return np.nan

In [248]:
# Calculate cost
cost = len(data_missing) * 0.005
print("Cost for geocoding {} addresses is ${:.2f}.".format(len(data_missing), cost))

# Geocode missing coordinates using full addresses
data_missing['latitude_longitude'] = data_missing['full_address'].apply(geocode)

# Update dataframe
data.update(data_missing)

Cost for geocoding 58 addresses is $0.29.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [249]:
# Check that there are no more missing coordinates before proceeding
assert data['latitude_longitude'].isnull().sum() == 0, "Missing coordinates must be geocoded."

#### 1.2.3 Split *latitude_longitude* 

Split coordinates into separate columns and convert to float values.

In [250]:
# Split latitude_longitude into separate columns and convert to float values: latitude, longitude
lat_long_series = data['latitude_longitude'].str[1:-1].str.split(',', expand=True) \
                        .astype(float).rename(columns={0: "latitude", 1: "longitude"})

# Add to original data
data = pd.concat([data, lat_long_series], axis=1)

In [251]:
# Display
data[['latitude_longitude', 'latitude', 'longitude']].head(1)

Unnamed: 0,latitude_longitude,latitude,longitude
0,"(34.05474, -118.42628)",34.05474,-118.42628
