# permits-data / ETL Pipeline

A simple ETL pipeline for construction permits data from the [Los Angeles Open Data Portal](https://data.lacity.org/) using bash, Python, Docker and PostgreSQL. Includes a basic [Object-Relational Mapper](https://en.wikipedia.org/wiki/Object-relational_mapping) (ORM) for PostgreSQL using `psycopg2` and a notebook that outlines the steps in the pipeline. 

### Background
Cited from [Building and Safety Permit Information](https://data.lacity.org/A-Prosperous-City/Building-and-Safety-Permit-Information-Old/yv23-pmwf):<br>
>"*The Department of Building and Safety issues permits for the construction, remodeling, and repair of buildings and structures in the City of Los Angeles. Permits are categorized into building permits, electrical permits, and mechanical permits*"

The raw permits data available from the [Los Angeles Open Data Portal](https://data.lacity.org/) contains missing latitude and longitude coordinates for some properties. The pipeline in this notebook geocodes the missing coordinates and updates a local database using a basic ORM for PostgreSQL.

### Prerequisites

1) [Anaconda](https://docs.anaconda.com/anaconda/install/)<br>
2) [Docker](https://docs.docker.com/get-docker/)<br>
3) [API key for Google Maps](https://developers.google.com/maps/documentation/geocoding/get-api-key). It may be necessary to set up a developer account. Note that geocoding incurs a charge of 0.005 USD per request, although Google does give an intial 300 USD credit.<br>
4) Make sure the `.env` file is present. Refer to the README for instructions on this.

5) Before starting Jupyter Notebook, the Anaconda environment must be activated:
  ```zsh
  make create_env
  conda activate permits_pipeline_env
  ```
6) Populate the .env environment variables by running:
  ```bash
  set -o allexport; source .env; set +o allexport;
  ```
  
### Instructions
Run the command `make load_db` to automatically download contruction permits data and load into a PostgreSQL database in Docker, then run this notebook to transform columns and geocode missing addresses. 

## Setup

In [1]:
import os
import sys
# Set path for modules
sys.path[0] = '../'

import pandas as pd
import psycopg2

from src.pipeline.dictionaries import types_dict, replace_map
from src.pipeline.transform_data import create_full_address, split_lat_long
from src.toolkits.geospatial import geocode_from_address
from src.toolkits.postgresql import Database, Table
from src.toolkits.eda import explore_value_counts

# Set notebook display options
pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

### Connect to PostgreSQL

Connection parameters are accessed from the .env file in the root directory, or they can be passed to the *Database( )* class.

In [2]:
# Create instance of database
permits = Database()

# List tables
permits.list_tables()

['permits_raw']

In [3]:
# Drops and rewrites table for testing purposes; allows restarting kernel
#params = {"table_name":"permits_raw", "types_dict":types_dict_abbrev, "id_col":"pcis_permit_no"}
#permits.drop_table('permits_raw').drop_table('tmp_permits_raw').create_table(**params)
#!cd ../ && bash scripts/load_db.sh

### Update Table Column Names & Types

In [4]:
# Creates instance of table
permits_raw = Table(name="permits_raw", id_col="pcis_permit_no")

In [5]:
# Update column names
permits_raw.format_table_names(replace_map=replace_map, update=True)

# Update column types
permits_raw.update_types(types_dict)

Updated names in "permits_raw".
Updated types in "permits_raw".


## Extract

Pulls data from Docker PostgreSQL into Pandas dataframe.

In [6]:
# Fetch data
data = permits_raw.fetch_data()

## Transform

Transformations applied:
* All address columns are concatenated into one single column: *full_address*
* The *full_address* column is used as input to the *geocode_from_address( )* function to find missing GPS coordinates

In [7]:
# Transform columns: concatenate all address columns to create full_address
data = create_full_address(data)

# Geocoding missing GPS coordinates
geocode_from_address(data); # Updates dataframe in place
data = split_lat_long(data)

Cost for geocoding 42 addresses is $0.21.
Geocoding...
42 locations were assigned coordinates.


## Load

Reloads data into Docker PostgreSQL.

In [8]:
# Update database with new values
permits_raw.update_values(data=data, id_col="pcis_permit_no", types_dict=types_dict, 
                          columns=['full_address', 'latitude', 'longitude'])

Added new columns to "permits_raw":
['full_address', 'longitude', 'latitude']
Updated types in "permits_raw".
Created temporary table "tmp_permits_raw".
Dataframe columns do not match table "permits_raw".
Rearranged dataframe columns to match "permits_raw".
Copy successful on table "permits_raw".
Updated values in "permits_raw".


#### Check transformations:

In [9]:
# Using ORM
permits_raw.fetch_data(sql="SELECT full_address, latitude, longitude FROM permits_raw LIMIT 10;")



Unnamed: 0,full_address,latitude,longitude
0,1999 S AVENUE OF THE STARS 90067,34.05886,-118.41642
1,10817 W HUSTON ST 91601,34.16004,-118.36635
2,10910 W WALNUT DR 91040,34.25029,-118.36814
3,1440 GAMBLE AVE 90744,33.794012,-118.2435141
4,3554 S SAWTELLE BLVD 90066,34.01545,-118.42271
5,8838 N SWINTON AVE 91343,34.23139,-118.48606
6,12514 W OXNARD ST 91607,34.17926,-118.40576
7,1453 W 56TH ST 90062,33.99157,-118.30207
8,555 S GAYLEY AVE 90024,34.06886,-118.44989
9,10944 W VENTURA BLVD 91604,34.13979,-118.36893


In [10]:
# Access psql running in Docker to run query
!docker exec -i postgres_db psql -h localhost -U postgres -p 5432 \
permits -c 'SELECT full_address, latitude, longitude FROM permits_raw LIMIT 10;'

           full_address            | latitude  |  longitude   
-----------------------------------+-----------+--------------
 1999 S AVENUE OF THE STARS  90067 |  34.05886 |   -118.41642
 10817 W HUSTON ST 91601           |  34.16004 |   -118.36635
 10910 W WALNUT DR 91040           |  34.25029 |   -118.36814
 1440 GAMBLE AVE 90744             | 33.794012 | -118.2435141
 3554 S SAWTELLE BLVD 90066        |  34.01545 |   -118.42271
 8838 N SWINTON AVE 91343          |  34.23139 |   -118.48606
 12514 W OXNARD ST 91607           |  34.17926 |   -118.40576
 1453 W 56TH ST 90062              |  33.99157 |   -118.30207
 555 S GAYLEY AVE 90024            |  34.06886 |   -118.44989
 10944 W VENTURA BLVD 91604        |  34.13979 |   -118.36893
(10 rows)

