The goal of this notebook is to generate a slew of employer-level characteristics given a UI wage records table. We will find these employer-level characteristics for all employers with at least five unique employees in Kentucky for the 2012Q4-2014Q3. We will define each employer as a unique `employeeno` value in this dataset.

Here are the statistics we will find:

    - Total payroll
    - Average earnings per employee
    - Earnings per employee at the 75th percentile
    - Earnings per employee at the 25th percentile
    - Industry
    - Number of full quarter employees
    - Total payroll for full quarter employees
    - Average earnings per full quarter employee
    - Employment, Separation, and Hiring Growth Rates
    
Our final output from this notebook is a permanent table with employer-level information spanning 2012Q4-2014Q3 for each employer with at least 5 employees in Kentucky that exists in its UI wage records.

In [None]:
# load packages
library(lubridate)
library(tidyverse)
library(DBI)
library(RPostgreSQL)

In [None]:
# connect to our database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

There are a few suspect `ssn` values that we will not include in this analysis. Therefore, we will create a temporary table of Kentucky's UI wage records while subsetting the data to the year/quarter combinations we will need to generate this table.

First, we will create temporary tables of all the UI wage record information from `ky_wages_sub` for five quarters: 2012Q4-2014Q3. We need to include information on the quarter prior to our desired start (2012Q3) since we need employment, separation, and hiring rates for 2012Q3 to calculate growth rates for these measures in 2012Q4.

In [None]:
qry = "
select *
from ada_ky_20.ky_wages_sub
limit 5
"
dbGetQuery(con,qry)

In [None]:
table<-c("q3_2012", "q4_2012","q1_2013","q2_2013","q3_2013", "q4_2013", "q1_2014", "q2_2014","q3_2014")
as.integer(substr(table, 2,2))

2012Q4-2014Q3

In [None]:
table<-c("q3_2012", "q4_2012","q1_2013","q2_2013","q3_2013", "q4_2013", "q1_2014", "q2_2014","q3_2014")
year <- as.integer(substr(table, 4,7))
q <- as.integer(substr(table,2,2))
for(i in 1:9){
    qry = '
    create temp table "%s" as 
    select *
    from ada_ky_20.ky_wages_sub
    where qtr = %d and calendaryear = %d and employeeno is not null
    '
    full_qry = sprintf(qry, table[i], q[i], year[i])
    dbExecute(con, full_qry)
}

In [None]:
qry = "
select * from q3_2012 limit 5
"
dbGetQuery(con, qry)

Then, we will add columns to track if each `employeeno`/`coleridge_id` combination within a given quarter exists in the wage record table the quarter before and/or the quarter after. This will be important in tracking full-quarter employment, as well as hiring and separation numbers.

In [None]:
qry = "
select * from q3_2012 limit 5
"
dbGetQuery(con, qry)

In [None]:
# initialize pre and post employment columns
new_cols <- c('pre_emp', 'post_emp')

for(col in new_cols){
    for(i in 1:9){
        qry='
        ALTER TABLE "%s" ADD COLUMN "%s" int
        '
        full_qry = sprintf(qry,table[i], col)
        dbExecute(con, full_qry)
    }
}

After the `pre_emp` and `post_emp` columns are initialized in each of these temporary tables, we can set these as indicator variables if the `coleridge_id`/`employeeno` combination that appeared in the UI wage records for the given year/quarter combination also existed in the previous and future quarter.

In [None]:
# order employment flags properly for 0-9 index below
preYr = c(2012, 2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014)
preQ = c(2, 3, 4, 1, 2, 3, 4, 1, 2)

# loop through an integer list
# 0 is 4th lag (4 quarters ago)

for(i in 1:9){
    # update this quarter employment flag
    qry='
    UPDATE "%s" a SET pre_emp = 
        CASE WHEN b.wages is null THEN 0 ELSE 1 END
    FROM ada_ky_20.ky_wages_sub b
    WHERE b.calendaryear= %d AND b.qtr= %d --grab correct quarter
        AND a.coleridge_id=b.coleridge_id AND a.employeeno=b.employeeno --ensure same job
    '
    full_qry = sprintf(qry, table[i], preYr[i], preQ[i])
    dbExecute(con, full_qry)
    }

In [None]:
# see values of pre_emp
qry = "
select pre_emp, count(*)
from q4_2012 group by pre_emp
"
dbGetQuery(con, qry)

In [None]:
# order employment flags properly for 0-5 index below
postYr = c(2012, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014)
postQ = c(4, 1, 2, 3, 4, 1, 2, 3, 4)

# loop through an integer list
# 0 is 4th lag (4 quarters ago)

for(i in 1:9){
    # update this quarter employment flag
    qry='
    UPDATE "%s" a SET post_emp = 
        CASE WHEN b.wages is NULL THEN 0 ELSE 1 END
    FROM ada_ky_20.ky_wages_sub b
    WHERE b.calendaryear= %d AND b.qtr= %d --grab correct quarter
        AND a.coleridge_id=b.coleridge_id AND a.employeeno=b.employeeno --ensure same job
    '
    full_qry = sprintf(qry, table[i], postYr[i], postQ[i])
    dbExecute(con, full_qry)
    }

In [None]:
# take a peek at one of the tables
qry <- "
select *
from q3_2012
limit 5
"
dbGetQuery(con, qry)

Now that we have pre and post-quarter employment indicators for each `coleridge_id`/`employeeno` combination, we can add hiring and separation indicators into these tables.

In [None]:
new_cols <- c('sep', 'hire')

for(col in new_cols){
    for(i in 1:9){
        qry='
        ALTER TABLE "%s" ADD COLUMN "%s" int
        '
        full_qry = sprintf(qry,table[i], col)
        dbExecute(con, full_qry)
    }
}

In [None]:
# take a peek at one of the tables
qry <- "
select *
from q3_2012
limit 5
"
dbGetQuery(con, qry)

In [None]:
for(i in 1:9){
    qry='
    UPDATE "%s" a SET sep = 
    CASE WHEN post_emp is null THEN 1 ELSE 0 END
    '
    full_qry = sprintf(qry,table[i])
    dbExecute(con, full_qry)
}

In [None]:
# look at different values of sep
qry = '
select count(*), sep
from q3_2012 group by sep
'

dbGetQuery(con, qry)

In [None]:
for(i in 1:9){
    qry='
    UPDATE "%s" a SET hire = 
    CASE WHEN pre_emp is null THEN 1 ELSE 0 END
    '
    full_qry = sprintf(qry,table[i])
    dbExecute(con, full_qry)
}

In [None]:
# look at one of the tables again
qry = '
select * 
from q3_2012 
limit 5'
dbGetQuery(con, qry)

In [None]:
# look at different values of sep
qry = '
select count(*), hire
from q3_2012 group by hire
'

dbGetQuery(con, qry)

### Aggregate by Employer

At this point, we have all the information we need to aggregate on the `employeeno` values. We will do these aggregations in separate steps, as they will require separate `WHERE` clauses. In the first, we will find values for all measures outside of the full-quarter employee-related ones.

In [None]:
emp_tabs <- c("gen_q4_2012", "gen_q1_2013", "gen_q2_2013", "gen_q3_2013", "gen_q4_2013", 
              "gen_q1_2014", "gen_q2_2014", "gen_q3_2014")
for(i in 1:8){    
    qry = '
    create temp table "%s" as
    select employeeno,  naics, qtr, calendaryear, count(distinct(coleridge_id)) as num_employed,
    sum(wages)/count(distinct(coleridge_id)) as avg_earnings, sum(wages) as total_earnings,
    percentile_disc(0.25) within group (order by wages) as bottom_25_pctile,
    percentile_disc(0.75) within group (order by wages) as top_25_pctile,
    sum(hire) as num_hire, sum(sep) as num_sep
    from "%s"
    group by employeeno, naics, qtr, calendaryear
    '
    full_qry = sprintf(qry, emp_tabs[i], table[i+1])
    dbGetQuery(con, full_qry)
    }

In [None]:
# see these stats aggregated by employer for q1
qry = "
select * from gen_q4_2012 limit 5
"
dbGetQuery(con, qry)

In a separate table, we can find all of the statistics related to full-quarter employment.

In [None]:
full_tabs <- c("fq_q4_2012", "fq_q1_2013", "fq_q2_2013", "fq_q3_2013", "fq_q4_2013", 
               "fq_q1_2014", "fq_q2_2014", "fq_q3_2014")
for(i in 1:8){    
    qry = '
    create temp table "%s" as
    select employeeno, naics, qtr, calendaryear, count(distinct(coleridge_id)) as full_num_employed, 
    sum(wages)/count(distinct(coleridge_id)) as full_avg_earnings, sum(wages) as full_total_earnings
    from "%s"
    where post_emp = 1 and pre_emp = 1
    group by employeeno, naics, qtr, calendaryear
    '
    full_qry = sprintf(qry, full_tabs[i], table[i+1])
    dbGetQuery(con, full_qry)
    }

In [None]:
# see a full quarter employment table
qry = "
select * from fq_q4_2012 limit 5
"
dbGetQuery(con, qry)

Finally, we need information on these employer's hiring, employment, and separation numbers for the prior year to calculate their growth rates.

In [None]:
old_tabs <- c("pre_q4_2012", "pre_q1_2013", "pre_q2_2013", "pre_q3_2013", "pre_q4_2013", 
              "pre_q1_2014", "pre_q2_2014", "pre_q3_2014")
for(i in 1:8){    
    qry = '
    create temp table "%s" as
    select employeeno,  naics, qtr, calendaryear, 
    count(distinct(coleridge_id)) as num_employed_pre, sum(hire) as num_hire_pre, sum(sep) as num_sep_pre
    from "%s"
    group by employeeno, naics, qtr, calendaryear
    '
    full_qry = sprintf(qry, old_tabs[i], table[i])
    dbGetQuery(con, full_qry)
    }

In [None]:
# see quarter before information
qry = "select * from pre_q4_2012 limit 5"
dbGetQuery(con, qry)

Now that we have all the information we need in three tables, we can join them together based on the `employeeno` and `naics` values.

In [None]:
tabs <- c("emp_q4_2012", "emp_q1_2013", "emp_q2_2013", "emp_q3_2013", "emp_q4_2013", "emp_q1_2014", "emp_q2_2014", "emp_q3_2014")
for(i in 1:8){
    qry = '
    create temp table "%s" as
    select a.*, 
    case 
        when b.full_num_employed is null then 0 
        else b.full_num_employed end as full_num_employed,
    b.full_avg_earnings, b.full_total_earnings
    from "%s" a
    left join "%s" b
    on a.employeeno = b.employeeno and a.qtr = b.qtr and a.naics = b.naics and a.calendaryear = b.calendaryear
    where a.num_employed >= 5
    '
    full_qry = sprintf(qry, tabs[i], emp_tabs[i], full_tabs[i])
    dbExecute(con, full_qry)
    }

In [None]:
# see joined full quarter and current quarter measures
qry = "select * from emp_q3_2014 limit 5"
dbGetQuery(con, qry)

To calculate the hiring, separation, and employment growth rates, we will use the following function from <a href='https://academic.oup.com/qje/article-abstract/107/3/819/1873525'>Davis and Haltiwanger (1992)</a> to calculate 1) employment growth rate: `emp_rate`; 2) separation growth rate: `sep_rate`; 3) hire growth rate: `hire_rate`.

$$ g_{et}=\frac{2(x_{et} - x_{e,t-1})}{(x_{et} + x_{e,t-1})} $$

In this function, $g_{et}$ represents employment/separation/hire growth rate of employer $e$ at time $t$. $x_{et}$ and $x_{e,t-1}$ are employer $e$'s employment/separation/hire at time $t$ and $t-1$, respectively. According to Davis and Haltiwanger (1992):

"*This growth rate measure is symmetric about zero, and it lies in the closed interval [-2,2] with deaths (births) corresponding to the left (right) endpoint. A virtue of this measure is that it facilitates an integrated treatment of births, deaths, and continuing establishments in the empirical analysis.*"

In other words, a firm with a $ g_{et} = 2$ is a new firm, while a firm with a $ g_{et} = -2$ is a a firm that exited the economy.
    
> Why do the two endpoints represent firms' deaths and births? Calculate the value of $g_{et}$ when $x_{et}=0$ and when $x_{e,t-1}=0$ and see what you get.

In practice, we will apply this formula for every `uiacct` unless it experienced no hires or separations in the current and previous quarters, where instead of getting a divide by zero error, we will assign it to 0.

In [None]:
final_table <- c("all_q4_2012", "all_q1_2013", "all_q2_2013", "all_q3_2013", "all_q4_2013", "all_q1_2014", "all_q2_2014", "all_q3_2014")
for(i in 1:8){
    qry = '
    create temp table "%s" as
    select a.employeeno, a.naics, a.qtr, a.calendaryear, a.num_employed, a.avg_earnings, a.total_earnings, 
    a.bottom_25_pctile, a.top_25_pctile, a.full_num_employed, a.full_avg_earnings, a.full_total_earnings,
        (2.0 * (a.num_employed - b.num_employed_pre))/(a.num_employed + b.num_employed_pre) as emp_rate,
    case
        when b.calendaryear = 2012 and b.qtr = 3 then null
        when a.num_hire = 0 and b.num_hire_pre = 0 then 0
        else (2.0 * (a.num_hire - b.num_hire_pre))/(a.num_hire + b.num_hire_pre) end as hire_rate, 
    case
        when a.num_sep = 0 and b.num_sep_pre = 0 then 0
        else (2.0 * (a.num_sep - b.num_sep_pre))/(a.num_sep + b.num_sep_pre) end as sep_rate
    from "%s" a
    left join "%s" b
    on a.employeeno = b.employeeno
    '
    full_qry = sprintf(qry, final_table[i], tabs[i], old_tabs[i])
    dbExecute(con, full_qry)
    }

In [None]:
qry <- "
select * from all_q3_2013 limit 5
"
dbGetQuery(con, qry)

Since these eight tables contain the same exact column names, we can simply union them to create our final output: `employers_2013`.

    create table ada_ky_20.employers_2013 as
    select * from all_q4_2012
    union all
    select * from all_q1_2013
    union all
    select * from all_q2_2013
    union all
    select * from all_q3_2013
    union all
    select * from all_q4_2013
    union all
    select * from all_q1_2014
    union all
    select * from all_q2_2014
    union all
    select * from all_q3_2014