<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<center> Julia Lane, Benjamin Feder, Angela Tombari, Ekaterina Levitskaya, Tian Lou, Lina Osorio-Copete. </center> 

# Using Employment and Employer-Level Measures to Understand Kentucky's Labor Market

## Introduction

While in the Data Exploration notebook we focused primarily on understanding our cohort's earnings, here we will first analyze two measures of stable employment. From there, we will analyze some employer-level measures created in a supplementary notebook to get a better sense of Kentucky's labor market and how employers of our cohort fit into the overall labor market.

## Learning Objectives

This notebook features two prominent segments:

1. Different measures of stable employment
1. Labor market interactions

These two sections will have two different units of analysis: the first will focus directly on the individuals in our cohort while the second will focus on employers. 

Before we start looking at their employers, a logical prelude would be taking a deeper dive into our cohort's employment. Here, we will walk through two different measures of stable employment within a cohort and see if their earnings differed significantly from those without stable employment. From there, we will load in our employer-level measures file and look at some characteristics we can use to get a better sense of Kentucky's labor market at the times of potential employment. In parallel with our cohort, the employer-level file contains information on all of Kentucky's employers in the UI wage records data from 2012Q4-2014Q3, the potential quarters of employment for our cohort when limiting earnings to one year post graduation.

We would also like to discover any distinguishing factors between the overall labor market in Kentucky and the employers that hired members of our 2013 cohort. Ultimately, we want to gain a better understanding of the demand side when it comes to employment opportunities for our Kentucky graduates.

Similar to the Data Exploration notebook, we will pose a few direct questions we will use to answer our ultimate question: **How can we use labor market interactions to help explain employment outcomes of Kentucky graduates?**

Before we do so, we need to load our external R packages and connect to the database.

### R Setup

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)

# scaling data
library(scales)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

## Stable Employment Measures

As discussed above, we will spend some time in this section taking a look at our 2013 cohort's employment outcomes using two different measures of stable employment. We will examine how average quarterly earnings differ for individuals who satisfy these definitions of stable employment. We have listed the two questions we will seek to answer in this section below:

1. How many of our graduates found stable employment within Kentucky? What percentage is this of our total cohort?
1. What were the average quarterly earnings within these stable jobs?

Let's first load our table matching our 2013 cohort to their employment outcomes into R.

### Stable Employment Measures #1 and #2: Creating the Metrics

In [None]:
# read table into R
qry = "
select *
from ada_ky_20.cohort_wages
"
df_wages = dbGetQuery(con, qry)

In [None]:
# take a look at df_wages
glimpse(df_wages)

In [None]:
# read table into R
qry = "
select *
from ada_ky_20.cohort_wages
"
df_wages = dbGetQuery(con, qry)

In [None]:
# take a look at df_wages
glimpse(df_wages)

<font color=green><h3>Question 1: How many leavers found stable employment? What percentage is this of our total cohort? </h3></font> 

How would you define stable employment? Keep in mind that stable employment is a subjective measure and can take on many definitions given the context of your analysis. Here are the two definitions of stable employment we will look at: 

1. Those with positive earnings **all four quarters** after exit with the **same employer**.
2. Those that experienced full-quarter employment. By full-quarter employment, an individual had earnings in **quarters t-1, t, and t+1** from the **same employer**.

> Since stable employment can be defined in many different ways, if you choose to analyze stable employment within a specific cohort (highly recommended), make sure you clearly state your definition of stable employment.

#### Stable Employment Measure #1: Positive earnings all four quarters with the same employer

This calculation is relatively simple given that we have to just manipulate `df_wages`. We will approach this calculation by counting the number of quarters each individual (`coleridge_id`) received wages from each employer (`employeeno`), and then filter for just those `coleridge_id`/`employeeno` combinations that appear in all four potential quarters of employment.

> To see if they appeared in all four quarters, we can count the number of distinct `job_date` values each `coleridge_id`/`employeeno` combination appeared in the UI wage records.

In [None]:
# see if we can calculate stable employment measure #1
df_wages %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    head()                          #does this data frame look the way we think it should? (functions as a gut check)

From here, we can add one line of code `summarize(n_distinct(coleridge_id))` to calculate the number of individuals in our cohort that experienced this measure of stable employment.

In [None]:
## calculate number of individuals in our cohort that experienced stable employment measure #1
df_wages %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    summarize(n_distinct(coleridge_id))

If you are curious about the amount of members of our cohort that found stable employment (according to this definition) with multiple employers, you can do so by counting the number of times each `coleridge_id` appears after filtering the data frame to only those `coleridge_id`/`employeeno` combinations that appeared four separate times within our time frame of interest.

In [None]:
# count the number who experienced stable employment measure #1 with multiple employers
df_wages %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)
    ) %>%                                        #same code as above, counts number of quarters of employment by person/employer combination
    ungroup() %>%
    filter(n_quarters==4) %>%                    #filters so that only person/employer combinations that persist across all 4 quarters are included
    group_by(coleridge_id) %>%
    summarize(n=n()) %>%                         #how many person/employer combinations meet this criteria for each person
    ungroup() %>%
    filter(n>1) %>%                              #restricts to only those people with at least 2 employers in all four quarters
    summarize(num=n())

Anyways, we can calculate the percentage of our original cohort that experienced stable employment pretty easily now - we just need to load our original cohort into R to provide a frame of reference.

>As a reminder, `df_wages` includes only those individuals in our cohort who found employment in Kentucky during at least one of the four quarters post-graduation.

In [None]:
# select all rows from our cohort
qry <- "
SELECT *
FROM ada_ky_20.cohort
"

#read into R as df
df <- dbGetQuery(con,qry)

To make the code a bit more readable, we can save the number of individuals in our cohort that satisfied this definition of stable employment first before dividing it by the number of members of our original cohort.

In [None]:
# save to calculate stable employment percentage
stable <- df_wages %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    summarize(num = n_distinct(coleridge_id))

# percentage employed all four quarters
percent((stable$num/n_distinct(df$coleridge_id)), .01)

This percentage may not mean anything to us yet.  Is this bad?  Is this good?  One thing we can look at is how this changes the story in the context of our research question. The Postsecondary Feedback Report uses any UI-covered wage-earning employment within the given federal fiscal year (FFY).  We can do something similar here to see what employment looks like if we counted any UI-covered wage-earning employment as employment within this time frame.

In [None]:
# save to calculate any employment percentage
any_employment <- df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    ungroup() %>%
    filter(total_wages > 0) %>%
    summarize(num = n_distinct(coleridge_id))

# percentage employed all four quarters
percent((any_employment$num/n_distinct(df$coleridge_id)), .01)

This presents a stark contrast for us in terms of the story we are telling. It would be interesting to assess the difference between any employment and the different versions of stable employment several years post-graduation.  The Postsecondary Feedback Report focuses on employment 3-years out as the first reportable employment percentage under the assumption that employment stabilizes over time.

>On your own, you can see how this story changes if you restrict to employment in all four quarters regardless of employer.  What does this suggest about employment post-graduation? Does the literature support the stories that we are seeing?

Now, let's see how the percentage changes when we employ (pun intended) our second definition of stable employment.

#### Stable Employment Measure #2: Full-Quarter Employment

Finding full-quarter employment is a bit more complicated. Instead of solely using R, we will venture back into SQL, since we will need to find earnings for our cohort for an extended period of time relative to the information available in `df_wages`. To ensure that an individual experienced full-quarter employment with an employer, they must have earnings from the same employer in times t-1, t, and t+1, with t representing a quarter.  


**Example: Data Needed to Assess Full-Quarter (FQ) Employment for Person1/EmployerA in Each of the 4 Quarters Post-Graduation**

|Person/Employer Combination|High Degree |FQ YearQtr |t-1|t|t+1|
|---|---|---|---|---|---|
|_Person 1/Employer A_ |_2013 Q3_ |_2013 Q4_ |<font color=green>2013 Q3</font> |**2013 Q4** |2014 Q1 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q1_ |2013 Q4 |**2014 Q1** |2014 Q2 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q2_ |2014 Q1 |**2014 Q2** |2014 Q3 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q3_ |2014 Q2 |**2014 Q3** |<font color=green>2014 Q4</font> |



Our previous Kentucky wage data frame, `df_wages`, included employment for the 4 quarters post-graduation. As can be seen in the table above, calculating full quarter employment for the same four quarter span requires two additional quarters of wage information.  This requires us to extend our dataframe to include the employment quarter of graduation as well as one additional quarter after our final quarter of interest.  In the example above, this means including employment in <font color=green>2013 Q3</font> and extending employment to encompass <font color=green>2014 Q4</font>. 

>`df_wages` had four potential quarters of employment per person.  Our extended wage data frame will have 6 possible quarters of employment per person to assess full-quarter employment for the year following graduation.

Recall that in our permanent table `ky_wages_sub` in the `ada_ky_20` schema, created in the Data Exploration notebook, we limited the maximum `calendaryear`/`qtr` combination to 2014Q4. Let's verify that we can combine our `cohort` table with `ky_wages_sub` to find these six quarters of employment history for each `coleridge_id` by making sure that the most recent wage record in `df_wages` at the very most is from 2014Q3.

In [None]:
# find most recent date in df_wages
max(df_wages$job_date)

This is great news! That means we do not need to recreate a version of the `ui_wage_record` table in the `kystats_2020` schema with a relative "job date" in date format. Let's take a quick look at `ky_wages_sub` before joining it to `cohort`.

In [None]:
# look at ky_wages_sub
qry = '
select *
from ada_ky_20.ky_wages_sub
limit 5
'
dbGetQuery(con, qry)

To find earnings outcomes for these six quarters, we will follow a nearly identical process to that of creating our `cohort_wages` table, except instead of limiting employment outcomes to just the year following graduation, we will extend it to include all employment for 15 months post graduation, as well as in the quarter of graduation. We have already created the permanent table `full_cohort_wages` for you, and the code to do so is featured below.

    create table ada_ky_20.full_cohort_wages as
    select a.coleridge_id, a.degreegroup, a.degreerank, a.kpeds_major1, a.kpeds_major1_cip, a.kpeds_instname, a.kpeds_institution, a.kpeds_sector, a.deg_class, b.industry, b.majorindustry, b.wages, b.fein, b.employeeno, a.deg_date, b.job_date, (b.job_date - a.deg_date) as time_after_grad
    from ada_ky_20.cohort a
    left join ada_ky_20.ky_wages_sub b
    on a.coleridge_id = b.coleridge_id
    where ((a.deg_date + '15 month'::interval) >= b.job_date) and (a.deg_date <= b.job_date)

Now that we have all potential earnings covered in the UI wage records for the six quarters of interest relative to each `coleridge_id`, we can calculate full-quarter employment. To do so, we will use three copies of the same table, and then use a `WHERE` clause to make sure we are identifying the same individual and employer combination across three consecutive quarters.

The `\'3 month\'::interval` code can be used when working with dates (`job_date` in this case), as it will match to exactly three months from the original date. Before or after the original date can be indicated with `+` or `-` signs.

In [None]:
# see if we can calculate full-quarter employment
qry = '
select a.coleridge_id, a.employeeno, a.job_date, a.wages
from ada_ky_20.full_cohort_wages a, ada_ky_20.full_cohort_wages b, ada_ky_20.full_cohort_wages c
where a.coleridge_id = b.coleridge_id and a.employeeno = b.employeeno and a.job_date = (b.job_date - \'3 month\'::interval)::date and 
      a.coleridge_id = c.coleridge_id and a.employeeno = c.employeeno and a.job_date = (c.job_date + \'3 month\'::interval)::date and
      a.wages > 0 and b.wages > 0 and c.wages > 0
order by a.coleridge_id, a.job_date
limit 5
'
dbGetQuery(con, qry)

The query above will only select earnings for quarters where an individual experienced full-quarter employment with an employer, and due to the `WHERE` clause, it will only select rows in our original four quarters of interest post graduation, since it is impossible to verify if someone achieved full-quarter employment for either the earliest or latest quarter of employment history.

Now that we have made sure that we can calculate full-quarter employment, let's find all cases of full-quarter employment in `full_cohort_wages` and read the resulting table into R.

In [None]:
# find all individuals who achieved full-quarter employment in a given quarter
# the first two rows of the where clause are designed to ensure the person/employer combination existed in:
# the quarter before (t-1, table aliased as b) and 
# the quarter after  (t+1, table aliased as c)
# the target quarter (t,   table aliased as a)

qry = '
select a.coleridge_id, a.employeeno, a.job_date, a.wages
from ada_ky_20.full_cohort_wages a, ada_ky_20.full_cohort_wages b, ada_ky_20.full_cohort_wages c
where a.coleridge_id = b.coleridge_id and a.employeeno = b.employeeno and a.job_date = (b.job_date - \'3 month\'::interval)::date and 
      a.coleridge_id = c.coleridge_id and a.employeeno = c.employeeno and a.job_date = (c.job_date + \'3 month\'::interval)::date and
      a.wages > 0 and b.wages > 0 and c.wages > 0
order by a.coleridge_id, a.job_date
'
cohort_full <- dbGetQuery(con, qry)

Now that we have all records of full-quarter employment, along with their earnings in the quarter, we can simply calculate the number of individuals in our cohort who experienced our second measure of stable employment in at least one quarter.

In [None]:
# calculate number of individuals in our cohort that experienced full-quarter employment
cohort_full %>%
    summarize(n=n_distinct(coleridge_id))

We can then find the percentage of individuals in our cohort that achieved this measure in at least one quarter by following a similar process as we did for our first definition of stable employment.

>Reminder: `df` is the name we assigned to our full cohort table of graduates when calculating percentage of graduates meeting the criteria for stable employment measure #1.

In [None]:
# save number of individuals in our cohort that experienced at least one quarter of full-quarter employment
full_n <- cohort_full %>%
    summarize(n=n_distinct(coleridge_id))

# calculate proportion of people in our cohort that experienced at least one quarter of full-quarter employment
percent((full_n$n/n_distinct(df$coleridge_id)), .01)

We can also calculate the percentage of individuals in our cohort that experienced full quarter employment with the same employer in all four quarters.

In [None]:
# graduates with full-quarter employment in all four quarters
cohort_full %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)) %>%
    ungroup() %>%
    filter(n_quarters == 4) %>%
    summarize(n=n_distinct(coleridge_id))

And then we can calculate this percentage.

In [None]:
# save graduates with full-quarter employment in all four quarters into a new data frame named full_4
full_4 <- cohort_full %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)) %>%
    ungroup() %>%
    filter(n_quarters == 4) %>%
    summarize(n=n_distinct(coleridge_id))

percent((full_4$n/n_distinct(df$coleridge_id)), .01)

If you're curious, you can see if anyone experienced full quarter employment all four quarters with multiple employers as well.

In [None]:
# check count of individuals with more than one employer providing full quarter employment across all four quarters
cohort_full %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)) %>%
    ungroup() %>%
    filter(n_quarters == 4) %>%
    group_by(coleridge_id) %>%
    summarize(n_emps = n_distinct(employeeno)) %>%
    filter(n_emps > 1) %>%
    summarize(n=n_distinct(coleridge_id))

Are you surprised at the difference in percentages for our two measures of stable employment?

<font color=red><h3> Checkpoint 1: Sector-level Analysis </h3></font> 

Find the most common sectors of the individuals that satisfied either one of these definitions of stable employment. Are you surprised? Do you think this may vary by location?

In [None]:
# complete checkpoint


### Stable Employment Measures #1 and #2: Exploring Quarterly Earnings

<font color=green><h3>Question 2: What were the average wages earned within these stable jobs by quarter?</h3></font> 

Now let's see if earnings differed for our cohort when comparing our two measures of stable employment. 

#### Stable Employment Measure #1: Average Quarterly Earnings

We'll start with our first measure of those that had earnings with the same employer for all four quarters within our time frame. First, we will isolate all `coleridge_id`/`employeeno` combinations that satisfied this stable employment measure, and then filter our original earnings data frame `df_wages` to just include wages for these combinations.

In [None]:
# all coleridge_id and employeeno values from stable employment measure #1 and save to stable_emp_1
stable_emp_1 <- df_wages %>%
    group_by(coleridge_id, employeeno) %>%
    summarize(n_quarters = n_distinct(job_date)
    ) %>%
    ungroup() %>%
    filter(n_quarters==4) %>%
    select(-n_quarters)

# see stable_emp_1
head(stable_emp_1)

> The code used to create `stable_emp_1` is copied from the code used earlier to isolate those who had earnings with the same employer for all four quarters within our time frame. The final line (`select(-n_quarters)`) drops the variable containing the number of employed quarters (which is always four given the `filter` condition above).

Now, we just need to join rows in `df_wages` for those with the same `employeeno` and `coleridge_id` combinations as in `stable_emp_1`, and from there we can find the average quarterly earnings.

>Note: We do not need to match on `job_date` here because this metric is based on all four quarters of employment with the same person/employer combination.

In [None]:
# find average quarterly earnings by employer for these individuals
# as noted above, a small subset of people meet this stable employment criteria for multiple employers

df_wages %>% 
    inner_join(stable_emp_1, by = c('coleridge_id', 'employeeno')) %>%
    summarize(mean_wages = mean(wages))

In [None]:
# if you are interested in what the average quarterly earnings overall were for individuals across all jobs that met the criteria for stable employment metric #1,
# you would need to account for individuals who met the stable employment criteria for multiple employers.

df_wages %>% 
    inner_join(stable_emp_1, by = c('coleridge_id', 'employeeno')) %>%
    group_by(coleridge_id, job_date)%>%
    summarize(quarterly_wages = sum(wages)) %>% 
    ungroup() %>%
    summarize(mean_wages = mean(quarterly_wages))

The code chunk below is just an example of a gut check that can be run to assess the above snippet of code.  If we restrict our overall dataframe to those that met this stable employment metric definition for only one employer, the two code chunks should match.  Feel free to ignore this gut check.

In [None]:
##restricting to those that meet the stable employment metric #1 for ONLY one employer
df_wages_one_emp <- df_wages %>%
    inner_join(stable_emp_1, by = c('coleridge_id', 'employeeno')) %>%
    group_by(coleridge_id) %>%
    mutate(n_employers = n_distinct(employeeno)) %>%
    filter(n_employers ==1) %>%
    ungroup() 

#first code chunk from above (treats each row as the unit of analysis)
df_wages_one_emp %>% 
    inner_join(stable_emp_1, by = c('coleridge_id', 'employeeno')) %>%
    summarize(mean_wages = mean(wages))

#second code chunk from above (forces quarterly wages to be the sum of wages in that quarter per person)
#if each person only meets the criteria for one employer, results will be the same
df_wages_one_emp %>% 
    inner_join(stable_emp_1, by = c('coleridge_id', 'employeeno')) %>%
    group_by(coleridge_id, job_date)%>%
    summarize(quarterly_wages = sum(wages)) %>%
    ungroup() %>%
    summarize(mean_wages = mean(quarterly_wages))

#### Stable Employment Measure #2: Average Quarterly Earnings

For our second stable employment measure, we have already identified `coleridge_id`/`employeeno`/`job_date` combinations for full-quarter employment. We will use a similar strategy in filtering `df_wages` before finding the average quarterly earnings for quarters in which members of our cohort experienced full-quarter employment, starting with our data frame `cohort_full`.

In [None]:
# see cohort_full
head(cohort_full)

Given that we already have found these individual's wages per each quarter of full-quarter employment, we can simply find the average of all the values in the `wage` column.

> Note: `cohort_full` contains all instances of full-quarter employment, not four quarters of full-quarter employment for a single employer.

In [None]:
# find average quarterly earnings for stable employment measure 2
cohort_full %>%
    summarize(mean_wages = mean(wages))

The code above gives us an approximation of the average quarterly earnings per quarter/employer/graduate combination meeting full-term employment criteria. If your interest is in how much a person who meets full quarter employment makes on average in a quarter, you would need to dig deeper and see how many people are meeting this criteria for multiple employers in the same quarter (a corner case that would not be handled above).  If the corner case exists, wages would need to be summed similar to the process shown in stable employment metric #1 above.

<font color=red><h3> Checkpoint 2: Wages for 4 Quarters of Full-Quarter Employment</h3></font> 

Find the distribution of wages (represented by a histogram) for those in our cohort that experienced full-quarter employment for all four quarters of interest. Is the visualization more, less, or of the same value as looking at the average quaterly wages?

In [None]:
# average quarterly wages under stable employment measure #2 for all four quarters in histogram



Now that we have explored two potential definitions of stable employment within the context of our cohort, we will shift our focus for the rest of the notebook to the employers in order to get a better sense of Kentucky's labor market at the time of our analysis.

## Kentucky's Employers

In this section, we'll look at the characteristics of Kentucky's employers. First, let's load in and take a quick look at our employer-level characteristics file `employers_2013` (located in the `ada_ky_20` schema for all employers *with at least five distinct employees* in each of the potential quarter where we analyze earnings outcomes (2012Q4 - 2014Q3).

Once we take a quick look at the `employers_2013` table, we will try to answer some broad questions about Kentucky's labor market through some more direct questions:

- What is the total number of jobs represented in `employers`? What about total number of full quarter jobs?
- What are the most popular industries by number of employees? What about by number of employers?
- What is the distribution of both total and full-quarter employment of employers per quarter?
- What is the distribution of total and average annual earnings by quarter of these employers?
- Did average employment, hiring, and separation growth rates across all employers vary by `calendaryear`/`qtr` combinations?

### Load the dataset

Before we get started answering these questions, let's load and then take a look at this file.

In [None]:
# read into R
qry <- "
select *
from ada_ky_20.employers_2013
"
employers <- dbGetQuery(con, qry)

# see employers
head(employers)

Let's see how many rows are in `employers`.

In [None]:
# number of rows
nrow(employers)

Let's also see how many employers we have on file per `calendaryear`/`qtr` combination.

In [None]:
# number of employers by quarter
employers %>%
    count(calendaryear, qtr)

Are there any reasons you can think of that may explain why we see a sizeable REDACTED in the number of employers from REDACTED to REDACTED? Anyways, now that we have taken a quick look into `employers`, we can start answering the questions posed at the beginning of this section. 

<font color=green><h3>Question 1: What is the total number of jobs represented in `employers`? What about total number of full quarter jobs?</h3></font> 

There are two columns in `employers` we will focus on to answer this set of questions: `num_employed`, which is a calculation of the number of employers, and `full_num_employed`, which tracks the number of full-quarter employees.

In [None]:
# find number of employees and full-quarter employees
employers %>%
    summarize(total_jobs = sum(num_employed),
             total_full_quarter_jobs = sum(full_num_employed))

<font color=green><h3>Question 2: What are the most popular industries by number of employees? What about by number of employers?</h3></font> 

You may have noticed the `naics` column in `employers`. Let's take a quick look at it.

In [None]:
# see naics codes
employers %>%
    select(naics) %>%
    head()

In past notebooks, we worked with `majorindustry` values, which were the titles associated with two-digit NAICS codes. We can select the first two digits of every value in `naics` with the `substring()` function.

In [None]:
# demonstrate substring function
employers %>%
    select(naics) %>%
    mutate(two_digit = substring(naics, 1, 2)) %>%
    head()

So to answer this question, we can simply include `substring(naics, 1, 2)` in our `group_by()` clause before counting the number of employees by 2-digit NAICS code.

In [None]:
# 10 most popular industries by number of employees
employers %>%
    group_by(substring(naics,1,2)) %>%
    summarize(num_employed = sum(num_employed)) %>%
    rename(naics = 1) %>%
    arrange(desc(num_employed)) %>%
    head(10)

# save as pop_naics
pop_naics <- employers %>%
    group_by(substring(naics,1,2)) %>%
    summarize(num_employed = sum(num_employed)) %>%
    rename(naics = 1) %>%
    arrange(desc(num_employed)) %>%
    head(10)

Let's use our industry crosswalk to put some names to these NAICS codes. In cases where the names are not already joined to the NAICS codes, we can use `naics_2012` table in the `public` schema to act as a crosswalk.

> NAICS codes are updated every five years, with the most recent update in 2017. There is an associated `naics_2017` table, but since we are focusing on employers from 2012-2014, we will use the proper NAICS lookup table within this time frame.

You can see this table by running the code cell below.

In [None]:
# read naics_2012 table into R as naics
qry = '
select *
from public.naics_2012
limit 5
'
dbGetQuery(con, qry)

If you're curious, you can see that there are a few NAICS codes that won't work for direct matching due to the presence of `-` in the `naics_us_code` value. We can see them here:

> This is apparent in all of the tables.

In [None]:
# see naics with dashes
qry = "
select *
from public.naics_2012
where naics_us_code like '%-%'
"
dbGetQuery(con, qry)

For the `naics_2012` table, we have added in these direct numbers in a permanent table `naics_2012_upd` in the `ada_ky_20` schema to allow for direct matching. Let's read this table into R.

In [None]:
# read updated naics table for 2012 into R
qry = "
select *
from ada_ky_20.naics_2012_upd
"

naics <- dbGetQuery(con, qry)

Since we have already stored `pop_naics` as a data frame, we can `left_join()` it to `naics` to find the industries associated with each 2-digit NAICS code.

In [None]:
# get industry names of most popular naics
pop_naics %>% 
    left_join(naics, by=c('naics' = 'naics_us_code')) %>%
    # don't include the other columns
    select(-c(seq_no,naics)) %>%
    # sort order of columns
    select(naics_us_title, num_employed)

Do any of these industries suprise you? Now, let's move on to our most common industries by number of employers.
> In the following code, `n_distinct()` is used to calculate the number of unique employers from 2012Q4-2014Q3.

In [None]:
# calculate popular industries by number of distinct employers in this time frame
employers %>%
    filter(!is.na(naics)) %>%
    group_by(substring(naics,1,2)) %>%
    summarize(distinct_emp = n_distinct(employeeno)) %>%
    arrange(desc(distinct_emp)) %>%
    ungroup() %>%
    rename(naics = 1) %>%
    head(10)

Again, we can find the associated industry names with a quick join after saving the resulting data frame above.

In [None]:
# calculate number of distinct employers from 2012Q4-2014Q3
# save to pop_naics_emps
pop_naics_emps <- employers %>%
    filter(!is.na(naics)) %>%
    group_by(substring(naics,1,2)) %>%
    summarize(distinct_emp = n_distinct(employeeno)) %>%
    arrange(desc(distinct_emp)) %>%
    ungroup() %>%
    rename(naics = 1) %>%
    head(10)

In [None]:
# get industry names of most popular naics
pop_naics_emps %>% 
    left_join(naics, by=c('naics' = 'naics_us_code')) %>%
    # don't include the other columns
    select(-c(seq_no,naics)) %>%
    # sort order of columns
    select(naics_us_title, distinct_emp)

How does this list compare to the one of the most popular industries by number of total employees?

<font color=green><h3>Question 3: What is the distribution of both total and full-quarter employment of employers per quarter?</h3></font> 

Now, instead of aggregating `num_employed` by quarter, we will simply look at the distribution of `num_employed` within each quarter. First, we'll look at a simple histogram of this distribution, and then find some underlying percentile values as well.

> Because these visualizations are exploratory and not meant to be finalized, we are not going to adjust the default bin size or add any titles.

In [None]:
# see num_employed distribution
employers %>%
    ggplot(aes(x=num_employed)) + 
    geom_histogram()

Above, we can see that most employers have REDACTED employees per quarter. Let's take a look at some percentiles to get a better understanding of this distribution.

In [None]:
# find distribution of total employees by employer and quarter
employers %>%
    summarize('.01' = quantile(num_employed, .01, na.rm=TRUE),
              '.1'  = quantile(num_employed, .1, na.rm=TRUE),
              '.25' = quantile(num_employed, .25, na.rm=TRUE),
              '.5'  = quantile(num_employed, .5, na.rm=TRUE),
              '.75' = quantile(num_employed, .75, na.rm=TRUE),
              '.9'  = quantile(num_employed, .9, na.rm=TRUE),
              '.99' = quantile(num_employed, .99, na.rm=TRUE),
             )

We can do the same for the number of full-quarter employees in each quarter by employer.

In [None]:
# see full_num_employed distribution
employers %>%
    ggplot(aes(x=full_num_employed)) + 
    geom_histogram()

From this histogram, it seems as though even more employers tend to have not REDACTED. Let's confirm that notion by looking at some of the percentile values.

In [None]:
# find distribution of full-quarter employees by employer and quarter
employers %>%
    summarize('.01' = quantile(full_num_employed, .01, na.rm=TRUE),
              '.1'  = quantile(full_num_employed, .1, na.rm=TRUE),
              '.25' = quantile(full_num_employed, .25, na.rm=TRUE),
              '.5'  = quantile(full_num_employed, .5, na.rm=TRUE),
              '.75' = quantile(full_num_employed, .75, na.rm=TRUE),
              '.9'  = quantile(full_num_employed, .9, na.rm=TRUE),
              '.99' = quantile(full_num_employed, .99, na.rm=TRUE),
             )

What does this tell you about the relative size of employers in Kentucky?

<font color=green><h3>Question 4: What is the distribution of total and average payroll by quarter of these employers? </h3></font> 

We will follow a similar protocol for answering these questions, starting with a histogram of the distribution before looking at some percentile values.

> If you would like to display large numbers without scientific notation in a histogram, you can use `options(scipen=999)`, which will work specifically in a Jupyter notebook environment.

In [None]:
# Display full numbers instead of scientific notation
options(scipen=999)

In [None]:
# see total_earnings distribution
employers %>%
    ggplot(aes(x=total_earnings)) + 
    geom_histogram()

It appears as though the total payrolls of most employers tends to be REDACTED.

In [None]:
# find distribution of total payroll by employer and quarter
employers %>%
    summarize('.01' = quantile(total_earnings, .01, na.rm=TRUE),
              '.1'  = quantile(total_earnings, .1, na.rm=TRUE),
              '.25' = quantile(total_earnings, .25, na.rm=TRUE),
              '.5'  = quantile(total_earnings, .5, na.rm=TRUE),
              '.75' = quantile(total_earnings, .75, na.rm=TRUE),
              '.9'  = quantile(total_earnings, .9, na.rm=TRUE),
              '.99' = quantile(total_earnings, .99, na.rm=TRUE),
             )

In [None]:
# see avg_earnings distribution
employers %>%
    ggplot(aes(x=avg_earnings)) + 
    geom_histogram()

Can you determine much from this visualization? If not, what would you like to add to it. Feel free to include any additions on the above visualizations.

In [None]:
# find distribution of average payroll by employer and quarter
employers %>%
    summarize('.1'  = quantile(avg_earnings, .1, na.rm=TRUE),
              '.25' = quantile(avg_earnings, .25, na.rm=TRUE),
              '.5'  = quantile(avg_earnings, .5, na.rm=TRUE),
              '.75' = quantile(avg_earnings, .75, na.rm=TRUE),
              '.9'  = quantile(avg_earnings, .9, na.rm=TRUE),
              '.99' = quantile(avg_earnings, .99, na.rm=TRUE),
             )

Is this what you were expecting to see? How do overall average payrolls for all employees compare to average quarterly earnings within our cohort?

<font color=green><h3>Question 5: Did average employment, hiring, and separation growth rates across all employers vary by `calendaryear`/`qtr` combinations?</h3></font> 

Here, we will go back to using `group_by` and `summarize` to find the means and standard deviations of these growth rates across all `calendaryear`/`qtr` combinations in `employers`.

In [None]:
# find mean and standard deviation of employment rates by quarter
employers %>%
    group_by(calendaryear,qtr) %>%
    summarize(mean = mean(emp_rate, na.rm=TRUE),
             sd = sd(emp_rate, na.rm=TRUE))

In [None]:
# find mean and standard deviation of hiring rates by quarter
employers %>%
    group_by(calendaryear, qtr) %>%
    summarize(mean = mean(hire_rate, na.rm=TRUE),
             sd = sd(hire_rate, na.rm=T))

In [None]:
# find mean and standard deviation of separation rates by quarter
employers %>%
    group_by(calendaryear, qtr) %>%
    summarize(mean = mean(sep_rate, na.rm=T),
             sd = sd(sep_rate, na.rm=T))

Based on your knowledge of employment patterns then, are these results consistent with the overall trends in the United States at the time?

<font color=red><h3> Checkpoint 3: Understanding Our Cohort within Labor Market </h3></font> 

As stated at the beginning of the notebook, we would like to get a better sense of who is employing our 2013 cohort - are they larger employers with lots of turnover? Do they tend to pay their employees better? Please find the answers to the questions posed in "Kentucky's Employers" for employers that employed members of our cohort. Filter the `employers` data frame based on the `employeeno`, `qtr`, and `calendaryear` values.

In [None]:
# guiding question 1



In [None]:
# guiding question 2



In [None]:
# guiding question 3



In [None]:
# guiding question 4



In [None]:
# guiding question 5



In this notebook, you have explored two separate definitions of stable employment and how quarterly wages changed under the two definitions. Then, you switched over to looking at the demand side of the labor market, learning about all of Kentucky's employers from 2012Q4-2014Q3. 

After answering the final checkpoint, you will be able to compare employers of our cohort to the overall labor market in Kentucky. Did you find that individuals in our cohort were often employed at certain types of employers? 