<center><img style="float: center;" src="images/CI_horizontal.png" width="600"></center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

<center> Julia Lane, Benjamin Feder, Angela Tombari, Ekaterina Levitskaya, Tian Lou, Lina Osorio-Copete. </center> 

# Outcome measurement and imputation

## Introduction

What should you do when you encounter missing values in your data? Unfortunately, there is usually no *right* answer. However, you can try to impute these missing values, providing your best guess for each missing point's true value. Here, you will learn how to implement common imputation methods you can use in approaching missing values in your own work.

### Learning Objectives

* Explore options for imputing missing values

* Visualize estimate changes following imputation

In this notebook, you will focus on 2012-13 Kentucky graduates' earnings during their first year after graduation, particularly in their first and fourth quarters after graduation. Recall that in the [Data Exploration](03_Dataset_Exploration.ipynb) notebook, you initially examined the earnings distribution for all members of this cohort who had positive earnings in this time period in Kentucky. To evaluate the earnings outcomes of all 2012-13 Kentucky graduates, you need to decide what to do when you cannot find their earnings in the Kentucky Unemployment Insurance (UI) wage records. A person may not appear in Kentucky's UI wage records for several reasons:
- The person is unemployed. 
- The person is out of labor force, e.g., schooling, childcare, etc...
- The person was employed outside of Kentucky.
- The person's job is not covered in UI wage records, e.g.,self-employed, independent contractors, federal government works, etc. <a href='https://www.nap.edu/read/10206/chapter/11#294'>(Hotz and Scholz, 2002)</a>

You will explore the resulting earnings outcomes after applying different earnings imputation methods. The methods covered in this notebook include:
- Dropping all "missing" values
- Filling in zero for people who do not have records in Kentucky UI wage records data 
- Substituting missing values with the average earnings of people who are in the same degree fields and have the same gender
- Regression imputation
- Adding in Ohio, Indiana, Missouri, Tennessee, and Illinois UI wage records for the cohort in question

## R Setup and Database Connection

Before you begin, you need to run the code cells below to import the libraries and connect to our PostgreSQL database.

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)

# scaling data, calculating percentages, overriding default graphing
library(scales)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

## Brief Manipulation: Isolating Earnings during first quarter after graduation

Before we start performing imputation, we need to do some quick data manipulation to isolate earnings from the first quarter after each individual's graduation. To do so, using the same approach as we did in the last [section](03_Data_Exploration.ipynb/#Common-Employment-Patterns) of the Data Exploration notebook, we will create a new column, `qrt_after_grad`, by dividing `time_after_grad` by 90 and rounding to the nearest whole number.

In [None]:
# read wage table into R
qry <- "
select *
from ada_ky_20.cohort_wages
"
df_wages <- dbGetQuery(con, qry)

In [None]:
# add in quarter after graduation
df_wages <- df_wages %>%
    mutate(q_after_grad = round(time_after_grad/90)) #default rounding behavior rounds to an integer

# see unique values of q_after_grad
df_wages %>%
    distinct(q_after_grad)

Now, we can simply create a data frame with just first quarter post-graduation wages using `filter()`.

In [None]:
# Filter quarter 1 after graduation
q1_wages <- df_wages %>%
    filter(q_after_grad == 1)

Because we will want to estimate the total wages for each `coleridge_id` in this quarter, not necessarily their wages per employer, let's aggregate `q1_wages` to find the total earnings for each member of this cohort in the entire quarter.

In [None]:
# aggregate on coleridge_id
q1_wages <- q1_wages %>%
    group_by(coleridge_id) %>%
    summarize(tot_wages = sum(wages)) %>%
    ungroup()

In [None]:
nrow(q1_wages)

In [None]:
q1_num <- q1_wages %>%
    summarize(n=n_distinct(coleridge_id))

cat('The total graduates with positive earnings during their first quarter after graduation:', q1_num$n)

To see the percentage of our cohort represented in `q1_wages`, let's load in our original cohort into R.

In [None]:
qry <- "
select *
from ada_ky_20.cohort
"
df <- dbGetQuery(con, qry)

In [None]:
cat('That is', percent(q1_num$n/nrow(df), .01), 'of the study cohort.')

<h3 style="color:red">Checkpoint 1: Identifying Earnings in the Fourth Quarter after Graduation</h3>

Given the code above, create a data subset `q4_wages` that contains all earnings for the cohort in their fourth quarter after graduation. How many members of our cohort had positive earnings in this quarter? Do you expect this number to be higher or lower than the number in the first quarter?

## Add graduates without positive earnings for Q1

Our current data frame, `q1_wages`, only contains individuals with positive earnings in their first quarter after graduation in Kentucky. Let's add in members of our cohort who did not appear in Kentucky's wage records during this time period, as well the additional variables from the original cohort table to better describe the individuals. This will let us easily analyze different earnings distributions in the cohort's first quarter after graduation as we progress throughout this notebook.

We can do so by using a `left_join()` of the original cohort, `df`, to `q1_wages`, as this will add in one row for each `coleridge_id` in the original cohort that was not included in `q1_wages`.

In [None]:
# add in employment outcomes for all of those in the original cohort
q1_all_wages <- df %>%
    left_join(q1_wages, c("coleridge_id"))

As a quick check, we can see if the number of individuals in `q1_all_wages` that either have or do not have null wages makes sense given the total number of individuals in the cohort that were in `q1_wages`. We can do so by adding in an indicator variable if the `wages` column was null for each potential wage record in `q1_all_wages`, and then counting the number of distinct individuals based on this new variable.

In [None]:
# employment outcomes for all of those in our original cohort
q1_all_wages %>%
    mutate(wage_ind = ifelse(is.na(tot_wages), 'no', 'yes')) %>%
    group_by(wage_ind) %>%
    summarize(n=n_distinct(coleridge_id))

In [None]:
# check number of individuals in q1_wages
q1_num$n

We can see that these numbers make sense. If they did not add up, chances are there was an issue with the details of your join.

For future usage, let's add the gender, birth year, and corresponding `ssn` to each individual in `q1_all_wages`. All three of these variables can be accessed within the table `master_person` in the `kystats_2020` schema. Let's load the contents of this table into R in preparation for the join, but only the contents for those in the original cohort.

>Note: `ssn` is a hashed value to prevent direct reidentification of any individual within the ADRF.

In [None]:
# load master_person into R
qry <- "
select *
from kystats_2020.master_person
where coleridge_id in (select coleridge_id from ada_ky_20.cohort)
"
master_person <- dbGetQuery(con, qry)

In [None]:
# check to see every individual is in master_person
nrow(master_person)

We can now simply `left_join()` `master_person` to `q1_all_wages` to find the corresponding gender, birth year, and ssn values for each individual in `q1_all_wages`. To ensure we are just selecting these specific columns from `master_person`, we can de-select all of the other columns in `master_person`, all of which start with `ceds`.

> The `rowid` and `contentarea` columns will also not provide any additional information to the resulting data frame, so we will de-select these variables as well.

In [None]:
# join and de-select unnecessary variables and see the names of the columns
q1_all_wages %>%
    left_join(master_person, 'coleridge_id') %>%
    select(-c(starts_with('ceds'), starts_with('rowid'), starts_with('contentarea'))) %>%
    names()

In [None]:
# update q1_all_wages
q1_all_wages <- q1_all_wages %>%
    left_join(master_person, 'coleridge_id') %>%
    select(-c(starts_with('ceds'), starts_with('rowid'), starts_with('contentarea')))

Just to confirm, we can check to see if the number of rows in `q1_all_wages` is equal to the number of rows in `df`, the original cohort, as each individual in the original cohort should correspond to a single row regardless of employment status.

In [None]:
nrow(df) == nrow(q1_all_wages)

Let's also check to see if we have any missing values for our demographic variables. If so, let's fill these in as `unknown` so they won't be dropped in future analyses.

In [None]:
# see number of na values by column
colSums(is.na(q1_all_wages))

Since we will not be using `kpeds_major2` in this notebook, we will simply just use `replace_na()` for `gender` and `birthyear`.

In [None]:
# replace na
q1_all_wages<-q1_all_wages %>%
    replace_na(list(
        gender = 'U',
        birthyear='unknown'
    )
              )

# see na distribution now
colSums(is.na(q1_all_wages))

> Theoretically, you could apply these imputation methods to these missing demographic values. However, for the purposes of this notebook, we will focus our imputation techniques on missing earnings values.

<h3 style="color:red">Checkpoint 2: Replicate for Q4</h3>

Create a data frame `q4_all_wages` that mirrors `q1_all_wages` except for Q4. Feel free to add in as many code cells as you deem necessary.

## Impute Wage Values

Now that we have confirmed that our `q1_all_wages` dataframe is ready to use for testing our imputation methods, we can get started. To recall, here are the five methods we will be trying out in this notebook:
1. Dropping all people with "missing" values on the variable of interest (Q1 wages)
2. Filling in zero for people who do not have records in Kentucky UI data
3. Filling in missing values with the average Kentucky UI earnings of people who are in the same degree fields and have the same gender
4. Regression
5. Filling in missing values by adding in Ohio, Indiana, Missouri, Tennessee, and Illinois UI records for the cohort in question

### 1. Drop All Missing Values

First, let's look at the earnings outcomes during first quarter after graduation when we drop all missing earnings values. Here, by ignoring potentially non-missing values, we are hoping that they mirror the same distribution as the present one. Although this is fairly common, you should **never, ever, ever** use this method in practice. 

> Deleting missing values is often called listwise deletion and essentially assumes that missing values are missing completely at random (MCAR). For a scholarly treatment of this issue, see (amongst others): 
> - Rubens (1976) "Inference and Missing Data" for the initial presentation, or
> - Peugh and Enders (2004) "Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement" for a more recent discussion.  

In [None]:
# drop missing values
q1_no_missing <- q1_all_wages %>%
    filter(!is.na(tot_wages))

In [None]:
# see earnings distribution
summary(q1_no_missing$tot_wages)

<h4 style="color:red">Checkpoint 3: Replicate for Q4</h4>

What does the earnings distribution look like for Q4 when you drop missing values?

### 2. Fill in Missing Values with Zero

Next, let's see how the earnings distribution shifts when we encode all missing earnings outcomes as 0. Here, we are assuming that all missing earnings are due to unemployment.

In [None]:
# fill all null tot_wages with 0
q1_wages_zero <- q1_all_wages %>%
    mutate(tot_wages = ifelse(is.na(tot_wages), 0, tot_wages)) 

In [None]:
# Take a look at the distribution. How does it vary from the distribution you get in method 1?
summary(q1_wages_zero$tot_wages)

In [None]:
cat('Average earnings if missing wages are dropped is $', round(mean(q1_no_missing$tot_wages), 2), sep = '', '.')

cat('\nAverage earnings if missing wages are imputed as 0 is $', round(mean(q1_wages_zero$tot_wages), 2), sep = '', '.')

<h4 style="color:red">Checkpoint 4: Replicate for Q4</h4>

What does the earnings distribution look like for Q4 when you fill missing values with zero?

### 3. Fill in Missing Values with Major/Gender Mean Earnings

Now, instead of either ignoring missing values or assuming the earnings are 0, we will try imputing missing earnings for each individual as the average quarterly earnings of the other individuals in our cohort of the same `gender` and `kpeds_major1`.

Here, our strategy is as follows:
- Using populated wages, find mean earnings for each major by gender
- Merge the mean earnings, based on major and gender, to the overall cohort
 - creates an additional column `mean_wages`
- Recode so that missing values are populated with mean earnings
 - data stored in a new column `imputed_wages`


>Note: This process is frequently termed mean imputation. Implementing this method will compress the variance and covariance of the imputed variable, resulting in biased parameter estimates for all parameters except the mean (Peugh & Enders, 2004, p.529). In this example, we are assuming that the missing values in wages are conditional on both gender and major. We also assume that the missingness is not truly indicative of lack of wages.

In [None]:
# mean earnings by gender/kpeds_major1 grouping
q1_all_wages %>%
    group_by(gender, kpeds_major1) %>%
    summarize(mean_wages = mean(tot_wages, na.rm=T)) %>%
    head()

In [None]:
#mean earnings by gender/kpeds_major1 grouping saved
q1_major_gend <- q1_all_wages %>%
    group_by(gender, kpeds_major1) %>%
    summarize(mean_wages = mean(tot_wages, na.rm=T)) %>%
    ungroup()

Now, we will merge the two DataFrames, `q1_major_gend` and `q1_all_wages` using `inner_join`.
> Note: `left_join()` would also work in this case.

In [None]:
# see if join works
q1_all_wages %>%
    inner_join(q1_major_gend, by=c('gender', 'kpeds_major1')) %>%
    head()

In [None]:
# save join results to q1_joined_major_gend
q1_joined_major_gend <- q1_all_wages %>%
    inner_join(q1_major_gend, by=c('gender', 'kpeds_major1'))

Now, we can add a new column to `q1_joined_major_gend` to include the mean wage, based on gender and major, *if* the individual did not appear in the Kentucky UI wage records data. 

In [None]:
# see if mutation works as designed
q1_joined_major_gend %>%
    mutate(imputed_wages = ifelse(is.na(tot_wages), mean_wages, tot_wages)) %>%
    select(tot_wages, mean_wages, imputed_wages) %>%
    head()

In [None]:
# save mutation to q1_major_gend_impute
q1_major_gend_impute <- q1_joined_major_gend %>%
    mutate(imputed_wages = ifelse(is.na(tot_wages), mean_wages, tot_wages))

In using this method, there is a chance we cannot impute missing values for all individuals in the cohort. If `imputed_wages` is still `NA`, we can assume there were no individuals in the cohort with non-missing earnings with the same major/gender combination.

In [None]:
# see if any still don't have imputed earnings
q1_major_gend_impute %>%
    filter(is.na(imputed_wages)) %>%
    summarize(n=n())

Unfortunately, it seems as though we do not have available earnings for every combination of gender and primary degree. For the sake of the exercise, we will ignore the earnings of those whose we could not impute using this method.

In [None]:
summary(q1_major_gend_impute$imputed_wages)

<h4 style="color:red">Checkpoint 5: Replicate for Q4</h4>

Impute missing earnings values as the mean earnings of individuals in the cohort with the same gender (`gender`) and degree designation (`kpeds_major1`) in quarter 4. What does the earnings distribution look like? For how many individuals could you not impute values using this method?

### 4. Regression imputation

We can also use regression to try to get more accurate earnings values. We will build a regression equation from the obervations for which we know the earnings, then use the equation to predict the missing earnings values. This is, in effect, an extension of the mean imputation by subgroup. Here, we will use demographic information of graduates such as birth year, gender, institution of graduation, date of degree received, and if the individual received a STEM degree.

> Note: We will not be checking the assumptions associated with linear regressions, as this example is aimed at merely displaying how to use a linear regression for imputation. If you plan on using regression imputation, please check all assumptions before employing a predictive model.

In [None]:
# subset to variables included in regression analysis
q1_reg <- q1_all_wages %>%
    select(coleridge_id, tot_wages, birthyear, gender, kpeds_instname, kpeds_isstem, deg_date)

In [None]:
# see types of the variables
glimpse(q1_reg)

In this case, it may make sense for `birthyear` to be a numeric variable rather than a character vector, as there may be some predictive power in numerically analyzing the ages of the graduates. Let's change `birthyear` to a `numeric` variable.

> Null birth years were previously replaced with a character vector. Any individual with an unknown birth year will be dropped to allow conversion to a numeric variable for imputation purposes.

In [None]:
# see types of the variables after type change
q1_reg %>%
    mutate(birthyear = as.numeric(birthyear)) %>%
    glimpse()

In [None]:
# save change of types of variables
q1_reg_sub <- q1_reg %>%
    mutate(birthyear = as.numeric(birthyear))

Since we will build the model using the members of our cohort with non-missing wages, we will split `q1_reg_sub` into two datasets, one for testing (`q1_wages_na`) and one for training (`q1_wages_pred`).

In [None]:
# split into training and testing sets
# don't need tot_wages because they are null 
q1_wages_na <- q1_reg_sub %>%
    filter(is.na(tot_wages)) %>%
    select(-c(tot_wages))

q1_wages_pred <- q1_reg_sub %>%
    filter(!is.na(tot_wages))

The model creation process for a linear regression can be done using the `lm()` function. The variable we are trying to predict is on the left-hand side of `lm()` before the `~`, and the predictors are all of the variables on the right-hand side of the `~`.

In [None]:
# run model and fit coefficients
q1_wages_model <- lm(tot_wages ~ birthyear + gender + kpeds_instname + kpeds_isstem + deg_date, data = q1_wages_pred)

Now that we have fit coefficients for each of the predictors in the model, we can predict the `tot_wages` variable for the test set using `predict()`.

In [None]:
# predict earnings for test set
pred_earnings <- data.frame(tot_wages = predict(q1_wages_model, newdata=q1_wages_na))

In [None]:
# see predicted earnings
head(pred_earnings)

Because the output for `predict()` retains the same order of rows from `q1_wages_na`, we can add the `tot_wages` variable from `pred_earnings` into the existing `q1_wages_na` data frame.

In [None]:
# see updated data frame with predicted earnings
cbind(q1_wages_na, pred_earnings) %>% 
    head()

In [None]:
# save updated data frame
q1_wages_na_w_earnings <- cbind(q1_wages_na, pred_earnings)

Finally, before we can see the effects of the imputation method, we need to combine our training set, which already has `tot_wages`, with our testing set and its predicted `tot_wages`. 

In [None]:
# combine training and testing sets
rbind(q1_wages_na_w_earnings, q1_wages_pred) %>% 
    head()

In [None]:
# save combined training and testing sets
q1_reg_earnings <- rbind(q1_wages_na_w_earnings, q1_wages_pred)

Now we can see the entire earnings distribution for the cohort after applying regression imputation.

In [None]:
# see earnings distribution for full cohort
summary(q1_reg_earnings$tot_wages)

In [None]:
# see earnings distribution for imputed portion of cohort
summary(q1_wages_na_w_earnings$tot_wages)

<font color=red><h4> Checkpoint 6: Switch `kpeds_isstem` with `deg_class` and re-run the regression</h4></font> 

When you switch `kpeds_isstem` with `deg_class` in the regression, how does the earnings distribution compare to the one using the previous linear regression to impute values?

### 5. Add in Ohio, Indiana, Missouri, Tennessee, and Illinois UI data

Finally, let's see how the earnings distribution changes when we add in some bordering states' UI wage records. You will see how we joined Ohio, Indiana, Missouri, Tennessee and Illinois UI wage records to our `cohort` table. Afterwards, we will combine these tables to analyze the overall earnings distribution.

By adding in contiguous states' wage records, we should be able to capture most earnings of our cohort that were outside of Kentucky.

Recall that in the Data Exploration [notebook](03_Data_Exploration.ipynb/#Join-Cohort-to-Ohio's-UI-Wage-Records), we created the permanent table `oh_wages` by joining `cohort_w_ssns` to `small_ohio_ui`, which was a subset of the entire UI wage records within Ohio. The following SQL queries created `in_wages`, `tn_wages`, `il_wages`, and `mo_wages`.

	create table ada_ky_20.in_wages as 
    select b.uiacct::varchar as employeeno, b.wages, b.job_date, a.coleridge_id, a.degreegroup, a.degreerank, 
    a.kpeds_major1, a.kpeds_major1_cip, a.kpeds_instname, a.kpeds_sector, a.deg_date, (b.job_date - a.deg_date) as time_after_grad, 
    'IN'::varchar as state, a.deg_class
    from ada_ky_20.cohort_w_ssns a
    left join ada_ky_20.small_indiana_ui b
    on a.ssn = b.ssn
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

    create table ada_ky_20.mo_wages as 
    select b.empr_no::varchar as employeeno, b.wage as wages, b.job_date, a.coleridge_id, a.degreegroup, a.degreerank, 
    a.kpeds_major1, a.kpeds_major1_cip, a.kpeds_instname, a.kpeds_sector, a.deg_date, (b.job_date - a.deg_date) as time_after_grad, 
    'MO'::varchar as state, a.deg_class
    from ada_ky_20.cohort_w_ssns a
    left join ada_ky_20.small_mo_ui b
    on a.ssn = b.ssn
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

    create table ada_ky_20.tn_wages as 
    select b.empr_nbr::varchar as employeeno, b.wge_amt as wages, b.job_date, a.coleridge_id, a.degreegroup, a.degreerank, 
    a.kpeds_major1, a.kpeds_major1_cip, a.kpeds_instname, a.kpeds_sector, a.deg_date, (b.job_date - a.deg_date) as time_after_grad, 
    'TN'::varchar as state, a.deg_class
    from ada_ky_20.cohort_w_ssns a
    left join ada_ky_20.small_tn_ui b
    on a.ssn = b.ssn
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

    create table ada_ky_20.il_wages as 
    select b.empr_no::varchar as employeeno, b.wage as wages, b.job_date, a.coleridge_id, a.degreegroup, a.degreerank, 
    a.kpeds_major1, a.kpeds_major1_cip, a.kpeds_instname, a.kpeds_sector, a.deg_date, (b.job_date - a.deg_date) as time_after_grad, 
    'IL'::varchar as state, a.deg_class
    from ada_ky_20.cohort_w_ssns a
    left join ada_ky_20.small_illinois_ui b
    on a.ssn = b.ssn
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

Let's briefly explore these tables to see how many `coleridge_id` values these created tables captured.

In [None]:
# Read in IN table 
qry = "
select count(distinct(coleridge_id))
from ada_ky_20.in_wages
"
dbGetQuery(con, qry)

In [None]:
# Read in MO table 
qry = "
select count(distinct(coleridge_id))
from ada_ky_20.mo_wages
"
dbGetQuery(con, qry)

In [None]:
# Read in IL table 
qry = "
select count(distinct(coleridge_id))
from ada_ky_20.il_wages
"
dbGetQuery(con, qry)

In [None]:
# Read in TN table 
qry = "
select count(distinct(coleridge_id))
from ada_ky_20.tn_wages
"
dbGetQuery(con, qry)

We have access to six tables that have 2013 AY Kentucky graduates' UI records from the six states. We can append these tables by using `union` in SQL.

>Note: In this case `union` and `union all` are equivalent. The `union` command will remove duplicate rows and should be avoided if duplication of rows is meaningful.

In [None]:
# try combining cohort_wages and oh_wages
qry = "
select coleridge_id, degreegroup, degreerank, kpeds_major1, kpeds_major1_cip, kpeds_instname, kpeds_sector, deg_date, 
wages, job_date, time_after_grad, deg_class, 'KY'::varchar as state
from ada_ky_20.cohort_wages
UNION ALL
select coleridge_id, degreegroup, degreerank, kpeds_major1, kpeds_major1_cip, kpeds_instname, kpeds_sector, deg_date, 
wages, job_date, time_after_grad, deg_class, state
from ada_ky_20.oh_wages
UNION ALL
select coleridge_id, degreegroup, degreerank, kpeds_major1, kpeds_major1_cip, kpeds_instname, kpeds_sector, deg_date, 
wages, job_date, time_after_grad, deg_class, state
from ada_ky_20.in_wages
UNION ALL
select coleridge_id, degreegroup, degreerank, kpeds_major1, kpeds_major1_cip, kpeds_instname, kpeds_sector, deg_date, 
wages, job_date, time_after_grad, deg_class, state
from ada_ky_20.il_wages
UNION ALL
select coleridge_id, degreegroup, degreerank, kpeds_major1, kpeds_major1_cip, kpeds_instname, kpeds_sector, deg_date, 
wages, job_date, time_after_grad, deg_class, state
from ada_ky_20.tn_wages
UNION ALL
select coleridge_id, degreegroup, degreerank, kpeds_major1, kpeds_major1_cip, kpeds_instname, kpeds_sector, deg_date, 
wages, job_date, time_after_grad, deg_class, state
from ada_ky_20.mo_wages
"

#this is the critical difference- here we assign the results to a data frame in our environment
combined_wages <- dbGetQuery(con, qry)

The `combined_wages` data frame contains earnings observations for the cohort in all four quarters post-graduation. For consistency, let's focus on just the earnings for their first quarter after graduation.

In [None]:
# filter for first quarter
q1_combined_wages <- combined_wages %>%
    mutate(q_after_grad = round(time_after_grad/90)) %>% 
    filter(q_after_grad == 1)

Let's see how many distinct `coleridge_id` are in `q1_combined_wages`.

In [None]:
# see number of distinct coleridge_id values 
n_distinct(q1_combined_wages$coleridge_id)

As you may recall, the number of unique `coleridge_id` values in `q1_combined_wages` is not the same number as in the original cohort. Since we just combined all earnings observations for these six states, we are still missing a portion of the original cohort that did not appear in any of these states' UI wage records in their first quarter after graduation.

To allow for reasonable comparison, let's add in these individuals using another `left_join()`.

In [None]:
# add in those present in df but not q1_combined_wages
q1_all_combined_wages <- df %>%
    left_join(q1_combined_wages, c('coleridge_id', 'degreegroup', 'degreerank', 'kpeds_major1', 'kpeds_major1_cip', 
                                'kpeds_instname', 'kpeds_sector', 'deg_date', 'deg_class'))

In [None]:
# see number of unique coleridge_ids
n_distinct(q1_all_combined_wages$coleridge_id)

While we have the UI wage records for the cohort in six states, let's quickly explore `q1_all_combined_wages`.

In [None]:
# Let's check how many people have earnings in each state
q1_all_combined_wages %>%
    group_by(state) %>%
    summarize(n=n_distinct(coleridge_id)) %>%
    arrange(desc(n))

Let's see the breakdown of the amount of states each person worked in during this time frame.

In [None]:
# Count number of jobs in different states by coleridge_id
q1_all_combined_wages %>%
    filter(!is.na(state)) %>%
    group_by(coleridge_id) %>%
    summarize(n_states = n_distinct(state)) %>%
    ungroup() %>%
    group_by(n_states) %>%
    summarize(n=n_distinct(coleridge_id))

Let's check how many missing values we have filled in by adding additional states' UI records.

In [None]:
cat('By adding in UI wage records from a handful of bordering states, we have managed to find wage records for', 
   n_distinct(q1_combined_wages$coleridge_id) - n_distinct(q1_wages$coleridge_id),
   'more people, as well as augmented earnings for some others.')

In [None]:
# Let's see the earnings distribution after we add UI records from other states
q1_combined_agg_wages <- q1_all_combined_wages %>%
    group_by(coleridge_id) %>%
    summarize(tot_wages = sum(wages))

summary(q1_combined_agg_wages$tot_wages)

## Visualizing Earnings Distributions

We can quickly determine if these different imputation methods significantly altered the pre-imputation wage distribution by visualizing the overall earnings distribution. Plotting side-by-side boxplots can be an effective choice. To do so, we need to bind the earnings from all of these methods by rows, meaning they must have the same columns. For the sake of simplicity, we will have three columns in this data frame:

- `coleridge_id`, the person identifier
- `tot_wages`, cumulative earnings in first quarter post-graduation
- `method`, type of imputation method

In [None]:
# adapt q1_no_missing
q1_no_missing %>%
    select(coleridge_id, tot_wages) %>% head()

q1_no_missing <- q1_no_missing %>%
    select(coleridge_id, tot_wages) %>%
    mutate(method = 'remove missing')

In [None]:
# adapt q1_reg_earnings
q1_reg_earnings%>%
    select(coleridge_id, tot_wages) %>% head()

q1_reg_earnings <- q1_reg_earnings %>%
    select(coleridge_id, tot_wages) %>%
    mutate(method = 'regression')

In [None]:
# adapt q1_wages_zero
q1_wages_zero %>%
    select(coleridge_id, tot_wages) %>% head()

q1_wages_zero <- q1_wages_zero %>%
    select(coleridge_id, tot_wages) %>%
    mutate(method = 'zero')

In [None]:
#adapt q1_major_gend_impute
q1_major_gend_impute %>% select(coleridge_id, imputed_wages) %>% rename(tot_wages = imputed_wages) %>% head()

q1_major_gend_impute <- q1_major_gend_impute %>%
    select(coleridge_id, tot_wages) %>%
    mutate(method = 'mean')

Now that these methods all have the same column names, we can feed them into `rbind()`.

In [None]:
# combine earnings from all methods
all_methods <- rbind(q1_major_gend_impute, q1_reg_earnings, q1_no_missing, q1_wages_zero)

Instead of plotting the earnings distributions of each method one at a time, we can plot them all in a side-by-side fashion by using the `facet_grid()` function as we did in the Data Visualization [notebook](05_Data_Visualization.ipynb/#Distribution-of-quarterly-wages-by-degree-rank).

In [None]:
# boxplot of all methods
all_methods %>%
    ggplot(aes(x=tot_wages, y ='')) +
    geom_boxplot() + 
    facet_grid(method ~ .) +
    labs(
        title = "The Q1 Earnings Distribution's Quartiles up to the 75th are largely affected by \n imputation method",
        x='Quarter 1 Earnings',
        caption = 'Source: KPEDS and KY UI wage records data'
    ) +
    theme_minimal()

## Multiple histograms

We can also look at the differences in the earnings distribution by looking at side-by-side histograms. Instead of using the `geom_` layer `geom_boxplot()`, we will use `geom_histogram()`.

In [None]:
all_methods %>%
    ggplot(aes(x=tot_wages)) +
    geom_histogram() + 
    facet_grid(method ~ .) +
    labs(
        title = 'REDACTED has a significant change on the overall earnings distribution',
        y = 'Density',
        x='Quarterly Wages',
        caption = 'Source: KPEDS and KY UI wage records data'
    ) +
    theme_minimal()

<h3 style="color:red">Checkpoint 7: Visualizing cross state earnings</h3>
Add the cross state earnings distribution to either the above multiple histograms or boxplots.

### (Optional) Advanced: Using machine learning to impute values

To impute values, we can also use machine learning algorithms such as `K-nearest Neighbors` and `Decision Trees`. The principle behind `K-nearest Neighbors` is quite simple: the missing values can be imputed by values of "closest neighbors" - as approximated by other, known, features. 

For example, if we had cases where the data on earnings of some graduates was completely missing, we could approximate their earnings by referring to other characteristics which could be shared by major group (their 'closest neighbors' in terms of characteristics).

The algorithm calculates the distance between the input values (the missing values) and helps to identify the nearest possible value based on other features (such as known characteristics of the closest major group). Imputing missing data using machine learning has become a research hotbed, and there are plenty of papers covering the various algorithms if you are curious.

## References

Peugh, J. L., & Enders, C. K. (2004). Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement. _Review of Educational Research_, 74(4), 525-556. doi: 10.3102/00346543074004525

Rubin, D. B. (1976). Inference and Missing Data. _Biometrika_, 63(3), 581-592. doi:10.2307/2335739