<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<div align='center'>Julia Lane, Benjamin Feder, Angie Tombari, and Ekaterina Levitskaya.</div>

**_Disclosure Review Examples & Exercises_**

This notebook contains information on how to prepare research output for disclosure control. It outlines how to prepare different kind of outputs before submitting an export request and provides an overview of the information needed for disclosure review. _Please read through the entire notebook because it will separately discuss different types of outputs that will be flagged in the disclosure review process._

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)

# scaling data, calculating percentages, overriding default graphing
library(scales)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

# General Remarks on Disclosure Review

## Files you can export
In general, any kind of file format can be exported. However, researchers typically export tables, graphs, regression outputs and aggregated data. Thus, the requirement is to export one of these types, which implies that every result designed for export needs to be saved in either .csv, .txt or graph format.

## Jupyter notebooks are only exported to retrieve code
Unfortunately, results cannot be exported in a Jupyter notebook. Doing disclosure reviews on output in Jupyter notebooks is too burdensome. Jupyter notebooks will only be exported when the output is deleted for the purpose of exporting code. **This does not mean that Jupyter notebooks will not be needed during the export process.** 

## Documentation of code is important
Provide the code for every desired output for export. It is important for the ADRF staff to have the code to better understand the process for creating these outputs. Understanding how research results are created is important in understanding your research output. Thus, it is important to document every step of an analysis in a Jupyter notebook. 

## General rules to keep in mind
A more detailed description of the rules for exporting results can be found on the class website. This is just a quick overview. Please go to the class website and read the entire guidelines (link below) before preparing files for export. 
- The disclosure review is based on the underlying observations of your study. **Every statistic for export must be based on at least 10 individuals. When reporting any employment statistics in Kentucky, in addition to the restriction of a minimum of 10 individuals, it must shown that 1) there are at least 10 firms and 2) employment in no one firm comprises more than 80% of the associated group to receive an export**. In other states, the statistics must be based on 3 firms and employment in no one firm comprises more than 80% of the associated group. The disclosure review team must be shown that every statistic for export is based on those numbers by providing the associated counts/percentages in an input file. 
- Document code so the reviewer can follow the data work provided. Assessing re-identification risks highly depends on the context. Therefore, it is imperative to provide context info with the analysis for the reviewer. When making a comments in the code, make sure not to use any individual statistic (e.g. the mean is ...).
- Save the requested output with the corresponding code in the input and output folders. Make sure the code is executable. The code should exactly produce the output that is requested.
- Please request an export only when results are final and they are needed for a presentation or final project report.

## To-Do:
Read through the **documentation** link: adrf.readthedocs.io/en/latest/export_of_results/guidelines.html#documentation

# Disclosure Review Walkthrough

The statistics and visualizations created in the Data Exploration and Data Visualization notebooks will be reconstructed and prepared to pass the disclosure review process.

## Counts

Recall the first guiding question in the "Understanding Our Graduates" [section](03_Data_Exploration.ipynb/#Understanding-Our-Graduates) of the Data Exploration notebook:

- How many graduates are there by highest degree rank (within the 2013 AY)?

First, the table containing the cohort must be read into R.

In [None]:
# read cohort into R
qry <- "
select *
from ada_ky_20.cohort
"
df <- dbGetQuery(con, qry)

# see df
head(df)

From here, the number of graduates by `degreerank` were found using `count()`. Because the desired statistics for export are not from any of the employment tables, the counts within all of these groups must just be at least 10. 

Otherwise, as mentioned above, each count would need to be comprised of at least 10 firms (in Kentucky, 3 in all other states) and that the employment in no one firm would make up more than 80% of the group to receive the export.

In [None]:
# count by degreerank
df %>%
    count(degreerank)

The `write_csv()` function will allow for the transformation of the data frame to a .csv as long as the file path and the name of the .csv are designated. Here, the file will be called `counts_by_degreerank.csv` (the more descriptive the name of the file, the easier it is for the Coleridge Initiative's export team to review).

> In the file path, the `user` object is assigned to your ADRF username, the csv will be saved in the `user`'s home folder. Review the `sprintf` section in the Data Exploration [notebook](03_Data_Exploration.ipynb/#sprintf) if there is any uncertaintly around the usage of `sprintf`.

In [None]:
# change to your username for saving files
user <- 'benjaminfeder'

In [None]:
# save as csv
df %>%
    count(degreerank) %>%
    write_csv(sprintf('/nfshome/%s/counts_by_degreerank.csv', user))

### Percentages

A subset of working with counts, with any reported percentages, the underlying counts of the numerators and denominators must be provided for each group involved in your desired export. The following example will illustrate this notion. It is taken from the "Understanding Our Graduates" [section](03_Data_Exploration.ipynb/#Understanding-Our-Graduates) in the Data Exploration notebook:

- What are the percentages of graduates who received their primary degrees within the seven major groups? Does this differ by institution location?

Recall that to answer this question, `deg_class` variable in `df` was employed.

In [None]:
# see values of deg_class
unique(df$deg_class)

And the following code was written the find the answer to the first part:

In [None]:
# find percentage of graduates by degree type
df %>%
    count(deg_class) %>%
    mutate(pct = round((n/sum(n)) * 100, 2)) %>%
    arrange(desc(pct)) %>%
    select(-n)

Notice how there are no counts provided in this csv. To receive this file as part of an export, a supplementary file of the counts within each of the groups must be provided. In this case, it is sufficient to `count()` the number of rows by `deg_class` to find the underlying counts for these percentages since each row in `df` corresponds to a unique individual.

In [None]:
# find counts by deg_class
df %>%
    count(deg_class)

Since there is the same denominator for this csv (total number of individuals in our cohort), an extra row for the denominator can be added using `rbind()`.

> To export a table with percentages calculated as a within-row calculation by dividing two columns, the counts for the variables within each row that were used to find the percentage(s) must be provided.

In [None]:
# find counts by deg_class
df %>%
    count(deg_class) %>%
    rbind(data.frame(deg_class = "total (denominator)", n = n_distinct(df$coleridge_id)))

These data frames can now be written to csv, as they are ready for export. If one file provides supporting counts and/or general information for a file designated for export, please name the supporting file `_counts` as seen below.

In [None]:
# save as csv
df %>%
    count(deg_class) %>%
    mutate(pct = round((n/sum(n)) * 100, 2)) %>%
    arrange(desc(pct)) %>%
    select(-n) %>%
    write_csv(sprintf('/nfshome/%s/pct_by_degree_type.csv', user))

In [None]:
# save support csv
df %>%
    count(deg_class) %>%
    rbind(data.frame(deg_class = "total (denominator)", n = n_distinct(df$coleridge_id))) %>%
    write_csv(sprintf('/nfshome/%s/pct_by_degree_type_counts.csv', user))

Now, to export the percentage of graduates by `deg_class` based on their institution's location, first the institution crosswalk must be loaded. 

In [None]:
# read kpeds_inst_xwalk into R
qry <- "
select *
from kystats_2020.kpeds_inst_xwalk
"
inst_xwalk <- dbGetQuery(con, qry)

The following code can be used to find the answer to the second part of this question:
- Does this (the percentages of graduates who received their primary degrees within the seven major groups) differ by institution location?

In [None]:
# can match on the institution code
df_app <- df %>% 
    left_join(inst_xwalk, by=c("kpeds_institution" = "inst_code"))

In [None]:
# counts of graduates by major group for colleges located in Appalachian counties
appalachian <- df_app %>%
    filter(appalachian == 1) %>%
    count(deg_class) %>%
    mutate(pct = round((n/sum(n)) * 100, 2)) %>%
    select(-n)
                

# counts of graduates by major group for colleges located in non-Appalachian counties
nonappalachian <- df_app %>%
    filter(appalachian == 0) %>%
    count(deg_class) %>%
    mutate(pct = round((n/sum(n)) * 100, 2)) %>%
    select(-n)

#binding these two tibbles to look for differences descriptively
cbind(appalachian %>% mutate(college_location = "Appalachian"), 
      nonappalachian %>% mutate(college_location = "Non-Appalachian"))

Again, before this data frame can be exported, it is necessary to provide the underlying counts of the numerators and denominators used to generate these percentages. The numerators can easily be found by running `count(deg_class)` after filtering by Appalachian status, and the denominators will also be added into the data frame.

In [None]:
# find counts for appalachian
app_counts <- df_app %>%
    filter(appalachian == 1) %>%
    count(deg_class) %>%
    rbind(data.frame(deg_class = 'total (denominator)', n = n_distinct(df_app[df_app$appalachian == 1,]$coleridge_id)))

# find counts for non appalachian
nonapp_counts <- df_app %>%
    filter(appalachian == 0) %>%
    count(deg_class) %>%
    rbind(data.frame(deg_class = 'total (denominator)', n = n_distinct(df_app[df_app$appalachian == 0,]$coleridge_id)))

# combine the two
cbind(app_counts %>% mutate(college_location = "Appalachian"), 
      nonapp_counts %>% mutate(college_location = "Non-Appalachian"))

Now that the underlying counts are confirmed that they are all greater than 10, these data frames can be saved as csv files for export.

> If some counts are less than 10, the counts should still be included as evidence in the input folder.

In [None]:
# save for export
cbind(appalachian %>% mutate(college_location = "Appalachian"), 
      nonappalachian %>% mutate(college_location = "Non-Appalachian")) %>%
    write_csv(sprintf('/nfshome/%s/pct_by_degree_type_appalachian.csv', user))

cbind(app_counts %>% mutate(college_location = "Appalachian"), 
      nonapp_counts %>% mutate(college_location = "Non-Appalachian")) %>%
    write_csv(sprintf('/nfshome/%s/pct_by_degree_type_appalachian_counts.csv', user))

## Fuzzy percentiles

Under no circumstances will percentiles be able to be exported, regardless of if the unit of analysis is directly subject to disclosure review. To get a sense of percentiles in the data, fuzzy percentiles can be exported, which can be created by finding the average of two true percentiles. 

For example, a fuzzy median can be created by finding the average of the true 45th and 55th percentiles. An example of preparing data for export will be provided by walking through an another example from the Data Exploration notebooks, this time from the "Understanding Post-Graduation In-State Employment and Earnings" [section](03_Data_Exploration.ipynb/#Understanding-Post-Graduation-In-State-Employment-and-Earnings):

- How do annual earnings post-graduation differ by degree rank?

Since the cohort's earnings will be the focus of this section, the `cohort_wages` table from the `ada_ky_20` schema must be read into R.

In [None]:
# read cohort's wages into R
qry <- "
select *
from ada_ky_20.cohort_wages
"
df_wages <- dbGetQuery(con, qry)

# see df_wages
head(df_wages)

Initially, to answer this question, the following code was used:

In [None]:
# more nuanced look at distribution
df_wages %>%
    group_by(coleridge_id, degreerank) %>%
    summarize(total_wages = sum(wages)) %>% 
    ungroup() %>% #we ungroup to get rid of the coleridge_id variable
    group_by(degreerank) %>% #group by only degreegroup to get us a wage summary by this variable
    summarize('.1'  = quantile(total_wages, .1),
              '.25' = quantile(total_wages, .25),
              '.5'  = quantile(total_wages, .5),
              '.75' = quantile(total_wages, .75),
              '.9'  = quantile(total_wages, .9)
             )

This data frame is not ready for export for four reasons:
    
1. These numbers represent exact percentiles and will not be accepted in the export process
2. There are not underlying counts per each group (`degreerank` in this example)
3. No evidence of lack of employer dominance
4. No presence of underlying employer counts per each group

This list will be followed in order. Fuzzying percentiles can be done by finding the average of our two true percentile values equidistant from the percentile in question.

In [None]:
# fuzzy quantiles
df_wages %>%
    group_by(coleridge_id, degreerank) %>%
    summarize(total_wages = sum(wages)) %>% 
    ungroup() %>% #we ungroup to get rid of the coleridge_id variable
    group_by(degreerank) %>% #group by only degreegroup to get us a wage summary by this variable
    summarize('fuzzy 10' = (quantile(total_wages, .05) + quantile(total_wages, .15))/2,
              'fuzzy 25' = (quantile(total_wages, .20) + quantile(total_wages, .30))/2,
              'fuzzy 50' = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
              'fuzzy 75' = (quantile(total_wages, .70) + quantile(total_wages, .80))/2,
              'fuzzy 90' = (quantile(total_wages, .80) + quantile(total_wages, .95))/2
             )

The fuzzy percentiles have been successfully found for the desired outputs. To finalize this aspect of the export, the underlying counts within each of the groups must be added, which can easily be done by adding `n()` to the `summarize()` call, since each row corresponds to a unique `coleridge_id`.

In [None]:
# fuzzy quantiles with counts
df_wages %>%
    group_by(coleridge_id, degreerank) %>%
    summarize(total_wages = sum(wages)) %>% #this gets us up to the code used above
    ungroup() %>% #we ungroup to get rid of the coleridge_id variable
    group_by(degreerank) %>% #group by only degreegroup to get us a wage summary by this variable
    summarize('fuzzy 10' = (quantile(total_wages, .05) + quantile(total_wages, .15))/2,
              'fuzzy 25' = (quantile(total_wages, .20) + quantile(total_wages, .30))/2,
              'fuzzy 50' = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
              'fuzzy 75' = (quantile(total_wages, .70) + quantile(total_wages, .80))/2,
              'fuzzy 90' = (quantile(total_wages, .8) + quantile(total_wages, .95))/2,
              n = n()
             )

Since the underlying counts of individuals in the same data frame we would like to export have been provided, it is not necessary to additionally export a data frame of the underlying counts per group. However, because this export concerns earnings and employment data in Kentucky, there must be evidence of at least ten employers that constitute the wage calculations within each group, as well as a lack of single-employer dominance, to receive the export. Employer dominance can be defined as an employer providing at least 80 percent of the weight in the calculation. Therefore, if a group's employment relies on that ratio or greater from one employer, the group will need to be redacted.

The number of employers involved in the calculations by group will be found first (still `degreerank` in this example).

In [None]:
# see number of employers per group
df_wages %>% 
    group_by(degreerank) %>%
    summarize(
        n_employers = n_distinct(employeeno)
    )

Finally, to verify the lack of employer dominance within any of the groups, the following steps can be worked through:
- Count the instances of the employers within each `degreerank` (grouping variable)
- Use `group_by()` to group the grouping variable
- Calculate the proportion of entries per grouping variable for each employer
- Select the most common employer using `top_n()`

> In this case, the variable on which `top_n()` will be sorted by does not need to be specified because sorting by `n` or `prop` in this example will yield the same result.

In [None]:
# see employer dominance per group
df_wages %>% 
    count(degreerank, employeeno) %>%
    group_by(degreerank) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1)

Now that high enough individual counts within each group have been verified, fuzzy percentiles have been created, and the employers of these individuals have been analyzed, the data frame in question can be simply exported by piping the output into `write_csv()`.

In [None]:
# save as csv
df_wages %>%
    group_by(coleridge_id, degreerank) %>%
    summarize(total_wages = sum(wages)) %>% 
    ungroup() %>% #we ungroup to get rid of the coleridge_id variable
    group_by(degreerank) %>% #group by only degreegroup to get us a wage summary by this variable
    summarize('fuzzy 10' = (quantile(total_wages, .05) + quantile(total_wages, .15))/2,
              'fuzzy 25' = (quantile(total_wages, .20) + quantile(total_wages, .30))/2,
              'fuzzy 50' = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
              'fuzzy 75' = (quantile(total_wages, .70) + quantile(total_wages, .80))/2,
              'fuzzy 90' = (quantile(total_wages, .80) + quantile(total_wages, .95))/2,
              n = n()
             ) %>%
    write_csv(sprintf('/nfshome/%s/fuzzy_wages_by_group.csv', user))

The corresponding employer counts and dominance tests will be saved for export as well.

In [None]:
# save employer dominance per group
df_wages %>% 
    count(degreerank, employeeno) %>%
    group_by(degreerank) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    write_csv(sprintf('/nfshome/%s/fuzzy_wages_by_group_employer_dominance.csv', user))

In [None]:
# save number of employers per group
df_wages %>% 
    group_by(degreerank) %>%
    summarize(
        n_employers = n_distinct(employeeno)
    ) %>%
    write_csv(sprintf('/nfshome/%s/fuzzy_wages_by_group_employer_counts.csv', user))

## Visualizations

The same disclosure controls covered up to this point for counts and other statistics apply for visualizations too, as underlying counts by groups for each visualization must be provided. The examples below are based on those from the Data Visualization notebook. The first example covered is a boxplot.

### Boxplots

Disclosure-proofing boxplots can be quite tricky since there are restrictions against exporting a boxplot with true percentiles or outlines. Instead, the boxplot must be created using fuzzy percentiles by user-inputting fuzzy values. This example is modified from the [Fuzzy Percentiles](Disclosure_Review.ipynb/#Fuzzy-percentiles) section above:

- Creating a boxplot of the distribution of earnings for the cohort

Here are the elements required in creating a safe boxplot:

- fuzzy 25th percentile
- fuzzy 75th percentile
- fuzzy median (50th percentile)
- fuzzy minimum
- fuzzy maximum
- no outliers

These separate components of the boxplot (outside of the no outliers) can be manually inputted into the `ggplot(aes()` call. Therefore, a data frame will be created containing each of these components, and then piped it into `ggplot()` accordingly. First, the fuzzy 25th, 50th and 75th percentiles must be generated. In this example, they will all be saved to `stats`.

In [None]:
# find 25, 50 and 75 fuzzy percentiles
stats<-df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    summarize(
        'fuzzy_25' = (quantile(total_wages, .20) + quantile(total_wages, .30))/2,
        'fuzzy_50' = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
        'fuzzy_75' = (quantile(total_wages, .70) + quantile(total_wages, .80))/2
        )
# see stats
stats

To handle fuzzy minimum and maximum values, _first_, the cutoff values will be calculated (in both directions) to determine if an individual would be viewed as an outlier by total wages. _Second_, all individuals whose total wages are outside of this bound will be `filter()`ed out , and then _third_, and finally, the average of the individuals with the lowest and the highest wages within this bound will be taken to find the fuzzy minimum and maximum values, respectively.

As the first step is completed, the fuzzy minimum and maximum values will be added to `stats` so they can be easily referred to in the future.

In [None]:
# find minimum and maximum cutoff values
stats <- stats %>%
    mutate(
        fuzzy_min_cutoff = (fuzzy_25 - 1.5*(fuzzy_75 - fuzzy_25)),
        fuzzy_max_cutoff = (fuzzy_75 + 1.5*(fuzzy_75 - fuzzy_25))
    )
# see stats
stats

Next, the original data frame, `df_wages %>% group_by(coleridge_id) %>% summarize(total_wages = sum(wages))`, will be filtered to only include individuals whose counts remain within the `fuzzy_min_cutoff` and `fuzzy_max_cutoff` values. This data frame will be named `new_df`.

In [None]:
# find df with no outliers as per fuzzy min and max cutoffs
new_df <- df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    filter((total_wages > stats$fuzzy_min_cutoff) & (total_wages < stats$fuzzy_max_cutoff)) %>%
    ungroup()

Finally, the `fuzzy_min` and `fuzzy_max` values will be found by taking the average of the two individuals with the lowest, as well as the two individuals with the highest, wages in `new_df`. These values will be saved as variables in `stats`.

In [None]:
# find fuzzy max
new_max <- new_df %>%
    arrange(desc(total_wages)) %>%
    head(2) %>%
    summarize(m = mean(total_wages))

# find fuzzy min
new_min <- new_df %>%
    arrange(total_wages) %>%
    head(2) %>%
    summarize(m = mean(total_wages))

# save fuzzy_min and fuzzy_max to stats
stats <- stats %>%
    mutate(
        fuzzy_min = new_min$m,
        fuzzy_max = new_max$m
    )
# see stats
stats

Now, `stats` can be piped into the `ggplot()` call, feeding in everything besides the `fuzzy_min_cutoff` and `fuzzy_max_cutoff` values.

In [None]:
stats %>%    
    ggplot(aes(x="", ymin = fuzzy_min, lower = fuzzy_25, middle = fuzzy_50, upper = fuzzy_75, ymax = fuzzy_max)) +
    geom_boxplot(stat="identity") +
    labs(
        title = 'Most individuals in the cohort earned within [redacted] and [redacted] dollars in the \n year after graduation',
        y = 'Total Wages',
        x='',
        caption = 'Source: KPEDS and KY UI wage records data'
    ) +
    theme_minimal()

Thus, in combining code all together, it would look like the code cell below.

> `ggsave()` was added at the end, as it will allow for the saving of the most recent visualization to a PDF.

In [None]:
# make fuzzy boxplot
# get fuzzy 25, 50, 75 and min/max cutoffs
# find 25, 50 and 75 fuzzy percentiles
stats<-df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    summarize(
        'fuzzy_25' = (quantile(total_wages, .20) + quantile(total_wages, .30))/2,
        'fuzzy_50' = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
        'fuzzy_75' = (quantile(total_wages, .70) + quantile(total_wages, .80))/2
        ) %>%
    # find min and max cutoff values
   mutate(
        fuzzy_min_cutoff = (fuzzy_25 - 1.5*(fuzzy_75 - fuzzy_25)),
        fuzzy_max_cutoff = (fuzzy_75 + 1.5*(fuzzy_75 - fuzzy_25))
    )
# find df with no outliers as per fuzzy min and max cutoffs
new_df <- df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    filter((total_wages > stats$fuzzy_min_cutoff) & (total_wages < stats$fuzzy_max_cutoff)) %>%
    ungroup()

# find fuzzy max
new_max <- new_df %>%
    arrange(desc(total_wages)) %>%
    head(2) %>%
    summarize(m = mean(total_wages))

# find fuzzy min
new_min <- new_df %>%
    arrange(total_wages) %>%
    head(2) %>%
    summarize(m = mean(total_wages))

# plot same graph
stats %>%
    mutate(
        fuzzy_min = new_min$m,
        fuzzy_max = new_max$m
    ) %>%
    ggplot(aes(x="", ymin = fuzzy_min, lower = fuzzy_25, middle = fuzzy_50, upper = fuzzy_75, ymax = fuzzy_max)) +
    geom_boxplot(stat="identity") +
    labs(
        title = 'Most individuals in the cohort earned within [redacted] and [redacted] dollars in the \n year after graduation',
        y = 'Total Wages',
        x='',
        caption = 'Source: KPEDS and KY UI wage records data'
    ) +
    theme_minimal()

# save plot
ggsave(sprintf("/nfshome/%s/fuzzy_boxplot_grad_earnings.pdf", user))

Because the visualization uses individuals from Kentucky's UI wage records, input files containing underlying people counts, as well as underlying employer counts and lack of employer dominance must be added. The number of underlying employers and individuals can be found using `n_distinct()`.

In [None]:
# number of individuals
df_wages %>%
    summarize(n_individuals = n_distinct(coleridge_id)) %>%
    write_csv('/nfshome/%s/fuzzy_boxplot_grad_earnings_num_individuals.csv')

In [None]:
# number of employers
df_wages %>%
    summarize(n_employers = n_distinct(employeeno)) %>%
    write_csv('/nfshome/%s/fuzzy_boxplot_grad_earnings_num_employers.csv')

To find evidence of the potential lack of employer dominance, the number of rows pertaining to each employer can be counted, and then the relative proportion can be added in using `mutate()`. To find the highest proportion of rows pertaining to a single `employeeno`, the `top_n()` function can be used.

In [None]:
# proof of lack of employer dominance
df_wages %>%
    count(employeeno) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    write_csv('/nfshome/%s/fuzzy_boxplot_grad_earnings_employer_dominance.csv')

This example can be continued to look at the wage distribution after grouping the institutions of graduation by their sector. The `kpeds_sector` values need to be updated first.

In [None]:
# update kpeds_sector
df_wages <- df_wages %>%
    mutate(kpeds_sector = ifelse(kpeds_sector == "AIKCU", "4 year independent", kpeds_sector))

This output will be transformed so it is completely disclosure-proofed. A similar structure can be followed as before, except this time, every variable included in `stats` will be done so inside a `for()` loop by each potential grouping within `kpeds_sector`, where the `stats` will be found for each value of `kpeds_sector`, and then these values can be `rbind()`ed together for each `kpeds_sector` in the last step of the `for()` loop. At the end, the output of `upd_stats` will be printed, and it contains fuzzy minimums, maximums, 25th, 50th, and 75th percentiles for each value of `kpeds_sector`.

> For disclosure-proofing boxplots, the code cell below can be updated to fit your requirements.

In [None]:
# intialize data frame
upd_stats <- data.frame()

# go through each of the different sectors (groups)
for(grp in unique(df$kpeds_sector)){
    new_df <- df_wages %>%
        group_by(coleridge_id, kpeds_sector) %>%
        summarize(total_wages = sum(wages)) %>%
        ungroup() %>%
        filter(kpeds_sector == grp)

    stats <- new_df %>%
        group_by(kpeds_sector) %>% # the grouping variable
        summarize(
            'fuzzy_25' = (quantile(total_wages, .20) + quantile(total_wages, .30))/2,
            'fuzzy_50' = (quantile(total_wages, .45) + quantile(total_wages, .55))/2,
            'fuzzy_75' = (quantile(total_wages, .70) + quantile(total_wages, .80))/2
            ) %>%
       # find min and max cutoff values
        mutate(
            fuzzy_min_cutoff = (fuzzy_25 - 1.5*(fuzzy_75 - fuzzy_25)),
            fuzzy_max_cutoff = (fuzzy_75 + 1.5*(fuzzy_75 - fuzzy_25))
           )

    df_grp <- new_df %>%
        filter(total_wages > stats[stats$kpeds_sector == grp,]$fuzzy_min_cutoff, 
           total_wages < stats[stats$kpeds_sector == grp,]$fuzzy_max_cutoff)

    # find fuzzy max
    new_max <- df_grp %>%
        arrange(desc(total_wages)) %>%
        head(2) %>%
        summarize(m = mean(total_wages))

    # find fuzzy min
    new_min <- df_grp %>%
        arrange(total_wages) %>%
        head(2) %>%
        summarize(m = mean(total_wages))

    stats<-stats %>%
        mutate(
            fuzzy_min = new_min$m,
            fuzzy_max = new_max$m
        )
    # fill upd_stats with the stats for each of the kpeds_sectors
    upd_stats <- rbind(upd_stats, stats)
    }


print(upd_stats)


From here, `upd_stats` can be piped into the `ggplot()` call.

In [None]:
upd_stats %>%    
    ggplot(aes(x=kpeds_sector, ymin = fuzzy_min, lower = fuzzy_25, middle = fuzzy_50, upper = fuzzy_75, ymax = fuzzy_max)) +
    geom_boxplot(stat="identity") + 
    labs(
        title = 'Graduates from [REDACTED] tend to earn REDACTED in their first year \n after graduation',
        y = 'Total Earnings',
        x='Institution Type',
        caption = 'Source: KPEDS and KY UI wage records data'
    ) +
    theme_minimal()

This figure can be saved once again using `ggsave()`.

In [None]:
ggsave(sprintf('/nfshome/%s/fuzzy_boxplot_grad_earnings_by_sector.pdf', user))

Before finishing this example, underlying individual counts, employer counts, and evidence of a lack of employer dominance must be provided within each group.

In [None]:
# number of individuals
df_wages %>%
    group_by(kpeds_sector) %>% 
    summarize(n_individuals = n_distinct(coleridge_id)) %>%
    write_csv('/nfshome/%s/fuzzy_boxplot_grad_earnings_by_sector_individual_counts.csv')

In [None]:
# number of employers
df_wages %>%
    group_by(kpeds_sector) %>%
    summarize(n_employers = n_distinct(employeeno)) %>%
    write_csv('/nfshome/%s/fuzzy_boxplot_grad_earnings_by_sector_employer_counts.csv')

In [None]:
# proof of lack of employer dominance
df_wages %>%
    count(employeeno, kpeds_sector) %>%
    group_by(kpeds_sector) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>% 
    write_csv('/nfshome/%s/fuzzy_boxplot_grad_earnings_by_sector_employer_dominance.csv')

### Histogram

The thought process behind exporting histograms is very similar to the ones for other visualizations, as there must be verification that each group (or bin, in this case) follows the disclosure review guidelines. While this may seem like a simple idea, it can sometimes require a bit of manipulation.

This example will walk through preparing a histogram of the distribution of wages within the cohort for export, starting with the default settings of `geom_histogram()`.

> Note: In using a density plot, there is no requirement to provide counts per bin because there are no bins - the underlying counts per line just need to be included.

In [None]:
# default geom_histogram settings
df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    ungroup() %>%
    ggplot(aes(x=total_wages)) +
    geom_histogram()

Although the count is naturally displayed on the vertical axis, the counts within each bin may not always be clear. The `stat_bin()` layer can be added to `geom_histogram()` to display the counts per bin.

In [None]:
# displaying counts per bin
df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    ungroup() %>%
    ggplot(aes(x=total_wages)) +
    geom_histogram() + 
    stat_bin(aes(y=..count.., label= ..count..), geom="text", vjust = 0)

It is obvious that these bins do not all have at least 10 observations. Although this issue can be solved in a variety of ways, in this notebook, the edges of the bins will be manually adjusted. To start, the code below creates a bin for every [REDACTED] dollars using the `seq()` function. 

> Note: The `breaks` argument must be added to both the `geom_histogram()` and then `stat_bin()` calls.

In [None]:
# displaying counts per bin
df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    ungroup() %>%
    ggplot(aes(x=total_wages)) +
    geom_histogram(breaks = seq(REDACTED)) +
    stat_bin(aes(y=..count.., label= ..count..), geom="text", vjust = 0, breaks = seq(REDACTED))

Since each bin consists of at least 10 observations, ordinally, this visualization can be saved for export using `ggsave()`. However, since the example leverages Kentucky UI wage records, the underlying number of employers, as well as a lack of employer dominance must be shown. In this case, due to need to encode these groups, that task may be arduous. Instead, it may be easier to use a density plot to best represent the general distribution, as the number of individuals, number of employers, and lack of employer dominance just needs to be displayed for the entire sample.

In [None]:
# displaying counts as a density plot to prevent disclosure issues
df_wages %>%
    group_by(coleridge_id) %>%
    summarize(total_wages = sum(wages)) %>%
    ungroup() %>%
    ggplot(aes(x=total_wages)) +
    geom_density()

In [None]:
# save viz for export
ggsave(sprintf('/nfshome/%s/wage_distribution_density.pdf', user))

The same code as in the first fuzzy boxplot export can be used as evidence for the required counts and dominance proofing.

In [None]:
# number of individuals
df_wages %>%
    summarize(n_individuals = n_distinct(coleridge_id)) %>%
    write_csv('/nfshome/%s/wage_distribution_density_num_individuals.csv')

In [None]:
# number of employers
df_wages %>%
    summarize(n_employers = n_distinct(employeeno)) %>%
    write_csv('/nfshome/%s/wage_distribution_density_num_employers.csv')

In [None]:
# proof of lack of employer dominance
df_wages %>%
    count(employeeno) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    write_csv('/nfshome/%s/wage_distribution_density_employer_dominance.csv')

## Barplot

An example of preparing a barplot for disclosure review using wage record data is shown below, it may be useful to recall the example in the Data Visualization [notebook](05_Data_Visualization.ipynb/#Key-Sectors-by-Institution-Location) of visualizing the percentage of graduates in key sectors by Appalachian status. The original visualization (created with the code below) will be prepared for export.

In [None]:
# add in five key sectors
# changed names a bit for viz
df_wages <- df_wages %>%
    mutate(key_sect = case_when(
        majorindustry == "Manufacturing" ~ "Manufacturing",
        majorindustry == "Construction" ~ "Construction",
        majorindustry == "Health Care and Social Assistance" ~ "Health Sciences",
        majorindustry == "Transportation and Warehousing" ~ "Transportation",
        majorindustry %in% c("Professional, Scientific, and Technical Services",
                             "Finance and Insurance", 
                             "Information",
                             "Wholesale Trade") ~ "Business",
        TRUE ~ "Non_Key"
    )
          )
# can match on the institution code
df_wages_app <- df_wages %>% 
    left_join(inst_xwalk, by=c("kpeds_institution" = "inst_code"))

# find breakdown of employed graduates in KY by appalachian status of institution
app_break<-df_wages_app %>% 
    group_by(appalachian) %>%
    summarize(n_total=n_distinct(coleridge_id))

In [None]:
# bar plot of percentages by industry and appalachian status
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n = n_distinct(coleridge_id)) %>%
    ungroup() %>%
    left_join(app_break, "appalachian") %>%
    mutate(pct = (n/n_total)*100) %>%    
    ggplot(aes(x=word(key_sect, 1), y=pct, fill=as.factor(appalachian))) +
    geom_bar(stat="identity", position=position_dodge()) +
    labs(
        title = 'Graduates from Appalachian Institutions were more likely to end up in [redacted] or \n a [redacted] sector',
        y = 'Percentage of graduates',
        x='Sector',
        fill = 'Appalachian Status',
        caption = 'Source: KPEDS and KY UI wage records data'
    ) +
    theme_minimal() +
    ylim(0,100)

The above visualization cannot be exported just yet for the following reasons:
- No counts of numerators
- No counts of denominators
- Because this visualization is using Kentucky's UI wage records, proof that each numerator and denominator is associated with at least 10 employers, as well as proof of the lack of employer dominance within each group must be shown

The counts for the numerators and denominators by group (`key_sect` and `appalachian`) will be found first.

The number of individuals who were employed by sector and institution location can be found simply by counting the number of individuals within each `key_sect`, `appalachian` group. These groups must all contain at least 10 members of our cohort to proceed accordingly.

> `count()` cannot be used here since each row does not correspond to one graduate.

In [None]:
# counts of key_sect/appalachian combination
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n=n_distinct(coleridge_id))

Additionally, the counts of the denominators must be added for each of these calculations - the total number of appalachian and non-appalachian graduates.

> The number and the names of the columns in `df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(coleridge_id))` must be identical to those of the data frames added in using `rbind()`.

In [None]:
# counts of deg_class/appalachian combination with totals
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n=n_distinct(coleridge_id)) %>%
    ungroup() %>%
    rbind(data.frame(key_sect = 'Total', appalachian = 0, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(coleridge_id)) %>% filter(appalachian == 0))$n),
          data.frame(key_sect = 'Total', appalachian = 1, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(coleridge_id)) %>% filter(appalachian == 1))$n)
         )

Here, the number for the total does not equal the sum of the counts per subgroup...and that makes sense in this example, as some individuals may have worked in multiple sectors within this time frame.

The final portion of information required before exporting the visualization is at the employer level, both in terms of counts and dominance. The counts of the employers per group can be found using a similar process as the one for finding the number of graduates per group.

In [None]:
# find number of employers per group
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n=n_distinct(employeeno))

The number of total employers for graduates of institutions in each geographic location should be added as well.

In [None]:
# number of employers per group and total
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n=n_distinct(employeeno)) %>%
    ungroup() %>%
    rbind(data.frame(key_sect = 'Total', appalachian = 0, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(employeeno)) %>% filter(appalachian == 0))$n),
          data.frame(key_sect = 'Total', appalachian = 1, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(employeeno)) %>% filter(appalachian == 1))$n)
          )

Before this visualization is fully prepared for export, there must be evidence of a lack of employer dominance.

In [None]:
# employer dominance by group
df_wages_app %>%
    count(key_sect, appalachian, employeeno) %>% 
    group_by(key_sect, appalachian) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    select(-c(employeeno, n))

To finish this portion of the export process, the maximum employer dominance for all of the sectors when broken down by appalachian status must be added. First, for readability purposes, this information will be assigned `emp_dom` before it is added into the original data frame.

In [None]:
# employer dominance of totals
emp_dom <- df_wages_app %>%
    count(appalachian, employeeno) %>% 
    group_by(appalachian) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    select(-c(employeeno, n)) %>%
    ungroup()

emp_dom

In [None]:
# add in total group employer dominance
df_wages_app %>%
    count(key_sect, appalachian, employeeno) %>% 
    group_by(key_sect, appalachian) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    ungroup() %>%
    select(-c(employeeno, n)) %>%
    rbind(
        data.frame(key_sect = as.character("Total"), appalachian = 0, prop = (emp_dom %>% filter(appalachian == 0))$prop),
        data.frame(key_sect = as.character("Total"), appalachian = 1, prop = (emp_dom %>% filter(appalachian == 1))$prop)
    )

Since all of counts of individuals and employers are at least 10 for each subgroup and there is no employer dominance within any of the subgroups, this visualization will pass disclosure review. To finish it off, these outputs will all be saved in their respective file formats.

In [None]:
# counts of deg_class/appalachian combination with totals
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n=n_distinct(coleridge_id)) %>%
    ungroup() %>%
    rbind(data.frame(key_sect = 'Total', appalachian = 0, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(coleridge_id)) %>% filter(appalachian == 0))$n),
          data.frame(key_sect = 'Total', appalachian = 1, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(coleridge_id)) %>% filter(appalachian == 1))$n)
         ) %>%
        write_csv(sprintf('/nfshome/%s/appalachian_sector_barplot_individual_counts.csv', user))

In [None]:
# write number of employers to csv
df_wages_app %>%
    group_by(key_sect, appalachian) %>%
    summarize(n=n_distinct(employeeno)) %>%
    ungroup() %>%
    rbind(data.frame(key_sect = 'Total', appalachian = 0, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(employeeno)) %>% filter(appalachian == 0))$n),
          data.frame(key_sect = 'Total', appalachian = 1, n = (df_wages_app %>% group_by(appalachian) %>% summarize(n=n_distinct(employeeno)) %>% filter(appalachian == 1))$n)
          ) %>%
        write_csv(sprintf('/nfshome/%s/appalachian_sector_barplot_emp_counts.csv', user))

In [None]:
# write employer dominance to csv
df_wages_app %>%
    count(key_sect, appalachian, employeeno) %>% 
    group_by(key_sect, appalachian) %>%
    mutate(prop = n/sum(n)) %>%
    top_n(1) %>%
    ungroup() %>%
    select(-c(employeeno, n)) %>%
    rbind(
        data.frame(key_sect = as.character("Total"), appalachian = 0, prop = (emp_dom %>% filter(appalachian == 0))$prop),
        data.frame(key_sect = as.character("Total"), appalachian = 1, prop = (emp_dom %>% filter(appalachian == 1))$prop)
    ) %>%
    write_csv(sprintf('/nfshome/%s/appalachian_sector_barplot_emp_dominance.csv', user))

In [None]:
# save for export
ggsave(sprintf('/nfshome/%s/appalachian_sector_barplot_emp_dominance.pdf', user))

## Machine Learning

Exporting clusters must be treated as any other grouping variable, as each cluster must satisfy a minimum number of individuals and (when applicable) employers to pass disclosure control.

## Reminders
Every single item designated for export, regardless of whether it is a .csv, .pdf, .png, or something else, must have corresponding proof in an input folder to show that every group used to create this statistic followed the present disclosure review guidelines.

Additionally, when exporting employer-level characteristics, there must also be a lack of employer dominance as well as a count of at least three employers (or 10, depending on the state) per group.