# Data Measurement Checkpoint Answers

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

**Anna-Carolina Haensch** \
University of Maryland \
LMU Munich

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256974.svg)](https://doi.org/10.5281/zenodo.10256974)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

In [None]:
#Load libraries
import pandas as pd
import numpy as np

#### **Checkpoint 1: Generate Completers, Non-Completers, and Degree Pursuers**

1. Load the 2013 cohort data and save it in `df_2013`.
2. Add the `group` column to the 2013 cohort data. The `group` column identifies *'completer, Associate'*, *'completer, Bachelor'*, *'completer, Master'*, *'non-completer'*, and *'degree pursuer'*.
3. What percentages of the 2013 cohort are completers, non-completers, and degree pursuers?

Before running the code below, please change <font color='red'> **YOUR DATA DIRECTORY**</font> to your own file path.

In [None]:
# Question 1
#Define data directory
data_directory = 'YOUR DATA DIRECTORY'

#Read in 2013 cohort data
df_2013 = pd.read_csv(data_directory+'syntucky_cohort_2013.csv')

In [None]:
# Question 2

#Conditions list
conditions_2013 = [df_2013['high_completion_label'] == 'Associate', #Completers whose highest degrees are associate
              df_2013['high_completion_label'] == 'Bachelor', #Completers whose highest degrees are bachelor
              df_2013['high_completion_label'] == 'Master', #Completers whose highest degrees are master
              ((df_2013['year7_enrolled'] == 0) & 
               ( ~ df_2013['high_completion_label'].isin(['Associate', 'Bachelor', 'Master']))), #Non-completers
             ((df_2013['year7_enrolled'] == 1) & 
               ( ~ df_2013['high_completion_label'].isin(['Associate', 'Bachelor', 'Master'])))] #Degree pursuers

#Choices (or values) list
choices_2013 = ['completer, Associate', 
           'completer, Bachelor', 
           'completer, Master', 
           'non-completer', 
           'degree pursuer']

#Assign results to the indicator 'group' based on conditions; Default choice is the null value
df_2013['group'] = np.select(conditions_2013, choices_2013, default = np.NaN)

In [None]:
# Question 3
#Counts of students in each group; save the result to DataFrame 'df_cnt_group'
df_cnt_group_2013 = df_2013.groupby(['group'], dropna = False)['id'].agg(['count']).reset_index()

#Add a new column, 'percent'
#Recall that df_2013.shape[0] is the number of rows in 'df_2013', i.e., number of students in 'df_2013'
df_cnt_group_2013['percent'] = round(df_cnt_group_2013['count']/df_2013.shape[0], 2)

#See the results
df_cnt_group_2013

#### **Checkpoint 2: Generate Basic Job Quality Measure**

Please calculate year 7 average earnings and average number of employers for completers, non-completers, and degree pursuers in the 2013 cohort. Do you observe the same trends as the 2015 cohort?

In [None]:
#Average year 7 earnings and number of employers by group
#We put the variable list vertically to improve the readability of the code
df_job_quality_group_2013 = df_2013.groupby(['group'])[['year7_earnings', 
                                                   'year7_ct_employers']].agg(['mean']).reset_index()

#See the result
df_job_quality_group_2013

Yes, we observe the same trend in the 2013 cohort as the 2015 cohort.

#### **Checkpoint 3: Generate Additional Job Quality Measures**

1. Please calculate year 7 high earnings job rate, average employment duration, and average earnings per employed quarter for completers, non-completers, and degree pursuers in the 2013 cohort. Do you observe the same trends as the 2015 cohort?

2. There are other measures available in the data, such as:
   -  `year7_max_qrts_one_employer`: Number of quarters a person employed by the most consistent employer in year 7.
   -  `year7_earnings_most_consistent_employer`: inflation-adjusted earnings from the most consistent employer in year 7.

    See if you can use them to create other job quality measures.

3. Check the missingness of your outcome variables.

In [None]:
# Question 1
#High earnings job indicator
df_2013['year7_high_earnings'] = (df_2013['year7_earnings'] > 15080) * 1

#Average earnings per employed quarter
df_2013['year7_avg_qtr_earnings'] = df_2013['year7_earnings'] / df_2013['year7_ct_qtrs_employed']

#Define the list of job quality measures
measure_list_2013 = [ 'year7_high_earnings','year7_ct_qtrs_employed', 'year7_avg_qtr_earnings']

#Job Quality by group
df_job_quality_group_2013 = df_2013.groupby(['group'])[measure_list_2013].agg(['mean']).reset_index()

#See the result
df_job_quality_group_2013

Yes, we can observe the same trend in the 2013 cohort as the 2015 cohort.

In [None]:
# Question 2

#Calculate average quarterly earnings from the most consistent employer
df_2013['avg_qtrs_earnings_most_consistent_employer'] = df_2013['year7_earnings_most_consistent_employer']/df_2013['year7_max_qtrs_one_employer']

#Update our job quality measure list
measure_list_new = ['year7_max_qtrs_one_employer','year7_earnings_most_consistent_employer', 'avg_qtrs_earnings_most_consistent_employer']

#Job Quality by group
df_job_quality_group_new = df_2013.groupby(['group'])[measure_list_new].agg(['mean']).reset_index()

#See the result
df_job_quality_group_new

In [None]:
# Question 3

#Generate missing indicator
df_2013['y7_ct_qtrs_employed_missing'] = (df_2013['year7_ct_qtrs_employed'].isna() == True) * 1

#Calculate missing rate and number of people with missing year 7 number of quarters employed by group
df_missing_2013 = df_2013.groupby(['group'])['y7_ct_qtrs_employed_missing'].agg(['mean', 'sum']).reset_index()

#Rename columns
df_missing_2013 = df_missing_2013.rename(columns = {'mean' : 'year7_ct_qtrs_employed_missing_rate',
                                                    'sum' : 'year7_ct_qtrs_employed_missing_cnt'})

df_missing_2013

#### **Checkpoint 4: Examine Job Quality By Major**

1. Create two DataFrames. The first DataFrame only includes the 2013 cohort completers. The second DataFrame only includes the 2013 cohort non-completers and degree pursuers.

2. Examine job quality by majors for completers, non-completers, and degree pursuers in the 2013 cohort.

In [None]:
# Question 1
#Save completers to DataFrame df_2013_comp
df_2013_comp = df_2013[(df_2013['group'] == 'completer, Associate') | 
                       (df_2013['group'] == 'completer, Bachelor') |
                       (df_2013['group'] == 'completer, Master')]

#Save non-completers and degree pursuers to df_2015_non_comp
df_2013_non_comp = df_2013[(df_2013['group'] == 'non-completer') | 
                           (df_2013['group'] == 'degree pursuer')]

In [None]:
# Question 2
#Define the list of job quality measures
#This step just puts all column names in one list to improve the readability of the groupby() code.
measure_list_2013 = ['year7_earnings', 'year7_ct_employers', 'year7_high_earnings', 
                     'year7_ct_qtrs_employed', 'year7_avg_qtr_earnings', 'y7_ct_qtrs_employed_missing']

#Job Quality by major for completers
df_job_quality_comp_2013 = df_2013_comp.groupby(['high_completion', 'group'])[measure_list_2013].agg(['mean']).reset_index()

#See the result
df_job_quality_comp_2013

#Job Quality by major for non-completers and degree pursuers
df_job_quality_non_comp_2013 = df_2013_non_comp.groupby(['first_enroll', 'group'])[measure_list_2013].agg(['mean']).reset_index()

#See the result
df_job_quality_non_comp_2013