# Data Exploration Checkpoint Answers

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

**Anna-Carolina Haensch** \
University of Maryland \
LMU Munich

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10257347.svg)](https://doi.org/10.5281/zenodo.10257347)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

Before running the code below, please change <font color='red'> **YOUR DATA DIRECTORY**</font> to your own file path.

In [None]:
#Load libraries
import pandas as pd
import numpy as np

#Define data folder directory
data_directory = 'YOUR DATA DIRECTORY'

#### **Checkpoint 1: Import Cohort Data**

Please load the 2013 cohort data and save it in `df_2013`. The csv file of the 2013 cohort data is *"syntucky_cohort_2013.csv"*.

In [None]:
#Read in 2013 Syntucky cohort data 
df_2013 = pd.read_csv(data_directory + 'syntucky_cohort_2013.csv')

#Check the data
df_2013.head()

#### **Checkpoint 2: Check Data Structure**

Please check if the data you load in Checkpoint 1 (`df_2013`) has duplicates and if it has the key information you need for your planned analysis.

In [None]:
#Check number of rows and number of columns in df_2013
df_2013.shape

In [None]:
#Check number of unique individuals in df_2013
df_2013['id'].nunique()

*The number of rows and the number of unique individuals are the same in `df_2013`. We do not need to worry about duplicates.*

In [None]:
#Check columns
df_2013.columns

#### **Checkpoint 3: Create a subset of your data**

Please use the subsetting technique you learned in this section to save students in the 2013 cohort who earned a bachelor's degree in education to a new DataFrame `df_2013_ed_ba`.

Are you interested in other degree levels and majors? Try to save them in other DataFrames.

In [None]:
#Save students who earned a bachelor's degree in education in df_2013_ed_ba
df_2013_ed_ba = df_2013[(df_2013['high_completion'] == 'education') & (df_2013['high_completion_label'] == 'Bachelor')]

#Check the number of people
df_2013_ed_ba.shape

In [None]:
# You can also use groupby and agg to check number of people in each degree level and each major
df_2013.groupby(['high_completion_label', 'high_completion'])['id'].agg(['count']).reset_index()

#### **Checkpoint 4: Generate Descriptive Statistics and Check Missing Values**

1. Use the 2013 cohort DataFrame you saved in Checkpoint 1, `df_2013`, to get 2013 cohort's mean and median earnings from year 5 - year 7.
2. Use the 2013 education bachelor's degree earners DataFrame you saved in Checkpoint 3, `df_2013_ed_ba`, to get their mean and median earnings from year 5 - year 7.
3. How many students in `df_2013_ed_ba` have missing earnings in year 7? 

In [None]:
#Summary Statistics of the 2013 cohort's year 5 to year 7 earnings
df_2013[['year5_earnings', 'year6_earnings', 'year7_earnings']].describe()

In [None]:
#Summary Statistics of the 2013 cohort's year 5 to year 7 earnings
df_2013_ed_ba[['year5_earnings', 'year6_earnings', 'year7_earnings']].describe()

In [None]:
#Number of 2013 cohort students who earned education bachelor's degree
count_ed_ba = df_2013_ed_ba['id'].nunique()

#Number of 2013 cohort students who earned education bachelor's degree and 
# have missing year 7 earnings
count_ed_ba_no_y7earn = df_2013_ed_ba['year7_earnings'].isna().sum()

#print results
print("In the 2013 cohort,", count_ed_ba, 
      "students' highest degree level is bachelor's degree in education.")
print("Of these students,", count_ed_ba_no_y7earn, "have missing earnings in year 7.")
print("In other words,", round(count_ed_ba_no_y7earn/count_ed_ba, 3), 
      "education bachelor's degree earners have missing earnings in year 7.")