# Data Visualization Checkpoint Answers

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

**Anna-Carolina Haensch** \
University of Maryland \
LMU Munich

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10257134.svg)](https://doi.org/10.5281/zenodo.10257134)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

Before running the code below, please change <font color='red'> **YOUR DATA DIRECTORY**</font> to your own file path.

In [None]:
#Data analysis libraries
import pandas as pd
import numpy as np

#Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#Define data folder directory
data_directory = 'YOUR DATA DIRECTORY'

#### **Checkpoint 1: Create a Boxplot for the 2013 Cohort Bachelor's Degree Holders**

Please load the 2013 cohort data and save students whose highest degree is bachelor's degree in `df_2013_ba`. Then create a boxplot to show year 7 earnings distributions for bachelor's degree holders by major.

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
#Read in 2013 cohort data
df_2013 = pd.read_csv(data_directory + 'syntucky_cohort_2013.csv')

#Students in the 2013 cohort whose highest degree level is bachelor's degree
df_2013_ba = df_2013[(df_2013['high_completion_label'] == 'Bachelor')]

In [None]:
#Create a boxplot

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

# Create a boxplot for 'year7_earnings' across all 'high_completion' groups
# We also use `order = ` to sort majors alphabetically
sns.boxplot(x = 'high_completion', y = 'year7_earnings', data = df_2013_ba,
           order = ['business', 'computerscience', 'education', 'nursing', 'other'])

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Year 7 Earnings', fontsize = 12)
ax.set_xlabel('Highest Degree Major', fontsize = 12)

# Y-axis tick labels: Change the format of earnings displayed on Y-axis
ax.yaxis.set_major_formatter(format('${x:,.0f}'))

# X-axis tick labels: 
# We can define x tick labels, such as using capital letters for the first letter of each word
# Make sure the order of your labels are consistent with the order you defined in sns.boxplot()
# Rotate x tick labels for better readability if there are many categories
ax.set_xticklabels(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'],
                   rotation = 20)

# Set the title
plt.title('Of the 2013 Cohort Bachelor Degree Holders, \n Nursing Students Had the Highest Median Earnings in Year 7')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.12, 'Data Source: Syntucky Data')

#Save the plot
plt.savefig(r"C:\Users\YOUR NAME\Documents\ba_2013_y7_earnings_by_major.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 2: Use a Countplot to Check Counts of Missing Earnings** 

Please create a countplot to check how many students have missing and non-missing year 7 earnings in the 2013 cohort by the highest degree major. 

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
# Create a new column indicating whether data in 'year7_earnings' column is missing
df_2013_ba.loc[:, 'year7_earn_missing'] = (df_2013_ba['year7_earnings'].isna() == True) * 1

In [None]:
# Create a countplot

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

# Create a countplot to show counts of missing and non-missing 'year7_earnings' by major groups
# We also use `order = ` to sort majors alphabetically
sns.countplot(x = 'high_completion', hue = 'year7_earn_missing', data = df_2013_ba,
              order = ['business', 'computerscience', 'education', 'nursing', 'other'])

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Student Counts', fontsize = 12)
ax.set_xlabel('Highest Degree Major', fontsize = 12)

# Y-axis tick labels: Change the format of counts displayed on Y-axis
ax.yaxis.set_major_formatter(format('{x:,.0f}'))

# X-axis tick labels: 
# Make sure the order of your labels are consistent with the order you defined in sns.countplot()
# Rotate x tick labels for better readability if there are many categories
ax.set_xticklabels(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'],
                   rotation = 20)

# Set the title
plt.title('More Than 50% of Students in Business and \n Computer Science Majors Have Missing Year 7 Earnings')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.12, 'Data Source: Syntucky Data')

# Define legend
plt.legend(['Non-missing', 'Missing'], fontsize = 12)

#Save the plot
plt.savefig(r"C:\Users\YOUR NAME\Documents\ba_2013_y7_earnings_missingness_by_major.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 3: Depict Median Earnings Over Time** 

Please create a line plot to show year 5 to year 9 median earnings for the 2013 cohort by the highest degree major. 

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
# Choose the variables we need in order to create the line plot
df_2013_ba_earn_wide = df_2013_ba[['id', 'high_completion', 'year5_earnings', 'year6_earnings', 'year7_earnings','year8_earnings','year9_earnings']]

#Change column names for easier labeling
df_2013_ba_earn_wide = df_2013_ba_earn_wide.rename(columns = {'year5_earnings' : 'Year 5',
                                                              'year6_earnings' : 'Year 6',
                                                              'year7_earnings' : 'Year 7',
                                                              'year8_earnings' : 'Year 8',
                                                              'year9_earnings' : 'Year 9'})

# Transform the dataframe from the wide format to the long format using the melt function
df_2013_ba_earn_long = pd.melt(df_2013_ba_earn_wide, 
                               id_vars = ['id', 'high_completion'],
                               value_vars = ['Year 5','Year 6','Year 7', 'Year 8','Year 9'],
                               var_name = 'year',
                               value_name = 'earnings')

#See the first five rows of the long format
df_2013_ba_earn_long.head()

In [None]:
# Create a line plot

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

# Create a line plot to show earnings trends for students in different majors
sns.lineplot(x = 'year', y = 'earnings', hue = 'high_completion', data = df_2013_ba_earn_long,
             #Define orders of categories shown on the legend
             hue_order = ['business', 'computerscience', 'education', 'nursing', 'other'], 
             # We use `estimator=` to define what statistics to show on the graph, 
             # you can replace np.nanmedian with other statistics such as np.mean
             estimator = np.nanmedian, 
             # We hide the `errorbar` here, but it can be confidence interval, standard deviation, etc. 
             errorbar = None)

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Median Earnings', fontsize = 12)
ax.set_xlabel('Number of Years Since College Entrance', fontsize = 12)

# Y-axis tick labels: Change the format of counts displayed on Y-axis
ax.yaxis.set_major_formatter(format('${x:,.0f}'))

# Rotate x tick labels for better readability
plt.xticks(rotation = 20)

# Set the title
# '\n' allows you to start the sentence in a new line
plt.title('Of the 2013 Cohort Bachelor Degree Holders, \n Nursing Students Had the Highest Median Earnings from Year 5 to Year 9 \n')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.08, 'Data Source: Syntucky Data')

# Define legend; The order of the labels should be consistent with the categories defined in `hue_order`
plt.legend(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'], fontsize = 10)

#Save the plot
plt.savefig(r"C:\Users\YOUR NAME\Documents\ba_2015_earnings_trends_by_major.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 4: Show Year 7 Earnings by Major and Degree Completion Status** 

Please use a bar chart to show year 7 median earnings for the 2013 cohort by major and degree completion status (completers, non-completers, and degree pursuers).

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
#Generate a group indicator

#Remove the records where high_completion_label is Doctoral
df_2013 = df_2013[df_2013['high_completion_label'] != 'Doctoral']

#Conditions list
conditions = [df_2013['high_completion_label'] == 'Associate', #Completers whose highest degrees are associate
              df_2013['high_completion_label'] == 'Bachelor', #Completers whose highest degrees are bachelor
              df_2013['high_completion_label'] == 'Master', #Completers whose highest degrees are master
              ((df_2013['year7_enrolled'] == 0) & 
               ( ~ df_2013['high_completion_label'].isin(['Associate', 'Bachelor', 'Master']))), #Non-completers
             ((df_2013['year7_enrolled'] == 1) & 
               ( ~ df_2013['high_completion_label'].isin(['Associate', 'Bachelor', 'Master'])))] #Degree pursuers

#Choices (or values) list
choices = ['Completer, Associate', 
           'Completer, Bachelor', 
           'Completer, Master', 
           'Non-completer', 
           'Degree pursuer']

#Assign results to the indicator 'group' based on conditions; Default choice is the null value
df_2013['group'] = np.select(conditions, choices, default = np.NaN)

In [None]:
#Major indicator

#For completers, we use the highest degree majors.
#For non-completers and degree pursuers, we use the first enrollment majors.

#Conditions list
conditions = [((df_2013['group'] == 'Completer, Associate') | # condition 1: completers
               (df_2013['group'] == 'Completer, Bachelor') |
               (df_2013['group'] == 'Completer, Master')), 
              ((df_2013['group'] == 'Non-completer') | # condition 2: non-completers and degree pursuers
               (df_2013['group'] == 'Degree pursuer'))] 
    

#Choices (or values) list
choices = [df_2013['high_completion'],
           df_2013['first_enroll']]

#Assign results to the indicator 'major' based on conditions; Default choice is the null value
df_2013['major'] = np.select(conditions, choices, default = np.NaN)

In [None]:
# Create a bar chart

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

#Create a bar chart to show year 7 earnings by majors and degree completion status
sns.barplot(x = 'major', y = 'year7_earnings', hue = 'group', data=df_2013,
            #Define the order of student groups
            hue_order = ['Completer, Associate', 'Completer, Bachelor', 'Completer, Master', 
                         'Non-completer', 'Degree pursuer'],
            #Define the order of majors displayed on x-axis
            order = ['business', 'computerscience', 'education', 'nursing', 'other'],
            #Define statistics to show on each bar
            estimator = np.nanmedian, errorbar = None)

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Median Earnings', fontsize = 12)
ax.set_xlabel('Highest Degree Major', fontsize = 12)

# Y-axis tick labels: Change the format of counts displayed on Y-axis
ax.yaxis.set_major_formatter(format('${x:,.0f}'))

# X-axis tick labels: 
# Make sure the order of your labels are consistent with the order you defined in sns.barplot()
# Rotate x tick labels for better readability if there are many categories
ax.set_xticklabels(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'],
                   rotation = 20)

# Set the title
# '\n' allows you to start the sentence in a new line
plt.title('Nursing and Computer Science Students Had the Highest Year 7 \n Median Earnings Among Completers')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.12, 'Data Source: Syntucky Data')

# Set legend font size
plt.legend(fontsize = 8)

#Save the plot
plt.savefig(r"C:\Users\YOUR NAME\Documents\2015_year7_earnings_by_major_group.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()