# Data Visualization

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

**Anna-Carolina Haensch** \
University of Maryland \
LMU Munich

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10257134.svg)](https://doi.org/10.5281/zenodo.10257134)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

## Goals:
In this notebook, we will show job quality for students in the 2015 Syntucky cohort by using various visualizations, including boxplot, countplot, line plot, and bar chart. You may find that the Python code we use to create visualizations is lengthy. However, once you write the visualization code in Python, it is easy for your to tune details for your visualizations and most importantly, to replicate the same type of visualizations.  

**The specific questions we seek to answer in this notebook are**:
1. Of the 2015 cohort bachelor degree holders, what are their year 7 earnings distributions by major?
2. Of the 2015 cohort bachelor degree holders, how does the missingness in year 7 earnings vary by major?
3. Of the 2015 cohort bachelor degree holders, what are their earnings trends over time by major?
4. Of the 2015 cohort, how does the year 7 median earnings vary by major and degree completion status (completer, non-completer, and degree pursuer)?

**After completing this notebook, you should:**
1. Learn how to create simple but informative visualizations
2. Understand how to present your research findings properly and concisely by using different types of visualizations
3. Be able to use Matplotlib and Seaborn functions

This is, like the rest of the class, only an introduction to data visualization with Matplotlib and Seaborn. A valuable resource to enhance your understanding and usability of the Matplotlib and Seaborn libraries is cheat sheet, such as [Seaborn Cheat Sheet](https://www.kaggle.com/discussions/getting-started/126958) and [Matplotlib Cheat Sheet](https://matplotlib.org/cheatsheets/).

## 1. Import Data

In this notebook, we introduce you to two data visualization libraries: **Matplotlib** and **Seaborn**. *Matplotlib* offers a wide range of tools for creating static, animated, or interactive plots. It is highly customizable and allows you to control almost every aspect of your plots. *Seaborn*, on the other hand, is built on top of Matplotlib and integrates well with Pandas data structures. It provides a high-level interface for drawing attractive and informative statistical graphics. It comes with several built-in themes for styling Matplotlib graphics and tools to create complex visualizations.

In [None]:
#Data analysis libraries
import pandas as pd
import numpy as np

#Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

As usual, we use the 2015 cohort data to create visualizations in this notebook. Before running the code below, please change <font color='red'> **YOUR DATA DIRECTORY**</font> to your own file path.

In [None]:
#Define data folder directory
data_directory = 'YOUR DATA DIRECTORY'

#Read in 2015 cohort data
df_2015 = pd.read_csv(data_directory+'syntucky_cohort_2015.csv')

#Check the first five rows of the data
df_2015.head()

## 2. Create Visualizations

### Example 1: Boxplot

A *boxplot*, also known as a *box and whisker plot*, is a graphical representation of statistical data. It usually includes five elements: minimum, first quartile (Q1, or the 25th percentile), median (Q2 or the 50th percentile), third quartile (Q3 or the 75th percentile), and maximum. The "box" in a boxplot encompasses the interquartile range (IQR), which spans Q1 to Q3. The "line" inside the box indicates the median. The "whiskers" extend from the box to show the range of the data (from min to max), but typically only up to 1.5 times the IQR. Any data points beyond the whiskers can be considered *outliers* and are often displayed as individual points. **Boxplots are useful in visualizing the spread and skewness of data and in spotting outliers.** Comparing boxplots allows for an easy comparison of these characteristics across different categories or groups in a dataset.

In this example, we will create a boxplot to show year 7 earnings distributions by major for bachelor's degree holders in  the 2015 cohort. First, let's save students in the 2015 cohort with the highest degree level of a bachelor's degree in the DataFrame `df_2015_ba`.

In [None]:
#Students in the 2015 cohort with the highest degree level of a bachelor's degree
df_2015_ba = df_2015[(df_2015['high_completion_label'] == 'Bachelor')]

Now we can use the seaborn function `sns.boxplot()` to create a simple boxplot. You need to define three parameters in this function: 1) `x = 'high_completion'` means we want the highest degree majors to be the x-axis; 2) `y = 'year7_earnings'` means the y-axis values are year 7 earnings; 3) `data = df_2015_ba` means we want to use the data in DataFrame `df_2015_ba`. We also need to use `plt.show()` to display the plot in this Jupyter notebook.

The resulting boxplot is a clear visualization of the distribution and central tendencies of year 7 earnings for bachelor's degree holders in each major. We can see that for the 2015 cohort, bachelor's degree holders in nursing had the highest year 7 median earnings, as well as the highest 25th and 75th percentile year 7 earnings.

In [None]:
# Create a boxplot for 'year7_earnings' across all 'high_completion' major groups
sns.boxplot(x = 'high_completion', y = 'year7_earnings', data = df_2015_ba)

# Show the plot
plt.show()

**How can we improve our visualization?** While the above boxplot is informative, a few important components are missing and it is hard for our audience to tell the key information presented on this graph. We can improve the boxplot by including:

1. **A graph title that summarizes the key point of the graph**: the title should tell your audience the key takeaway of your graph. For example, we want to know that of the 2015 cohort bachelor's degree holders, students in which major had the highest median earnings in year 7. Our graph title should answer this question. We can use *"Of the 2015 Cohort Bachelor Degree Holders, Nursing Students Had the Highest Median Earnings in Year 7"* as our title. Of course, if you have different interpretations of the same graph, you can use an alternative title to summarize your key takeaways. To add the title, we use the `plt.title()` function.

2. **Clear x-axis and y-axis labels**: in the current graph, the x-axis and y-axis labels are column names and they could be confusing to our audience. We need to replace them with concise descriptions. To change the labels, we can use `ax.set_ylabel` and `ax.set_xlabel`.

3. **Data Source**: labeling the source of your data increases the credibility of your findings and enables others to replicate your results. We can use `plt.figtext()` to add additional text on the graph.

There are additional functions you can use to improve your visualizations, such as setting your graph size by using `plt.subplots(figsize = (8, 5))` and rotating your x-axis tick labels slightly to make it easier to read, `ax.set_xticklabels(rotation = 20)`. Please read the comments in the code below to see what each line of code means.

> You can save your graph by using `plt.savefig()`. Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
#Create a boxplot

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

# Create a boxplot for 'year7_earnings' across all 'high_completion' groups
# We also use `order = ` to sort majors alphabetically
sns.boxplot(x = 'high_completion', y = 'year7_earnings', data = df_2015_ba,
           order = ['business', 'computerscience', 'education', 'nursing', 'other'])

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Year 7 Earnings', fontsize = 12)
ax.set_xlabel('Highest Degree Major', fontsize = 12)

# Y-axis tick labels: Change the format of earnings displayed on Y-axis
ax.yaxis.set_major_formatter(format('${x:,.0f}'))

# X-axis tick labels: 
# We can define x tick labels, such as using capital letters for the first letter of each word
# Make sure the order of your labels are consistent with the order you defined in sns.boxplot()
# Rotate x tick labels for better readability if there are many categories
ax.set_xticklabels(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'],
                   rotation = 20)

# Set the title
plt.title('Of the 2015 Cohort Bachelor Degree Holders, \n Nursing Students Had the Highest Median Earnings in Year 7')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.12, 'Data Source: Syntucky Data')

#Save the plot
plt.savefig(r"C:\Users\YOUR USERNAME\Documents\ba_2015_y7_earnings_by_major.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 1: Create a Boxplot for the 2013 Cohort Bachelor's Degree Holders**

Please load the 2013 cohort data and save students whose highest degree is bachelor's degree in `df_2013_ba`. Then create a boxplot to show year 7 earnings distributions for bachelor's degree holders by major.

### Example 2: Countplot

One of the foci of the class - and an important part of data analytics - is to examine the pattern of "missingness" in the data.  **On how many people are the earnings distributions in the boxplot based? How many people have missing year 7 earnings in each major?** We can visually present this information by using a *countplot*, again using the seaborn library. A *countplot* is essentially a histogram across a categorical variable. It displays the counts of observations in each categorical bin using bars. 

In this countplot, the x-axis will still be the major categories, `high_completion`, and the y-axis will be the counts of people in each category. Since we also want to display the counts of students by whether their year 7 earnings are missing, we need to create a dummy variable to indicate the missingness of `year7_earnings` first.

In [None]:
# Create a new column indicating whether data in 'year7_earnings' column is missing
df_2015_ba.loc[:, 'year7_earn_missing'] = (df_2015_ba['year7_earnings'].isna() == True) * 1

Now we can use `sns.countplot()` to create the countplot. We use `hue = 'year7_earn_missing` to indicate that we want to show counts of students with missing and non-missing year 7 earnings in separate bars and in different colors. This countplot provides a quick, at-a-glance understanding of how the missingness in `year7_earnings` is distributed across different `high_completion` categories. Similar to example 1, we should customize x-axis and y-axis labels and ticks. We also use `plt.legend()` to specify which color represents count of missing earnings and which color represents count of non-missing earnings.

> Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
# Create a countplot

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

# Create a countplot to show counts of missing and non-missing 'year7_earnings' by major groups
# We also use `order = ` to sort majors alphabetically
sns.countplot(x = 'high_completion', hue = 'year7_earn_missing', data = df_2015_ba,
              order = ['business', 'computerscience', 'education', 'nursing', 'other'])

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Student Counts', fontsize = 12)
ax.set_xlabel('Highest Degree Major', fontsize = 12)

# Y-axis tick labels: Change the format of counts displayed on Y-axis
ax.yaxis.set_major_formatter(format('{x:,.0f}'))

# X-axis tick labels: 
# Make sure the order of your labels are consistent with the order you defined in sns.countplot()
# Rotate x tick labels for better readability if there are many categories
ax.set_xticklabels(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'],
                   rotation = 20)

# Set the title
plt.title('More Than 50% of Students in Business and \n Computer Science Majors Have Missing Year 7 Earnings')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.12, 'Data Source: Syntucky Data')

# Define legend
plt.legend(['Non-missing', 'Missing'], fontsize = 12)

#Save the plot
plt.savefig(r"C:\Users\YOUR USERNAME\Documents\ba_2015_y7_earnings_missingness_by_major.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 2: Use a Countplot to Check Counts of Missing Earnings** 

Please create a countplot to check how many students have missing and non-missing year 7 earnings in the 2013 cohort by the highest degree major. 

### Example 3: Line Plot

In the previous examples, we have examined year 7 earnings distributions and the missingness in year 7 earnings by major. Now, you may wonder: **what is the time trend in earnings for bachelor degree holders? Do nursing bachelor degree holders always have the highest median earnings?** To answer these questions, we can use a line plot to show earnings over time for students from different majors.

Before we create the line plot, it is important to ensure that our data is in a format suitable for plotting time trends. Recall that we have year 5 to year 7 earnings for the 2015 cohort. Our current data is in the *wide format*:

<text><center>**Wide Format Data**</center></text>

|id|year5_earnings|year6_earnings|year7_earnings|
|:--------:|:--------:|:--------:|:--------:|
|1|NaN|NaN|10000|
|2|20000|NaN|50000|

However, to create the line plot, the data should be in the *long format*:

<text><center>**Long Format Data**</center></text>
|id|year|earnings|
|:--------:|:--------:|:--------:|
|1|year5_earnings|NaN|
|1|year6_earnings|NaN|
|1|year7_earnings|10000|
|2|year5_earnings|20000|
|2|year6_earnings|NaN|
|2|year7_earnings|50000|

The `pd.melt()` function allows us to transform a wide dataframe to the long format. In the code below, we first select the columns we need and save them to a new DataFrame `df_2015_ba_earn_wide`. Then we use `pd.melt()` function to reshape the wide data to long format. In this function, we need to define the DataFrame that has the wide format data, columns to use as identifiers (`id_vars = `), columns to be reshaped into long format (`value_vars = `), and the new column names (`var_name = ` and `value_name = `). 

In [None]:
# Choose the variables we need in order to create the line plot
df_2015_ba_earn_wide = df_2015_ba[['id', 'high_completion', 'year5_earnings', 'year6_earnings', 'year7_earnings']]

#Change column names for easier labeling
df_2015_ba_earn_wide = df_2015_ba_earn_wide.rename(columns = {'year5_earnings' : 'Year 5',
                                                              'year6_earnings' : 'Year 6',
                                                              'year7_earnings' : 'Year 7'})

# Transform the dataframe from the wide format to the long format using the melt function
df_2015_ba_earn_long = pd.melt(df_2015_ba_earn_wide, 
                               id_vars = ['id', 'high_completion'],
                               value_vars = ['Year 5','Year 6','Year 7'],
                               var_name = 'year',
                               value_name = 'earnings')

#See the first five rows of the long format
df_2015_ba_earn_long.head()

Now we can use the `sns.lineplot()` function to create the line plot. In this function, we need to define the x-axis (`x = 'year'`, number of years since college entrance), the y-axis (`y = 'earnings'`, earnings for students in each major and each year), categories (`hue = 'high_completion'`, highest degree major), and the underlying data (`data = df_2015_ba_earn_long`). Since we want to show median earnings, we can define the statistics by using `estimator = np.nanmedian`. You can replace `np.nanmedian` with other Numpy statistics, such as `np.mean`. Seaborn lineplot's default setting shows confidence intervals. We hide the confidence interval here by using `errorbar = None`, but you can remove this code or show other parameters such as standard deviation.

> Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
# Create a line plot

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

# Create a line plot to show earnings trends for students in different majors
sns.lineplot(x = 'year', y = 'earnings', hue = 'high_completion', data = df_2015_ba_earn_long,
             #Define orders of categories shown on the legend
             hue_order = ['business', 'computerscience', 'education', 'nursing', 'other'], 
             # We use `estimator=` to define what statistics to show on the graph, 
             # you can replace np.nanmedian with other statistics such as np.mean
             estimator = np.nanmedian, 
             # We hide the `errorbar` here, but it can be confidence interval, standard deviation, etc. 
             errorbar = None)

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Median Earnings', fontsize = 12)
ax.set_xlabel('Number of Years Since College Entrance', fontsize = 12)

# Y-axis tick labels: Change the format of counts displayed on Y-axis
ax.yaxis.set_major_formatter(format('${x:,.0f}'))

# Rotate x tick labels for better readability
plt.xticks(rotation = 20)

# Set the title
# '\n' allows you to start the sentence in a new line
plt.title('Of the 2015 Cohort Bachelor Degree Holders, \n Nursing Students Had the Highest Median Earnings from Year 5 to Year 7 \n')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.08, 'Data Source: Syntucky Data')

# Define legend; The order of the labels should be consistent with the categories defined in `hue_order`
plt.legend(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'], fontsize = 10)

#Save the plot
plt.savefig(r"C:\Users\YOUR USERNAME\Documents\ba_2015_earnings_trends_by_major.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 3: Depict Median Earnings Over Time** 

Please create a line plot to show year 5 to year 9 median earnings for the 2013 cohort by the highest degree major. 

### Example 4: Bar Chart

A *bar chart* illustrates an approximation of the central values for a numerical attribute through the height of each bar, while also converying a sense of the uncertainty around this estimation using error bars. It is neccessary to point out that a bar plot usually displays the mean. The bar plot could have different formats of input data: lists, numpy arrays, or pandas series object, which can be directly allocated to the x, y, and hue parameters. Basically, this function mostly displays categorical variables on x axis and numerical variables on y axis and another categorical variable for comparison represented by hue. 

In the previous example, we find that nursing bachelor degree holders had the highest median earnings from year 5 to year 7. How about students in other degree levels? **How do year 7 earnings vary by majors and degree completion status (completers, non-completers, and degree pursuers)?** A bar chart is one of the most suitable visualizations to present this information. Before we create the bar chart, we need to create the student group indicator and the major indicator (the definitions and code are the same as what we used in the data measurement notebook).

In [None]:
#Generate a group indicator

#Remove the records where high_completion_label is Doctoral
df_2015 = df_2015[df_2015['high_completion_label'] != 'Doctoral']

#Conditions list
conditions = [df_2015['high_completion_label'] == 'Associate', #Completers whose highest degrees are associate
              df_2015['high_completion_label'] == 'Bachelor', #Completers whose highest degrees are bachelor
              df_2015['high_completion_label'] == 'Master', #Completers whose highest degrees are master
              ((df_2015['year7_enrolled'] == 0) & 
               ( ~ df_2015['high_completion_label'].isin(['Associate', 'Bachelor', 'Master']))), #Non-completers
             ((df_2015['year7_enrolled'] == 1) & 
               ( ~ df_2015['high_completion_label'].isin(['Associate', 'Bachelor', 'Master'])))] #Degree pursuers

#Choices (or values) list
choices = ['Completer, Associate', 
           'Completer, Bachelor', 
           'Completer, Master', 
           'Non-completer', 
           'Degree pursuer']

#Assign results to the indicator 'group' based on conditions; Default choice is the null value
df_2015['group'] = np.select(conditions, choices, default = np.NaN)

In [None]:
#Major indicator

#For completers, we use the highest degree majors.
#For non-completers and degree pursuers, we use the first enrollment majors.

#Conditions list
conditions = [((df_2015['group'] == 'Completer, Associate') | # condition 1: completers
               (df_2015['group'] == 'Completer, Bachelor') |
               (df_2015['group'] == 'Completer, Master')), 
              ((df_2015['group'] == 'Non-completer') | # condition 2: non-completers and degree pursuers
               (df_2015['group'] == 'Degree pursuer'))] 
    

#Choices (or values) list
choices = [df_2015['high_completion'],
           df_2015['first_enroll']]

#Assign results to the indicator 'major' based on conditions; Default choice is the null value
df_2015['major'] = np.select(conditions, choices, default = np.NaN)

To create the bar chart, we use the function `sns.barplot()`. Similar to previous examples, we need to define the x-axis (`x = 'major'`), the y-axis (`y = 'year7_earnings'`), the categories (`hue = 'group'`), and the data (`data = df_2015`). In this function, we can also define the order of student groups (`hue_order = `) and the order of majors (`order = `) we want to show on the graph. The rest code is to change details of the visualization. Please read the comments and adjust the parameters based on your needs.

> Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> in the second to last line of code to your username or your own file path.

In [None]:
# Create a bar chart

# Set plot size
# The first number is the width and the second number is the height.
fig, ax = plt.subplots(figsize = (8, 5))

#Create a bar chart to show year 7 earnings by majors and degree completion status
sns.barplot(x = 'major', y = 'year7_earnings', hue = 'group', data=df_2015,
            #Define the order of student groups
            hue_order = ['Completer, Associate', 'Completer, Bachelor', 'Completer, Master', 
                         'Non-completer', 'Degree pursuer'],
            #Define the order of majors displayed on x-axis
            order = ['business', 'computerscience', 'education', 'nursing', 'other'],
            #Define statistics to show on each bar
            estimator = np.nanmedian, errorbar = None)

# Change Y-axis and X-axis labels
# You can change the label size in `fontsize = `
ax.set_ylabel('Median Earnings', fontsize = 12)
ax.set_xlabel('Highest Degree Major', fontsize = 12)

# Y-axis tick labels: Change the format of counts displayed on Y-axis
ax.yaxis.set_major_formatter(format('${x:,.0f}'))

# X-axis tick labels: 
# Make sure the order of your labels are consistent with the order you defined in sns.barplot()
# Rotate x tick labels for better readability if there are many categories
ax.set_xticklabels(['Business', 'Computer Science', 'Education', 'Nursing', 'Other Majors'],
                   rotation = 20)

# Set the title
# '\n' allows you to start the sentence in a new line
plt.title('Nursing and Computer Science Students Had the Highest Year 7 \n Median Earnings Among Completers and Degree Pursuers, Respectively')

# Data source
# The first two numbers indicate the coordinate (or (x,y) position) of the text
# You may need to adjust the numbers a few times to place it in an ideal place
plt.figtext(0.65, -0.12, 'Data Source: Syntucky Data')

# Set legend font size
plt.legend(fontsize = 8)

#Save the plot
plt.savefig(r"C:\Users\YOUR USERNAME\Documents\2015_year7_earnings_by_major_group.jpg", bbox_inches = 'tight')

# Show the plot
plt.show()

#### **Checkpoint 4: Show Year 7 Earnings by Major and Degree Completion Status** 

Please use a bar chart to show year 7 median earnings for the 2013 cohort by major and degree completion status (completers, non-completers, and degree pursuers).