# Introduction to Data Visualization


Welcome to the final and **the coolest** part of this course! 😎 I'm not saying that the previous topics aren't cool. But visualizations are the bridges that can connect your data analytics skills with the outside world. The last thing you want to do is print out a large DataFrame to report your findings.

Data visualization translates raw data into visual stories that reveal patterns, trends, and relationships. Humans process visual information far faster than text or tables—so effective visualizations help analysts think, communicate, and persuade.

> There is no such thing as information overload. There is only bad design.
>
> -- Edward Tufte


---

## ✨ From Data to Visualization: A Workflow

1. **Define the Question**: What decision or hypothesis are you exploring?
2. **Acquire and Clean Data**: Remove noise, handle missing values.
3. **Choose the Visual Form**: Match data type (categorical, continuous, time-series) to appropriate chart.
4. **Design and Annotate**: Add titles, captions, and labels.
5. **Iterate**: Test different layouts and get user feedback.


---

## 📊 Types of Data Visualization

### Exploratory vs. Explanatory

![Exploratory vs Explanatory](images/dataviz/exploratory-vs-explanatory.png)

When we visualize data, we are not simply drawing charts - we are searching for _patterns_. Most quantitative insights can be grouped into five broad categories of data patterns. Recognizing these patterns helps analysts choose the most effective visual representation for their message.

![Data patterns](images/dataviz/data-patterns.png)

:::{seealso} Kevin Hartman's Digital Marketing Analytics: In Theory and In Practice

This categorization is from Kevin Hartman's [Digital Marketing Analytics: In Theory and In Practice](https://www.amazon.in/Digital-Marketing-Analytics-Theory-Practice/dp/B08J5BD5RM).

He taught digital marketing analytics in Gies iDegree programs, although he now teaches at his alma mater, Loyola University Chicago.

:::

### 1️⃣ Change

- **Definition:** Shows how a variable evolves over time.
- **Examples:** Line charts, area charts, or bar charts that track trends.
- **Use When:** You want to highlight growth, decline, or seasonal fluctuations.

### 2️⃣ Clustering

- **Definition:** Reveals natural groupings or segments within data.
- **Examples:** Scatter plots or bubble charts showing customer segments, product groups, or behavioral clusters.
- **Use When:** You want to explore differences and similarities between observations.

### 3️⃣ Relativity

- **Definition:** Displays how parts relate to a whole.
- **Examples:** Pie charts, donut charts, or stacked bar charts.
- **Use When:** You want to emphasize proportions or contribution to a total.

### 4️⃣ Ranking

- **Definition:** Compares ordered categories to identify leaders or laggards.
- **Examples:** Horizontal bar charts, lollipop charts, or sorted column charts.
- **Use When:** You want to show the top or bottom performers (e.g., top 10 sales regions).

### 5️⃣ Correlation

- **Definition:** Illustrates relationships between two or more quantitative variables.
- **Examples:** Scatter plots, correlation matrices, or regression lines.
- **Use When:** You want to determine whether changes in one variable are associated with changes in another.

### 🧪 Why It Matters

Identifying these five data patterns helps analysts choose the right visualization for their story. Rather than focusing on chart types first, start by asking:

> "What kind of pattern am I trying to show - change, clustering, relativity, ranking, or correlation?"


### 🗂️ Common Chart Types

| Chart Type             | Best For                                      | Avoid When                |
| ---------------------- | --------------------------------------------- | ------------------------- |
| **Bar Chart**          | Comparing categorical values                  | Too many categories       |
| **Line Chart**         | Showing trends over time                      | Non-sequential categories |
| **Scatter Plot**       | Revealing relationships between two variables | Too few data points       |
| **Histogram**          | Showing data distribution                     | Comparing multiple groups |
| **Box Plot**           | Summarizing distribution & outliers           | Small samples             |
| **Pie / Donut Chart**  | Showing parts of a whole                      | Many small slices         |
| **Heatmap**            | Displaying matrix or correlation patterns     | Hard-to-read color scales |
| **TreeMap / Sunburst** | Hierarchical proportions                      | Need precise comparisons  |


---

## 🧾 Common Pitfalls in Data Visualization

| Pitfall          | Description          | Better Practice           |
| ---------------- | -------------------- | ------------------------- |
| Overuse of 3D    | Distorts proportions | Stick to 2D               |
| Too Many Colors  | Confuses audience    | Use ≤ 5 meaningful colors |
| Truncated Y-Axis | Misleads differences | Start at 0 for bar charts |
| Dense Dashboards | Cognitive overload   | Prioritize key visuals    |
| Unlabeled Axes   | Ambiguous meaning    | Always label variables    |

:::{danger} Avoid 3D charts

There are very few cases where 3D charts are appropriate, such as a surface plot, or a spatial visualization. Otherwise, 3D charts often distort data and make it harder to interpret. Stick to 2D visualizations for clarity.

If you need to show multiple dimensions, consider using color, size, or facets instead of adding a third spatial dimension.

If you create a 3D chart because it looks "cool", I will track you down and personally make you redo it in 2D. 😡

:::


---

## 📚 Dataviz libraries

The mostly commonly used library for data visualization in introductory data analytics courses is [**matplotlib**](https://matplotlib.org/), a low-level visualization library for Python. [**seaborn**](https://seaborn.pydata.org/) is another popular library built on top of matplotlib that provides a higher-level interface for creating attractive and informative statistical graphics.

Below are some of the most popular and battle-tested data visualization libraries - all of which are free and open source:

- [**matplotlib**](https://matplotlib.org/): Low-level visualization library for Python
- [**seaborn**](https://seaborn.pydata.org/): High-level visualization library for Python **built on matplotlib**
- [**bokeh**](https://bokeh.org/): Interactive visualizations for modern web browsers
- [**plotnine**](https://plotnine.readthedocs.io/): Python implementation of [ggplot2](https://ggplot2.tidyverse.org/)
- [**plotly**](https://plotly.com/python/): Interactive visualization library supporting Python, JavaScript, and R
- [**altair**](https://altair-viz.github.io/): Declarative visualization library for Python based on [Vega](https://vega.github.io/vega/)

### 🎨 Plot.ly

We'll use [plotly](https://plotly.com/python/), which provides both low-level and high-level interfaces to create publication-ready graphs.

![plotly logo image](images/dataviz/plotly-logo.png)

To use `plotly` in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/), install the `jupyterlab` and `anywidget` packages in the same environment as you installed `plotly`, using `pip` inside a terminal:

```
pip install plotly anywidget
```

or `conda`:

```
conda install plotly anywidget
```

You can use the exclamation mark `!` to run shell commands directly from a Jupyter notebook cell:

```
!pip install plotly anywidget
```

or 

```
!conda install plotly anywidget
```

It's generally a better practice to run installation commands in a terminal rather than inside a Jupyter notebook to avoid environment issues.

:::{attention} Install `anywidget`

The plotly charts in JupyterLab require the `anywidget` package to render properly. Make sure to install it in the same environment where you have plotly and JupyterLab installed.

:::

:::{important} Manual fallback

If your plots are still not rendering correctly after installing `anywidget`, you may need to manually set the renderer for plotly in JupyterLab using the following code:

```python
import plotly.io as pio
pio.renderers.default = "notebook"
```

:::


---

## 🛠️ Exercises using an HR Dataset


▶️ Import the following Python packages.

1. `pandas`: Use alias `pd`.
2. `numpy`: Use alias `np`.
3. `plotly.express`: Use alias `px`.
4. `plotly.graph_objects`: Use alias `go`.


In [1]:
import pandas as pd
import numpy as np

import plotly.graph_objects as go
import plotly.express as px

▶️ Check the version of plotly installed in your environment.


In [2]:
import plotly

print(f"Plotly version: {plotly.__version__}")

Plotly version: 6.3.1


Today, we work with an HR Dataset to uncover insights about HR metrics, measurement, and analytics. The data has been downloaded from [https://rpubs.com/rhuebner/hr_codebook_v14](https://rpubs.com/rhuebner/hr_codebook_v14) without any modification.


▶️ Import the HR Dataset. 🐷👧👨🏻‍🦰👩🏼‍🦳👳🏽‍♂️👩🏾‍🦲🐼.


In [3]:
# Display all columns
pd.set_option("display.max_columns", 50)

df_hr = pd.read_csv("https://github.com/bdi475/datasets/raw/main/HR-dataset-v14.csv")

display(df_hr)

Unnamed: 0,Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
0,"Adinolfi, Wilson K",10026,0,0,1,1,5,4,0,62506,0,19,Production Technician I,MA,1960,07/10/83,M,Single,US Citizen,No,White,7/5/2011,,N/A-StillEmployed,Active,Production,Michael Albert,22.0,LinkedIn,Exceeds,4.60,5,0,1/17/2019,0,1
1,"Ait Sidi, Karthikeyan",10084,1,1,1,5,3,3,0,104437,1,27,Sr. DBA,MA,2148,05/05/75,M,Married,US Citizen,No,White,3/30/2015,6/16/2016,career change,Voluntarily Terminated,IT/IS,Simon Roup,4.0,Indeed,Fully Meets,4.96,3,6,2/24/2016,0,17
2,"Akinkuolie, Sarah",10196,1,1,0,5,5,3,0,64955,1,20,Production Technician II,MA,1810,09/19/88,F,Married,US Citizen,No,White,7/5/2011,9/24/2012,hours,Voluntarily Terminated,Production,Kissy Sullivan,20.0,LinkedIn,Fully Meets,3.02,3,0,5/15/2012,0,3
3,"Alagbe,Trina",10088,1,1,0,1,5,3,0,64991,0,19,Production Technician I,MA,1886,09/27/88,F,Married,US Citizen,No,White,1/7/2008,,N/A-StillEmployed,Active,Production,Elijiah Gray,16.0,Indeed,Fully Meets,4.84,5,0,1/3/2019,0,15
4,"Anderson, Carol",10069,0,2,0,5,5,3,0,50825,1,19,Production Technician I,MA,2169,09/08/89,F,Divorced,US Citizen,No,White,7/11/2011,9/6/2016,return to school,Voluntarily Terminated,Production,Webster Butler,39.0,Google Search,Fully Meets,5.00,4,0,2/1/2016,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,"Woodson, Jason",10135,0,0,1,1,5,3,0,65893,0,20,Production Technician II,MA,1810,05/11/85,M,Single,US Citizen,No,White,7/7/2014,,N/A-StillEmployed,Active,Production,Kissy Sullivan,20.0,LinkedIn,Fully Meets,4.07,4,0,2/28/2019,0,13
307,"Ybarra, Catherine",10301,0,0,0,5,5,1,0,48513,1,19,Production Technician I,MA,2458,05/04/82,F,Single,US Citizen,No,Asian,9/2/2008,9/29/2015,Another position,Voluntarily Terminated,Production,Brannon Miller,12.0,Google Search,PIP,3.20,2,0,9/2/2015,5,4
308,"Zamora, Jennifer",10010,0,0,0,1,3,4,0,220450,0,6,CIO,MA,2067,08/30/79,F,Single,US Citizen,No,White,4/10/2010,,N/A-StillEmployed,Active,IT/IS,Janet King,2.0,Employee Referral,Exceeds,4.60,5,6,2/21/2019,0,16
309,"Zhou, Julia",10043,0,0,0,1,3,3,0,89292,0,9,Data Analyst,MA,2148,02/24/79,F,Single,US Citizen,No,White,3/30/2015,,N/A-StillEmployed,Active,IT/IS,Simon Roup,4.0,Employee Referral,Fully Meets,5.00,3,5,2/1/2019,0,11


---

### 📦 Box Plot

Box plots divide the data into 4 sections that each contain 25% of the data. It is useful to quickly identify the distribution of the data based on Q1, Q2 (median), and Q3.

![box plot explanation](images/dataviz/box-plot-explanation.png)


▶️ Create a simple box plot of 12 different GPAs. NumPy is used here to calculate the statistical figures.


In [4]:
gpa = np.array(
    [3.33, 2.67, 3.0, 3.67, 3.67, 2.33, 3.0, 3.0, 2.67, 4.0, 3.33, 2.67, 4.0]
)
gpa

array([3.33, 2.67, 3.  , 3.67, 3.67, 2.33, 3.  , 3.  , 2.67, 4.  , 3.33,
       2.67, 4.  ])

In [5]:
fig = px.box(x=gpa, title="GPA Distribution (Horizontal Box Plot)")
fig.show()

In [6]:
print(f"Mean: {np.mean(gpa)}")
print(f"Median: {np.median(gpa)}")
print(f"Q1: {np.quantile(gpa, 0.25)}")
print(f"Q3: {np.quantile(gpa, 0.75)}")
print(f"IQR: {np.quantile(gpa, 0.75) - np.quantile(gpa, 0.25)}")

Mean: 3.18
Median: 3.0
Q1: 2.67
Q3: 3.67
IQR: 1.0


**🗺️ Findings**

- Median is `3`.
- Minimum is `2.33`.
- Maximum is `4`.
- Interquartile range is `1`.
  - You can calculate this value by subtracting Q1 from Q3: `3.67 - 2.67`.
- There is a positive skew.
  - This is also shown by comparing the mean and the median.


---

### 🎯 Example 1: Salary box plot (vertical)

▶️ Draw a vertical box plot of `Salary` in `df_hr`.


In [7]:
fig = px.box(df_hr, y="Salary", title="Salary Distribution (Vertical)")
fig.show()

---

### 🎯 Example 2: Salary box plot (horizontal)

▶️ Draw a horizontal box plot of `Salary`.


In [8]:
fig = px.box(df_hr, x="Salary", title="Salary Distribution (Horizontal)")
fig.show()

---

### 🎯 Example 3: Salary distribution by citizenship status

▶️ Draw horizontal box plots of `Salary` by `CitizenDesc`.


In [9]:
fig = px.box(
    df_hr,
    x="Salary",
    y="CitizenDesc",
    title="Salary Distribution by Citizenship Status",
)
fig.show()

---

### 🎯 Example 4: Salary distribution by performance

▶️ Draw horizontal box plots of `Salary` by `PerformanceScore`.


In [10]:
# YOUR CODE BEGINS
fig = px.box(
    df_hr,
    x="Salary",
    y="PerformanceScore",
    title="Salary Distribution by Performance Score",
)
fig.show()
# YOUR CODE ENDS

---

### 🎯 Example 4: Salary distribution by department

▶️ Draw horizontal box plots of `Salary` by `Department`.


In [11]:
# YOUR CODE BEGINS
fig = px.box(
    df_hr,
    x="Salary",
    y="Department",
    title="Salary Distribution by Department",
    height=600,
)
fig.show()
# YOUR CODE ENDS

---

### 🧶 Histogram

Histograms display frequency distributions using bars of different heights.

Here is an example histogram showing the distribution of 500 random integers following a normal distribution.


In [12]:
fig = px.histogram(x=np.random.randn(500))
fig.show()

---

### 🎯 Example 4: Salary histogram

▶️ Draw a histogram of `Salary` in `df_hr`.


In [13]:
fig = px.histogram(df_hr, x="Salary", title="Salary Distribution")
fig.show()

---

### 🎯 Example 4: Number of absences histogram

▶️ Draw a histogram of `Absences` in `df_hr`.


In [14]:
fig = px.histogram(df_hr, x="Absences", title="Number of Absence Distribution")
fig.show()

---

### 🎯 Example 8: Salary histograms by gender

▶️ Draw overlaid histograms of `Salary` in `df_hr` by `GenderID`.


In [15]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=df_hr[df_hr["GenderID"] == 0]["Salary"], name="Male"))

fig.add_trace(go.Histogram(x=df_hr[df_hr["GenderID"] == 1]["Salary"], name="Female"))

# Overlay both histograms
fig.update_layout(barmode="overlay")

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.6)
fig.show()

:::{note} Graph objects

This example uses the low-level `graph_objects` interface of plotly to create overlaid histograms. This approach provides more control over the individual traces and their properties.

You will not be asked to use `graph_objects` in the exercises and case studies. However, it is good to be aware of this interface as you advance your data visualization skills.

:::
