<a href="https://colab.research.google.com/github/egynzhu-personal/siop-python-seminar-2024/blob/main/02_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**3 Data Visualization**


In this tutorial, we will explore techniques to visualize HR-related data aimed at people analytics data scientists. Data visualization is a crucial skill for any data scientist as it helps in understanding patterns, trends, and insights hidden within the data.

First, let's import the necessary libraries:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

***3.1 Read Data and Exploratory Analysis***

Before visualizing the data, it's essential to understand its structure and characteristics. Let's perform some basic exploratory data analysis.

In [None]:
# Read data from github repository and take a look at the first few rows
# NOTE: The data is simulated
data = pd.read_csv("https://github.com/egynzhu-personal/siop-python-seminar-2024/blob/main/data/hr_data.csv?raw=true")
data.head()

In [None]:
# Examine the data type of each column
data.dtypes

In [None]:
# Get summary statistics for continuous variables
data.describe()

In [None]:
# You can get the value counts for categorical variables
data["department"].value_counts()

In [None]:
# Examine statistics by department
data.drop("attrition", axis=1).groupby("department").mean()

In [None]:
# Examine the correlation matrix
data.drop(["department", "attrition"], axis=1).corr()

***3.2 Plotting with Seaborn***

Seaborn is a powerful Python data visualization library built on top of Matplotlib. It provides a high-level interface for creating attractive and informative statistical graphics. Seaborn simplifies the process of creating complex visualizations by offering intuitive functions that work seamlessly with Pandas DataFrames. With Seaborn, you can quickly generate a wide variety of plots, including scatter plots, bar plots, histograms, heatmaps, and more, with just a few lines of code. Additionally, Seaborn offers built-in themes and color palettes to enhance the aesthetics of your plots.

In [None]:
# Tenure Distribution, Histogram
sns.histplot(data["tenure"],  # Column to plot
             bins=8,  # Number of bins
             kde=True,  # Kernel Density Estimates
             color="salmon")  # Color of bars
plt.title("Tenure Distribution")  # Set figure title
plt.xlabel("Tenure")  # Set x-axis label
plt.ylabel("Frequency")  # Set y-axis label

In [None]:
# Age and Tenure Distribution by Department, Subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))  # Set subplot properties
sns.boxplot(ax=ax[0], x="department", y="age", data=data, hue = "department", palette="cool")  # Plot in position 0
ax[0].set_title("Age Distribution by Department")  # Set subplot title
sns.boxplot(ax=ax[1], x="department", y="tenure", data=data, hue = "department", palette="cool")  # Plot in position 1
ax[1].set_title("Tenure Distribution by Department")  # Set subplot title

In [None]:
# Salary Distribution by Department, Multiple histagram
sns.displot(data=data,  # Dataset used for plot
            x="salary",   # Variable to plot
            col="department",  # Variable to parse data
            hue="department",  # Variable to change color
            palette="spring")  # Color palette
# Because of the subplots, set suptitle instead of title
plt.suptitle("Salary Distribution by Department", y=1.1, fontsize='xx-large')

In [None]:
# Relation Plot
sns.relplot(data=data, x="performance", y="salary", hue="attrition", kind="scatter", palette="Accent")
plt.title("Performance vs Salary by Attrition")

In [None]:
# Linear Model Plot
sns.lmplot(data=data, x="performance", y="salary", hue="attrition", scatter_kws={'alpha':0.3}, palette="Accent")
plt.title("Performance vs Salary by Attrition with Regression")

**Activity**: Now, it's your turn! Use the techniques learned in this tutorial to visualize the following aspects of the HR data:

1. Distribution of performance by attrition.
2. Relationship between salary and performance by department.
Feel free to explore additional visualizations and share your findings with your peers.

In [None]:
# Activity 1


In [None]:
# Activity 2

**References**

[Pandas Documentation](https://pandas.pydata.org/)

[Matplotlib Documentation](https://matplotlib.org/)

[Seaborn Documentation](https://seaborn.pydata.org/)