# Data visualisation

In [None]:
%pip install seaborn

Often, the amount and complixity of the data available is so high, that it is impossible to "just look at the numbers".

This is where data visualisation is beneficial. It allows, with the appropriate configuration and parameters, to condense the large volume of data into a more manageable for humans image.

While traditional spreadsheet tools like Excel offer similar functionality, often more easily usable, they also have significant limitations in terms of the amount of data that they can process.

## In this tutorial

Here, we will see how to use a popular software library for the Python language, called *Seaborn*, to visualise different aspects of our data.

We will use two ready-made data set for this purpose.

### Initialisation

The next block of code does some basic necessary initalisation and setup to allow us to use NumPy, Pandas and the *Seaborn* libraries.

It is not very important that you understand the details.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from warnings import simplefilter

simplefilter("ignore")


### Loading the datasets.

As in previous sessions, we need to first *load* the data, into a Pandas data frame to able able to manipulate it.

The following code will create *two* different  Pandas data frames:

* `super_store_data`: Contains the data from the superstore dataset. Please see the description of the dataset here:  https://www.kaggle.com/datasets/vivek468/superstore-dataset-final
* `human_resources_data`: Contains the data from the human resources dataset. Please see the description of the dataset here: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

In [None]:
# Load the super store dataset here
super_store_data = pd.read_csv("Sample - Superstore.csv", encoding = 'ISO-8859-1')

# Load the human resources dataset here
human_resources_data = pd.read_csv("HRDataset_v14.csv")

print("SuperStore columns:", super_store_data.columns)
print()
print("HR columns:", human_resources_data.columns)



#### Print a small sample of the super store data

In [None]:
super_store_data.head()


#### Print a small sample of the HR data

In [None]:
human_resources_data.head()

### Basics of figures

In order to create a digram (called "*plots*" here) with *Seaborn* you often start with the execution of the `plt.figure(...)` command. 
This is useful to set the size of the final diagram, and it clearlt indicates that in this cell we will be producing a plot at the end.

After the `plt.figure()` command, one or more `sns.XYZ()` commands will be added, where `XYZ` is different depending on the type of plot wanted.

We will see a number of different plots here and examples of how to use them.

## 1. Relationship plots
Relationship plots are useful for showing the relationship between two or more variables.

### 1. lineplot
In this example, we add a figure, configure the figure's size and "clarity" and then add a lineplot with some data.

The size of the figure is givena as 2-tuple (i.e. a pair) in the `figsize` parameter to `plt.figure()` function.

In the lineplot added (`sns.lineplot()`):

* The data comes from the HR dataset (`data = human_resources_data`)
* The x axis values are the values of the `EngagementSurvey` column
* The y axis values are the values of the `Salary` column

In [None]:
# Use this line to change the width and height of the seaborn plot.
plt.figure(figsize = (10, 6))

# Set the title of the plot.
plt.title("Relationship between EngagementSurvey and Salary")

# Add a lineplot, with the x axis values being the values of the EngagementSurvey column in the HR dataset (human_resources_data)
sns.lineplot(x = "EngagementSurvey", y = "Salary", data = human_resources_data)

**EXERCISE**: Question: What is the relationship between the `Salary` and the `EngagementSurvey` columns? 

#### Choices:

1. Linearly corellated (one is proportional to the other)
2. Non-linearly corellated
3. Non-corellated / no relationship exists

#### **Your Answer**: ____

#### Another example of the use of `lineplot`.

In this example, you will notice a light green colour around the line. 

When you have multiple values in the y axis for the same value in the x axis, the `lineplot` will shows an *estimate* of the *central tendency* and a *confidence
interval* for that estimate.

Thus having a large green shaded area indicates that there is high variability for those corresponding values.

In [None]:
plt.figure(figsize = (10, 6))

plt.title("Relationship between SpecialProjectsCount and Salary")

sns.lineplot(x = "SpecialProjectsCount", y = "Salary", data = human_resources_data)

**EXERCISE**: In the same way as the previous plot, do a plot of the *Relationship between `EmpSatisfaction` and `Salary`* using the HR dataset.

In [None]:
# TODO: write here


### 2. scatterplot 

The `scatterplot` is similar in many ways to the `lineplot` but the `sns.scatterplot()` function will not try to estimate the central tendency or the confidence interval in the data.

In this first example, we show the relationship between the number of absences and the number of days the employee was late in the last 30 days in the HR dataset.

It shoud be clear here that there does not seem to be a relationship between these two variables.

In [None]:
# Use this line to change the width and height of the seaborn plot.
plt.figure(figsize = (10, 6))

# Set the title of the plot.
plt.title("Relationship between Absences and Salary")

sns.scatterplot(x = "Absences", y = "DaysLateLast30", data = human_resources_data)


As an extension, we can also color the dots (the "*markers*" as they are called) with a color coming from another column. Let's use the `PerformanceScore` column to color the markers.

In [None]:
# Use this line to change the width and height of the seaborn plot.
plt.figure(figsize = (10, 6))

# Set the title of the plot.
plt.title("Relationship between Absences and Salary")

sns.scatterplot(x = "Absences", y = "DaysLateLast30", hue="PerformanceScore", data = human_resources_data)


**EXERCISE**: Question: What is the relationship between the `Absences`, `DaysLateLast30` and `PerformanceScore` columns? 

#### **Your Answer**: ____

Finally, we can also change the shape (use the `style` parameter) and size (use the `size` parameter) of the markers with shapes and sizes coming from other columns.

However, this may make the diagram hard to read, as in this example:

In [None]:
# Use this line to change the width and height of the seaborn plot.
plt.figure(figsize = (10, 6))

# Set the title of the plot.
plt.title("Relationship between Absences and Salary")

sns.scatterplot(x = "Absences", y = "DaysLateLast30", hue="PerformanceScore", style="EmpSatisfaction", size="SpecialProjectsCount", data = human_resources_data)


**EXERCISE**: Create a scatterplot of the **super store** dataset, of the `Sales` (y) versus the `Profit` (x) and color the markers based on the `Category` column.

In [None]:
# TODO: write here


## 2. Categorical plots

Sometimes the data in a column is not ordinal (e.g. numbers) but rather categorical. This type of data does not have a natural order. For example, in the HR datasat there is a column named `RecruitmentSource`, which contains where the employee was recruited from. Example values are `LinkedIn` and `Indeed`, but there is no obvious order to this values (e.g. one cannot say that `Indeed` > `LinkedIn`).


### 1. countplot

The `countplot` will count the number of rows with the same value for the column it get as a parameter.

In the below example, using the HR dataset, for each value of the `RecruitmentSource` column, a separate count is produced.

In [None]:
plt.figure(figsize = (18, 6))

# Set the title of the plot.
plt.title("Number of Employees hired from each Platform.")

sns.countplot(x = "RecruitmentSource", data = human_resources_data)

**EXERCISE**: Question: Which are the the two most frequent recruitment sources?

#### **Your Answer**: ____

**EXERCISE**: Create a countplot of the **HR** dataset, of the `PerformanceScore` (x) column.

In [None]:
# TODO: write here


**EXERCISE**: Create a countplot of the **HR** dataset, of the `MarriedID` (x) column and also color the couns based on the `GenderID` column.

*Hint: It's similar as for the scatterplot, also you can use the documentation*

In [None]:
# TODO: write here


### 2. boxplot

The `boxplot` (or box-and-whisker plot) shows the **distribution** of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable.

The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be "outliers" using a method that is a function of the inter-quartile range.

In [None]:
plt.figure(figsize = (10, 6))
sns.boxplot(x = "Category", y = "Sales",  data = super_store_data)

**EXERCISE**: Create a boxplot of the **super store** dataset, of the `Sales` (y) versus the `Region` (x) and color the markers based on the `Category` column.

In [None]:
# TODO: write here


### 3. violinplot

The `violinplot` shows the distribution of quantitative data across categorical variables such that those distributions can be compared. 

Unlike a box plot, in which all of the plot components correspond to **actual datapoints**, the violin plot features a kernel density **estimation** of the underlying distribution.

In [None]:
plt.figure(figsize = (10, 6))

sns.violinplot(x = "Category", y = "Sales",  data = super_store_data)

## 3. Distribution plots

Often times it's useful to plot the distribution of the values of a column. These kinds of plots are called distribution plots.

### 1. displot
The `histplot` will plot histograms to show distributions of datasets.

A histogram is a classic visualization tool that represents the distribution of one or more variables by **counting the number of observations** that fall within disrete **bins**.


In [None]:
plt.figure(figsize = (10, 6))
sns.histplot(x='Sales', data = super_store_data, bins=30)

**EXERCISE**: Create a histplot of the **HR** dataset, of the `PerformanceScore` (x) column and color the markers based on the `EmpSatisfaction` column.

In [None]:
# TODO: write here


### 2. kdeplot
The `kdeplot` uses a kernel density estimate (KDE) as a method for visualizing the distribution of observations in a dataset, analagous to a histogram. 

The KDE represents the data using a continuous probability density curve in one or more dimensions.

In [None]:
plt.figure(figsize = (10, 6))
sns.kdeplot(data = super_store_data, x = "Sales", shade = True)

**EXERCISE**: Create a kdeplot of the **HR** dataset, of the `EngagementSurvey` (x) column and color the markers based on the `EmpSatisfaction` column.

In [None]:
# TODO: write here


## 4. Matrix plots

One useful thing to examine in a dataset is if two columns are correlated.

To compute the pairwise correlation of columns all the columns, excluding NA/None values, you can use the `corr()` method on any Pandas dataframe object, as in the example below.


In [None]:
plt.figure(figsize = (10, 6))

# Get the correlation matrix between all the columns in the dataframe super_store_data and store it in the new corr_matrix valiable
corr_matrix = super_store_data.corr()

# Plot the correlation matrix
sns.heatmap(corr_matrix, vmax=0.9, annot=True)

**EXERCISE**: Compute and show the correlation matrix for the **HR** dataset.

In [None]:
# TODO: write here


## 5. Scatterplot Matrix

Finally, a way to quickly see all the realationships between all the numerical columns in a dataframe is to use the scatterplot matrix.

For the **super store** dataset, it would be the following (it may take some time to finish):


In [None]:
plt.figure(figsize = (20, 20))

sns.pairplot(super_store_data)

**EXERCISE**: Compute and show the scatterplot matrix for the **super store** dataset, but color the markers based on the `Category` column.

In [None]:
# TODO: write here
