## Week 4 Notebook 2 Comparing Categories with Seaborn

We want to be able to create plots that compare the data by category.

In this notebook we will try out a new data science library called [Seaborn](https://seaborn.pydata.org/index.html). 


## The Seaborn Library

Seaborn is a data visualisation library with high level functions built on top of Matplotlib. Seaborn's plotting functions allow us to create many plots quickly, especially when we want to compare categories of data.


### Importing Seaborn

Seaborn should be included with your Anaconda distribution, so you can import it with the statement below. Seaborn is usually imported as `sns`.

In [None]:
# import Seaborn together with the other libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline

### Dataset

We will continue to use the `wids_train` data set. Let's import it for this notebook.


In [None]:
# read in the WIDS dataset.
wids_train= pd.read_csv("wids-climate-train.csv")
wids_train.head()

Looking at the data, we can see that there are some categorical variables such as `State_Factor`, `building_class` and `facility_type` that define the different sites. 

We want to be able to compare the values of `site_eui` based on these categories.

### Histogram with Seaborn

Seaborn is built on top of Matplotlib, and is integrated with Pandas. 

There are many high level functions defined in Seaborn that can help us to create complex plots quickly. 

The basic histogram can be created by specifying the data and the column to be used for the x-axis.

In [None]:
# Create histogram with seaborn
import seaborn as sns
fig, ax = plt.subplots()
ax = sns.histplot(data = wids_train, x = 'site_eui')
plt.show()

## Comparing Categories

Seaborn makes it easy to add categorical comparisons by colour. 

We can compare the distribution for different States by adding the keyword argument `hue`.
Specifying the `State_Factor` as the `hue` will separate each state by colour.


In [None]:
# Differentiate states by hue
fig, ax = plt.subplots()
ax = sns.histplot(data = wids_train, x = 'site_eui', hue='State_Factor', element='step')   # draw as a step function
ax.set_title('Distribution of Site EUI by State')
plt.show()

### Seaborn boxplot

Similarly, we can create a comparative boxplot that compares the distribution of site EUI by state and building class. 



In [None]:
# Create boxplot by State and Building Class
fig, ax = plt.subplots()
ax = sns.boxplot(data = wids_train, 
                 x='State_Factor', 
                 y='site_eui', 
                 hue='building_class')
ax.set_ylim(0,200)
ax.legend(loc='upper right')
plt.show()


### Scatterplot with Seaborn

Recall that we previously created a scatterplot using the `scatter` method in Matplotlib.

In [None]:
# Scatterplot with matplotlib
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(data = wids_train, x='energy_star_rating', y='site_eui', alpha=0.4)
plt.show()

Let's say we wanted to create the same scatter plot but we want to compare the `building_class` category.

All we have to do is to add the `building_class` for the `hue` argument, and this will *colour* the points according to the building class.

In [None]:
# Create the scatterplot using seaborn, still using figure and ax as before 

fig, ax = plt.subplots(figsize=(10,5))
ax = sns.scatterplot(data = wids_train, 
                     x='energy_star_rating', y='site_eui', 
                     hue='building_class', alpha = 0.4)
plt.show()

The points are now coloured based on the `building_class` value, and we can see that more residential buildings seem to have lower `site_eui` values than commercial buildings for the same `energy_star_rating` values.


### Bar Chart 
A bar chart is another type of plot that is usually used to compare categorical data. Seaborn can automatically create a barplot with an estimated summary value by category.

For example, we can create a barplot comparing the mean `site_eui` based on `building_class`.


In [None]:
# bar plot of mean site eui
fig, ax = plt.subplots()
ax = sns.barplot(data = wids_train, x = 'building_class', y='site_eui')
ax.set_title("Mean Site EUI by Building Class")
plt.show()

The little black lines shown at the top of each bar is an error bar to show the confidence interval estimate for the mean. We can turn it off by using the argument `ci=None`.

In [None]:
# bar plot of mean site eui
fig, ax = plt.subplots()
ax = sns.barplot(data = wids_train, 
                 x = 'building_class', 
                 y='site_eui', ci=None)
ax.set_title("Mean Site EUI by Building Class")
plt.show()

**Estimator Values**

When we specify the categories that we want to plot, Seaborn will calculate an single value from each group of categories depending on the `estimator=` argument. 

The default estimator is the mean, but we can specify a different estimator such as
- `estimator=len` for the number of observations in the group
- `estimator=np.median` for the median value (using numpy)
- `estimator=sum` for the sum of all the values
- `estimator=max` for the highest value
- `estimator=min` for the highest value

For example, the barplot below shows the median site eui for each building class in each state.

In [None]:
# bar plot of Median site eui
import numpy as np
fig, ax = plt.subplots()
ax = sns.barplot(data = wids_train, 
                 x = 'State_Factor', 
                 y='site_eui', 
                 hue='building_class', 
                 ci=None, 
                 estimator=np.median)
ax.set_title("Median Site EUI")
plt.show()

**Building Type (Site EUI)**

We need to further investigate the facility type for some clarity. First check how many observations there are for each facility type:

In [None]:
# Count the number of observations for each facility type
wids_train['facility_type'].value_counts()

It looks like there are many levels, so we will compare the 'Commercial' facility types first. 

Putting the `facility_type` as the `y=` value will make the barplot appear horizontally.

In [None]:
# filter to obtain Commercial Buildings
comm_bldgs = wids_train[wids_train['building_class']=='Commercial']

# Create a barplot by facility type
fig, ax = plt.subplots()
ax = sns.barplot(data =comm_bldgs, y="facility_type",x="site_eui", ci=None)
ax.set_title("Mean Site EUI for commercial buildings")
plt.show()


There are many facility types, so you can adjust the figure size in the `subplots()` method. 

Let's just show the top 10 sites with highest `site_eui`. In order to do this, we can specify the sort order in the plot.

But first we have to sort the values.
Let's find the order in which we should show the bars, by doing the following steps:

In [None]:
# 1. group by facility type and calculate the mean site eui for each group
mean_eui_by_facility = comm_bldgs.groupby('facility_type')['site_eui'].mean()
print(mean_eui_by_facility )

In [None]:
# 2. Now sort in descending order
mean_eui_sorted = mean_eui_by_facility.sort_values(ascending=False)
print (mean_eui_sorted)


In [None]:
# 3. then just get the facility type names: index
plot_order = mean_eui_sorted.index

We can put the steps 1 - 3 above in one statement like this:

In [None]:
# 1. group by facility type to calculate the mean site eui
# 2. then sort in descending order
# 3. then just get the facility type names: index
plot_order = comm_bldgs.groupby('facility_type')['site_eui'].mean().sort_values(ascending=False).index


In [None]:
# plot showing only the top 10 using plot_order[:10]
fig, ax = plt.subplots()
ax = sns.barplot(data =comm_bldgs, y="facility_type",x="site_eui", order=plot_order[:10], ci=None)
ax.set_title("Top 10 commercial facilities that have high mean site EUI")

plt.show()

In [None]:
# Similarly, plot the Residential buildings
res_bldgs = wids_train[wids_train['building_class']=='Residential']
plot_order = res_bldgs.groupby('facility_type')['site_eui'].mean().sort_values(ascending=False).index

# only 6 types, so not necessary to find top 10

fig, ax = plt.subplots()
ax = sns.barplot(data =res_bldgs, y="facility_type",x="site_eui", order=plot_order, ci=None)
ax.set_title("Mean site EUI for Residential facilities")
plt.show()

Based on the two horizontal bar plots, we can see that:

* The mean site EUI for Commercial buildings is highest for data centres.
* For Residential buildings, mixed use type facilities have highest mean site EUI.


### Heatmaps with Seaborn

When a category has many levels, like the `facility_type` above, a heatmap is useful for comparing the values by colour. 

A heatmap is a way of representing the data in a 2-dimensional form. The data values are represented as colours in the graph. 
The goal of the heatmap is to provide a coloured visual summary of information. 

For example, we can calculate the median site EUI for each building class by state. First we create a two-way table for the two dimensions, `building_class` and `State_Factor`.

In [None]:
# Calculate median for each group of building class and state
eui_By_building_class = wids_train.groupby(['building_class','State_Factor'])['site_eui'].median()
data = eui_By_building_class.unstack()
data

Now that the rows represent the building class and the columns represent the states, we can plot the heatmap:

In [None]:
# Create heatmap
fig, ax = plt.subplots(figsize=(12,9))
ax=sns.heatmap(data, annot=True, fmt="0.01f", cmap="Blues")
ax.set_title('Median Site EUI by Building Type')
plt.show()

Let's try the heatmap with the facility types.

In [None]:
# get the data required as a two-way table
eui_by_Facility = wids_train.groupby(['facility_type','State_Factor'])['site_eui'].median()
data = eui_by_Facility.unstack()
data


In [None]:
# Create heatmap using 
# two way table data
fig, ax = plt.subplots(figsize=(20,15))
ax=sns.heatmap(data, 
               linewidths=0.5, 
               cmap='cividis')
plt.show()

In this heatmap, although we have not annotated the values, the use of colours help identify the facility types and states with the highest median site eui values.

### Line Charts

Another common chart is a line chart, which is often used to visualise trends. The x-axis is usually a time sequence.

Seaborn can generate the line chart and estimate the mean value of the required variable at each point of the x-axis value. 

**Year Built**

We can plot the site eui data according to the year the building was built. To simplify this exercise we will choose only buildings built after 1950.



In [None]:
# Plot line chart using Seaborn lineplot()
fig, ax = plt.subplots()
ax = sns.lineplot(data=wids_train[wids_train['year_built']>1950], x='year_built', y='site_eui')
ax.set_title('Mean site eui of building built in last 50 years')
plt.show()

Newer buildings tend to have lower mean site EUI when compared to old buildings

Seaborn will calculate the mean by default and plot the values with confidence interval estimates when plotting the line, as shown in the "shadow" above and below the line. To remove the confidence intervals, set `ci=None`

In [None]:
# plot without confidence interval and separate by building class
fig, ax = plt.subplots()
ax = sns.lineplot(data=wids_train[wids_train['year_built']>1950], x='year_built', y='site_eui', hue = 'building_class', ci=None)
ax.set_title('Mean site eui of building built in last 50 years')
plt.show()

We have again added the hue to show the building class. It is clear that the residential buildings start to have lower site EUI than commercial buildings after the 1970s.

### Other Category Plots

There are many other plots provided by Seaborn for plotting categorical data. You can check out them out here: [Seaborn Category Plots](https://seaborn.pydata.org/tutorial/categorical.html).

Here is an example of a Violinplot, that can be used to show both the distribution of `site_eui` by considering the `State_Factor` **and** `building_class` categories:
- the width on either side of each 'violin' shows number of observations like a histogram
- the height of the violin show the range of values
- the mini box within the violin indicate the quartiles, like a boxplot.

In [None]:
# violin plot 

fig, ax = plt.subplots(figsize=(12,5))
ax = sns.violinplot(data = wids_train, 
                 x = 'State_Factor', 
                 y='site_eui', 
                 hue='building_class', split=True)
ax.set_title("Distribution of site eui by State and Building class")
plt.show()

## Exercises

Let's try some of the category plots as exercises.

Suppose we want to investigate further on the differences in bike rental for the Capital Bike Share Data.

Read in the data again for this notebook:

In [None]:
# make sure the libraries are imported
import matplotlib.pyplot as plt
import seaborn as sns

date_cols = ['rental_date','started_at', 'ended_at']
bikes = pd.read_csv('bikes_clean.csv', parse_dates = date_cols, dayfirst=True)
bikes.head()

Q1. Create a lineplot to show the mean duration of rentals, in minutes, by the rental date. 

In [None]:
# Q1 Answer


Q2. Let's see if there is any difference in the duration in rental based on the day of the week. 

Create a box plot  to compare the `duration_in_min` for each `day_of_week`. You might have to set the y limits to a smaller range to view the differences clearly.

In [None]:
# Q2 Answer


Q3. Add the `hue` argument to the boxplot above to compare the duration of rental for members vs casual users.

In [None]:
# Q3 Answer



Q4. Create a barplot to compare the rideable type, type of member and mean duration of rental

In [None]:
# Q4 Answer


Q5. Create a volinplot to compare the duration of rental for each day of week, split by members vs casual users 

In [None]:
# Q5 Answer


Q6a. Let's try a heatmap. Create a two-way table that calculates the number of rentals by `day_of_week` and `rental_hour` from the `bikess` data set.

In [None]:
# Q6a. Answer


Q6b. Now create the heatmap. Check the colormaps [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html) and select a suitable colormap to match the heatmap. Which days and hours have the highest number of rentals?

In [None]:
# Q6b answer


Great! We can see that Seaborn helps us to visualise our data and compare across categories, especially with the use of colour. Next we will try to organise our figures with subplots.