# Week 4 Notebook 1 Visualising Distributions (with Solutions)

This week we will focus on how we create data visualisations to tell stories about our data.

In this notebook we will explore how to create the following plots to visualise distributions:
- histograms,
- boxplots, and
- scatterplots

We will also try out a new data science library called [Seaborn](https://seaborn.pydata.org/index.html). 

We will use a new data set for the analysis.

**About the Analysis**

The aim of this analysis is to perform the exploratory analysis of a training dataset and get some insights on how each of the variables relate to our target variable which is `site_eui` (Site Energy Usage Intensity - the amount of heat and electricity consumed by a building). We have about 64 attributes (including the target). The data is obtained from the [Women in Data Science datathon 2022](https://www.widsconference.org/datathon.html).

**WIDS Dataset**

The dataset consists of building characteristics (e.g. floor area, facility type etc), weather data for the location of the building (e.g. annual average temperature, annual total precipitation etc) as well as the energy usage for the building and the given year, measured as Site Energy Usage Intensity (Site EUI). Each row in the data corresponds to the a single building observed in a given year.

The task is to predict the Site EUI for each row, given the characteristics of the building and the weather data for the location of the building. For now, we will just focus on exploring and visualising the data set.

Let's get started with importing our usual libraries.


In [None]:
# Import the required libraries
import pandas as pd
import matplotlib.pyplot as plt

# show plots within this notebook
%matplotlib inline

### Basic Exploration

Let's read in the data set and have a look at it.
It should be in the same directory as this notebook.


In [None]:
# Read in the data set.
wids_train= pd.read_csv('wids-climate-train.csv')
wids_train.info()

There are 64 columns and 75757 rows. 
There appear to be a large number of missing values for the `energy_star_rating` column, so we might want to do something about that.

Let's continue exploring the data first. 

In [None]:
# Check the first five rows
wids_train.head()

Let's try to analyse the site EUI in terms of state, type of building, and year built. 

To try to understand the data, we can check the distribution of the `site_eui` data. This can done using a histogram or boxplot.


**Histogram**

A histogram is a graphical representation that organises a group of data points into user-specified ranges.Like a bar chart, histograms consist of a series of vertical bars along the x-axis. Histograms are most commonly used to depict what a set of data looks like in aggregate.


In [None]:
# Basic histogram
fig, ax = plt.subplots()
ax.hist(wids_train['site_eui'])
plt.show()

We can see that most of the buildings have values between 0 and 100. Let's add more *bins* to see the distribution between 0 and 200 more clearly.


In [None]:
# Create the histogram again, but this time with 200 bins
fig, ax = plt.subplots()
ax.hist(wids_train['site_eui'], bins = 200)
ax.set_title('Distribution of Site EUI', fontsize=20)
ax.set_xlabel('Site Energy Usage Intensity (EUI)')
ax.set_xlim(0,500)
ax.set_ylabel('Number of buildings')
plt.show()

**Boxplot**

Another plot that can be used to visualise the data is a boxplot. A boxplot gives a good indication of how the values in the data are spread out. 

Boxplots are a standardised way of displaying the distribution of data based on a five number summary:
1. minimum value
2. first quartile (Q1)
3. median (Q2)
4. third quartile (Q3)
5. maximum value

We can see these values for `site_eui` using the `describe()` method:

In [None]:
# print five number summary of site eui
summary = wids_train['site_eui'].describe()
print(summary)


As you can see the first quartile is indicated as 25%, which means that 25% of the values are below 54.528601, and the median is 75.293716.

Let's try to create a boxplot for the `site_eui`. 

In [None]:
# Create a basic boxplot
fig, ax = plt.subplots()
ax.boxplot(wids_train['site_eui'])
ax.set_xlabel('Site EUI')
plt.show()



The black markers above the boxplot's 'whiskers' are considered *outliers* as they are values that are 1.5 times above the interquartile range. Let's try to take a closer look at the box and whiskers. 

We can do this by limiting the y-axis range.

In [None]:
# Create a basic boxplot
fig, ax = plt.subplots()
ax.boxplot(wids_train['site_eui'])
ax.set_xlabel('Site EUI')

# show y axis from 0 to 200 only to 'zoom in'
ax.set_ylim(0, 200)

# add annotation to mark interquartile range
ax.annotate(xy=(0.90, summary['25%']), xytext=(0.55, summary['25%']), text='1st quartile', arrowprops=dict(arrowstyle='simple'))
ax.annotate(xy=(0.90, summary['75%']), xytext=(0.55, summary['75%']), text='3rd quartile', arrowprops=dict(arrowstyle='simple'))
ax.annotate(xy=(1.1, summary['50%']), xytext=(1.25, summary['50%']), text='Median', arrowprops=dict(arrowstyle='simple'))

# calculate top of IQR
capVal = 1.5 * (summary['75%']- summary['25%']) + summary['75%'] 
ax.annotate(xy=(1.05, capVal), xytext=(1.1, capVal), text='3rd Quartile + \n(1.5 * Interquartile Range)', arrowprops=dict(arrowstyle='simple'))


plt.show()
print(summary)


We can see the box more clearly now, and some annotation has been added to mark the important points on the boxplot.

We can also make the boxplot horizontal by setting the `vert` (for vertical) argument to `False`.

In [None]:
# Create a horizontal boxplot
fig, ax = plt.subplots(figsize=(20,3))
ax.boxplot(wids_train['site_eui'], vert=False)
ax.set_title('Distribution of site EUI')

# add more tick marks
ax.xaxis.set_major_locator(plt.MultipleLocator(50))
plt.show()

### Comparing Site EUI by State

We can compare the distribution of `site_eui` for each state using boxplots, by plotting the states' respective `site_eui` values on the same plot. First we should check how many values there are for each state.

In [None]:
# Check how many unique values there are for each state
states = wids_train['State_Factor'].value_counts()
print(states.index)
print(states)

Next we will separate the `site_eui` values for each state, and store them in a dictionary that uses the state name as the key. 

In [None]:
# create a dictionary of for each state's site_eui
statesList={}
# For each state, store the series of site EUI  as the value and the name of the state as the key
for state in states.index:
    statesList[state]=wids_train[wids_train['State_Factor']==state]['site_eui']
# show the dictionary
statesList

Now we can use Matplotlib to plot each of the state's `site_eui` values.

In [None]:
# plot the values stored in the dictionary, by state
fig, ax = plt.subplots()
ax.boxplot(statesList.values(), labels= statesList.keys())
ax.set_ylim(0,200)
ax.set_title("Distribution of site EUI by state")
plt.show()

The plot shows the boxplots for each of the states, and this gives us an idea of the distribution of the site EUI by state.
For example, we can see that State 10 has a smaller range of values, and state 11's site EUI values appear to be generally lower than those for state 6.

## Scatter Plots

Scatter plots are useful for comparing numerical data. Each member of the dataset is plotted as a point.
Scatterplots are useful for exploring and visualising correlations between data.

We have previously created scatterplots using the `plot` method and hiding the lines. However, we can also create scatterplots using the `scatter` method in `Matplotlib`.

Let's create one to compare the `energy_star_rating` and `site_eui` values.


In [None]:
# Create a scatter plot
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(wids_train['energy_star_rating'], wids_train['site_eui'], alpha=0.3)  # alpha to set transparency
ax.set_title('Site EUI by energy star rating')
ax.set_xlabel('Energy Star Rating')
ax.set_ylabel('Site EUI')
plt.show()

There are many points on this scatter plot, so the darker points indicate more overlap. Generally we can see that there is a slight decrease in the site EUI values for higher energy star rating values.

### Note on specifying arguments

We have been creating our plots by defining the x or y values using the column names in square brackets, like this: 

`ax.scatter(wids_train['energy_star_rating'], wids_train['site_eui'], alpha=0.3)`

An alternative is to specify the `data=` keyword argument and then the x and y attributes.

`ax.scatter(data = wids_train, x ='energy_star_rating', y='site_eui', alpha=0.3)`

In [None]:
# Create a scatter plot using data= keyword argument
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(data = wids_train, x ='energy_star_rating', y='site_eui', alpha=0.3)
ax.set_title('Site EUI by energy star rating')
plt.show()

So far, we have created some plots to visualise the distribution of the Site EUI, with respect to the states and the energy star rating.

Besides these comparisons, we want to compare the site EUI values based on the state, building class and facility type. 

In order to do this, we might need more sophisticated plotting functions.

In the next notebook we will introduce the `Seaborn` library for quickly creating some plots to compare the data across categories.

## Exercises

For this exercise we will use the data on bike rentals from [Capital Bike Share](https://www.capitalbikeshare.com/system-data). 

We have a set of cleaned data about January 2022 rentals in the `bikes_clean.csv` file.

Let's read in the data first. We will *parse* the start and end dates so that they are read in as dates.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
# Reading in data and parsing the dates
date_cols = ['started_at', 'ended_at']
bikes = pd.read_csv('bikes_clean.csv', parse_dates = date_cols, dayfirst=True)
bikes.head()

###  Questions

Q1. Create a histogram for the `duration_in_min` values from the `bikes` data set. 
- Put the data into 200 bins
- adjust the x-axis limits so that the plot is clearer
- Add a title for the figure.

In [None]:
# Q1 Answer

# Histogram showing distribution of bike rental duration
fig, ax = plt.subplots()
ax.hist(bikes['duration_in_min'], bins=200)
ax.set_xlim(0,200)
ax.set_title('Duration of Rental in Minutes')
plt.show()

Q2. Create a ***horizontal*** boxplot for the `duration_in_min` from the `bikes` data set, set the x-axis limits from 0 to 100 as it appears most of the data will be within this range.

In [None]:
# Q2 Answer
# Boxplot showing distribution of duration in minutes

fig, ax = plt.subplots(figsize=(20,3))
ax.boxplot(bikes['duration_in_min'], vert=False)           # Create a horizontal boxplot
ax.set_title('Duration of Rental, in Minutes')
ax.set_xlim(0,100)

# add more tick marks
#ax.xaxis.set_major_locator(plt.MultipleLocator(5))
#plt.show()

Q3. Create a scatterplot to compare the `start_lat` on the x-axis with the `start_lng` on the y-axis for the values in the `bikes` data set. Add suitable axis labels and a title. This would give us an idea of the popular locations that the bikes are being rented from.

In [None]:
# Q3 Answer
# Create a scatter plot
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(data = bikes, x = 'start_lat',y='start_lng', alpha=0.2)  # alpha to set transparency
ax.set_title('Starting Rental Locations')
ax.set_xlabel('Latitude')
ax.set_ylabel('longitude')
plt.show()

Q4. Let's check if there is any difference in the start locations between members and casual users. We have to separate the data into two. 

The set of casual users data is shown as an example. You are required to add the scatter plot for members with a different colour and show the legend.


In [None]:
# Q4 Answer
# get casual users
casual_data = bikes[bikes['member_casual']=='casual']


# Enter your answer to get the member data
member_data = bikes[bikes['member_casual']=='member']


# Create a scatter plot 
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(data = casual_data, x = 'start_lat',y='start_lng', alpha=0.2, c= 'pink',  label = 'Casual')  # alpha to set transparency

# add the points for the member data
ax.scatter(data = member_data, x = 'start_lat',y='start_lng', alpha=0.2, c = 'blue', label = 'Member')  # alpha to set transparency


ax.set_title('Starting Rental Locations for casual vs members')
ax.set_xlabel('Latitude')
ax.set_ylabel('Longitude')

# add a statement to show the legend
plt.legend()

# show the plot
plt.show()

You might find that the points tend to overlap on top of each other, so the lower layer of values is not clear. We can do this better with the `Seaborn` library, so we'll cover that in the next notebook.