# Week 4 Notebook 4 Colours


Let's have a look at how we can choose colours for our plots.

In this notebook, we will cover:
- Using the color argument
- Different colormaps
- Seaborn Palettes

Let's start by importing the libraries and reading in the WIDS datathon data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors
wids_train = pd.read_csv('wids-climate-train.csv')

In [None]:
wids_train.head()

## Matplotlib colours

A good tutorial on colours is available in [Plotting with Pride: Colors in Matplotlib](https://petercbsmith.github.io/color-tutorial.html). Here we will try to view some of the colours available.

You might have noticed we added another import above: `import matplotlib.colors as mcolors`. This gives us the `colors` module which allows us to find the names of each of the colours available.

Without using specific named colours, Matplotlib will give us default colours. We can show them in the legend below.

In [None]:
# Show the default colours
val = 1
fig, ax= plt.subplots(figsize=(5,5))
ax.set_facecolor("lightgray")
for color in range(10):
    choice =("C"+str(color)) 
    plt.plot(val, val, c=choice, label=choice, linewidth=14, marker='o', markersize=14)
    val= val+1
    
# plot the legend to show the colors, use bbox_to_anchor to put the legend outside the bounding boxes of the axes
plt.legend(fontsize=20, bbox_to_anchor=(1,1), facecolor="lightgray") 
plt.show()

In [None]:
# Print the colours in a legend
val = 1
fig, ax= plt.subplots(figsize=(5,5))
ax.set_facecolor("lightgray")

# look through the colours defined in mcolors.BASE_COLORS
# You can try other colours eg mcolors.TABLEAU_COLORS, mcolors.CSS4_COLORS, mcolors.XKCD_COLORS
for color in mcolors.BASE_COLORS:   
    plt.plot(val, val, c=color, label=color, linewidth=14, marker='*', markersize=10)
    val= val+1
    
# plot the legend to show the colors, use bbox_to_anchor to put the legend outside the bounding boxes of the axes
plt.legend(fontsize=20, bbox_to_anchor=(1,1), facecolor="lightgray") 
plt.show()

## Plotting the WIDS data

Let's see how we can use the colours to define our data. 

Using basic Matplotlib plotting, we use the `color=` or `c=` argument to define the colour for plot elements.


In [None]:
# Select only sites built in 2015, for example
data = wids_train[wids_train['year_built']==2015]

# Define a dictionary so that each state is given a specific colour
colors_by_state ={'State_1':'b','State_2':'g' ,'State_4':'r', 'State_8':'c', 'State_10': 'm', 'State_11':'y', 'State_6':'k' }

# Create the plot
fig, ax= plt.subplots()

# plot the points for each state
for state_name in colors_by_state:
    state_data = data[data['State_Factor']==state_name]  # get the data for one state
    ax.scatter(state_data['energy_star_rating'], state_data['site_eui'], 
               color=colors_by_state[state_name], label=state_name, alpha=0.6)
ax.legend(bbox_to_anchor=(1,1))
ax.set_xlabel('energy star rating')
ax.set_ylabel('site EUI')
ax.set_title('Higher Energy Star Rating should reflect lower Site EUI values')
fig.suptitle('Sites built in 2015')
plt.show()

## Case Study - Checking Energy Star Rating Awarded by States

We want to be able to compare the energy star ratings awarded vs the site EUI values for each state. Maybe this can tell us which states are able to award the energy star ratings accurately.

Let's try to check which facility types in which states have the highest site EUI ratings vs energy star ratings. 
First we have to group and find the mean site EUI and mean energy star rating for each state.

You might recall that there are some missing values for the energy star ratings. As we want to find out which buildings have more accurate energy star ratings, those with missing values will not help in our analysis.

In [None]:
# drop data where energy star rating value is null
star_rated = wids_train.dropna(subset=['energy_star_rating'])


**Group By State, Building and Facility Type**

Now we can calculate the mean values for site EUI and energy star rating for each state, building class and facility type.
We will reset the index on the data frame so that we can get a new data frame to be used in the plot. 

Don't forget we want to use the `star_rated` data set.

In [None]:
# Calculate mean values for each group of state, building class and facility type that have been awarded energy star ratings
mean_by_state_bldg = star_rated.groupby(['State_Factor','building_class','facility_type']).mean()
mean_by_state_bldg.reset_index(inplace=True)
mean_by_state_bldg


Using this data, we can create a scatter plot of the mean site EUI vs the Energy Star Rating and find the top Facility Types.

In [None]:
# Scatter plot where size of marker indicates higher site EUI
fig, ax= plt.subplots()
ax.scatter(mean_by_state_bldg['energy_star_rating'], mean_by_state_bldg['site_eui'], c='blue',
           alpha=0.5, s=mean_by_state_bldg['site_eui'])

ax.set_xlabel('Energy Star Rating')
ax.set_ylabel('Site EUI')
ax.set_ylim(0,1000)
ax.set_title('Mean Site EUI vs energy star rating')

# find facility type and state with three highest max site eui

# First sort the values by site EUI
top3 = mean_by_state_bldg.sort_values(by='site_eui', ascending=False)

# Then put the text for the top 3 only - can change to 5, or 10
for i in range(3):
    top = top3.iloc[i]
    xVal= top['energy_star_rating']
    
    # add annotation to the top 3 only
    ax.annotate(top['facility_type']+',\n'+top['State_Factor'], xy=(xVal,top['site_eui']), xytext=(xVal-10,top['site_eui']-(i*50) ))

plt.show()

## Using Colours to represent Values

In the plot above we have used the size of the markers to indicate the site EUI value - bigger markers show that the site EUI is higher. We can also use colours to show the value of another dimension.

For example, if we want to use the colour based on the energy star rating:


In [None]:
# Scatter plot where size of marker indicates higher site EUI and colour indicates the energy star rating
fig, ax= plt.subplots()
sc = ax.scatter(mean_by_state_bldg['energy_star_rating'], mean_by_state_bldg['site_eui'],
           alpha=0.5, s=mean_by_state_bldg['site_eui'], c=mean_by_state_bldg['energy_star_rating'])

ax.set_xlabel('Energy Star Rating')
ax.set_ylabel('Site EUI')
ax.set_ylim(0,1000)
ax.set_title('Mean Site EUI vs energy star rating')

# add the color bar to indicate the color values
plt.colorbar(sc, label='Energy Star Rating')

# get the sizes for the legend
handles, labels = sc.legend_elements(prop='sizes')
ax.legend(handles, labels, bbox_to_anchor=(1.5,1), title='Site EUI')

plt.show()

## Colormaps

The `colorbar` that has been added shows the colours that the values of energy star rating map to. This is known as the ***colormap***. You have a large choice as you can see in the [Choosing Colormaps in Matplotlib](https://matplotlib.org/stable/tutorials/colors/colormaps.html) reference.

The default colormap used in the plot above is `viridis`. However it does not represent energy star values well. 

We can specify a better colormap, for example if we want to indicate that green is best and red is worst, we can use the `RdYlGn` colormap:



In [None]:
# Scatter plot where size of marker indicates higher site EUI and colour indicates the energy star rating
fig, ax= plt.subplots()
sc = ax.scatter(mean_by_state_bldg['energy_star_rating'], mean_by_state_bldg['site_eui'],
           alpha=0.8, s=mean_by_state_bldg['site_eui'], c=mean_by_state_bldg['energy_star_rating'], cmap='RdYlGn')

ax.set_xlabel('Energy Star Rating')
ax.set_ylabel('Site EUI')
ax.set_ylim(0,1000)
ax.set_title('Mean Site EUI vs energy star rating')

# add the colorbar to indicate the color values
plt.colorbar(sc, label='Energy Star Rating')
plt.show()

### Reversing Colormaps

To reverse the sequence of the colour values in a colormap, you can just add `_r` to the name of the colormap. For example in the plot below, instead of the `summer` colormap where lower values are green and higher values are yellow, the value of the colormap has been set to `cmap='summer_r`.

In [None]:
fig, ax= plt.subplots(figsize=(8,5))

# choose the summer colormap but reverse it in the cmap argument
scatter=ax.scatter(mean_by_state_bldg['State_Factor'], 
               mean_by_state_bldg['site_eui'],
               c=mean_by_state_bldg['energy_star_rating'], 
               cmap='summer_r', marker='*', s=mean_by_state_bldg['site_eui'])
plt.colorbar(scatter, label = 'Mean Energy Star Rating')

ax.set_ylabel('Site EUI')

ax.set_title('Mean Site EUI for facility types in each state')

In [None]:
fig, ax= plt.subplots(figsize=(10,10))

# choose the summer colormap but reverse it in the cmap argument
scatter=ax.scatter(mean_by_state_bldg['State_Factor'], 
               mean_by_state_bldg['site_eui'],
               c=mean_by_state_bldg['energy_star_rating'], 
               cmap='summer_r', marker='*', s=mean_by_state_bldg['site_eui'])
ax.set_ylabel('Site EUI')
ax.set_title('Mean Site EUI by facility type')

# set the colorbar at the bottom instead
plt.colorbar(scatter, location='bottom', ticks=None, label='Energy Star Rating')


# Then the legend on the right
handles, labels = scatter.legend_elements(prop='sizes')
plt.legend(handles, labels, bbox_to_anchor=(1,1), title='Site EUI')

## Seaborn Palette

In the previous notebooks we have used the Seaborn `hue` argument to distinguish categories. However, we can also specify the choice of colours that the hue should use. This uses the `palette` argument.


In [None]:
# specifying the palette to use
fig, ax = plt.subplots()
ax = sns.boxplot(data = wids_train, 
                 x='State_Factor', 
                 y='site_eui', 
                 hue='building_class', 
                 palette = 'hsv')

ax.set_ylim(0,200)
ax.set_xlabel('')
ax.set_ylabel('Site EUI')
ax.legend(loc='upper right')
ax.set_title('Site EUI by State')
plt.show()

The tutorial on [Choosing Color Palettes](https://seaborn.pydata.org/tutorial/color_palettes.html) provides some examples of the names of palettes. Seaborn is still compatible with the Matplotlib colormaps, so you can choose from the the colormaps names or the palette names.

Let's try out more plots in the exercises!


# Exercises

Let's try to customise the plots for the bike sharing data set with more colours.


In [None]:
date_cols = ['rental_date','started_at', 'ended_at']
bikes = pd.read_csv('bikes_clean.csv', parse_dates = date_cols, dayfirst=True)
bikes.head()

In [None]:
# create the bike groups and count the total number of ride_ids in each group
import numpy as np
groups = bikes.groupby(['rideable_type','member_casual','day_of_week','rental_date','rental_hour'])['duration_in_min'].agg([np.mean,len])
daily_rentals =  groups.reset_index()
# find the number of rentals for each group and the mean duration in minutes
daily_rentals.rename(columns={'mean':'mean_duration', 'len':'num_rentals'}, inplace=True)
daily_rentals

**Q1. Basic Scatterplot**

Create a basic scatterplot of the hour of day by duration in minutes, to view the mean duration of rental for different hours of the day throughout the whole month, for different types of bikes.
- create the dictionary of colours for each rideable_type
- use a for loop to loop through the dictionary of bike types
- using the `daily_rentals` data set, plot the scatter plot with `rental_hour` on the x-axis, `mean_duration` on the y-axis, and the color for the points based on the bike type.
- try to add the size of the points based on the `mean_duration`. 



In [None]:
# Set colors for the types of bikes - choose suitable colors
bike_colors = {'docked':'red', 'classic':'blue','electric':'green'}

# Create a basic Matplotlib scatter plot to loop through the dictionary 
# plot the points for each bike type



You should be able to see from the plot that electric bikes are taken at all hours but only for shorter durations.

**Q2. Scatter Plot**

Create a scatter plot to see the number of rentals over the whole month, with the colour based on the rental hour. 
- Use the `daily_rentals` data set
- use the `rental_date` on the x-axis
- use the `num_rentals` on the y-axis, to show the number of rentals for the different bikes and hours
- use the argument `c=rental_hour` to specify that it should be a sequence of values to be mapped
- use a suitable colormap for the rental hour with the `cmap=` argument.
- add a colorbar

In [None]:
# Q2. Scatterplot to show the number of rentals by rental date for different hours of the day


## Using Seaborn 

**Q3. Seaborn Lineplot**

Create a lineplot using Seaborn on the `daily_rentals` data set to show the rental date on the x-axis, the mean duration on the y-axis, and the `hue` argument to show two lines of different colours, each representing whether the rentals are from members or casual users.
Select a suitable value for the `palette` argument for the colours.

In [None]:
# Q3 Answer
# plot duration of rental by date


## Seaborn Heatmap

Let's practice creating the heatmap and selecting suitable colours. To create a heatmap, we have to create a two-way table of data first.

**Q4a. Two-Way Table**

Create a two-way table using the `groupby()` method on the `daily_rentals` data set to find the ***sum*** of `num_rentals` for the `rental_date` by the `rental_hour`.


In [None]:
# Q4a Answer 


**Q4b. Heatmap**

Using the data in Q4a, create a heatmap that shows the rental date vs the rental hour and select a suitable colormap to represent the number of rentals. Don't forget to use `annot=True` so that you can check the values against the colour.

In [None]:
#Q4b Answer

