# Data Exploration - Domestic Energy Ratings in England and Wales

The aim of this project is to look at the energy ratings (EPC) of domestic properties in England and Wales. I want to analyse this data and present it in an interactive format (with maps). 

If possible, I would like to perform some machine learning and perhaps create an output website where people can look up the ratings of their area and see the trends for future years.

The data is from the [UK Gov Live tables on Energy Performance of Buildings Certificates](https://www.gov.uk/government/statistical-data-sets/live-tables-on-energy-performance-of-buildings-certificates), table D3.



### What questions can I answer with this dataset?
- Which region is performing best wrt EPC ratings?
- Which region is the worst performing wrt EPC ratings?

- What is the overall trend for EPC ratings over time (increasing, decreasing, stangant)?


Steps:

In [None]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
csv = 'dataset_2.csv'
data = pd.read_csv(csv)
df = pd.DataFrame(data)

In [None]:
df.head(3)

In [None]:
#making column names simpler to use. Create list, fix, replace columns with new list.

cols = df.columns.to_list()

#cleaning column names

new_cols = []

for item in cols:
    item = item.lower()
    item = re.sub(' ','_', item)
    item = re.sub(r'\(','', item)
    item = re.sub(r'\)','', item)
    item = re.sub(r'/','', item)
    item = re.sub(r'£','', item)
    item = re.sub('__','_', item)
    new_cols.append(item)

df.columns = new_cols

df.info()

In [None]:
#converting the number columns to floats
df.iloc[:,3:] = df.iloc[:,3:].replace(',','', regex=True).astype(float)

In [None]:
#converting the quarter format to datetime
df['quarter'] = df['quarter'].replace('/', '-Q', regex=True)
df['quarter'] = pd.to_datetime(df['quarter'])

In [None]:
df = df.replace(r'Unknown', np.NaN, regex=True)

In [None]:
df = df.dropna()
df.head(3)

In [None]:
#How many local authorities are in the dataset?

total_las = len(df['local_authority'].unique())

print('Total Local Authorities = {}'.format(total_las))

## Finding the best and worst performing regions

Method:

- A lodgement is taken of a new EPC rating when a property is sold or newly let. New lodgements are recorded each quarter.
- The quarter doesn't really help me, so going to sum by year for each region
- A simple count of lodgements wont' tell us much, as some regions will have more property changes than others. So, convert to percentages.
- Find the regions with the highest percentages of A-graded EPC properties (and for each subsequent grade)

In [None]:
df = df.copy(deep=True)

In [None]:
#summing by year for each region

#getting the year
df['year'] = df['quarter'].dt.year
#new groupby dataframe to sum quarterly counts into yearly counts, by local authority
df_years = df.groupby(['local_authority_code','local_authority','year'])[['number_of_lodgements','a','b','c','d','e','f','g','not_recorded']].sum()

In [None]:
df_years.head(3)

In [None]:
df_years = df_years.reset_index()

In [None]:
#Adding a boolean column to see if a local authority is a London Borough
#Using ONS labeling standards, local authority code E09 = London Boroughs
#This column will help us understand the data better later on
df_years['is_greater_london'] = df_years['local_authority_code'].str.contains('E09')

In [None]:
#Creating a function that will automatically calculate the
#percentage of properties lodged under each grade for each local authority, every year

ratings = ['a','b','c','d','e','f','g']

def percentager(dataframe):
    for item in ratings:
        dataframe['percentage_'+item] = round(100*(dataframe[item]/dataframe['number_of_lodgements']),2)
    return dataframe

In [None]:
df_years = df_years.apply(percentager, axis=1)

In [None]:
#Finding the top 10 performing regions - i.e. those with the most A rated EPC properties in 2022
most_a = df_years[df_years['year'] == 2022].nlargest(10, 'percentage_a')
most_a.head(5)

In [None]:
#color = ['lime', 'limegreen', 'yellowgreen', 'gold', 'darkorange', 'orangered','red']

In [None]:
ax1 = sns.barplot(data=most_a, x='local_authority', y='percentage_a', color='lightgreen')
plt.xlabel('Local Authority')
plt.ylabel('Percentage')
plt.xticks(rotation=90)
plt.title('Percent of newly lodged EPC A-rated properties\n Top 10 Local Authorities across England and Wales 2022')
plt.show()
plt.savefig('pct_a_rated.jpg')

In [None]:
#Bottom 10 performing regions - i.e. those with the most G rated properties in 2022
most_g = df_years[df_years['year'] == 2022].nlargest(10, ['percentage_g']).sort_values(by='percentage_g',
                                                                                      ascending=False)
most_g.head(3)

In [None]:
ax2 = sns.barplot(data=most_g, x='local_authority', y='percentage_g', color='red')
plt.xlabel('Local Authority')
plt.ylabel('Percentage')
plt.xticks(rotation=90)
plt.title('Percent of newly lodged EPC G-rated properties\n Worst performing Local Authorities across England and Wales 2022')
plt.show()

plt.savefig('pct_g_rated.jpg')

So far we can see that the 9/10 of the areas with the most newly-lodged A-rated EPC properties in 2022 were in London Boroughs. In contrast, 10/10 of the regions with the most newly-lodged G-rated EPC properties were outside London, with 50% of them in Wales, 3 in Devon and Cornwall, and 2 in the north of England (Ryedale, Eden).

However, G-rated properites seem to take up quite a small number of new lodgements in 2022, at most just under 6%. Perhaps we should look at G and F rated properties together for more insight.

In [None]:
#Bottom 10 performing regions - i.e. those with the most F AND G combined rated properties in 2022
df_years['f_and_g'] = df_years['percentage_f'] + df_years['percentage_g']

most_f_and_g= df_years[df_years['year'] == 2022].nlargest(10,['f_and_g']).sort_values(by='f_and_g', ascending=False)
most_f_and_g.head(3)

In [None]:
ax3 = sns.barplot(data=most_f_and_g, x='local_authority', y='f_and_g', color='coral')
plt.xlabel('Local Authority')
plt.ylabel('Percentage')
plt.xticks(rotation=90)
plt.title('Percent of newly lodged EPC F and G-rated properties\n Worst Performing Local Authorities across England and Wales 2022')
plt.show()

plt.savefig('pct_f_g_rated.jpg')

In [None]:
mean_f_and_g = round(most_f_and_g['f_and_g'].mean(),2)
print('Average % of lodgements in the the worst performing areas at rating F or G = {}%'.format(mean_f_and_g))

In [None]:
#The mean without Scilly in the top 10
mean_most_f_and_g_no_scilly = most_f_and_g.iloc[1:,-1].mean()
print('Average % of lodgements in the the worst performing areas at rating F or G, excluding Isles of Scilly = {}%'.format(mean_most_f_and_g_no_scilly))

In [None]:
df_2022 = df_years[df_years['year'] == 2022]
df_2022['f_and_g'].describe()

In [None]:
ax4 = plt.hist(df_2022['f_and_g'], bins=50)
plt.title('Combined percent of F and G lodged properties in 2022')
plt.show()

plt.savefig('f_g_hist.jpg')

The named on our list of local authorities with the most F and G lodged properties combined in 2022 is not too different to the names listed in the G-only chart.

However, we can see that in these worst-performing regions, F and G rated properties accounted for 20.02% of new lodgements in 2022. The average for the whole list in 2022 is 5.3%, and the standard deviation is 3.9%, so areas in these poorly performing regions are in the extreme end of the scale (shown in the histogram above)

5 of the top 10 in this list are again are in Wales, two in Norfolk, two in the Cornwall area (Isles of Scilly being off the cost of Cornwall) and 1 in the north of England (Eden). None are in London.

The Isles of Scilly is particulary poor performing and is bringing up the average slightly - without it, the mean of the remaining 9 local authorities in that list is 17.61%. However there is also quite a small number of lodgements happening in that area annually - the smallest number of all regions, as shown below.

In [None]:
#Top 3 Local Authorities with the fewest lodgements in 2022
df_years[df_years['year'] == 2022].nsmallest(3,'number_of_lodgements')

As the mean shown above is around 5%, what percentage of regions with > 5% F and G rated properties are in London vs outside London?

In [None]:
#percentage to calculate
pct = 5

#filter for London data
london_greater_than_x_pct_f_g = df_years[(df_years['year'] == 2022) 
                                   & (df_years['f_and_g'] >= pct)
                                   & (df_years['is_greater_london'] == True)].sort_values(by='f_and_g')

#filter for non-London data
non_london_greater_than_x_pct_f_g = df_years[(df_years['year'] == 2022) 
                                   & (df_years['f_and_g'] >= pct)
                                   & (df_years['is_greater_london'] == False)].sort_values(by='f_and_g')


total_non_london = non_london_greater_than_x_pct_f_g['is_greater_london'].value_counts().sum()
total_london = london_greater_than_x_pct_f_g['is_greater_london'].value_counts().sum()

print('Number of non-London regions with >{}% F and G EPC rated properties registered in 2022 = {}'
      .format(pct, total_non_london))


print('Number of London regions with >{}% F and G EPC rated properties registered in 2022 = {}'
      .format(pct, total_london))

proportion_non_london = round(100*total_non_london/len(df_years['local_authority'].unique()),2)

print('\nProportion of regions with >5% F and G EPC rated properties registered in 2022 = {}'
      .format(proportion_non_london))

That's quite a staggering difference between the capital and the rest of England and Wales. Just 1 London borough had >5% of it's properties registered as F or G rated in 2022, compared to outside the capital where 127 regions, or over 1/3 of regions overall, had >5% of their registered properties in 2022 rated F or G.

Shall we see how this trend has changed over the years?

In [None]:
#Creating a filter for London Boroughs versus non-London Local Authorities

london_filter = df['local_authority_code'].str.contains('E09')

#Finding the total london and non-london local authorities - helpful for later calcs
total_london_las = len(df[london_filter].local_authority.unique())
total_non_london_las = len(df[~london_filter].local_authority.unique())

In [None]:
#Creating filtered df of London Boroughs with >5% F and G properties, all years
london_f_g_x_pct = df_years[(df_years['f_and_g'] >= pct)
                                   & (df_years['is_greater_london'] == True)].sort_values(by='f_and_g')

#Creating filtered df of non-London LAs with >5% F and G properties, all years
non_london_f_g_x_pct = df_years[(df_years['f_and_g'] >= pct)
                                   & (df_years['is_greater_london'] == False)].sort_values(by='f_and_g')

#Using groupby on both the above dfs to get a count of the local authorities by year
grouped_london = london_f_g_x_pct.groupby('year', as_index=False)['local_authority'].value_counts()
grouped_non_london = non_london_f_g_x_pct.groupby('year', as_index=False)['local_authority'].value_counts()

#Summing the above counts
london_f_and_g = pd.DataFrame(grouped_london).groupby('year', as_index=False)['count'].sum()
non_london_f_and_g = pd.DataFrame(grouped_non_london).groupby('year', as_index=False)['count'].sum()

#Calculating the percentage of F and G properties each year, out of the total Local Authorities for that area
# (i.e. London, non-London)
london_f_and_g['pct_of_total_las'] = round(100*london_f_and_g['count']/total_london_las,2)
non_london_f_and_g['pct_of_total_las'] = round(100*non_london_f_and_g['count']/total_non_london_las,2)

In [None]:
non_london_f_and_g

In [None]:
london_f_and_g

In [None]:
ax5 = sns.barplot(data=london_f_and_g, x='year', y='pct_of_total_las'
                  , color='purple', alpha=.5, label='London')
ax6 = sns.barplot(data=non_london_f_and_g, x='year', y='pct_of_total_las'
                  , color='pink', alpha=.5, label='Non-London')
plt.xlabel('year')
plt.xticks(rotation=90)
plt.ylabel('percentage')
plt.title('Percentage of London vs non-London Local Authorities\nwith >5% EPC F and G rated properties by year')

handles, labels = ax5.get_legend_handles_labels()
ax5.legend(handles, labels)

plt.show()

plt.savefig('f_g_5pct.jpg')

It looks as though both London and the wider non-London local authorities of England and Wales have seen a drop in the percentage of properties with F and G ratings over time. However, it looks like the trend has been more pronounced in London versus outside of London, particularly since 2019. Let's check that.

In [None]:
#Find regression trend of both groups

from sklearn.linear_model import LinearRegression

# Training data
X_london = london_f_and_g.loc[:, ['year']]  # features
y_london = london_f_and_g.loc[:, 'pct_of_total_las']  # target

# Train the model
model = LinearRegression()
model.fit(X_london, y_london)

y_pred_london = pd.Series(model.predict(X_london), index=X_london.index)

In [None]:
# Training data
X_non_london = non_london_f_and_g.loc[:, ['year']]  # features
y_non_london = non_london_f_and_g.loc[:, 'pct_of_total_las']  # target

# Train the model
model = LinearRegression()
model.fit(X_non_london, y_non_london)

y_pred_non_london = pd.Series(model.predict(X_non_london), index=X_non_london.index)

In [None]:
#Replotting with the regression lines

ax7 = y_pred_london.plot(label='London')
ax8 = y_pred_non_london.plot(label='Non-London')


ax5 = sns.barplot(data=london_f_and_g, x='year', y='pct_of_total_las'
                  , color='purple', alpha=.5, label='London')
ax6 = sns.barplot(data=non_london_f_and_g, x='year', y='pct_of_total_las'
                  , color='pink', alpha=.5, label='Non-London')
plt.xlabel('year')
plt.xticks(rotation=90)
plt.ylabel('percentage')
plt.title('Percentage of London vs non-London Local Authorities\nwith >5% EPC F and G rated properties by year')

handles, labels = ax5.get_legend_handles_labels()
ax5.legend(handles, labels)

plt.show()

plt.savefig('f_g_5ct_with_trends.jpg')

From the above lines of regression we can see that London Boroughs have had a steeper, and therefore faster and more dramatic decrease in the percentage of F and G EPC rated properties being registered per year, compared to Local Authorities outside of London. 

Both regions show a decline in the overall percent of F and G rated properties being registered, indicating that property energy efficiency over England and Wales is increasing. 

Note that this data just represents new EPC registrations yearly. EPCs are valid for 10 years, and are usually renewed when a property is sold or a new rental agreement is required, after the previous EPC has expired. So by no means does the data cover the whole property market, but it does give us a slice of the picture where property movement is taking place.

In the London data we see a dramatic drop in the percentage of F and G rated properties between 2018 and 2019. It is possible that this reflects changes to the **Minimum Energy Efficiency Standards (MEES)**. Introduced in 2018 to England and Wales,  properites with new tenancy agreements (or renewals of existing tenants) are required to be EPC rating E or above. 

With London being such a popular place and renting common in the capital, it is reasonable that this dramatic drop in the years after 2018 reflects the scale of these reforms. Reference: [Gov.uk Guidance](https://www.gov.uk/guidance/domestic-private-rented-property-minimum-energy-efficiency-standard-landlord-guidance)

One way we can confirm this is by seeing if E-rated or above properties have increased over time instead.

In [None]:
#Want to creat a stacked bar chart of London Borough EPC ratings over the years

#Creating a filter for just years and EPC rating percentages in London Boroughs
needed_cols = ['year','percentage_a', 'percentage_b', 'percentage_c',
           'percentage_d', 'percentage_e', 'percentage_f', 'percentage_g']

pct_cols = ['percentage_a', 'percentage_b', 'percentage_c',
           'percentage_d', 'percentage_e', 'percentage_f', 'percentage_g']

london_pcts = df_years[(df_years['is_greater_london'] == True)][needed_cols]

#As we have data for many Boroughs, will take the mean
#percentage of properties in each rating per year, via a groupby
london_pcts_mean = london_pcts.groupby('year')[pct_cols].mean()

In [None]:
london_pcts_mean.head(5)

In [None]:
#I want to create a stacked barchart with the data above to visualise how the proportions change over time
label = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
london_pcts_mean.plot(kind='bar', stacked=True, color=['green','limegreen', 'greenyellow','gold','orange',
                                                       'tomato', 'firebrick'])
plt.legend(bbox_to_anchor=(1.2,1), loc='upper right', labels=label, title='EPC Rating')
plt.title('Mena London EPC rated property registrations annually')
plt.ylabel('percent')

plt.show()

plt.savefig('stacked_london.jpg')

We can see from the above chart that London overall has seen an increase in the proportion of higher-grade properties being registered yearly, compared to properties with rating E or lower. It appears that properties with a C or D rating have become the most commonly registered groups since 2018, again coinciding with the introduction of MEES.

I'll create the same chart for LAs outside of London and see how it compares

In [None]:
#Filter for non-London Local Authorities
non_london_pcts = df_years[(df_years['is_greater_london'] == False)][needed_cols]

#As we have data for many LAs, will take the mean
#percentage of properties in each rating per year, via a groupby
non_london_pcts_mean = non_london_pcts.groupby('year')[pct_cols].mean()

non_london_pcts_mean.plot(kind='bar', stacked=True, color=['green','limegreen', 'greenyellow','gold','orange',
                                                       'tomato', 'firebrick'])
plt.legend(bbox_to_anchor=(1.2,1), loc='upper right', labels=label, title='EPC Rating')
plt.title('Non-London mean EPC rated property registrations')
plt.ylabel('percent')

plt.show()

plt.savefig('stacked_non_london.jpg')

Outside of London we see some similar trends to the above. For example, there is an increase in the proportion of properties with rating D or higher since around 2017. This is perhaps indicating that landlords outside the capital prepared earlier for the MEES regulations (though this is just speculative). 

Overall, E or lower rated properties are becoming a smaller proportion of the overall property stock outside of London, being displaced by D and C rated properties.

One of the more notable differences between the London and non-London stacked charts is that A-rated properties are much more common in the capital. One explanation might be that upgrading a home's energy efficiency does not always come cheaply. The relative difference in wealth in London versus the rest of England and Wales may reflect the ability of landlords and homeowners in that region to invest in their homes to a greater degree than in regional areas. 

Another explanation could be that the very comepetitive element of London property market means that customers (in particular, property buyers) expect better standards from their properties, and have the ability to be fussier about energy ratings, leading to overall standards increasing.


# Plotting the data on a map

I now want to plot our EPC rating percentages for each local authority on a map, to get a clearer idea of the distribution of ratings across England and Wales geographically.

I'll look at making a time lapse map of E, F and G rated properties over time by local authority. This will give us further insight into how the MEES regulations impacted properties across the two countries.

I'm going to use GeoPandas to do this, with Local Authority boundary data from the Office of National Statistics. The data can be found [here](https://geoportal.statistics.gov.uk/datasets/ons::local-authority-districts-december-2021-uk-bfc-1/about) (I used the geojson file)

In [None]:
#Create column for E, F and G

df_years['efg'] = df_years['percentage_e'] + df_years['percentage_f'] + df_years['percentage_g']

In [None]:
#Let's create a dataset just for this work
drop_cols = df_years.iloc[:,3:20]
efg = df_years.drop(columns=drop_cols)

In [None]:
efg.head(3)

In [None]:
import geopandas as gpd

In [None]:
#Reading our geojson file

gj = gpd.read_file('condensed_uk_local_authorities_2021.json')

#Got a condensed version of the official one from: 
#https://github.com/thomasvalentine/Choropleth/blob/main/Local_Authority_Districts_(December_2021)_GB_BFC.json
#Thank you!

In [None]:
#Checking the shape of the plots
gj.plot(figsize=(3,3))

We can see from the map plot above that the GeoJSON file includes all local authorities in the UK. Our EPC dataset is just for Wales and England, so we can drop the entries for Scotland and Northern Ireland.

In [None]:
#Removing Northern Irish and Scottish local authorities
ni_scot_filters = (gj['LAD21CD'].str.startswith('N')) | (gj['LAD21CD'].str.startswith('S'))
gj_ew_only = gj[~ni_scot_filters]
gj_ew_only.shape

In [None]:
#Check map looks OK - should show just English and Welsh local authorities now
gj_ew_only.boundary.plot()

In [None]:
gj.head(3)

In [None]:
import plotly.express as px

from urllib.request import urlopen
import json


with urlopen('https://raw.githubusercontent.com/thomasvalentine/Choropleth/main/Local_Authority_Districts_(December_2021)_GB_BFC.json') as response:
    local_authorities_json = json.load(response)

lat = 53.16972805776037
lon = -2.1522074544910885


fig = px.choropleth_mapbox(df_years,
                           geojson=local_authorities_json,
                           locations='local_authority',
                           color='efg',
                           featureidkey="properties.LAD21NM",
                           color_continuous_scale=px.colors.sequential.OrRd,
                           mapbox_style="carto-positron",
                           center={"lat": lat, "lon": lon},
                           zoom=5.5,
                           range_color=[0,70],
                           animation_frame='year',
                           labels={'efg':'Percent'},
                           title='Percent of EPC E, F, and G properties registered annually'
                          )

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
fig.write_html("EFG.html") 
fig.show()


It looks like I am missing some data for some years/regions. A hover over the map shows the relevant counties are Buckinghamshire, West Northamptonshire and North Northamptonshire.

Was this data in the original dataset? A quick check below:

In [None]:
#Our originally imported and cleaned dataframe was called df

missing_data_las = df[(df['local_authority'] == 'Buckinghamshire') | 
         (df['local_authority'] == 'North Northamptonshire') |
         (df['local_authority'] == 'West Northamptonshire')]
missing_data_las['year'].unique()

Looks like a 'no'! Those counties only have data since 2020 in the original dataset. That's OK, we can live without it.

In [None]:
df_years['abc'] = df_years['percentage_a'] + df_years['percentage_b'] + df_years['percentage_c']

In [None]:
import plotly.express as px

from urllib.request import urlopen
import json


with urlopen('https://raw.githubusercontent.com/thomasvalentine/Choropleth/main/Local_Authority_Districts_(December_2021)_GB_BFC.json') as response:
    local_authorities_json = json.load(response)

lat = 53.16972805776037
lon = -2.1522074544910885


fig = px.choropleth_mapbox(df_years,
                           geojson=local_authorities_json,
                           locations='local_authority',
                           color='abc',
                           featureidkey="properties.LAD21NM",
                           color_continuous_scale="Viridis",
                           mapbox_style="carto-positron",
                           center={"lat": lat, "lon": lon},
                           zoom=5.5,
                           range_color=[0,70],
                           animation_frame='year',
                           labels={'abc':'Percent'},
                           title='Percent of EPC A, B, or C properties registered annually'
                          )

fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
fig.write_html("ABC.html")                           
fig.show()
