# House Prices

The aim of this project was to build an interactive web app for display house prices in England and Wales, using a heat-map overlay to visualise the differences by region.

##### Initial questions

- What granularity to show average price at? e.g. county, town, postcode, council?
- Could add feature for user to zoom in e.g. initially show at county level, then zoom into postcode level
- Mean or median average price? - allow user to choose, outliers will be interesting as well
- Might be interesting to see range of house prices by area too
    - Min, Max
    - Mean
    - 10%, 25%, 50%, 75%, 90% percentiles

A next step would be to see how house prices have changed over time. This could be a separate overlay on the map (user chooses which overlay to view), effectively a different page on the website. Challenge will be how to visualise this change over time by area?

- How far back in time will data go?
- Maybe take snapshots of average price by area every 5 or 10 years
- One simple view would be to allow user to select a date range, e.g. 1960-2020, then visual show average price difference between the 2 dates
    - Allowing this to be controlled with a slider would make it easier to find trends

##### How to get the data?
Write description here of how I got the data and how it's created/published.

##### Preparing the data

Loading the CSV file into a MySQL database.

~~~~sql
DROP DATABASE IF EXISTS `houseprices`;
CREATE DATABASE `houseprices`;
USE `houseprices`;

CREATE TABLE `pricepaid` (
`unique_id` VARCHAR(100),
`price_paid` DECIMAL,
`deed_date` DATE,
`postcode` VARCHAR(8),
`property_type` VARCHAR(1),
`new_build` VARCHAR(1),
`estate_type` VARCHAR(1),
`saon` VARCHAR(50),
`paon` VARCHAR(50),
`street` VARCHAR(50),
`locality` VARCHAR(50),
`town` VARCHAR(50),
`district` VARCHAR(50),
`county` VARCHAR(50),
`transaction_category` VARCHAR(1),
`linked_data_uri` VARCHAR(1),
PRIMARY KEY (unique_id)
);

SET GLOBAL local_infile=ON;
SET autocommit=0;
SET unique_checks=1;
SET foreign_key_checks=0;

LOAD DATA LOW_PRIORITY 
LOCAL INFILE 'Path/To/Project/pricepaid.csv'
INTO TABLE pricepaid 
CHARACTER SET armscii8
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n' 
(`unique_id`,`price_paid`,`deed_date`,`postcode`,`property_type`,`new_build`,`estate_type`,`saon`,`paon`,`street`,`locality`,`town`,`district`,`county`,`transaction_category`,`linked_data_uri`);
~~~~

Creating an index 

~~~~sql
CREATE INDEX idx_lastname
ON Persons (LastName);
~~~~

I had originally planned to use the full dataset in my Shiny app, however the full table is ~5GB in size with ~29m rows. I chose to get around this by sampling the dataset. Taking a simple random sample of the data would mean that the number of samples from each area would be proportional to the population of that area, so to ensure that each area had an equal number of samples I would use stratified sampling instead.

I needed to choose a level of granularity to which to stratify the data. A UK postcode is made up of 2 parts, the outward code (first part) and inward code (second part), separated by a space. The outward code consists of the postcode area (either 1 or 2 letters) followed by the postcode district (usually 1 or 2 digits). For example, in the postcode PO16 7GZ, PO16 is the outward code (or outcode), PO is the area and 16 is the district.

OutCode and PostcodeArea were added as generated columns to the pricepaid table, along with a Year column and a YearBin column.

~~~~sql
ALTER TABLE pricepaid ADD COLUMN OutCode VARCHAR(4) GENERATED ALWAYS AS substr(postcode, 1, locate(' ', postcode) - 1) STORED;
ALTER TABLE pricepaid ADD COLUMN PostcodeArea VARCHAR(3) GENERATED ALWAYS AS regexp_replace(OutCode, '[0-9]+', '') STORED;
ALTER TABLE pricepaid ADD COLUMN Year INT GENERATED ALWAYS AS year(cast(deed_date as date)) STORED;
ALTER TABLE pricepaid ADD COLUMN YearBin VARCHAR(4) GENERATED ALWAYS AS case when (Year < 2005) then '1995 - 2004' when (Year < 2015) then '2005 - 2014' else '2015 +' end STORED;
~~~~

Taking a stratified sample of 100 observations for each distinct OutCode and YearBin.

~~~~sql
SELECT t.* FROM
(SELECT pp.*, ROW_NUMBER() OVER (PARTITION BY OutCode, YearBin ORDER BY RAND()) AS SeqNum
FROM pricepaid pp) t
WHERE t.SeqNum <= 100
INTO LOCAL OUTFILE 'C:/Users/danjr/Documents/Projects/UK House Prices Visualisation/pricepaidsample.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
~~~~

## sampled the data by area (to reduce size so it could be loaded into memory)
- tried this, summarising sample by area but it was too slow

## instead just loaded pre-summarised data into the app

In [3]:
print("hello")

hello


In [5]:
import os
from pathlib import Path
import json
import pandas as pd
from ipyleaflet import Map, Choropleth
from branca.colormap import linear

### Load data
appDir = Path(os.path.abspath(''))
# used https://mapshaper.org/ to simplify the GeoJSON file, reducing its size using the Visvalingam / weighted area method with a 1% zoom level
with open(appDir / 'OutcodeCoordinates_compressed.json', 'r') as f:
    outcodeCoordinates = json.load(f)
summary = pd.read_csv(appDir / 'summary.csv')
yearBins = list(summary['YearBin'].unique())

input_yearBin = tuple()
input_switch = False
input_statistic = 'mean'

# function to filter the summary dataset and return a lookup dictionary with a key for each Outcode
# select all time periods if none are selected
if input_yearBin == tuple():
    filter = yearBins
else:
    filter = list(input_yearBin)

# logic for comparing summary stastics between time periods
if input_switch:
    minYearBin = filter[0]
    maxYearBin = filter[-1]
    dfMin = summary[summary['YearBin'] == minYearBin][['Outcode', input_statistic]]
    dfMax = summary[summary['YearBin'] == maxYearBin][['Outcode', input_statistic]]
    df = pd.merge(dfMin, dfMax, how="inner", on="Outcode")
    df['diff'] = df[input_statistic + '_y'] - df[input_statistic + '_x']
    df = df.set_index('Outcode')
    df['decile'] = pd.qcut(df['diff'], 10, labels=False)
    choroData = df['decile'].to_dict()
# if not comparing time periods then just show summary statistic
else:
    df = summary[summary['YearBin'].isin(filter)].set_index('Outcode')
    df['decile'] = pd.qcut(df[input_statistic], 10, labels=False)
    choroData = df['decile'].to_dict()

# create a Map object and add a Choropleth layer to it
m = Map(center=(54.00366, -2.547855), zoom=5.5)

layer = Choropleth(
            geo_data=outcodeCoordinates,
            choro_data=choroData,
            key_on='id',
            colormap=linear.viridis,
            border_color='black',
            style={'fillOpacity': 0.8, 'dashArray': '5, 5'}
        )
        
m.add(layer)

m

Map(center=[54.00366, -2.547855], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', …