# In-class exercise: average house prices vs average income

In this guided practice, we'll merge and explore two datasets from the [London Datastore](https://data.london.gov.uk/):
* [Average house prices by borough](https://data.london.gov.uk/dataset/average-house-prices-borough) (Land Registry)
* [Average income of tax payers by borough](https://data.london.gov.uk/dataset/average-income-tax-payers-borough) (HMRC)

Let's start by loading some libraries.

In [1]:
import numpy as np
import pandas as pd

Next, we define the location of the two datasets.

In [2]:
AHP_URL = 'https://files.datapress.com/london/dataset/average-house-prices-borough/2016-08-01T11:27:10/average-house-prices-borough.xls'
INCOMES_URL = 'https://files.datapress.com/london/dataset/average-income-tax-payers-borough/2016-04-05T08:55:06/income-of-tax-payers.csv'

## Over to you!

Read in the *Median Annual* sheet from the Average House Prices by Borough dataset at `AHP_URL` into a DataFrame called `ahp`.

In [3]:
ahp = pd.read_excel(AHP_URL, sheetname='Median Annual')

Filter the DataFrame so that only boroughs are included (hint: check the structure of `ahp.Code`).

In [4]:
ahp = ahp[ahp.Code.str.startswith('E09', na=False)]

Set `Code` as index, then drop it from the DataFrame.

In [5]:
ahp.set_index('Code', drop=True, inplace=True)

We will now convert (`melt`) the dataset from 'wide' to 'long' format, and convert `Year` to integer.

In [6]:
ahp = pd.melt(ahp, id_vars='Area', var_name='Year', value_name='Price')
ahp['Year'] = ahp.Year.astype('int')

Calculate mean house prices by year.

In [7]:
ahp.groupby('Year').Price.mean()

Year
1996     83983.303030
1997     94842.954545
1998    108486.969697
1999    127183.863636
2000    151637.969697
2001    170181.363636
2002    196561.212121
2003    215584.090909
2004    233728.727273
2005    244355.090909
2006    262362.318182
2007    295134.848485
2008    295976.075758
2009    286555.151515
2010    315221.939394
2011    321733.909091
2012    334741.515152
2013    367204.545455
2014    429028.787879
2015    465467.969697
Name: Price, dtype: float64

Calculate mean house prices by borough using only data from 2010 onwards.

In [8]:
ahp[ahp.Year >= 2010].groupby('Area').Price.mean()

Area
Barking and Dagenham      192415.833333
Barnet                    361583.333333
Bexley                    231666.666667
Brent                     340208.333333
Bromley                   305495.833333
Camden                    569166.666667
City of London            594999.833333
Croydon                   245666.666667
Ealing                    337816.666667
Enfield                   263666.666667
Greenwich                 278665.833333
Hackney                   370083.333333
Hammersmith and Fulham    560166.666667
Haringey                  346916.666667
Harrow                    329608.333333
Havering                  238250.000000
Hillingdon                278937.333333
Hounslow                  286316.666667
Islington                 461129.166667
Kensington and Chelsea    966666.666667
Kingston upon Thames      341408.333333
Lambeth                   358750.000000
Lewisham                  275075.000000
Merton                    332495.833333
Newham                    240500.00

Identify the three boroughs with highest mean house prices.

In [9]:
ahp.groupby('Area').Price.mean().sort_values(ascending=False).head(3)

Area
Kensington and Chelsea    585968.625
Westminster               441171.250
Camden                    366874.950
Name: Price, dtype: float64

Read in the Average Income of Tax Payers by Borough dataset from `INCOMES_URL` into a DataFrame called `incomes`.

In [10]:
incomes = pd.read_csv(INCOMES_URL)

Keep only the columns indicating the borough and the medians for each year.

In [11]:
incomes = incomes.iloc[:,(incomes.columns == 'Area') | incomes.columns.str.startswith('Median')]

Rename the columns to only include the starting year (e.g. 1999-00 = 1999)

In [12]:
incomes.rename(columns=dict(zip(
    incomes.columns[1:],
    incomes.columns[1:].str[11:15]  # Could use regular expressions here
)), inplace=True)

Melt the DataFrame and convert `Year` to integer.

In [13]:
incomes = pd.melt(incomes, id_vars='Area', var_name='Year', value_name='Income')
incomes['Year'] = incomes.Year.astype('int')

Merge `incomes` with `ahp`, keeping only observations found in both.

In [14]:
ahp = pd.merge(ahp, incomes, how='inner')

Compute mean house prices and incomes by year.

In [15]:
ahp.pivot_table(values=['Price', 'Income'], index='Year')

Unnamed: 0_level_0,Income,Price
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1999,17575.757576,127183.863636
2000,19166.666667,151637.969697
2001,19269.69697,170181.363636
2002,19645.454545,196561.212121
2003,19948.484848,215584.090909
2004,20033.333333,233728.727273
2005,21596.969697,244355.090909
2006,22063.636364,262362.318182
2007,23378.787879,295134.848485
2009,25154.545455,286555.151515


Compute the correlation between house prices and incomes.

In [16]:
ahp.Price.corr(ahp.Income)

0.6727862643323711