# Dublin House Prices by Post Code
## Part 3 - Using Machine Learning to Find Missing Values
We used a `Google Maps Client` to clean up some initial house price data from the [Property Price Register](https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/page/ppr-home-en) website in [Part 1](). Then, in [Part 2](), we cleaned that data and associated as many addresses as we could with eirKeys, the new post code format in Ireland. Now, in Part 3, we hope to use to Machine Learning techniques to fill in the rest.

In [1]:
import pandas as pd
import pickle
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split

In [2]:
with open('../library/pickle/2016/houses_cleaned_after_google_2016_pickle', 'r') as f:
    houses = pickle.load(f)
houses.dtypes

Address                    object
County                     object
Date               datetime64[ns]
Description                object
FullMarketPrice            object
Lat                       float64
Lon                       float64
PostCode                   object
Size                       object
VAT                        object
gCheck                     object
Price                     float64
PC                         object
dtype: object

In [3]:
houses.PC.value_counts()

glugger    4194
D15        1038
D04         683
D24         540
D18         539
D07         498
D09         494
D16         471
D12         437
D13         430
D08         413
D11         407
D06         383
D14         376
D03         345
D05         332
A94         279
K78         262
A96         251
K67         246
D22         230
D01         202
K32         183
D02         137
K36         125
D10         122
D17          85
D20          78
K34          49
K45          42
A98          17
K56          17
A45           2
A41           1
A42           1
Name: PC, dtype: int64

### Using Naive Bayes to Find eirKeys for Addresses
Houses for which we don't have an eirKey are called 'gluggers', from the Irish _ubh ghliogair_, a rotten egg. the good news is The presence of the addresses for which we have both geographic coordinates and an eirKey can help up train a model that will will help us track down eirKeys for addresses for which we have only coordinates. This is how we'll do it:

1. We have to divide our data frame into three types.
    * `good_eggs`, which is where we have both post codes and coordinates,
    * `glugger_coord`, where we have coordinates but no post code.
    * `glugger_lost`, where we have neither coordinate nor post code.
2. This being done, we can use the extant post codes in the `good_eggs` to train a Naive Bayes classification model.
3. We can then use our Naive Bayes model to assign post codes to our `glugger_coord` data frame, using the co-ordinates.
4. Concatenate the three data frames again
5. Resolve the different post code columns

In [4]:
glugger_coord = houses[(houses['PC'] == 'glugger') & (houses['Lat'].notnull())].copy()
glugger_lost = houses[(houses['PC'] == 'glugger') & (houses['Lat'].isnull())].copy()
good_eggs = houses[(houses['PC'] != 'glugger') & (houses['Lat'].notnull())].copy()

print "We have {:,} addresses with good coordinates.".format(good_eggs.shape[0])
print "We have {:,} addresses for which we have coordinates but no eirKey.".format(glugger_coord.shape[0])
print "We have {:,} addresses for which we have neither eirKey nor coordinates. These are our most challenging problem.".format(glugger_lost.shape[0])

We have 6,634 addresses with good coordinates.
We have 2,610 addresses for which we have coordinates but no eirKey.
We have 1,584 addresses for which we have neither eirKey nor coordinates. These are our most challenging problem.


We're now going to use the data for which we have coordinates and eirKeys, the `good_eggs` data frame, to train a model to predict eirKeys based on the co-ordinates that exist in the `glugger_coord` data frame.

In [5]:
coords = good_eggs[['Lat', 'Lon']]

f_train, f_test, l_train, l_test = train_test_split(coords,
                                                    good_eggs['PC'],
                                                    test_size = 0.25,
                                                    random_state = 33)

clf = GaussianNB()
clf.fit(f_train, l_train)

GaussianNB()

Next, we're going to call the `predict` method on our `clf` classifier to predict eirKeys for every address in `glugger_coord`, and store those predictions in a new *Prediction* column.

In [6]:
glugger_coord['Prediction'] = clf.predict(glugger_coord[['Lat', 'Lon']])
print glugger_coord.head()

                                              Address  County       Date  \
2   Moyola Mews, Churchtown Rd Lower, Churchtown L...  Dublin 2016-01-04   
5   Hampton Lodge, Grace Park Rd, Whitehall, Dubli...  Dublin 2016-01-04   
6    32 Latchford Square, Castaheany, Dublin, Ireland  Dublin 2016-01-04   
8   5 Mount Argus Terrace, Harold's Cross, Dublin ...  Dublin 2016-01-04   
14      Clarendon Hall, Aungier St, Dublin 2, Ireland  Dublin 2016-01-05   

                              Description FullMarketPrice        Lat  \
2   Second-Hand Dwelling house /Apartment              No  53.298311   
5   Second-Hand Dwelling house /Apartment              No  53.376031   
6   Second-Hand Dwelling house /Apartment              No  53.396303   
8   Second-Hand Dwelling house /Apartment              No  53.322198   
14  Second-Hand Dwelling house /Apartment              No  53.340596   

         Lon  PostCode Size VAT gCheck     Price       PC Prediction  
2  -6.253560                 No   Good 

Finally, we rejoin the three data frames into one, cosolidate the eirKey data, and see what we've got.

In [7]:
house_prices = pd.concat([good_eggs, glugger_coord, glugger_lost], ignore_index=True)

import numpy as np

def apply_final_postcodes(aRow):
    if aRow['PC'] == 'glugger' and type(aRow['Prediction']) != float:
        return aRow['Prediction']
    else:
        return aRow['PC']

house_prices['Bayes'] = house_prices.apply(lambda x: apply_final_postcodes(x), axis=1)
house_prices['Bayes'].value_counts()

glugger    1584
D15         781
D12         557
A96         492
D24         447
A94         441
D04         433
D07         420
D14         405
D09         404
D11         400
D06         385
D18         367
D16         365
D13         343
K67         328
D03         322
D05         315
D08         313
K78         305
K36         280
K32         231
D22         225
D01         146
D02         133
D10         113
K34          65
D20          59
D17          53
K45          47
K56          45
A98          19
A45           3
A41           1
A42           1
Name: Bayes, dtype: int64

We still have 756 addresses, 11 per cent of the dataset total, for which we don't have eirKeys, meaning we have valid eirKeys for 89% of our sample. It may be possible to cut that number further, by using the Google Maps Geodata we carefully put away earlier - just in case.

### Google Geodata
Google Maps creates a .json-like record for every address on its maps. _Neighborhood_ and _locality_ are two of the keys they use. Our geodata object from Part 1 has neighborhood and locality information for all the addresses for which we successfully able to find co-ordinates. We'll now use this to whittle down our missing values a little more.

This process is much more grunt-work than the Naive Bayes implementation, but it takes all sorts, of course.

**Firstly**, we load the geodata which is a list. We iterate through the geodata list, appending the co-ordinate, neighborhood and locality details to a second, neighborhood, list. From this neighborhood list we create a `geodata_df` so we can get a look at it.

In [9]:
with open('../library/pickle/2016/geodata_dublin_2016_pickle', 'r') as f:
    gd = pickle.load(f)

neighborhoods = []
neighborhoods.append(['Address',
                      'Lat',
                      'Lon',
                      'Locality',
                      'Neighborhood',
                      'eirKey'])

def find_address_types(address):
    returnValue = {'locality':'',
                   'neighborhood':'',
                   'eirKey':''}
    for a in address:
        if 'locality' in a['types']:
            returnValue['locality'] = a['short_name']
        if 'neighborhood' in a['types']:
            returnValue['neighborhood'] = a['short_name']
        if 'postal_code' in a['types']:
            returnValue['eirKey'] = a['short_name'][:3]
    
    return returnValue


for address in gd:
    lat = address['geometry']['location']['lat']
    lon = address['geometry']['location']['lng']
    others = find_address_types(address['address_components'])
    
    temp = [address['formatted_address'],
            lat,
            lon,
            others['locality'],
            others['neighborhood'],
            others['eirKey']]
            
    neighborhoods.append(temp)
    
geocode_df = pd.DataFrame(neighborhoods[1:],
                      columns = neighborhoods[0])

geocode_df.head()


Unnamed: 0,Address,Lat,Lon,Locality,Neighborhood,eirKey
0,"34 Mountpleasant Terrace, Dublin 6, D06 YC58, ...",53.328587,-6.261495,,,D06
1,"2 Brighton Rd, Brighton Hall, Kerrymount, Dubl...",53.258165,-6.174641,,Kerrymount,D18
2,"Moyola Mews, Churchtown Rd Lower, Churchtown L...",53.298311,-6.25356,,Churchtown Lower,
3,"24 Woodstown Meadow, Ballycullen, Dublin 16, D...",53.273761,-6.327188,,Ballycullen,D16
4,"28 Belton Park Gardens, Clontarf, Dublin 9, D0...",53.375715,-6.226541,,Clontarf,D09


**Secondly**, we group our `geodata_df` dataframe by both _locality_ and _neighborhood_, to prevent confusion with a generic placename like 'Hill' or 'Cross' showing up in more than one place. We then use these locality_neighborhood groups as keys in an area_to_eirKey_mapper dictionary, which will return a set of eirKeys for every tuple of locality and neighborhood.

We check for sets with more than one element, as these will have to be either resolved by hand or else ignored, depending on the time-output balance in the project.

In [11]:
from collections import defaultdict
area_to_eirKey_mapper = defaultdict(set)
byArea = geocode_df.groupby(['Locality', 'Neighborhood'])
for area, rest in byArea:
    if area[0] in ('', 'Dublin', 'DUB') or area[1] in ('', 'Dublin', 'DUB'):
        continue
    else:
        temp = set(rest['eirKey'])
        for t in temp:
            if t not in ('', 'Dublin'):
                area_to_eirKey_mapper[(area[0].lower(), area[1].lower())].add(t)
            else:
                continue

for a, b in area_to_eirKey_mapper.iteritems():
    if len(b)>1:
        print a, b


Nothing printed - this is a level of multiplicity we can live with. **Thirdly**, then, we now copy that slice of the `house_prices` data frame for which we don't have eirKeys. We iterate through the address and, where there's a match for either a locality or a neighborhood in an address string, we store in our `address_mapper` dictionary.

In [13]:
address_mapper = {}
multiples = []
gluggers = house_prices[house_prices.Bayes == 'glugger'].copy()
addresses = gluggers['Address']

for address in addresses:
    found = False
    for a, b in area_to_eirKey_mapper.iteritems():
        if a[0] in address.lower() or a[1] in address.lower():
            found = True
            break
            
    if found == False:
        address_mapper[address] = 'glugger'
        continue

    temp = list(b)
    if len(temp) == 1:
        address_mapper[address] = temp[0]
        continue
    else:
        multiples.append([a, b, address])
        address_mapper[address] = 'multiple'


print len(multiples)

0


In [14]:
gluggers['area_mapper'] = gluggers['Address'].map(address_mapper)

gluggers['area_mapper'].value_counts()

glugger    664
K36        222
K67        189
A94        113
A96         76
K78         71
K34         64
D24         57
K32         46
K56         37
A41         14
K45         14
A98          6
A42          4
D13          3
D15          2
D22          1
D18          1
Name: area_mapper, dtype: int64

In [15]:
multiples

[]

In [16]:
gluggers[gluggers['area_mapper'] == 'glugger'].head(10)

Unnamed: 0,Address,County,Date,Description,FullMarketPrice,Lat,Lon,PC,PostCode,Prediction,Price,Size,VAT,gCheck,Bayes,area_mapper
9247,"167 The Crescent, Park West Point, Gallanstown",Dublin,2016-01-05,Second-Hand Dwelling house /Apartment,No,,,glugger,,,130000.0,,No,NoResponse,glugger,glugger
9249,"5 NICHOLAS AVENUE, CHURCH STREET, DUBLIN 7",Dublin,2016-01-06,Second-Hand Dwelling house /Apartment,No,,,glugger,,,225000.0,,No,NoResponse,glugger,glugger
9251,"APT 1B, BEDFORD COURT, LOWER KIMMAGE RD DUBLIN 6W",Dublin,2016-01-06,Second-Hand Dwelling house /Apartment,No,,,glugger,,,130000.0,,No,NoResponse,glugger,glugger
9255,"TROODOS, JORDANSTOWN, BALLOUGH",Dublin,2016-01-07,Second-Hand Dwelling house /Apartment,No,,,glugger,,,312800.0,,No,NoResponse,glugger,glugger
9259,"3 MONKSLEA, BORU COURT, FOREST ROAD",Dublin,2016-01-08,Second-Hand Dwelling house /Apartment,No,,,glugger,,,170000.0,,No,NoResponse,glugger,glugger
9260,"41 Priory Road, Harolds Cross Road, Dublin 6W",Dublin,2016-01-08,Second-Hand Dwelling house /Apartment,No,,,glugger,,,750000.0,,No,NoResponse,glugger,glugger
9265,"3 Glencarrig, Brides Glen Road, Rathmichael",Dublin,2016-01-11,New Dwelling house /Apartment,No,,,glugger,,,748898.67,greater than 125 sq metres,Yes,NoResponse,glugger,glugger
9271,"6 THE PIERRE APTS, VICTORIA TERRACE, DUN LAOGH...",Dublin,2016-01-15,Second-Hand Dwelling house /Apartment,No,,,glugger,,,875000.0,,No,NoResponse,glugger,glugger
9277,"65 BOWBRIDGE PLACE, KILMAINHAM, DUBLIN 8",Dublin,2016-01-19,Second-Hand Dwelling house /Apartment,No,,,glugger,,,205000.0,,No,NoResponse,glugger,glugger
9278,"77 CASTLEDAWSON, ROCK RD, SION HILL",Dublin,2016-01-19,Second-Hand Dwelling house /Apartment,No,,,glugger,,,585000.0,,No,NoResponse,glugger,glugger


Looking at the addresses in `gluggers.head()`, it's clear that that the majority of these missing addresses can be filled in by hand. But, for the purposes of this exercise, having 664 missing values means the we have 95.2% of our eirKeys present and correct, and the effort involved in chasing down these 664 aren't worth it for the difference.

Nothing to do now but return these values to make a complete `houses` data frame once more, and tidy it up ready for its close up in [Part 4: Data Analysis]().

In [18]:
non_gluggers = house_prices[house_prices['Bayes'] != 'glugger'].copy()
final = pd.concat([non_gluggers, gluggers], ignore_index=True)
final.dtypes

Address                    object
Bayes                      object
County                     object
Date               datetime64[ns]
Description                object
FullMarketPrice            object
Lat                       float64
Lon                       float64
PC                         object
PostCode                   object
Prediction                 object
Price                     float64
Size                       object
VAT                        object
area_mapper                object
gCheck                     object
dtype: object

In [19]:
def apply_final_postcodes(aRow):
    if aRow['Bayes'] == 'glugger':
        return aRow['area_mapper']
    else:
        return aRow['Bayes']

final['eirKey'] = final.apply(lambda x: apply_final_postcodes(x), axis=1)

In [20]:
final.eirKey.value_counts()

D15        783
glugger    664
A96        568
D12        557
A94        554
K67        517
D24        504
K36        502
D04        433
D07        420
D14        405
D09        404
D11        400
D06        385
K78        376
D18        368
D16        365
D13        346
D03        322
D05        315
D08        313
K32        277
D22        226
D01        146
D02        133
K34        129
D10        113
K56         82
K45         61
D20         59
D17         53
A98         25
A41         15
A42          5
A45          3
Name: eirKey, dtype: int64

### eirKey Names
As Lieutenant Columbo used to say, there's just one more thing: those eirKeys aren't very intuitive. Let's see if we can use the geodata information we have to create an `eirKeyName` column.

We're going to group `geocode_df` by eirKey, and then count up the Localities for each eirKey. That `value_counts()` Series is then passed to the `check_values()` function, which removes any unhelpful localities, such as 'Dublin' or '', and then returns the locality at the top of the tree. If the pruning has eliminated the entire tree, it returns the eirKey itself.

The localities aren't quite ideal - Blanchardstown is subsumed into Castleknock, Christchurch into Dolphins' Barn (I imagine wigs on the green at the next DCC meeting if that were to happen in real-life), and so on. Again, these could be pruned by hand, but such pruning isn't really worthwhile for the purposes of this exercise.

In [21]:
eirKey_namer = {}
def check_values(values, ek):
    forbidden_localities = ['Dublin', 'dublin', 'dub', 'Dub', 'DUB', '']
    for f in forbidden_localities:
        if f in values:
            values.remove(f)
    if len(values) > 0:
        return values[0]
    else:
        return ek
    
byEirKey = geocode_df.groupby('eirKey')
for ek, details in byEirKey:
    temp = details['Locality'].value_counts()
    values = list(temp.keys())
    value = check_values(values, ek)
    eirKey_namer[ek] = value
    
final['eirKeyName'] = final['eirKey'].map(eirKey_namer)
final.eirKeyName.value_counts()
    

Clonee           783
Killiney         568
D12              557
Blackrock        554
Swords           517
Saggart          504
Malahide         502
D04              433
D07              420
D14              405
D09              404
D11              400
D06              385
Lucan            376
Cabinteely       368
D16              365
Portmarnock      346
D03              322
D05              315
D08              313
Balbriggan       277
Newcastle        226
D01              146
D02              133
Skerries         129
D10              113
Rush              82
Lusk              61
D20               59
D17               53
Bray              25
Ballyboughal      15
Garristown         5
Oldtown Court      3
Name: eirKeyName, dtype: int64

In [22]:
final.dtypes

Address                    object
Bayes                      object
County                     object
Date               datetime64[ns]
Description                object
FullMarketPrice            object
Lat                       float64
Lon                       float64
PC                         object
PostCode                   object
Prediction                 object
Price                     float64
Size                       object
VAT                        object
area_mapper                object
gCheck                     object
eirKey                     object
eirKeyName                 object
dtype: object

In [23]:
final.drop(['Bayes', 'County', 'PC', 'PostCode', 'Prediction', 'area_mapper', 'gCheck'], axis=1, inplace=True)

In [24]:
final.head()

Unnamed: 0,Address,Date,Description,FullMarketPrice,Lat,Lon,Price,Size,VAT,eirKey,eirKeyName
0,"34 Mountpleasant Terrace, Dublin 6, D06 YC58, ...",2016-01-01,Second-Hand Dwelling house /Apartment,No,53.328587,-6.261495,170000.0,,No,D06,D06
1,"2 Brighton Rd, Brighton Hall, Kerrymount, Dubl...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.258165,-6.174641,1150000.0,,No,D18,Cabinteely
2,"24 Woodstown Meadow, Ballycullen, Dublin 16, D...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.273761,-6.327188,430000.0,,No,D16,D16
3,"28 Belton Park Gardens, Clontarf, Dublin 9, D0...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.375715,-6.226541,225000.0,,No,D09,D09
4,"48A Beaufield Park, Stillorgan, Dublin, A94 XH...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.290123,-6.203367,345000.0,,No,A94,Blackrock


In [25]:
with open('../library/pickle/2016/prices_final_dublin_2016_pickle', 'w') as f:
    pickle.dump(final, f)