# Dublin House Prices by Post Code
## Part 3 - Using Machine Learning to Find Missing Values
We used a `Google Maps Client` to clean up some initial house price data from the [Property Price Register]() website in [Part 1](). Then, in [Part 2](), we cleaned that data and associated as many addresses as we could with eirKeys, the new post code format in Ireland. Now, in Part 3, we hope to use to Machine Learning techniques to fill in the rest.

In [1]:
import pandas as pd
import pickle
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split

In [2]:
with open('../library/pickle/houses_cleaned_after_google_2016_2015_pickle', 'r') as f:
    houses = pickle.load(f)
houses.dtypes

Address                    object
County                     object
Date               datetime64[ns]
Description                object
FullMarketPrice            object
Lat                       float64
Lon                       float64
PostCode                   object
Size                       object
VAT                        object
gCheck                     object
Price                     float64
PC                         object
dtype: object

In [3]:
houses.PC.value_counts()

glugger    11295
D15         1703
D24         1007
D07          969
D09          967
D18          939
D04          920
D12          865
D08          817
D11          762
D16          728
D13          716
D03          689
D14          633
D06          624
D05          600
A94          578
K78          558
A96          508
D01          462
D22          456
K67          234
D10          209
D02          199
K32          164
D20          151
D17          129
K36          120
K34           47
K45           36
A98           31
K56           17
A45            2
A86            2
A41            1
A42            1
Name: PC, dtype: int64

### Using Naive Bayes to Find eirKeys for Addresses
Houses for which we don't have an eirKey are called 'gluggers', from the Irish _ubh ghliogair_, a rotten egg. the good news is The presence of the addresses for which we have both geographic coordinates and an eirKey can help up train a model that will will help us track down eirKeys for addresses for which we have only coordinates. This is how we'll do it:

1. We have to divide our data frame into three types.
    * `good_eggs`, which is where we have both post codes and coordinates,
    * `glugger_coord`, where we have coordinates but no post code.
    * `glugger_lost`, where we have neither coordinate nor post code.
2. This being done, we can use the extant post codes in the `good_eggs` to train a Naive Bayes classification model.
3. We can then use our Naive Bayes model to assign post codes to our `glugger_coord` data frame, using the co-ordinates.
4. Concatenate the three data frames again
5. Resolve the different post code columns

In [4]:
glugger_coord = houses[(houses['PC'] == 'glugger') & (houses['Lat'].notnull())].copy()
glugger_lost = houses[(houses['PC'] == 'glugger') & (houses['Lat'].isnull())].copy()
good_eggs = houses[(houses['PC'] != 'glugger') & (houses['Lat'].notnull())].copy()

print "We have {:,} addresses with good coordinates.".format(good_eggs.shape[0])
print "We have {:,} addresses for which we have coordinates but no eirKey.".format(glugger_coord.shape[0])
print "We have {:,} addresses for which we have neither eirKey nor coordinates. These are our most challenging problem.".format(glugger_lost.shape[0])

We have 13,230 addresses with good coordinates.
We have 8,322 addresses for which we have coordinates but no eirKey.
We have 2,973 addresses for which we have neither eirKey nor coordinates. These are our most challenging problem.


We're now going to use the data for which we have coordinates and eirKeys, the `good_eggs` data frame, to train a model to predict eirKeys based on the co-ordinates that exist in the `glugger_coord` data frame.

In [5]:
coords = good_eggs[['Lat', 'Lon']]

f_train, f_test, l_train, l_test = train_test_split(coords,
                                                    good_eggs['PC'],
                                                    test_size = 0.25,
                                                    random_state = 33)

clf = GaussianNB()
clf.fit(f_train, l_train)

GaussianNB()

Next, we're going to call the `predict` method on our `clf` classifier to predict eirKeys for every address in `glugger_coord`, and store those predictions in a new *Prediction* column.

In [6]:
glugger_coord['Prediction'] = clf.predict(glugger_coord[['Lat', 'Lon']])
print glugger_coord.head()

                                             Address  County       Date  \
1                       Saggart, Co. Dublin, Ireland  Dublin 2016-01-04   
2                       Saggart, Co. Dublin, Ireland  Dublin 2016-01-04   
4  Moyola Mews, Churchtown Rd Lower, Churchtown L...  Dublin 2016-01-04   
7   32 Latchford Square, Castaheany, Dublin, Ireland  Dublin 2016-01-04   
9  5 Mount Argus Terrace, Harold's Cross, Dublin ...  Dublin 2016-01-04   

                             Description FullMarketPrice        Lat       Lon  \
1          New Dwelling house /Apartment              No  53.280840 -6.443184   
2          New Dwelling house /Apartment              No  53.280840 -6.443184   
4  Second-Hand Dwelling house /Apartment              No  53.298311 -6.253560   
7  Second-Hand Dwelling house /Apartment              No  53.396310 -6.431049   
9  Second-Hand Dwelling house /Apartment              No  53.322198 -6.289609   

   PostCode                                               Size

Finally, we rejoin the three data frames into one, cosolidate the eirKey data, and see what we've got.

In [7]:
house_prices = pd.concat([good_eggs, glugger_coord, glugger_lost], ignore_index=True)

import numpy as np

def apply_final_postcodes(aRow):
    if aRow['PC'] == 'glugger' and type(aRow['Prediction']) != float:
        return aRow['Prediction']
    else:
        return aRow['PC']

house_prices['Bayes'] = house_prices.apply(lambda x: apply_final_postcodes(x), axis=1)
house_prices['Bayes'].value_counts()

glugger    2973
D15        2003
D24        1276
D12        1163
A96        1111
D13        1109
D04        1080
D16        1065
D18        1016
A94        1012
D11         991
D06         982
D07         968
D14         907
D09         895
D03         828
D08         804
K78         728
D05         639
D01         528
D22         522
K36         373
K67         347
D10         258
K32         244
D02         235
D20         166
D17         108
K34          70
K45          42
A98          38
K56          38
A86           2
A45           2
A41           1
A42           1
Name: Bayes, dtype: int64

We still have 756 addresses, 6 per cent of the dataset total, for which we don't have eirKeys, meaning we have valid eirKeys for 94% of our sample. It may be possible to cut that number further, by using the Google Maps Geodata we stored from earlier.

### Google Geodata
Google Maps creates a .json-like record for every address on its maps. _Neighborhood_ and _locality_ are two of the keys they use. Our geodata object from Part 1 has neighborhood and locality information for all the addresses for which we successfully able to find co-ordinates. We'll now use this to whittle down our missing values a little more.

This process is much more grunt-work than the Naive Bayes implementation, but it takes all sorts, of course.

**Firstly**, we load the geodata which is a list. We iterate through the geodata list, appending the co-ordinate, neighborhood and locality details to a second, neighborhood, list. From this neighborhood list we create a `geodata_df` so we can get a look at it.

In [10]:
with open('../library/pickle/2016/geodata_dublin_2016_pickle', 'r') as f:
    gd = pickle.load(f)

with open('../library/pickle/2015/geodata_2015_pickle', 'r') as f:
    gd15 = pickle.load(f)

[gd.append(geo) for geo in gd15]
print type(gd)

<type 'list'>


In [11]:
neighborhoods = []
neighborhoods.append(['Address',
                      'Lat',
                      'Lon',
                      'Locality',
                      'Neighborhood',
                      'eirKey'])

def find_address_types(address):
    returnValue = {'locality':'',
                   'neighborhood':'',
                   'eirKey':''}
    for a in address:
        if 'locality' in a['types']:
            returnValue['locality'] = a['short_name']
        if 'neighborhood' in a['types']:
            returnValue['neighborhood'] = a['short_name']
        if 'postal_code' in a['types']:
            returnValue['eirKey'] = a['short_name'][:3]
    
    return returnValue


for address in gd:
    lat = address['geometry']['location']['lat']
    lon = address['geometry']['location']['lng']
    others = find_address_types(address['address_components'])
    
    temp = [address['formatted_address'],
            lat,
            lon,
            others['locality'],
            others['neighborhood'],
            others['eirKey']]
            
    neighborhoods.append(temp)
    
geocode_df = pd.DataFrame(neighborhoods[1:],
                      columns = neighborhoods[0])

geocode_df.head()

Unnamed: 0,Address,Lat,Lon,Locality,Neighborhood,eirKey
0,"34 Mountpleasant Terrace, Dublin 6, D06 YC58, ...",53.328587,-6.261495,Ranelagh,,D06
1,"Saggart, Co. Dublin, Ireland",53.28084,-6.443184,Saggart,,
2,"Saggart, Co. Dublin, Ireland",53.28084,-6.443184,Saggart,,
3,"2 Brighton Rd, Brighton Hall, Kerrymount, Dubl...",53.258165,-6.174641,Foxrock,Kerrymount,D18
4,"Moyola Mews, Churchtown Rd Lower, Churchtown L...",53.298311,-6.25356,Churchtown,Churchtown Lower,


**Secondly**, we group our `geodata_df` dataframe by both _locality_ and _neighborhood_, to prevent confusion with a generic placename like 'Hill' or 'Cross' showing up in more than one place. We then use these locality_neighborhood groups as keys in an area_to_eirKey_mapper dictionary, which will return a set of eirKeys for every tuple of locality and neighborhood.

We check for sets with more than one element, as these will have to be either resolved by hand or else ignored, depending on the time-output balance in the project.

In [12]:
from collections import defaultdict
area_to_eirKey_mapper = defaultdict(set)
byArea = geocode_df.groupby(['Locality', 'Neighborhood'])
for area, rest in byArea:
    if area[0] in ('', 'Dublin', 'DUB') or area[1] in ('', 'Dublin', 'DUB'):
        continue
    else:
        temp = set(rest['eirKey'])
        for t in temp:
            if t not in ('', 'Dublin'):
                area_to_eirKey_mapper[(area[0].lower(), area[1].lower())].add(t)
            else:
                continue

for a, b in area_to_eirKey_mapper.iteritems():
    if len(b)>1:
        print a, b


(u'donnycarney', u'clontarf') set([u'D05', u'D03', u'D09'])
(u'cabinteely', u'johnstown') set([u'A96', u'D18'])
(u'glasnevin', u'botanic') set([u'D11', u'D09'])
(u'sandyford', u'balally') set([u'D16', u'D18'])
(u'drumcondra', u'drumcondra south') set([u'D03', u'D09'])
(u'glasnevin', u'ballygall') set([u'D11', u'D09'])
(u'teach meal\xf3g', u'templeogue') set([u'D6W', u'D16'])
(u'churchtown', u'dundrum') set([u'D14', u'D16'])
(u'dundrum', u'drummartin') set([u'D14', u'D16'])
(u'churchtown', u'churchtown upper') set([u'D14', u'D16'])
(u'killester', u'clontarf east') set([u'D05', u'D03'])


This is a level of multiplicity we can live with. **Thirdly**, then, we copy that slice of the `house_prices` data frame for which we don't have eirKeys. We iterate through the address and, where there's a match for either a locality or a neighborhood in an address string, we store in our `address_mapper` dictionary.

In [13]:
address_mapper = {}
multiples = []
gluggers = house_prices[house_prices.Bayes == 'glugger'].copy()
addresses = gluggers['Address']

for address in addresses:
    found = False
    for a, b in area_to_eirKey_mapper.iteritems():
        if a[0] in address.lower() or a[1] in address.lower():
            found = True
            break
            
    if found == False:
        address_mapper[address] = 'glugger'
        continue

    temp = list(b)
    if len(temp) == 1:
        address_mapper[address] = temp[0]
        continue
    else:
        multiples.append([a, b, address])
        address_mapper[address] = 'multiple'


print len(multiples)

3


In [14]:
gluggers['area_mapper'] = gluggers['Address'].map(address_mapper)

gluggers['area_mapper'].value_counts()

K67         762
K36         479
glugger     423
K32         265
A94         214
K34         158
K56         116
K78         102
K45          95
D24          72
D09          57
A96          51
D18          42
D05          27
D15          16
D22          16
D13          12
D16          11
D12          10
D14          10
D03           7
A42           7
A45           6
D07           3
multiple      3
A98           3
D06           2
D08           1
D11           1
D20           1
D10           1
Name: area_mapper, dtype: int64

In [15]:
multiples

[[(u'teach meal\xf3g', u'templeogue'),
  {u'D16', u'D6W'},
  u'NO 11 ORLAGH NA GREEN, TEMPLEOGUE, DUBLIN16'],
 [(u'donnycarney', u'clontarf'),
  {u'D03', u'D05', u'D09'},
  u'11 Churchfield, 59/60 Clontarf Road, Dublin 3'],
 [(u'teach meal\xf3g', u'templeogue'),
  {u'D16', u'D6W'},
  u'MANTUA, 256 TEMPLEOGUE RD, TEMPLEOGUE BRIDGE DUBLIN 6']]

In [16]:
gluggers[gluggers['area_mapper'] == 'glugger'].head(10)

Unnamed: 0,Address,County,Date,Description,FullMarketPrice,Lat,Lon,PC,PostCode,Prediction,Price,Size,VAT,gCheck,Bayes,area_mapper
21558,"3 MONKSLEA, BORU COURT, FOREST ROAD",Dublin,2016-01-08,Second-Hand Dwelling house /Apartment,No,,,glugger,,,170000.0,,No,NoResponse,glugger,glugger
21559,"41 Priory Road, Harolds Cross Road, Dublin 6W",Dublin,2016-01-08,Second-Hand Dwelling house /Apartment,No,,,glugger,,,750000.0,,No,NoResponse,glugger,glugger
21568,"77 CASTLEDAWSON, ROCK RD, SION HILL",Dublin,2016-01-19,Second-Hand Dwelling house /Apartment,No,,,glugger,,,585000.0,,No,NoResponse,glugger,glugger
21569,"13 MONTGOMERY COURT, DUBLIN 1, DUBLIN",Dublin,2016-01-20,Second-Hand Dwelling house /Apartment,No,,,glugger,,,200000.0,,No,BadLat,glugger,glugger
21573,"37 Langford Street, Killorglin",Dublin,2016-01-21,Second-Hand Dwelling house /Apartment,No,,,glugger,,,95000.0,,No,BadLat,glugger,glugger
21575,"10 BORRIG HOUSE, BORRIS AVE, DUNLAOGHAIRE",Dublin,2016-01-22,Second-Hand Dwelling house /Apartment,No,,,glugger,,,220500.0,,No,NoResponse,glugger,glugger
21578,"7 Barnwell Walk, Hansfield, Barnwell",Dublin,2016-01-22,New Dwelling house /Apartment,No,,,glugger,,,254966.0,greater than or equal to 38 sq metres and less...,Yes,BadLat,glugger,glugger
21589,"19 BLACK STREET, INFIRMARY ROAD, STONEYBATTER",Dublin,2016-01-29,Second-Hand Dwelling house /Apartment,No,,,glugger,,,217000.0,,No,NoResponse,glugger,glugger
21595,"10 HIBERNIA, DE VESCI COURT, THE SLOPES",Dublin,2016-01-30,Second-Hand Dwelling house /Apartment,Yes,,,glugger,,,120000.0,,No,NoResponse,glugger,glugger
21597,"113 PARKVIEW MANSION, APT 8, HAROLDS CROSS DUB...",Dublin,2016-02-01,Second-Hand Dwelling house /Apartment,No,,,glugger,,,145000.0,,No,NoResponse,glugger,glugger


Looking at the addresses in `gluggers.head()`, it's clear that that the majority of these missing addresses can be filled in by hand. But, for the purposes of this exercise, having 201 missing values means the we have 98.4% of our eirKeys present and correct, and the effort involved in chasing down these 201 aren't worth it for the difference.

Nothing to do now but return these values to make a complete `houses` data frame once more, and tidy it up ready for its close up in [Part 4: Data Analysis]().

In [17]:
non_gluggers = house_prices[house_prices['Bayes'] != 'glugger'].copy()
final = pd.concat([non_gluggers, gluggers], ignore_index=True)
final.dtypes

Address                    object
Bayes                      object
County                     object
Date               datetime64[ns]
Description                object
FullMarketPrice            object
Lat                       float64
Lon                       float64
PC                         object
PostCode                   object
Prediction                 object
Price                     float64
Size                       object
VAT                        object
area_mapper                object
gCheck                     object
dtype: object

In [18]:
def apply_final_postcodes(aRow):
    if aRow['Bayes'] == 'glugger':
        return aRow['area_mapper']
    else:
        return aRow['Bayes']

final['eirKey'] = final.apply(lambda x: apply_final_postcodes(x), axis=1)

In [19]:
final.eirKey.value_counts()

D15         2019
D24         1348
A94         1226
D12         1173
A96         1162
D13         1121
K67         1109
D04         1080
D16         1076
D18         1058
D11          992
D06          984
D07          971
D09          952
D14          917
K36          852
D03          835
K78          830
D08          805
D05          666
D22          538
D01          528
K32          509
glugger      423
D10          259
D02          235
K34          228
D20          167
K56          154
K45          137
D17          108
A98           41
A42            8
A45            8
multiple       3
A86            2
A41            1
Name: eirKey, dtype: int64

### eirKey Names
As Lieutenant Columbo used to say, there's just one more thing: those eirKeys aren't very intuitive. Let's see if we can use the geodata information we have to create an `eirKeyName` column.

We're going to group `geocode_df` by eirKey, and then count up the Localities for each eirKey. That `value_counts()` Series is then passed to the `check_values()` function, which removes any unhelpful localities, such as 'Dublin' or '', and then returns the locality at the top of the tree. If the pruning has eliminated the entire tree, it returns the eirKey itself.

The localities aren't quite ideal - Blanchardstown is subsumed into Castleknock, Christchurch into Dolphins' Barn (I imagine wigs on the green at the next DCC meeting if that were to happen in real-life), and so on. Again, these could be pruned by hand, but such pruning isn't really worthwhile for the purposes of this exercise.

In [20]:
eirKey_namer = {}
def check_values(values, ek):
    forbidden_localities = ['Dublin', 'dublin', 'dub', 'Dub', 'DUB', '']
    for f in forbidden_localities:
        if f in values:
            values.remove(f)
    if len(values) > 0:
        return values[0]
    else:
        return ek
    
byEirKey = geocode_df.groupby('eirKey')
for ek, details in byEirKey:
    temp = details['Locality'].value_counts()
    values = list(temp.keys())
    value = check_values(values, ek)
    eirKey_namer[ek] = value
    
final['eirKeyName'] = final['eirKey'].map(eirKey_namer)
final.eirKeyName.value_counts()
    

Castleknock      2019
Tallaght         1348
Blackrock        1226
Crumlin          1173
Killiney         1162
Sutton           1121
Swords           1109
Ballsbridge      1080
Ballinteer       1076
Foxrock          1058
Finglas           992
Rathmines         984
Cabra             971
Drumcondra        952
Churchtown        917
Malahide          852
Clontarf          835
Lucan             830
Dolphins Barn     805
Raheny            666
Clondalkin        538
D01               528
Balbriggan        509
Ballyfermot       259
D02               235
Skerries          228
Palmerstown       167
Rush              154
Lusk              137
Coolock           108
Bray               41
Garristown          8
Oldtown Court       8
Dunboyne            2
Ballyboughal        1
Name: eirKeyName, dtype: int64

In [21]:
final.dtypes

Address                    object
Bayes                      object
County                     object
Date               datetime64[ns]
Description                object
FullMarketPrice            object
Lat                       float64
Lon                       float64
PC                         object
PostCode                   object
Prediction                 object
Price                     float64
Size                       object
VAT                        object
area_mapper                object
gCheck                     object
eirKey                     object
eirKeyName                 object
dtype: object

In [22]:
final.drop(['Bayes', 'County', 'PC', 'PostCode', 'Prediction', 'area_mapper', 'gCheck'], axis=1, inplace=True)

In [23]:
final.head()

Unnamed: 0,Address,Date,Description,FullMarketPrice,Lat,Lon,Price,Size,VAT,eirKey,eirKeyName
0,"34 Mountpleasant Terrace, Dublin 6, D06 YC58, ...",2016-01-01,Second-Hand Dwelling house /Apartment,No,53.328587,-6.261495,170000.0,,No,D06,Rathmines
1,"2 Brighton Rd, Brighton Hall, Kerrymount, Dubl...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.258165,-6.174641,1150000.0,,No,D18,Foxrock
2,"24 Woodstown Meadow, Ballycullen, Dublin 16, D...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.273761,-6.327188,430000.0,,No,D16,Ballinteer
3,"28 Belton Park Gardens, Clontarf, Dublin 9, D0...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.375715,-6.226541,225000.0,,No,D09,Drumcondra
4,"48A Beaufield Park, Stillorgan, Dublin, A94 XH...",2016-01-04,Second-Hand Dwelling house /Apartment,No,53.290123,-6.203367,345000.0,,No,A94,Blackrock


In [24]:
with open('../library/pickle/prices_final_dublin_2016_2015_pickle', 'w') as f:
    pickle.dump(final, f)