# Dublin House Prices by Post Code
## Part 1 - Sourcing and Cleaning the Data
Certain details on the sales of property in Ireland going back to 2010 are available for download here, on the Property Price Register website: https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/page/ppr-home-en

The website makes it clear that the Register is not a “Property Price Index,” as such; rather it is a record of submissions made to the Revenue Commissions for stamp duty purposes as part of the house conveyance process. Like that well-remembered man, Oliver Cromwell, the data is presented as it was filed, warts and all.

Our aim is to break down Dublin property prices in 2016 by time and post code. Let's see how far we can get with that using the **Property Price Register** data.

In [1]:
import pandas as pd

ppr = pd.read_csv('../library/PPR/PPR-2016-Dublin.csv')
ppr.describe()

Unnamed: 0,Date of Sale (dd/mm/yyyy),Address,Postal Code,County,Price (�),Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description
count,12757,12757,8482,12757,12757,12757,12757,12757,1890
unique,272,12690,23,1,2434,2,2,3,4
top,17/02/2016,"KILN HOUSE, KILSALLAGHAN, DUBLIN",Dublin 15,Dublin,"�250,000.00",No,No,Second-Hand Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
freq,243,3,1071,12757,146,12370,10881,10866,1403


Good, but not great. There's a huge amount of missing data in the *Property Size Description* field, for instance. For our immediate purpose, there's quite a lot of missing data in the *Postal Code* section too. How are we to resolve that?

### Google Maps API
As with so many contemporary propblems, we can resolve our problem through Google. Google provide a Google Maps API that will allow us to look up each addresses in our PPV data, and hopefully return quite a lot of data from it, not least the house's co-ordinates in latitude and longitude.

I've written a program called `coordinateCollector.py`, as the Jupyter notebook format isn't ideal for this sort of data munging, not least as the file can take a long time to run. The program is stored [here](), and I'll go through its classes now.

#### The `Raw` Class
The `Raw` class is the most straight-forward of the classes. It has two attributes, the file location and the list of houses that the `create_houses_list method` will create. The encoding of the .csv is specified as 'latin-1'; unfortunate, but there's nothing to be done about it.

```python
class Raw(list):
    
    def __init__(self, houses_url):
        self.houses_url = houses_url
        self.houses_list = self.create_houses_list(self.houses_url)

    def create_houses_list(self, houses_url):
        houses_original = []
        with open(houses_url, 'r') as f:
            reader = csv.reader(f, encoding = 'latin-1')
            for r in reader:
                houses_original.append(r)
        print "There are {:,} rows.".format(len(houses_original))
        
        return houses_original
```

#### The `GeoCode` Class
This is the workhorse of the whole project. The `GeoCode` class has four methods, three of which are helper methods for the major `get_geodata()` method of the class.
Two important things happen on initialization.
1. The `address_list` attribute is populated with call to the `houses_list` attribute of the `Raw` class.
2. A `Google Maps Client` object is instantiated using a key read in from a local file - it is never bad practice to hard-code keys. Notice also that the key attribute is `__key`, rather than `key`. This makes it a private attribute, for further security.

```python
class GeoCode(object):
    
    def __init__(self, houses_url):
        self.address_list = Raw(houses_url).houses_list
        with open("/Users/anthonymunnelly/Documents/TECH/Google API/api.json", 'r') as f:
            self.__myKey = json.load(f)

        self.gmaps = googlemaps.Client(key=self.__myKey['mapper'])
        self.house_data = self.get_geodata()
        
        self.houses = pd.DataFrame(self.house_data[0], columns = ["Date",
                                                      "Address",
                                                      "PostCode",
                                                      "County",
                                                      "Price",
                                                      "FullMarketPrice",
                                                      "VAT",
                                                      "Description",
                                                      "Size",
                                                      "Lat",
                                                      "Lon",
                                                      "gCheck"])
    
        self.bad_add = pd.DataFrame(self.house_data[1], columns = ["Date",
                                                      "Address",
                                                      "PostCode",
                                                      "County",
                                                      "Price",
                                                      "FullMarketPrice",
                                                      "VAT",
                                                      "Description",
                                                      "Size",
                                                      "gCheck"])
                                                      
        self.geodata = self.house_data[2]
        
        

    def coordinate_finder(self, anAddress):
        result = self.gmaps.geocode(anAddress)
        if len(result) > 0:
            return result[0]
        else:
            return -1
        
    def address_components_searcher(self, component_list, sought_item):
        returnValue = ''
        for c in component_list:
            if sought_item in c['types']:
                returnValue = c['short_name']
        
        return returnValue
    
    def check_geometry(self, lat, lon):
        north = (53.426621, -6.249899) # airport
        south = (53.148672, -6.092088) # greystones
        west = (53.371680, -6.514402) # leixlip amenities
        east = (53.329053, -5.341059) # middle of the Irish sea
        if lat > north[0] or lat < south[0]:
            return [False, 'BadLat']
        elif lon < west[1] or lon > east[1]:
            return [False, 'BadLon']
        else:
            return [True, 'Good']


    def get_geodata(self):
        houses_list = []
        bad_addresses_list = []
        geodata_list = []
        
        for item in self.address_list:
            sanity_checker = self.address_list.index(item)+1
            if sanity_checker % 100 == 0:
                print '\nThis is item {:5} of {:,}.'.format(sanity_checker, len(self.address_list))
            try:
                geocode = self.coordinate_finder(item[1])
                if geocode == -1:
                    item.append('NoResponse')
                    bad_addresses_list.append(item)
                    continue
                address = geocode['formatted_address']
                latitude = geocode['geometry']['location']['lat']
                longitude = geocode['geometry']['location']['lng']
                gCheck = self.check_geometry(latitude, longitude)
                if gCheck[0]:
                    postcode = self.address_components_searcher(geocode['address_components'], 'postal_code')
                    
                    clean_line = [item[0],
                                  address,
                                  postcode,
                                  item[3],
                                    item[4],
                                    item[5],
                                    item[6],
                                    item[7],
                                    item[8],
                                    latitude,
                                    longitude,
                                    gCheck[1]]
                    
                    houses_list.append(clean_line)
                    geodata_list.append(geocode)
                else:
                    item.append(gCheck[1])
                    bad_addresses_list.append(item)
                    
            except:
                print 'exception'
                item.append('Exception')
                bad_addresses_list.append(item)
                continue
    
        return (houses_list, bad_addresses_list, geodata_list)

```

`get_geodata()`

The `get_geodata()` method iterates through each address that we have. It prints a sanity-check method on every hundredth line, because this program can take a long time to run and I am not a patient man. This calms me.

The `Google Client` is called on every address. If the `coordinate_finder()` method finds our `Google Client` has sent a -1 response, we append the address details to the `bad_address` list, and move along.

If the response is good, there are further checks. We use the `check_geometry` method to see if the latitude and longitude coordinates returned are reasonable ones for Dublin, by checking them against reasonably generous cardinal coordinates for north, south, east and west of the city. If we don't check this, we can get address in Dublin, Ohio or in New Zealand and other far-flung places. Not helpful.

The final check then is to see if our `Google Client` has logged a post code - it is post codes we are after, after all.

If all this has gone well, we append the new, fuller details of the property to the `houses_list` attribute. Note also that we use Google's `formatted_address` data where possible, as it will be more consistent than the formatting in our orginal PPR document. If the process has fallen down at any stage, we append the property details to to the `bad_address_list` attribute.

Finally, the `get_geodata` method returns a tuple of the `houses_list` attribute, the `bad_addresses_list` attribute and a list of all the geodata we've gathered. You never know when this will turn our useful.

### The `Ferment` Class
The `Ferment` class allows us to pickle all the data that we've gathered in order to use it in Jupyter notebooks, as we're about to do in [Part 2]() of **Dublin House Prices by Post Code**