## Data Wrangling of Singapore Map Data
### by: Darren Liu

[Mapzen: Central Singapore, Singapore](https://mapzen.com/projects/search?query=Singapore%2C%20Central%20Singapore&endpoint=place&gid=gn%3Alocality%3A1880252&selectedLat=1.28967&selectedLng=103.85007&lng=103.74372&lat=1.37213&zoom=12)

> I have chosen Singapore because it is well-known for its central planning and compact landmass. It would be interesting to see if and how Singapore map information differs from American cities. 

In [1]:
from pymongo import MongoClient
from pprint import pprint

client = MongoClient("mongodb://localhost:27017")
db = client.openmap
mr = db.mapdata_raw
pc = db.postcode

### 1. Problems to Solve

> I admittingly lack understanding of map data and what is to expect from good node, way, or relation data points, compared to bad ones. But I do know what good addresses look like, or so I thought I did. After some light research on how addresses work in Singapore, I came across some facts that are further testaments of their central planning prowess. According to Wikipedia, every building in Singapore is given its unique postcode. This means any piece of address information that can be linked to its actual postcode has the potential to have perfect address information, granted there is a postcode database with that information. Lucky as always, there are. One you pay for, and one you work for, so I worked for it.

> The nimble parser worked 2 full days to parse the poor site of its 124,285 unique postcodes and its perfect addresses. (There is actually 124,289 but I missed some and don't know what I missed). The quest for perfect address can begin.

In [2]:
pc.find_one()

{u'_id': ObjectId('56e0c1a994e1beeae6709e51'),
 u'building': u'Og Albert Complex',
 u'city': u'Singapore',
 u'full': u'Og Albert Complex, Albert Street, 60, Singapore, Albert, Bugis, Victoria Street, Rochor, Central',
 u'postcode': u'189969',
 u'region1': u'Central',
 u'region2': u'Bugis, Victoria Street, Rochor',
 u'region3': u'Singapore',
 u'street': u'Albert Street, 60'}

> The raw data download in its entirety has a daunting 1,134,428 observations.

In [3]:
mr.find().count()

1134428

> However, only 34,015 has any sort of address information

In [4]:
query = {'address': {'$exists': 1}}
print mr.find(query).count()

34015


> Going further, it is surprising to find many of them are not even in Singapore, but in neighboring territories, such as Johor Bahru of Malaysia.

In [5]:
# count of ways and nodes
query = [
    {'$match': {'address': {'$exists': 1}}},
    {'$group': 
     {'_id':
      {'city': '$address.city',
       'country': '$address.country'},
      'count': { '$sum': 1 }
     }
    },
    {'$sort': {'count': -1}}
]
result_in = mr.aggregate(query)
result = [x for x in result_in]
for x in result[:5]:
    pprint(x)

{u'_id': {u'city': u'Johor Bahru'}, u'count': 13781}
{u'_id': {}, u'count': 8931}
{u'_id': {u'city': u'Singapore', u'country': u'SG'}, u'count': 8200}
{u'_id': {u'country': u'SG'}, u'count': 2124}
{u'_id': {u'city': u'Singapore'}, u'count': 850}


> In order to be searchable by our method, a street or postcode is necessary. This leaves us the opportunity to shiny up 10,692 addresses.

In [6]:
query = {"$and": [
            {"$or": [
                {'address.city': {'$exists': 1, '$eq':'Singapore'}},
                {'address.country': {'$exists': 1, '$eq':'SG'}}
            ]},
            {'address': {'$exists': 1}},
            {"$or": [
                {'address.street': {'$exists': 1}},
                {'address.postcode': {'$exists': 1}}
            ]},
        ]}
dat_in_query = mr.find(query)
dat_in = [x for x in dat_in_query]
len(dat_in)

10692

### 2. Cleaning process
> The proposed method is simple. The data to be cleaned should have either street address or postcode available, or both. A cleaner script will attempt to find the unique database address by associating the available information to it. For street address, a regular expression pattern of street and housenumber, if available, will attempt to match the full address from the database. For postcode, a simple postcode lookup will retrieve the unique address. There are several ways this can play out. With the 10,692 address available for cleaning, the exact outcomes are listed below:

> - Both street and postcode available:

>> both information matched and database result the same: __3411__

>> both information matched but database result not the same: __597__

>> only postcode information matched: __0__

>> only street information matched: __0__

> - Only postcode available and matched: __19__

> - Only postcode available and matched: __0__

> - Only street available and matched: __6447__

> - Only street available and matched: __0__

> - Exception: __218__

> Aside from exceptions, which need to be investigated, all data points had matching information in the database. The next natural thing to be curious about are the 597 records that matched up with database but had conflicting results. After some digging in, these were some of the reasons I could find:

>> In many cases the street names were ambiguous with no house number to narrow the street matching down to one unique result

>> In many cases the street address include the housenumber, which can either have two housenumbers attached to the regex search string or be inconsistent with the database address format

>> In some cases, words are just spelled wrong

>> In some cases, the street does not appear anywhere in the postcode version of address, which can be a headache given it is uncertain which piece of information is the valid one

### 3. Conclusion

> Singapore's unique postcode system made our address cleaning strategy possible. All 10,692 search attempts had a corresponding database match, however, it did yield some inconsistencies that can be eliminated with more work. This project can be improved by fixing some of the problems found in section 2, as well as the program exception occurrences. The project can go further by potentially using xy coordinates to enhance incomplete street address information, resolve inconsistencies between street address and postcode search results, and spot potentially errorneous postcodes.

#### Source
- https://en.wikipedia.org/wiki/Postal_codes_in_Singapore
- http://sgp.postcodebase.com/