# <span style='color:grey'> Data Wrangling with MongoDB </span> 
______________________
## <span style='color:orange'> Wrangling OpenStreetMap Data </span> 
### <span style='color:grey'> Adan Olivera </span> 

   <span style='color:orange'> Map Area: São Paulo, SP, Brazil </span> 

- [Map area in OpenStreetMap](https://www.openstreetmap.org/relation/298285)
- [Metro extract in MapZen](https://mapzen.com/data/metro-extracts.html#sao-paulo_brazil)

In this report, we're going to detail the relevant aspects of the Wrangle OpenStreetMap Data project from Udacity's Data Analyst Nanodegree.

The goal of this project is to use data munging techniques (e.g. assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity) to clean OpenStreetMap data for a part of the world, and then import the cleansed data into a MongoDB database to run exploratory queries against it.

The chosen place in the world for this project is the São Paulo region in Brazil because it is where I live. It's the main economic region of the country and the one with highest population, 20 million people as of 2015.


____________

## <span style='color:orange'> Contents </span> 


1.Problems encountered in the map

    1.1 Street names
    1.2 Postal codes
    1.3 Phone numbers
   
   
2.Data Overview
    
    2.1 File sizes
    2.2 Number of documents 
    2.3 Number of nodes
    2.4 Number of ways
    2.5 Number of unique users
    2.6 Number of pizza or japanese restaurants in the area
    
    
3.Additional Ideas
    
    3.1 Improving Cycling Information
    3.2 Additional data exploration using MongoDB queries

4.Conclusion

5.References

____________

## <span style='color:orange'> 1. Problems Encountered in the Map </span> 

After downloading the metro extact .osm file for the city of São Paulo and running it against some auditing scripts (listed in data_auditing_scripts.py), I noticed some validity, accuracy and uniformity problems with some sections of the data:
1. **Street names:** some street names contained invalid street types and some were incomplete;
2. **Post codes:** there innacurate post codes, missing or with extra characters. They were also disuniform and some even invalid;
3. **Phone numbers:** I also found problems with the structure of phone numbers. There were innacurate cases, invalid ones and they generally lacked uniformity.

### <span style='color:grey'> 1.1 Street names </span> 

From the auditing script detailed below (adpated from lesson 6), I could find the following problems with street name strings:

- **Over-abbreviated street names**: there were several abbreviation variations for each expected street type, making the data disuniform. To fix this issue, I mapped all abreviation variations to their respective full type and, using a python script, simply replaced the substrings corresponding to the abbreviations in the problematic street names, such that "Al. Santos" would be converted into "Alameda Santos", as an example.

- **Informal and unexpected street types**: there were elements with informal street types (e.g. "passagem", "via", "acost"), which rigorously wouldn't be used to represent street names. Since even being informal these types may be usefull, I chose to keep them as they were and simply add them to my expected types list.

- **Incomplete street names**: some names had their type missing, and just contained the actual street name or only part of it. For these cases, I manually searched for their names on "Google Maps" to get their street types and ensure their accuracy. I then mapped them and using the same script used to fix abbreviations, I replaced the incomplete strings with accurate and valid ones, adding street types (e.g. "Alfonso" became "Avenida Alfonso").

In [10]:
### Script used to audit ways for unexpected types, listing the occurencies for each unexpected type in a set dictionary.
### Output is a dictionary where the keys are the unexpected types and the values are sets of occurrences 
### for each unexpected type, as can be seen below.

import xml.etree.cElementTree as ET
import pprint
import re
from collections import defaultdict

OSMFILE = "sao-paulo_brazil.osm"
street_type_re = re.compile(r'\S+\.?\b', re.IGNORECASE)

expected = ["Rua", "Avenida", "Alameda", "Quarteirão", "Quadra", "Lugar", "Viela", "Faixa", "Estrada",
                "Trilha", "Praça", "Passarela", 'Acesso', 'Largo', "Rodovia", "Travessa"]

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type.encode('utf-8','ignore')  not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    return street_types

st_types = audit(OSMFILE)
pprint.pprint(dict(st_types))

{u'1': set([u'1\xaa Travessa da Estrada do Morro Grande']),
 '3': set(['3']),
 'AC': set(['AC SAO BERNARDO DO CAMPO']),
 'Acost': set(['Acost. Direita KM 12,0 /Marg.Tie. Expr.']),
 u'Al': set(['Al. Barros',
             'Al. Jauaperi',
             u'Al. Joaquim Eug\xeanio de Lima',
             u'Al. Jos\xe9 Maria Lisboa',
             'Al. Lorena',
             'Al. Pamplona',
             'Al. Santos',
             u'Al. Sarutai\xe1']),
 'Alfonso': set(['Alfonso Bovero']),
 'Antonio': set(['Antonio Caputo']),
 u'Av': set(['Av C',
             'Av Dr. Silvio de Campos',
             u'Av Guap\xe9',
             u'Av Jac\xfa Pessego / Nova Trabalhadores',
             u'Av. Agenor C. de Magalh\xe3es',
             u'Av. Ant\xf4nio Joaquim de Moura Andrade',
             'Av. Augusto Zorzi Baradel Furquim',
             'Av. Comendador Masatoshi Shinmyo',
             'Av. Francisco Matarazzo',
             u'Av. Francisco N\xf3brega Barbosa',
             'Av. Presidente Juscelino Kub

### <span style='color:grey'> 1.2 Post codes </span> 

From the post code auditing script shown below (adpated from lesson 6), I was able to notice different issues with post codes registered in the OSM file:

- **Inconsistent formating**: many codes were represented with dots, additional or no hyphens and white spaces instead of only one hyphen as the national standard ('00000-000'). The exceptional values (with extra hyphens, dots, spaces) were mapped and replaced along with the simpler ones (only missing the hyphen) using a cleaning script in python.

- **Invalid values**: some codes contained text instead of only digits (e.g. "CEP", "Igreja Presbiteriana Vila Gustavo"). As they weren't many, all of them were easily replaced by mapping the correct values either from Google searches or by intuition. There was also one post code apparently from a region outside the map analysed. Once the data was imported into MongoDB, I searched for the document containing that post code, and discovered that the error was just due to a typo. Then I updated its value with the correct one, using the commands below:

```
db.sao_paulo_brazil.find_one({"address.postcode":"25450-000"}) >> to find the document with the incorrect code

db.sao_paulo_brazil.update({"address.postcode":"25450-000"},
                           {"$set": {"address.postcode":"02545-000"}}) >> to replace the incorrect code

```

- **Incomplete or innacurate codes**: There were incomplete post codes (missing numbers) and some with extra digits. For most of them, I was able to find the correct code by searching for intuitive variations on the national postal agency website ([Correios](http://www.buscacep.correios.com.br/sistemas/buscacep/)). These were cases where an additional 0 was added to the code (e.g. '042010-000'), where there were 0s missing (e.g. '09380') or where the correct code was surrounded by incorrect digits (e.g. '09890-1 09890-080 00'). They were mapped and replaced used the cleaning script mentioned before.

In [32]:
### Script used to audit postcodes by grouping them into different problematic scenarios (e.g. extra or 
### missing characters), and listing examples for each case.
### Output is a dictionary of tuples with one element being the count of occurrences in each scenario 
### and the other being a list of up to 10 examples.

import xml.etree.cElementTree as ET
import pprint
import re

OSMFILE = "sao-paulo_brazil.osm"
post_code_re = re.compile(r'\d{5}\-\d{3}', re.IGNORECASE)
correct = []
extra_chars = []
missing_chars = []
missing_hyphen = []
wrong_region = []

def audit_post_code(post_code_types, post_code):
    post_code = post_code.encode('ascii','ignore') 
    if ("-" not in post_code) and (len(post_code) < 8):
        missing_chars.append(post_code)
        post_code_types["missing_chars"] =(len(missing_chars), missing_chars[:10])
    
    elif (("-" not in post_code) and (len(post_code) > 8) or (len(post_code) > 9)):
        extra_chars.append(post_code)
        post_code_types["extra_chars"] = (len(extra_chars), extra_chars[:10])
    
    elif "-" not in post_code:
        missing_hyphen.append(post_code)
        post_code_types["missing_hyphen"] = (len(missing_dash), missing_dash[:10])
    
    elif (post_code[0] != "0") and (post_code[0] != "1"):
        wrong_region.append(post_code)
        post_code_types["wrong_region"] = (len(wrong_region), wrong_region[:10])
    
    elif re.search(post_code_re, post_code) is not None:
        correct.append(post_code)
        post_code_types["correct"] = (len(correct), correct[:10])
        
def is_post_code(elem):
    return (elem.attrib['k'] == "addr:postcode")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    post_code_types = {"missing_chars":0, "extra_chars":0, "missing_hyphen":0, "wrong_region":0, "correct":0}
    counter = 0
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_post_code(tag):
                    audit_post_code(post_code_types, tag.attrib['v'])
                    counter += 1
    return post_code_types

pc_types = audit(OSMFILE)
pprint.pprint(pc_types)

{'correct': (8099,
             ['03331-000',
              '03302-000',
              '03164-010',
              '05461-010',
              '02120-020',
              '03403-003',
              '06455-000',
              '01301-000',
              '01301-000',
              '01318-001']),
 'extra_chars': (97,
                 ['010196-200',
                  '12.216-540',
                  '13.308-911',
                  '023630000',
                  '04783 020',
                  '042010-000',
                  '042010-000',
                  '03032.030',
                  '03032.030',
                  '03032.030']),
 'missing_chars': (3, ['09380', '05410', '12242']),
 'missing_hyphen': (194,
                    ['04345000',
                     '12315280',
                     '01309010',
                     '01309000',
                     '05006000',
                     '01304001',
                     '02615020',
                     '05025010',
                     '09930270

### <span style='color:grey'> 1.3 Phone numbers </span> 

Phone numbers posed a more complex challange than the previous data types. Through the exploratory auditing script below, I was able to find many unexpected cases and numerous formatting variations. I experimented with different cleaning alternatives until I found a helpful python module called "[phonenumbers](https://github.com/daviddrysdale/python-phonenumbers)" with functions for treating phone strings. I then used it to parse the phone number strings into a consistend international format. The main problems encountered with phone numbers were the following:

- **Inconsistent formating**: There were numbers with multiple hyphens (e.g. '55-11-37120713'), parenthesis (e.g. '+55 (11) 3583-1810'), dots (e.g. '011-2986.8540'), slashes (e.g. '+55 11 2949-1844 / 11 99602-0973'), white spaces (e.g. '+55 11 3322 2200') and some with none of these separators at all (e.g. '551151829947'). All these cases where converted into a consistent international format ('+00 00 0000-0000') using functions from the phonenumbers module in a custom python cleaning script.

- **More than one number per tag**: Many phone number strings where actually composed of multiple phones numbers (e.g. '+55-11-32274554 +55-11-997537015' or '11 2959-3594 / 2977-2491'). These were also parsed using functions from the phonenumbers module and converted into a list with the individual numbers in international format as elements.

- **Missing area codes**: Some numbers had area codes missing. Either they had the country code missing (e.g. '011-2986.8540') or the local area code missing (e.g. '5514-7964'). In some cases, the country code "plus" sign was missing e.g. 55-11-37120713) or even misplaced ('55+ (11) 3670-8000'). As the phonenumbers module can't correctly parse numbers without local codes, I then treated these strings to fix the problems before running them trough the parser. After being treated, they were parsed to the international format.

- **Incomplete and innacurate numbers**: I also found numbers missing digits (e.g. '+55 11190' or '11193'), with additional digits (e.g. '+55 11 1 3135 4156') or with text among digits (e.g. '+55 11 2949-1844 / 11 99602-0973 com Sander'). As these weren't many, whenever was possible, they were mapped to their respective correct forms, and then fixed before being parsed. When there was text among the digits, a given function from the phonenumbers module was used to appropriately filter the numbers. For the cases where the number couldn't be guessed, they were parsed into empty strings.

In [39]:
### Script used to audit phone numbers by grouping them into different problematic scenarios (e.g. extra or missing characters), 
### and listing examples for each case.
### Output is a dictionary of tuples with one element being the count of occurrences in each scenario and 
### the other being a list of up to 10 examples.

import xml.etree.cElementTree as ET
import pprint
import re

OSMFILE = "sao-paulo_brazil.osm"

other = []
extra_chars = []
missing_chars = []
missing_hyphen = []

def audit_phone_number(phone_number_types, phone_number):
    if ("-" not in phone_number) and (len(phone_number) < 8):
        missing_chars.append(phone_number)
        phone_number_types["missing_chars"] =(len(missing_chars), missing_chars[:10])

    elif ("-" not in phone_number) and (len(phone_number) > 8):
        extra_chars.append(phone_number)
        phone_number_types["extra_chars"] = (len(extra_chars), extra_chars[:10])

    elif "-" not in phone_number:
        missing_hyphen.append(phone_number)
        phone_number_types["missing_hyphen"] = (len(missing_hyphen), missing_hyphen[:10])

    else:
        other.append(phone_number)
        phone_number_types["other"] = (len(other), other[:10])


def is_phone_number(elem):
    return (elem.attrib['k'] == "phone")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    phone_number_types = {"missing_chars":0, "extra_chars":0, "missing_hyphen":0, "other":0}
    counter = 0
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_phone_number(tag):
                    counter += 1
                    audit_phone_number(phone_number_types, tag.attrib['v'])
    return phone_number_types


pn_types = audit(OSMFILE)
pprint.pprint(pn_types)

{'extra_chars': (1204,
                 ['+55 11 3104 0678',
                  '0800 772 3633',
                  '+55 11 33726800',
                  '551151829947',
                  '+55 11 3412 7611',
                  '+55 11 22920977',
                  '55 (11) 32592776',
                  '+55 11 32728280',
                  '+55 11 3289 1586',
                  '+55 11 31710311']),
 'missing_chars': (8,
                   ['+55 11',
                    '+55 11',
                    '+55 11',
                    '+55 11',
                    '193',
                    '190',
                    '193',
                    '+55 11']),
 'missing_hyphen': (1, ['26455667']),
 'other': (416,
           ['+55 11 2292-2365',
            '+55 13 3495-5504',
            '+55 11 4648-1048',
            '+55 11 4191-8707',
            '+55 11 3255-2817',
            '55-11-3222-1007',
            '+55 11 2028-1010',
            '+55 11 2692-0482',
            '+55 11 2533-9791',
          

## <span style='color:orange'> 2. Data Overview </span> 

After running auditing scripts and mapping problems to be fixed, I defined a set of scripts to clean the data before importing it into MongoDB.

To import the OSM XML, I firt converted it into a JSON file and then used the mongoimport tool to bulk insert its documents into a database named "osm" and a collection name "sao_paulo_brazil".

With the XML to JSON convertion scripts I also run data cleaning scripts to correct the problems mentioned in the first section, before writing the values in the converted JSON. Only the elements of type “node” and “way” were imported, and the data model used for the documents follows the example below:

```
{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}
```

After the data was imported, I run some queries to explore it. Here are some basic statistics extracted in this exploration, and the queries used to gather them:

In [51]:
## Defining functions to be used for queries

def get_db(db_name):
    #creates a connection and selects a database
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def aggregate(db, pipeline):
    #runs the aggregation pipeline and iterates through documents in it
    return [doc for doc in db.sao_paulo_brazil.aggregate(pipeline)]

db = get_db('osm')

###  <span style='color:grey'> 2.1 File sizes </span> 



```
> sao-paulo_brazil.osm ......... 389.3 MB
> sao-paulo_brazil.osm.json .... 562.1 MB

```

###  <span style='color:grey'> 2.2 Number of documents </span> 

In [40]:
db.sao_paulo_brazil.find().count()

1999296

###  <span style='color:grey'> 2.3 Number of nodes </span> 

In [43]:
db.sao_paulo_brazil.find({"type":"node"}).count()

1757483

###  <span style='color:grey'> 2.4 Number of ways </span> 

In [42]:
db.sao_paulo_brazil.find({"type":"way"}).count()

241756

###  <span style='color:grey'> 2.5 Number of unique users </span> 

In [41]:
len(db.sao_paulo_brazil.distinct("created.user"))

1655

###  <span style='color:grey'> 2.6 Number of pizza or japanese restaurants in the area </span> 

In [62]:
db.sao_paulo_brazil.find({"amenity":"restaurant",
                              "cuisine": {"$in": ["pizza", "japanese"]}
                             }).count()

134

## <span style='color:orange'> 3. Additional Ideas </span> 

###  <span style='color:grey'> 3.1 Improving Cycling Information </span> 

Cycling information for this region of the map is relatively very scarce. Only about 2% of ways have bycicle information of some type. And only 82 of places have registered bicycle parking, a mere 0.8% of the total number of amenities.

Given the importance of bycicles as a clean trasportation altertive and as a solution to the local traffic problems, it would be beneficial to make sure that all ways have relevant bycicle information. Increasing the quality and volume of cycling information would make it easier to use bycicle in the city and would enable the development of software applications to support its use.

There are some alternatives that would enables us to bridge this gap:

- The solution should obviously involve the OSM local community or local software engineers motivated by the cause. We could contact local cycling groups and find engineers or developers among them that wish to volunteer and help.
- The city government could be another party that could contribute promoting awareness or even with its resources. It has demosntrated interest in the subject with recent cycling projects in the city, like the construction of cyclelanes.
- Local cycling-related businesses is another group that could also help. They have a direct interest in the increase of bycicle use and could contribute with resources or increasing awareness in the community.

With volunteers or people commercially involved in improving cycling data in the region, the task could initiated by first listing the ways without bicycle information. Then starting with the ways with the highest traffic, they could start to map the cycling infrastructure and to populate OSM with their findings. Another way to tackle this would be by crowdsourcing the input of information to cyclists by a smartphone app, for example. OpenStreeMap provides [guidelines in their wiki](https://wiki.openstreetmap.org/wiki/Bicycle) for uploading bicycle information.

####  <span style='color:grey'> - Number and share of ways with bicycle information </span> 

In [165]:
number_ways = db.sao_paulo_brazil.find({"type":"way"}).count()
number_bicycle = db.sao_paulo_brazil.find({"type":"way","bicycle":{"$exists":1}}).count()
share_ways_bicycle = number_bicycle/float(number_ways)
print "number of bicyle tags is %1.0f" %(number_bicycle) 
print "bicycle information as a percentage of ways is %1.4f" %(share_ways_bicycle*100) + "%"

number of bicyle tags is 4178
bicycle information as a percentage of ways is 1.7282%


####  <span style='color:grey'> - Number and share of ways with cyclelanes </span> 

In [176]:
number_high_cycleways = db.sao_paulo_brazil.find({"type":"way","highway":"cycleway"}).count()
number_way_cycleways = db.sao_paulo_brazil.find({"type":"way","cycleway":{"$exists":1}}).count()
number_cycleways = number_high_cycleways + number_way_cycleways
share_ways_cycleways = number_cycleways/float(number_ways)
print "number of clycleways is %1.0f" %(number_cycleways) 
print "clycleways as a percentage of ways is %1.4f" %(share_ways_cycleways*100) + "%"

number of clycleways is 798
clycleways as a percentage of ways is 0.3301%


####  <span style='color:grey'> - Number of places with bicycle parking, bicycle rental and compressed air </span> 

In [178]:
print "number of places with bicycle parking is %1.0f" %(db.sao_paulo_brazil.find({"amenity":"bicycle_parking"}).count())
print "number of places with bicycle rental is %1.0f" %(db.sao_paulo_brazil.find({"amenity":"bicycle_rental"}).count())
print "number of places with compressed air is %1.0f" %(db.sao_paulo_brazil.find({"amenity":"compressed_air"}).count()) 

number of places with bicycle parking is 82
number of places with bicycle rental is 95
number of places with compressed air is 1


###  <span style='color:grey'> 3.2 Additional data exploration using MongoDB queries </span> 

####  <span style='color:grey'> - Top 10 amenities in the region </span> 

In [195]:
pipeline = [{"$match":{"amenity":{"$exists":1}}}, 
            {"$group":{"_id":"$amenity","count":{"$sum":1}}}, 
            {"$sort":{"count":-1}}, 
            {"$limit":10}]

result = aggregate(db, pipeline)
pprint.pprint(result)

[{u'_id': u'fuel', u'count': 1455},
 {u'_id': u'parking', u'count': 1140},
 {u'_id': u'restaurant', u'count': 930},
 {u'_id': u'school', u'count': 843},
 {u'_id': u'bank', u'count': 768},
 {u'_id': u'place_of_worship', u'count': 487},
 {u'_id': u'pharmacy', u'count': 349},
 {u'_id': u'fast_food', u'count': 336},
 {u'_id': u'pub', u'count': 269},
 {u'_id': u'hospital', u'count': 248}]


####  <span style='color:grey'> - Top 10 leisure options in the region </span> 

In [199]:
pipeline = [{"$match":{"leisure":{"$exists":1}}}, 
            {"$group":{"_id":"$leisure", "count":{"$sum":1}}},        
            {"$sort":{"count":-1}}, 
            {"$limit":10}]

result = aggregate(db, pipeline)
pprint.pprint(result)

[{u'_id': u'park', u'count': 2498},
 {u'_id': u'pitch', u'count': 1590},
 {u'_id': u'swimming_pool', u'count': 336},
 {u'_id': u'sports_centre', u'count': 261},
 {u'_id': u'garden', u'count': 132},
 {u'_id': u'playground', u'count': 94},
 {u'_id': u'stadium', u'count': 40},
 {u'_id': u'common', u'count': 23},
 {u'_id': u'recreation_ground', u'count': 22},
 {u'_id': u'track', u'count': 16}]


####  <span style='color:grey'> - Top 10 types of shopping places </span> 

In [200]:
pipeline = [{"$match":{"shop":{"$exists":1}}}, 
            {"$group":{"_id":"$shop", "count":{"$sum":1}}},        
            {"$sort":{"count":-1}}, 
            {"$limit":10}]

result = aggregate(db, pipeline)
pprint.pprint(result)

[{u'_id': u'supermarket', u'count': 728},
 {u'_id': u'yes', u'count': 711},
 {u'_id': u'bakery', u'count': 389},
 {u'_id': u'car', u'count': 217},
 {u'_id': u'car_repair', u'count': 191},
 {u'_id': u'clothes', u'count': 176},
 {u'_id': u'convenience', u'count': 162},
 {u'_id': u'mall', u'count': 137},
 {u'_id': u'hardware', u'count': 117},
 {u'_id': u'fashion', u'count': 99}]


####  <span style='color:grey'> - Top 10 building types </span> 

In [204]:
pipeline = [{"$match":{"building":{"$exists":1}}}, 
            {"$group":{"_id":"$building","count":{"$sum":1}}}, 
            {"$sort":{"count":-1}}, 
            {"$limit":11},
            {"$skip":1}]

result = aggregate(db, pipeline)
pprint.pprint(result)

[{u'_id': u'house', u'count': 5096},
 {u'_id': u'apartments', u'count': 1774},
 {u'_id': u'residential', u'count': 1666},
 {u'_id': u'industrial', u'count': 1471},
 {u'_id': u'roof', u'count': 1295},
 {u'_id': u'commercial', u'count': 455},
 {u'_id': u'warehouse', u'count': 405},
 {u'_id': u'school', u'count': 241},
 {u'_id': u'retail', u'count': 192},
 {u'_id': u'public', u'count': 173}]


####  <span style='color:grey'> - 5 most popular cuisines in the region </span> 

In [205]:
pipeline = [{"$match":{"amenity":{"$exists":1}, 
                       "amenity":"restaurant"}}, 
            {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},        
            {"$sort":{"count":-1}}, 
            {"$limit":6},
            {"$skip":1}]

result = aggregate(db, pipeline)
pprint.pprint(result)

[{u'_id': u'regional', u'count': 136},
 {u'_id': u'pizza', u'count': 82},
 {u'_id': u'japanese', u'count': 52},
 {u'_id': u'italian', u'count': 23},
 {u'_id': u'burger', u'count': 18}]


## <span style='color:orange'> 4. Conclusion </span> 

On this project we have seen that the OpenStreetMap is a powerful tool to explore maps, as it's open source, well documented, and though it's far from complete, it has very detailed information. 

Given that's an open platform, different people contribute in different ways and many times with wrong, mistyped or incomplete data, leading to inacurate, invalid or disuniform datasets. Part of the dirt data in this extract, more speceficaly street names, postcodes and phone numbers, were cleansed and standardized during this project, making the data for the city of São Paulo more accurate and uniform. 

With the power and simplicity of MongoDB, we were able to explore the region discovering interesting stats about it and revealing its scarcity of cycling related data, which could be used to improve transportation in the region if it were more complete.

## <span style='color:orange'> 5. References </span> 

- [OpenStreet Map Wiki](http://wiki.openstreetmap.org/wiki/Main_Page)
- [Correios website](http://www.buscacep.correios.com.br/sistemas/buscacep/)
- [Bicycles section in OpenStreetMap's wiki](https://wiki.openstreetmap.org/wiki/Bicycle)
- [phonenumbers Python Library](https://github.com/daviddrysdale/python-phonenumbers)
- [Google Maps](https://www.google.com.br/maps)
- [MongoDB 3.2 Manual](https://docs.mongodb.org/manual/)