# INTRODUCTION 

I have been living in the Boston area for the last few years since grad school. The dataset analyzed for the purposes of this project pertains to the Boston area. The Boston area dataset was exported from [openstreetmaps](http://www.openstreetmap.org/#map=17/40.71652/-73.94470&layers=H). The analysis included the following steps 

* **Question Phase:** This phase involves asking general questions about the dataset. The questions involve the problem we are trying to solve for. 
* **Data Auditing:** This phase involves auditing the data to identify anomalies and patterns. E.g. In the streetmap data we could run into street names which have some kind special characters in them, or we could run into zipcodes in the Boston area that have some kind of alphabetical characters in them. 
* **Data Cleansing:** This phase involves classifying the anomalies that are identified in the previous step and devising approaches to clean up the data. The cleansing could be either manual or done programmatically. The project assumes both a programmatic and a manual approach to cleansing data. The focus is mostly been around cleansing the data programmatically. However in certain cases there is also a need for a manual review 

Data Auditing and Data Cleansing follow a repetive approach till a fair amount of data anomalies have been identified and also cleansed approrpriately. 

* **Conclusion:** This phase involves drawing conclusion about the dataset, based on the auditing and cleansing steps 
* **Communication:** The phase involves communicating the results of the analysis to the audiences. In a real life scenario this would be the business users who make business decisions based on the dataset analysis. 

In addition, this project also involves importing the dataset into [mongoDB](https://www.mongodb.com/), followed by executing some of the mongoDB's aggregation commands to further analyze the dataset that has been imported. 






# Question Phase



# Data Auditing
## Identifying the TAGS along with the count of occurences of each of the TAGS

This step involves doing an initial analysis of the dataset and doing an assessment of the XML nodes. The step also involves counting the number of instances of the specific node. While this step does provide a good start to the data auditing process, it does not answer a whole of questions that needs to be answered. This step definitely helps us confirm the validity of the XML format as the XML parser (ET.iterparse) is able to parse through the entire XML file. 

The output of the step above is the creation of **"Boston.osm.json"** file, which is later been used to import into MongoDB. In addition as a part of the import the street names, phone numbers and zipcodes were also cleaned up 

# Setting up for Mongo Data Analysis

In [None]:
import pymongo
from pymongo import MongoClient
import pprint
client = MongoClient()
db = client.boston
print db

# Data Analysis/Data Exploration in MongoDB

# Assessing the Size of the Original OSM File and the JSON File 

In [None]:
import os
print 'The original OSM file is {} MB'.format(os.path.getsize('Boston.osm')/1.0e6)
print 'The JSON file is {} MB'.format(os.path.getsize('Boston.osm' + ".json")/1.0e6)

In [None]:
boston = db['bostonc']

# Number of Documents

In [None]:
boston.find().count()

# Number of Nodes and Ways

In [None]:
print "Number of nodes:",boston.find({'tag': 'node'}).count()
print "Number of ways:", boston.find({'tag': 'way'}).count()

# Top 10 Contributors along with the UserNames

In [None]:
result = boston.aggregate( [
                                        { "$group" : {"_id" : "$created.user", "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 10 } ] )

pprint.pprint(list(result))

# List of Top 50 Amenities in the Boston Area

In [None]:
result = boston.aggregate( [            {'$match': {'amenity': {'$exists': 1}}},
                                        { "$group" : {"_id" : "$amenity", "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 50 } ] )

pprint.pprint(list(result))

# Extracting the List of Colleges from the DataSet

In [None]:
colleges = boston.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"college",}},      
                      {"$group":{"_id":{"Name":"$name"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "College":"$_id.Name",
                                  "Name":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":10}])
pprint.pprint(list(colleges))

**This list is definitely missing some of the key universities in the Boston Area like Harvard, MIT, NorthEastern. On further review of the dataset I noticed that the missing schools and colleges are infact a part of the dataset, they just don't have an amenity of college attached to them** 

# Extracting the list of Public Buildings in the Boston Area

In [None]:
building = boston.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"public_building",}},      
                      {"$group":{"_id":{"Name":"$name"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "Building":"$_id.Name",
                                  "Name":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":10}])
pprint.pprint(list(building))

# Extracting the Top Cities in the Boston Area