[OpenStreetMap](https://www.openstreetmap.org/#map=11/41.4980/-81.7070) is a user-generated map of the entire world, freely available to download. 
The data extract of Cleveland used for this project was downloaded from [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/metro/cleveland_ohio/).

The first step was to download the map as an XML file. The original file was nearly 5.5 million lines long, which meant that it would be unwieldy to process the entire file every time. I decided to create sample files for testing out my auditing scripts and cleaning. I wanted a smaller file to test run functions and then an intermediate file I could use for identifying the most common problems prior to cleaning the data. 

The orginal file contains the following breakdown of top-level elements: {'way': 189248, 'node': 1795742, 'relation': 3732}.

The breakdown of top-level elements for the three file sizes is summarized in the following table:

|                     	    | 'node'  	| 'way'  	| 'relation' 	| File Size (MB) 	|
|-----------------------	|---------	|--------	|------------	|----------------	|
| __Full file__          	| 1795742 	| 189248 	| 3732       	|      392.6     	|
| __Intermediate sample__ 	| 179575  	| 18924  	| 374        	|      39.9      	|
| __Small sample__        	| 17958   	| 1892   	| 38         	|       3.9      	|

#### Overall Approach
Being in possession of a large dataset can be both exciting and intimidating! There are numerous possiblities, but the sheer amount of data can be overwhelming. Before I officially began any data wrangling, I wanted to have a step-by-step plan to guide my actions in order to efficiently go through the process of cleaning the data. I decided that the following approach, adapted from the Udacity course on Data Wrangling was a solid guide:

1. Audit the data: identify errors/missing data in the XML data
2. Create a data cleaning plan based on the audit
    * Identify the causes of any "dirty" or inconsistent data 
    * Develop a set of corrective cleaning actions and test on a small sample of the XML data
3. Implement the data cleaning plan: run scripts and transfer the cleaned data to .csv files
4. Manually correct as necessary: import the data from .csv to SQL and perform SQL queries on the data to identify any further inconsistencies that would necessitate returning to step 2. 

Data wrangling is an iterative procedure, and as such, I expected that I might need to cycle through these steps several times. However, I knew that having a clear outline of the procedure to follow would save me untold hours of work and confusion.



### Auditing the Data

There are five main aspects of data quality to consider when auditing a dataset:
1. Validity: Does the data conform to a schema (standard format)?
2. Accuracy: Does the data conform to reality or a trusted external source?
3. Completeness: Are all records present?
4. Consistency: Is data in a field or across a row in logical agreement?
5. Uniformity: Are the same units used for a given field?

In [1]:
import xml.etree.cElementTree as ET

In [2]:
source_file = 'amenitiesSource.xml'

In [24]:
tree = ET.parse(source_file)
root = tree.getroot()

for table in root.iter('table'):
    if table.attrib['class'] == 'wikitable':
        for row in table:
            for data in row.findall('td'):
                for element in data:
                    if element.tag == 'a':
                        if element.text:
                            print(element.text)

 bar
 bbq
 biergarten
 cafe
 drinking_water
 fast_food
 food_court
 ice_cream
 pub
 restaurant
 college
 kindergarten
 library
 public_bookcase
 school
 music_school
 driving_school
 language_school
 university
 bicycle_parking
 bicycle_repair_station
 bicycle_rental
 boat_sharing
 bus_station
 car_rental
 car_sharing
 car_wash
 charging_station
 ferry_terminal
 fuel
 grit_bin
 motorcycle_parking
 parking
 parking_entrance
 parking_space
 taxi
 atm
 bank
 bureau_de_change
 baby_hatch
 clinic
 dentist
 doctors
 hospital
 nursing_home
 pharmacy
 social_facility
 veterinary
 healthcare
 blood_donation
 arts_centre
 brothel
 casino
 cinema
 community_centre
 fountain
 gambling
 nightclub
 amenity=stripclub
 planetarium
 social_centre
 stripclub
 studio
 swingerclub
 theatre
 animal_boarding
 animal_shelter
 baking_oven
 bench
 clock
 courthouse
 coworking_space
 crematorium
 crypt
 dive_centre
 dojo
 embassy
 fire_station
Tag:leisure=firepit
 game_feeding
 grave_yard
 hunting_stand
 intern