# Data Wrangling SQL

## OpenStreetMap

OpenStreetMap (OSM) foundation is building free and editable map of the world, enabling the development of freely-reusable geospatial data. The data from OpenStreetMap is being used by many applications such as GoogleMaps, Foursquare and Craigslist. 

To look at the map, or download your area of interest, you can visit http://www.openstreetmap.org website. 

For more information you can check their wiki which includes all the necessary information and documentation:
https://en.wikipedia.org/wiki/OpenStreetMap

## Area Chosen

For this project, I chose Chicago, the Windy City in the US. This is the city where I got my undergrad degree.

The original file is about 2 GB in size; I use a sample file about 50MB to perform my initial analysis on. Finally, I run it on the original file to create the CSV files for my database. 


## Data Exploration

Using ET.iterparse (i.e. iterative parsing) is efficient here since the original file is too large for processing the whole thing.

The main problem we encountered in the dataset is the street name inconsistencies. Below is the old name corrected with the better name. 
- Avenue (starting with capital letter)
- Ave
- Ave.
- avenue (starting with small letter)

To be able to process the data, we need to make these street types uniform. In case we are later searching for specific Avenue names, we can do a quick search on all street types that have the word 'Avenue' in them and we can make sure that we are not missing anything with abrreviations of Avenue.

In [1]:
import xml.etree.cElementTree as ET
import pprint

OSMFILE = 'map.osm'

def count_tags(filename):
    tags= {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tags.keys():
            tags[elem.tag] = 1
        else:
            tags[elem.tag] += 1
    
    pprint.pprint(tags)
    
count_tags(OSMFILE)

{'bounds': 1,
 'member': 2586,
 'nd': 53000,
 'node': 43731,
 'osm': 1,
 'relation': 48,
 'tag': 16952,
 'way': 6097}


In [2]:
import xml.etree.cElementTree as ET
import pprint
import re

OSMFILE = 'map.osm'


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'): #iterating through the tag element in the XML file
            k = element.attrib['k'] #looking for the tag attribute 'k' which contains the keys
            if re.search(lower, k):
                keys['lower'] += 1
            elif re.search(lower_colon, k):
                keys['lower_colon'] += 1
            elif re.search(problemchars, k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
                
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    pprint.pprint(keys)
    
process_map(OSMFILE)

{'lower': 11580, 'lower_colon': 5117, 'other': 255, 'problemchars': 0}


In [3]:
OSMFILE = 'map.osm'

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':
                userid = element.attrib['uid']
                users.add(userid)

    print len(users)
    
process_map(OSMFILE)

146


It shows the users that contribute to the map

In [4]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

OSMFILE = 'map.osm'
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# the list of street types that we want to have
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]


The 'audit_street_type' function will get the list of street types and using the regular expression, compare them to the expected list. If they do not match the names in the expected list, it adds it to the street_types dictionary

In [5]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [6]:
'''
The 'is_street_name' function will get the elements in the file
(i.e. the tag element) and return the attributes in that element for which their 
key is equal to 'addr:street'. 
The 'audit' funntion uses iterative parsing to go through the XML file,
parse node and way elements, and iterate through their tag element. 
It will then call the 'audit_street_type' function to add the value attribute 
of the tag (i.e. the street name) to it. 
'''

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    
    #parses the XML file
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        # iterate through the 'tag' element of node and way elements
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

In [7]:
street_types = audit(OSMFILE)

pprint.pprint(dict(street_types))

{'52': set(['US 52']),
 'Ave.': set(['W. Stadium Ave.']),
 'D': set(['Cumberland Ave #D']),
 'Dr': set(['Agriculture Mall Dr']),
 'Lagrange': set(['Hamilton & Lagrange']),
 'Mall': set(['Memorial Mall', 'Purdue Mall', 'Stadium Mall']),
 'North': set(['West 350 North']),
 'Rd': set(['Klondike Rd']),
 'St': set(['East State St', 'W Wood St']),
 'St.': set(['N. Russell St.', 'N. University St.']),
 'W': set(['Sagamore Pkwy W']),
 'Way': set(['Geddes Way']),
 'West': set(['Sagamore Parkway West'])}


Going through the street name list, I will use it to update the 'mapping' list. In this list I mention the format of the street type that was found in the file (left) and specify to what format it needs to be changed (right).


In [8]:
#The list of dictionaries, containing street types that need to be changed to match the expected list
mapping = { "St": "Street",
            "St.": "Street",
            "street": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "AVE": "Avenue,",
            "avenue": "Avenue",
            "Rd.": "Road",
            "Rd": "Road",
            "road": "Road",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Blvd,": "Boulevard",
            "boulevard": "Boulevard",
            "broadway": "Broadway",
            "square": "Square",
            "way": "Way",
            "Dr.": "Drive",
            "Dr": "Drive",
            "ct": "Court",
            "Ct": "Court",
            "court": "Court",
            "Sq": "Square",
            "square": "Square",
            "cres": "Crescent",
            "Cres": "Crescent",
            "Ctr": "Center",
            "Hwy": "Highway",
            "hwy": "Highway",
            "Ln": "Lane",
            "Ln.": "Lane",
            "parkway": "Parkway"
            }

To match the expected list of street name and replace the abbreviated street types, I wrote a function that uses the mapping to do this conversion.

I take the street name and split it at the space character. In case I could find a string that matches any in the mapping, I replace it with the format I have specified for it. When the function finds 'Blvd', it goes through mapping and map it to 'Boulevard', and the final street name will come out as 'N California Boulevard'.

In [9]:
def update_name(name, mapping):
    output = list()
    parts = name.split(" ")
    for part in parts:
        if part in mapping:
            output.append(mapping[part])
        else:
            output.append(part)
    return " ".join(output)

Let's do a print to see how the changes have been applied. I iterate through the street_types from which collected different street types from the 'audit' function, and call the 'update_name' function to change the street type.

In [10]:
for st_type, ways in street_types.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print name, "=>", better_name

West 350 North => West 350 North
Sagamore Pkwy W => Sagamore Pkwy W
Sagamore Parkway West => Sagamore Parkway West
N. Russell St. => N. Russell Street
N. University St. => N. University Street
US 52 => US 52
Klondike Rd => Klondike Road
Stadium Mall => Stadium Mall
Purdue Mall => Purdue Mall
Memorial Mall => Memorial Mall
Geddes Way => Geddes Way
Hamilton & Lagrange => Hamilton & Lagrange
W. Stadium Ave. => W. Stadium Avenue
East State St => East State Street
W Wood St => W Wood Street
Agriculture Mall Dr => Agriculture Mall Drive
Cumberland Ave #D => Cumberland Avenue #D


## Auditing Postcodes

Postcodes are another inconsistent type of data that is entered into the map. The inconsistency is either in how they are represented (with the city abbreviation or without) or how long they are.

In the 'dicti' function, I create a dictionary where I can store postcodes. The dictionary key will be the postcode itself and the dictionary value will be the number of times that postcode was repeated throughout the map.

In [11]:
OSMFILE = 'map.osm'

def dicti(data, item):
    data[item] += 1

The 'get_postcode' function will take the 'tag' element as an input and return the elements for which the keys are equal to 'addr:postcode' 

The 'audit' function, like the one for street names, parses the XML file and iterates through node and way elements. It extracts the value attribute (i.e. the postcode) and add it to the 'dicti' dictionary.

In [12]:
def get_postcode(elem):
    return (elem.attrib['k'] == "addr:postcode")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    data = defaultdict(int)
    # parsing the XML file
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        
        # iterating through node and way elements.
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if get_postcode(tag):
                    dicti(data, tag.attrib['v'])
    
    return data

Now I will call the 'audit' function and print the output which should be a list of dictionaries of postcodes.

In [13]:
postcodes = audit(OSMFILE)

pprint.pprint(dict(postcodes))

{'47406': 1, '47906': 57, '47907': 43}


### Different Postcodes and Ways to Clean Them up
The output shows that the postcodes are in these formats:
- A 5-digit format (e.g. 12345)
- A 5-digit format followed by more numbers after a hyphen (e.g. 12345-6789)

To deal with the postcodes, I divide them into different categories:
- First category include the ones:
    - Where the length equals to 5 (e.g. 12345)
    - Where the length is longer than 5, and they contain characters (like abbreviations of a city) (e.g. CA 12345)
    
- Second category include the ones:
    - Where the length is longer than 5, and they are followed by a hyphen (e.g. 12345-6789)
    
- Third category include the ones:
    - Where the length is longer than 5, but are not followed by any hyphen (e.g. 123456)
    - Where the length is shorter than 5 (e.g. 1234, 515)
    - Where the postcode equals to 'CA'
 

In [14]:
def update_postcode(digit):
    output = list()
    
    first_category = re.compile(r'^\D*(\d{5}$)', re.IGNORECASE)
    
    second_category = re.compile('^(\d{5})-\d{4}$')
    
    third_category = re.compile('^\d{6}$')
    
    if re.search(first_category, digit):
        new_digit = re.search(first_category, digit).group(1)
        output.append(new_digit)
        
    elif re.search(second_category, digit):
        new_digit = re.search(second_category, digit).group(1)
        output.append(new_digit)
    
    elif re.search(third_category, digit):
        third_output = third_category.search(digit)
        new_digit = '00000'
        output.append('00000')
    
    # this condition matches the third category for the other two types of postcodes
    elif digit == 'CA' or len(digit) < 5:
        new_digit = '00000'
        output.append(new_digit)

    return ', '.join(str(x) for x in output)


I will print the output after the changes are done to the postcodes

In [15]:
for postcode, nums in postcodes.iteritems():
    better_code = update_postcode(postcode)


## Preparing the Data for the Database

To load the data to the SQLite database, I need to transfer it from the XML file to CSV files. I create multiple CSV files, and later create the corresponding tables in my database based on them.

The CSV files I want to have are:
- Node
- Node_tags
- Way
- Way_tags
- Way_nodes

Each of these CSV files contains different columns and stores data based on those columns. The columns used in the CSV files will be the table columns in the database. This is the schema:
- NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
- NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
- WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
- WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
- WAY_NODES_FIELDS = ['id', 'node_id', 'position']



In [16]:
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def shape_element(element):

    node_attribs = {} # Handle the attributes in node element
    way_attribs = {} # Handle the attributes in way element
    way_nodes = [] # Handle the 'nd' tag in the way element
    tags = []  # Handle secondary tags the same way for both node and way elements
    
    # Handling node elements
    if element.tag == 'node':
        for item in NODE_FIELDS:
            try:
                node_attribs[item] = element.attrib[item]
            except:
                node_attribs[item] = "9999999"
        
        # Iterating through the 'tag' tags in the node element
        for tg in element.iter('tag'):
            if not PROBLEMCHARS.search(tg.attrib['k']):
                tag_dict_node = {}
                tag_dict_node['id'] = element.attrib['id']

                # Calling the update_name function to clean up problematic street names based on audit.py file
                if is_street_name(tg):
                    better_name = update_name(tg.attrib['v'], mapping)
                    tag_dict_node['value'] = better_name

                # Calling the update_postcode function to clean up problematic postcodes based on audit.py file
                elif get_postcode(tg):
                    better_postcode = update_postcode(tg.attrib['v'])
                    tag_dict_node['value'] = better_postcode
                
                # For other values that are not street names or postcodes
                else:
                    tag_dict_node['value'] = tg.attrib['v']

                if ':' not in tg.attrib['k']:
                    tag_dict_node['key'] = tg.attrib['k']
                    tag_dict_node['type'] = 'regular'
                else: 
                    character_before_colon = re.findall('^[a-zA-Z]*:', tg.attrib['k'])
                    character_after_colon = re.findall(':[a-zA-Z_]+' , tg.attrib['k'])
                    if len(character_after_colon) != 0:
                        tag_dict_node['key'] = character_after_colon[0][1:]
                    else:
                        tag_dict_node['key'] = 'regular'

                    if len(character_before_colon) != 0:
                        tag_dict_node['type'] = character_before_colon[0][: -1]
                    else:
                        tag_dict_node['type'] = 'regular'
                tags.append(tag_dict_node)
            
        return {'node': node_attribs, 'node_tags': tags}
        
    # Handling way elements
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            try:
                way_attribs[item] = element.attrib[item]
            except:
                way_attribs[item] = "9999999"
        
        # Iterating through 'tag' tags in way element
        for tg in element.iter('tag'):
            if not PROBLEMCHARS.search(tg.attrib['k']):
                tag_dict_way = {}
                tag_dict_way['id'] = element.attrib['id']

                # Calling the update_name function to clean up problematic street names based on audit.py file
                if is_street_name(tg):
                    better_name_way = update_name(tg.attrib['v'], mapping)
                    tag_dict_way['value'] = better_name_way

                # Calling the update_postcode function to clean up problematic postcodes based on audit.py file
                if get_postcode(tg):
                    better_postcode_way = update_postcode(tg.attrib['v'])
                    tag_dict_way['value'] = better_postcode_way

                # For other values that are not street names or postcodes
                else:
                    tag_dict_way['value'] = tg.attrib['v']

                if ':' not in tg.attrib['k']:
                    tag_dict_way['key'] = tg.attrib['k']
                    tag_dict_way['type'] = 'regular'
                else:
                    character_before_colon = re.findall('^[a-zA-Z]*:', tg.attrib['k'])
                    character_after_colon = re.findall(':[a-zA-Z_]+', tg.attrib['k'])
                
                    if len(character_after_colon) == 1:
                        tag_dict_way['key'] = character_after_colon[0][1:]
                    if len(character_after_colon) > 1:
                        tag_dict_way['key'] = character_after_colon[0][1: ] + character_after_colon[1]
                
                    if len(character_before_colon) != 0:
                        tag_dict_way['type'] = character_before_colon[0][: -1]
                    else:
                        tag_dict_way['type'] = 'regular'
                
                tags.append(tag_dict_way)
        
        # Iterating through 'nd' tags in way element
        count = 0
        for tg in element.iter('nd'):
            tag_dict_nd = {}
            tag_dict_nd['id'] = element.attrib['id']
            tag_dict_nd['node_id'] = tg.attrib['ref']
            tag_dict_nd['position'] = count
            count += 1
            
            way_nodes.append(tag_dict_nd)
        
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}


With the shape_element function in place, I can now parse and shape the data, and write it to CSV files.

The main function is what I used to call my audit function to update street names and postcodes. The python script shaping_csv.py takes care of creating the CSV files.

## Create CSV 

In [25]:

"""
After auditing is complete the next step is to prepare the data to be inserted into a SQL database.
To do so I will parse the elements in the OSM XML file, transforming them from document format to
tabular format, thus making it possible to write to .csv files.  These csv files can then easily be
imported to a SQL database as tables.
"""

#Initial imports
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
from audit import *  #Imports all the functions from audit.py file
import cerberus
import schema

#The directory where the OSM file is located
OSM_PATH = 'chicago_illinois.osm'

#The directory where the created CSV files will be located 
NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

#The SQL schema that is defined in schema.py file. Both files need to be in the same directory.
SCHEMA = schema.schema

#Regular expression pattern to find problematic characters in value attributes
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

#Regular expression pattern to find different types of streets in street names
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

#The list of street types that we want to have
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

#The list of dictionaries, containing street types that need to be changed to match the 'expected' list
mapping = { "St": "Street", "St.": "Street", "street": "Street",
            "Ave": "Avenue", "Ave.": "Avenue", "AVE": "Avenue,", "avenue": "Avenue",
            "Rd.": "Road", "Rd": "Road", "road": "Road",
            "Blvd": "Boulevard", "Blvd.": "Boulevard", "Blvd,": "Boulevard", "boulevard": "Boulevard",
            "broadway": "Broadway",
            "square": "Square", "square": "Square", "Sq": "Square",
            "way": "Way",
            "Dr.": "Drive", "Dr": "Drive",
            "ct": "Court", "Ct": "Court", "court": "Court",
            "cres": "Crescent", "Cres": "Crescent", "Ctr": "Center",
            "Hwy": "Highway", "hwy": "Highway",
            "Ln": "Lane", "Ln.": "Lane",
            "parkway": "Parkway" }

#The columns in the CSV files. The same columns need to be created for the database
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']
 
def shape_element(element):
    """ A function to Shape each element into several data structures.

    Arg:
    -param element: 'Node' and'way' tags that are passed to this function from
    the get_element function; mainly called by process_map function

    The function goes through node and way elements, defining the values for
    nodes, nodes_ways, ways, ways_tags, and ways_nodes dictionaties. 
    
    The "node" field holds a dictionary of the following top level node 
    attributes: id, user, uid, version, lat, lon, timestamp, changeset

    The "ways" fields hold a dictionary of the following top level node
    attributes: id, users, uid, versiob, timestamp, changeset
    
    The "node_tags" and "way_tags"field holds a list of dictionaries, one per 
    secondary tag. Secondary tags are child tags of node which have the tag 
    name/type: "tag". Each dictionary has the following fields from the secondary
    tag attributes:
    - id: the top level node id attribute value
    - key: the full tag "k" attribute value if no colon is present or the 
        characters after the colon if one is.
    - value: the tag "v" attribute value
    - type: either the characters before the colon in the tag "k" value 
        or "regular" if a colon is not present.
    For the value field, I call updates_name and update_postcode functions to
    clean problematic street names or postcodes. I call these functions on both
    node and way elements.

    Return:
    The following dictionaries will be returned:
    -node
    -node_tags
    -way
    -way_nodes
    -way_tags
    """
    node_attribs = {} # Handle the attributes in node element
    way_attribs = {} # Handle the attributes in way element
    way_nodes = [] # Handle the 'nd' tag in the way element
    tags = []  # Handle secondary tags the same way for both node and way elements
    
    # Handling node elements
    if element.tag == 'node':
        for item in NODE_FIELDS:
            #If the 'uid' field was empty "9999999" is set as 'uid'
            try:
                node_attribs[item] = element.attrib[item]
            except:
                node_attribs[item] = "9999999"
        
        # Iterating through the 'tag' tags in the node element
        for tg in element.iter('tag'):
            if not PROBLEMCHARS.search(tg.attrib['k']): #Ignoring values that contain problematic characters
                tag_dict_node = {}
                tag_dict_node['id'] = element.attrib['id']

                # Calling the update_name function to clean up problematic street names based on audit.py file
                if is_street_name(tg):
                    better_name = update_name(tg.attrib['v'], mapping)
                    tag_dict_node['value'] = better_name

                # Calling the update_postcode function to clean up problematic postcodes based on audit.py file
                elif get_postcode(tg):
                    better_postcode = update_postcode(tg.attrib['v'])
                    tag_dict_node['value'] = better_postcode
                
                # For other values that are not street names or postcodes
                else:
                    tag_dict_node['value'] = tg.attrib['v']

                if ':' not in tg.attrib['k']:
                    tag_dict_node['key'] = tg.attrib['k']
                    tag_dict_node['type'] = 'regular'
                #Dividing words before and after a colon ':'
                else: 
                    character_before_colon = re.findall('^[a-zA-Z]*:', tg.attrib['k'])
                    character_after_colon = re.findall(':[a-zA-Z_]+' , tg.attrib['k'])
                    if len(character_after_colon) != 0: #If the key was an empty field
                        tag_dict_node['key'] = character_after_colon[0][1:]
                    else:
                        tag_dict_node['key'] = 'regular'

                    if len(character_before_colon) != 0: #If the type was an empty field
                        tag_dict_node['type'] = character_before_colon[0][: -1]
                    else:
                        tag_dict_node['type'] = 'regular'
                tags.append(tag_dict_node)
            
        return {'node': node_attribs, 'node_tags': tags}
        
    # Handling way elements
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            #If the 'uid' field was empty "9999999" is set as 'uid'
            try:
                way_attribs[item] = element.attrib[item]
            except:
                way_attribs[item] = "9999999"
        
        # Iterating through 'tag' tags in way element
        for tg in element.iter('tag'):
            if not PROBLEMCHARS.search(tg.attrib['k']):
                tag_dict_way = {}
                tag_dict_way['id'] = element.attrib['id']

                # Calling the update_name function to clean up problematic street names based on audit.py file
                if is_street_name(tg):
                    better_name_way = update_name(tg.attrib['v'], mapping)
                    tag_dict_way['value'] = better_name_way

                # Calling the update_postcode function to clean up problematic postcodes based on audit.py file
                if get_postcode(tg):
                    better_postcode_way = update_postcode(tg.attrib['v'])
                    tag_dict_way['value'] = better_postcode_way

                # For other values that are not street names or postcodes
                else:
                    tag_dict_way['value'] = tg.attrib['v']

                if ':' not in tg.attrib['k']:
                    tag_dict_way['key'] = tg.attrib['k']
                    tag_dict_way['type'] = 'regular'
                #Dividing words before and after a colon ':'
                else:
                    character_before_colon = re.findall('^[a-zA-Z]*:', tg.attrib['k'])
                    character_after_colon = re.findall(':[a-zA-Z_]+', tg.attrib['k'])
                
                    if len(character_after_colon) == 1:
                        tag_dict_way['key'] = character_after_colon[0][1:]
                    if len(character_after_colon) > 1:
                        tag_dict_way['key'] = character_after_colon[0][1: ] + character_after_colon[1]
                
                    if len(character_before_colon) != 0: #If the type was an empty field
                        tag_dict_way['type'] = character_before_colon[0][: -1]
                    else:
                        tag_dict_way['type'] = 'regular'
                
                tags.append(tag_dict_way)
        
        # Iterating through 'nd' tags in way element
        count = 0
        for tg in element.iter('nd'):
            tag_dict_nd = {}
            tag_dict_nd['id'] = element.attrib['id']
            tag_dict_nd['node_id'] = tg.attrib['ref']
            tag_dict_nd['position'] = count
            count += 1
            
            way_nodes.append(tag_dict_nd)
        
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}


# ================================================== #
#   Helper Functions - Written by Udacity Lecturers  #
# ================================================== #
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

#Validating that during creation of CSV files the fields are all in accordance with the columns that should be
#in the CSV files
def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)
        
        raise Exception(message_string.format(field, error_string))


class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

# =================================================================== #
#  Main Function-Creating the CSV files-Written by Udacity Lecturers  #
# =================================================================== #
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
         codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
         codecs.open(WAYS_PATH, 'w') as ways_file, \
         codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
         codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        count = 1
        for element in get_element(file_in, tags=('node', 'way')):
            if count % 10000 == 0: #Setting a counter to show how many rows the code has processed
                print count
            count += 1
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])

if __name__ == '__main__':
    # Note: If the validation is set to True, the process takes much longer than when it is set to False

    process_map(OSM_PATH, validate=False)


10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000
1020000
1030000
1040000
1050000
1060000
1070000
1080000
1090000
1100000
1110000
1120000
1130000
1140000
1150000
1160000
1170000
1180000
1190000
1200000
1210000
1220000
1230000
1240000
1250000
1260000
1270000
1280000
1290000
1300000
1310000
1320000
1330000
1340000
1350000
1360000
1370000
1380000
1390

## Create Database from above CSV

In [26]:
"""
Build database of the CSV files with the repective table names.
"""

import csv, sqlite3

con = sqlite3.connect("'db.sqlite'")
con.text_factory = str
cur = con.cursor()

# create nodes table
cur.execute("CREATE TABLE nodes (id, lat, lon, user, uid, version, changeset, timestamp);")
with open('nodes.csv','rb') as fin:
    dr = csv.DictReader(fin) 
    to_db = [(i['id'], i['lat'], i['lon'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) \
             for i in dr]

cur.executemany("INSERT INTO nodes (id, lat, lon, user, uid, version, changeset, timestamp) \
                VALUES (?, ?, ?, ?, ?, ?, ?, ?);", to_db)
con.commit()

#create nodes_tags table
cur.execute("CREATE TABLE nodes_tags (id, key, value, type);")
with open('nodes_tags.csv','rb') as fin:
    dr = csv.DictReader(fin) 
    to_db = [(i['id'], i['key'], i['value'], i['type']) for i in dr]

cur.executemany("INSERT INTO nodes_tags (id, key, value, type) VALUES (?, ?, ?, ?);", to_db)
con.commit()

#Create ways table
cur.execute("CREATE TABLE ways (id, user, uid, version, changeset, timestamp);")
with open('ways.csv','rb') as fin:
    dr = csv.DictReader(fin) 
    to_db = [(i['id'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]

cur.executemany("INSERT INTO ways (id, user, uid, version, changeset, timestamp) VALUES (?, ?, ?, ?, ?, ?);", to_db)
con.commit()

#Create ways_nodes table
cur.execute("CREATE TABLE ways_nodes (id, node_id, position);")
with open('ways_nodes.csv','rb') as fin:
    dr = csv.DictReader(fin) 
    to_db = [(i['id'], i['node_id'], i['position']) for i in dr]

cur.executemany("INSERT INTO ways_nodes (id, node_id, position) VALUES (?, ?, ?);", to_db)
con.commit()

#Create ways_tags table
cur.execute("CREATE TABLE ways_tags (id, key, value, type);")
with open('ways_tags.csv','rb') as fin:
    dr = csv.DictReader(fin) 
    to_db = [(i['id'], i['key'], i['value'], i['type']) for i in dr]

cur.executemany("INSERT INTO ways_tags (id, key, value, type) VALUES (?, ?, ?, ?);", to_db)
con.commit()


## Data Overview

I want to get some information regarding the CSV files and the database I created.

By importing 'hurry.filesize' I can translate the file sizes from bytes to KB or MB. To install the library, you need to 'pip install hurry.filesize' it on your machine. I got the idea from using this method from the post below:  
https://discussions.udacity.com/t/display-files-and-their-sizes-in-directory/186741

In [17]:
from pprint import pprint
import os
from hurry.filesize import size 
dirpath = 'submission' #main directory


files_list = []
for path, dirs, files in os.walk(dirpath):
    files_list.extend([(filename, size(os.path.getsize(os.path.join(path, filename)))) for filename in files])

for filename, size in files_list:
    print '{:.<40s}: {:5s}'.format(filename,size)

Now that I have audited and cleaned the data and transfered everything into table in my database, I can start running queries on it. The queries answer many questions such as:   
- Number of nodes
- Number of way
- Number of unique users
- Most contributing users
- Number of users who contributed only once
- Top 10 amenitie
- Shops 
- Users who added amenities 


In [18]:
import sqlite3

sqlite_file = 'db.sqlite'
con = sqlite3.connect(sqlite_file)
cur = con.cursor()

### Number of nodes

In [19]:
def number_of_nodes():
    output = cur.execute('SELECT COUNT(*) FROM nodes')
    return output.fetchone()[0]

print 'Number of nodes: \n' , number_of_nodes()

Number of nodes: 
8701756


### Number of ways

In [20]:
def number_of_ways():
    output = cur.execute('SELECT COUNT(*) FROM ways')
    return output.fetchone()[0]

print 'Number of ways: \n' , number_of_ways()

Number of ways: 
1231106


### Number of unique users

In [21]:
def number_of_unique_users():
    output = cur.execute('SELECT COUNT(DISTINCT e.uid) FROM \
                         (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')
    return output.fetchone()[0]

print 'Number of unique users: \n' , number_of_unique_users()

Number of unique users: 
2823


### Most contributing users

In [22]:
def most_contributing_users():
    
    output = cur.execute('SELECT e.user, COUNT(*) as num FROM \
                         (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
                         GROUP BY e.user \
                         ORDER BY num DESC \
                         LIMIT 10 ')
    pprint(output.fetchall())
    return output.fetchall()

print 'Most contributing users: \n'
most_contributing_users()

Most contributing users: 

[(u'chicago-buildings', 5605968),
 (u'Umbugbene', 1091115),
 (u'woodpeck_fixbot', 219369),
 (u'alexrudd (NHD)', 204341),
 (u'g246020', 107386),
 (u'patester24', 105214),
 (u'mpinnau', 103495),
 (u'asdf1234', 101397),
 (u'Oak_Park_IL', 101251),
 (u'TIGERcnl', 93141)]


[]

### Number of users who contributed once

In [23]:
def number_of_users_contributed_once():
    
    output = cur.execute('SELECT COUNT(*) FROM \
                             (SELECT e.user, COUNT(*) as num FROM \
                                 (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
                                  GROUP BY e.user \
                                  HAVING num = 1) u')
    
    return output.fetchone()[0]
                         
print 'Number of users who have contributed once: \n', number_of_users_contributed_once()

Number of users who have contributed once: 
636


### Top 10 amenities 

In [24]:
def top_ten_amenities_in_sf():
    output = cur.execute('SELECT value, COUNT(*) as num FROM nodes_tags\
                            WHERE key="amenity" \
                            GROUP BY value \
                            ORDER BY num DESC \
                            LIMIT 20' )
    pprint(output.fetchall())
    return output.fetchall()

print 'Top ten amenities: \n'
top_ten_amenities_in_sf()

Top ten amenities: 

[(u'place_of_worship', 3038),
 (u'school', 1906),
 (u'restaurant', 1568),
 (u'fast_food', 899),
 (u'parking', 603),
 (u'cafe', 450),
 (u'bench', 426),
 (u'bicycle_parking', 414),
 (u'fuel', 388),
 (u'bicycle_rental', 361),
 (u'bank', 330),
 (u'drinking_water', 282),
 (u'bar', 257),
 (u'fountain', 229),
 (u'grave_yard', 216),
 (u'shelter', 204),
 (u'fire_station', 192),
 (u'toilets', 179),
 (u'pharmacy', 166),
 (u'pub', 162)]


[]

### Top 10 cuisines

In [25]:
def cuisines_in_sf():
    output = cur.execute ('SELECT value, COUNT(*) as num FROM ways_tags \
                           WHERE key="cuisine" \
                           GROUP BY value \
                           ORDER BY num DESC \
                           LIMIT 10')
    pprint(output.fetchall())
    return output.fetchall()

print 'Top 10 cuisines: \n'
cuisines_in_sf()

Top 10 cuisines: 

[(u'burger', 340),
 (u'mexican', 72),
 (u'chicken', 64),
 (u'pizza', 58),
 (u'american', 53),
 (u'coffee_shop', 36),
 (u'sandwich', 31),
 (u'italian', 30),
 (u'chinese', 17),
 (u'ice_cream', 16)]


[]

### Different types of shops

### Users who added amenities to the map



In [26]:
def shops_in_sf():
    output = cur.execute('SELECT value, COUNT(*) as num FROM nodes_tags\
                            WHERE key="shop" \
                            GROUP BY value \
                            ORDER BY num DESC' )
    pprint.pprint(output.fetchall())
    return output.fetchall()

print 'Different types of shops: \n'
top_ten_amenities_in_sf()

Different types of shops: 

[(u'place_of_worship', 3038),
 (u'school', 1906),
 (u'restaurant', 1568),
 (u'fast_food', 899),
 (u'parking', 603),
 (u'cafe', 450),
 (u'bench', 426),
 (u'bicycle_parking', 414),
 (u'fuel', 388),
 (u'bicycle_rental', 361),
 (u'bank', 330),
 (u'drinking_water', 282),
 (u'bar', 257),
 (u'fountain', 229),
 (u'grave_yard', 216),
 (u'shelter', 204),
 (u'fire_station', 192),
 (u'toilets', 179),
 (u'pharmacy', 166),
 (u'pub', 162)]


[]

In [27]:
def users_who_added_amenity():
    output = cur.execute('SELECT DISTINCT(nodes.user), nodes_tags.value FROM \
                            nodes join nodes_tags \
                            on nodes.id=nodes_tags.id \
                            WHERE key="amenity" \
                            GROUP BY value \
                            LIMIT 10' ) # Remove this part to view the whole list of users
    pprint(output.fetchall())
    return output.fetchall()

print 'Users who added amenity to the map: \n'
users_who_added_amenity()

Users who added amenity to the map: 

[(u'DACGroup', u'Family Restaurant'),
 (u'DACGroup', u'Furniture Store'),
 (u'Dark Asteroid', u'Lombard Village Hall'),
 (u'DACGroup',
  u'Portable Toilet Supplier;Trailer Rental Service;Construction Equipment Supplier;Fence Contractor'),
 (u'FrankRKryzak', u'arts_centre'),
 (u'Zol87', u'artwork'),
 (u'Tomasz11', u'atm'),
 (u'PhQ', u'baby_hatch'),
 (u'Tomasz11', u'bank'),
 (u'Umbugbene', u'banquet_hall')]


[]

### List of postcodes


In [28]:
def list_of_postcodes():
    output = cur.execute('SELECT e.value, COUNT(*) as num FROM \
                            (SELECT value FROM nodes_tags WHERE key="postcode"\
                             UNION ALL SELECT value FROM ways_tags WHERE key="postcode") e \
                            GROUP BY e.value \
                            ORDER BY num DESC \
                            LIMIT 5' ) # Remove this limit to see the complete list of postcodes
    pprint(output.fetchall())
    return output.fetchall()

print 'List of postcodes: \n'
list_of_postcodes()


List of postcodes: 

[(u'60201', 9392),
 (u'60202', 7727),
 (u'60305', 1720),
 (u'60564', 1684),
 (u'60136', 1306)]


[]

### Amenities 

we checked to see what the amenities around this area are. Since the list was quite long, I limited it to the first 20 amenities with the highest number.

In [29]:
def amenities_around_47906():
    output = cur.execute('SELECT nodes_tags.value, COUNT(*) as num \
                          FROM nodes_tags \
                            JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE key="amenity") AS amenities \
                            ON nodes_tags.id = amenities.id \
                            WHERE nodes_tags.key="amenity"\
                            GROUP BY nodes_tags.value \
                            ORDER BY num DESC \
                            LIMIT 20' ) # Remove this limit to see the complete list of postcodes
    pprint(output.fetchall())
    return output.fetchall()

print 'Amenities around  \n'
amenities_around_47906()

Amenities around  

[(u'place_of_worship', 3038),
 (u'school', 1906),
 (u'restaurant', 1568),
 (u'fast_food', 899),
 (u'parking', 603),
 (u'cafe', 450),
 (u'bench', 426),
 (u'bicycle_parking', 414),
 (u'fuel', 388),
 (u'bicycle_rental', 361),
 (u'bank', 330),
 (u'drinking_water', 282),
 (u'bar', 257),
 (u'fountain', 229),
 (u'grave_yard', 216),
 (u'shelter', 204),
 (u'fire_station', 192),
 (u'toilets', 179),
 (u'pharmacy', 166),
 (u'pub', 162)]


[]

### Popular Cafes

In [30]:
def most_popular_cafes():
    output = cur.execute('SELECT nodes_tags.value, COUNT(*) as num \
                          FROM nodes_tags \
                            JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value="coffee_shop") AS cafes \
                            ON nodes_tags.id = cafes.id \
                            WHERE nodes_tags.key="name"\
                            GROUP BY nodes_tags.value \
                            ORDER BY num DESC \
                            LIMIT 10' ) # Remove this limit to see the complete list of postcodes
    pprint(output.fetchall())
    return output.fetchall()

print 'Most popular cafes: \n'
most_popular_cafes()

Most popular cafes: 

[(u'Starbucks', 50),
 (u"Dunkin' Donuts", 13),
 (u'Starbucks Coffee', 10),
 (u"Peet's Coffee & Tea", 3),
 (u'Intelligentsia', 2),
 (u'Intelligentsia Coffee', 2),
 (u'Blue Max Coffee', 1),
 (u'Bow Truss Coffee Roasters', 1),
 (u'Brew Brew Coffee Lounge', 1),
 (u'Bridgeport Coffeehouse', 1)]


[]

# Reference

https://github.com/Dalaska/Udacity-Data-Wrangling-Clean-OpenStreetMap

https://github.com/bestkao/data-wrangling-with-openstreetmap-and-mongodb

https://discussions.udacity.com/t/display-files-and-their-sizes-in-directory/186741

https://github.com/Nazaniiin/DataWrangling_OpenStreetMap

https://github.com/alphagammamle/Udacity-Data-Analyst-Nanodegree/tree/master/P3-OpenStreetMap-Wrangling-with-SQL

https://github.com/alphagammamle/OpenStreetMap-Toronto

https://github.com/paul-reiners/udacity-data-wrangling-mongo-db/blob/master/docs/ProjectReport.md

http://napitupulu-jon.appspot.com/posts/wrangling-openstreetmap.html

http://puwenning.github.io/2016/02/10/P3-project-openstreetmap-data-case-study/

http://fch808.github.io/Data%20Wrangling%20with%20MongoDB%20-%20Write-up.html