# Auditing the OSM Data

Before we can do the real auditing, let's get the preliminary chores over with:

In [1]:
# Modules
from __future__  import print_function
from collections import Counter, defaultdict
from lxml.etree  import iterparse  # lxml is much faster
import re

%run 'vendor/pprint_utf.py'        # Fix the problem of displaying Chinese

# Config constants
SAMPLE_FILE  = 'osm/sample.osm'
OSM_FILE     = 'osm/Xian-Xianyang.osm'
OSM_ELEMENTS = ('node', 'way', 'relation')

# Regex patterns (retroactively collected here from the auditing process below)
ALL_EN = re.compile(u'^[a-zA-Z0-9\s]+$')
ALL_CN = re.compile(u'^[\u4e00-\u9fa5]+$')
CN_EN  = re.compile(u'^[\u4e00-\u9fa5]+[^\u4e00-\u9fa5]+')
CN_SPACE_EN = re.compile(u'^[\u4e00-\u9fa5]+\s[^\u4e00-\u9fa5]+')

# Helper functions (retroactively collected here from the auditing process below)
#
# Generate every tag key and value with its paren
                
# print the element id and all its tags in the form of a dict
def print_element(elem):
    print(elem.tag, 'id:', elem.attrib['id'])
    pprint({ tag.attrib['k']: tag.attrib['v'] 
             for tag in elem.iter('tag') })
    print()
            
# Find an OSM element by id and
# print all its tags in the form of a dict
def find_and_print(id):
    for _, elem in iterparse(OSM_FILE, tag=OSM_ELEMENTS):
        if elem.attrib['id'] == id:
            print_element(elem)
            elem.clear()
            return
        else:
            elem.clear()

## The Shape of the OSM File

It seems a good idea to first familiarize ourselves with all the XML elements in the file and their respective number of occurrences in the descending order.

In [2]:
xml_element_count = Counter(elem.tag for _, elem in iterparse(SAMPLE_FILE))
xml_element_count.most_common()

[('nd', 30411),
 ('node', 25623),
 ('tag', 8036),
 ('way', 3415),
 ('member', 602),
 ('relation', 21),
 ('osm', 1)]

Now that we know what kind of and how many XML elements are there, let's get to the shape of the file. Using any text editor we can find out the outline as below, and it accounts for all the XML elements counted above:
```
<osm>
    <node>          # a point on the map
        <tag></tag> # contains key/value pair
    </node>
    
    <way>           # a list of node element
        <nd></nd>   # refers to a node element
        <tag></tag> # contains key/value pair
    </way>
    
    <relation>            # a list of node/way/relation element
        <member></member> # refers to a node/way/relation element
        <tag></tag>       # contains key/value pair
    </relation>
</osm>
```

It can be reasonably deduced that the official OSM elements (node, way and relation) provides the structures of the map data, while the real meat resides in the tag elements. So tags will be the focal point of our auditing.

## Surveying Tags

Since we will probably spend a lot of time with tags, I have defined (and put at the very beginning) an **all_tags()** helper function, which allows us easy access to every tag element along with its parent OSM element.

The code below shows how **all_tags()** works, and actually prints all the tags in the sample file.

In [3]:
for elem, key, val in all_tags(SAMPLE_FILE):
    pprint([elem.attrib['id'], key, val])

['244080278', 'name', 'Gaoling Xian']
['244080278', 'place', 'county']
['244080278', 'gns:DSG', 'ADM3']
['244080278', 'gns:UFI', '-1906363']
['244080278', 'gns:UNI', '6816820']
['244080278', 'name:vi', u'Cao Lăng']
['244080278', 'name:zh', u'高陵县']
['244080278', 'gns:ADM1', '26']
['244080278', 'created_by', 'dkt_GNS-import-1']
['244080278', 'name:zh_pinyin', 'Gaoling Xian']
['244081932', 'name', u'渭南市']
['244081932', 'place', 'city']
['244081932', 'gns:DSG', 'ADM2']
['244081932', 'gns:UFI', '-1930304']
['244081932', 'gns:UNI', '6816771']
['244081932', 'name:de', 'Weinan']
['244081932', 'name:en', 'Weinan']
['244081932', 'name:fr', 'Weinan']
['244081932', 'name:ja', u'渭南市']
['244081932', 'name:vi', u'Vị Nam']
['244081932', 'name:zh', u'渭南市']
['244081932', 'gns:ADM1', '26']
['244081932', 'wikipedia', u'zh:渭南市']
['244081932', 'is_in:country', 'China']
['244081932', 'name:zh_pinyin', u'Wèinán Shì']
['244081932', 'is_in:continent', 'Asia']
['244081932', 'is_in:country_code', 'CN']
['24408483

['1828989797', 'source', 'Bing']
['1828989830', 'power', 'tower']
['1828989830', 'source', 'Bing']
['1828989876', 'power', 'tower']
['1828989876', 'source', 'Bing']
['1876844147', 'foot', 'yes']
['1876844147', 'name', u'西北工业大学友谊校区南门']
['1876844147', 'horse', 'no']
['1876844147', 'barrier', 'gate']
['1876844147', 'bicycle', 'yes']
['1876844147', 'motor_vehicle', 'designated']
['1905583315', 'highway', 'motorway_junction']
['1918384840', 'name', u'谢王桥']
['1918384840', 'highway', 'motorway_junction']
['1918760562', 'highway', 'traffic_signals']
['1927956207', 'highway', 'motorway_junction']
['2152344863', 'power', 'tower']
['2152344896', 'power', 'tower']
['2152344953', 'power', 'tower']
['2152344965', 'power', 'tower']
['2152345039', 'power', 'tower']
['2152345071', 'power', 'tower']
['2152345136', 'power', 'tower']
['2153564006', 'power', 'tower']
['2153564020', 'power', 'tower']
['2153564050', 'power', 'tower']
['2153564060', 'power', 'tower']
['2153564070', 'power', 'tower']
['2153564

['2968410795', 'railway', 'station']
['2972370360', 'name', u'安远门站']
['2972370360', 'railway', 'station']
['2977155119', 'name', u'保税区站']
['2977155119', 'railway', 'station']
['2977172947', 'name', u'石家街站']
['2977172947', 'railway', 'station']
['2984012340', 'name', u'秦始皇陵']
['2997258126', 'highway', 'traffic_signals']
['2999960009', 'name', u'两岔河站']
['2999960009', 'railway', 'station']
['3003935030', 'name', u'漠西站']
['3003935030', 'railway', 'station']
['3011811575', 'name', u'黄良街办']
['3011811575', 'place', 'town']
['3011812876', 'name', u'王寺街办']
['3011812876', 'place', 'town']
['3013880327', 'name', u'代王街办']
['3013880327', 'place', 'town']
['3015663398', 'name', u'洪庆街办']
['3015663398', 'place', 'town']
['3060726439', 'name', u'西关']
['3060726439', 'place', 'town']
['3074748427', 'name', u'普集街镇']
['3074748427', 'place', 'town']
['3079707878', 'name', u'东关南街']
['3079707878', 'place', 'town']
['3079707888', 'name', u'谭家']
['3079707888', 'place', 'town']
['3079723028', 'name', u'长延堡']
['3

['28294894', 'oneway', 'yes']
['28294894', 'highway', 'primary']
['28320649', 'name', u'二环南路西段']
['28320649', 'lanes', '2']
['28320649', 'oneway', 'yes']
['28320649', 'highway', 'trunk']
['28340417', 'name', u'二环南路']
['28340417', 'lanes', '2']
['28340417', 'oneway', 'yes']
['28340417', 'highway', 'trunk']
['28343383', 'ref', 'G3011']
['28343383', 'lanes', '3']
['28343383', 'layer', '1']
['28343383', 'bridge', 'yes']
['28343383', 'oneway', 'yes']
['28343383', 'highway', 'motorway']
['28356512', 'ref', 'S105']
['28356512', 'name', 'S105']
['28356512', 'oneway', 'no']
['28356512', 'highway', 'primary']
['28356512', 'surface', 'asphalt']
['28358449', 'lanes', '4']
['28358449', 'layer', '2']
['28358449', 'bridge', 'yes']
['28358449', 'highway', 'trunk']
['28358567', 'name', 'Car Repair Parking']
['28358567', 'amenity', 'parking']
['28460937', 'lanes', '2']
['28460937', 'layer', '2']
['28460937', 'bridge', 'yes']
['28460937', 'oneway', 'yes']
['28460937', 'highway', 'trunk_link']
['28461936'

['84170167', 'oneway', 'yes']
['84170167', 'source', 'Bing']
['84170167', 'highway', 'motorway']
['84170167', 'name:en', 'Xihan Expressway']
['85318488', 'aerialway', 'cable_car']
['88216827', 'name', 'Huaqing Hot Springs']
['88216827', 'leisure', 'park']
['93087711', 'layer', '1']
['93087711', 'bridge', 'yes']
['93087711', 'oneway', 'yes']
['93087711', 'highway', 'motorway_link']
['93087726', 'lanes', '2']
['93087726', 'oneway', 'yes']
['93087726', 'highway', 'motorway_link']
['93483183', 'ref', 'G5']
['93483183', 'name', u'西汉高速']
['93483183', 'oneway', 'yes']
['93483183', 'source', 'GPS']
['93483183', 'highway', 'motorway_link']
['93483183', 'name:en', 'Xihan Expressway']
['93483883', 'highway', 'tertiary']
['93483883', 'name', u'渭滨路']
['93484769', 'highway', 'residential']
['93484782', 'highway', 'unclassified']
['93484782', 'name', u'瞪羚路']
['93484782', 'oneway', 'no']
['120859725', 'ref', 'G5']
['120859725', 'name', u'西禹高速']
['120859725', 'oneway', 'yes']
['120859725', 'highway', '

['148554787', 'oneway', 'yes']
['148554787', 'source', 'Bing']
['148832624', 'name', u'公园南路环道']
['148832624', 'oneway', 'yes']
['148832624', 'highway', 'secondary']
['148976742', 'name', u'西太路']
['148976742', 'oneway', 'yes']
['148976742', 'source', 'Bing']
['148976742', 'highway', 'trunk']
['149141973', 'name', u'八府庄北路']
['149141973', 'highway', 'residential']
['149143828', 'highway', 'tertiary']
['149699973', 'name', u'团结一路']
['149699973', 'highway', 'residential']
['149735092', 'name', u'新桥路']
['149735092', 'highway', 'residential']
['152427720', 'layer', '1']
['152427720', 'bridge', 'yes']
['152427720', 'highway', 'primary']
['152427993', 'lanes', '2']
['152427993', 'oneway', 'yes']
['152427993', 'highway', 'trunk']
['152428516', 'name', u'二环南路西段']
['152428516', 'lanes', '2']
['152428516', 'layer', '1']
['152428516', 'oneway', 'yes']
['152428516', 'highway', 'trunk']
['152429475', 'name', '2nd Ring Road']
['152429475', 'lanes', '2']
['152429475', 'oneway', 'yes']
['152429475', 'hig

['181212629', 'source', 'Bing']
['181212629', 'highway', 'residential']
['181212639', 'source', 'Bing']
['181212639', 'highway', 'residential']
['181212776', 'highway', 'secondary']
['181212776', 'name', u'天谷八路']
['181212776', 'source', 'Bing']
['181215282', 'bridge', 'yes']
['181215282', 'highway', 'motorway']
['181215282', 'lanes', '3']
['181215282', 'layer', '1']
['181215282', 'name', u'环高速公路']
['181215282', 'name:en', 'Ring Expressway']
['181215282', 'oneway', 'yes']
['181215282', 'ref', 'G3001']
['181215282', 'toll', 'yes']
['181215292', 'lanes', '3']
['181215292', 'layer', '1']
['181215292', 'bridge', 'yes']
['181215292', 'oneway', 'yes']
['181215292', 'highway', 'motorway']
['181215307', 'layer', '2']
['181215307', 'bridge', 'yes']
['181215307', 'oneway', 'yes']
['181215307', 'source', 'Bing']
['181215307', 'highway', 'motorway_link']
['181338927', 'layer', '3']
['181338927', 'bridge', 'yes']
['181338927', 'oneway', 'yes']
['181338927', 'highway', 'motorway_link']
['181338943', 

['218574509', 'oneway', 'yes']
['218574509', 'source', 'bing']
['218574509', 'highway', 'motorway_link']
['218574527', 'source', 'bing']
['218574527', 'highway', 'tertiary']
['218574543', 'source', 'bing']
['218574543', 'highway', 'tertiary']
['218574557', 'highway', 'trunk']
['218574557', 'oneway', 'yes']
['218574557', 'ref', 'G312']
['218816671', 'name', u'学9楼']
['218816671', 'building', 'yes']
['218816671', 'addr:housename', u'学9楼']
['219050112', 'highway', 'steps']
['220422988', 'highway', 'living_street']
['220545805', 'highway', 'living_street']
['220546389', 'highway', 'living_street']
['220546399', 'highway', 'living_street']
['220547878', 'highway', 'living_street']
['220548566', 'highway', 'living_street']
['220549381', 'name', u'学4楼']
['220549381', 'building', 'yes']
['220557319', 'name', u'体育馆']
['220557319', 'leisure', 'sports_centre']
['220682934', 'name', u'水果屋']
['220682934', 'building', 'yes']
['220684361', 'name', u'学3楼']
['220684361', 'building', 'yes']
['220685156',

['229922835', 'ref', 'G30']
['229922835', 'name', u'西宝高速']
['229922835', 'lanes', '2']
['229922835', 'oneway', 'yes']
['229922835', 'highway', 'motorway']
['229922835', 'int_ref', 'AH5']
['229922835', 'name:en', 'Xibao Expressway']
['229922845', 'ref', 'G30']
['229922845', 'name', u'西宝高速']
['229922845', 'lanes', '2']
['229922845', 'layer', '1']
['229922845', 'bridge', 'yes']
['229922845', 'oneway', 'yes']
['229922845', 'highway', 'motorway']
['229922845', 'int_ref', 'AH5']
['229922845', 'name:en', 'Xibao Expressway']
['229923445', 'lanes', '4']
['229923445', 'layer', '1']
['229923445', 'bridge', 'yes']
['229923445', 'oneway', 'yes']
['229923445', 'highway', 'motorway']
['229923455', 'ref', 'G30']
['229923455', 'name', u'西宝高速']
['229923455', 'lanes', '4']
['229923455', 'oneway', 'yes']
['229923455', 'highway', 'motorway']
['229923455', 'int_ref', 'AH5']
['229923455', 'name:en', 'Xibao Expressway']
['230006053', 'highway', 'tertiary']
['230059420', 'highway', 'secondary']
['230059420', '

['236672203', 'highway', 'motorway']
['236672580', 'highway', 'tertiary']
['237065467', 'ref', 'G3001']
['237065467', 'name', u'西安绕城高速G3001']
['237065467', 'lanes', '3']
['237065467', 'oneway', 'yes']
['237065467', 'highway', 'motorway']
['237069236', 'highway', 'residential']
['237069877', 'highway', 'unclassified']
['237070116', 'highway', 'residential']
['237070510', 'highway', 'residential']
['237070568', 'highway', 'residential']
['237077992', 'highway', 'residential']
['237081232', 'highway', 'track']
['237087088', 'oneway', 'yes']
['237087088', 'highway', 'motorway_link']
['237087292', 'highway', 'residential']
['237110063', 'highway', 'residential']
['237113977', 'name', u'测绘路']
['237113977', 'highway', 'residential']
['237113977', 'name:en', 'Cehui Rd']
['237116435', 'highway', 'residential']
['237117203', 'highway', 'tertiary']
['237117203', 'name', u'天台八路']
['237572690', 'highway', 'unclassified']
['237572690', 'name', u'渔场路']
['237900035', 'highway', 'service']
['238038738'

['246609124', 'building', 'house']
['246610878', 'highway', 'residential']
['246614163', 'highway', 'unclassified']
['246614163', 'name', u'创新路']
['246620144', 'highway', 'residential']
['246620165', 'name', u'红枫林']
['246620165', 'building', 'house']
['246620165', 'addr:city', u'西安市']
['246620165', 'addr:housename', u'红枫林']
['246620184', 'name', u'红枫林']
['246620184', 'building', 'house']
['246620184', 'addr:city', u'西安市']
['246620184', 'addr:housename', u'红枫林']
['246763840', 'building', 'house']
['246764785', 'highway', 'residential']
['246764795', 'building', 'house']
['246765364', 'building', 'house']
['246767548', 'building', 'house']
['246767564', 'highway', 'residential']
['246767580', 'building', 'house']
['247807436', 'name', u'团结四路']
['247807436', 'highway', 'residential']
['247807449', 'name', u'人民西巷']
['247807449', 'highway', 'residential']
['247807449', 'name:en', 'W. Renming Alley']
['247813443', 'highway', 'residential']
['247813460', 'name', u'枣园西路']
['247813460', 'oneway

['255698431', 'name', u'问远路']
['255698441', 'highway', 'residential']
['255698441', 'name', u'天问路']
['255869827', 'name', u'长安汽车站']
['255869827', 'office', 'company']
['255872506', 'bridge', 'yes']
['255872506', 'highway', 'footway']
['255892580', 'building', 'college']
['255906190', 'name', u'长安一中初中部']
['255906190', 'amenity', 'school']
['255907923', 'name', u'明 秦惠王墓']
['255907923', 'landuse', 'cemetery']
['255915447', 'name', u'长安妇幼保健医院']
['255915447', 'amenity', 'hospital']
['255915447', 'building', 'yes']
['255986671', 'name', u'高新四小']
['255986671', 'amenity', 'school']
['255988821', 'highway', 'unclassified']
['255988831', 'highway', 'unclassified']
['255988842', 'highway', 'unclassified']
['255991434', 'embankment', 'yes']
['255991434', 'highway', 'residential']
['255992171', 'highway', 'residential']
['255992181', 'highway', 'unclassified']
['255992181', 'website', 'http://zjq.in']
['255992755', 'highway', 'unclassified']
['255992765', 'highway', 'unclassified']
['255992775', 'h

['263633893', 'name', u'网球场']
['263633893', 'sport', 'tennis']
['263633893', 'leisure', 'pitch']
['263634515', 'name', u'西花园']
['263634515', 'leisure', 'park']
['263635110', 'source', 'bing']
['263635110', 'waterway', 'river']
['263636854', 'source', 'GPS']
['263636854', 'highway', 'tertiary']
['263738531', 'oneway', 'yes']
['263738531', 'highway', 'secondary']
['263739288', 'name', u'环城南路东段']
['263739288', 'oneway', 'yes']
['263739288', 'highway', 'trunk_link']
['263769902', 'highway', 'trunk']
['263769902', 'name', u'环城东路']
['263769902', 'name:en', 'East Ring Road']
['263770044', 'oneway', 'yes']
['263770044', 'highway', 'primary']
['263773392', 'oneway', 'yes']
['263773392', 'highway', 'trunk_link']
['263774000', 'lanes', '4']
['263774000', 'highway', 'trunk']
['263774505', 'tunnel', 'yes']
['263774505', 'highway', 'cycleway']
['263774842', 'oneway', 'yes']
['263774842', 'highway', 'service']
['263775868', 'lanes', '3']
['263775868', 'oneway', 'yes']
['263775868', 'highway', 'trunk'

['267371242', 'name:zh', u'陇海线']
['267371242', 'railway', 'rail']
['267371242', 'voltage', '25000']
['267371242', 'frequency', '50']
['267371242', 'electrified', 'contact_line']
['267372303', 'name', u'陇海线']
['267372303', 'gauge', '1435']
['267372303', 'layer', '1']
['267372303', 'usage', 'main']
['267372303', 'bridge', 'yes']
['267372303', 'name:en', 'Longhai Line']
['267372303', 'name:zh', u'陇海线']
['267372303', 'railway', 'rail']
['267372303', 'voltage', '25000']
['267372303', 'frequency', '50']
['267372303', 'electrified', 'contact_line']
['267372781', 'name', u'陇海线']
['267372781', 'gauge', '1435']
['267372781', 'layer', '1']
['267372781', 'usage', 'main']
['267372781', 'bridge', 'yes']
['267372781', 'name:en', 'Longhai Line']
['267372781', 'name:zh', u'陇海线']
['267372781', 'railway', 'rail']
['267372781', 'voltage', '25000']
['267372781', 'frequency', '50']
['267372781', 'electrified', 'contact_line']
['267397517', 'railway', 'rail']
['267397623', 'highway', 'primary']
['267404272',

['283064990', 'ref', 'S107']
['283064990', 'lanes', '3']
['283064990', 'oneway', 'yes']
['283064990', 'source', 'GPS']
['283064990', 'highway', 'primary']
['283397529', 'landuse', 'grass']
['283547965', 'name', u'西安软件园秦风阁']
['283547965', 'building', 'commercial']
['283552821', 'highway', 'residential']
['283552832', 'name', u'土木工程大楼']
['283552832', 'building', 'house']
['283580431', 'ref', 'S108']
['283580431', 'source', 'GPS']
['283580431', 'highway', 'secondary']
['283722350', 'railway', 'rail']
['283722350', 'service', 'spur']
['283722360', 'railway', 'rail']
['283722360', 'service', 'spur']
['284164584', 'building', 'yes']
['284164942', 'building', 'yes']
['284969427', 'ref', 'S108']
['284969427', 'name', 'S108']
['284969427', 'source', 'GPS']
['284969427', 'highway', 'primary']
['284969437', 'man_made', 'pipeline']
['284969758', 'railway', 'rail']
['284969758', 'voltage', '25000']
['284969758', 'frequency', '50']
['284970071', 'landuse', 'grass']
['284971699', 'waterway', 'river']

['289322516', 'highway', 'primary']
['289323046', 'oneway', 'yes']
['289323046', 'highway', 'primary_link']
['289323056', 'highway', 'residential']
['289323066', 'highway', 'residential']
['289323883', 'oneway', 'yes']
['289323883', 'highway', 'motorway_link']
['289323894', 'highway', 'residential']
['289323904', 'mtb', 'yes']
['289323904', 'ref', 'G108;G310']
['289323904', 'note', 'conventional trunk road superceded by expressway']
['289323904', 'source', 'osm-gpx']
['289323904', 'highway', 'trunk']
['289324235', 'source', 'bing']
['289324235', 'railway', 'rail']
['289324245', 'highway', 'tertiary']
['289324723', 'highway', 'residential']
['289331705', 'highway', 'residential']
['289331957', 'landuse', 'residential']
['289336225', 'building', 'yes']
['289338511', 'oneway', 'yes']
['289338511', 'highway', 'primary_link']
['289338521', 'name', u'中街']
['289338521', 'highway', 'primary']
['289343362', 'building', 'yes']
['289385375', 'oneway', 'yes']
['289385375', 'highway', 'motorway_lin

['291270021', 'highway', 'service']
['291270813', 'highway', 'tertiary']
['291270823', 'name', u'渭阳西路']
['291270823', 'oneway', 'yes']
['291270823', 'highway', 'primary']
['291270833', 'highway', 'residential']
['291271355', 'name', u'联盟四路']
['291271355', 'highway', 'residential']
['291271385', 'highway', 'residential']
['291289715', 'landuse', 'grass']
['291294745', 'highway', 'footway']
['291294755', 'highway', 'footway']
['291294765', 'highway', 'footway']
['291295501', 'name', u'水房']
['291295501', 'building', 'house']
['291296779', 'highway', 'footway']
['291297682', 'highway', 'footway']
['291297692', 'highway', 'footway']
['291298999', 'landuse', 'grass']
['291299012', 'landuse', 'grass']
['291426078', 'highway', 'residential']
['291426088', 'highway', 'residential']
['291426401', 'highway', 'tertiary']
['291426411', 'highway', 'unclassified']
['291574433', 'landuse', 'grass']
['291575967', 'amenity', 'restaurant']
['291575967', 'building', 'yes']
['291575967', 'name', u'食堂']
['2

['293494489', 'waterway', 'river']
['293494499', 'waterway', 'river']
['293495545', 'ref', 'G5']
['293495545', 'name', u'西汉高速']
['293495545', 'lanes', '2']
['293495545', 'bridge', 'yes']
['293495545', 'oneway', 'yes']
['293495545', 'source', 'GPS']
['293495545', 'highway', 'motorway']
['293495545', 'name:en', 'Xihan Expressway']
['293495555', 'ref', 'S107']
['293495555', 'bridge', 'yes']
['293495555', 'source', 'GPS']
['293495555', 'highway', 'primary']
['293498132', 'highway', 'residential']
['293498142', 'bridge', 'yes']
['293498142', 'highway', 'residential']
['293498152', 'bridge', 'yes']
['293498152', 'highway', 'secondary']
['293499681', 'name', u'铁一中']
['293499681', 'amenity', 'school']
['293500870', 'railway', 'rail']
['293500870', 'service', 'spur']
['293509597', 'highway', 'secondary']
['293509607', 'railway', 'disused']
['293510537', 'highway', 'tertiary_link']
['293553190', 'name', u'浐灞湿地公园']
['293553190', 'leisure', 'park']
['293553215', 'lanes', '3']
['293553215', 'oneway

['301071013', 'source', 'bing']
['301071013', 'highway', 'primary']
['301071625', 'name', 'G40']
['301071625', 'lanes', '2']
['301071625', 'layer', '1']
['301071625', 'bridge', 'yes']
['301071625', 'oneway', 'yes']
['301071625', 'highway', 'motorway']
['301071635', 'bridge', 'yes']
['301071635', 'oneway', 'yes']
['301071635', 'highway', 'motorway_link']
['301071645', 'waterway', 'river']
['301073815', 'ref', 'G40']
['301073815', 'name', 'G40']
['301073815', 'lanes', '2']
['301073815', 'oneway', 'yes']
['301073815', 'highway', 'motorway']
['301073836', 'ref', 'G40']
['301073836', 'name', 'G40']
['301073836', 'lanes', '2']
['301073836', 'oneway', 'yes']
['301073836', 'highway', 'motorway']
['301075886', 'highway', 'tertiary']
['301191495', 'highway', 'trunk_link']
['301278783', 'highway', 'tertiary']
['301419561', 'railway', 'rail']
['301419561', 'service', 'siding']
['301464618', 'highway', 'residential']
['301483274', 'ref', 'G30']
['301483274', 'name', u'西潼高速G20']
['301483274', 'lanes

['333416295', 'addr:housenumber', u'127号']
['333418727', 'foot', 'yes']
['333418727', 'name', u'永宁门隧道']
['333418727', 'horse', 'no']
['333418727', 'lanes', '6']
['333418727', 'layer', '-1']
['333418727', 'oneway', 'no']
['333418727', 'tunnel', 'yes']
['333418727', 'bicycle', 'yes']
['333418727', 'highway', 'trunk']
['333418727', 'maxspeed', '60']
['333536495', 'amenity', 'public_building']
['333536495', 'building', 'yes']
['333553735', 'waterway', 'drain']
['333750108', 'highway', 'secondary']
['333752109', 'railway', 'rail']
['333840997', 'name', u'东方红广场']
['333840997', 'leisure', 'playground']
['333988762', 'name', u'行政楼']
['333988762', 'building', 'school']
['333988772', 'highway', 'footway']
['334007337', 'name', u'三星立交']
['334007337', 'layer', '1']
['334007337', 'bridge', 'yes']
['334007337', 'oneway', 'yes']
['334007337', 'highway', 'motorway_link']
['334281568', 'oneway', 'yes']
['334281568', 'highway', 'trunk_link']
['334282303', 'highway', 'unclassified']
['334283254', 'name',

['351364229', 'gauge', '1435']
['351364229', 'layer', '1']
['351364229', 'usage', 'main']
['351364229', 'bridge', 'yes']
['351364229', 'railway', 'rail']
['351364229', 'maxspeed', '350']
['351364229', 'highspeed', 'yes']
['351377432', 'highway', 'residential']
['351896268', 'highway', 'residential']
['351896278', 'highway', 'residential']
['351897176', 'landuse', 'residential']
['351899933', 'waterway', 'drain']
['352030782', 'landuse', 'residential']
['352678526', 'layer', '-1']
['352678526', 'tunnel', 'yes']
['352678526', 'highway', 'service']
['354224088', 'oneway', 'yes']
['354224088', 'highway', 'motorway_link']
['354424278', 'name', u'行贝鲁']
['354424278', 'bridge', 'yes']
['354424278', 'highway', 'residential']
['354424288', 'name', u'行者路']
['354424288', 'lanes', '4']
['354424288', 'highway', 'secondary']
['354424288', 'name:en', 'Xingzhe Rd']
['354426807', 'landuse', 'residential']
['354480473', 'landuse', 'residential']
['354603288', 'natural', 'water']
['354603298', 'highway', 

['377421409', 'source', 'GPS']
['377421409', 'highway', 'motorway']
['377421409', 'int_ref', 'AH5']
['377421419', 'layer', '1']
['377421419', 'bridge', 'yes']
['377421419', 'highway', 'motorway_link']
['377422391', 'ref', 'G40;G70']
['377422391', 'bridge', 'yes']
['377422391', 'oneway', 'yes']
['377422391', 'source', 'GPS']
['377422391', 'highway', 'motorway']
['377422391', 'int_ref', 'AH5']
['377422402', 'ref', 'G40;G70']
['377422402', 'bridge', 'yes']
['377422402', 'oneway', 'yes']
['377422402', 'source', 'GPS']
['377422402', 'highway', 'motorway']
['377422402', 'int_ref', 'AH5']
['377422412', 'ref', 'G40;G70']
['377422412', 'oneway', 'yes']
['377422412', 'source', 'GPS']
['377422412', 'highway', 'motorway']
['377422412', 'int_ref', 'AH5']
['377422422', 'ref', 'G40;G70']
['377422422', 'oneway', 'yes']
['377422422', 'source', 'GPS']
['377422422', 'tunnel', 'yes']
['377422422', 'highway', 'motorway']
['377422422', 'int_ref', 'AH5']
['377422432', 'ref', 'G40;G70']
['377422432', 'bridge'

['396260777', 'highway', 'primary_link']
['396260807', 'oneway', 'yes']
['396260807', 'highway', 'trunk_link']
['396742649', 'highway', 'unclassified']
['396761527', 'highway', 'motorway_link']
['398607098', 'highway', 'tertiary']
['398806573', 'highway', 'secondary']
['398848959', 'bridge', 'yes']
['398848959', 'oneway', 'yes']
['398848959', 'highway', 'motorway_link']
['398848969', 'tunnel', 'yes']
['398848969', 'highway', 'residential']
['398848979', 'ref', 'G30N']
['398848979', 'name', u'西咸北环线']
['398848979', 'highway', 'motorway']
['398849651', 'ref', 'G30N']
['398849651', 'name', u'西咸北环线']
['398849651', 'bridge', 'yes']
['398849651', 'oneway', 'yes']
['398849651', 'highway', 'motorway']
['398849661', 'ref', 'G30N']
['398849661', 'name', u'西咸北环线']
['398849661', 'oneway', 'yes']
['398849661', 'highway', 'motorway']
['398849671', 'oneway', 'yes']
['398849671', 'highway', 'motorway_link']
['398849681', 'ref', 'G30N']
['398849681', 'name', u'西咸北环线']
['398849681', 'layer', '1']
['39884

['403871858', 'highway', 'residential']
['403890315', 'ref', 'G5']
['403890315', 'name', u'西汉高速']
['403890315', 'lanes', '2']
['403890315', 'oneway', 'yes']
['403890315', 'source', 'Bing']
['403890315', 'highway', 'motorway']
['403890315', 'name:en', 'Xihan Expressway']
['403917018', 'bridge', 'yes']
['403917018', 'highway', 'unclassified']
['403917018', 'layer', '1']
['404217651', 'highway', 'unclassified']
['404219619', 'highway', 'residential']
['404222006', 'highway', 'unclassified']
['404326557', 'landuse', 'residential']
['404954927', 'highway', 'unclassified']
['404955667', 'highway', 'residential']
['405041063', 'layer', '1']
['405041063', 'bridge', 'yes']
['405041063', 'highway', 'residential']
['405086325', 'highway', 'residential']
['405100460', 'highway', 'unclassified']
['405376611', 'highway', 'residential']
['405377979', 'highway', 'residential']
['405450772', 'highway', 'residential']
['405450964', 'highway', 'tertiary']
['406344042', 'landuse', 'residential']
['4063445

['432120889', 'building', 'yes']
['432120899', 'building', 'yes']
['432120909', 'building', 'yes']
['432120919', 'building', 'yes']
['432120929', 'building', 'yes']
['432120939', 'building', 'yes']
['432120949', 'building', 'yes']
['432337216', 'height', '25']
['432337216', 'building:part', 'yes']
['432337226', 'building', 'yes']
['432337236', 'building', 'yes']
['432337246', 'building', 'yes']
['432337256', 'building', 'yes']
['432337266', 'building', 'yes']
['432337276', 'building', 'yes']
['432337287', 'building', 'yes']
['432337297', 'building', 'yes']
['432337308', 'building', 'yes']
['432337318', 'building', 'yes']
['432337328', 'building', 'yes']
['432337338', 'building', 'yes']
['432337348', 'building', 'yes']
['432337358', 'building', 'yes']
['432337368', 'building', 'yes']
['432337378', 'highway', 'path']
['432337388', 'highway', 'service']
['432337388', 'service', 'alley']
['432337398', 'highway', 'footway']
['432723419', 'building', 'yes']
['432723429', 'building', 'yes']
[

['446878869', 'highway', 'residential']
['446878879', 'highway', 'secondary']
['446878879', 'lanes', '2']
['446878879', 'oneway', 'yes']
['446878889', 'highway', 'residential']
['446878899', 'highway', 'residential']
['446878899', 'oneway', 'yes']
['446878909', 'highway', 'residential']
['446946765', 'building:part', 'yes']
['446946765', 'height', '7.7']
['446946765', 'min_height', '6.2']
['447230394', 'highway', 'residential']
['447231624', 'highway', 'residential']
['447232335', 'highway', 'residential']
['447288635', 'building', 'yes']
['447288645', 'building', 'yes']
['447288655', 'building', 'yes']
['447288667', 'building', 'yes']
['447288677', 'building', 'yes']
['447288687', 'building', 'yes']
['447288698', 'building', 'yes']
['447288715', 'building', 'yes']
['447288729', 'highway', 'service']
['447288745', 'access', 'private']
['447288745', 'highway', 'service']
['447365572', 'building', 'yes']
['447365582', 'building', 'yes']
['447365592', 'building', 'yes']
['447365602', 'bui

['457277830', 'lanes', '2']
['457277830', 'oneway', 'yes']
['457301946', 'highway', 'residential']
['457301946', 'name', u'电子东街']
['457301956', 'highway', 'unclassified']
['457301956', 'name', u'天谷八路']
['457301956', 'source', 'Bing']
['457301967', 'bridge', 'yes']
['457301967', 'highway', 'tertiary']
['457301967', 'layer', '1']
['457301967', 'maxspeed', '70']
['457301967', 'name', 'Keji 7 road']
['457301977', 'highway', 'unclassified']
['457307900', 'highway', 'residential']
['457307900', 'name', u'北新街']
['457307900', 'name:en', 'Beixing Street']
['457336029', 'highway', 'tertiary']
['457336029', 'name', u'纺渭路']
['457336069', 'highway', 'residential']
['457336094', 'highway', 'unclassified']
['457336094', 'name', u'公园北路']
['457340032', 'highway', 'unclassified']
['457340032', 'name', u'柳虹路']
['457344965', 'highway', 'unclassified']
['457344965', 'name', u'新广路']
['457643057', 'landuse', 'recreation_ground']
['403556', 'ref', 'G210']
['403556', 'name', u'210国道']
['403556', 'type', 'route

Brutally printing out all the tags at once proves too much information to grok, while counting the tag keys could be useful:

In [4]:
tag_key_count = Counter(key for _, key, _ in all_tags(OSM_FILE))
tag_key_count.most_common()

[('highway', 22151),
 ('name', 8907),
 ('building', 6508),
 ('oneway', 6352),
 ('source', 4035),
 ('bridge', 3294),
 ('ref', 2544),
 ('layer', 2472),
 ('lanes', 2236),
 ('name:en', 2072),
 ('railway', 1957),
 ('landuse', 1707),
 ('power', 1267),
 ('amenity', 1138),
 ('service', 767),
 ('surface', 734),
 ('name:zh', 648),
 ('leisure', 619),
 ('tunnel', 606),
 ('place', 579),
 ('natural', 503),
 ('int_ref', 463),
 ('waterway', 378),
 ('website', 341),
 ('gauge', 275),
 ('tourism', 251),
 ('name:zh_pinyin', 248),
 ('lanes:start', 246),
 ('lanes:end', 246),
 ('electrified', 223),
 ('flagfp', 223),
 ('frequency', 215),
 ('access', 213),
 ('barrier', 208),
 ('type', 204),
 ('voltage', 204),
 ('maxspeed', 190),
 ('name:fr', 189),
 ('foot', 172),
 ('wikipedia', 167),
 ('fixme', 162),
 ('shop', 160),
 ('water', 152),
 ('addr:street', 144),
 ('addr:city', 144),
 ('boundary', 143),
 ('admin_level', 143),
 ('bicycle', 130),
 ('note', 129),
 ('usage', 128),
 ('buildingpart', 113),
 ('sport', 113),


To get another overview, aggregate the tags into sets by keys:

In [5]:
agg = defaultdict(set)
for elem, key, val in all_tags(OSM_FILE):
    agg[key].add(val)
pprint(dict(agg))

{'Flagfp': set(['1']),
 'ISO3166-1': set(['CN']),
 'ISO3166-1:alpha2': set(['CN']),
 'ISO3166-1:alpha3': set(['CHN']),
 'ISO3166-1:numeric': set(['156']),
 'ISO3166-2': set(['CN-61']),
 'access': set(['customers',
                'designated',
                'destination',
                'half restricted',
                'no',
                'permissive',
                'private',
                'unknown',
                'yes']),
 'addr:city': set(['Qu Jiang New District Xian',
                   'Si-an',
                   u'Weiyang District, Xi’an, Shaanxi',
                   "Xi'an",
                   'Xian',
                   u'咸阳',
                   u'咸阳市',
                   u'西安',
                   u'西安市',
                   u'陕西西安']),
 'addr:country': set(['CN']),
 'addr:housename': set(['14#',
                        'KaiYuan Mall',
                        'Saigao Block',
                        u'万家灯火小区',
                        u'中国建设银行',
                        

 'bridge': set(['aqueduct', 'no', 'viaduct', 'yes']),
 'building': set(['apartments',
                  'college',
                  'commercial',
                  'dam',
                  'dormitory',
                  'hospital',
                  'house',
                  'hut',
                  'industrial',
                  'no',
                  'public',
                  'residential',
                  'retail',
                  'roof',
                  'school',
                  'stadium',
                  'tower',
                  'university',
                  'warehouse',
                  'yes',
                  u'仁厚庄园']),
 'building:levels': set(['1',
                         '11',
                         '18',
                         '2',
                         '22',
                         '24',
                         '26',
                         '27',
                         '28',
                         '3',
                         '32',
     

              u'人民广场街',
              u'人民西巷',
              u'人民西路',
              u'人民路',
              u'人民路广场',
              u'人民银行大街',
              u'什王村',
              u'仁义东巷',
              u'仁义路',
              u'仁厚庄北路',
              u'仁厚庄南路',
              u'仁宗庙',
              u'仁村',
              u'介家巷',
              u'从新巷',
              u'仓程路',
              u'仓门巷',
              u'付村花园东区',
              u'代家镇',
              u'代王街办',
              u'仪井镇',
              u'仪凤东街',
              u'仪凤南街',
              u'仪风北街',
              u'仪风西街',
              u'仲英书院 东1舍',
              u'仲英书院 东21舍',
              u'仲英书院 东2舍',
              u'仲英书院 东3舍',
              u'任留街道',
              u'企业路',
              u'伊古斋黄桂柿子饼',
              u'优佳超市',
              u'会展东路',
              u'会展中心',
              u'会展路',
              u'会昌路',
              u'会议中心',
              u'伞塔路',
              u'伟丰花园',
              u'低温实验室',
              u'住房',
              u'体乐南巷',

              u'振兴南街',
              u'振兴路',
              u'振华北路',
              u'振华南路',
              u'振华路',
              u'排球场',
              u'排球羽毛球乒乓球场',
              u'接待室',
              u'控制室',
              u'搪瓷北巷',
              u'搪瓷南巷',
              u'操场',
              u'操场东巷',
              u'收发室',
              u'政府街',
              u'政法巷',
              u'政通大道',
              u'故市镇',
              u'敏行路',
              u'教一楼',
              u'教二100',
              u'教二楼',
              u'教公寓3楼',
              u'教公寓4楼',
              u'教公寓5楼',
              u'教单1楼',
              u'教单2楼',
              u'教单3楼',
              u'教单4楼',
              u'教单5楼',
              u'教场门',
              u'教堂',
              u'教学一楼',
              u'教学七楼',
              u'教学三楼',
              u'教学九楼',
              u'教学二楼',
              u'教学五楼',
              u'教学八楼',
              u'教学六楼',
              u'教学区东门',
              u'教学区北一门',
              u'教学区北二门',
              

              u'纺六路',
              u'纺南路',
              u'纺四路',
              u'纺园一路',
              u'纺园二路',
              u'纺建路',
              u'纺新街',
              u'纺机路',
              u'纺渭路',
              u'纺织公园',
              u'纺织城',
              u'纺织城东街',
              u'纺织城小学',
              u'纺织城正街',
              u'纺织城站',
              u'纺织城西街',
              u'细柳街办',
              u'终南镇',
              u'终南音乐厅',
              u'经九路',
              u'经发一路',
              u'经发二路',
              u'经发路',
              u'经济管理学院',
              u'经电三路',
              u'结构力学实验室',
              u'统一路',
              u'综合服务大楼',
              u'综合楼',
              u'绿园度假村：324，600，616，游9',
              u'绿园度假村：324，923',
              u'绿地世纪城 Igress Park',
              u'绿地世纪城A区',
              u'绿地世纪城B区',
              u'绿地世纪城B区地面停车场',
              u'绿地世纪城C区',
              u'绿地世纪城篮球场',
              u'绿地笔克会展中心',
              u'缤纷南郡',
              u'缤纷南郡（陕西大会堂）',
           

                 'Changlefang',
                 'Changming Rd',
                 'Changrenli',
                 'Changsheng Str',
                 'Chaocheng Alley',
                 'Chaoliu',
                 'Chengguan',
                 'Chengouchun Rd',
                 'Chengxin Rd',
                 'Chenhe',
                 'China',
                 'China National Highway 108',
                 'China National Highway 211',
                 'China National Highway 310',
                 'China National Highway 312',
                 'Chishui',
                 'Chongning',
                 'Chongye Rd',
                 'Chuangxin Rd',
                 'Chunhua County',
                 'Cien Rd',
                 'Cliffside Path',
                 'Columbia',
                 'Cuihua Mountain',
                 'Cuihua Rd',
                 'Da Qing Lu',
                 'Dajing',
                 'Dali County',
                 'Dalianhuachi Street',
                 'Dama

                 u'渭南市',
                 u'礼泉県',
                 u'興平市',
                 u'華陰市',
                 u'藍田県',
                 u'西安市',
                 u'長安区',
                 u'閻良区',
                 u'陝西省',
                 u'高陵区']),
 'name:jbo': set(['.djunguos.']),
 'name:ka': set([u'ჩინეთი']),
 'name:kaa': set([u'Qıtay']),
 'name:kab': set(['Ccinwa']),
 'name:kbd': set([u'Хъутей Джылэ Республикэ']),
 'name:kg': set(['Sina']),
 'name:ki': set(['China']),
 'name:kk': set([u'Қытай Халық Республикасы']),
 'name:kl': set(['Kina']),
 'name:kn': set([u'ಚೀನಿ ಜನರ ಗಣರಾಜ್ಯ',
                 u'ಶೀಅನ್']),
 'name:ko': set([u'골동품 마켓',
                 u'당 대명궁 공원',
                 u'시안 성벽 동문',
                 u'종루',
                 u'중화인민공화국',
                 u'청진대사',
                 u'팔선암']),
 'name:koi': set([u'Кина']),
 'name:kr': set([u'고루', u'혁명공원']),
 'name:krc': set([u'Къытай Халкъ Республика']),
 'name:ks': set([u'چیٖن']),
 'name:ku': set([u'Çîn']),
 'name:kv': set([u

               'yes']),
 'plant:output:hot_water': set(['yes']),
 'population': set(['5000', '5500', '6501200']),
 'postal_code': set(['713400']),
 'power': set(['generator',
               'line',
               'minor_line',
               'plant',
               'pole',
               'station',
               'substation',
               'tower',
               'transformer']),
 'public_transport': set(['platform', 'station', 'stop_position']),
 'railway': set(['abandoned',
                 'buffer_stop',
                 'construction',
                 'crossing',
                 'disused',
                 'halt',
                 'level_crossing',
                 'platform',
                 'rail',
                 'station',
                 'subway',
                 'subway_entrance',
                 'yes']),
 'ramp': set(['yes']),
 'ref': set(['05/23',
             '05L/23R',
             '05R/23L',
             '06L/24R',
             '06R/24L',
             '07/25',
 

Now, let's investigate the tags with keys occurred most often. First, the highway tags:

In [6]:
pprint(agg['highway'])

set(['bus_stop',
     'construction',
     'crossing',
     'cycleway',
     'footway',
     'living_street',
     'mini_roundabout',
     'motorway',
     'motorway_junction',
     'motorway_link',
     'path',
     'pedestrian',
     'platform',
     'primary',
     'primary_link',
     'raceway',
     'residential',
     'rest_area',
     'road',
     'secondary',
     'secondary_link',
     'service',
     'services',
     'steps',
     'tertiary',
     'tertiary_link',
     'track',
     'traffic_signals',
     'trunk',
     'trunk_link',
     'turning_circle',
     'unclassified',
     'via_ferrata'])


Meh, the highway tags look as boring as all the hours you could spend asking "Are we there yet?" when driving on them. What about the name tags?

In [7]:
pprint(agg['name'])

set([u' 陕西中医大学',
     u'06号',
     '1',
     '103',
     '106,721',
     '107',
     '107, 34',
     '107,34',
     '107,512,34',
     '107,512,608,34',
     u'107省道',
     u'108国道',
     u'108省道',
     u'10号宿舍楼',
     u'10号教学楼',
     u'113分叉',
     u'113县道',
     u'113县道3',
     u'11号宿舍楼',
     u'12号宿舍楼',
     u'12楼',
     u'13号宿舍楼',
     '14,312,...',
     u'14号宿舍楼',
     u'17楼',
     u'19号楼',
     u'19号楼（1994年建）',
     u'1北楼',
     u'1号宿舍楼',
     u'1号教学楼',
     u'1号楼',
     u'1号高层',
     '2',
     '205',
     '206',
     '207,35,...',
     '209, ...',
     u'20楼',
     u'210国道',
     u'211国道',
     '212, ...',
     '220',
     u'24楼',
     '251',
     '251, 908, 512, 608, ...',
     '251, 908, 608',
     '251,29,908',
     '251,901,29',
     '251,908,29, ...',
     '261',
     '261, 262',
     '262',
     '29,411',
     '29,908,411',
     '2nd Ring Road',
     u'2号宿舍楼',
     u'2号教学楼',
     u'2号楼',
     u'2号高层',
     '3',
     '308,X7,...',
     u'310国道',
     '312,14,604,X7,...',
  

     u'万寿北路',
     u'万寿南路',
     u'万寿路站',
     u'万年路',
     u'万庆巷',
     u'万豪酒店',
     u'万达广场一号路',
     u'万达百货民乐园店',
     u'丈八一路',
     u'丈八七路',
     u'丈八三路',
     u'丈八东路',
     u'丈八二路',
     u'丈八五路',
     u'丈八六路',
     u'丈八北路',
     u'丈八北路 / Zhangba Bei Lu',
     u'丈八北路站',
     u'丈八四路',
     u'丈八四路北段',
     u'丈八街办',
     u'丈八西路',
     u'丈杜路',
     u'三五零七社区',
     u'三兆村',
     u'三兆路',
     u'三兆路 Sānzhào Rd',
     u'三南路',
     u'三原',
     u'三原县',
     u'三原县 (Sanyuan)',
     u'三原站',
     u'三号坑',
     u'三号楼',
     u'三号路',
     u'三姐妹饺子 @ 东木头市店',
     u'三家庄',
     u'三张镇',
     u'三星',
     u'三星工业园区',
     u'三星快速干道',
     u'三星立交',
     u'三桥收费站',
     u'三桥村',
     u'三桥立交',
     u'三桥站',
     u'三桥街办',
     u'三殿桥',
     u'三民村站',
     u'三渠镇',
     u'三爻站',
     u'三环辅道',
     u'三航路',
     u'三贤路',
     u'三过村',
     u'三里镇',
     u'上小径',
     u'上林体育馆',
     u'上林斜路',
     u'上林路',
     u'上河道',
     u'上王村',
     u'上草村',
     u'上陆陌',
     u'下北街',
     u'下寨镇',
     u'下庙镇',
     u'下草村',
     u'下西街',
     u'专

     u'咸通路',
     u'咸铜铁路',
     u'咸阳',
     u'咸阳东',
     u'咸阳中学',
     u'咸阳北站',
     u'咸阳南',
     u'咸阳博物馆',
     u'咸阳市',
     u'咸阳市 / Xianyang',
     u'咸阳湖',
     u'咸阳秦都',
     u'咸阳立交匝道',
     u'咸阳站',
     u'咸阳西',
     u'咸阳钟楼',
     u'哈佛公馆',
     u'哈哈',
     u'响桥村',
     u'哑柏镇',
     u'唐久便利店',
     u'唐兴数码',
     u'唐兴路',
     u'唐园1号楼',
     u'唐园2号楼',
     u'唐园3号楼',
     u'唐城墙公园',
     u'唐城墙遗址公园',
     u'唐大慈恩寺遗址公园',
     u'唐宫仙指',
     u'唐延南路',
     u'唐延路',
     u'唐苑东路（天街）',
     u'唐苑北路',
     u'唯实路',
     u'商州区 (Shangzhou)',
     u'商洛北站',
     u'商洛市 / Shangluo',
     u'商通路',
     u'商铺',
     u'啤酒路',
     u'喂子坪乡',
     u'喷泉广场',
     u'嘉华街',
     u'嘉天國際',
     u'四号楼',
     u'四号路',
     u'四合院',
     u'四大发明广场',
     u'四季东巷',
     u'四季西巷',
     u'四屯镇',
     u'四府街',
     u'四段巷',
     u'四浩庄',
     u'团委',
     u'团结一路',
     u'团结三路',
     u'团结东路',
     u'团结中路',
     u'团结二路',
     u'团结北路',
     u'团结南路',
     u'团结四路',
     u'团结村',
     u'团结西路',
     u'团结路',
     u'园丁面食屋',
     u'围墙巷',
     u'围棋寨村',

     u'杏林镇',
     u'杏渭路',
     u'材料学院 强度楼',
     u'杜化路',
     u'杜家巷',
     u'杜曲街办',
     u'杜陵邑南路',
     u'杜雁路',
     u'杨凌',
     u'杨凌大道',
     u'杨凌西',
     u'杨孔寺三组四组八组',
     u'杨孔寺水库',
     u'杨孔寺水库大坝',
     u'杨家村路',
     u'杨家滩村',
     u'杨庄乡',
     u'杨陵区',
     u'杨陵区 (Yangling)',
     u'杨陵南站',
     u'杨陵水上运动中心',
     u'杨陵镇站',
     u'板桥',
     u'林学院东院',
     u'林苑宾馆',
     u'林荫小道',
     u'枣园',
     u'枣园东路',
     u'枣园南路',
     u'枣园巷',
     u'枣园站',
     u'枣园西路',
     u'枫丹白露苑',
     u'枫叶北路 / Fengye Bei Lu',
     u'枫叶南路 / Fengye Nan Lu',
     u'枫叶新新花园 / Fengye Xinxin Huayuan',
     u'枫叶新都市',
     u'枫叶苑',
     u'枫林华府',
     u'枫林华府 东门',
     u'枫林华府 服务中心',
     u'枫林路 / Fenglin Road',
     u'柏树林',
     u'某小区工地',
     u'柞水县 (Zhashui)',
     u'柳亭路',
     u'柳仓街',
     u'柳新路',
     u'柳枝镇',
     u'柳烟路',
     u'柳荫路',
     u'柳莺路',
     u'柳虹路',
     u'柳雪路',
     u'柳鸣路',
     u'柴家十字',
     u'标准厂房',
     u'标新街',
     u'标缝路',
     u'栎阳街道',
     u'树园站',
     u'栖凤街',
     u'栖斜路',
     u'校务办公楼',
     u'校务楼',
  

     u'西部大道',
     u'西部欣桥农产品物流中心',
     u'西里路',
     u'西铁分局',
     u'西铁双维超市',
     u'西铜高速',
     u'西门',
     u'西闸口',
     u'西阳镇',
     u'西阶',
     u'西韦巷',
     u'西韩村',
     u'西韩街',
     u'西飞大道',
     u'西食堂',
     u'西高新',
     u'西黄高速',
     u'西龙窝村',
     u'观音山',
     u'观音禅院',
     u'规划局家属院',
     u'规划路',
     u'角楼',
     u'解放军三二三医院',
     u'解放南路',
     u'解放市场',
     u'解放路',
     u'解放门',
     u'计算机中心',
     u'许士庙街',
     u'许滨北路',
     u'试飞路',
     u'试飞院路',
     u'试验田',
     u'诚信路',
     u'诚字楼',
     u'调剂餐厅',
     u'谢王桥',
     u'谭家',
     u'谭家滩村',
     u'谷家巷',
     u'豁口',
     u'豪邦时尚购物广场',
     u'贞元镇',
     u'贞观路',
     u'贡院门',
     u'财东路',
     u'货车安全检查站',
     u'贾三灌汤包子馆',
     u'贾家滩村',
     u'贾家馄饨馆',
     u'赛格电脑城',
     u'赛格购物中心',
     u'赤栏桥村',
     u'赤水河',
     u'赤水站',
     u'赤水镇',
     u'赵佳宝甜食',
     u'赵公明财神庙',
     u'赵围东路',
     u'赵家堡村',
     u'赵村镇',
     u'赵王巷',
     u'赵王村',
     u'赵镇',
     u'足球场',
     u'跳楼塔',
     u'车棚',
     u'车站东路',
     u'车站北路',
     u'车站南路',
     u'软件学院',
   

Bingo! Pretty easily, we can see the problems, because some of the name tags contain only English words (when they shouldn't) or contain duplicated Chinese and English names:

- ('Gaoling Xian', '244080278')
- (u'索菲特国际会展中心 Sofitel on Renmin Square', '850352828')

## Auditing Name Tags

Using another helper function **find_and_print()** defined and put at the beginning, let's look at the whole elements in which the above two potential problematic name tags live:

In [8]:
find_and_print('244080278')
find_and_print('850352828')

node id: 244080278
{'created_by': 'dkt_GNS-import-1',
 'gns:ADM1': '26',
 'gns:DSG': 'ADM3',
 'gns:UFI': '-1906363',
 'gns:UNI': '6816820',
 'name': 'Gaoling Xian',
 'name:vi': u'Cao Lăng',
 'name:zh': u'高陵县',
 'name:zh_pinyin': 'Gaoling Xian',
 'place': 'county'}

node id: 850352828
{'addr:housenumber': '319',
 'addr:street': 'Dongxin',
 'name': u'索菲特国际会展中心 Sofitel on Renmin Square',
 'name:en': 'Sofitel on Renmin Square',
 'name:zh': u'索菲特国际会展中心',
 'tourism': 'hotel'}



Ha, the OSM elements do have the correct Chinese names, although buried in the 'name:zh' tags ('高陵县' and '索菲特国际会展中心'). Fixing them once we have found them should be easy (copy 'name:zh' tags to 'name' tags). However, determining how to pinpoint these kind of problematic name tags could be challenging.

First, let's ascertain whether we can safely use the 'name:zh' tags as the gold standard to fix our name tags by searching the counterexamples:

In [9]:
ALL_CN = re.compile(u'^[\u4e00-\u9fa5\s]+$')

# search 'name:zh' tags containing non-Chinese characters

for elem, key, val in all_tags(OSM_FILE):
    if key == 'name:zh' and not ALL_CN.match(val):
        tags = { tag.attrib['k']: tag.attrib['v'] 
                     for tag in elem.iter('tag') }
        name = tags.get('name', '')
        
        print('id     ', elem.attrib['id'])
        print('name   ', name)
        print('name:zh', val, '\n')


id      310515765
name    兵马俑 terracotta army
name:zh Exército de Terracota 

id      1761733711
name    长乐门（东门）
name:zh 长乐门（东门） 

id      2535980901
name    都城隍庙
name:zh Duchenghuang Temple of Xi'an 

id      4308320198
name    센양 국제공항
name:zh Aeroporto Internacional de Xi'an Xianyang 

id      4317716489
name    陕西省计算机专业技术资格水平考试办公室
name:zh 地   址：西安市高新区丈八五路10号 陕西省科技资源统筹中心D座217室 

id      4376910800
name    Yong Ning International Art museum
name:zh Yong Ning International Art museum 

id      4377098289
name    greenants outdoor
name:zh greenants outdoor 

id      4379974989
name    K2 summit
name:zh K2 summit 

id      4381895289
name    Yuan village food tourist park
name:zh Yuan village food tourist park 

id      4388892491
name    Three Sisters Dumplings
name:zh Three Sisters Dumplings 

id      4394361889
name    pagoda
name:zh pagoda 

id      4397221090
name    PSB office for visa extensions
name:zh PSB office for visa extensions 

id      4405705298
name    Yaijie Guesthouse


Unfortunately, counterexamples do exist. However, it does not prevent us from using the 'name:zh' tags as the gold standard, but it does mean we have to use them with care: checking before using.

So we can split up the problematic name tags into two groups depending on whether the 'name:zh' tags can be used as the gold standard:


In [10]:
ALL_EN = re.compile(u'^[a-zA-Z0-9\s]+$')
CN_EN  = re.compile(u'^[\u4e00-\u9fa5\s]+[^\u4e00-\u9fa5]+$')

names_zh_good = []
names_zh_bad  = []

for elem, key, val in all_tags(OSM_FILE):
    if key == 'name':
        tags = { tag.attrib['k']: tag.attrib['v'] 
                     for tag in elem.iter('tag') }
        zh = tags.get('name:zh', '')
    
        if ALL_EN.match(val) and ALL_CN.match(zh):
            names_zh_good.append([elem.attrib['id'], val, zh])
            
        elif CN_EN.match(val):
            if ALL_CN.match(zh):
                names_zh_good.append([elem.attrib['id'], val, zh])  
            else:
                names_zh_bad.append([elem.attrib['id'], val])
                
print("Tags of 'name:zh' in Chinese:\n")
for id, name, zh in names_zh_good:
    print('id     ', id)
    print('name   ', name)
    print('name:zh', zh, '\n')

Tags of 'name:zh' in Chinese:

id      244080278
name    Gaoling Xian
name:zh 高陵县 

id      310740922
name    Watsons
name:zh 屈臣氏 

id      850352828
name    索菲特国际会展中心 Sofitel on Renmin Square
name:zh 索菲特国际会展中心 

id      1511453998
name    KFC
name:zh 肯德基 

id      2254514432
name    KFC
name:zh 肯德基 

id      2260053296
name    华山南峰 Hua Shan South Peak
name:zh 华山南峰 

id      2379990928
name    Vanguard
name:zh 华润万家 

id      2634971740
name    KFC
name:zh 肯德基 

id      2945306642
name    KFC
name:zh 肯德基 

id      3995724828
name    翠华山 Cuihua Mountain
name:zh 翠华山 

id      4086083089
name    KFC
name:zh 肯德基 

id      4119774589
name    KFC
name:zh 肯德基 

id      4198263692
name    KFC
name:zh 肯德基 

id      28258008
name    Tangyan Lu
name:zh 唐延路 

id      28258930
name    长安南路 Cháng‘ān nán lù
name:zh 长安南路 

id      28258946
name    桃园南路 / Taoyuan Nan Lu
name:zh 桃园南路 

id      28339844
name    高新六路 / Gaoxin Liu Lu
name:zh 高新六路 

id      28343179
name    2nd Ring Road
name:zh 二环路 

id    

Among the name tags with no golden-standard 'name:zh' tags shown below, only some of them can be easily fixed, namely those with a space separating the Chinese and English parts, such as: 

'兵马俑 terracotta army', 

'高薪四路 / Gaoxin Si Lu' and 

'高薪第一小学 / Gaoxin Diyi Xiaoxue'.

And we have incidentally noticed that the wrong Chinese characters '高薪' here should be '高新' instead.

In [11]:
print("No 'name:zh' tag or 'name:zh' tag not in Chinese:\n")
pprint(names_zh_bad)

No 'name:zh' tag or 'name:zh' tag not in Chinese:

[['310515765', u'兵马俑 terracotta army'],
 ['312727586', u'西安软件园: 260, 262'],
 ['315470145', u'丁家桥: 107, 218, 251, 908'],
 ['1558667396', u'陕西电视塔 Shaanxi Television Tower'],
 ['1655332222', u'交通银行ATM'],
 ['1828308887', u'招行ATM'],
 ['2615853708', u'农行ATM'],
 ['2615853709', u'建行ATM'],
 ['2615870236', u'丁家桥: 218, 251'],
 ['2615873516', u'西安软件园: 218, 253, 260, 262'],
 ['2616229440', u'西寨： 905，917，923，4-19'],
 ['2616238121', u'西寨：905，917，923，4-19'],
 ['2616238122', u'西寨：323，500，4-04，4-09，4-10，4-13'],
 ['2616238123', u'西寨：323，500，4-04，4-09，4-10，4-13，4-18'],
 ['2616238124', u'西寨：215，229，324，918，4-18'],
 ['2616238125', u'西寨：215，229，324，918'],
 ['2616238126', u'青年南街：215，229，324，918'],
 ['2616238127', u'青年南街：215，229，324，918'],
 ['2616238128', u'长安党校：215，229，324，918'],
 ['2616238130', u'太阳水岸新城：162'],
 ['2616238131', u'候家湾村：324'],
 ['2616238132', u'绿园度假村：324，923'],
 ['2921585612', u'太阳水岸新城：162'],
 ['2921585613', u'长安二中：162'],
 ['2921585614', u'毓秀园：1

Thus we update the regular expression to isolate those easily-fixable name tags, and show how to fix them:

In [12]:
CN_SPACE_EN  = re.compile(u'^[\u4e00-\u9fa5\s]+\s[^\u4e00-\u9fa5]+$')

for elem, key, val in all_tags(SAMPLE_FILE):
    if key == 'name' and CN_SPACE_EN.match(val):
        cn, _, _ = val.partition(' ')
        print('id:\t\t', elem.attrib['id'])
        print('name:\t', val)
        print('fixed:\t', cn, '\n')

id:		 310515765
name:	 兵马俑 terracotta army
fixed:	 兵马俑 

id:		 850352828
name:	 索菲特国际会展中心 Sofitel on Renmin Square
fixed:	 索菲特国际会展中心 

id:		 3283285289
name:	 绿地世纪城 Igress Park
fixed:	 绿地世纪城 

id:		 28258958
name:	 高薪四路 / Gaoxin Si Lu
fixed:	 高薪四路 

id:		 28294894
name:	 锦业路 JinYe road
fixed:	 锦业路 

id:		 28465736
name:	 新纪元公园 / Xinji Yuan Huayuan
fixed:	 新纪元公园 

id:		 55696761
name:	 白沙路 / Baisha Lu
fixed:	 白沙路 

id:		 142323478
name:	 新开门南路 New Open Door South Rd
fixed:	 新开门南路 

id:		 162658753
name:	 高薪路 / Gaoxin Lu
fixed:	 高薪路 

id:		 251744557
name:	 雁南三路 Yàn nán sān lù
fixed:	 雁南三路 

id:		 251746753
name:	 长安南路 Cháng‘ān nán lù
fixed:	 长安南路 

id:		 313208278
name:	 高薪第一小学 / Gaoxin Diyi Xiaoxue
fixed:	 高薪第一小学 

id:		 335518217
name:	 高新二路 / Gaoxin Er Lu
fixed:	 高新二路 

id:		 335522611
name:	 丈八北路 / Zhangba Bei Lu
fixed:	 丈八北路 

id:		 335527388
name:	 科技路 / Keji Lu
fixed:	 科技路 

id:		 335527413
name:	 锦业路 JinYe road
fixed:	 锦业路 

id:		 335531767
name:	 桃园南路 / Taoyuan Nan Lu
fixed:	 桃

# Auditing Name:en Tags

In [13]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

ENDING = re.compile(r'\b\S+\.?$', re.IGNORECASE)     
EXPECTED = ['City', 'District', 'Section', 'Temple', 'Park', 'Base', 'Line', 'Luanzhen', 'Area',
    'Hospital','Tunnel', 'Plaza', 'Railway', 'River', 'Expressway', 'Road', 'Square', 'Alley',
    'County', 'Street', 'Village', 'School', 'Avenue', 'Pavilion', 'Path', 'University', 
    'Airport', 'Cemetery', 'Factory', 'Sanyao', 'Baqiao', 'Yinzhen', 'Jingyang', 'Lantian',
    'Gujiling', 'Caotang', "Chang'an", 'Linwei', 'Station', 'Gate', 'Yangling', 'Restaurant',
    'Island', 'Tower', 'Mall', 'Exit', 'Stadium', 'Hotel', 'Store', 'station', 'Peak',
    'Market',
]
            
endings = defaultdict(set)     
for elem, key, val in all_tags(SAMPLE_FILE):
    if key == 'name:en':
        ending = ENDING.search(val).group()
        if ending not in EXPECTED:
            endings[ending].add(val)

pprint(endings.keys())

['Doumen',
 'Warriors',
 'Qiaonan',
 'Ginwa',
 'Rd',
 'cafe',
 'Dayang',
 'Gengzhen',
 'Houjiacun',
 'Zhongnan',
 'Columbia',
 'Caochangpo',
 'Expy',
 '2',
 'Xizhakou',
 'Jiaochangmen',
 '0006',
 'Dajing',
 'Weinan',
 'Fengqiyuan',
 'Str',
 'Beicaojia',
 'Fanjia',
 'Tongyuan',
 'Thirteen']


In [14]:
ABBRES = {
    'Blvd': 'Boulevard',
    'Rd': 'Road',
    'Lu': 'Road',
    'Rd(E)': 'East Road',
    'Rd(W)': 'West Road',
    'Rd(N)': 'North Road',
    'Rd(S)': 'South Road',
    'Expy': 'Expressway',
    'Str': 'Street',
    'St': 'Street',
    'Jie': 'Street',
    u'Jie（S.）': 'Stret',
    'Qu': 'District',
}

In [15]:
for elem, key, val in all_tags(OSM_FILE):
    if key == 'name:en':
        for abbre, full in ABBRES.items():
            if val.endswith(abbre):
                print(abbre, ':', val, '=>', val.replace(abbre, full))


Qu : Lintong Qu => Lintong District
Qu : Yanta Qu => Yanta District
Rd : Xiwan Rd => Xiwan Road
Rd : E. Zhangba Rd => E. Zhangba Road
Rd : N. Changan Rd => N. Changan Road
Rd : S. Changan Rd => S. Changan Road
Rd : TV Tower Rd => TV Tower Road
Rd : Changming Rd => Changming Road
Rd : W. Xiaozhai Rd => W. Xiaozhai Road
Rd : W. Huaqing Rd => W. Huaqing Road
Rd : Huichang Rd => Huichang Road
Jie（S.） : Zhuque Da Jie（S.） => Zhuque Da Stret
Rd : Daqing Rd => Daqing Road
Lu : Da Qing Lu => Da Qing Road
Rd : Fengqing Rd => Fengqing Road
Rd : Yanhuan Rd => Yanhuan Road
Lu : Guang De Lu => Guang De Road
Rd : W. Renming Rd => W. Renming Road
Rd : Weiyang East Rd => Weiyang East Road
Rd(E) : Youyi Rd(E) => Youyi East Road
Rd : Cien Rd => Cien Road
Rd : Yannan 1st Rd => Yannan 1st Road
Rd : Chongye Rd => Chongye Road
Rd : Yongsong Rd => Yongsong Road
Lu : Jinye Er Lu => Jinye Er Road
Rd : Weiyang West Rd => Weiyang West Road
Rd : Jianshe Rd => Jianshe Road
Rd : W. Cehui Rd => W. Cehui Road
Rd : S. 

# Final audit code

What we learnt so far can be written up as an **audit_tags** function to be used in the audit.py file (which will be imported and used in the to_csv.py file):

```
def audit_tags(tags):
    # Apply what we learnt in audit.ipynb:
    for i, tag in enumerate(tags):
        value = tag['value']

        # Fix name tags
        if tag['key'] == 'name' and (ALL_EN.match(value) or 
                                     CN_EN.match(value)):
            zh_tag = get_zh_tag(tags)

            # Use name:zh as the gold standard
            if zh_tag and ALL_CN.match(zh_tag['value']):
                tags[i]['value'] = zh_tag['value']

            # Fix tags with a space separating the Chinese 
            # and English parts, such as:
            # '兵马俑 terracotta army'
            # '高薪四路 / Gaoxin Si Lu'
            elif CN_SPACE_EN.match(value):
                cn, _, _ = value.partition(' ')
                tags[i]['value'] = cn

            # Fix typos
            # '高薪' => '高新'
            for wrong, right in TYPOS.items():
                if wrong in value:
                    tags[i]['value'] = tags[i]['value'].replace(wrong, right)

        # Fix abbrevations in name:en
        if tag['key'] == 'en':
            for abbre, full in ABBRES.items():
                if value.endswith(abbre):
                    tags[i]['value'] = tags[i]['value'].replace(abbre, full)

    return tags
```