 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="http://localhost:8890/notebooks/Explore%20Data.ipynb#Basic-Exploration" data-toc-modified-id="Basic-Exploration-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Basic Exploration</a></span><ul class="toc-item"><li><span><a href="http://localhost:8890/notebooks/Explore%20Data.ipynb#Elements-in-the-document" data-toc-modified-id="Elements-in-the-document-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Elements in the document</a></span></li><li><span><a href="http://localhost:8890/notebooks/Explore%20Data.ipynb#Types-of-tags" data-toc-modified-id="Types-of-tags-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Types of tags</a></span></li><li><span><a href="http://localhost:8890/notebooks/Explore%20Data.ipynb#No.-of-distinct-users-who-have-contributed" data-toc-modified-id="No.-of-distinct-users-who-have-contributed-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>No. of distinct users who have contributed</a></span></li></ul></li></ul></div>

In [1]:
import xml.etree.ElementTree as ET
from tqdm import tqdm
from collections import defaultdict, Counter

## Basic Exploration

Let's look at the different types of tags

In [2]:
tags_by_type = defaultdict(lambda: Counter())
elem_count_by_type = Counter()
users = set()

for event, elem in tqdm(ET.iterparse('mumbai_india.osm')):
    elem_count_by_type[elem.tag] += 1
    
    if 'uid' in elem.attrib:
        users.add(elem.attrib['uid'])

    for tag in elem.iter('tag'):
        tags_by_type[elem.tag][tag.attrib['k']] += 1

5112846it [01:41, 50303.04it/s]


### Elements in the document

In [3]:
elem_count_by_type.most_common(100)

[('nd', 2360309),
 ('node', 2055448),
 ('tag', 395568),
 ('way', 284435),
 ('member', 13091),
 ('relation', 3993),
 ('bounds', 1),
 ('osm', 1)]

Above, we see a list of all the elements present in the XML file. The most important elements are:

1. `node` is a point of interest
1. `way` is an ordered collection of nodes representing an open path or a closed area
1. `tag` specifies additional details for a `node` or a `way`
1. `relation` specifies relations between nodes and/or ways
1. `nd` is how a node is referenced from within a `way` or a `relation`

### Types of tags

In [6]:
print("Total tags in ways: ", len(tags_by_type['way']))
tags_by_type['way']

('Total tags in ways: ', 323)


Counter({'AND:importance_level': 118,
         'AND_a_c': 2,
         'AND_a_i': 7,
         'AND_a_nosr_r': 115,
         'AND_a_w': 2,
         'FIXME': 9,
         'Golden Park': 1,
         'Legality': 1,
         'University': 1,
         'abandoned:landuse': 1,
         'abandoned:railway': 5,
         'abutters': 2,
         'access': 401,
         'accomodation': 1,
         'addr:city': 1393,
         'addr:country': 7,
         'addr:district': 3,
         'addr:flats': 5,
         'addr:full': 3,
         'addr:housename': 230,
         'addr:housenumber': 696,
         'addr:place': 5,
         'addr:postcode': 1504,
         'addr:state': 4,
         'addr:street': 1404,
         'addr:substreet': 1,
         'addr:suburb': 30,
         'addr:unit': 6,
         'admin_level': 16,
         'administrative': 3,
         'aerialway': 1,
         'aerodrome': 1,
         'aerodrome:type': 1,
         'aeroway': 117,
         'alt_name': 208,
         'alt_name:mr': 8,
        

In [7]:
print("Total tags in nodes: ", len(tags_by_type['node']))
tags_by_type['node']

('Total tags in nodes: ', 312)


Counter({'AND_a_nosr_p': 117,
         'Cable TV Provider': 1,
         'City': 1,
         'EState Consultants': 1,
         'Family': 1,
         'General Goods': 1,
         'General Items': 8,
         'General Store': 1,
         'Genral Goods': 1,
         'Guard_type': 56,
         'Gym': 1,
         'IR:zone': 1,
         'Legality': 1,
         'Mahesh Jain': 1,
         'Name': 115,
         'Photocopy': 8,
         'Print': 11,
         'Sector': 1,
         'access': 77,
         'addr:city': 744,
         'addr:country': 25,
         'addr:district': 1,
         'addr:housename': 303,
         'addr:housenumber': 488,
         'addr:place': 13,
         'addr:postcode': 1065,
         'addr:province': 2,
         'addr:state': 9,
         'addr:street': 1307,
         'addr:street_1': 1,
         'addr:suburb': 1,
         'addr:unit': 2,
         'admin_level': 2,
         'aerodrome': 1,
         'aeroway': 121,
         'alt_name': 28,
         'amenity': 3130,
        

In [8]:
tags_overall = tags_by_type['way'] + tags_by_type['node']
print("Total tags overall: ", len(tags_overall))

tags_overall

('Total tags overall: ', 488)


Counter({'AND:importance_level': 118,
         'AND_a_c': 2,
         'AND_a_i': 7,
         'AND_a_nosr_p': 117,
         'AND_a_nosr_r': 115,
         'AND_a_w': 2,
         'Cable TV Provider': 1,
         'City': 1,
         'EState Consultants': 1,
         'FIXME': 9,
         'Family': 1,
         'General Goods': 1,
         'General Items': 8,
         'General Store': 1,
         'Genral Goods': 1,
         'Golden Park': 1,
         'Guard_type': 56,
         'Gym': 1,
         'IR:zone': 1,
         'Legality': 2,
         'Mahesh Jain': 1,
         'Name': 115,
         'Photocopy': 8,
         'Print': 11,
         'Sector': 1,
         'University': 1,
         'abandoned:landuse': 1,
         'abandoned:railway': 5,
         'abutters': 2,
         'access': 478,
         'accomodation': 1,
         'addr:city': 2137,
         'addr:country': 32,
         'addr:district': 4,
         'addr:flats': 5,
         'addr:full': 3,
         'addr:housename': 533,
         'add

### No. of distinct users who have contributed

In [9]:
len(users)

1791

Looking at the number of distinct users who have contributed, the data might have a lot of inconsistencies.