 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Basic-statistics" data-toc-modified-id="Basic-statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Basic statistics</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Types-of-tags" data-toc-modified-id="Types-of-tags-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Types of tags</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Investigate-tag-keys" data-toc-modified-id="Investigate-tag-keys-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Investigate tag keys</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Explore-users" data-toc-modified-id="Explore-users-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Explore users</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Audit-street-names" data-toc-modified-id="Audit-street-names-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Audit street names</a></span><ul class="toc-item"><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Look-at-some-sample-street-names" data-toc-modified-id="Look-at-some-sample-street-names-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Look at some sample street names</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Store-street-names-for-further-processing" data-toc-modified-id="Store-street-names-for-further-processing-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Store street names for further processing</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Investigate-street-suffixes" data-toc-modified-id="Investigate-street-suffixes-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Investigate street suffixes</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Investigate-suffixes" data-toc-modified-id="Investigate-suffixes-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Investigate suffixes</a></span><ul class="toc-item"><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Names-of-places" data-toc-modified-id="Names-of-places-5.4.1"><span class="toc-item-num">5.4.1&nbsp;&nbsp;</span>Names of places</a></span></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Street-names-with-commas" data-toc-modified-id="Street-names-with-commas-5.4.2"><span class="toc-item-num">5.4.2&nbsp;&nbsp;</span>Street names with commas</a></span></li></ul></li></ul></li><li><span><a href="http://localhost:8890/notebooks/Exploration.ipynb#Valid-street-suffixes" data-toc-modified-id="Valid-street-suffixes-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Valid street suffixes</a></span></li></ul></div>

In [55]:
import xml.etree.ElementTree as ET
from tqdm import tqdm
from collections import defaultdict, Counter
import re

import pickle

In [56]:
def load_data(filename):
    with open(filename + '.pickle', 'rb') as f:
        return pickle.load(f)
    
def save_data(filename, obj):
    with open(filename + '.pickle', 'wb') as f:
        return pickle.dump(obj, f)

## Basic statistics

These are the things that I'm interested in:

1. Total number of nodes
1. Total number of ways
1. Distinct tags and their counts within nodes
1. Distinct tags and their counts within ways

In [22]:
tags_by_type = defaultdict(lambda: Counter())
elem_count_by_type = Counter()
users = set()

for event, elem in tqdm(ET.iterparse('mumbai_india.osm')):
    elem_count_by_type[elem.tag] += 1
    
    if 'uid' in elem.attrib:
        users.add(elem.attrib['uid'])

    for tag in elem.iter('tag'):
        tags_by_type[elem.tag][tag.attrib['k']] += 1

5112846it [02:01, 42211.28it/s]


In [8]:
elem_count_by_type.most_common(100)

[('nd', 2360309),
 ('node', 2055448),
 ('tag', 395568),
 ('way', 284435),
 ('member', 13091),
 ('relation', 3993),
 ('bounds', 1),
 ('osm', 1)]

In [7]:
tags_by_type['way'].most_common()

[('building', 223891),
 ('highway', 40519),
 ('name', 11884),
 ('oneway', 4477),
 ('source', 4135),
 ('landuse', 3054),
 ('building:levels', 2702),
 ('layer', 1956),
 ('bridge', 1876),
 ('railway', 1518),
 ('addr:postcode', 1504),
 ('lanes', 1452),
 ('addr:street', 1404),
 ('addr:city', 1393),
 ('ref', 1216),
 ('leisure', 1158),
 ('amenity', 1029),
 ('natural', 966),
 ('gauge', 963),
 ('electrified', 948),
 ('surface', 915),
 ('waterway', 853),
 ('voltage', 801),
 ('addr:housenumber', 696),
 ('frequency', 638),
 ('area', 557),
 ('name:mr', 557),
 ('old_name', 522),
 ('ref:old', 476),
 ('usage', 473),
 ('service', 460),
 ('power', 433),
 ('access', 401),
 ('maxspeed', 383),
 ('passenger_lines', 372),
 ('man_made', 363),
 ('motorroad', 306),
 ('foot', 303),
 ('postal_code', 283),
 ('lit', 281),
 ('type', 267),
 ('cables', 243),
 ('int_ref', 239),
 ('addr:housename', 230),
 ('sport', 225),
 ('tunnel', 218),
 ('alt_name', 208),
 ('bicycle', 192),
 ('religion', 179),
 ('railway:traffic_mode

In [9]:
tags_by_type['node'].most_common()

[('source', 18416),
 ('power', 8970),
 ('name', 6219),
 ('created_by', 4497),
 ('amenity', 3130),
 ('natural', 2645),
 ('highway', 1496),
 ('addr:street', 1307),
 ('name:en', 1286),
 ('addr:postcode', 1065),
 ('place', 1039),
 ('operator', 930),
 ('shop', 796),
 ('addr:city', 744),
 ('building', 544),
 ('addr:housenumber', 488),
 ('railway', 406),
 ('tourism', 397),
 ('wikidata', 335),
 ('wikipedia', 330),
 ('description', 319),
 ('cuisine', 315),
 ('addr:housename', 303),
 ('website', 302),
 ('public_transport', 294),
 ('religion', 268),
 ('ref', 250),
 ('opening_hours', 210),
 ('phone', 200),
 ('brand:wikidata', 186),
 ('barrier', 181),
 ('atm', 173),
 ('extrude', 171),
 ('visibility', 171),
 ('tessellate', 171),
 ('ele', 148),
 ('leisure', 136),
 ('man_made', 132),
 ('bus', 128),
 ('emergency', 126),
 ('aeroway', 121),
 ('AND_a_nosr_p', 117),
 ('Name', 115),
 ('short_name', 111),
 ('name:mr', 102),
 ('office', 83),
 ('shelter', 82),
 ('access', 77),
 ('leaf_type', 76),
 ('internet_a

## Types of tags

As we can see above, the top 5 most common tags in order are:

1. nd
1. node
1. tag
1. way
1. member

As per the OSM documentation (http://wiki.openstreetmap.org/wiki/Elements), the basic components are:
1. **node:** Defining point in space. http://wiki.openstreetmap.org/wiki/Node
1. **way:** Defining linear features and area boundaries. Is a ordered set of nodes marked by the `nd` element. http://wiki.openstreetmap.org/wiki/Way
1. **relation:** Sometimes used to explain how elements work together
1. **tag:** Contained within a `node`, `way`, or `relation`, and describes additional metadata of that element
1. **member:** A tag contained within `node`, with a reference to a `relation`, implying that the node is part of that relation

## Investigate tag keys

1. Tag key can be plain ascii
1. Can be nested as "tag:name"
1. Can have special characters
1. Can be something else altogether

In [38]:
lower = re.compile(r'^([a-z_])*$')
lower_colon = re.compile(r'^([a-z_])*:([a-z_\:])*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def get_tag_key_type(tag):
    if lower.match(tag):
        return 'lower'
    elif lower_colon.match(tag):
        return 'lower_colon'
    elif problemchars.search(tag):
        return 'problemchars'
    else:
        return 'other'

In [39]:
tags_by_key_type = defaultdict(set)

for tag_type in ['node', 'way']:
    for tag, _ in tags_by_type[tag_type].iteritems():
        tag_key_type = get_tag_key_type(tag)
        tags_by_key_type[tag_key_type].add(tag)

In [40]:
for key_type in ['lower', 'lower_colon', 'problemchars', 'other']:
    print(key_type, len(tags_by_key_type[key_type]))
    print("====")
    for i in tags_by_key_type[key_type]:
        print(i)
    print("====\n")

('lower', 261)
====
shop
taxi
maxspeed
managed
office
restriction
man_made
indoor
social_facility
area
industrial
postal_code
motorcycle
int_ref
golf
trail_visibility
is_in
note
proposed
mumbai
residential
boundary
created_by
bench
attraction
charge
location
communication
usage
covered
hiway
administrative
psv
fax
junction
cycleway
short_name
service_times
leisure
number
icao
dock
historic
bridge_name
motor_vehicle
foot
tourism
smoothness
alternative_name
new_name
accomodation
fixme
vending
name
designation
level
steps
subway
embankment
fuel
crossing
gauge
repeat_on
cmt
disused
toll
internet_access
is_capital
bicycle
frequency
crop
nearyouu
platforms
cutting
street
wpt_symbol
parking
loc_name
sport
monorail
cargo
local_ref
capacity
network
substance
wikipedia
leaf_type
access
religion
buildingpart
playground
capital
print
ele
construction
artwork_type
ref
email
highway
public_transport
foundation
barrier
toilets
trees
electrified
denomination
water
tracks
studio
sym
leaf_cycle
tidal
ho

## Explore users

In [23]:
len(users)

1791

In [25]:
users

{'4976084',
 '1743639',
 '2293842',
 '616774',
 '4163648',
 '1306709',
 '2891069',
 '32529',
 '2535944',
 '2891065',
 '2891060',
 '2891062',
 '6671181',
 '3951013',
 '2891084',
 '2891083',
 '2891081',
 '3185511',
 '5043846',
 '2891088',
 '404532',
 '3800745',
 '2482676',
 '2219985',
 '2928127',
 '2928126',
 '2928125',
 '5018343',
 '2928123',
 '2952446',
 '2482411',
 '2932656',
 '2482415',
 '2482417',
 '118021',
 '336460',
 '5644374',
 '250196',
 '486933',
 '2928233',
 '2928236',
 '4512849',
 '556713',
 '2482306',
 '5009377',
 '5228128',
 '1286620',
 '5536733',
 '2482309',
 '3013340',
 '1821999',
 '441368',
 '289334',
 '436707',
 '6705583',
 '3884409',
 '4317520',
 '5761393',
 '1237830',
 '3889257',
 '3582761',
 '57987',
 '5475717',
 '5178468',
 '3339376',
 '336056',
 '3339378',
 '4687216',
 '675579',
 '223444',
 '445283',
 '2901893',
 '6110823',
 '7260',
 '90311',
 '6243687',
 '42123',
 '3367858',
 '1848945',
 '123364',
 '5125037',
 '510836',
 '883727',
 '2724216',
 '2447853',
 '512102

## Audit street names

### Look at some sample street names

In [45]:
i = 0
sample_street_names = []
for event, elem in ET.iterparse('mumbai_india.osm', events=('start',)):
    if elem.tag == 'way':
        for tag in elem.iter('tag'):
            if tag.attrib['k'] == 'addr:street':
                sample_street_names.append(tag.attrib['v'])
                i += 1
                
    if i > 10:
        break




          31966it [00:14, 2156.25it/s][A[A[A

In [46]:
sample_street_names

['Juhu Tara Road',
 'Boman Behram Road (Mere Weather Road)',
 'D Road',
 'Wing Mess Road',
 'TL Wasvani Road',
 'P Ramabai Marg',
 'Forjet Street',
 'Sahar Elevated Road',
 'Infinite Corridoor',
 'Powai',
 'Powai']

There are a couple of things to observe here:

1. We have suffixes like "Road", "Street", and even "Marg." In India, to my knowledge, "Road" and "Street" mean the same, and the only reason we see one over the other is because of historic reasons. Both "Road" & "Street" translate to the same Hindi or Marathi word.
1. "Marg" is the local word that means "Road" or "Street."
1. Thus, for computation purposes, all of these words can be considered to be synonyms.

Alternative names:
1. In case of "Boman Behram Road (Mere Weather Road)", we see that an alternative name is within parentheses.
1. Since we can have only one name, I'll ignore the alternative name.

In [52]:
alternative_name_regex = re.compile(r'\s*\([^)]*\)\s*')

print(alternative_name_regex.sub('', 'Boman Behram Road (Mere Weather Road)'))

Boman Behram Road


### Store street names for further processing

Parsing XML is very expensive. So, let's extract all the street names separately and then process them further.

In [58]:
# Store all streets in once place for further processing
street_names = set()
for event, elem in ET.iterparse('mumbai_india.osm', events=('start',)):
    if elem.tag in ['way', 'node']:
        for tag in elem.iter('tag'):
            if tag.attrib['k'] == 'addr:street':
                street_names.add(tag.attrib['v'])
                
save_data('street_names', street_names)

In [59]:
street_names = load_data('street_names')

### Investigate street suffixes

The following rules will be applied:
1. The street name is always lowercased
1. Remove trailing alternative names. So "Name (Alt name)" becomes just "Name"

We'll store a list of street names against the suffix. Then, curious suffixes can be investigate further to find out what's really happening.

In [60]:
street_suffixes = defaultdict(set)

for street_name in street_names:
    # Lowercase everything
    street_name = street_name.lower()
    
    # Remove alternative name
    street_name = alternative_name_regex.sub('', street_name)
    
    # Get suffix
    street_suffix = street_name.split()[-1]
    
    street_suffixes[street_suffix].add(street_name)

In [62]:
street_suffixes.keys()

['shop',
 'bhayandar,',
 'hendrapada',
 'gully',
 'gold',
 'hattimohalla',
 'nere-chipale,',
 'multiplex',
 'avenue',
 'raigad',
 '25',
 '26',
 '20',
 '21',
 '22',
 'mumbai',
 'jhopadpatti',
 'thane',
 'karjat',
 'kherajroad',
 'complex',
 '4',
 'sakinaka',
 'mandal',
 'circle',
 'crossroad',
 'vashi',
 'bandra',
 '11',
 'garden',
 'bhavan',
 'kharghar',
 'gardens,',
 'baingawadi',
 '9',
 'b',
 'east,',
 'compound',
 'world',
 'chandivali',
 'bunglow',
 'gokuldham',
 'mankhurd',
 'j.v.link.rod',
 '50',
 'amruta',
 'koliwada',
 'belgaum',
 'no.6',
 'kone,',
 'no.4',
 'no.3',
 'no.2',
 'no.1',
 '18',
 'chowk,',
 'mumbai,',
 'chauk',
 'mg',
 'nagar',
 'society',
 'mulshi',
 'street',
 'sahar',
 'flyover',
 'chawl',
 'godrej',
 'jnpt',
 'versova',
 '04',
 'bombay',
 'sector-48',
 'asalpha',
 'rd',
 '3',
 'showroom,',
 'fob',
 'new',
 'perarawadi',
 'rd.',
 'temple',
 'highway',
 'sector-6',
 'city',
 'chs',
 'koldongri',
 'ghansoli',
 'path',
 'ranwar',
 'govandi',
 '33',
 '32',
 'nager,t.

### Investigate suffixes

Let's investigate certain suffixes to understand what's happening better.

#### Names of places

Keys like 'bhayandar', 'mumbai', 'thane', 'karjat', are names of places. It'll be curious to know how they got there.

In [63]:
street_suffixes['mumbai']

{'hindcycle road, worli, mumbai',
 'janki kutir juhu church road, mumbai',
 'liberty tower, plot no. k 10, behind reliable plaza, thane-belapur road, airoli navi mumbai',
 'lokhandawala road, 4 bungalows, andheri, mumbai',
 'new link road, andheri west, mumbai',
 'opposite bon bon, four bungalows, andheri west, mumbai'}

In [64]:
street_suffixes['thane']

{'naupada, thane', 'thane'}

In [65]:
street_suffixes['karjat']

{'karjat'}

In [66]:
street_suffixes['sakinaka']

{'jagnnath mandir road,satyanagar, sakinaka', 'satyanagar, sakinaka'}

We observe a couple of things.

1. Some of these street names are full street names. We can truncate everything after the first comma.
1. Some of the street names are actually "node" names. Thane is the place where you'll find a street. In such cases, it's worth investigating street names that are just one word long, and possibly ignoring them.
1. Some of the street names are "directions", like "opposite bon bon, four bungalows, andheri west, mumbai". We should ignore these.

#### Street names with commas

Let's look at street names with commas in more detail

In [74]:
streets_with_commas_suffixes = defaultdict(set)

for street_name in street_names:
    # Only consider street names with commas
    if "," in street_name:
        # Lowercase everything
        street_name = street_name.lower()

        # Remove alternative name
        street_name = alternative_name_regex.sub('', street_name)
        
        # Just get the part before the first comma
        name, _ = street_name.split(",", 1)
        name = name.strip()
        
        if name == '':
            continue
        
        # Get suffix
        street_suffix = name.split()[-1]
        
        streets_with_commas_suffixes[street_suffix] = street_name

In [76]:
streets_with_commas_suffixes.keys()

['plaza',
 'estate',
 'cinema',
 'reclamation',
 'centre',
 'nagar',
 '10e',
 'street',
 'shastrinagar',
 'gograswadi',
 'junction',
 'st',
 'avenue',
 'mandir',
 'factory',
 'rd',
 'marg',
 '06',
 '21',
 'area',
 'nere-chipale',
 '45',
 'karjat',
 'heights',
 '1',
 'club',
 '3',
 'complex',
 '186',
 '4',
 '7',
 'retibandar',
 'circle',
 'depot',
 'beach',
 '19/20',
 '24',
 'lokhandwala',
 'north',
 'garden',
 'naupada',
 'towers',
 'colony',
 'bungalows',
 'park',
 'apt.',
 'ltd',
 'esplanade',
 'oshiwara',
 'sec-2',
 'g-13',
 'a',
 'lane',
 'satyanagar',
 'mahal',
 '14',
 'extension',
 'mobile',
 'crest',
 'floor',
 '2556',
 'vangani',
 'nerul',
 'nager',
 'no.3',
 'subway',
 'bon',
 'd-sector',
 'tower',
 'wing',
 'road',
 'tilakwadi']

In [77]:
streets_with_commas_suffixes['shastrinagar']

'shastrinagar, andheri west'

Shastrinagar is the name of a locality, not the street name. "Linking Rd" would be an example street within the locality.

In [78]:
streets_with_commas_suffixes['mandir']

'near sai baba mandir, pathanwadi bus stand, malad east'

This is not a street name, but a landmark. We should ignore records starting with "near."

In [80]:
streets_with_commas_suffixes['nerul']

'nerul, navi mumbai,  sector no 19 a'

Again, Nerul is a node with many sectors. It's not a street.

In [81]:
streets_with_commas_suffixes['1']

'aramnagar part 1, versova'

Again, name of a locality.

In [82]:
streets_with_commas_suffixes['no.3']

'plot no.3, sector 2, kharghar'

Not a street.

In [83]:
streets_with_commas_suffixes['colony']

'aarye milk colony, royal palms'

This colony is too big to be considered a street.

## Valid street suffixes

Looking at the previous records, let's drill down on suffixes of valid street names: