# OpenStreetMap Data Case Study

## Map Area:
### Manchester, England
 - http://www.openstreetmap.org/relation/146656#map=11/53.4427/-2.2337
 - https://mapzen.com/data/metro-extracts/#manchester-england
 - http://www.openstreetmap.org/export#map=11/53.4427/-2.2337
 
 
I choose Manchester because this is home for me. I am also interested in learning what's changed and interesting things I hadn't known before.


### Map Exploration

In [2]:
# Iterative Parsing

In [None]:
# Sample Result: 
# Full Result
{'bounds': 1,
 'member': 27148,
 'nd': 1758747,
 'node': 1421700,
 'osm': 1,
 'relation': 2203,
 'tag': 805691,
 'way': 201946}

In [None]:
# Check tags

In [29]:
# tag issues
manchester_england = {'higher': 280,
                         'lower': 608844,
                         'lower_colon': 74340,
                         'naptan': 135348,
                         'other': 769,
                         'problemchars': 2}

# Non-errors: 122225 | Errors 683466
683466/122225
# 5

Non-errors: 122225 | Errors: 683466


5

In [None]:
# Exploring Users

In [None]:
283  Users - Sample
1919 Users - Full

## Problems Encountered in the Map

- Short Street Name Suffixes 
- Binary options mixed with categorical variables (i.e. Yes, No, Loony)
- Naptan Imported keys - Data imported where source is equal to:naptan_import prefixes it's attributes with naptan:CommonName


### Short Street Names

I used the original update_name method, which used the dictionary mappy strategy to find specific patterns within strings using regex and replacing with more consistent values across the data set. I added in some very common British street abbreviations with their respective normal forms to the dictonary for replacement:

"RD": "Road"
"Pk": "Park"
"Gr": "Grove"
"Gn": "Green"


This included changes like Q-Park Deansgate North Pk to Q-Park Deansgate North Park. This process covered many inconsistent values in the data to make it more consistent across the map. 

### Binary Mixed With Categorical Variables
The options for bicycles lanes could be any of the following:
    
 [u'yes', u'no', u'permissive', u'designated', u'dismount',
 u'destination', u'permissive or not at all!', u'private',
 u'loony_only']
 
After reviewing the options, I first thought that I could place each of the categorical variables within the binary ones, that may be helpful to an analysis but less helpful if I wanted to understand the particularities of bike lane rules. 

So a better change to me would be to turn the binary variables into categorical ones.

Yes -> Full_access
No -> No Access

This would also allow newcomes to specify better rather than having to choose between the overlap between the binary and categorical variables.







In [None]:
    if tag.attrib['k'] == 'bicycle':
        if tag.attrib['v'] == 'yes':
            tag.attrib['v'] == 'full access'
    if tag.attrib['v'] == 'no':
        tag.attrib['v'] == 'no access'

In [None]:
# Before

	loony_only	1
	permissive or not at all!	1
	destination	4
	private	44
	dismount	56
	permissive	174
	designated	346
	no	889
	yes	2357

In [None]:
# After
loony_only	1
1	permissive or not at all!	1
2	destination	4
3	private	44
4	dismount	56
5	permissive	174
6	designated	346
7	no_access	889
8	full_access	2357


### Naptan Import

##### After scanning over the tags, I found naptan prefixed tags that were followed by common fields like street that wouldn't work well without adding complexity to the query, so I choose to look at naptan prefixed tags only and compare them alongside addr based prefixes which seems standard:

{'k': 'naptan:ShortCommonName', 'v': 'Hockley PO'}
{'k': 'name', 'v': 'Willow Close'}
{'k': 'source', 'v': 'naptan_import'}
{'k': 'highway', 'v': 'bus_stop'}
{'k': 'naptan:Street', 'v': 'Park Lane'}
{'k': 'naptan:Bearing', 'v': 'W'}
{'k': 'naptan:AtcoCode', 'v': '0600MA0495'}
{'k': 'naptan:Crossing', 'v': 'Willow Close'}
{'k': 'naptan:Landmark', 'v': 'House number 191'}
{'k': 'naptan:verified', 'v': 'no'}
{'k': 'naptan:Indicator', 'v': 'W-bound'}
{'k': 'naptan:CommonName', 'v': 'Willow Close'}
{'k': 'naptan:NaptanCode', 'v': 'chepawa'}
{'k': 'naptan:ShortCommonName', 'v': 'Willow Close'}

##### And I compared it to one of the typical cases

{'k': 'addr:city', 'v': 'Whaley Bridge'}
{'k': 'dispensing', 'v': 'yes'}
{'k': 'addr:street', 'v': 'Market Street'}
{'k': 'addr:postcode', 'v': 'SK23 7LP'}
{'k': 'addr:housenumber', 'v': '40'}
        
        
##### I used a regular expression to match 'naptan:'
^([naptan]|_)*:
    
##### After adding in a count of all the tags that have the naptan prefix it's substantial: 

'naptan': 135348    

##### Since the prefix keys don't all have direct matches, I've tried to find the most reliable and relatable key names to rename, thus far address is the most consistent.

{'k': 'naptan:Street', 'v': 'Park Lane'}
##### should become
{'k': 'addr:street', 'v': 'Park Lane'}

13292 naptan:Street tags
7212 addr:street tags
    
##### This surprised me since I assumed the latter key was of the majority.


##### After checking through the other naptan tags like bearing, atcocode, I discovered the other keys weren't directly transferable into the addr keys except the street key.
######(i.e. Landmark values,  ('House number 191', 'ICL', 'Fenner Sales'))

#### SO I decided to replace all naptan:Street so that any further Street based dataanalysis would take in the 13292 values that may have slipped through otherwise.

##### I added 'naptan:Street' in 3 different places 
- 1. In the key_type function to count how many instances there were.
- 2. In the improving street names section to ensure that it received preprocessing.
- 3. In the database import section where I replaced the default tag 'naptan:Street' to 'addr:street'

#### This resulted in a substantial change in number of total addresses:
###### From 71212 to 20503
        


- I had quite a few errors in my code, mostly none type issues. Added type checking before appending to dict or list


In [37]:
# Now to import the db

# Data Overview and Additional Ideas

After converting the map data into csv's based on the provided schema, 
here are the results:

 #### File Sizes
    ______________________________
    |                             |
    |- nodes_tags.csv | (11.3Mb)  |
    |- nodes.csv      | (115.3Mb) |
    |- ways_tags.csv  | (18Mb)    |
    |- ways.csv       | (11.8Mb)  |
    |                             |
    |- man_eng.osm    | (318Mb)   |
    |- manchester.db  | (225.6Mb) |
    |_____________________________|
    
    
    
#### Number of Nodes
nn = SELECT COUNT(*) FROM node;
#### Number of ways
nw = SELECT COUNT(*) FROM way;
#### Number of users
nu = SELECT COUNT(DISTINCT(e.uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;
#### Number of node tags
nnt = 'SELECT COUNT(*) FROM node_tags'
#### Number of way tags
nwt = 'SELECT COUNT(*) FROM way_tags'
     _____________________________
    |                             |
    |- Number of Users | 1899     |
    |- Number of Nodes | 1421701  |
    |- Number of ways  | 201947   |
    |- Number of Tags  | 796322   |
    |- - Nodes         | 290902   |
    |- - Ways          | 505420   |
     _____________________________



### Top 10 most common node_tags
- created_by	38485
- name	        22294
- highway	    18903
- source	    17875
- AtcoCode	    13404
- CommonName	13291
- Street	    13291
- verified	    13286
- Indicator	    13280
- Landmark	    13164

### Top 10 most common way_tags

- created_by	38485
- name	        22294
- highway	    18903
- source	    17875
- AtcoCode	    13404
- CommonName	13291
- Street	    13291
- verified	    13286
- Indicator	    13280
- Landmark	    13164


## Religion

#### Node_Tags

- 0	religion	christian	321
- 1	religion	muslim	16
- 2	religion	jewish	6
- 3	religion	buddhist	2
- 4	religion	hindu	1
- 5	religion	scientologist	1
- 6	religion	sikh	1
 
#### Way Tags

- 0	religion	christian	282
- 1	religion	muslim	9
- 2	religion	jewish	6
- 3	religion	buddhist	1
- 4	religion	hindu	1

<img src="files/images/religion_pie_chart.png" />



### Bicycle Lanes

I also looked at bicycle lanes. 
I've added the possible values below. As you can tell there are some unique
values such as 'loony_only' and I'm unsure of the operational defintion but it seems
to be a sort of joke but not helpful as it's too unique for the overall data.

We might want to categorise these options with a second option field. 
Yes and No at the highest level and then an option second attribute
which says the type of yes. 

- Possible Values:
    [u'yes', u'no', u'permissive', u'designated', u'dismount',
       u'destination', u'permissive or not at all!', u'private',
       u'loony_only']
       
#### Bicycle Stats      
- loony_only	                1
- permissive or not at all!	    1
- destination	                4
- private	                   44
- dismount	                   56
- permissive	              174
- designated	              346
- no	                      889
- yes	                     2357


##### SQL CODE
'SELECT value, count(*) as num FROM way_tags WHERE key = "bicycle" Group By value Order by num ;

## FOOD!

- 0	chinese	131
- 1	indian	90
- 2	fish_and_chips	88
- 3	pizza	75
- 4	coffee_shop	61
- 5	sandwich	55

Apparently Chinese food is listed as most popular, followed by Indian, then Fish & Chips. This surprises me because I remember Indian food being the most popular, most notably the infamous Curry Mile. 

There are quite a few problems in this part of the dataset as well,
such as the use of semi-colons, commas and colons to delineate multiple cuisines. 

The fix here is to add a regex pattern that looks for these three values: ;,: 
Then to use the regex as a parameter on the split function.
(i.e. pizza;Burger;Fish_&_Chips;Kebab )

## Amenities

amenity = 'SELECT value, count(*) as num FROM node_tags WHERE key="amenity" GROUP BY value ORDER BY num DESC;'

- pub	            1270
- post_box	        936
- place_of_worship	575
- fast_food	        571
- bench	            478


As you can see, there aren't a lack of pubs in Manchester as could be said for much more of England. What is surprising is that there are more places of worship than there are fast food according to the dataset. 

## Additional Ideas


### Mixed Cuisine
I am interested in looking at different restaurants that combine cuisines  in one place across different areas at the city and country level. In this case, we would just need to split by ":" and figure out each type of food combination. There were only a few in the dataset I looked at but I'd be interested in seeing what combinations are most/least popular (i.e. chinese:fish_chips)

###### Improving the Dataset 

In someways OSM reminds me of the Waze app, but it lacks a public community, it feels as if it isn't obvious what one should do to contribute without diving into the docs. OSM has many quirks(like the label "only loony's would ride bikes here" that can be found in this dataset) and they vary from city to city, so I'd be interested in expanding on the locality of the maps. By keeping a standard map that is cross referenced with other map technology (google maps/bing) and customised maps which users can add to for whatever the reason may be. A map for a pub run, a Mini Cooper's map that highlights all of the points in an city/area that contains windy turns or less stop lights. THese maps could add gamification by allowing users to setup goals such as the "Liverpudlian Pub Run". This is a bit like FourSquare but instead think custom maps that can be privatised and membership controlled in an open source way. Not unlike a forum. Imagine that the "loony" bike lanes becomes a map for those who love to do bmx, or paths for parkour. I think we have map technology that is much more accurate than OSM and if that is OSM's goal then I don't see a users needs being met better than Google/Bing. 
Go custom, localised, particular. Maps for friends, maps for family. Maps that tell the secrets of a city. Now that's exciting.   

I'd also like to discuss some major issues that I could forsee with the advent of this technology.


-Data Reliablity
Because we would be using Google/Bing based API's we'd be also relying on Google/BIng as dependcies for the map itself. Data fetching in an async way has been one of the most common problems in web technology today, so we'd be placing ourselves at the centre of it. If we choose to set up webhook like updates and live update the map we run into problems with some people seeing an updated map and others seeing an older one. Because api requests are limited, limited in how many requests you can make and not 100% uptime, we would be subject to latency and availability issues. In the middle of updating a map we could stop getting data which would upend the data quality. We would have to build a system that has redundancy built in and the ability to roll back or refetch when we cannot finish an api or db transaction. This introduces another possible fault point that is unavoidable in developing this types of systems.

-Data integrity
No matter how we implement the system we are adding complexity to the system which means the database and it's transactions will have to be remodelled and ultimately it will change the way the data is used. By cross referencing Google/Bing maps technology, we have to consider how OSM will *weight* user generated content as opposed to external map sources. The data will become much more complex as it brings in data from it's existing user base (user contribution and bots) and extend it to api's which might require reshaping the data to fit OSM's. Since OSM is committed to user contributions then we might consider keeping user contributed data as the primary source of truth and updates from external technology could be approved by users of OSM. IF we assume external sources are more accurate than OSM's then we place a heavier weight on the external sources and the data model should reflect that. 

Customised Map
- How might updates cascade? 
Upon introducing the customised map feature we would be adding in facades or building data layers on top of the foundational map that OSM provides. As the underlying map changes, the references within the customised maps must also change. Keeping customised maps up to date involves running into the same issues I've talked about above. Some potentially new issues are as follows:

Customised maps will 
- introduce complexity to the user access system.
- store duplicate data (map data or references)

This also means that if the 'foundational' OSM map is updated that it must cascade to the customised maps as well. FOr any given point on the foundational map, there are possibly 1000's of references to it within  the customised maps and they are absolutely dependent on these references so all of the customised maps "become incomplete" as they were made to include this reference. If this reference point is updated, all should be well. But if it is moved or deleted, the maps are at risk at being useless or at least losing the completeness they had when they were created. 


 


### Conclusion

After weaving through the dataset, I was able to find some interesting data that intrigued me to dig deeper and investigate the OSM data further. There was quite a few errors and the data required extensive cleaning, much more than what could be done here. But I guess this is the nature of open source street maps, that being said there are also beautiful particularities about the data such as bike lanes marked "only looneys" and other unique entires that seem to form and sometimes other support it, giving the data more reliability.