# OpenStreetData Case Study for the Metro Area of Berlin, Germany

## Map Area

Berlin Metro Area, Germany. Berlin was choosen, since it is my hometown.

## Data Wrangling

I've downloaded the available data from https://mapzen.com/data/metro-extracts/ (May 2nd, 2016), extracted nodes and ways and imported the data into a sqlite database (See file data_preparation.py, database schama see schema.txt). Some unknown characters in usernames (kyrillic) lead to problem extracting the 'nodes' and 'ways' from the OSM File. SQL import errors lead to a not complete database. 

Therefore, the count of rows was checked against the csv files:

In [2]:
# Warning: Takes some time with the big csv files.

import pandas as pd


df_nodes = pd.DataFrame.from_csv('nodes.csv')
df_nodes_tags = pd.DataFrame.from_csv('nodes_tags.csv')
df_ways = pd.DataFrame.from_csv('ways.csv')
df_ways_tags = pd.DataFrame.from_csv('ways_tags.csv')
df_ways_nodes = pd.DataFrame.from_csv('ways_nodes.csv')

In [3]:
print("Count of rows in csv files")
print("nodes: ", len(df_nodes.index.values))
print("nodes_tags: ", len(df_nodes_tags.index.values))
print("ways: ", len(df_ways.index.values))
print("ways_tags: ", len(df_ways_tags.index.values))
print("ways_nodes: ", len(df_ways_nodes.index.values))

Count of rows in csv files
nodes:  10460000
nodes_tags:  3658234
ways:  1596861
ways_tags:  4191676
ways_nodes:  13362536


Count of rows in database:

    SELECT Count(*) FROM nodes;
    > 10,460,000
    
    SELECT Count(*) FROM nodes_tags;
    > 3,658,235
    
    SELECT Count(*) FROM ways;
    > 1,596,861
    
    SELECT Count(*) FROM ways_tags;
    > 4,191,677
    
    SELECT Count(*) FROM ways_nodes;
    > 13,362,537

The tables nodes_tags, ways_tags and ways_nodes are one row longer. A manual inspection revealed that the column names were added as a data row. These rows were removed manualy. 

Furthermore, since nodes_tags and nodes_ways are sub to nodes and ways, ids from the tags file should allways refer to a valid id in the nodes or ways file.

In [4]:
print(set(df_nodes_tags.index.values) <= set(df_nodes.index.values))
print(set(df_ways_tags.index.values) <= set(df_ways.index.values))

True
True


In [9]:
import sqlite3

conn = sqlite3.connect('data1.db')
c = conn.cursor()

nodes_set = set([n[0] for n in c.execute("SELECT id FROM nodes").fetchall()])
nodes_tags_set = set([n[0] for n in c.execute("SELECT id FROM nodes_tags").fetchall()])
ways_set = set([n[0] for n in c.execute("SELECT id FROM ways").fetchall()])
ways_tags_set = set([n[0] for n in c.execute("SELECT id FROM ways_tags").fetchall()])

print(nodes_tags_set <= nodes_set)
print(ways_tags_set <= ways_set)

True
True


## File Sizes

* 'berlin.osm':    2.29 GB (uncompressed)
* 'nodes.csv':      833 MB
* 'nodes_tags.csv': 131 MB
* 'ways.csv':        93 MB
* 'ways_nodes.csv': 316 MB
* 'ways_tags.csv':  140 MB

## Evaluating the data

While evaluating the data the following problems were encountered:

* (nodes table) Columns lat and lon use different precision. 
* (nodes_tags table) Column key has values that are probably inconsistencies, like 'addr' and 'address' or 'abbr' and 'abrevation'
* (nodes_tags table) The key 'fixme', 'FIXME' and 'TODO' was found.
* (ways_tags table) The column value holds unexpected values for column key filtered for maxspeed. 250 is unlikely (39 times) as well as 210 or 190. Also the max limit 30 seems to be encoded in various different ways (30, DE:zone30, DE:zone:30, DE:30, PL:zone30, DE:zone(:30), zone30)
* (ways_tags table) The column value holds unexpected values for the column key filtered by postcode. Postcodes are five digits starting (in Berlin) with a 1. '66-470' (1,632 times), '74-500' (1,486 times) and '74-505' (938 times) do not match this criteria. There are codes starting with a '0' (mostly area around Berlin) and one code is '39264' (a place called Deetz and quite a bit away from Berlin).

## Evaluating the problems

### nodes table : columns lat and lon

### nodes_tags table:  Inconsistent keys

### nodes_tags table: fixme and todo keys

### ways_tags table: maxspeed

### ways_tags table: postcode

## Evaluating the contributors

### Number of unique contributors

In [11]:
print(c.execute("SELECT Count(*) FROM (SELECT uid FROM nodes UNION SELECT uid FROM ways) tmp;").fetchall()[0][0])

7903


### Top 15 contributors by count



In [20]:
from pprint import pprint

statement = """
SELECT user, COUNT(*) FROM nodes
  GROUP BY user
UNION ALL
SELECT user, COUNT(*) FROM ways
  GROUP BY user
ORDER BY COUNT(*) DESC
LIMIT 15;
"""

for n in c.execute(statement).fetchall():
    print(n[0], "{:,}".format(n[1]))

atpl_pilot 2,378,801
jacobbraeutigam 574,371
r-michael 337,015
streckenkundler 335,778
anbr 329,417
atpl_pilot 312,716
WegefanHB 281,135
Bot45715 242,853
Konrad Aust 166,110
toaster 156,494
Elwood 151,421
g0ldfish 145,945
geozeisig 120,498
Polarbear 116,260
Randbewohner 102,982


### Top 15 longest active contributors

## Additional Evaluations

### Anemities

### Cluster of italien places

## Ideas for Improvement

The data for Berlin is generally on a high level. Common standards are partly missing for values like maxspeed on ways. It should be possible to work on this in a programmatical way.

## Conclusion

Berlin is big and cleaning up all the "fixme" and other open ends is a lifetime job. The data is generally considering the size quite good. 