# Project documentation

## Encountered problems and data cleaning

The code for parsing nodes and ways is pretty close to what has been covered in the lectures. A few extensions have been necessary to extract some more meta information from the data:

 * Extract shops and shop specific information
 * Compute the length of streets
 * Restrict ways to actual streets
 * Extract street name pattern
 * Special characters

Lets discuss these problems in more detail:

### Extract shops and shop specific information
For the following statistics the number of shops of the same name will be investigated. After parsing the data it became clear, that nodes which are shops have a specific set of tags. For all shops, these are read and restructured, so that the information becomes available under the new "shop" entry.

The following code parses the tags of a node and restructures the data so that a dictionary from a set of selected _k_ to the corresponding values _v_ is created. Here we extract the name, wheelchair accessibility and opending hours for shops. But this can be easily extended to, say, bus stops.

In [None]:
# Maps each type of special information to a list of fields which we care for.
SPECIAL_TYPES = {"shop": ["name", "wheelchair", "opening_hours"]}

    
def parse_specials(element):
    """Parses and restructures special information for nodes."""
    content = {}
    type_ = None
    for subtag in element.iter("tag"):
        k = subtag.attrib["k"]
        v = subtag.attrib["v"]
        if k in SPECIAL_TYPES:
            type_ = k
        else:
            content[k] = v
    content = {k: v for k, v in content.items() 
               if type_ in SPECIAL_TYPES and k in SPECIAL_TYPES[type_]}       
    
    return type_, content

For example, this transforms the following input into the Python dictionary below:

In [None]:
<node id="25433634" lat="51.0195473" lon="13.7223831" version="8" timestamp="2017-08-11T16:08:51Z" changeset="51038779" uid="86504" user="Thomas8122">
    <tag k="level" v="0"/>
    <tag k="name" v="Kaufland"/>
    <tag k="opening_hours" v="Mo-Sa 07:00-21:00"/>
    <tag k="shop" v="supermarket"/>
    <tag k="wheelchair" v="yes"/>
</node>

In [None]:
{"shop": {"name": "Kaufland", "opening_hours": "Mo-Sa 07:00-21:00", "wheelchair": "yes"}}

### Compute the length of streets
During the data wrangling the lenght of streets has to be computed, so that later on statistics on the length of streets can assembled. This is done in three steps:

1. Remembering the coordinates of the nodes while parsing the `<node>` XML in a map `NODE_LOCATIONS`. 
2. When parsing the `<way>` XML we first collect the `ref` attribute of the `<nd>` tags and resolve them using the `NODE_LOCATIONS` map.
3. Compute the pairwise distance between adjacent nodes and sum up over the course of the street. 


In [None]:
# Stores node coordinates for distance calculation during node parsing.
NODE_LOCATIONS = {}  


def parse_way(element):
    way = {}
    element_counter["way"] += 1
    way_nodes = parse_way_nodes(element)
    if way_nodes:
        way["nodes"] = way_nodes    
        way["length"] = compute_way_length(way_nodes)
    return way


def parse_way_nodes(element):
    nodes = []
    for node in element.iter("nd"):
        nodes.append(node.attrib["ref"])
    return nodes     


def compute_way_length(way):
    """Assumes that the way nodes are in the correct order."""
    distance = 0
    for s, t in pairwise(way):
        try:
            lat1, lon1 = NODE_LOCATIONS[s]
            lat2, lon2 = NODE_LOCATIONS[t]
            distance += geo_distance(lat1, lon1, lat2, lon2)
        except:
            pass  # ignore errors, which happens a lot in sampled data
    return distance

Although the Euclidean distance would yield a sufficient approximation for this part of the earth, it is generally not applicable for computing the distance between coordinates on the earth's surface. Especially when it comes to extreme latitudes (close to +-90°) the error becomes very high. Instead, we assume the world is a perfect ball sphere and use the great-circle distance.

The functions for `geo_distance` and `pairwise` are skipped for brevity. They can be looked up in the file `audit.py`.

#### A note on data size
The dataset becomes too large during the computation because I need to store the coordinates of each node in order to compute the length of streets later on. In order to fit it into the 3GB RAM of my machine, I needed to drop other data like the create-information from OSM. I am aware that this could have been done on MongoDB as well, however, the Python API falls back to use JavaScript in its MapReduce code and I am not fluent with that, so it seemed more reasonable to do this within the parser code.

The following code skips this information.

In [None]:
CREATED = ["version", "changeset", "timestamp", "user", "uid"]


def parse_element(element):
    entity = None
    if element.tag == "way":
        entity = parse_way(element)
    elif element.tag == "node":
        entity = parse_node(element)
    entity["type"] = element.tag
    lat = lon = None
    for k, v in element.attrib.items():
        if k in CREATED:
            # ignore the creation data
            pass

### Restrict ways to actual streets
The investigation should only focus on actual streets. There are extracted by focusing on `<way>` elements with the tag `highway` and a valid `name` tag. There are some strange street names. So I created a list of street name patterns (like "Straße" or "Ring" or more descriptive types of street names like "An der Fabrik" which is "Next to the factory") and iteratively extended the list until the pattern matched all the valid street names. All other streets and ways which are not streets (i.e. do not have a "highway" tag) are discarded during the parsing. 

The following code checks if a common component is contained in the street name or if it starts with one of the typical patterns.

In [None]:

def parse_element(element):
    # ...
    
    # For ways, only consider streets (highway tag) with a valid name:
    if element.tag == "way":
        tags = entity["tags"]
        if ("highway" in tags and 
            "name" in tags and 
            is_valid_street_name(tags["name"])):
            return entity
        
        
common_street_names = re.compile(
        r"(weg|straße|platz|allee|gasse|ring|berg|grund|" +
        r"steig|hof|ufer|höhe|leite|brücke|passage|steg|graben|tunnel)", 
        re.IGNORECASE)
common_start_phrases = re.compile(
        r"(Am |An de|Alt|Hinter de|Im |Zum |Zur )", 
        re.IGNORECASE)


def is_valid_street_name(name):
    return common_street_names.search(name) or common_start_phrases.match(name)



### Extract street name pattern
This came to my mind when restricting to actual streets. If we already check for common street name components, why not extract them and create a statistic on which are the most common? 

The following code reuses the patterns created above and stores the matching regular expression groups within the way data.

In [None]:
def extract_street_name_component(name):
    match = common_street_names.search(name)
    if match:
        return match.group(1).lower()
    match = common_start_phrases.match(name)
    if match:
        return match.group(1).lower()
    return ""

This is an example how this transforms XML into a dictionary:

In [None]:
<way id="25359184" version="5" timestamp="2014-08-24T15:57:51Z" changeset="24982129" uid="48393" user="Wurgwitz">
    <nd ref="2729352943"/>
    <nd ref="3039521232"/>
    <nd ref="3039521213"/>
    <tag k="name" v="Hinter dem Rathaus"/>
    <tag k="highway" v="residential"/>
</way>
<way id="25368605" version="3" timestamp="2014-07-05T11:47:01Z" changeset="23965179" uid="2675" user="Eckhart Wörner">
    <nd ref="35314674"/>
    <nd ref="578090458"/>
    <nd ref="276516857"/>
    <tag k="name" v="Hirschbacher Weg"/>
    <tag k="highway" v="residential"/>
    <tag k="maxspeed" v="30"/>
    <tag k="postal_code" v="01277"/>
</way>

The result would be:

In [None]:
{"name": "Hinter dem Rathaus", "street_name_component_match": "hinter de"},
{"name": "Hirschbacher Weg", "street_name_component_match": "weg"},

### Special characters (UTF-8 handling)
The dataset is from Germany and thus many names includes German special characters, encoded as UTF-8. I expected a lot more trouble with handling these correctly. However, it turned out that Python3 completely gets the encoding right and there have been no problems with special characters at all, even when transferring the data to MongoDB and back. The key is to used Python 3 instead of Python 2.

## Overview of the data

### General statistics

The following numbers have been determined from the database. Note that during parsing, some irrelevant data has been dropped for workaround memory issuues. For instance, the following investigation does not care about ways which are not roads (tagged as "highway") and the information which OSM user created the elements at what time etc. pp.

#### File size

The uncompressed dataset has a size of 312 MByte.

#### Number of elements
`> db.osm_dresden.count()`

1341909


#### Number of nodes
`> db.osm_dresden.find({"type": "node"}).count()`

1323207

#### Number of streets
`> db.osm_dresden.find({"type": "way"}).count()`

18702


### Computed statistics

#### Most frequent shops in town

| Name | Count |
|-|-|
|None|340|
|Rossmann|19|
|Konsum|17|
|Netto Marken-Discount|13|
|Netto|13|
|Fleischerei Richter|12|
|Rewe|11|
|Richter|11|
|Sternenbäck|11|
|Eisold|9|

#### Longest streets:

| Street name | lenth in kilometers |
|-|-|
|Ullersdorf-Langebrücker Straße|2.49|
|Radeberger Landstraße|2.49|
|Fütterungsweg|2.44|
|Mittelweg|2.35|
|Tunnel Coschütz|2.35|
|Tunnel Coschütz|2.31|
|Alter Bahndamm|2.29|
|Ullersdorf-Langebrücker Straße (5)|2.18|
|Prießnitzgrundweg|2.09|
|Kesselsdorfer Straße|1.98|

#### Streets with the most nodes:

| Street name | number of nodes |
|-|-|
|Schloßplatz|120|
|Am Pulverturm|75|
|Leitenweg|74|
|Wiener Straße|74|
|Seebachstraße|70|
|Wieckestraße|68|
|Jorge-Gomondai-Platz|67|
|Unkersdorfer Landstraße|67|
|Altgorbitzer Ring|66|
|Elberadweg|63|

#### Frequency of common street name components:

| Street name component | absolute frequency |
|-|-|
|straße|10148|
|weg|2719|
|berg|1318|
|platz|690|
|am |631|
|hof|459|
|alt|392|
|grund|316|
|brücke|311|
|an de|297|
|ring|297|
|allee|245|
|steig|144|
|gasse|141|
|ufer|124|
|höhe|117|
|zur |93|
|leite|70|
|graben|65|
|zum |64|
|steg|34|
|tunnel|12|
|im |7|
|passage|5|
|hinter de|3|

## Other ideas about the dataset

During the project, the following extensions of the analysis came to my mind:

* Determine and combine split streets
* Missing shop names
* Parse and process shop opening hours
* More precise distance function

Lets discuss them in some detail.

### Determine and combine split streets

Some long streets seem to be split in the data set: There are streets which have the same name or the same name except for some number. For some reason, these are separated in the OSM data.

One could determine these streets with similar names, combine them and then compute the length of the combined streets. However, also distinct streets may come with the same name. These should be merged. Finding out which street parts belong together and which are distinct is not trivial.


### Missing shop names

Many shops do not have a name. This cannot be reconstructed from the data. I decided to keep them in the statistic anyway. It would be interesting to find out why there is no name, for example by checking some of the shops on Google street view or reconstructing them by looking up the addresses in an external API.


### Parse and process shop opening hours

In the present data auditing, shop opening hours have just been read as-is. This means that most are composed of German short terms for weekdays and hours of the day. This should really be converted to datetime objects. The results could be stored in the shop data, and a MongoDB aggregation query could be constructed that yields the number of open grocery stores for every hour of the day. The query would select all shops from the data, gorup them by opening hours and then count the number of shops per hour of the day. 

The query could look like this (untested):

In [None]:
result = db.osm_dresden.aggregate([
    {"$match": {"type": {"$eq": "node"}, "$exists": {"shop": True}}},  # select only shops
    {"$unwind": "$opening_hours"},  # create one entry per each opening hour
    {"$group": {"_id": "$opening_hours", "count": {"$sum": 1}}},  # group by opening hour and count number of open shops
    {"$sort": {"_id": 1}},  # sort by field opening hour, ascending
])

### More precise distance function

The great-circle distance used to compute the street length assumes the earth is a perfect sphere, which is not. to get more precise values, the [WGS 84](https://en.wikipedia.org/wiki/World_Geodetic_System#WGS84) projection could be used which models the earth as an ellipsoid.

## Conclusion

After doing some analysis and restructuring of the data, it seems that the selected data set is of fairly good quality. German special characters are encoded in a sound way. The information on different entities is surprisingly detailed. There are some streets which are separated in the data where no reason or pattern could be found. 

The present project created analysed the longest streets in the data set, listed the most frequent shops in the area and did a lexicographical analysis on the frequency of common street name wordings. 

Several extensions have been suggested which could be used to draw conclusions from the data.


### A note about sampling

The project submission includes only a sample of the data which has been created using the code provided in the Project guideline. The calculation of the street lenght works only correctly if every node of a street is given. The sampling process leads to OSM ways beeing fragmented. Thus the resulting lengths are probably not correct.