# OpenStreetMap Data Project 

## Arlington, Virginia Street Data 

## 1. Libraries used to parse data 

### After downloading a large xml file of data from openstreetmap.org I used several python libraries to parse, clean, and import the data into a local MongoDB database. 

In [1]:
from pymongo import MongoClient
import xml.etree.cElementTree as ET
from collections import defaultdict
from collections import deque
import re
import pprint


#### The following code opens my street data for the city of Arlington, VA downloaded from OpenStreetMap, then creates 2 diferent dictionaries that I will use to store the data later. 

In [None]:
arlington_xml_data = open("arlington_xml_data", "r")
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
street_types = defaultdict(set)
experiment = {}

#### List of unwanted street types

In [None]:
bad_list = ['Southwest', 'S.W.','Southeast', 'Southwest', 'St.', 'Hwy', 'North1', 'Northeast', 'Northwest', 'Ave.', 'Southeast\\']


#### List of expected street types

In [2]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", "Trail", "Parkway", "Commons"]

#### Below is my connection to my mongodb database for insterting cleaned data

In [3]:
client = MongoClient(port=27017)
db = client.project2try2


### The following code is for auditing the dataset and then adding the street data to the street types dictionary

In [None]:
def audit_street_types(street_types, street_name):
    x = street_type_re.search(street_name)
    if x:
        street_type = x.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)


def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit():
    for event, elem in ET.iterparse(arlington_xml_data, events=("start",)):
        if elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_types(street_types, tag.attrib['v'])
    #pprint.pprint(dict(street_types))
    return(dict(street_types))


## 2. Prepare data for database

#### Making sure unwanted street types don't make their way into the database, so I set the key values for the dictionary I am using to prepare data for database insertion.


In [None]:
def create_default_dict_structure(street_types):
    for street in street_types:
        if street not in bad_list:
            experiment.setdefault(street, [])


### The "change_data" function is used to check for certain street types that I wanted to change before adding the data into the experiment dictionary. I changed directions like Southwest to the abbreviated SE, Hwy to the full Highway, and removed a few erroneous names. 

In [None]:
def change_data(street_types_dict):
    for x in street_types_dict:

        if x == 'Southwest':
            for y, street in enumerate(street_types_dict[x]):
                #check to see if not in the list of streets we don't want in our new dictionary
                #if street not in bad_list:
                    name = street.split(' ')
                    #iterate over the street name after splitting it to check for term you want to replace

                    for i, n in enumerate(name):
                        if n == 'Southwest':
                            name[i] = 'SW'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'Ave.':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'Ave.':
                            name[i] = 'Ave'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'S.W.':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'S.W.':
                            name[i] = 'SW'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'Southeast\\':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'Southeast\\':
                            name[i] = 'SE'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'St.':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'St.':
                            name[i] = 'St'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'Southeast':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'Southeast':
                            name[i] = 'SE'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'North1':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'North1':
                            name[i] = 'North'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'Northeast':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'Northeast':
                            name[i] = 'NE'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'Northwest':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'Northwest':
                            name[i] = 'NW'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        if x == 'Hwy':
            for y, street in enumerate(street_types_dict[x]):
                    name = street.split(' ')
                    for i, n in enumerate(name):
                        if n == 'Hwy':
                            name[i] = 'Highway'
                            experiment.setdefault(name[i], [])
                            new = ' '.join(name)
                            experiment[name[i]].append(new)
        else:
            if x not in bad_list:
                experiment.setdefault(x, [])
                for street in street_types_dict[x]:
                    experiment[x].append(street)


## 3. Insert Data into local MongoDB database (must have database server running and indicated the proper port)

In [None]:
audit()
create_default_dict_structure(street_types)
change_data(street_types)

### This function structures the JSON object and then inserts into database. 

In [None]:
def db_insert(experiment):
    street_type = 'street_type'
    street_name = 'street_name'
    for x in experiment:
        street_insert = {
            street_type : x,
            street_name : experiment[x]
        }
        result = db.streets.insert_one(street_insert)
        print(result.inserted_id)



db_insert(experiment)

## 4. Data Queries 

### Number of documents

In [None]:
db.streets.find({}).count()

In [None]:
1771

### Querying the database to see the most common street types 

In [None]:
> db.streets.find({"street_type" : "SE"}).count()
406
> db.streets.find({"street_type" : "NE"}).count()
397
> db.streets.find({"street_type" : "NW"}).count()
623
> db.streets.find({"street_type" : "SW"}).count()
127

## 5. Thoughts for next time

### I could do some more interesting queries if I used a more complex data set to insert into my database. I also played around a bit with the structure of the JSON objects that I inserted into my MongoDB database and the types of characteristics that I included.

### During data cleaning I certainly could have written more efficient code. I would often wait minutes at a time after making small changes to my code just to see if it would run properly and execute on the dataset. Part of that is the nature of analyzing a large dataset, but some of my functions have multiple nested for loops which can be extremely slow. 

### I enjoyed learning how to use MongoDB and it was cool to see that I could use the MongoClient python library to connect to my local DB and then push directly from my python file. 

## Sources 

#### https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
#### https://docs.mongodb.com/manual/mongo/
#### https://docs.python.org/2/library/collections.html