# Wrangling OpenStreetMap Data for SW West Virginia

## Motivation
In order to develop my data wrangling skills, I am auditing and organizing the OpenStreetMap data for SW West Virginia. I chose West Virginia for multiple reasons:
1. My wife's family is from there (and most still live in the state) 
2. I assumed that (like with most self-reported or survey-based data collection efforts), a state with a high proportion of poor and rural areas is likely to have very poor data coverage and quality. This is something that my work on map data may help
3. The state of WV is an area racked by a number of unfortunate statistics, not the least of which is [a very high drug overdose rate](https://www.cdc.gov/drugoverdose/data/statedeaths.html). This area of West Virginia (in particular, Huntington, WV) [suffers particularly badly](https://www.npr.org/2017/06/29/534868012/what-happens-when-the-heroin-epidemic-hits-small-town-america), with the city of Huntington sometimes being called the drug overdose death capital of America.

In wrangling the data for this region of the US, I hope to provide some value to an otherwise ignored set of communities. It is my hope that I will be able to take my progress here and push the audited data to OpenStreetMap as a final step.

## Data Provenance

First of all, let's establish where these data came from and some basic information about them. The data were pulled as a custom extract from https://mapzen.com/ and the final file size for the region (unzipped) is 538 MB. An image of the region extracted is shown here: ![](imgs/SW_WV_ExtractConfirm.png).

## Project Steps

I'm going to tackle this project using the following steps (mostly included here for my own mental organization, but hopefully also helpful for anyone following this work too!):

1. Sample the data to generate a relatively small data set (e.g. 1-10 MB) that I can sift through to identify recurring concerns as part of my data audit. The code I will use for sampling the data is called *SampleMapData_Small.py*. 

2. Audit the data sample in the tradition of the Udacity Data Wrangling course, using NumPy and Pandas in Python. This will entail:
    1. **Auditing data validity:** do the data conform to a pre-defined data schema? In the case of this project, some relevant questions are *Do the tags include zip codes that are within the region of interest?* or *Does there seem to be valid tag hierarchy being obeyed throughout?*
    2. **Auditing data accuracy:** do the data conform to some gold standard? An example of this will likely be zip codes only being 5 or 9 digits long.
    3. **Auditing data completeness:** one approach I may take to this is ensuring that cities or counties I know should be included in the region I'm parsing are actually recorded in the data. Given the region chosen here, one obvious audit that will need to occur is identifying easily the non-WV data in the set. While these data are not inherently invalid, different state policies regarding the naming of roads, counties, etc. may have an impact on the seeming validity of the data at hand and should be identified early.
    4. **Auditing data consistency:** I will investigate this by looking across tags of the same type to ensure a standard format is being followed throughout. I'll also investigate other issues of internal consistency as they arise.
    5. **Auditing data uniformity:** for these data, this will likely take the form of checking whether data values are within a reasonable range (e.g. no latitudes or longitudes well outside the region of interest).

3. Check to see if more audit problems are found when sampling a larger data set than previously. Keep iterating on this approach of "increase sample size, sample, audit" until no new audit failures are found.

4. Correct the data problems identified in the preliminary audits (one script per data field) and export the corrected data (with an additional script) in a CSV format that follows the data schema provided by the Udacity team. This schema creates tables tracking `nodes`, `node_tags`, `ways`, `ways_tags`, and `ways_nodes` and allows for the creation of the SQL database needed for recording these data.

5. Create a SQL database with the provided schema.

5. Import the CSV data files (one per SQL table) into the SQL database created.

6. Query this database in a variety of ways to:
    1. Check to make sure no other auditing errors exist. If any are found, deal with them by modifying the auditing code, re-creating the CSV files, and removing then re-adding records in the SQL database (if necessary).
    2. Determine some descriptive aspects of the data. For example, data-oriented queries of interest could include the number of users contributing and identifying some representative contributions of the highest-frequency contributors. This could identify any recurring issues with the data these individuals are supplying and speed up any further auditing that may be needed. Region-specific auditing would include things like determining if the number of cities and counties are accurate. Statistics specifically mentioned in the project rubric (with some additions from me) are:
        1. Size of all files used in this project
        2. Number of unique users
        3. Number of nodes and ways
        4. Numbers of specific types of nodes (e.g. cafes)
        5. Numbers of nodes and ways lacking any child tags to describe them beyond their latitude/longitude
    
    3. Investigate any other interesting counts present in the data, such as:
        1. The number of unincorporated townships vs. the number of cities/villages
        2. Comparing the number of residential properties to the number of commercial properties. 
        
        WV is known by many to have issues with the presence (or lack thereof) of basic infrastructure needs, given its high density of rural locations. As such, I'll also explore how the count of residences compares to the count of:
        1. grocery stores, 
        2. restaurants, and 
        3. hospitals + doctor offices 
        
        This will potentially illuminate the issue of "healthcare and food deserts." The healthcare component in particular is relevant to the aforementioned overdose concerns for this area.

In [1]:
#Let's see what pops out when we try to do a sampling of every 1000th k tag in our big data file.
#This will involve running the script "SampleMapData_Small" to generate 'data_sample.osm' file and investigating in
#a text editor like Sublime

#TODO: 'sample' the main OSM file with k = 1 so it comes out as unicode instead of ASCII and doesn't have parsing issues
#...may not be an issue for conversion to CSV, but just something to consider



As all of the significant problems seem to exist in nodes and ways with child tags, I modified my sampling algorithm to skip any nodes or ways that were not parents of tags, so as to make for easier data parsing and spot analysis. Doing so reduced the sampled file (with k = 1000) from 6,832 lines to 4,582 lines.

### Issues Identified During Audit

#### TO DO: MAKE SURE, AT LEAST, THE ITEMS ON THIS LIST ACCOUNT FOR YOUR RESEARCH QUESTIONS

1. Many objects seem to use the Geographic Names Information System (GNIS) schema produced by the USGS for providing tag data. While not necessarily a problem, this schema incorporates duplicative data (e.g. a vague `is_in` attribute key whose value is always the county, but also a `gnis:County` key) that is not helpful.
2. Need to edit street types (as in the Udacity case study) to be consistent throughout
3. IS THIS ONE?: check that none of the amenity types are misspelled
4. IS THIS ONE?: check that no amenity is listed duplicatively and identify the correct one if that aren't identical (e.g. have different addresses)
5. 