# Wrangling OpenStreetMap Data for SW West Virginia

## Motivation
In order to develop my data wrangling skills, I am auditing and organizing the OpenStreetMap data for SW West Virginia. I chose West Virginia for multiple reasons:
1. My wife's family is from there (and most still live in the state) 
2. I assumed that (like with most self-reported or survey-based data collection efforts), a state with a high proportion of poor and rural areas is likely to have very poor data coverage and quality. This is something that my work on map data may help
3. The state of WV is an area racked by a number of unfortunate statistics, not the least of which is [a very high drug overdose rate](https://www.cdc.gov/drugoverdose/data/statedeaths.html). This area of West Virginia (in particular, Huntington, WV) [suffers particularly badly](https://www.npr.org/2017/06/29/534868012/what-happens-when-the-heroin-epidemic-hits-small-town-america), with the city of Huntington sometimes being called the drug overdose death capital of America.

In wrangling the data for this region of the US, I hope to provide some value to an otherwise ignored set of communities. It is my hope that I will be able to take my progress here and push the audited data to OpenStreetMap as a final step.

## Data Provenance

First of all, let's establish where these data came from and some basic information about them. The data were pulled as a custom extract from https://mapzen.com/ and the final file size for the region (unzipped) is 538 MB. An image of the region extracted is shown here **(note that this region includes WV as well as some portions of VA, KY, and OH)**: ![](imgs/SW_WV_ExtractConfirm.png).

## Project Steps

I'm going to tackle this project using the following steps (mostly included here for my own mental organization, but hopefully also helpful for anyone following this work too!):

1. Sample the data to generate a relatively small data set (e.g. 1-10 MB) that I can sift through to identify recurring concerns as part of my data audit. The code I will use for sampling the data is called *SampleMapData_Small.py*. 

2. Audit the data sample in the tradition of the Udacity Data Wrangling course, using NumPy and Pandas in Python. This will entail:
    1. **Auditing data validity:** do the data conform to a pre-defined data schema? In the case of this project, a relevant question would be *Do the tags include zip codes that are within the region of interest?*
    2. **Auditing data accuracy:** do the data conform to some gold standard? An example of this will likely be zip codes only being 5 or 9 digits long.
    3. **Auditing data completeness:** one approach I may take to this is ensuring that cities or counties I know should be included in the region I'm parsing are actually recorded in the data. Given the region chosen here, one obvious audit that will need to occur is identifying easily the non-WV data in the set. While these data are not inherently invalid, different state policies regarding the naming of roads, counties, etc. may have an impact on the seeming validity of the data at hand and should be identified early.
    4. **Auditing data consistency:** I will investigate this by looking across tags of the same type to ensure a standard format is being followed throughout and, if it is not, determining if a correction is needed to answer the questions of greatest interest to me. I'll also investigate other issues of internal consistency as they arise.
    5. **Auditing data uniformity:** for these data, this will likely take the form of checking whether data values are within a reasonable range (e.g. no latitudes or longitudes well outside the region of interest).

3. Check to see if more audit problems are found when sampling a larger data set than previously. Keep iterating on this approach of "increase sample size, sample, audit" until no new audit failures are found in the full data set.

4. Correct the data problems identified in the audits as part of the CSV data export process that follows the data schema provided by the Udacity team. This schema creates tables tracking `nodes`, `node_tags`, `ways`, `ways_tags`, and `ways_nodes` and allows for the creation of the SQL database needed for recording these data.

5. Create a SQL database with the provided schema.

5. Import the CSV data files (one per SQL table) into the SQL database created.

6. Query this database in a variety of ways to:
    1. Check to make sure no other auditing errors exist. If any are found, deal with them by modifying the auditing code, re-creating the CSV files, and removing then re-adding records in the SQL database (if necessary).
    2. Determine some descriptive aspects of the data. For example, data-oriented queries of interest could include the number of users contributing and identifying some representative contributions of the highest-frequency contributors. This could identify any recurring issues with the data these individuals are supplying and speed up any further auditing that may be needed. Region-specific auditing would include things like determining if the number of cities and counties are accurate. Statistics specifically mentioned in the project rubric (with some additions from me) are:
        1. Size of all files used in this project
        2. Number of unique users
        3. Number of nodes and ways
        4. Numbers of specific types of nodes (e.g. cafes)
        5. Numbers of nodes and ways lacking any child tags to describe them beyond their latitude/longitude
    
    3. Investigate any other interesting counts present in the data, such as:
        1. The number of unincorporated townships vs. the number of cities/villages
        2. Comparing the number of residential properties to the number of commercial properties.         
        3. WV is known by many to have issues with the presence (or lack thereof) of basic infrastructure needs, given its high density of rural locations. As such, I'll also explore how the count of residences compares to the count of:
            * grocery stores, 
            * restaurants, and 
            * hospitals + doctor offices 
        
        This will potentially illuminate the issue of "healthcare and food deserts." The healthcare component in particular is relevant to the aforementioned overdose concerns for this area.

As all of the significant problems seem to exist in nodes and ways with child tags, I modified my sampling algorithm to skip any nodes or ways that were not parents of tags, so as to make for easier data parsing and spot analysis. Doing so reduced the sampled file (with k = 1000) from 6,832 lines to 4,582 lines.

### Issues Identified During Audit

#### TO DO: MAKE SURE, AT LEAST, THE ITEMS ON THIS LIST ACCOUNT FOR YOUR RESEARCH QUESTIONS
#### Items of import: 
* Do unincorporated townships get called out specifically, or are they just non-city areas?
    * Can assume that (for nodes) place=town, place=village, place=hamlet, and place=isolated_dwelling always refer to unincorporated communities (reference justifying this [can be found here](https://wiki.openstreetmap.org/wiki/United_States_admin_level#Unincorporated_areas))
* Is there consistent (if at all) identification of residential vs. commercial properties?
    * If there is a addr:housenumber tag key and a name key, then it's likely commercial, otherwise if just addr:housenumber assume residential


1. **A variety of amenity types exist that satisfy my research questions** regarding food and healthcare deserts (e.g. shop=grocery and shop=supermarket for grocery stores), 
    * *I need to generate a list of all amenities in the file and then make sure I'm seeking out all of the relevant ones.* This exercise also helps ensure there aren't any erroneous amenities in there.
3. **There are barely any postcodes/zip codes** (addr:postcode) for nodes in the every-100 data sample (10 total!) and most are commercial entities.
    * However, there are a number of zip codes on way tags, identifying the start zip and end zip (presumably, they're called "left" and "right" in the data).
    * *I currently have no corrective plan for this, as zip code is not a crucial element and correcting this would be a data creation exercise due to the GIS nature of the problem, as opposed to a data munging exercise*
4. **County and state names are inconsistent**, depending upon which system generated them (e.g. GNIS, Tiger, etc.). This is important, as so many features of WV are referenced by county by WV residents, thus any analysis will need to be based off of counties to have any intuitive value to residents (and likely the same is true of state and local legislators). Some of the issues are:
    * The state name is included with the county name (e.g. `tiger:county` = "County, State (2-letter)") and not as a separate data field
    * The county names are referenced by differing tags across the dataset (e.g. `gnis:county_name` + `addr:state`; OR `gnis:County` + `gnis:ST_alpha`)
    * The tags only include a numeric `gnis:county_id` and `gnis:state_id` for amenities and shops (e.g. instead of naming the county/state as they do for non-commercial nodes). These commercial locations also typically lack street addresses, only including the requisite longitude and latitude. This latter problem is beyond the scope of this exercise, **but I can create consistency by developing a mapping algorithm for GNIS county IDs to county names and state IDs to state names.**
    * **I plan to standardize these data** by: 
        * Removing the state from any county entry, making a state-specific data entry of the form `addr:state` (wherein the value of the state is  and including a county name of the format `addr:county`. 
            * While `addr:county` [is not considered standard for OpenStreetMap](https://help.openstreetmap.org/questions/25170/counties-and-building-names-in-addresses) (as it can be derived from other administrative data contained in OSM, such as the county borders), it is useful for the purposes of our database, so that results may be provided in the context of counties without requiring extra calculations to determine the county borders, a GIS exercise that is beyond the scope of this project. 
        * When only a numeric identifier is available for the state and/or county, replacing that with `addr:state` or `addr:county` (as appropriate) and using the name of the state/county, instead of its numeric identifier.
5. **No clinics, doctors, hospitals, or generic healthcare establishments were found in the every-100 sample file**. While this certainly doesn't mean they are missing from the full data file, it is concerning.




### Non-issues that Surprised Me (in the sense that they did not require correction)

1. Found no street types that were unexpected (when sampling every 100 tags), given a relatively small list: `["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road","Trail", "Parkway", "Commons", "Circle", "Terrace"]` nor any obviously mis-formatted or overabbreviated street names.
    * The exception: found at least one formatted of the type "County Road 107." Given that that is unique (and accurate), I did not change it in any way.
2. No zip codes were identified that were poorly formatted and none were outside the relevant region.
3. No latitudes or longitudes were identified that are outside the expected bounds.
    * These bounds (four corners of a rectangle) are: 
        * [-82.6730344023,37.1523636424],
        * [-80.2011105742,37.1523636424],
        * [-80.2011105742,39.0498347562],
        * [-82.6730344023,39.0498347562]
        * **Basic rule: longitude should be between -82.67 and -80.20, latitude should be between 37.15 and 39.05**


In [6]:
from collections import defaultdict
import pprint

test = defaultdict(list)
test['123'].append('Bad lat')
test['123'].append('Bad lon')

pprint.pprint(test)

defaultdict(<class 'list'>, {'123': ['Bad lat', 'Bad lon']})
