# Wrangling OpenStreetMap Data for SW West Virginia

## Motivation
In order to develop my data wrangling skills, I am auditing and organizing the OpenStreetMap data for SW West Virginia. I chose West Virginia for multiple reasons:
1. My wife's family is from there (and most still live in the state) 
2. I assumed that (like with most self-reported or survey-based data collection efforts), a state with a high proportion of poor and rural areas is likely to have very poor data coverage and quality. This is something that my work on map data may help
3. The state of WV is an area racked by a number of unfortunate statistics, not the least of which is [a very high drug overdose rate](https://www.cdc.gov/drugoverdose/data/statedeaths.html). This area of West Virginia (in particular, Huntington, WV) [suffers particularly badly](https://www.npr.org/2017/06/29/534868012/what-happens-when-the-heroin-epidemic-hits-small-town-america), with the city of Huntington sometimes being called the drug overdose death capital of America.

In wrangling the data for this region of the US, I hope to provide some value to an otherwise ignored set of communities. It is my hope that I will be able to take my progress here and push the audited data to OpenStreetMap as a final step.

## Data Provenance

First of all, let's establish where these data came from and some basic information about them. The data were pulled as a custom extract from https://mapzen.com/ and the final file size for the region (unzipped) is 538 MB. An image of the region extracted is shown here **(note that this region includes WV as well as some portions of VA, KY, and OH)**: ![](imgs/SW_WV_ExtractConfirm.png).

## Project Steps

I'm going to tackle this project using the following steps (mostly included here for my own mental organization, but hopefully also helpful for anyone following this work too!):

1. Sample the data to generate a relatively small data set (e.g. 1-10 MB) that I can sift through to identify recurring concerns as part of my data audit. The code I will use for sampling the data is called *SampleMapData_Small.py*. 

2. Audit the data sample in the tradition of the Udacity Data Wrangling course, using NumPy and Pandas in Python. This will entail:
    1. **Auditing data validity:** do the data conform to a pre-defined data schema? In the case of this project, a relevant question would be *Do the tags include zip codes that are within the region of interest?*
    2. **Auditing data accuracy:** do the data conform to some gold standard? An example of this will likely be zip codes only being 5 or 9 digits long.
    3. **Auditing data completeness:** one approach I may take to this is ensuring that cities or counties I know should be included in the region I'm parsing are actually recorded in the data. Given the region chosen here, one obvious audit that will need to occur is identifying easily the non-WV data in the set. While these data are not inherently invalid, different state policies regarding the naming of roads, counties, etc. may have an impact on the seeming validity of the data at hand and should be identified early.
    4. **Auditing data consistency:** I will investigate this by looking across tags of the same type to ensure a standard format is being followed throughout and, if it is not, determining if a correction is needed to answer the questions of greatest interest to me. I'll also investigate other issues of internal consistency as they arise.
    5. **Auditing data uniformity:** for these data, this will likely take the form of checking whether data values are within a reasonable range (e.g. no latitudes or longitudes well outside the region of interest).

3. Check to see if more audit problems are found when sampling a larger data set than previously. Keep iterating on this approach of "increase sample size, sample, audit" until no new audit failures are found in the full data set.

4. Correct the data problems identified in the audits as part of the CSV data export process that follows the data schema provided by the Udacity team. This schema creates tables tracking `nodes`, `node_tags`, `ways`, `ways_tags`, and `ways_nodes` and allows for the creation of the SQL database needed for recording these data.

5. Create a SQL database with the provided schema.

5. Import the CSV data files (one per SQL table) into the SQL database created.

6. Query this database in a variety of ways to:
    1. Check to make sure no other auditing errors exist. If any are found, deal with them by modifying the auditing code, re-creating the CSV files, and removing then re-adding records in the SQL database (if necessary).
    2. Determine some descriptive aspects of the data. For example, data-oriented queries of interest could include the number of users contributing and identifying some representative contributions of the highest-frequency contributors. This could identify any recurring issues with the data these individuals are supplying and speed up any further auditing that may be needed. Region-specific auditing would include things like determining if the number of cities and counties are accurate. Statistics specifically mentioned in the project rubric (with some additions from me) are:
        1. Size of all files used in this project
        2. Number of unique users
        3. Number of nodes and ways
        4. Numbers of specific types of nodes (e.g. cafes)
        5. Numbers of nodes and ways lacking any child tags to describe them beyond their latitude/longitude
    
    3. Investigate any other interesting counts present in the data, such as:
        1. The number of unincorporated townships vs. the number of cities/villages
        2. Comparing the number of residential properties to the number of commercial properties.         
        3. WV is known by many to have issues with the presence (or lack thereof) of basic infrastructure needs, given its high density of rural locations. As such, I'll also explore how the count of residences compares to the count of:
            * grocery stores, 
            * restaurants, 
            * alcohol serving-selling businesses, and 
            * hospitals + doctor offices 
        
        This will potentially illuminate the issue of "healthcare and food deserts." The healthcare component in particular is relevant to the aforementioned overdose concerns for this area. I include alcohol-oriented establishments in these results to replicate [a similar study done in 2013](https://www.usatoday.com/story/dispatches/2013/12/06/top-bar-and-pizza-cities/3882089/) regarding the number of bars per capita vs. food options per capita. In my case, I plan to compare the different counties in terms of their ratios of alcohol to food, alcohol to healthcare, and fast food to healthcare, among other combinations.

#### Note

As all of the significant problems seem to exist in nodes and ways with child tags, I modified my sampling algorithm (when using smaller sample OSM files for initial code development) to skip any nodes or ways that were not parents of tags, so as to make for easier data parsing and spot analysis. Doing so reduced the sampled file (with sample frequency = 1000 tags) from 6,832 lines to 4,582 lines.

### Issues Requiring Correction Identified During Audit

3. Zip code formatting and/or just plain wrong-ness.
4. State/county inconsistency, formatting, or representation as a code number instead of a name.
4. Amenity/shop types were only rarely incorrected labeled/spelled. These rare instances needing correction are discussed briefly below.

### Non-issues that Surprised Me (non-issues in the sense that they do not require correction/cannot reasonably be corrected for this project)

1. I found no street types that were unexpected (when sampling every 100 tags), given a relatively small list: `["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road","Trail", "Parkway", "Commons", "Circle", "Terrace"]` nor any obviously mis-formatted or overabbreviated street names.
7. **I found a number of references to highways that could not be treated with the Udacity-class-developed street auditing code.** For example, "Highway 644" is a valid representation of a way tag and could even have buildings along it whose addresses (and thus node tags) would be written this way. It's not at all wrong, but it also violates the assumption that street types (such as "Road") will only be placed at the end of the address.
    * In addition, a number of street names ended with a cardinal direction (e.g. "East" or "Northeast") which is completely valid in many areas, such as Washington, DC, but also makes it so that the more traditional street type identifier comes earlier in the street name (e.g. "101 K Street NE")
    * **As these are not inherently incorrect ways of representing these street names, I will not be correcting these 'issues'**.
3. No latitudes or longitudes were identified that are outside the expected bounds.
    * These bounds (four corners of a rectangle) are: 
        * [-82.6730344023,37.1523636424],
        * [-80.2011105742,37.1523636424],
        * [-80.2011105742,39.0498347562],
        * [-82.6730344023,39.0498347562]
        * **Basic rule: longitude should be between -82.67 and -80.20, latitude should be between 37.15 and 39.05**
6. **Multiple businesses were identified that did not have any business type (e.g. shop, amenity, etc.) associated with them.** *At this time, there is no obvious way to correct for this error, as no discernible pattern exists as to when the tag is included or excluded.* For example, many of the nodes that were clearly businesses (based upon their names) had `k="name"` values, but other nodes also had this and were just bus stops, waterways, or something else entirely. As a result, this will be an issue that needs to remain for now, until a gold standard data set for business names can be identified and used in the auditing process (e.g. by comparing `k="name"` values to the list of known businesses and extracting the business type label from that same list to put into the OSM data).


### Other Items of Relevance

1. The best way to identify the quantity of residential vs. commercial use areas is by using the tag keys `k="landuse"` and `k="building"`. The relevant options are as follows:
    * Residential
        * `landuse:residential`
        * `landuse:village_green`
        * `landuse:recreation_ground`
        * `landuse:allotments`
        * `building:apartments`
        * `building:farm`
            * [According to OSM](https://wiki.openstreetmap.org/wiki/Key:building) this is a purely residential designation
        * `building:house`
        * `building:detached`
        * `building:residential`
        * `building:dormitory`
        * `building:houseboat`
        * `building:bungalow`
        * `building:static_caravan`
            * This refers to a mobile home (semi)permanently left on a single site       
        * `building:cabin`
    * Commercial
        * `landuse:commercial`
        * `landuse:depot`
        * `landuse:industrial`
        * `landuse:landfill`
        * `landuse:orchard`
        * `landuse:plant_nursery`
        * `landuse:port`
        * `landuse:quarry`
        * `landuse:retail`   
        * `building:hotel`
        * `building:commercial`
        * `building:industrial`
        * `building:retail`
        * `building:warehouse`
        * `building:kiosk`
        * `building:hospital`
        * `building:stadium`
    * **Note:** any types that appear to be mixed residential + commercial usage or have mixed ownership models (e.g. `landuse:farmyard` or `building:university`) have been excluded from consideration for the sake of clarity.
2. We can assume that (for nodes) `place:town` (e.g. `<tag k="place" v="town" />`), `place:village`, `place:hamlet`, and `place:isolated_dwelling` **always refer to unincorporated communities** (reference justifying this [can be found here](https://wiki.openstreetmap.org/wiki/United_States_admin_level#Unincorporated_areas)) for purposes of analysis.
3. **A variety of amenity types exist (in general, not necessarily in my data file, but often in the data file too) that satisfy my research questions** regarding food and healthcare deserts:
    * Grocery stores
        * `shop:grocery`
        * `shop:greengrocer`
        * `shop:convenience`
        * `shop:supermarket`
    * Restaurants
        * `amenity:restaurant`
        * `amenity:cafe`
        * `amenity:fast_food`
        * `amenity:food_court`
    * Alcohol serving/selling locations
        * `amenity:biergarten`
        * `amenity:pub`
        * `amenity:bar`
        * `shop:alcohol`
        * `shop:wine`
    * Healthcare locations
        * `amenity:clinic`
        * `amenity:doctors`
        * `shop:optician`
        * `amenity:dentist`
        * `amenity:hospital`
        * `healthcare:*`

### Formatting of Zip Codes

**Some issues identified in this area:**
1. There are some postcodes/zip codes for nodes that include non-numbers in them and are just plain wrong (e.g. 'WV')
2. Some of the codes appear to follow the 9-digit zip code standard instead of the more general 5-digit standard (e.g. '12345-1111')
3. Some of the codes include lists of zip codes separated by either colons or semi-colons

**I will correct these issues during export into CSV files for the SQL tables by (these are referenced by preceding issue numbers, resp.):**
1. Ignoring zips that are just letters (such as 'WV')
2. Shortening each 9-digit zip code to only include the first 5 digits
3. Extracting individual zip codes when they appear as lists, giving them each their own tag record associated with the node/way tag ID in question

NOTE: Each zip code will be encoded into the nodes_tags or ways_tags table as `addr:postcode` as this is the most common way to record a zip code in OSM.

### State/County Naming 
            
**County and state names are inconsistent**, depending upon which system generated them (e.g. GNIS, Tiger, etc.). This is important, as so many features of WV are referenced by county by WV residents, thus any analysis will need to be based off of counties to have any intuitive value to residents (and likely the same is true of state and local legislators). Also, if there are states being recorded that are clearly not supposed to be within the territory sampled, we need to correct that.

**Some issues identified in this area:**
1. The state name is included with the county name (e.g. `tiger:county` = "County, State (2-letter)") and not as a separate data field
2. Some (ways, presumably) also include *lists* of county,state entries in a single tag, separated by colons or semicolons (sometimes with, sometimes without, leading and trailing spaces for each item in the list).
3. The county names are referenced by differing tag keys across the dataset (e.g. `gnis:county_name` + `addr:state`; OR `gnis:County` + `gnis:ST_alpha`; and many others)
4. The tags only include a numeric `gnis:county_id` and `gnis:state_id` for some nodes and ways, typically amenities and shops (e.g. instead of naming the county/state as they do for non-commercial nodes). 
    1. These commercial locations also typically lack street addresses, only including the requisite longitude and latitude. 


**I will correct these issues during export into CSV files for the SQL tables by (these are referenced by preceding issue numbers, resp.):**
1. Removing the state from any county entry, making a state-specific tag record for the state name extracted (when exporting to CSV) and keeping the county name as its own data field (these would be of the form `addr:county` or `addr:state` in the nodes_tags/ways_tags table)
2.  Extracting each state/county name from the list and storing in a separate tag record for the node/way in question
3. Auditing done by `Audit_Simple.py` has revealed that the following state and county tags are most relevant to our analysis here, and these will be the ones that are the focus of extraction:
    * **Types of County Tags**:
        * `gnis:County`
        * `gnis:County_num`
        * `gnis:county_id`
        * `gnis:county_name`
        * `is_in:county`
        * `tiger:county`

    * **Types of State Tags**
        * `addr:state`
        * `gnis:ST_alpha`
        * `gnis:ST_num`
        * `gnis:state_id`
        * `nist:state_fips`
4. I can create consistency by developing a mapping algorithm for county/state ID numbers to county and state names, using the US Census Data API to ensure that the proper mapping and vintage of data set are being utilized.
    1. The problem of commercial locations lacking street addresses is beyond the scope of this project, as it requires substantial additional GIS data that is supposed to be provided by OSM in the first place.
    
NOTE: the 2007 import of USGS GNIS data into OpenStreetMaps had the fields `gnis:County_num` and `gnis:ST_num` which are the 2007 (presumably) county and state FIPS codes, resp. The 2009 import of USGS GNIS data had similar fields, but with slightly different tag keys. These were `gnis:county_id` and `gnis:state_id`, which are also FIPS codes, but those from 2009 presumably. **As such, my extraction of these data will include mapping these numbers to their actual names, as per these two years' FIPS code mappings, using the US Census Bureau's Census Data API.** [Please see here](https://wiki.openstreetmap.org/wiki/USGS_GNIS) for the reference regarding these import actions for OSM and [here](https://www.census.gov/data/developers/updates/new-discovery-tool.html) for the Census Data API description page, including links to usage documentation.

### Shop/Amenity Tag Formatting

There were no extraneous or misspelled amenity types/values as far as I could tell, except for the incorrect `amenity:'ATV Trails'` entry and the incorrectly-capitalized `shop:Tiles`. 

**I will correct this by changing these tags to `amenity:atv` and `shop:tiles`, resp., when exporting to CSV, even though these are not critical to my proposed analysis.**

In [13]:
from collections import defaultdict
import pprint

print(3%1)

0
