# OpenStreetMap Data Wrangling Project


The city I chose to work with initially was Austin, Texas. 

* http://www.openstreetmap.org/relation/113314
* https://mapzen.com/data/metro-extracts/metro/austin_texas/

I chose this city because it's the place I presently call home, and tackling the Austin data set gives me a chance to get better acquainted with the place I live. 

## General Audit and Overview

As a preliminary step to working with irregularities in the data set, I'll take a look at the distribution of tags to see which are abundant enough to serve as good data wrangling practice. I've created an Audit Class for the initial survey of the data file, and I'll start by creating a Audit object for the Austin OSM file and getting some preliminary stats.

In [1]:
import osm_auditor

In [2]:
austin = osm_auditor.Audit(r'austin_texas.osm')
austin.get_file_size()
austin.get_osm_stats()

Opened OSM file size = 1415.3 megabytes
Counting element tags in file...
Relations in austin_texas.osm = 2400
Ways in austin_texas.osm = 669817
Nodes in austin_texas.osm = 6389941


## Detailed Audit

The file looks like it has plenty of data points for wrangling, so next I'll look at some of the specific tags, performing a quick audit of each to see which might need cleaning. Since addresses make up a significant part of the OSM data, I'll focus on address tags. In the OSM data, address components - house number, street name, postal code, city name, etc. - are each (ideally) assigned a different tag. I'll start with postal code tag, creating a set of all unique instances contained in the data file, and then check for inconsistencies, formating issues or postal codes that fall outside of the Austin metro area.

In [3]:
austin.audit_tags('addr:postcode')

Processing OSM file...
Data processed in 142.8992 secs. 
Resulting set of tag names: 

{'78656', '78957', '78616', '78750', '78669', 'TX 78728', '78744', '78732', '78621', '78653', '78753', 'TX 78758', '78626', '78717', '78749', '78644', '78727', '78739', '78645', '78737', '78665', '78712', '76574-4649', '78613-2277', '78759-3504', 'TX 78724', '78731', '78620', '78742', '78613', '78738', '78676', '78640-6137', '78704-7205', '78728', '78602', '78660', '78758', '78628', 'TX 78745', '78729', '78617', '78747', '78664', '78759', '76574', '78640', '78754', '78746', '78758-7013', 'tx', '78735', '78646', '78704', '78654', '78722', '78681', 'Texas', '78733', '78723', '78719', '78642', '78752', '78741', '78724-1199', '78682', '78705', '78641', '78745', '78619', '78626\u200e', '78757', '78652', '78736', '78615', '78751', '78612', '78724', 'TX 78613', '78680', 'TX 78735', '78704-5639', '78721', '78640-4520', '78691', '78666', '78756', '78705-5609', '78758-7008', '78610', '78703', '14150', '78726',

The postal code data shows minor inconsistencies such as the use of nine digit postal codes, the inclusion of the state abbreviation TX, a formating error, and one postal code (14150) from New York. On the whole, however, the data in the postal code field look relatively clean.

Next I'll survey the addr:city field. 

In [4]:
austin.audit_tags('addr:city')

Processing OSM file...
Data processed in 145.9873 secs. 
Resulting set of tag names: 

{'Georgetown', 'San Gabriel Village Boulevard', 'Creedmoor', 'Dripping Springs', 'Taylor, TX', 'West Lake Hills', 'Cedar Park', 'Cedar Park, TX', 'Elgin, TX', 'Lakeway', 'Austin, TX', 'Bastrop', 'Buda', 'Georgetown, TX', 'Austin, Tx', 'Ste 128, Austin', 'Cedar Creek', 'Manchaca', 'Spicewood, TX', 'Dripping Springs TX', 'Dale', 'Bastrop, TX', 'kyle', 'Leander, TX', 'San Marcos', 'Manor', 'Austin;TX;USA', 'Lago Vista', 'Manchaca,', 'Maxwell', 'Sunset Valley', 'Del Valle', 'Smithville', 'Lost Pines', 'Webberville', 'Dripping Springs, Tx', 'Bee Cave', 'Kyle', 'N Austin', 'Spicewood', 'Elgin', 'Wimberley', 'Pflugerville, TX', 'Kyle, TX', 'Westlake Hills, TX', 'Round Rock', 'Taylor', 'Hutto', 'Jonestown', 'Barton Creek', 'austin', 'Driftwood', 'Round Rock, TX', 'Austin', 'Pflugerville', 'Liberty Hill', 'Leander'}


There are some similar issues with this field - capitalization, state abbreviations included, and missing city data evidenced by street names appearing as examples. Over all, though, the this data field is also relatively clean. 

I'll also try following the lead of the lessons and take a look at the addr:street field, focusing on the street type and looking for inconsistencies in formating. For this I'll rely on a particular method within the Audit Class.

In [3]:
austin.audit_streets()

Processing OSM file...
Data processed in 142.4671 secs. 
Resulting set of tag values for street type: 

{'', 'Post', '#306', 'D1', 'Park', 'Lajitas', 'View', 'Media', 'Mohawk', 'Crescent', 'Cv', 'Run', 'Path', '#F-4', 'Tiempo', 'Tr', 'court', 'Royale', 'Road,1100', 'Wren', 'Lane', 'Dorado', '129', 'Alto', '#311', 'Boggy', '2244', '#B100', 'Jacinto', 'Tropez', 'Voyageurs', 'Pflugerville', 'Talamore', '45', 'Juniper', 'Fandango', 'Horn', 'Hat', 'Plaza', 'Toro', 'Limon', 'Highlander', 'Ave.', 'Edenderry', 'Fields', 'Grande', 'Liberty', 'Spicewood', '969', 'Ravine', '320', '104', 'Cave', 'Gonzales', 'Landing', 'Turn', 'Hollow', 'Mesa', 'Dale', 'Linda', 'RM1431', 'Calle', '1100', 'Quarry', 'Slew', 'Terrance', '#100', 'N', 'Gunsmoke', 'Vale', 'Blvd', 'Jr', '170', 'Bonanza', 'Barrhead', 'Rose', '1826', 'Highway', 'South', 'Claw', 'Harborway', 'Spring', '452', 'Flower', 'Willo', 'A500', 'Race', 'Oak', 'Saddles', 'F', '414', 'Birch', 'Flat', 'Tealwood', 'D5000', 'Arrow', 'Boulevard', 'Camelback

The street type data, in contrast with the postal codes, is a complete mess. The most common problem appears to be that users didn't bother to include a street type when entering an location address. Other issues include inconsistent abbreviations, capitalization, and apartment numbers or letters as the last element in the street field instead of street type. 

The addr:street field would be an excellent candidate for munging, but since it was already demonstrated in the lessons, I'm going to focus instead on a different tag, population. My suspicion is that OSM population data relies on US Census data from 2010 and is probably outdated and contains significant inaccuracies. For this data wrangling exercise, I'll focus on updating the population tags for Austin and its satellite towns.

OpenStreetMap refers to cities, towns, villages, and hamlets collectively as settlements. Settlement data nodes include tags for population and a population dependent tag for settlement types (listed above) under the tag name 'place'. The table below outlines the possible values for the 'place' tag and the respective relations to population figures. 

Tag|Population|Description
---|----------|-----------
place=city|100,000+|
place=town|10,000 - 100,000|an urban settlement with local importance
place=village|<10,000|incorporated municipality, regardless of its population	
place=hamlet|<100|unincorporated settlement with less than 100 inhabitants
place=isolated_dwelling|<= 2 households|the smallest kind of human settlement

To clean and update these two data points, population and settlement type, I've created a Popul Class with methods for parsing and cleaning the target data as well as for writing it to csv files and an SQL database. 

In [5]:
import osm_popul_wrangle as popul

I'll start by creating a Popul object for the Austin data file, and since we've already seen the file stats, I'll jump right into processing the data. The .process_data() method will parse the Austin XML file and extract settlement nodes. All the node information will be written to csv files. The method makes use of shape_data() helper function that compares OSM population values to 2016 estimates and writes both values to the csv file. The values for the 'place' tags are also checked against the 2016 population estimates and updated accordingly. A boolean value marks whether the 'place' tag value has been changed.

The 2016 Texas population estimates are downloaded in a csv file from Texas Demographic Center (TDC) website and parsed into a dictionary with settlement names as keys and a two element list as values. The lists contain 2010 Census data and the 2016 TDC estimates. The TDC csv file is loaded and parsed in the .get_popul_est() method.

In [6]:
# Initialize Austin Popul object
aust_popul = popul.Popul(r'austin_texas.osm')
# Create estimates dictionary
aust_popul.get_popul_est()
# Print sample of population estimates dictionary
print(dict(list(aust_popul.pop_est.items())[:20]))

{'Irving': ['216290', '239740'], 'Laguna Heights CDP': ['3488', '4102'], 'Lakeside City (Archer)': ['997', '1010'], 'Angus': ['414', '438'], 'Lindale': ['4818', '5495'], 'Bryson': ['539', '558'], 'Louise CDP': ['995', '1005'], 'Toco': ['75', '80'], 'Cumby': ['777', '812'], 'Yantis': ['388', '396'], 'White Settlement': ['16116', '17204'], 'Tivoli CDP': ['479', '506'], 'Premont': ['2653', '2628'], 'Waller': ['2326', '2653'], 'Krugerville': ['1662', '1987'], 'Natalia': ['1431', '1485'], 'Venus': ['2960', '3349'], 'Alto Bonito Heights CDP': ['342', '361'], "Port O'Connor CDP": ['1253', '1175'], 'Kingsbury CDP': ['782', '873']}


With the Austin Popul object initialized and the estimates dictionary in place, we can get processing!

In [7]:
aust_popul.process_data()

Processing OSM file...
Data processed in 166.7125 secs. 
29 population tags found and cleaned.


With the data parsed, cleaned, and written to csv files, I'll now move to the next step of saving the csv data as tables in an SQL database and then examining the data. First the SQL database.

In [9]:
aust_popul.write_sql()

Next, I'll reference the SQL tables to get a count of how many population tags were revised.

In [10]:
aust_popul.get_pop_edits()

Number of revised populations = 29



Let's take a look at a rundown of the settlements that were revised and the differences between the OSM population data and the TDC estimates. Included is are the respective means for overall population change and proportional population change.

In [11]:
aust_popul.get_pop_change_list()
aust_popul.get_averages()

          Settlement  Increase  Proportion
0      Sunset Valley       -60      -0.076
1          Jonestown      -103      -0.049
2               Hays         2       0.008
3          Wimberley        42       0.016
4      Mountain City        21       0.030
5       Liberty Hill       128       0.085
6            Bastrop       687       0.091
7             Thrall        79       0.093
8         Bear Creek        41       0.109
9         San Leanna        56       0.115
10            Taylor      1895       0.124
11       Rollingwood       171       0.125
12         Woodcreek       216       0.147
13         Creedmoor        28       0.147
14            Austin    131391       0.166
15      Pflugerville      9377       0.200
16        Round Rock     20181       0.202
17           Lakeway      2554       0.224
18        San Marcos     13165       0.293
19              Kyle     10309       0.368
20       Webberville       120       0.390
21        Georgetown     22009       0.518
22         

Central Texas is growing at a rapid pace and as expected most of the settlements in the Austin metro area saw at least some change in population. Some of the settlements closest to Austin, within reasonable commuting distance, such as Buda, Kyle, Hutto and Manor saw significant upward population shifts. Did any of these settlements also see a change in 'place' tag values? The Popul class includes a method for referencing this against the SQL tables created earlier.

In [12]:
aust_popul.get_designation_changes()

Number of place designations changed = 1

  Settlement New Designation
0       Buda            town



Buda, Texas just to the south of Austin more than tripled in size and moved from the village classification to being a town by OSM 'place' tag standards.

Lastly, I'll check the sources for the OSM population data to test my initial hunch that most of it is sourced from the 2010 US Census. 

In [13]:
aust_popul.get_sources()

      Data Source  Count
0            None     23
1  US Census 2010      6



The auditing, wrangling, and cleaning I'll set up programmatically should be applicable to the OSM data of any area within Texas. To test it's applicability, I'll test the class on another Texas metro area OSM file, Houston. 

In [14]:
# Initialize Houston Popul object
houst_popul = popul.Popul(r'houston_texas.osm')
# Create estimates dictionary
houst_popul.get_popul_est()

With a new Popul object initialized for a different Texas metro area, I'll run through the data.

In [15]:
houst_popul.process_data()

Processing OSM file...
Data processed in 93.5103 secs. 
101 population tags found and cleaned.


In [16]:
houst_popul.write_sql()
houst_popul.get_pop_edits()

Number of revised populations = 101



In [17]:
houst_popul.get_pop_change_list()
houst_popul.get_averages()

              Settlement  Increase  Proportion
0           Roman Forest     -1423      -0.428
1             Plum Grove      -354      -0.355
2         Surfside Beach      -292      -0.338
3           Todd Mission       -46      -0.289
4                 Bonney      -113      -0.260
5              Kendleton      -133      -0.248
6                Orchard      -113      -0.231
7          Meadows Place     -1470      -0.222
8                  Kemah      -497      -0.201
9               Woodloch       -50      -0.197
10     Brookside Village      -361      -0.181
11              Kenefick      -118      -0.165
12           Bayou Vista      -204      -0.120
13          Oyster Creek      -121      -0.098
14          Dayton Lakes       -10      -0.094
15              Daisetta       -97      -0.089
16     Clear Lake Shores      -105      -0.081
17         West Columbia      -269      -0.064
18       Oak Ridge North      -178      -0.053
19             Needville      -152      -0.044
20           

So far the Popul Class has held up with a new data set, but the changes list presents some counterintuitive results that deserve follow up. Again, most of the state of Texas has experienced a steady if not accelerating growth trend since the last census. Maybe some negative growth values could be explained away to urban migration. Nonetheless the negative growth values for some of the settlements in the list above don't make sense. I'll work with Roman Forest, Texas as an example. First let's check the TDC estimates in our estimates dictionary.

In [18]:
print(houst_popul.pop_est['Roman Forest'])

['1538', '1901']


I crossed checked these values ([US Census 2010, TDC 2016 Estimate]) against other online sources and they check out. This puts the OSM data in question. Let's look at the OSM population tag value saved to the SQL database.

In [20]:
import sqlite3
p3_db = sqlite3.connect(r'p3_osm5')
curs = p3_db.cursor()
curs.execute("SELECT name, osm_population FROM settlement_popul WHERE name = 'Roman Forest'")
name, osm_popul = curs.fetchall()[0]
print(name, osm_popul)

Roman Forest 3324


The OSM population for Roman Forest, Texas is clearly inaccurate and explains the apparent negative growth. At least in the case of Roman Forest (and more than likely in other cases),  the population change figure is actually a measure of degree of inaccuracy.

In [21]:
houst_popul.get_designation_changes()

Number of place designations changed = 15

       Settlement New Designation
0        Pasadena            city
1    Jacinto City            town
2        Magnolia         village
3    Dayton Lakes          hamlet
4         Webster            town
5    Prairie View         village
6        Quintana         village
7   West Columbia         village
8        Pearland            city
9          Sweeny         village
10    League City            city
11        Liberty         village
12           Ames         village
13        Anahuac         village
14         Arcola         village



In [22]:
houst_popul.get_sources()

      Data Source  Count
0            None     92
1  US Census 2010      6
2       US Census      3



## Suggestions for Further Analyzing and Improving Data

As we saw with address tag data, user inputted data are good candidates for improving the accuracy and reliability of OSM information. With regard to the population tag values, I would suggest investigating negative growth values as a likely sign of inaccurate OSM figures. This might involve corroborating population figures with a third data source.

## Conclusions

The Austin OSM data set offered abundant opportunities for data munging. Address tags were most in need of attention and any use of OSM address data would have required a significant investment of time in auditing and cleaning. The population field I settled on, while ultimately less challenging programmatically, gave me practice in data corroboration and correction accross sources. Where the address fields data would have required data correction and standardization, the population field basically amounted to an exercise in the updating of data, an exercise that yielded a Python class that I was able to apply to other metro areas within Texas with consistent results. The population growth statistics gleaned during updating demonstrates that federal censuses are conducted too infrequently to be relied on as an accurate measure of rapidly growing population across a whole decade. The state population estimates offer an interrum proxy for more current and reliable figures. 