Downloaded Open Street Map data for the Berkeley area using the API:
http://overpass-api.de/query_form.html
With query:
(node(37.7983,-122.3504,37.8873,-122.1929);<;);out meta;

Saved the file as berkeley.osm

In [3]:
import xml.etree.cElementTree as ET
import re
from collections import defaultdict
import pprint
import codecs
import json

In [4]:
osm_file = 'berkeley.osm'

In [5]:
expected_street_type_list = ['Street','Avenue','Boulevard','Drive','Court',
                             'Place','Alameda','Broadway','Road','Parkway','Way',
                            'Plaza','Square','Telegraph']
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
special_chars_re = re.compile(r'[=\+/&<>;"\?%#$@\,\.]')
abbreviations_re = re.compile(r'[Bb]e?twe?e?n ')
cardinal_direction_re = re.compile(r'[\s]*[NSEW][\s]+')
starts_numeric_re = re.compile(r'^[\d]+')
numbered_street_re = re.compile(r'^[\d]+(st|nd|rd|th)')

#find numbers that start street names, except for 
#1st, 2nd, 3rd 4th, 5th street etc
#numbers probably should go with the street number field
non_numeric_re = re.compile(r'^(\d+)') 
word_replace = {'Btwn': 'Between',
               'btwn': 'btwn',
               'St': 'Street',
               'St.': 'Street',
               'Ct': 'Court',
               'Ct.': 'Court',
               'Pl': 'Plaza',
               'Pl.': 'Plaza',
               'Ave': 'Avenue',
               'Ave.': 'Avenue',
               'Sq': 'Square',
               'Sq.': 'Square'}

In [6]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

In [7]:
def is_postal(elem):
    return (elem.attrib['k'] == "addr:postcode")

In [8]:
def is_housenumber(elem):
    return (elem.attrib['k'] == "addr:housenumber")

In [9]:
def is_lanes(elem):
    return (elem.get('k') == "lanes")

In [10]:
def is_amenity(elem):
    return (elem.get('k') == "amenity")

In [11]:
def audit_street(street_dict, val):
    #m = street_type_re.search(val)
    #if m:
    #    street_type = m.group()
    #    if street_type not in expected_street_type_list:
    #        street_dict[street_type].add(val)
    #m = special_chars_re.search(val)
    #if m:
    #    street_dict[m.group()].add(val)
    #m = abbreviations_re.search(val)
    #if m:
    #    street_dict[m.group()].add(val)
    #m = cardinal_direction_re.search(val)
    #if m:
    #    street_dict[m.group()].add(val)
    m = starts_numeric_re.search(val)
    n = numbered_street_re.search(val)
    if m and not n:
        street_dict[m.group()].add(val)

In [12]:
def audit_postal(audit_dict,postal):
    postal_length = len(postal)
    audit_dict[postal_length].add(postal)

In [13]:
def audit_housenumber(audit_dict,val):
    m = non_numeric_re.search(val)
    if m:
        audit_dict[m.group()].add(val)

In [14]:
def audit_lanes(audit_dict,val):
    audit_dict[val] = val

In [15]:
def audit_amenity(audit_dict,val):
    audit_dict[val] = val

In [16]:
def audit(osm_file,audit_dict):
    counter = 0
    counter_max = 500000
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag in ["way",'node']:
            counter +=1
            for tag in elem.iter("tag"):
                val = tag.attrib['v']
                #if is_street_name(tag):
                #    audit_street(audit_dict,tag.attrib['v'])
                if is_postal(tag):
                    audit_postal(audit_dict,tag.attrib['v'])
                #if is_housenumber(tag):
                #    audit_housenumber(audit_dict,tag.attrib['v'])
                #if is_lanes(tag):
                #    audit_lanes(audit_dict,val)
                #if is_amenity(tag):
                #    audit_amenity(audit_dict,val)

        else:
            continue
        if counter > counter_max:
            break
    pprint.pprint(dict(audit_dict))

In [17]:
if __name__ == '__main__':
    #street_type_dict = defaultdict(set)
    #audit(osm_file,street_type_dict)
    audit_dict = defaultdict(set)
    audit(osm_file,audit_dict)

{2: set(['ca']),
 5: set(['93710',
         '94109',
         '94110',
         '94601',
         '94602',
         '94605',
         '94606',
         '94607',
         '94608',
         '94609',
         '94610',
         '94611',
         '94612',
         '94618',
         '94702',
         '94703',
         '94704',
         '94705',
         '94706',
         '94707',
         '94708',
         '94709',
         '94710',
         '94720',
         '95476']),
 8: set(['CA 94607']),
 10: set(['94612-2202', '94720-1076'])}


Some expected street names are Plaza, Court, and Square.  The abbreviations Ct and Pl can be replaced by Court and Plaza.

I noticed that some street names are more like descriptions of intersections of boundaries between two streets.  The abbreviation [Bb]twn can be replaced with 'between'.  The @ can be replaced with 'at'.

Also, a few street names include the street number, such as 111 Grand Avenue, 3605 Telegraph.  I may want to put the number in the addr:housenumber field.

For postcode, some include the state abbreviation 'CA' followed by the 5-digit postal code, some have the 'ca' state only, and some have the hyphenated zip code extension with four additional digits.  If there are non-numeric values in postcode, I will remove the letters.  I'll also remove the hyphen and extension.

For housenumber, some include alpha chars and dashes to represent a range, or are comma separated.  For those that are '-' or ',' separated, it might make sense to turn that into a list.

For lanes, one of the values is 18 lanes.

In [109]:
<way id="236348366" version="5" timestamp="2015-04-26T17:19:49Z" changeset="30510468" uid="61
6774" user="mueschel">
    <nd ref="293598417"/>
    <nd ref="667724547"/>
    <tag k="bicycle" v="no"/>
    <tag k="hgv" v="no"/>
    <tag k="highway" v="motorway"/>
    <tag k="lanes" v="18"/>
    <tag k="maxspeed" v="50 mph"/>
    <tag k="oneway" v="yes"/>
    <tag k="ref" v="I 80"/>
    <tag k="tiger:cfcc" v="A11"/>
    <tag k="tiger:county" v="Alameda, CA"/>
    <tag k="tiger:name_base" v="I-80"/>
    <tag k="toll" v="yes"/>
  </way>

SyntaxError: invalid syntax (<ipython-input-109-a6853b00d32e>, line 1)

Looking at the node associated with this way, it's a toll booth, so 18 lanes probably makes sense.  It's proably the toll booth for the San Francisco Bay Bridge.

In [110]:
  <node id="293598417" lat="37.8247804" lon="-122.3138369" version="6" timestamp="2015-04-26T17:19:50Z" changeset="30510468" uid="616774" user="mueschel">
    <tag k="barrier" v="toll_booth"/>
    <tag k="lanes" v="18"/>

SyntaxError: invalid syntax (<ipython-input-110-4249a41f3ebc>, line 1)

Amenity has some values that similar, and probably can be standardized:

Standardize these
 'car_share'
 'car_sharing'
 
 These are different: a post_box is probably just a box for dropping off letters, whereas a post office has a name referring to an post office building with employees
 'post_box'
 'post_office'
 
 Simplify these to just parking
 'parking'
 'parking_entrance'
 'parking_space'
 
 Standardize these
 'car_share'
 'car_sharing'

The values 'college', 'university', are used somewhat interchangeably, so it may make sense to put these in a sub-dictionary 

 This may be useful information for taxi drivers
 'toilets'
 


Toilets are either denoted by k="amenity" v="toilets", or in the case of BART public train stations, by k="toilets" v="yes"

Clean data