While I migrated this away from a notebook into `.py` files, I am seeing that notebooks do have their benefit :). It remembers the output of chunks of code and you can annotate it with Markdown, and you can choose to run only one cell at a time.

Of course, if you need a script to be able to run, then I don't think notebooks are the right thing for the job, but I'm starting to grow fond of them.

Before we actually work with OpenCage, let's see what the requests return, and how we can tweak it

In [3]:
from opencage.geocoder import OpenCageGeocode
import dotenv
import os
dotenv.load_dotenv()
KEY = os.getenv('OPENCAGE')
geocoder = OpenCageGeocode(KEY)

First, let's have some actual addresses to test it with.

In [2]:
addr1 = '83 MOUNT VERNON ST'

addr2 = """FOREST HILLS ST & GLEN RD
JAMAICA PLAIN, MA 02130
UNITED STATES"""

# The current script chops the addresses, unfortunately
addr2_chopped = "FOREST HILLS ST & GLEN RD"

addr3 = """ARUNDEL ST & MOUNTFORT ST
BOSTON, MA 02215
UNITED STATES
"""

addr3_chopped = "ARUNDEL ST & MOUNTFORT ST"

We can try "just geocoding it"

In [4]:
geocoder.geocode(addr1)

[{'annotations': {'DMS': {'lat': "42° 21' 30.46608'' N",
    'lng': "71° 4' 2.47872'' W"},
   'FIPS': {'county': '25025', 'state': '25'},
   'MGRS': '19TCG2974791647',
   'Maidenhead': 'FN42li16wa',
   'Mercator': {'x': -7911181.794, 'y': 5186030.125},
   'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=405903359#map=17/42.35846/-71.06736',
    'note_url': 'https://www.openstreetmap.org/note/new#map=17/42.35846/-71.06736&layers=N',
    'url': 'https://www.openstreetmap.org/?mlat=42.35846&mlon=-71.06736#map=17/42.35846/-71.06736'},
   'UN_M49': {'regions': {'AMERICAS': '019',
     'NORTHERN_AMERICA': '021',
     'US': '840',
     'WORLD': '001'},
    'statistical_groupings': ['MEDC']},
   'callingcode': 1,
   'currency': {'alternate_symbols': ['US$'],
    'decimal_mark': '.',
    'disambiguate_symbol': 'US$',
    'html_entity': '$',
    'iso_code': 'USD',
    'iso_numeric': '840',
    'name': 'United States Dollar',
    'smallest_denomination': 1,
    'subunit': 'Cent',
    '

It returned quite a lot, including first the MA address, but then also Detroit and even other countries!

Easiest thing to do is to filter by country.

We can also try removing the annotations:

>  If you do not need the information provided in the annotations please set no_annotations=1.
> This enables us to do less work and significantly reduces the response size and thus reply more quickly. 

*API reference*: <https://opencagedata.com/api>. It includes a "best practices" section

In [5]:
geocoder.geocode(addr1, countrycode='us', no_annotations=1) # unfortunately there is no autocompletion or documentation within Python

[{'bounds': {'northeast': {'lat': 42.3585968, 'lng': -71.0672842},
   'southwest': {'lat': 42.3583318, 'lng': -71.0674175}},
  'components': {'ISO_3166-1_alpha-2': 'US',
   'ISO_3166-1_alpha-3': 'USA',
   'ISO_3166-2': ['US-MA'],
   '_category': 'building',
   '_type': 'building',
   'city': 'Boston',
   'continent': 'North America',
   'country': 'United States',
   'country_code': 'us',
   'county': 'Suffolk County',
   'house_number': '83',
   'postcode': '02108',
   'road': 'Mount Vernon Street',
   'state': 'Massachusetts',
   'state_code': 'MA',
   'suburb': 'Beacon Hill'},
  'confidence': 10,
  'formatted': '83 Mount Vernon Street, Boston, MA 02108, United States of America',
  'geometry': {'lat': 42.3584628, 'lng': -71.0673552}},
 {'bounds': {'northeast': {'lat': 41.6520829, 'lng': -70.9400616},
   'southwest': {'lat': 41.6519695, 'lng': -70.9401799}},
  'components': {'ISO_3166-1_alpha-2': 'US',
   'ISO_3166-1_alpha-3': 'USA',
   'ISO_3166-2': ['US-MA'],
   '_category': 'build

The convenient OSM link is now gone. I guess we can use annotations as we figure out how to use this.

The places returned in order were:
* Boston, MA
* New Bedford, MA
* Somerville, MA
* Malden, MA
* Melrose, MA
* Lowell, MA
* Mount Vernon, GA
* Detroit, MI
* Worcester, MA

That is quite a lot of them!

What Eric did is to just append Boston, MA to it, and to always use the first result:

In [5]:
def clean_address(addr):
    return addr.strip() + ", Boston, MA"

clean_address(addr1)

'83 MOUNT VERNON ST, Boston, MA'

So we can try again:

In [6]:
geocoder.geocode(clean_address(addr1), countrycode='us', no_annotations=1)

[{'bounds': {'northeast': {'lat': 42.3585968, 'lng': -71.0672842},
   'southwest': {'lat': 42.3583318, 'lng': -71.0674175}},
  'components': {'ISO_3166-1_alpha-2': 'US',
   'ISO_3166-1_alpha-3': 'USA',
   'ISO_3166-2': ['US-MA'],
   '_category': 'building',
   '_type': 'building',
   'city': 'Boston',
   'continent': 'North America',
   'country': 'United States',
   'country_code': 'us',
   'county': 'Suffolk County',
   'house_number': '83',
   'postcode': '02108',
   'road': 'Mount Vernon Street',
   'state': 'Massachusetts',
   'state_code': 'MA',
   'suburb': 'Beacon Hill'},
  'confidence': 10,
  'formatted': '83 Mount Vernon Street, Boston, MA 02108, United States of America',
  'geometry': {'lat': 42.3584628, 'lng': -71.0673552}},
 {'bounds': {'northeast': {'lat': 42.37483, 'lng': -71.0580528},
   'southwest': {'lat': 42.37473, 'lng': -71.0581528}},
  'components': {'ISO_3166-1_alpha-2': 'US',
   'ISO_3166-1_alpha-3': 'USA',
   'ISO_3166-2': ['US-MA'],
   '_category': 'building'

Just the first 2 results both have the same address once we constrain to Boston

![first result](https://matrix.mit.edu/media/EOWgbQUfzMgssslXNvSdGids)

Only the first result is identified (reversed geocoded) by Google Maps as our desired address.

![second result](https://matrix.mit.edu/media/qijfNUMXxXMmaawtNYuOSAaV)

Since we only care about the coordinates and address, let's have a convenience function for this.

In [12]:
def geocode(addr):
    results = geocoder.geocode(addr, countrycode='us', no_annotations=1)
    return [
        (result['formatted'], (result['geometry']['lat'], result['geometry']['lng']))
        for result in results
    ] 

In [13]:
geocode(addr1)

[('83 Mount Vernon Street, Boston, MA 02108, United States of America',
  (42.3584628, -71.0673552)),
 ('83 Mount Vernon Street, New Bedford, MA 02746, United States of America',
  (41.6520262, -70.9401207)),
 ('83 Mount Vernon Street, Somerville, MA 02145, United States of America',
  (42.3826346, -71.0817058)),
 ('83 Mount Vernon Street, Malden Centre, Malden, MA 02148, United States of America',
  (42.4321091, -71.0626237)),
 ('83 Mount Vernon Street, Wyoming, Melrose, MA 02176, United States of America',
  (42.4493622, -71.0639848)),
 ('81;83 Mount Vernon Street, Lowell, MA 01854, United States of America',
  (42.646815, -71.3259686)),
 ('83 Mount Vernon Street, Mount Vernon, Montgomery County, GA 30445, United States of America',
  (32.1795359, -82.5952588)),
 ('83 Mount Vernon Street, Detroit, MI 48202, United States of America',
  (42.3760953, -83.076043)),
 ('83 Mount Vernon Street, Chandler Hill, Worcester, MA 01605, United States of America',
  (42.2761316, -71.7879753))]

In [14]:
geocode(clean_address(addr1))

[('83 Mount Vernon Street, Boston, MA 02108, United States of America',
  (42.3584628, -71.0673552)),
 ('83 Mount Vernon Street, Boston, MA 02129, United States of America',
  (42.37478, -71.0581028)),
 ('83 Mount Vernon Street, Boston, MA 02135, United States of America',
  (42.3473389, -71.1568004)),
 ('83 Mount Vernon Street, Boston, MA 02132, United States of America',
  (42.2849704, -71.1600953)),
 ('83 Mount Vernon Street, Boston, MA 02125, United States of America',
  (42.3215379, -71.0561875)),
 ('Mount Vernon St, Boston, MA, United States of America',
  (42.321558, -71.054962)),
 ('Boston, Massachusetts, United States of America', (42.35843, -71.05977))]

Let's try the other addresses:

In [15]:
geocode(addr2)

[('Jamaica Plain, Massachusetts, United States of America',
  (42.308353, -71.100431)),
 ('Suffolk County, MA 02130, United States of America', (42.3126, -71.1115))]

It seems like it doesn't like the format.

In [19]:
print(clean_address(addr2_chopped))
geocode(clean_address(addr2_chopped))

FOREST HILLS ST & GLEN RD, Boston, MA


[('Boston, Massachusetts, United States of America', (42.312169, -71.066139)),
 ('Boston, Massachusetts, United States of America', (42.35843, -71.05977))]

wow that is so unspecific

In [20]:
# Let's try to manually correct the address
addr2_corrected = "FOREST HILLS ST & GLEN RD, JAMAICA PLAIN, MA 02130, UNITED STATES"
geocode(addr2_corrected)

[('Jamaica Plain, Massachusetts, United States of America',
  (42.308353, -71.100431)),
 ('Suffolk County, MA 02130, United States of America', (42.3126, -71.1115))]

At least it says it's in Jamaica Plain, but that is quite unspecific still. I guess OpenCage is not good at this type of address (intersection between 2 addresses).

Let's try the third address

In [21]:
geocode(addr3_chopped)

[]

No results, forgot to add the Boston.

In [23]:
geocode(clean_address(addr3_chopped))

[('Boston, Massachusetts, United States of America', (42.347549, -71.103523))]

RIP.

In [24]:
geocode(addr3)

[('Boston, Massachusetts, United States of America', (42.347549, -71.103523)),
 ('Suffolk County, MA 02215, United States of America', (42.3471, -71.1027))]

Okay, in Google Maps it is *sort of* close to the second street (Mountfort St).

And for the second address it does sort of reach the two named addresses.

> FOREST HILLS ST & GLEN RD, JAMAICA PLAIN, MA 02130, UNITED STATES

![address 3](https://matrix.mit.edu/media/QLXITettpqgdjcHQBOPKHVAX)
