Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag for multiple matches to /geocode/multiple for detecting low quality addresses. #32

Closed
JustinBerger opened this issue Oct 19, 2016 · 5 comments

Comments

@JustinBerger
Copy link

When running /geocode low quality addresses will return multiple matches with similar scores. However, when running with /geocode/multiple it becomes impossible to tell that an address is low quality. If we had some way to know that there were multiple 'good' matches, we could then flag those for later processing using the non-batched /geocode.
For example: 201 main street and SLC (with acceptScore = 90) in the Single Geocode returns:

{
 "result": {
  "location": {
   "x": 424785.93550203653,
   "y": 4514020.436742754
  },
  "score": 90.97,
  "locator": "Centerlines.StatewideRoads",
  "matchAddress": "201 N MAIN ST, SALT LAKE CITY",
  "inputAddress": "201 main street, slc",
  "addressGrid": "SALT LAKE CITY",
  "candidates": [
   {
    "address": "201 S MAIN ST, SALT LAKE CITY",
    "location": {
     "x": 424827.46269343514,
     "y": 4512996.445548089
    },
    "score": 90.87,
    "locator": "AddressPoints.AddressGrid",
    "addressGrid": "SALT LAKE CITY"
   }
  ]
 },
 "status": 200
}

The multi geocode returns:

{
 "result": {
  "addresses": [
   {
    "id": 1,
    "location": {
     "x": 424785.93550203653,
     "y": 4514020.436742754
    },
    "score": 90.97,
    "locator": "Centerlines.StatewideRoads",
    "matchAddress": "201 N MAIN ST, SALT LAKE CITY",
    "inputAddress": "201 main street, slc",
    "addressGrid": "SALT LAKE CITY"
   }
  ]
 },
 "status": 200
}

The suggestion is to add some sort of indicator that multiple matches were found.

{
 "result": {
  "addresses": [
   {
    "id": 1,
    "location": {
     "x": 424785.93550203653,
     "y": 4514020.436742754
    },
    "score": 90.97,
    "locator": "Centerlines.StatewideRoads",
    "matchAddress": "201 N MAIN ST, SALT LAKE CITY",
    "inputAddress": "201 main street, slc",
    "addressGrid": "SALT LAKE CITY",
    "multipleDetected": true
   }
  ]
 },
 "status": 200
}
@JustinBerger JustinBerger changed the title Add match count to /geocode/multiple for detecting low quality addresses. Add flag for multiple matches to /geocode/multiple for detecting low quality addresses. Oct 19, 2016
@steveoh
Copy link
Member

steveoh commented Oct 19, 2016

Thanks, I will consider this!

It was a directed decision to not show candidates when doing multiple geocodes for payload size etc. What decisions would you make from knowing that there were multiple candidates?

For the time being you could switch to single geocodes. We won't tell the neighbors.

I ask because the example you give is an invalid address - It's missing a direction. Because of that you are getting addresses on both sides of the north temple. Based on your input, there is no clear way to determine which address is correct. I'm not sure why there was a tenth of a score point given to North over South but it really should be a tie.

@JustinBerger
Copy link
Author

JustinBerger commented Oct 25, 2016

Yeah. I figured the output was intentionally simplified.

The biggest thing I would like to do with the enhancement is use it to detect invalid addresses. It is nearly impossible to detect invalid addresses using sql (too many exceptions to the rules), so if we ran into a record with multiple possible geocodes, we can mark them for manual processing. I don't want to use the single geocode for the first pass because we have a lot of data, many with records that are going to be pretty solid matches. But you will notice both of these were > 90, so score alone doesn't tell you if the address is vague. I would also me marking anything with a low score, or without any successful matches.

We have a team who connects records from our various sources, and it would be quite easy to give them an interface which pulls up each marked record, which uses the single match to give them a drop down of options. Then a human could decide which is the right option, or mark the record to be corrected.

@steveoh
Copy link
Member

steveoh commented Oct 25, 2016

I think every geocode will always have candidates if you ask for them. I would suggest doing multiple iterations over the data. Possibly starting with the acceptScore set to 100 to pull out the good addresses. Then with the remaining that didn't match, lower the acceptScore. Anything under say, 95 would need inspection.

I'm not sure that a multipleDetected would be helpful given it would always be true. I could see setting a property if a delta, say 5 points, was the difference between the first and second place matches being useful though.

@JustinBerger
Copy link
Author

JustinBerger commented Oct 26, 2016

True. I had been thinking along the lines of the candidates also matching the acceptScore, but I really like your deltaidea better. Knowing that the 1st and 2nd place records are maxDelta = 5 apart gives a lot of flexibility to make better decisions about the matches.

If I was doing processing with acceptScore = 90 and ran into an 89.9 and a 90.1, i would never know about the lower score, while knowing that the delta = .2, or that delta <= maxDelta tells me (regardless of my choice of acceptScore) that it's going to need a human to look at it.

I was already planning on the algorithm making multiple passes, so I think I can implement my own version of the delta from the single match. I do worry that I'm going to have too few hits on the first pass with acceptScore = 100, but I'll run some tests to see what the hit rate looks like.

@steveoh
Copy link
Member

steveoh commented Oct 26, 2016

acceptScore only works on the match address and does not filter the candidates. Let's consider moving forward with the delta as this could go into a 1.x release. Modifying the acceptScore and candidates would be considered a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants