Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A collection of geospatial bug fixes #4444

Merged
merged 1 commit into from Feb 20, 2018

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Feb 16, 2018

Also using geopy to parse single-column spatial info, supports many different types of lat/lng styles

p = Point(s)
return (p.latitude, p.longitude)

df[key] = df[spatial.get('lonlatCol')].apply(tupleify)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely benchmark this with our current data before deploying, since it's using regular expressions to do the job: https://github.com/geopy/geopy/blob/master/geopy/point.py#L310-L339

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [2]: from geopy.point import Point

In [3]: Point('234,239')
Out[3]: Point(54.0, -121.0, 0.0)

In [4]: %timeit Point('234,239')
The slowest run took 4.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.7 µs per loop

In [5]: %timeit (float(v) for v in '234,239'.split(','))
The slowest run took 6.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 593 ns per loop

So roughly double the time. On 1M points, that means 0.5 second which to me is fine as almost negligible compare to the network time it takes to bring that over. Note that there's probably a numpy way of doing this that would be much faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you mean 20x the time. Calling Point on 1M points would take ~12 seconds (11.7e-6 s * 1e6).

I looked at the code and I'm not sure how to optimize this with Numpy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh snap. 12 seconds is long.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and we can look at improving the perf of Point later if we need.

@mistercrunch
Copy link
Member Author

Let's optimize Point later on.

@mistercrunch mistercrunch merged commit 5c35a2d into apache:master Feb 20, 2018
@mistercrunch mistercrunch deleted the geofixes branch February 20, 2018 22:41
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Apr 1, 2018
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.24.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels v0.23 🚢 0.24.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants