Skip to content

carlospadron/spatial_knn

Repository files navigation

spatial_knn

A compilation of solutions for the KNN problem.

See notebook for speed comparison.

When to use what:

  • Geopandas:
    • the data fits the memory
    • you don't plan to do complex queries.
  • SKlearn:
    • the previous conditions
    • your data is only points and Geopandas wasn't fast enough
  • SQL
    • the data does not fit the memory
    • you plan to use many other tables already present in the database
    • you plan to write complex queries
  • Snowflake (PAID SERVICE)
    • all the reasons from SQL point
    • you want a very fast result or your data is massive
    • you are also happy with rewriting your query to fit Snowflake SQL standards
  • Redshift (PAID SERVICE)
    • all the reasons from SQL point
    • you want a very fast result or your data is massive
    • you are also happy with rewriting your query to fit Redshift SQL standards
  • BigQuery (PAID SERVICE)
    • all the reasons from SQL point
    • you want a very fast result or your data is massive
    • you are happy to use geographical coordinates
    • you are also happy with rewriting your query to fit BigQuery SQL standards
  • Athena (PAID SERVICE)
    • all the reasons from SQL point
    • your data is massive
    • you are also happy with rewriting your query to fit Athena SQL standards
    • you are ok with execution times similar to postgres, if the data fits postgres
  • Shapely
    • You don't want to use Geopandas and you want to stay within Python
  • Scala/Kotlin
    • the data fits the memory
    • you need to write a complex program of which KNN is just a part and you need it to run fast
  • Rust
    • all the reasons in Scala/Kotlin point
    • Scala/Kotlin wasn't wast enough
    • you are happy to deal with a reduced universe of libraries
  • Pyspark
    • your data is massive and needs to be spread on different clusters