-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use set_geometry from Dask DF and export as geopackage? #16
Comments
Thanks for trying out and the feedback!
Hmm, yes, it seems that we didn't yet add this
No, that's not yet implemented. Only writing to Parquet files is currently implemented. |
Thanks for your quick reply. Thanks for your explanation. Until now, I used to chunk huge files (mostly csv) and stream into files (mode="a") or db. import dask.dataframe as dd
chunked_df = dd.read_csv(
"really_huge_geocsv.csv",
usecols=list(columns_definition.keys()),
dtype=columns_definition,
)
chunked_geodf = dask_geopandas.from_dask_df(chunked_df)
for chunk in chunked_geodf:
chunk.to_gpkg(layername='Table',
if_exists='append',
geometry_type="point",
x=chunked_df.long,
y=chunked_df.lat,
crs=32640
) Still too new to these tools to really help you. Maybe using PyGEOS abilities? There is a small gpkg writer using it: https://github.com/brendan-ward/pgpkg but I still didn't try myself. |
Thanks for that link to I just was experimenting with a from dask.delayed import delayed, tokenize
@delayed
def _extra_deps(func, *args, extras=None, **kwargs):
return func(*args, **kwargs)
def to_file(df, path, driver="GPKG", parallel=False, compute=True, **kwargs):
"""
Write to single file.
Parameters
----------
df : dask_geopandas.GeoDataFrame
path : str
Filename.
parallel : bool, default False
When true, have each block append itself to the DB table concurrently. This can result in DB rows being in a
different order than the source DataFrame's corresponding rows. When false, load each block into the SQL DB in
sequence.
compute : bool, default True
When true, call dask.compute and perform the load into SQL; otherwise, return a Dask object (or array of
per-block objects when parallel=True)
"""
# based on dask.dataframe's to_sql
def make_meta(meta):
return meta.to_file(path, driver=driver, mode="w", **kwargs)
make_meta = delayed(make_meta)
meta_task = make_meta(df._meta)
# Partitions should always append to the empty file created from `meta` above
worker_kwargs = dict(kwargs, driver=driver, mode="a")
if parallel:
# Perform the meta insert, then one task that inserts all blocks concurrently:
result = [
_extra_deps(
d.to_file,
path,
extras=meta_task,
**worker_kwargs,
dask_key_name="to_file-%s" % tokenize(d, **worker_kwargs)
)
for d in df.to_delayed()
]
else:
# Chain the "meta" insert and each block's insert
result = []
last = meta_task
for d in df.to_delayed():
result.append(
_extra_deps(
d.to_file,
path,
extras=last,
**worker_kwargs,
dask_key_name="to_file-%s" % tokenize(d, **worker_kwargs)
)
)
last = result[-1]
result = delayed(result)
if compute:
dask.compute(result, scheduler="processes")
else:
return result And then you can use it like this:
|
Ah, and what I forgot to mention is that you need to change this one line in your fiona install: https://github.com/Toblerity/Fiona/pull/858/files (to enable "append" mode for GPKG), because that fix is not yet released |
You're welcome!
Yes it's well documented on geopandas, great tip!
Nice! I'll give it a try if you need.
Too bad a new version is not released, it's not always possible to install from Github in professional context. |
I understand this code is most definitely still experimental, so I tried to modify it slightly to work with ESRI Shapefiles (please reference the comments in the above code), but I got these errors for both the gpkg and shp versions of the above code:
I just started looking into Dask recently. It's still new to me, so I don't really get this error message and I'd like to know if there's an "easy" fix/workaround to make the to_file function work with ESRI shapefiles? Please let me know. |
First of all, thanks for this work 馃憤.
Then, I'm trying to convert a Dask DataFrame into a GeoPandas, setting geometry from 2 columns (x,y) and I think there is a problem in the README instructions:
In that case,
df
is a dask dataframe and doesn't have attribute/method likeset_geometry
, raising a :AttributeError: 'DataFrame' object has no attribute 'set_geometry
.Digging into the source code, I've finally did:
But then, I'm not unable to export it as a geopackage. Am I missing something?
The text was updated successfully, but these errors were encountered: