-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby aggregation does not scale well with amount of groups #4001
Comments
My guess is that this isn't a bug with grouping, but rather an issue where you're running out of RAM and so getting heavily degraded performance. Strings are unforunately expensive to store in memory in Python and Pandas' lack of a text type kills us. Some things you could try to test this theory:
Also, just to clarify things, persist and compute are orthogonal to the local and distributed schedulers. There is no reason to change to use |
Any update on this @jangorecki ? Were any of those suggestions helpful? |
I run those queries with parquet on 1e7 and 1e8 only.
and I couldn't create it from dask as reading csv was running out of memory. |
@martindurant please see the parquet error above. It looks like the current experience of writing from spark and reading from dask still isn't smooth for novice users. @jangorecki you might find this notebook helpful for your benchmarks: https://gist.github.com/c0b84b689238ea46cf9aa1c79155fe34 |
My timings for 1e7 are as follows: 1e7
Creating all of the columns and storing them in memory is something like 200MB on disk as parquet or 5GB in RAM using Python object dtype. You really shouldn't underestimate the performance cost of using Python object dtypes in Pandas. They operate at Python speeds (rather than C like the rest of Pandas) and take up a ton of memory. In this situation I would strongly recommend
If we do this then I can easily run the 1e8 system on my laptop (16GB of RAM). Here is a run through with the threaded scheduler: |
1e8 fewer: 0.57s Still not great for many groups, as you said above |
@TomAugspurger you might want to run the notebook above and look at the |
Anything in particular that stands out duration-wise? (I only tried the default values in the notebook. Haven't played around with others). One thing that looked somewhat odd was the |
I'm seeing us spend 35% of our time in
pandas/core/arrays/categorical.py::is_dtype_equal, in particular calling
return hash(self.dtype) == hash(other.dtype)
Generally though, things do seem surprisingly slow relative to lower
cardinality groupbys
…On Sat, Oct 13, 2018 at 3:14 PM Tom Augspurger ***@***.***> wrote:
Anything in particular that stands out duration-wise? (I only tried the
default values in the notebook. Haven't played around with others).
One thing that looked somewhat odd was the .agg({'v1':'sum', 'v3':'mean'})
is that pandas is spending time in _factorize_array on both the
_get_compress_labels side (the actual determination of groups, which
isn't surprising) and the in the actual agg side. I haven't looked at
whether there's actually duplicative computation yet though.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4001 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszOGSFxmTJsn0sL9sJVOJuwdixpPoks5ukjusgaJpZM4WzgYh>
.
|
Just to highlight this issue I can point to this report https://h2oai.github.io/db-benchmark/ (click "5 GB" tab) |
By default Dask combines all group by apply chunks a single Pandas dataframe output no matter the size resulting in. As shown in the Many groups example in the link. https://examples.dask.org/dataframes/02-groupby.html. By using the parameter split_out you should be able to control this size. It also automatically combines the resulting aggregation chucks (every 8 chunks by default). FYI: This is an arbitrary value set in dask/dataframe/core.py line 3595. This value is fine for most cases but if the aggregation chunks are large, it could take a substantial amount of time to combine them, (especially if you need to transfer chunk from another worker. I think you can ensure output of the groupby chunks don’t get to big by setting split_out=4 and split_every =False.
|
@alex959595 thanks for suggestion. Any idea why this is not optimised internally? So the aggregation API could be data agnostic, relying only on metadata (schema). |
So that Dask doesn't have to look at all of your data before coming up with
a plan. One of the costs of laziness is poor planning.
…On Tue, Feb 26, 2019 at 5:54 AM Jan Gorecki ***@***.***> wrote:
@alex959595 <https://github.com/alex959595> thanks for suggestion. Any
idea why this is not optimised internally? So the aggregation API can be
data agnostic, relying only on metadata (schema).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4001 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszLFj5NE4bxfM4CtuaOpD3jo1thRYks5vRTyZgaJpZM4WzgYh>
.
|
I have a huge csv file(~400 GB) and I'm reading it through dask. I want to group the data by a column and apply a function to the grouped dataframes. It works perfectly fine with smaller csv files. Is there a way to use split_out with custom functions or any other workaround for this? |
@PalakHarwani I haven't used
That translates to from dask import distributed
client = distributed.Client(processes=True, silence_logs=logging.ERROR)
dk.config.set({"optimization.fuse.ave-width": 20}) So just by using "distributed` on a single machine you can deal with high cardinality queries much faster. |
It seems that there is performance bug when doing grouping. Time and memory consumed by dask does not seems to scale well with number of output rows.
Please find below script to produce example data, replace
N
to produce bigger input data.And following code to perform grouping.
Running on python 3.6.5, pandas 0.23.4, dask 0.19.2.
Single machine 20 CPU, 125 GB memory.
For input 1e7 rows
output 100 rows, timing: 0.4032 s
output 1e5 rows, timing: 2.1272 s
For input 1e8 rows
output 100 rows, timing: 3.2559 s
output 1e6 rows, timing: 149.8847 s
Additionally I checked alternative approach, instead of
.compute
to useClient
and.persist
(addingprint(len(.))
to ensure persist has kicked in). In both cases time was not acceptable (see table below, units are seconds).The text was updated successfully, but these errors were encountered: