-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing 13m x 89 table is quite slow. Any suggestion for improvements? #16
Comments
Hi mmistroni, Thanks for making this point. Essentially, the tableau sdk adds single observations one at a time per row in the final tableau file. So while using multithreading options in pandas can speed up the process in python (i.e. using lambda functions, etc), it seems like there's a ceiling from the tableau sdk. I've thought it might be a good idea to add observations concurrently, or assigning a row number to input values during the sdk execution, but these are feedback for the tableau developing team (there's also a certain level of encryption in the tableausdk package - it would be nice to go from .tde to pandas df but this is restricted). It would be great if you brought these up to the tableau team! Best, |
Thank you, Thank you for getting back to me. So, basically, to give you
some figures , it takes approx 7hrs to create an hyper file for a 12m x 89
columns in python/pandas
I have attempted to use pyspark to speed up the process, and i have reduced
it down to 1.5 hrs, but then again is not a perfectp rocess like pandas,
sometime spark fails , raised an OOM, restart the task and it'll result in
duplicates
Thanks for pointing out the issue with Tableau, i am going to ask Tableau
support for advices as tableausdk is a black box to me.
I will s urely keep you posted on outcome, but as you said, there's not
much that can be done on our side
kind regards
Marco.
…On Thu, Mar 28, 2019, 4:33 PM jamin4lyfe ***@***.***> wrote:
Hi mmistroni,
Thanks for making this point. Essentially, the tableau sdk adds single
observations one at a time per row in the final tableau file. So while
using multithreading options in pandas can speed up the process in python
(i.e. using lambda functions, etc), it seems like there's a ceiling from
the tableau sdk. I've thought it might be a good idea to add observations
concurrently, or assigning a row number to input values during the sdk
execution, but these are feedback for the tableau developing team (there's
also a certain level of encryption in the tableausdk package - it would be
nice to go from .tde to pandas df but this is restricted). It would be
great if you brought these up to the tableau team!
Best,
Ben
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJ7RDlnf1vKRsaOFuupiuv1s9PWz1WFxks5vbNF9gaJpZM4cMdJQ>
.
|
Hey OSError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found i was using pandas 0.24.something. Downgrading to 0.20.3 didnt fix it |
Hey @mmistroni! It's been a long time :) So several other users have contributed to pandleau, and in later versions the speedup has been considerably improved. Let me know if the module runs faster now on this example, thanks! |
Hello
cannot test it on RHEL7.... got issues with pandas dependencies.
HOwever, the extract api is still slow when extracting 14M x370 cols.. and
that is due to the way the extract api works- after reading other posts and
tableau forums, it does not support concurrency.
As long as your tableau data is <2m, you probably get decent results.
Anything bigger, and it does not reallys cale. that is at least from my
experience
will post if i have any further updates.
kind regards
…On Thu, Aug 1, 2019 at 3:06 AM jamin4lyfe ***@***.***> wrote:
Hey @mmistroni <https://github.com/mmistroni>! It's been a long time :)
So several other users have contributed to pandleau, and in later versions
the speedup has been considerably improved. Let me know if the module runs
faster now on this example, thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16?email_source=notifications&email_token=ACPNCDT6GFSCXEOLXD2FZELQCJALDA5CNFSM4HBR2JIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JCN7I#issuecomment-517089021>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPNCDU7A7AEB5SUK55QYA3QCJALDANCNFSM4HBR2JIA>
.
|
Hi @mmistroni - that's a good point... You're right, the tableausdk doesn't currently support concurrency, so there seems to be a limit on speedup through workarounds in python alone. If tableau makes an update to their sdk in the future, I'll be sure to incorporate this into pandleau. Thanks! |
@mmistroni @bwiley1 , interestingly I've run into this now as well -- my data is extremely large (around 500m rows, 300 cols) and I'm estimating it to take around 10 hours to run. However, there have to be some workarounds. A commercial product, Alteryx, can create TDEs and calls the same DLLs as the SDK and runs the same file in ~30 minutes, which is obscenely fast all considering. I'll harass Tableau support to see if they have any guidance for working with larger datasets, or any insight as to how this one vendor was able to implement it so efficiently. @mmistroni , any chance you'd be able to share some snippets of how you utilized pyspark for this? |
Hello,
sure... found this project on the net - it's in scala - and rewrote it in
python as the two apis are similar.
My usecase is that i have a massive dataframe in Spark and i need to create
a hyper file out of it
https://github.com/werneckpaiva/spark-to-tableau/blob/master/src/main/scala/tableau/TableauDataFrame.scala
by running Spark locally and extracting a 14m x 389 dataframe, it takes
6hrs . ifyou go below a 1m row then you have decent times - in matter of
minutes -
Alteryx will not do for me as i need to generate hyper files...
kind regards
Marco
.
…On Mon, Aug 5, 2019 at 8:05 PM Harrison ***@***.***> wrote:
@mmistroni <https://github.com/mmistroni> @bwiley1
<https://github.com/bwiley1> , interestingly I've run into this now as
well -- my data is extremely large (around 500m rows, 300 cols) and I'm
estimating it to take around 10 hours to run.
However, there have to be *some* workarounds. A commercial product,
Alteryx, can create TDEs and calls the same DLLs as the SDK and runs the
same file in ~30 minutes, which is obscenely fast all considering. I'll
harass Tableau support to see if they have any guidance for working with
larger datasets, or any insight as to how this one vendor was able to
implement it so efficiently.
@mmistroni <https://github.com/mmistroni> , any chance you'd be able to
share some snippets of how you utilized pyspark for this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16?email_source=notifications&email_token=ACPNCDTJCGAAQUAQ7XXNN73QDB2Y7A5CNFSM4HBR2JIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3SYSIY#issuecomment-518359331>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPNCDWCZATSCE7ZBWZKGQ3QDB2Y7ANCNFSM4HBR2JIA>
.
|
Hi,
not really an issue, but i am using pandleau to create an hyper file out of a 13m x89 table.
Columns are half string and half numbers.
Process takes quite a while (7 hours on a 16GB desktop).
Was wondering if you could suggest potential improvements?
Saw notes regarding Unicode slowing down python. any tricks to get around the issue?
kind regards
The text was updated successfully, but these errors were encountered: