Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing 13m x 89 table is quite slow. Any suggestion for improvements? #16

Closed
mmistroni opened this issue Mar 26, 2019 · 8 comments
Closed

Comments

@mmistroni
Copy link

Hi,
not really an issue, but i am using pandleau to create an hyper file out of a 13m x89 table.
Columns are half string and half numbers.
Process takes quite a while (7 hours on a 16GB desktop).
Was wondering if you could suggest potential improvements?
Saw notes regarding Unicode slowing down python. any tricks to get around the issue?
kind regards

@bwiley1
Copy link
Owner

bwiley1 commented Mar 28, 2019

Hi mmistroni,

Thanks for making this point. Essentially, the tableau sdk adds single observations one at a time per row in the final tableau file. So while using multithreading options in pandas can speed up the process in python (i.e. using lambda functions, etc), it seems like there's a ceiling from the tableau sdk. I've thought it might be a good idea to add observations concurrently, or assigning a row number to input values during the sdk execution, but these are feedback for the tableau developing team (there's also a certain level of encryption in the tableausdk package - it would be nice to go from .tde to pandas df but this is restricted). It would be great if you brought these up to the tableau team!

Best,
Ben

@mmistroni
Copy link
Author

mmistroni commented Mar 28, 2019 via email

@mmistroni
Copy link
Author

Hey
so this is part of hte same workflow i am running.
Running on RHEL7, using python 2.7+ pandleau.. i am getting a massive exception in pandleau.py

OSError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found

i was using pandas 0.24.something. Downgrading to 0.20.3 didnt fix it
What did fix it was actually to do a local import of pandas rather than doing an import pandas at the top of the file
I dont know what triggers it...it could be the fact tha pandleau tries to be smart to detect if you are using the old tableausdk api or the new one.... i dont know... it seems that pandas and the import
from tableausdk import *
somehow causes this problem
have you ever seen it? Would you know what to do to address it?
right now i had to copy paste the code and remove the global import
I am pretty sure this has to do with pandas as , when i run the extractAPI python samples, i dont get any error - while when i edit the sample and add an import pandas also the extractAPI sample blows up
ANy chances you can reproduce and help?
thanks

@bwiley1
Copy link
Owner

bwiley1 commented Aug 1, 2019

Hey @mmistroni! It's been a long time :) So several other users have contributed to pandleau, and in later versions the speedup has been considerably improved. Let me know if the module runs faster now on this example, thanks!

@bwiley1 bwiley1 closed this as completed Aug 1, 2019
@mmistroni
Copy link
Author

mmistroni commented Aug 4, 2019 via email

@bwiley1
Copy link
Owner

bwiley1 commented Aug 5, 2019

Hi @mmistroni - that's a good point... You're right, the tableausdk doesn't currently support concurrency, so there seems to be a limit on speedup through workarounds in python alone. If tableau makes an update to their sdk in the future, I'll be sure to incorporate this into pandleau. Thanks!

@ghost
Copy link

ghost commented Aug 5, 2019

@mmistroni @bwiley1 , interestingly I've run into this now as well -- my data is extremely large (around 500m rows, 300 cols) and I'm estimating it to take around 10 hours to run.

However, there have to be some workarounds. A commercial product, Alteryx, can create TDEs and calls the same DLLs as the SDK and runs the same file in ~30 minutes, which is obscenely fast all considering. I'll harass Tableau support to see if they have any guidance for working with larger datasets, or any insight as to how this one vendor was able to implement it so efficiently.

@mmistroni , any chance you'd be able to share some snippets of how you utilized pyspark for this?

@mmistroni
Copy link
Author

mmistroni commented Aug 5, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants