Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning takes too long time on multi-cores cpu #40

Closed
a1a2y3 opened this issue Aug 27, 2017 · 8 comments
Closed

Cleaning takes too long time on multi-cores cpu #40

a1a2y3 opened this issue Aug 27, 2017 · 8 comments

Comments

@a1a2y3
Copy link

a1a2y3 commented Aug 27, 2017

Cleaning takes 276s for house price dataset on intel E5-2683v3
As E5-2683 has more 14cores and 28threads.
I guess the problem may cause by n-job=-1 in here.
` if (self.verbose):
print("cleaning data ...")

    df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_list)(df[col]) for col in df.columns),
                   axis=1)

    df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_float_and_dates)(df[col]) for col in df.columns), axis=1) `       

I don't know how to fix it, may be add a n_jobs arguments for class Reader?
Looking for you response. Thank you.

@a1a2y3
Copy link
Author

a1a2y3 commented Aug 27, 2017

Drift_thresholder() has same problem.
It takes 1.38s on kaggle kernel, and176s on my PC with E5-2683v3 cpu.

@AxeldeRomblay
Copy link
Owner

Hum... sounds very weird ! Because it takes only 2 sec on my computer (7 cores). Have you tried to set n_jobs = 1 and run again ?

@a1a2y3
Copy link
Author

a1a2y3 commented Aug 28, 2017

Thank you for reply. I think joblib or multiprocessing cause this problem, and trying to solve it.
I use windows10 + anaconda + python3.6 + vs2015, may have conflict with joblib?

set n_jobs=1, seems OK
reading csv : train.csv ...
cleaning data ...
CPU time: 0.22528505325317383 seconds
reading csv : test.csv ...
cleaning data ...
CPU time: 0.1932668685913086 seconds

set n_jobs=2, it dies.

@a1a2y3
Copy link
Author

a1a2y3 commented Aug 29, 2017

from http://pythonhosted.org/joblib/parallel.html#common-usage
I found this "Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel."..."No code should run outside of the “if name == ‘main’” blocks, only imports and definitions."

Problem solved.

@AxeldeRomblay
Copy link
Owner

Yes this is what I was wondering. At the moment, MLBox does not support Windows but soon :)
Thank you very much for reporting this issue !!

@DarquesM
Copy link

I've got same issue, where should I set n_jobs=1 ?
mlbox.preprocessing.Reader does not have "n_jobs" parameter

@AxeldeRomblay
Copy link
Owner

AxeldeRomblay commented Nov 21, 2017

Hello @DarquesM !
The problem is due to windows... At the moment what you can do is to set n_jobs=1 in the source code :

df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_list)(df[col]) for col in df.columns), axis=1)

df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_float_and_dates)(df[col]) for col in df.columns), axis=1)

Otherwise, I will release soon a new version with reading and cleaning separate classes...

@AxeldeRomblay
Copy link
Owner

Hello, thanks for reporting this issue. I will close it since this will be fixed in a next release (MLBox 0.7.1 probably)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants