Cleaning takes too long time on multi-cores cpu #40

a1a2y3 · 2017-08-27T14:40:10Z

Cleaning takes 276s for house price dataset on intel E5-2683v3
As E5-2683 has more 14cores and 28threads.
I guess the problem may cause by n-job=-1 in here.
` if (self.verbose):
print("cleaning data ...")

    df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_list)(df[col]) for col in df.columns),
                   axis=1)

    df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_float_and_dates)(df[col]) for col in df.columns), axis=1) `

I don't know how to fix it, may be add a n_jobs arguments for class Reader?
Looking for you response. Thank you.

The text was updated successfully, but these errors were encountered:

a1a2y3 · 2017-08-27T14:49:24Z

Drift_thresholder() has same problem.
It takes 1.38s on kaggle kernel, and176s on my PC with E5-2683v3 cpu.

AxeldeRomblay · 2017-08-28T08:13:01Z

Hum... sounds very weird ! Because it takes only 2 sec on my computer (7 cores). Have you tried to set n_jobs = 1 and run again ?

a1a2y3 · 2017-08-28T15:35:22Z

Thank you for reply. I think joblib or multiprocessing cause this problem, and trying to solve it.
I use windows10 + anaconda + python3.6 + vs2015, may have conflict with joblib?

set n_jobs=1, seems OK
reading csv : train.csv ...
cleaning data ...
CPU time: 0.22528505325317383 seconds
reading csv : test.csv ...
cleaning data ...
CPU time: 0.1932668685913086 seconds

set n_jobs=2, it dies.

a1a2y3 · 2017-08-29T05:43:54Z

from http://pythonhosted.org/joblib/parallel.html#common-usage
I found this "Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel."..."No code should run outside of the “if name == ‘main’” blocks, only imports and definitions."

Problem solved.

AxeldeRomblay · 2017-08-29T12:40:14Z

Yes this is what I was wondering. At the moment, MLBox does not support Windows but soon :)
Thank you very much for reporting this issue !!

DarquesM · 2017-11-20T21:05:25Z

I've got same issue, where should I set n_jobs=1 ?
mlbox.preprocessing.Reader does not have "n_jobs" parameter

AxeldeRomblay · 2017-11-21T09:09:24Z

Hello @DarquesM !
The problem is due to windows... At the moment what you can do is to set n_jobs=1 in the source code :

df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_list)(df[col]) for col in df.columns), axis=1)

df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_float_and_dates)(df[col]) for col in df.columns), axis=1)

Otherwise, I will release soon a new version with reading and cleaning separate classes...

AxeldeRomblay · 2019-06-25T11:59:49Z

Hello, thanks for reporting this issue. I will close it since this will be fixed in a next release (MLBox 0.7.1 probably)

AxeldeRomblay closed this as completed Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaning takes too long time on multi-cores cpu #40

Cleaning takes too long time on multi-cores cpu #40

a1a2y3 commented Aug 27, 2017 •

edited

a1a2y3 commented Aug 27, 2017

AxeldeRomblay commented Aug 28, 2017

a1a2y3 commented Aug 28, 2017 •

edited

a1a2y3 commented Aug 29, 2017

AxeldeRomblay commented Aug 29, 2017

DarquesM commented Nov 20, 2017

AxeldeRomblay commented Nov 21, 2017 •

edited

AxeldeRomblay commented Jun 25, 2019

Cleaning takes too long time on multi-cores cpu #40

Cleaning takes too long time on multi-cores cpu #40

Comments

a1a2y3 commented Aug 27, 2017 • edited

a1a2y3 commented Aug 27, 2017

AxeldeRomblay commented Aug 28, 2017

a1a2y3 commented Aug 28, 2017 • edited

a1a2y3 commented Aug 29, 2017

AxeldeRomblay commented Aug 29, 2017

DarquesM commented Nov 20, 2017

AxeldeRomblay commented Nov 21, 2017 • edited

AxeldeRomblay commented Jun 25, 2019

a1a2y3 commented Aug 27, 2017 •

edited

a1a2y3 commented Aug 28, 2017 •

edited

AxeldeRomblay commented Nov 21, 2017 •

edited