Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads are restarted over and over in tsfresh.extract_features if using multiprocessing #364

Closed
jscheithe opened this issue Feb 28, 2018 · 13 comments

Comments

@jscheithe
Copy link

Hi there,

first of all, thanks for this package, I'm using it very happily!

Since yesterday, I can't run tsfresh.extract_features and tsfresh.select_features with n_jobs > 1:

  • When using IPython, the command line status bar stays at 0% forever. When I look in the Task Manager I see, that the processes are started, run for about 1 second, then die and are restarted over and over.

  • When I execute the script with python from the command line I get a RuntimeError as pasted below.

The DataFrame I'm passing looks like this:

            LOAD         AX    ...    trial_id
597.0   8.894621  -1.000000    ...         302
598.0   8.546521   0.000000    ...         302
.
.
.
234.0   8.123546   1.234567    ...         303
.
.
.
[57570 rows x 25 columns]

I'm calling tsfresh.extract_features(timeseries, column_id='trial_id', n_jobs=2) and each sub-frame with the same trial_id has the same shape: [57 rows x 25 columns]

I wish I knew what caused the error, this used to work until yesterday on my machine (Windows 7 x64).
Also, other scripts using the package multiprocessing work fine.

I am glad about any help. Thanks!

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\spawn.py", line 105, in spawn_
main
    exitcode = _main(fd)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\spawn.py", line 225, in prepar
e
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\spawn.py", line 277, in _fixup
_main_from_path
    run_name="__mp_main__")
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "T:\AR\Studenten\Studenten_2017\Jakob_Scheithe\python-projects\bl3data-package\bl3data\trialdb.py", line 489, in
<module>
    TDB = TrialDB(data_dict=full_data_dict)
  File "T:\AR\Studenten\Studenten_2017\Jakob_Scheithe\python-projects\bl3data-package\bl3data\trialdb.py", line 117, in
__init__
    self.add_data_dict(data_dict)
  File "T:\AR\Studenten\Studenten_2017\Jakob_Scheithe\python-projects\bl3data-package\bl3data\trialdb.py", line 152, in
add_data_dict
    self.add_tsfresh_features()
  File "T:\AR\Studenten\Studenten_2017\Jakob_Scheithe\python-projects\bl3data-package\bl3data\trialdb.py", line 239, in
add_tsfresh_features
    tsff = fs.tsfresh_features(self)
  File "T:\AR\Studenten\Studenten_2017\Jakob_Scheithe\python-projects\bl3data-package\bl3data\feature_selection.py", lin
e 87, in tsfresh_features
    extracted_features = tsfresh.extract_features(timeseries, column_id='trial_id', n_jobs=2)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\site-packages\tsfresh\feature_extraction\extra
ction.py", line 152, in extract_features
    distributor=distributor)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\site-packages\tsfresh\feature_extraction\extra
ction.py", line 226, in _do_extraction
    progressbar_title="Feature Extraction")
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\site-packages\tsfresh\utilities\distribution.p
y", line 341, in __init__
    self.pool = Pool(processes=n_workers)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\context.py", line 119, in Pool

    context=self.get_context())
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\pool.py", line 174, in __init_
_
    self._repopulate_pool()
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\pool.py", line 239, in _repopu
late_pool
    w.start()
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\process.py", line 105, in star
t
    self._popen = self._Popen(self)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\context.py", line 322, in _Pop
en
    return Popen(process_obj)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\popen_spawn_win32.py", line 33
, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\spawn.py", line 143, in get_pr
eparation_data
    _check_not_importing_main()
  File "C:\Users\SCHEITHE\AppData\Local\Continuum\miniconda3\envs\p36\lib\multiprocessing\spawn.py", line 136, in _check
_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
@jscheithe
Copy link
Author

I just encountered this problem unrelated to tsfresh.

So most likely this is not a tsfresh problem. Sorry for this!

@TheYoxy
Copy link

TheYoxy commented Mar 29, 2018

Did you find what was causing this issues?
I'm having the same one :/

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Mar 31, 2018

Hard to judge for us what causes this problem. Maybe @jscheithe can clarify how he fixed the issue .

just a guess, did you try to start the feature extraction inside a pool worker?

@jscheithe
Copy link
Author

jscheithe commented Apr 6, 2018

Yes i did! Sorry I didn't update ..

This is the same as #185. There I wrote:

My solution: Put everything in the file you're running within an if __name__ == __main__: check. (including all imports)
And maybe add a call to multiprocessing.freeze_support() right after the check, too (it seems to depend on your actual machine whether you need this or not)

Also, even with the guard, the problem occurs when I start the script from an IPython console. On my machine, I actually have to start all scripts that involve multiprocessing from command line ...

@directnirvana
Copy link

directnirvana commented Apr 7, 2018

I am getting the same error, and it appears to be fixed with the if __name__ == __main__: as well. However, I am attempting to write a module for other programs to import, in those cases, of course, __name__ does not equal __main__. I've tried __name__ == moduleName: but that doesn't seem to have the same protective effect. Are there any other suggestions on how to avoid this? Perhaps disabling the multiprocessing to avoid the spawning of the child loops?

@jscheithe
Copy link
Author

jscheithe commented Apr 9, 2018

This should only happen if your module contains direct calls to a function that uses multiprocessing (or imports a module that does).

If so, you can consider rewriting something like this:

import tsfresh
...
tsfresh.extract_features(..)

to something like this:

import tsfresh
...
def my_extraction():
    return tsfresh.extract_features(..)

I don't exactly understand why, but for me it also helped to

if __name__ == __main__:
    import mymodule

in the script I was running.

I'm not sure this is correct, so maybe someone who understands this better can be of more help.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Apr 9, 2018

Unfortunately, this is a windows specific issue within the multiprocessing library, see #185

From what I understand, there is nothing what we can do inside the tsfresh library to catch this

@jscheithe
Copy link
Author

However, I think it is important to mention this in the documentation and provide some workaround(s).

This can be a really frustrating one, especially if you're new to multiprocessing (on windows).

@MaxBenChrist
Copy link
Collaborator

We touch this topic in the faq http://tsfresh.readthedocs.io/en/latest/text/faq.html

Where would you expect this information as a user?

@jscheithe
Copy link
Author

jscheithe commented Apr 12, 2018

E.g. for tsfresh.feature_extraction.extraction.extract_features, the standard value for parameter n_jobs is 2.

So I think there are two options:

I think I'd prefer the second one.
However, this is just my viewpoint as a rather unexperienced user.

@directnirvana
Copy link

Thanks for the responses, I do see that the FAQ does mention the issue now, and I supposes that this is where I would mention the problem as well. I think it is worth considering @jscheithe 's suggestions, or perhaps in the FAQ mention the problem explicitly as a Windows problem, as it may not be obvious to novice users such as myself that we are using any multiprocessing features with tsfresh.

@tccf1109
Copy link

Hello guys, I am a novice using tsfresh. Nonetheless I got the same problem as described before, and effectively I can resolve it with if name == "main": .

For my use case I want to call a method that based on dataframe input can extract features and then return the dataframe with extracted features.

I've tried the following, where in .py file I put the if name == "main": with all the imports and then a function I want to use. But in this case I cannot call the method "AttributeError: module 'feature_extraction.FirstTest' has no attribute 'my_extraction'"

if __name__ == "__main__":
    import pandas as pd
    from tsfresh import extract_features
    from tsfresh.examples.robot_execution_failures import (
        download_robot_execution_failures,
        load_robot_execution_failures,
    )
    from tsfresh.feature_extraction import MinimalFCParameters

    def my_extraction():
        # Download the dataset
        download_robot_execution_failures()

        # Load the dataset
        timeseries, y = load_robot_execution_failures()

        # Extract just the first time series from the dataset
        pattern = timeseries.loc[timeseries["id"] == 1]

        # Remove all variables other than the minimum
        pattern = pattern.filter(items=["id", "time", "F_z"])

        # Only use minimal features
        settings = MinimalFCParameters()
        # Extract the features from the single pattern
        extracted_features = extract_features(
            pattern, column_id="id", column_sort="time", default_fc_parameters=settings
        )

        print(extracted_features)
        return extracted_features

@kempa-liehr
Copy link
Collaborator

Hi @tccf1109,
All imports and the definition of function my_extraction() have to be located before the if __name__ == "__main__": clause. Then, you can call my_extraction() from within the if-clause.

Cheers,
Andreas

import pandas as pd
from tsfresh import extract_features
from tsfresh.examples.robot_execution_failures import (
        download_robot_execution_failures,
        load_robot_execution_failures,
    )
from tsfresh.feature_extraction import MinimalFCParameters

def my_extraction():
    # Download the dataset
    download_robot_execution_failures()

    # Load the dataset
    timeseries, y = load_robot_execution_failures()

    # Extract just the first time series from the dataset
    pattern = timeseries.loc[timeseries["id"] == 1]

    # Remove all variables other than the minimum
    pattern = pattern.filter(items=["id", "time", "F_z"])

    # Only use minimal features
    settings = MinimalFCParameters()
    # Extract the features from the single pattern
    extracted_features = extract_features(
        pattern, column_id="id", column_sort="time", default_fc_parameters=settings
    )

    print(extracted_features)
    return extracted_features
    
if __name__ == "__main__":
    my_extraction()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants