Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should implement restartable, regressible data pipeling #88

Closed
toliwaga opened this issue Apr 29, 2016 · 7 comments
Closed

should implement restartable, regressible data pipeling #88

toliwaga opened this issue Apr 29, 2016 · 7 comments
Assignees

Comments

@toliwaga
Copy link
Contributor

This would be useful for a number of reasons

  1. ability to resume jobs at a known point would facilitate debugging of problems with big datasets late in the pipeline
  2. checkpoints would allow regression testing of results at all points in data pipeline
@bstabler bstabler added this to the Phase 3 milestone Dec 21, 2016
@bstabler
Copy link
Contributor

bstabler commented Dec 21, 2016

@e-lo

  • Write out results, document how to access them, and include it in the example.
  • In order to easily ascertain what was going on, I need a consistent summary function for dependent variables of a simulation and a way to access results after the model run is done
  • Make an obvious setting to save results to a folder
  • Make an obvious method to load results to memory from a specific directory and any other needed startup activities.
  • Create obvious errors when running models out of order; have a defined order in a config and warn if not correct.
  • It would be nice to be able to run a SQLesque search on the results in order to flag things like walk distances greater than 2 miles or people who didn’t have driving available, who had a license and a car, etc
  • Provide explicit error handling for running models out of order
  • implement a “debug mode” that has a more verbose output for diagnosing either run errors or odd output.

bstabler added a commit that referenced this issue Jan 5, 2017
@bstabler
Copy link
Contributor

bstabler commented Jan 5, 2017

I added a very simple example of writing an output table:

#write households table to a CSV file to review results
orca.get_table('households').to_frame().to_csv(orca.get_injectable("output_dir") + "/households_table.csv")

and updated the getting started guide

@bstabler
Copy link
Contributor

bstabler commented Mar 5, 2017

See the draft Design

@toliwaga
Copy link
Contributor Author

#165

@bstabler
Copy link
Contributor

The full example test did not finish. It ran for ~6 hours and then crashed (maybe due to memory requirements of the PRNG since the Python process was using the max RAM (160GB) for the last hour or so).

######### reseeding persons
channel.offset 4 channel.max_offset 5
Traceback (most recent call last):
  File "simulation.py", line 40, in <module>
    pipeline.run(models=_MODELS, resume_after=resume_after)
  File "e:\projects\asim\activitysim\activitysim\pipeline.py", line 315, in run
    run_model(model)
  File "e:\projects\asim\activitysim\activitysim\pipeline.py", line 269, in run_
model
    orca.run([model_name])
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 1876, in run
    step()
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 780, in __call__
    return self._func(**kwargs)
  File "e:\projects\asim\activitysim\activitysim\defaults\models\non_mandatory_t
our_frequency.py", line 105, in non_mandatory_tour_frequency
    create_non_mandatory_tours_table()
  File "e:\projects\asim\activitysim\activitysim\defaults\models\non_mandatory_t
our_frequency.py", line 142, in create_non_mandatory_tours_table
    pipeline.get_rn_generator().add_channel(df, 'tours')
  File "e:\projects\asim\activitysim\activitysim\prng.py", line 265, in add_chan
nel
    prngs = self.create_prngs_for_tour_channels(df, max_seed_offset, offset)
  File "e:\projects\asim\activitysim\activitysim\prng.py", line 229, in create_p
rngs_for_tour_channels
    prngs['generator'] = [np.random.RandomState(seed + offset) for seed in prngs
['seed']]
  File "mtrand.pyx", line 643, in mtrand.RandomState.__init__ (numpy\random\mtra
nd\mtrand.c:13272)
thread.error: can't allocate lock
Closing remaining open files:data\mtc_asim.h5...donedata\skims.omx...doneoutput\
pipeline.h5...done

@bstabler bstabler reopened this Mar 21, 2017
@bstabler
Copy link
Contributor

bstabler commented Apr 4, 2017

The full example test finished. It ran in 7.5 hours, which means pipelining and random number seeding didn't slow it down all that much. The run time is 1.5 hours less than the previous full run since we changed trip mode choice. The previous 'trip mode choice', which wasn't really trip mode choice anyway, ran for 1.5 hours. The final pipeline file is 20.7 GB. Here are the key timing statements for the model steps:

03/04/2017 11:18:23 - INFO - activitysim - Read logging configuration from: configs\logging.yaml
03/04/2017 11:18:23 - INFO - activitysim.pipeline - start_pipeline...
03/04/2017 11:20:00 - INFO - activitysim.tracing - Time to execute load skim_dict : 96.742 seconds (2.0 minutes)
03/04/2017 11:22:13 - INFO - activitysim.tracing - Time to execute run_model 'compute_accessibility' : 127.819 seconds (2.0 minutes)
03/04/2017 11:32:49 - INFO - activitysim.tracing - Time to execute run_model 'school_location_simulate' : 556.287 seconds (9.0 minutes)
03/04/2017 11:56:42 - INFO - activitysim.tracing - Time to execute run_model 'workplace_location_simulate' : 1419.465 seconds (24.0 minutes)
03/04/2017 12:00:12 - INFO - activitysim.tracing - Time to execute run_model 'auto_ownership_simulate' : 192.766 seconds (3.0 minutes)
03/04/2017 13:26:03 - INFO - activitysim.tracing - Time to execute run_model 'cdap_simulate' : 5142.071 seconds (86.0 minutes)
03/04/2017 13:43:00 - INFO - activitysim.tracing - Time to execute run_model 'mandatory_tour_frequency' : 990.459 seconds (17.0 minutes)
03/04/2017 14:42:31 - INFO - activitysim.tracing - Time to execute run_model 'mandatory_scheduling' : 3550.293 seconds (59.0 minutes)
03/04/2017 16:18:52 - INFO - activitysim.tracing - Time to execute run_model 'non_mandatory_tour_frequency' : 5779.362 seconds (96.0 minutes)
03/04/2017 16:32:22 - INFO - activitysim.tracing - Time to execute run_model 'destination_choice' : 787.011 seconds (13.0 minutes)
03/04/2017 17:50:31 - INFO - activitysim.tracing - Time to execute run_model 'non_mandatory_scheduling' : 4687.389 seconds (78.0 minutes)
03/04/2017 18:18:49 - INFO - activitysim.tracing - Time to execute run_model 'tour_mode_choice_simulate' : 1696.034 seconds (28.0 minutes)
03/04/2017 18:20:17 - INFO - activitysim.tracing - Time to execute run_model 'create_simple_trips' : 48.623 seconds (1.0 minutes)
03/04/2017 18:45:57 - INFO - activitysim.tracing - Time to execute run_model 'trip_mode_choice_simulate' : 1516.503 seconds (25.0 minutes)
03/04/2017 18:54:50 - INFO - activitysim.pipeline - close_pipeline
03/04/2017 18:54:50 - INFO - activitysim.tracing - Time to execute all models : 27386.795 seconds (456.0 minutes)

@bstabler
Copy link
Contributor

I'm closing this task since the restartable, regressible data pipeline works well now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants