Make tests twice as fast using mock #466

irskep · 2012-05-31T01:43:25Z

This branch introduces a dependency on the mock library and another change which take the test suite running time from 64 seconds to 30 seconds.

Mock time.sleep(), MRJobRunner._create_mrjob_tar_gz(), EMRJobRunner._wait_for_s3_eventual_consistency(), and EMRJobRunner._wait_for_job_flow_termination() in test_emr for ~20 second speedup
MRJobRunner.make_runner() sets the _steps attribute of the runner it creates so that the runner doesn't need to make a subprocess to get the job's steps. The machinery for getting a job's steps is completely internal and does not expose any APIs that could break as a result of this change. This is an optimization that we have complete control over. It could also be implemented by adding a keyword arg to MRJobRunner.__init__().

coyotemarin · 2012-06-04T21:07:26Z

update setup.py?

coyotemarin · 2012-06-04T21:09:03Z

tests/test_local.py

@@ -387,6 +387,10 @@ def test_echo_as_steps_python_bin(self):
        with mr_job.make_runner() as runner:
            assert isinstance(runner, LocalMRJobRunner)
            try:
+                # make_runner() populates _steps in the runner, so un-populate


"The mocked version of make_runner()..."

make_runner() isn't mocked. To do that I would have to have all test cases inherit from a new superclass.

Or write MRTestingJob() which does what the mrjob.job code above does, and have all the test case jobs inherit from it. That's not too bad.

Oh, whoa, didn't catch that patching runner._steps happens outside of tests. Don't do that!

The whole point of the roundabout way of fetching step definitions by running the script with --steps is to make it simple to write mrjob scripts in other languages. In theory, all we have to do is give mrjob a way to infer the interpreter to run from the script's file extension, and it'll be possible to write a mrjob script in whatever language, and launch it from Python. The performance hit of starting a subprocess is small compared to the overhead of starting up Hadoop (not to mention running the MapReduce job), and if you just want to test quickly, that's what inline mode is for.

It sucks that this slows down tests, but really, that's the tests' problem. If all the tests that run jobs need to inherit from a common base class to patch runner._steps so that they can run faster, they should do that, or they should use inline mode.

Or MRTestingJob. That's not a bad option either. :)

The whole point of the roundabout way of fetching step definitions by running the script with --steps is to make it simple to write mrjob scripts in other languages.

And you can still do that. I didn't remove that ability. It works the same as it always did except some information is slipped in early when possible. _steps() even already accounted for it.

Anyway, I went the MRTestingJob route.

To clarify: I didn't consider the _steps() thing a crazy testing-only hack. Just populating data when it was available rather than waiting and then unnecessarily wasting resources. Anyway it's gone.

Yeah, I can see that. I guess I'm saying that I want to keep this code path heavily travelled for now.

I think the clean way to do this would be to pass the steps in as an argument to the runner class's constructor rather than patching the runner's attributes. I'd feel better about this if we had a well-documented dictionary-like format for representing steps. It's not bad for the runner to represent steps internally like ['MR', 'R'] but it seems a little weird for this to be the canonical way of passing that information around.

I guess what I'm saying is that I'd like to do the steps optimization, support for other languages, and deeper support for the Java side of Hadoop all together. Not that we need to have all the code written before we release, but that we at least think through all these things and spec out the steps format.

Makes sense, thanks!

…modifying mrjob.job

Make tests twice as fast using mock

Steve Johnson added 5 commits May 30, 2012 17:43

test_emr mocks _create_mrjob_tar_gz for a 20 second speedup

bacc351

Looks like some file is trying to get removed twice, stop that nonsense

4e23558

make_runner() sets _steps on runner to avoid a subprocess call

fe001b7

Mock time.sleep(). tests now only take 30s.

afab8c9

Simplify patching of create_mrjob_tar_gz

de10dc2

coyotemarin reviewed Jun 4, 2012
View reviewed changes

Steve Johnson added 3 commits June 4, 2012 16:07

Finally patch _create_mrjob_tar_gz correctly.

1ece66f

Undo some funky patching that isn't needed

ad49792

Test jobs all inherit from MRTestingJob to make tests faster without …

a08b691

…modifying mrjob.job

coyotemarin pushed a commit that referenced this pull request Jun 4, 2012

Merge pull request #466 from irskep/mock_ftw

fdf8ad4

Make tests twice as fast using mock

coyotemarin merged commit fdf8ad4 into Yelp:master Jun 4, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tests twice as fast using mock #466

Make tests twice as fast using mock #466

irskep commented May 31, 2012

coyotemarin commented Jun 4, 2012

coyotemarin Jun 4, 2012

irskep Jun 4, 2012

irskep Jun 4, 2012

coyotemarin Jun 4, 2012

coyotemarin Jun 4, 2012

irskep Jun 4, 2012

irskep Jun 4, 2012

coyotemarin Jun 4, 2012

irskep Jun 4, 2012

Make tests twice as fast using mock #466

Make tests twice as fast using mock #466

Conversation

irskep commented May 31, 2012

coyotemarin commented Jun 4, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment