Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tests twice as fast using mock #466

Merged
merged 8 commits into from
Jun 4, 2012
Merged

Conversation

irskep
Copy link
Contributor

@irskep irskep commented May 31, 2012

This branch introduces a dependency on the mock library and another change which take the test suite running time from 64 seconds to 30 seconds.

  1. Mock time.sleep(), MRJobRunner._create_mrjob_tar_gz(), EMRJobRunner._wait_for_s3_eventual_consistency(), and EMRJobRunner._wait_for_job_flow_termination() in test_emr for ~20 second speedup
  2. MRJobRunner.make_runner() sets the _steps attribute of the runner it creates so that the runner doesn't need to make a subprocess to get the job's steps. The machinery for getting a job's steps is completely internal and does not expose any APIs that could break as a result of this change. This is an optimization that we have complete control over. It could also be implemented by adding a keyword arg to MRJobRunner.__init__().

@coyotemarin
Copy link
Collaborator

update setup.py?

@@ -387,6 +387,10 @@ def test_echo_as_steps_python_bin(self):
with mr_job.make_runner() as runner:
assert isinstance(runner, LocalMRJobRunner)
try:
# make_runner() populates _steps in the runner, so un-populate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The mocked version of make_runner()..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_runner() isn't mocked. To do that I would have to have all test cases inherit from a new superclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or write MRTestingJob() which does what the mrjob.job code above does, and have all the test case jobs inherit from it. That's not too bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, whoa, didn't catch that patching runner._steps happens outside of tests. Don't do that!

The whole point of the roundabout way of fetching step definitions by running the script with --steps is to make it simple to write mrjob scripts in other languages. In theory, all we have to do is give mrjob a way to infer the interpreter to run from the script's file extension, and it'll be possible to write a mrjob script in whatever language, and launch it from Python. The performance hit of starting a subprocess is small compared to the overhead of starting up Hadoop (not to mention running the MapReduce job), and if you just want to test quickly, that's what inline mode is for.

It sucks that this slows down tests, but really, that's the tests' problem. If all the tests that run jobs need to inherit from a common base class to patch runner._steps so that they can run faster, they should do that, or they should use inline mode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or MRTestingJob. That's not a bad option either. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of the roundabout way of fetching step definitions by running the script with --steps is to make it simple to write mrjob scripts in other languages.

And you can still do that. I didn't remove that ability. It works the same as it always did except some information is slipped in early when possible. _steps() even already accounted for it.

Anyway, I went the MRTestingJob route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: I didn't consider the _steps() thing a crazy testing-only hack. Just populating data when it was available rather than waiting and then unnecessarily wasting resources. Anyway it's gone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see that. I guess I'm saying that I want to keep this code path heavily travelled for now.

I think the clean way to do this would be to pass the steps in as an argument to the runner class's constructor rather than patching the runner's attributes. I'd feel better about this if we had a well-documented dictionary-like format for representing steps. It's not bad for the runner to represent steps internally like ['MR', 'R'] but it seems a little weird for this to be the canonical way of passing that information around.

I guess what I'm saying is that I'd like to do the steps optimization, support for other languages, and deeper support for the Java side of Hadoop all together. Not that we need to have all the code written before we release, but that we at least think through all these things and spec out the steps format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

coyotemarin pushed a commit that referenced this pull request Jun 4, 2012
Make tests twice as fast using mock
@coyotemarin coyotemarin merged commit fdf8ad4 into Yelp:master Jun 4, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants