-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tests twice as fast using mock #466
Conversation
update |
@@ -387,6 +387,10 @@ def test_echo_as_steps_python_bin(self): | |||
with mr_job.make_runner() as runner: | |||
assert isinstance(runner, LocalMRJobRunner) | |||
try: | |||
# make_runner() populates _steps in the runner, so un-populate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The mocked version of make_runner()
..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_runner()
isn't mocked. To do that I would have to have all test cases inherit from a new superclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or write MRTestingJob()
which does what the mrjob.job
code above does, and have all the test case jobs inherit from it. That's not too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, whoa, didn't catch that patching runner._steps
happens outside of tests. Don't do that!
The whole point of the roundabout way of fetching step definitions by running the script with --steps
is to make it simple to write mrjob scripts in other languages. In theory, all we have to do is give mrjob
a way to infer the interpreter to run from the script's file extension, and it'll be possible to write a mrjob
script in whatever language, and launch it from Python. The performance hit of starting a subprocess is small compared to the overhead of starting up Hadoop (not to mention running the MapReduce job), and if you just want to test quickly, that's what inline mode is for.
It sucks that this slows down tests, but really, that's the tests' problem. If all the tests that run jobs need to inherit from a common base class to patch runner._steps
so that they can run faster, they should do that, or they should use inline
mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or MRTestingJob
. That's not a bad option either. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of the roundabout way of fetching step definitions by running the script with --steps is to make it simple to write mrjob scripts in other languages.
And you can still do that. I didn't remove that ability. It works the same as it always did except some information is slipped in early when possible. _steps()
even already accounted for it.
Anyway, I went the MRTestingJob
route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: I didn't consider the _steps()
thing a crazy testing-only hack. Just populating data when it was available rather than waiting and then unnecessarily wasting resources. Anyway it's gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see that. I guess I'm saying that I want to keep this code path heavily travelled for now.
I think the clean way to do this would be to pass the steps in as an argument to the runner class's constructor rather than patching the runner's attributes. I'd feel better about this if we had a well-documented dictionary-like format for representing steps. It's not bad for the runner to represent steps internally like ['MR', 'R']
but it seems a little weird for this to be the canonical way of passing that information around.
I guess what I'm saying is that I'd like to do the steps optimization, support for other languages, and deeper support for the Java side of Hadoop all together. Not that we need to have all the code written before we release, but that we at least think through all these things and spec out the steps format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
Make tests twice as fast using mock
This branch introduces a dependency on the
mock
library and another change which take the test suite running time from 64 seconds to 30 seconds.time.sleep()
,MRJobRunner._create_mrjob_tar_gz()
,EMRJobRunner._wait_for_s3_eventual_consistency()
, andEMRJobRunner._wait_for_job_flow_termination()
intest_emr
for ~20 second speedupMRJobRunner.make_runner()
sets the_steps
attribute of the runner it creates so that the runner doesn't need to make a subprocess to get the job's steps. The machinery for getting a job's steps is completely internal and does not expose any APIs that could break as a result of this change. This is an optimization that we have complete control over. It could also be implemented by adding a keyword arg toMRJobRunner.__init__()
.