Filesystem functionality refactor #443

wants to merge 65 commits into


None yet

2 participants

irskep commented Apr 15, 2012

This branch moves all filesystem-related code into mrjob.fs.local, mrjob.fs.s3, mrjob.fs.ssh, and mrjob.fs.multi. There are many new tests under tests.fs.

I'll test this live some time this week or next, whenever I get a chance.


  • MRJobRunner needs a custom __getattr__ implementation for backward compatibility
  • mockssh and mockhadoop are uglier than they were before to support the new tests' monkey-patching of Popen (however this method is MUCH faster than actually copying files and calling Popen)

Possible future work:

  • HadoopJobRunner still contains a manual call to invoke_hadoop(['-fs', 'mkdir', path]) which should be pulled out at some point
  • More unit tests for MultiFilesystem, though it's currently covered pretty well by the existing integration tests
  • Add mrjob.fs to the public documentation
Steve Johnson and others added some commits Apr 12, 2012
Steve Johnson Remove old filesystem methods, fix attribute forwarding 0869232
Steve Johnson Populate mrjob.fs.hadoop with fs code aa8af70
Steve Johnson Converted HadoopJobRunner to fs objects, still one failing test with …
Steve Johnson Fixed hadoop du() implementation to match mrjob trunk 55d4946
Steve Johnson EMRJobRunner uses S3Filesystem (temporarily killed SSH) 6896aa5
Steve Johnson Added missing socket import to mrjob.fs.s3 20f56ee
Steve Johnson Fixed TestNoBoto to inject None as boto import ba0063a
Steve Johnson Cleaned up imports and definitions e5fb5d8
Steve Johnson Pretended to add ssh fs, not very convincingly 21d280b
@irskep irskep Removed S3 cruft from mrjob.fs.ssh 17683be
@irskep irskep SSHFilesystem almost works 20688b6
@irskep irskep SSH log fetch searches correct directory, test inits _fs properly 2f5ca78
@irskep irskep More SSHFilesystem fixin' c7d16c5
@irskep irskep Fix TestSSHLS 79082e3
@irskep irskep Fixed test_fetch_logs and sped up fetch_logs 677b174
@irskep irskep Fixes for pyflakes 556b9d8
@irskep irskep Comments explaining emr test cases a0505c2
@irskep irskep Helpful comments, remove dead code a1d1fbc
@irskep irskep Copyright and Licenses: For the Good of Humanity™ 1359a6d
@irskep irskep Started writing tests for local fs 6ae5bc9
@irskep irskep More local fs tests 8876792
@irskep irskep Ported over old local fs tests 16ff18b
@irskep irskep Separated bz2 and gz test cases 50aad7f
@irskep irskep Test remaining methods of local fs df3ea4e
@irskep irskep Slight mockhadoop refactoring to make monkey-patching subprocess easier bb786c8
@irskep irskep A little more mockhadoop refactoroing c741ed5
@irskep irskep Ported all the important hadoop fs tests to tests.fs.test_hadoop 4af1dff
@irskep irskep More hadoop fs tests, fix mockhadoop -rmr behavior, fix hadoop fs pat…
…h_exists behavior
@irskep irskep More hadoop fs test fixes b288fb3
@irskep irskep _fs starts as None instead of nonexistent to fix infinite recursion i…
…n __getattr__
@irskep irskep Re-fixed TestSSHLs 2707709
@irskep irskep Refactored mockssh to take mocked pipes/environ fa688a4
@irskep irskep Refactoring fs test cases to generalize mocked popen 7f572c5
@irskep irskep Pass custom env to mock popen 6d64153
@irskep irskep Finished first attempt at mocking subprocess for ssh d1713da
@irskep irskep More testing bug fixes/refactoring, basic SSH ls test case works bf003c2
@irskep irskep All ssh fs ls tests pass c2977ff
@irskep irskep First ssh slave tests, almost working 4bcd8f5
@irskep irskep Fixed slave cat test case 40c41f8
@irskep irskep Test case for ssh ls with slaves 78fdceb
@irskep irskep Small ssh test fix 4363a89
@irskep irskep S3 tests c227a18
@irskep irskep Remove old tests ad2005a
@irskep irskep Pyflakes 69b7c32
@irskep irskep Fix some errors and test failures dcf7c15
@irskep irskep Quiet some fetch_logs test cases f2a11bb
@irskep irskep Fixed hadoopy tests d6b49ee
@irskep irskep Fixed testing regression in tool tests 722f462
@irskep irskep Remove cat() from local fs, wrap local in multi for regular runner da81a74
@irskep irskep Canonical docstrings moved to MultiFilesystem 3ad4f9e
@irskep irskep Fix s3 fs docstring 6cd99c5
@irskep irskep Remove _invoke_hadoop() from HadoopJobRunner, it's already in the fs a30d285
@irskep irskep Pyflakes 14e9d9f
Steve Johnson Merge branch 'master' of into fs_refactor 75c9e56
Steve Johnson Merge branch 'fs_refactor' of into fs_refactor 9dd1971
Steve Johnson Merge from master 48af5a3
Steve Johnson Fix a bad tool test case 4b290df
Steve Johnson Cleaned up SSH key copying state bfdb5c4
Steve Johnson _ssh_key_name is a property now. Also fixed a test case. c920044
irskep commented May 31, 2012

Updated with master and cleaned up a bit. Ready for review.

No doc updates because I didn't want to change the public interface yet.

@davidmarin davidmarin commented on an outdated diff Jun 4, 2012
@@ -476,6 +476,22 @@ def get_default_opts(cls):
return cls.OPTION_STORE_CLASS(cls.alias, {}, False).default_options()
+ ### Filesystem object ###
+ @property
+ def fs(self):
+ if self._fs is None:
+ # wrap in MultiFilesystem so we get cat()
davidmarin Jun 4, 2012 Collaborator

This seems a bit elaborate to me.

I think it'd be less surprising to have all the FileSystem classes inherit from mrjob.fs.base.FileSystemBase, and put the cat() method in FileSystemBase.

irskep commented Jun 4, 2012

Another to-do: write MockFilesystem and replace a host of disparate filesystem mocking techniques with just one.

irskep commented Jun 4, 2012

Tweaks from review. Added BaseFilesystem, and renamed MultiFilesystem to CompositeFilesystem on dnephin's suggestion.

@davidmarin davidmarin and 1 other commented on an outdated diff Jun 4, 2012
@@ -0,0 +1,27 @@
+# Copyright 2009-2012 Yelp and Contributors
davidmarin Jun 4, 2012 Collaborator

Don't put it in! A blank is a beautiful thing.

irskep Jun 4, 2012 Contributor

I always thought that was personal preference, and mrjob hadn't expressed an opinion on it. Moved it to mrjob.fs.base because hey why not.

davidmarin Jun 4, 2012 Collaborator

It's just, once you put something in, it's really hard to move it back out.

@davidmarin davidmarin and 1 other commented on an outdated diff Jun 4, 2012
+# You may obtain a copy of the License at
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+log = logging.getLogger('mrjob.fs')
+class BaseFilesystem(object):
davidmarin Jun 4, 2012 Collaborator

Sorry, I expected to see all the methods supported by a filesystem, with docstrings and raise NotImplementedError. It's verbose, but it makes it a lot easier for someone reading the code to understand what these filesystem objects do.

irskep Jun 4, 2012 Contributor

Done, and moved all docstrings in as well instead of burying them in CompositeFilesystem. Might affect the docs though.

irskep commented Jun 4, 2012

I neglected to check the documentation earlier. Several methods are now missing:


I didn't even realize the EMRJobRunner methods were documented. I guess I could stub them out in EMRJobRunner, or add the fs modules to the documentation and mention that the methods are all forwarded.

irskep commented Jun 4, 2012

(By 'missing' I mean 'missing from the documentation'. The code all still works as expected.)


I think the fs modules should be added to the documentation, and you should explain that you can use runner.fs.method() to manipulate the filesystem.

You might note somewhere in the docs that the methods are also forwarded to the runner itself, but that it's not the preferred way of doing things (and will probably be deprecated in 0.4).

irskep commented Jun 5, 2012


irskep commented Jun 9, 2012

Merged into release-v0.4.

@irskep irskep closed this Jun 9, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment