irskep commented Apr 15, 2012

This branch moves all filesystem-related code into mrjob.fs.local, mrjob.fs.s3, mrjob.fs.ssh, and mrjob.fs.multi. There are many new tests under tests.fs.

I'll test this live some time this week or next, whenever I get a chance.


  • MRJobRunner needs a custom __getattr__ implementation for backward compatibility
  • mockssh and mockhadoop are uglier than they were before to support the new tests' monkey-patching of Popen (however this method is MUCH faster than actually copying files and calling Popen)

Possible future work:

  • HadoopJobRunner still contains a manual call to invoke_hadoop(['-fs', 'mkdir', path]) which should be pulled out at some point
  • More unit tests for MultiFilesystem, though it's currently covered pretty well by the existing integration tests
  • Add mrjob.fs to the public documentation
irskep commented May 31, 2012

Updated with master and cleaned up a bit. Ready for review.

No doc updates because I didn't want to change the public interface yet.

davidmarin commented on an outdated diff Jun 4, 2012
@@ -476,6 +476,22 @@ def get_default_opts(cls):
return cls.OPTION_STORE_CLASS(cls.alias, {}, False).default_options()
+ ### Filesystem object ###
+ @property
+ def fs(self):
+ if self._fs is None:
+ # wrap in MultiFilesystem so we get cat()
davidmarin Jun 4, 2012 Collaborator

This seems a bit elaborate to me.

I think it'd be less surprising to have all the FileSystem classes inherit from mrjob.fs.base.FileSystemBase, and put the cat() method in FileSystemBase.

irskep commented Jun 4, 2012

Another to-do: write MockFilesystem and replace a host of disparate filesystem mocking techniques with just one.

irskep commented Jun 4, 2012

Tweaks from review. Added BaseFilesystem, and renamed MultiFilesystem to CompositeFilesystem on dnephin's suggestion.

davidmarin and 1 other commented on an outdated diff Jun 4, 2012
@@ -0,0 +1,27 @@
+# Copyright 2009-2012 Yelp and Contributors
davidmarin Jun 4, 2012 Collaborator

Don't put it in! A blank is a beautiful thing.

irskep Jun 4, 2012 Contributor

I always thought that was personal preference, and mrjob hadn't expressed an opinion on it. Moved it to mrjob.fs.base because hey why not.

davidmarin Jun 4, 2012 Collaborator

It's just, once you put something in, it's really hard to move it back out.

davidmarin and 1 other commented on an outdated diff Jun 4, 2012
+# You may obtain a copy of the License at
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+log = logging.getLogger('mrjob.fs')
+class BaseFilesystem(object):
davidmarin Jun 4, 2012 Collaborator

Sorry, I expected to see all the methods supported by a filesystem, with docstrings and raise NotImplementedError. It's verbose, but it makes it a lot easier for someone reading the code to understand what these filesystem objects do.

irskep Jun 4, 2012 Contributor

Done, and moved all docstrings in as well instead of burying them in CompositeFilesystem. Might affect the docs though.

irskep commented Jun 4, 2012

I neglected to check the documentation earlier. Several methods are now missing:


I didn't even realize the EMRJobRunner methods were documented. I guess I could stub them out in EMRJobRunner, or add the fs modules to the documentation and mention that the methods are all forwarded.

irskep commented Jun 4, 2012

(By 'missing' I mean 'missing from the documentation'. The code all still works as expected.)


davidmarin commented Jun 5, 2012

You might note somewhere in the docs that the methods are also forwarded to the runner itself, but that it's not the preferred way of doing things (and will probably be deprecated in 0.4).

irskep commented Jun 9, 2012


irskep commented Jun 9, 2012

Merged into release-v0.4.

irskep closed this Jun 9, 2012
