New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add directory spidering #1115

Merged
merged 47 commits into from Jun 25, 2015

Conversation

Projects
None yet
4 participants
@cpcloud
Member

cpcloud commented Jun 4, 2015

closes #1108

  • test the executable on windows py2
  • test on py3 on mac and windows
  • add yaml file format to the command line as per an internal discussion with @chdoig

@cpcloud cpcloud self-assigned this Jun 4, 2015

@cpcloud cpcloud added this to the 0.8.1 milestone Jun 4, 2015

@cpcloud cpcloud added the enhancement label Jun 4, 2015

def spider(resource_path, ignore=(ValueError, NotImplementedError)):
return {

This comment has been minimized.

@llllllllll

llllllllll Jun 5, 2015

Member

You might want to wrap this in:

try:
    ....
except RuntimeError:
    raise ValueError('{dir} is too deeply nested'.format(dir=resource_path))

So that users are not confused by a recursion limit error.

This comment has been minimized.

@cpcloud

cpcloud Jun 5, 2015

Member

an os.walk based solution would probably eliminate the need for this restriction entirely

This comment has been minimized.

@mrocklin

mrocklin Jun 5, 2015

Member

What is the limitation here? Is this about the case of a very very nested directory? How likely is this?

This comment has been minimized.

@llllllllll

llllllllll Jun 5, 2015

Member

It is unlikely unless there is a symlink back above the parent. also, I agree with @cpcloud about the use of walk.

This comment has been minimized.

@cpcloud

cpcloud Jun 5, 2015

Member

i'll add a test case for a directory cycle

This comment has been minimized.

@cpcloud

cpcloud Jun 5, 2015

Member

i'm probably going to leave the implementation as is (modulo the runtimeerror catching) until someone comes to us with a 1000 level deep directory

This comment has been minimized.

@llllllllll

llllllllll Jun 5, 2015

Member

Okay, maybe we should add a flag for follow_symlinks or something? Also, it doesn't hurt to provide a nicer error than RuntimeError: Maximum recursion depth exceeded. We could even suggest altering the follow_symlinks flag.

edit: Sorry, didn't see your prior comments

This comment has been minimized.

@cpcloud

cpcloud Jun 5, 2015

Member

It looks like the number of symlinks is limited by the OS. In OS X for example:

$ ag -C 5 'MAXSYMLINKS' /usr/include/
/usr/include/sys/param.h
186-
187-/*
188- * MAXPATHLEN defines the longest permissable path length after expanding
189- * symbolic links. It is used to allocate a temporary buffer from the buffer
190- * pool in which to do the name expansion, hence should be a power of two,
191: * and must be less than or equal to MAXBSIZE.  MAXSYMLINKS defines the
192- * maximum number of symbolic links that may be expanded in a path name.
193- * It should be set high enough to allow all legitimate uses, but halt
194- * infinite loops reasonably quickly.
195- */
196-#define     MAXPATHLEN      PATH_MAX
197:#define MAXSYMLINKS 32

This comment has been minimized.

@cpcloud

cpcloud Jun 5, 2015

Member

linux is similarly a small number.

@llllllllll @mrocklin what do you think about symlinks? always follow? not follow, but have an option? i'm thinking option is the way to go

This comment has been minimized.

@llllllllll

llllllllll Jun 5, 2015

Member

I would prefer an option that defaults to True

@mrocklin

This comment has been minimized.

Member

mrocklin commented Jun 6, 2015

what do you think about symlinks?

My somewhat lazy tactic would probably be just not to care until someone ran into and reported the problem.

@chdoig

This comment has been minimized.

Member

chdoig commented Jun 16, 2015

Hey @cpcloud sorry for the delay reviewing this PR.

Could we have the blaze-server runnable script accessible from the blaze top directory? There are different ways to do that in my experience, I'll be happy to write the option that you prefer, didn't want to choose one without your opinion:

Options:

e.g.

  entry_points:
    - blaze-server = blaze.server.server:main

e.g.

      entry_points= {
          'console_scripts': ['blaze-server = blaze.server.server:main']
      },

e.g.

#!/usr/bin/env python

import sys
from blaze.server.server import main
sys.exit(main())
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 16, 2015

i'm +1 on the entry_points (in conda and setup.py) method

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 16, 2015

@chdoig mostly i'm concerned that might've gone a little hog wild with options in the spider function

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 17, 2015

@chdoig I've added an entry point in conda and in setup.py

@chdoig

This comment has been minimized.

Member

chdoig commented Jun 17, 2015

Cool, what do you mean by a little hog wild? Like in the arguments options? Any specific feedback you need from me?

This might not going off a tangent a little, but just to through it here: it might also be nice to have a file with Blaze URIs, name and some metadata (maybe a yaml file) or a master table / catalog:

data.file

name: accounts-data
uri: 'blaze/examples/data/accounts_*.csv'
----
name: iris-dataset
uri: 'sqlite:///blaze/examples/data/iris.db::iris'

and that you could do:

blaze-server --from_file data.file
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 17, 2015

what about

accounts_data: 'blaze/examples/data/accounts_*.csv'
iris_dataset: 'sqlite:///blaze/examples/data/iris.db::iris'

?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 17, 2015

by hog wild i was really referring to my glob argument which i think i'm going to remove ... it's kind of fragile and hard to reason about

@chdoig

This comment has been minimized.

Member

chdoig commented Jun 17, 2015

Works, I was just thinking that you might wanted to also keep some description, the datashape...

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 17, 2015

hm okay what about this:

accounts_data:
  uri: path/to/file
iris_dataset:
  uri: sqlite:///path/to/file.db
  dshape: "var * {...}"

We should take advantage of the fact that yaml already has a way to name things without having to explicitly say "here is the name"

@chdoig

This comment has been minimized.

Member

chdoig commented Jun 17, 2015

Yes, that was silly. This looks better.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

merging this soon

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

just tested a simple version of this on windows and it worked. merging on next pass

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

hm, nevermind there's a strange python34 unicode error in json discovery

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

turns out it was an untested aspect of odo

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

odo PR on the way

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

fixed in blaze/odo#243

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 25, 2015

merging on pass for real this time :)

cpcloud added a commit that referenced this pull request Jun 25, 2015

Merge pull request #1115 from cpcloud/spider
Add directory spidering

@cpcloud cpcloud merged commit 82bc752 into blaze:master Jun 25, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@cpcloud cpcloud deleted the cpcloud:spider branch Jun 25, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment