Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude parameter for scan command #4

Open
almereyda opened this issue Oct 30, 2020 · 5 comments
Open

Exclude parameter for scan command #4

almereyda opened this issue Oct 30, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@almereyda
Copy link

For some cases, we will want to exclude certain directories from scanning, like node_modules or .git.

Excluding this kind of directory could implemented similar to rsync's --exclude and --exclude-file switches.

An --exclude-file could then be called .pscignore and if found, be transparently applied recursively to all subdirectories, if present.

@anishathalye
Copy link
Owner

That's an interesting proposal. I'm trying to keep the Periscope software as simple as possible, so I want to understand whether there is a need for such a feature / how badly it's needed, and whether there's a reasonable workaround etc. How would you use such a feature? E.g. Is it for performance? Is there some usability aspect to it?

I've been using Periscope with pretty large data sizes (250 GB -- 4 TB), with a lot of data in Git repos and folders like node_modules, and I haven't really had a problem with it; I just scan everything, and it doesn't really matter that some stuff in there gets scanned too.

As a workaround, if I wanted to avoid scanning those directories, at least on my computer all my code is under a src directory, so I could e.g. do a psc scan <list directories except src> instead of a psc scan .. Does such an approach work in your situation?

@anishathalye anishathalye added the enhancement New feature or request label Nov 6, 2020
@almereyda
Copy link
Author

almereyda commented Nov 7, 2020

There are deeply nested directory structures with multiple .git and node_modules that can occur everywhere. So it is not said the directory I would like to omit lives somewhere visible to me from the point where I start a scan. My hypothesis here is, that we can tell some directories that don't need deduplication in advance, also to limit the output of possible candidates beforehand.

I'm also using a ~/src directory and can indeed imagine to choose the directories to scan more carefully before invocation.

--exclude, --exclude-file and .*ignore remain nice patterns one gets used to quickly, when they are around.

@anishathalye
Copy link
Owner

Ok, makes sense. So it's about limiting the size of the output and filtering out noise, not about performance?

(At some point I might do a quick performance test on my machine, I'm kind of curious what is the impact of traversing my src directory.)

I'm open to adding this feature, but just as an FYI, I am not sure exactly when I will have time to implement it.

@almereyda
Copy link
Author

almereyda commented Nov 10, 2020 via email

@anishathalye
Copy link
Owner

Hmm my intuition is that this might be a somewhat involved change, so perhaps not an ideal first issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants