Skip to content

Conversation

@alerman
Copy link
Contributor

@alerman alerman commented Apr 17, 2024

Add visibility as one of the critera that the grep searches

@alerman alerman changed the base branch from main to 2.1 April 17, 2024 18:38
@dlmarion
Copy link
Contributor

I think the description of the GrepCommand needs to be updated to reflect the change.

@Override
public String description() {
return "searches each row, column family, column qualifier and value in a"
+ " table for a substring (not a regular expression), in parallel, on the server side";
}

@ddanielr ddanielr merged commit 3c5326c into apache:2.1 Apr 22, 2024
@ctubbsii
Copy link
Member

ctubbsii commented Apr 23, 2024

This is a change in behavior to an existing user-facing class, that will change the data returned. This was likely not done initially because visibilities are typically used as access control, not searchable user data. While there may be some applications for treating the content of visibilities as searchable user data, I don't think it's a typical use case.

This will very likely break existing applications. I think this needs to be reverted as-is. To protect existing use cases, an extra option can be added (off by default) to include visibilities in the matching.

@alerman
Copy link
Contributor Author

alerman commented Apr 24, 2024

@ctubbsii Thanks for the feedback. I have to say though that I disagree. While it may have been done as that isn't typically "searchable user data", I would argue you could also call this a bug as the grep command (and iterator) search every part of the key which is a byte array EXCEPT the visibility. Can you point me to documentation which specifies that the visibility is a special part of the key that should be treated differently in some cases? The data model page makes no such note.

On top of that, this PR came out of observing a user of accumulo frustrated as the grep they were running was not matching, when they expected that it would. Seems to me that the logical behavior is that it greps for everything, not that it excludes some parts.

@ctubbsii
Copy link
Member

I would argue you could also call this a bug as the grep command (and iterator) search every part of the key which is a byte array EXCEPT the visibility. Can you point me to documentation which specifies that the visibility is a special part of the key that should be treated differently in some cases? The data model page makes no such note.

Each piece of the key are inherently unique and special in the data model, but it's also flexible and you can certainly treat it like other fields, such as in your use case. My point wasn't that visibility must be treated special, but rather that it's also a valid use case if it is. The basis for my objection wasn't to preserve the existing use case, but to show that it's not necessarily a bug because both use cases are valid.

Treating both use cases as valid, then this change will break some valid existing workflows. The severity of that breakage could range anywhere from "harmless surprise", to "security-compromise", depending on the additional data returned. I think we can satisfy both use cases, while avoiding making it a breaking change.

On top of that, this PR came out of observing a user of accumulo frustrated as the grep they were running was not matching, when they expected that it would. Seems to me that the logical behavior is that it greps for everything, not that it excludes some parts.

Is the main frustration the use of grep in the shell, or is it the GrepIterator itself? I'm okay changing the behavior in the shell, which primarily is intended to support user-interactive ad-hoc workflows anyway. My concern is changing the default behavior of the GrepIterator when used directly in table configuration or as a scan-time iterator.

A few things we could do to avoid breaking changes to the GrepIterator:

  1. Create a new iterator with new behavior (this gives the possibility for deprecating the old one, if we want to only have one, and wish to signal the change in behavior), or
  2. Add a matchOnVisibility option to existing iterator that is off by default (my preference).

For the shell, where we don't care about changing behavior as much:

  1. Update the grep command in the shell to use the new iterator or new option and update the documentation (also, highlight the change in behavior in the release notes), and
  2. Maybe also add an option to toggle between the behaviors in the shell (I don't think this is necessary... changing the behavior in the shell is less a concern for me than breaking direct uses of the GrepIterator).

Changing the behavior of the shell's grep command to make it more convenient can be done without changing the behavior of GrepIterator for existing workflows.

@ctubbsii ctubbsii modified the milestones: 4.0.0, 3.1.0, 2.1.3 Jul 12, 2024
@ivakegg
Copy link
Contributor

ivakegg commented Jul 26, 2024

Changing the grep iterator to use an option for whether to match against the visibility is fine with me. I think the shell should default that option to true.

asfgit pushed a commit that referenced this pull request Jul 29, 2024
Update the previously merged feature to match on column visibilities
using the GrepIterator, which was added in PR #4468

Now, GrepIterator supports a number of fine-grained options to choose
exactly which fields to match on. It preserves the behavior prior to the
changes in PR #4468 by default for the GrepIterator, but retains the
change to the GrepCommand for the shell by enabling the new option to
match on column visibilities.
@ctubbsii
Copy link
Member

I pushed the changes to this in commit bdb8ce9 to add options, keeping the behavior of GrepCommand, but defaulting to the previous behavior for any current user of GrepIterator directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants