Skip to content

Avoid checking Accumulo table exists before creation#74

Merged
jmark99 merged 28 commits intoapache:mainfrom
jmark99:ae13
May 7, 2021
Merged

Avoid checking Accumulo table exists before creation#74
jmark99 merged 28 commits intoapache:mainfrom
jmark99:ae13

Conversation

@jmark99
Copy link
Contributor

@jmark99 jmark99 commented Apr 5, 2021

Rather than checking if a table exists prior to creation, just create, catch and ignore TableExistsException instead.

This ticket removes the explicit check where appropriate. During the process several other small changes were made to the affected classes as well.

For the Bloom related examples, the table names were updated to use contants. Several methods were created to remove some redundant code.

For the CharacterHistogram.java class, an unneeded import was also removed.

Closes #13

Rather than checking if a table exists prior to creation, create,
catch and ignore TableExistsException instead.

This ticket removes the explicit check where appropriate. During the
process several other small changes were made to the affected classes
as well.

For the Bloom related examples, the table names were updated to use
contants. Several methods were created to remove some redundant code.

For the CharacterHistogram.java class, an unneeded import was also
removed.

Closes apache#13
try {
client.tableOperations().create(inputTable);
} catch (TableExistsException e) {
// ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not represent an actual error condition - this implies that the proceeding table delete failed? It shouldn't happen, but if it does then it seems something needs investigation.

client.tableOperations().setProperty("bloom_test4", "table.bloom.enabled", "true");
client.tableOperations().create(BLOOM_TEST3);
} catch (TableExistsException e) {
// ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run the test twice, do you get double the data? Does that matter?

@EdColeman
Copy link

A general question - could ignoring a table exists cause confusion if the results are examined by the user? Say for an ingest, running the example twice might lead to doubling the ingest - if I expected 10 rows but instead had 20 after a second run would that be worse than having the second fail because the table was there? Or are the scanners that are pulling the expect number of rows ingested and would only return partial results on multiple runs? Would it help if the exception was logged instead of silently ignored?

Another thought is if the examples are to serve as "recommended coding practices" is there benefit of showing the handling of table exists exceptions something we would recommend?

Overall I have no issues with the changes, just mainly curious as to why TableExists exceptions would be ignored.

@jmark99
Copy link
Contributor Author

jmark99 commented Apr 5, 2021

@EdColeman those are good questions to ask. I assume the initial intent of the ticket was to prevent checking for the existence of the table prior to every creation. The current state of the change keeps the examples working as before. That is, previously the code would check for the existence of the table and if it already existed would just keep chugging along without performing the actual creation. If errors arose or if the data was incorrect after a second consecuteve run, then the example would fail somewhere down the line or just have incorrect output (at least for most of the ones that I see).

Ideally, prior to running the examples you would want the table to not even exist. That way we would hopefully see consistent results between each run. I briefly considered updating the examples to delete tables if they already existed prior to execution, but I was worried if someone ran an example and had a 'real' table with a name from the example that could cause a problem. I guess that could still occur if even if the examples run and names were identical.

Another possiblity is to update them to provide a message to the user to remove any existing tables before running and exit. That would allow the user to clear up existing tables and then run the example. I would probably prefer that path at the moment. What are your thoughts on that approach?

I did run the integration tests back to back and they all still passed. I did not examine each example individually to see what would happen if run back to back.

So in summary, the current change keeps all of the examples running in the same manner as they were already running, but without the explicit table existence check. But I am open to updating to one of the aforementioned methods or to another route if you have other suggestions.

@jmark99 jmark99 marked this pull request as draft April 5, 2021 20:55
@jmark99
Copy link
Contributor Author

jmark99 commented Apr 5, 2021

@EdColeman and I had some additional Slack conversations and came up with a few ideas to hopefully improve the examples a bit more. In regards to table creation a couple of additional changes will be made. Firstly, the table names are going to receive a prefix so that there will be no conflict with any existing tables that a user might already be using. Additionally, If a table already exists, this will be logged so that a user can make an informed decision as to whether they would like to clean up the tables prior to running or re-running the examples. The log messages will indicate whether any additional configuration has taken place in cases where the table already existed. Finally, if situations exist where the example should definitely not proceed if a table already exists, the example will be updated to reflect this.

jmark99 added 24 commits April 12, 2021 09:27
Created Constants class to hold tables names and other constants used
throughout the examples.

Updated RandomBatchScanner to print log warning if table already exists.

Updated SequentialBatchWriter to exit if the required table for
scanning does not exist.
Updated to exit with error if table creation fails.
Created common Enum class for table names and constants.

Updated table creation code to log warning message if table already
exists.
Updated table name in batch.md documentation.
Updated to use examples namespace and to warn user if table already
exists.
Updated bloom.md docs to represent updated use of namespace with table
name.
Updated table creation to warn users if table already exists.
Created common enum class to contain table names and other values.
Updated classpath.md documentation to include creation of examples
namespace.

Also, update config command to use table.class.loader.context rather
than deprecated table.classpat.context.
Updated .gitignore to ignore log4j properties in conf directory.
Updated ReadWriteExample and RowOperation classes to use
TableExistsException and warning messages to user if table already
exists.
Minor style change to array creation in StatsCombiner.java.

Updated combiner.md documentation to make use of examples namespace.
Updated compacationStrategy.md documentation to make use of namespaces.
Update log message constants in Constants Enum class.
* Create common enum to contain namespace and table names.
* Small modifications to several small methods to read clearer.
* Extract common creation of constraints table to common location.
* Replace use of System.out.println with logging.
Updated deleteKeyValuePair.md documentation to use example namespaces.
Update dirlist example to use namespaces and to warn users of the prior
existence of tables.

Create common class to hold table names.
Update export.md documentation to utilize namespaces.
Updated documentation for filter, helloworld, and isolation examples.

Updated associated java classes for the aforementioned exmples.
Update the constraint related examples to work with the recent change to
the Constraints package. The Constraints were moved to publie API and
the location of the imports had to be updated to point to the right
location.
Updated docs for isolation, regex, reservaction, rgbalancer, and
rowhash.

Updated the Constraint related classes.
Update docs for sample and shard example.

Update the SampleExample class to handle examples namespace.
Update documentation for wordCount example.

Update WordCount class to create namespace and table.

Added MRCommon class to contains common mapreduce package variables.
Update Visibility example documentation to reflect the use of the
examples namespace.

Update contraints path to use new public API path for Constraints.
Update documentation for tabletofile and terasoft examples.
Update batch.md documentation.

Modify access to table name Strings in BloomCommon.

Remove default names for tables in Shard example.
jmark99 added 2 commits May 6, 2021 12:35
Updates several documentation files to add namespace info.

Updates various classes to use the common createTableWithNamespace
method.

Updates the declaration of table names in various classes.
Update filedata documentation to note that the ChunkCombiner class must
be avaiable in the accumulo lib directory or on the classpath somewhere
in order to scan the create examples.dataTable.
@jmark99 jmark99 marked this pull request as ready for review May 6, 2021 17:45
try {
client.tableOperations().create(inputTable);
} catch (TableExistsException e) {
low.error("Something went wrong. Table '{}' should have been deleted prior to creation "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be log?

try {
client.tableOperations().create(outputTable);
} catch (TableExistsException e) {
low.error("Something went wrong. Table '{}' should have been deleted prior to creation "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be log (instead of low)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :) Thanks for the catch. It was not caught by compile since this class is not in the src path of the examples.

Replace misspelled word - low -> log
Copy link

@EdColeman EdColeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look okay, except for the two minor questions.

I tried to build, but I think I need additional set-up for that to work.

@jmark99
Copy link
Contributor Author

jmark99 commented May 7, 2021

@EdColeman, yes, you need to have a running instance and update the examples conf files accordingly. For a couple of the examples the examples jar file must be in the Accumulo lib directory or somewhere on the classpath. I use fluo-uno for running the examples. I've gone through them all and except for the couple of issues that are still listed in GitHub, they are working as expected. Are you good with me pushing the changes at this time?

@jmark99 jmark99 merged commit ef2e04e into apache:main May 7, 2021
@jmark99 jmark99 deleted the ae13 branch May 12, 2021 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid checking if Accumulo table exists before creating it

2 participants