Skip to content

NIFI-12382: Add DatabaseSchemaRegistry service#8042

Closed
mattyb149 wants to merge 4 commits intoapache:mainfrom
mattyb149:NIFI-12382
Closed

NIFI-12382: Add DatabaseSchemaRegistry service#8042
mattyb149 wants to merge 4 commits intoapache:mainfrom
mattyb149:NIFI-12382

Conversation

@mattyb149
Copy link
Contributor

Summary

NIFI-12382 This PR adds a DatabaseSchemaRegistry controller service which gets the metadata for a specified database table and creates a RecordSchema from it.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing this new feature @mattyb149.

The concept of a database-backed Schema Registry makes sense, but this initial implementation appears to be very specific in terms of table definition, and does not include any documentation on those expectations. With that background, it seems like this could make supporting such a component potentially difficult with concerns around the potential need for schema migration.

For these reasons, it seems like a generalized implementation needs to more configuration options and documentation. Even then, some table definition decisions raise other questions, like IS_NULLABLE as a string versus a tiny integer or something similar. Perhaps it would be helpful to address some of these design considerations before going further.

@mattyb149
Copy link
Contributor Author

I'll add more documentation around the use case and what this service does, but I didn't see any use for a DB-backed registry that just stores schema text (such as a String column containing an Avro schema) since we have other implementations that basically do that. I can rename the service DatabaseTableSchemaRegistry if that helps make it clearer what this service is for, but the idea is to generate the schema from metadata rather than store and retrieve schemas. This can be used by things like ValidateRecord in order to ensure that records will successfully go into a database with a downstream PutDatabaseRecord for example.

@exceptionfactory
Copy link
Contributor

Thanks for the reply @mattyb149. Renaming the class could be helpful.

Another potential concern is using the schema name as the table name. Aside from schema naming conventions potentially conflicting with database table naming requirements, this would also require a table for every schema, which seems like it could be difficult to maintain. Although this could make sense in some specific environments, it seems like this might not be a good fit for inclusion as a component for general usage.

@pvillard31
Copy link
Contributor

I'm jumping in the conversation here but I don't think this is a very specific use case. The idea is that, in many cases, data would be sent into a database by NiFi and some users may not want to deal with schemas in NiFi or introduce an external Schema Registry. The idea here is to be able to retrieve the schema of the destination table in the NiFi flow so that it can be leveraged instead of the Infer Schema option which comes with the limitations and risks that we know.

@exceptionfactory
Copy link
Contributor

Thanks for the additional input @pvillard31, that is helpful, and makes more sense now.

I realize that I was not evaluating all the implementation details. Now I see that the implementation uses the Database Metadata results, and that is what defines the information. Sorry for missing that detail earlier @mattyb149.

With that in mind, I can now see how this makes more sense as a way to use database table metadata for the schema definition. Perhaps including Metadata in Service class name would also help describe how this works.

With that background, are there any concerns about JDBC driver support across vendors? I would expect that more popular database vendors should work with this approach, but some may not, so that may be worth highlighting under additional details.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the capability description @mattyb149, that definitely helps clarify how things work now. I noted a couple more detail recommendations.


// If no columns are found, check that the table exists
if (recordFields.isEmpty()) {
try (final ResultSet tblrs = dmd.getTables(dbCatalogName, dbSchemaName, tableName, null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adjusting the variable name for readability:

Suggested change
try (final ResultSet tblrs = dmd.getTables(dbCatalogName, dbSchemaName, tableName, null)) {
try (final ResultSet tablesResultSet = dmd.getTables(dbCatalogName, dbSchemaName, tableName, null)) {

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the code adjustments @mattyb149. The latest version works well in runtime testing with MySQL.

I noted a couple minor convention recommendations. The only concern remaining is the location of the class itself. Although it is convenient to have it with the Connection Pools, it seems out of place. As mentioned in the detail comments, moving it to a separate NAR seems like a better long-term solution, with minimal impact on the size. Thoughts?

}

private void checkTableExists(final DatabaseMetaData databaseMetaData, final String tableName) throws SchemaNotFoundException, SQLException {
try (final ResultSet tblrs = databaseMetaData.getTables(dbCatalogName, dbSchemaName, tableName, null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend renaming this variable for readability.

Suggested change
try (final ResultSet tblrs = databaseMetaData.getTables(dbCatalogName, dbSchemaName, tableName, null)) {
try (final ResultSet tables = databaseMetaData.getTables(dbCatalogName, dbSchemaName, tableName, null)) {

<modules>
<module>nifi-dbcp-service</module>
<module>nifi-hikari-dbcp-service</module>
<module>nifi-db-schema-registry-service</module>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this uses Database Connection Pools, it does not seem like it fits in this bundle and NAR. As it relies only on standard service interfaces, it should be a small addition as a separate NAR. What do you think?

}

@AfterAll
public static void shutdownDatabase() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Exception is not thrown and can be removed:

Suggested change
public static void shutdownDatabase() throws Exception {
public static void shutdownDatabase() {

assertThrows(SchemaNotFoundException.class, () -> dbSchemaRegistry.retrieveSchema(schemaIdentifier));
}

private static class DBCPServiceSimpleImpl extends AbstractControllerService implements DBCPService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend avoiding the use of Impl on class names, even for testing.

Suggested change
private static class DBCPServiceSimpleImpl extends AbstractControllerService implements DBCPService {
private static class SimpleDBCPService extends AbstractControllerService implements DBCPService {

@Override
public Connection getConnection() throws ProcessException {
try {
Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend declaring this class name once and reusing throughout the test class.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for restructuring the bundle @mattyb149, the latest version looks good! +1 merging

exceptionfactory pushed a commit that referenced this pull request Jan 4, 2024
- Added nifi-db-schema-registry-service-nar

This closes #8042

Signed-off-by: David Handermann <exceptionfactory@apache.org>
(cherry picked from commit 99c843f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants