Skip to content

NIFI-7159 - Add support for some data types in various processors#4104

Closed
Xeinn wants to merge 7 commits intoapache:masterfrom
Xeinn:big-decimal
Closed

NIFI-7159 - Add support for some data types in various processors#4104
Xeinn wants to merge 7 commits intoapache:masterfrom
Xeinn:big-decimal

Conversation

@Xeinn
Copy link

@Xeinn Xeinn commented Mar 2, 2020

Aligning the implementation of DECIMAL RecordFieldType with other types

Introduce RecordFieldType of DECIMAL for BigDecimal values

Adding RecordFieldType DECIMAL to ResultSetRecordSet

Introduce RecordFieldType DECIMAL into DataTypeUtils

Added missing BIGINT type to convertSimpleIfPossible method

Added missing type BIGINT to convertField method

Added missing type BIGINT to writeFieldForType

Added missing type BIGINT to parseFieldForType

Added missing type BIGINT to parseStringForType

Potentially breaking change of behaviour for Kudu Service.
Switched decimal from treatment as string to treatment as decimal

Added big decimal support for ElasticSearch

Added big decimal support for hbase processors

Added big decimal support for prometheus reporting

Added big decimal support to solr

Big decimal support in site to site reporting

Core support for big decimal

Fix issue with new toBigDecimal method

Issue where method was not correctly handling the situation where a
BigDecimal values was received as input.

Update test to reflect use of BigDecimal

Changes to unit tests to support DECIMAL type

Fix bug where NaN Double reported from jvm metrics not handled

BigDecimal implementation for mongo processor

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Enables X functionality; fixes bug NIFI-7159.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • [Y] Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • [Y] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • [Y] Has your PR been rebased against the latest commit within the target branch (typically master)?

  • [Y] Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • [Y] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • [Y] Have you written or updated unit tests to verify your changes?
  • [Y] Have you verified that the full build is successful on both JDK 8 and JDK 11?
  • [N/A] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • [N/A] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • [N/A] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • [N/A] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • [N/A] Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@MikeThomsen
Copy link
Contributor

Will start reviewing tonight.

@MikeThomsen
Copy link
Contributor

You have a checkstyle violation in your changes nifi-record. See the output for why the build failed:

 [WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[291] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[292] (javadoc) NonEmptyAtclauseDescription: At-clause should have a non-empty description.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[294] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[298] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[300,5] (whitespace) FileTabCharacter: Line contains a tab character.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[301,5] (whitespace) FileTabCharacter: Line contains a tab character.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[302,5] (whitespace) FileTabCharacter: Line contains a tab character.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[303] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[303,5] (whitespace) FileTabCharacter: Line contains a tab character.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[304,5] (whitespace) FileTabCharacter: Line contains a tab character.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[306] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[WARNING] src/main/java/org/apache/nifi/serialization/record/RecordFieldType.java:[307] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[WARNING] src/main/java/org/apache/nifi/serialization/record/util/DataTypeUtils.java:[464,17] (whitespace) FileTabCharacter: Line contains a tab character.
[WARNING] src/main/java/org/apache/nifi/serialization/record/type/DecimalDataType.java:[16] (regexp) RegexpSinglelineJava: Line has trailing whitespace.
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.1.0:check (check-style) on project nifi-record: You have 15 Checkstyle violations. -> [Help 1]

@MikeThomsen
Copy link
Contributor

Getting build errors now related to Decimal128.

@Xeinn
Copy link
Author

Xeinn commented Mar 2, 2020 via email

@MikeThomsen
Copy link
Contributor

I think we can use this and a little reflection to get around that one explicit reference to Decimal128

https://stackoverflow.com/questions/52870627/get-mongodb-driver-version-programmatically

@Xeinn
Copy link
Author

Xeinn commented Mar 3, 2020 via email

@MikeThomsen
Copy link
Contributor

Yeah, the javadocs are sketchy on that class, but there's gotta be some way where we can look at the driver, find it out if it's >= 3.8 and then use reflection so that older driver support will still work.

@Xeinn
Copy link
Author

Xeinn commented Mar 3, 2020 via email

Copy link
Contributor

@MikeThomsen MikeThomsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, but there is definitely some scope creep that will require others' input.

}
break;
}
case DECIMAL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should actually be a double because that's the maximum that Elasticsearch supports. See here:

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

retVal = clientService.toBytes(record.getAsBoolean(field));
break;
case DECIMAL:
// Decimal to be treated as the same as double
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be broken down into a byte array like the other types.

return new DoubleWritable((double) o);
}
// Map BigDecimal to a Double type - this should be improved to map to Hive Decimal type
if (o instanceof BigDecimal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some runtime issues with unit tests related to Orc. @mattyb149 @bbende @ijokarumawak do y'all have any insight into this ORC conversion?

case STRING:
case DECIMAL:
fields.add(new RecordField(cs.getName(), RecordFieldType.STRING.getDataType()));
fields.add(new RecordField(cs.getName(), RecordFieldType.DECIMAL.getDataType()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about Kudu, so I may put this ticket on hold and ask for input from nifi-dev on a few of these other components (ex Hive)

@Xeinn
Copy link
Author

Xeinn commented Mar 5, 2020 via email

@Xeinn
Copy link
Author

Xeinn commented Mar 5, 2020 via email


@Override
public String toString() {
return "DECIMAL" + Integer.toString(precision) + Integer.toString(scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to add underscores (for example) so it looks like this:
DECIMAL_11_11
instead of this:
DECIMAL1111

Btw Integer.toString is redundant with the string literal plus '+' operator (makes it a bit less readable).

}

public static BigDecimal toBigDecimal(final Object value, final String fieldName) {
if (value == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we cover all the Number cases?
I think new BigDecimal(value.toString()); might be enough.
(Can't think of an example where the string representation of a Number would not be good.)

/**
* Returns a data type that represents a DECIMAL value with given precision and scale
*
* @param precision indicates the expected number of digits before the decimal point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precision is the number of all digits.

import java.io.InputStream;
import java.io.Reader;
import java.lang.reflect.Array;
import java.math.BigDecimal;
Copy link
Contributor

@tpalfy tpalfy Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently

DataTypeUtils.findMostSuitableType(
        new BigDecimal("11.110"),
        Arrays.asList(RecordFieldType.DECIMAL.getDecimalDataType(5, 3)),
        Function.identity()
);

correctly returns Optional.of(RecordFieldType.DECIMAL.getDecimalDataType(5, 3));, but

DataTypeUtils.findMostSuitableType(
        new BigDecimal("1.110"),
        Arrays.asList(RecordFieldType.DECIMAL.getDecimalDataType(5, 3)),
        Function.identity()
);

("1.110" instead of "11.111") returns Optional.empty();

This is because the inferred type DECIMAL(4,3) is not equal to the provided DECIMAL(5,3) so the getWiderType method is called which doesn't handle cases where the two types are the same in general, only differ in their details.

Probably the public static Optional<DataType> getWiderType(final DataType thisDataType, final DataType otherDataType) should handle two different DECIMAL types by returning another DECIMAL that is the superset of the two in terms of scale and precision.

@Xeinn
Copy link
Author

Xeinn commented Mar 10, 2020 via email

@pvillard31 pvillard31 changed the title NIFI-7159 NIFI-7159 - Add support for some data types in various processors Mar 10, 2020
@MikeThomsen
Copy link
Contributor

@Xeinn there's a merge conflict now. Instructions follow if you're newish to git. Otherwise ignore.

To resolve do the following:

  1. Pull apache/nifi master into your master.
  2. Rebase (git rebase master) your branch off of master again.
  3. Follow the prompts that git gives you. Be careful to read what it says.
  4. git push origin --force BRANCH.

Don't git pull apache/nifi master into your branch.

@Xeinn
Copy link
Author

Xeinn commented Mar 16, 2020 via email

Xeinn added 6 commits March 16, 2020 19:28
Aligning the implementation of DECIMAL RecordFieldType with other types

Introduce RecordFieldType of DECIMAL for BigDecimal values

Adding RecordFieldType DECIMAL to ResultSetRecordSet

Introduce RecordFieldType DECIMAL into DataTypeUtils

Added missing BIGINT type to convertSimpleIfPossible method

Added missing type BIGINT to convertField method

Added missing type BIGINT to writeFieldForType

Added missing type BIGINT to parseFieldForType

Added missing type BIGINT to parseStringForType

Potentially breaking change of behaviour for Kudu Service.
Switched decimal from treatment as string to treatment as decimal

Added big decimal support for ElasticSearch

Added big decimal support for hbase processors

Added big decimal support for prometheus reporting

Added big decimal support to solr

Big decimal support in site to site reporting

Core support for big decimal

Fix issue with new toBigDecimal method

Issue where method was not correctly handling the situation where a
BigDecimal values was received as input.

Update test to reflect use of BigDecimal

Changes to unit tests to support DECIMAL type

Fix bug where NaN Double reported from jvm metrics not handled

BigDecimal implementation for mongo processor
@Xeinn
Copy link
Author

Xeinn commented Mar 17, 2020

@MikeThomsen @tpalfy

Seem to have some issues reported from the Windows build. Not sure if they are related to something I have done or results of commits I've rebased from master.

However I've pushed the changes as suggested above. Would appreciate you checking them again especially the hbase changes.

Question. Should I look to add parameters to affected processors to allow users to select to turn on handling of Decimal type and try where not enabled to maintain current behaviour?

@MikeThomsen
Copy link
Contributor

As a rule of thumb, if the Linux build passes we can merge. Will try to find some time to review.

Copy link
Contributor

@MikeThomsen MikeThomsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few unit tests for DataTypeUtils and I'm not sure if my expectations are off or the patch isn't behaving correctly see examples:

    public static final DecimalDataType DECIMAL_DATA_TYPE = new DecimalDataType(10, 5);

    @Test
    public void testStringToBigDecimal() {
        String input = "10000.0005";
        Object converted = DataTypeUtils.convertType(input, DECIMAL_DATA_TYPE, "testField");
        assertNotNull(converted);
        assertTrue(converted instanceof BigDecimal);
        BigDecimal bd = (BigDecimal)converted;
        assertEquals(10, bd.precision());
        assertEquals(5, bd.scale());
    }

    @Test
    public void testFloatToBigDecimal() {
        float input = 100.0f;
        Object converted = DataTypeUtils.convertType(input, DECIMAL_DATA_TYPE, "testField");
        assertNotNull(converted);
        assertTrue(converted instanceof BigDecimal);
        BigDecimal bd = (BigDecimal)converted;
        assertEquals(10, bd.precision());
        assertEquals(5, bd.scale());
    }

    @Test
    public void testDoubleToBigDecimal() {
        double input = 100.0;
        Object converted = DataTypeUtils.convertType(input, DECIMAL_DATA_TYPE, "testField");
        assertNotNull(converted);
        assertTrue(converted instanceof BigDecimal);
        BigDecimal bd = (BigDecimal)converted;
        assertEquals(10, bd.precision());
        assertEquals(5, bd.scale());
    }

    @Test
    public void testBigDecimalDatatypeSearch() {
        Optional<DataType> matcher = DataTypeUtils.findMostSuitableType(
                new BigDecimal("11.110"),
                Arrays.asList(RecordFieldType.DECIMAL.getDecimalDataType(5, 3)),
                Function.identity()
        );
        assertTrue(matcher.isPresent());
        DataType dt = matcher.get();
        assertTrue(dt instanceof DecimalDataType);
        DecimalDataType ddt = (DecimalDataType)dt;
        assertEquals(5, ddt.getPrecision());
        assertEquals(3, ddt.getScale());

        matcher = DataTypeUtils.findMostSuitableType(
                new BigDecimal("1.110"),
                Arrays.asList(RecordFieldType.DECIMAL.getDecimalDataType(9, 5)),
                Function.identity()
        );
        assertTrue(matcher.isPresent());
        ddt = (DecimalDataType)matcher.get();
        assertEquals(9, ddt.getPrecision());
        assertEquals(5, ddt.getScale());
    }

Thoughts?

@MikeThomsen MikeThomsen requested a review from markap14 March 21, 2020 20:07
@MikeThomsen
Copy link
Contributor

@markap14 could you take a look at the precision part and give me your opinion?

@Xeinn
Copy link
Author

Xeinn commented Mar 22, 2020

@MikeThomsen
Hi Mike,

Had a quick look at the unit tests and can see what you mean. Recall having a bit of debate with myself over how to handle this when initially implementing - there may be some scope for improvement.

Basically we have two different areas of "precision & scale"

A schema can define a type with precision and scale but then values manipulated from the records will also have a precision and scale. The DataType represented by DecimalDataType stores the schema defined precision and scale - but the values themselves, internally represented by the BigDecimal type take on the precision and scale of the value held.

So a string of "100.00" assigned to a BigDecimal would have precision of 5 and scale of 2

I guess the question is do we want to enforce the schema defined precision and scale throughout the handling of values and where necessary truncate/round values so they adhere to the schema defined precision and scale?

@MikeThomsen
Copy link
Contributor

I'll try to find some time today to play around with the APIs to see what validation is even possible. The first thing that comes to mind is whether or not the Avro precision and scale values are not just advice versus hard limits that are validated.

@joewitt
Copy link
Contributor

joewitt commented Mar 22, 2020

All the builds including windows should pass and failures need to be looked into.

@MikeThomsen
Copy link
Contributor

@joewitt ok, but we may need to reevaluate some of the tests because some of them have had timing issues on Windows that don't appear on a *NIX

@joewitt
Copy link
Contributor

joewitt commented Mar 22, 2020

Please be specific about which tests on windows have timing issues. A ton of effort has gone into the tests on all platforms. A large number of tests were not written to work on windows but none of those that I've found are related to timing issues but instead are due to actual 'test wont work on windows' issues. The issue that occurred in this case appears to be a corrupt zip file. Reran the jobs

@Xeinn
Copy link
Author

Xeinn commented Mar 22, 2020 via email

@MikeThomsen
Copy link
Contributor

but instead are due to actual 'test wont work on windows' issues.

That's pretty much what I was getting at.

@github-actions
Copy link

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

@github-actions github-actions bot added the Stale label Apr 25, 2021
@github-actions github-actions bot closed this May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants