Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3083] Fixed data mismatch issue after update #2902

Closed
wants to merge 1 commit into from

Conversation

kunal642
Copy link
Contributor

@kunal642 kunal642 commented Nov 6, 2018

Problem: When filling a columnPage directly to vector, we are skipping the deleted rows based on BitSet value. Now consider a situation where the 6th row is Null i.e BitSet(6) and the 3rd row is marked as deleted i.e BitSet(3).
During filling of the vector we will skip the 3rd row and the final vector will have 1 less row(total 5 rows) than the columnPage.
While reading we will only read 5 rows and when trying to set the 6th row as null we will end up making the wrong row as null.

Solution: Check if the vector has inverted index or deleted rows. If it has then dont blindly copy the array using System.arrayCopy instead iterated over the values check for null and insert the appropriate values.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Please provide details on
    - Whether new unit test cases have been added or why no new tests are required?
    - How it is tested? Please attach test report.
    - Is it a performance related change? Please attach the performance test report.
    - Any additional information to help reviewers in testing this change.

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@kunal642 kunal642 changed the title [WIP] Fixed data mismatch issue after update [CARBONDATA-3083] Fixed data mismatch issue after update Nov 6, 2018
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1298/

@@ -1734,7 +1734,7 @@ private CarbonCommonConstants() {
public static final String CARBON_PUSH_ROW_FILTERS_FOR_VECTOR =
"carbon.push.rowfilters.for.vector";

public static final String CARBON_PUSH_ROW_FILTERS_FOR_VECTOR_DEFAULT = "false";
public static final String CARBON_PUSH_ROW_FILTERS_FOR_VECTOR_DEFAULT = "true";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for changing the default value?

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1512/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9559/

override def afterAll {
sql("use default")
sql("drop database if exists iud cascade")
CarbonProperties.getInstance()
.addProperty(CarbonCommonConstants.isHorizontalCompactionEnabled , "true")
CarbonProperties.getInstance()
.addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER , "true")
CarbonProperties.getInstance().addProperty(CarbonCommonConstants
.CARBON_PUSH_ROW_FILTERS_FOR_VECTOR, "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of hard coding "false" use default value from constants

, Row(100), Row(-100), Row(null)))
sql("""drop table if exists iud.dest33_part""")
CarbonProperties.getInstance().addProperty(CarbonCommonConstants
.CARBON_PUSH_ROW_FILTERS_FOR_VECTOR, "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After test case completion we should set the default value for CARBON_PUSH_ROW_FILTERS_FOR_VECTOR?...default property is false so I think at the start of test case no need to modify the property value

vector.putShort(i, shortData[i]);
}
} else {
vector.putShorts(0, pageSize, shortData, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using putShorts/putFloats is common and unavoidable. In future also any new encoding class can make use of these method and then again the same problem can occur. Is it feasible to modify the vector classes implementation methods itself just like an example below
public void putShorts(int rowId, int count, short[] src, int srcIndex) { for (int i = srcIndex; i < count; i++) { putShort(rowId++, src[i]); } }
This way it will be better

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1303/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1516/

@ravipesala
Copy link
Contributor

@kunal642 Please check PR #2863 . This issue should not happen there. Please verify once

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9564/

@kunal642 kunal642 closed this Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants