Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28224: Upgrade Orc version in Hive to 1.9.3 #5218

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

difin
Copy link
Contributor

@difin difin commented Apr 26, 2024

What changes were proposed in this pull request?

Upgrading Orc version in Hive to the latest version that supports Java 8 - 1.9.3

Why are the changes needed?

To keep the Orc version in Hive updated.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

How was this patch tested?

Pre-commit testing

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general looks good, but I am not sure why the query results are changing, same query returning different results pre & post upgrade doesn't look f9 to me in general

Statistics: Num rows: 69 Data size: 13710 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 70 Data size: #Masked# Basic stats: COMPLETE Column stats: NONE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did number of rows changed with ORC upgrade?

Copy link
Contributor Author

@difin difin May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Num rows" is calculated estimate by dividing data size by average row size.
Previously data size was 13,710 and now it is 13,720.
Average row size = 196
Before Orc upgrade it was: 13,710 / 196 = 69.??? - hence the value of "Num rows" was 69.
Now: 13720 / 196 = 70
So basically the reason for num rows change is that Data size value increased.

NULL 3072 9318.4351351351 -4298.1513513514 5018444.1081079808 1633.60810810806667 5695.483082135323 5696.410307714464 3072 11160.71538461538500 -5147.90769230769300 6010604.30769230735360 1956.576923076922966667 6821.495748565151 6822.606289190915
NULL 3072 9318.4351351351 -4298.1513513514 5018444.1081079808 1633.60810810806667 5695.483082135325 5696.4103077144655 3072 11160.71538461538500 -5147.90769230769300 6010604.30769230735360 1956.576923076922966667 6821.4957485651385 6822.606289190904
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query result is changing here?

@@ -847,4 +847,4 @@ FROM (SELECT cint,
POSTHOOK: type: QUERY
POSTHOOK: Input: default@decimal_vgby_small
#### A masked pattern was here ####
95165244160
95767761728
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also the result of the query is changing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This difference is caused by the previous differences in STDDEV_POP(cdecimal1) and STDDEV_SAMP(cdecimal1) fields

@@ -180,8 +180,8 @@
<mysql.version>8.0.31</mysql.version>
<postgres.version>42.7.3</postgres.version>
<oracle.version>21.3.0.0</oracle.version>
<opencsv.version>2.3</opencsv.version>
<orc.version>1.8.5</orc.version>
<opencsv.version>5.9</opencsv.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why do you change opencsv under the PR title, HIVE-28224: Upgrade Orc version in Hive to 1.9.3, @difin ?

Could you proceed to upgrade opencsv library independently before upgrading ORC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added opencsv changes because Hive wasn't able to compile without them. Hive has a code that depends on opencsv version that was coming from Orc 1.8.5 and it is not in 1.9.3.

I created a separate PR for upgrading opencsv #5240
Once it gets merged, I will update this PR and remove opencsv changes from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @difin .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome, @dongjoon-hyun

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dongjoon-hyun, opencsv changes were removed from this PR and were done separately. The differences in the results of tests are still present.
I did a few additional checks:

  • The same differences are present when Hive statistics are enabled and when they are disabled.
  • With the decimal_vgby table used as a source in the test in Parquet format instead of Orc, the differences are absent (so it is an additional confirmation that it is something related to Orc).
  • I dumped the content of the decimal_vgby table with and without Orc upgrade and they were the same, which is odd

Copy link

sonarcloud bot commented May 9, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants