Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24827. Hive aggregation query returns incorrect results for non text files. #2018

Merged
merged 3 commits into from Mar 8, 2021

Conversation

ayushtkn
Copy link
Member

No description provided.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this also work for in cases when the input data is compressed?
I think the existing one did worked for that as well...tests passed so it was either not covered or still working fine :)

Comment on lines 4038 to 4044
if (footerCount > 0 && table.getInputFileFormatClass() != null
&& !TextInputFormat.class
.isAssignableFrom(table.getInputFileFormatClass())) {
LOG.warn("skip.footer.line.count is only valid for TextInputFormat "
+ "files, ignoring the value.");
footerCount = 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be a duplicate block ; you could move it into a method

@ayushtkn
Copy link
Member Author

Thanx @kgyrtkirk for the review. I tried the scenario in HIVE-24224. I think it worked as expected:

+-------------------+--------------------+----------------+
| bz2tst2.sequence  |     bz2tst2.id     | bz2tst2.other  |
+-------------------+--------------------+----------------+
| 9                 | 20200315 X00 1356  | 123            |
| 17                | 20200315 X00 1357  | 123            |
+-------------------+--------------------+----------------+

The file in was :

 printf "offset,id,other\n9,\"20200315 X00 1356\",123\n17,\"20200315 X00 1357\",123\nrst,rst,rst" > data.csv

bzip2 -f data.csv 

hdfs dfs -put data.csv.bz2 hdfs://hostname:8020/warehouse/tablespace/external/hive/bz2tst2

Table is

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `bz2tst2`(                   |
|   `sequence` int,                                  |
|   `id` string,                                     |
|   `other` string)                                  |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.OpenCSVSerde'     |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://ayushsaxena-3.ayushsaxena.root.hwx.site:8020/warehouse/tablespace/external/hive/bz2tst2' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'skip.footer.line.count'='1',                    |
|   'skip.header.line.count'='1',                    |
|   'transient_lastDdlTime'='1614334965')            |
+----------------------------------------------------+

Seems working, the UT was working. :-)

@kgyrtkirk kgyrtkirk merged commit 8b0542f into apache:master Mar 8, 2021
aihuaxu pushed a commit to aihuaxu/hive that referenced this pull request Mar 17, 2021
…text files. (apache#2018) (Ayush Saxena reviewed by Zoltan Haindrich)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants