Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24706: add the HiveHBaseTableInputFormatV2 to fix the compatible issue with spark #4063

Closed
wants to merge 2 commits into from

Conversation

alexdongli0829
Copy link

What changes were proposed in this pull request?

For HIVE-24706, the main issue here is the HiveHbaseTableInput format implements two version of InputFormat, which make the spark cannot get the correct version correctly, and this is indeed not very clear implementation.

So in this request, instead of directly extending TableInputFormatBase, I put it as a delegate which do the exactly the same as before, but avoid the confusing because the HbaseStorageHandler just need the old version InputFormat.

In the long term, I think hive should update the storage handler instead of keep mixing these two different API versions

Why are the changes needed?

Its impacting the spark and hive compatible and reported by different uses in hive and spark

Does this PR introduce any user-facing change?

There is configuration parameter added hive.hbase.inputformat.v2, so maybe need update doc to keep the end user informed

How was this patch tested?

create hbase table

echo "create 'students','account','address'" | sudo -u hbase hbase shell -n
echo "put 'students','student1','account:name','Alice'" |sudo -u hbase hbase shell -n

create hive table

hive -e "create external table test1 (key string, value string)
> stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> with serdeproperties ('hbase.columns.mapping' = ':key,account:name')
> tblproperties ('hbase.table.name' = 'students')"


SLF4J: Class path contains multiple SLF4J bindings.
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
Hive Session ID = 05b4ec22-d15f-4614-9bc3-6c183e868728
OK
Time taken: 2.913 seconds

Spark test:

spark-sql --jars /usr/lib/hive/lib/hive-hbase-handler.jar,/usr/lib/hbase/hbase-common-2.4.4.jar,/usr/lib/hbase/hbase-client-2.4.4.jar,/usr/lib/hbase/lib/hbase-mapreduce-2.4.4.jar,/usr/lib/hbase/lib/shaded-clients/hbase-shaded-client-2.4.4.jar --conf spark.hive.hbase.inputformat.v2=true

spark-sql> select * from test1;

student1    Alice

Unit Test

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hive.hbase.TestHiveHBaseTableInputFormatV2
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.069 s - in org.apache.hadoop.hive.hbase.TestHiveHBaseTableInputFormatV2
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

@sonarcloud
Copy link

sonarcloud bot commented Feb 15, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug E 7 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 26 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@alexdongli0829
Copy link
Author

alexdongli0829 commented Feb 20, 2023

Hey guys,

I am looking into the BUGs which caused the test failure:

  1. Regarding the 2 Job not call close and 4 InterruptedException issue, I think I know what the report is saying, but is it indeed necessary to fix the bugs? Because most of the code is from the HiveHBaseTableInputFormat, why there is no test failure on HiveHBaseTableInputFormat?

  2. For the RecordReader, I need to return the RecordReader in this function, how I can close this in the getRecordReader function?

CC @kgyrtkirk

@aturoczy
Copy link

aturoczy commented Apr 4, 2023

I don't get it. As I see there is a test failure in the javadoc part, which seems related. So I would say yes it should be fixed.

@alexdongli0829 alexdongli0829 deleted the apache/master branch April 5, 2023 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants