-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] MOR trigger compaction from Hudi CLI #1823
Comments
Another issue is, I am getting below error during inline compaction. Pls help. com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$46/188376151 |
@RajasekarSribalan For your first question, unfortunately currently in Spark Streaming writes only support inline compaction is supported. So you have to enable that config. Good news is, this PR is working on enabling the async compaction for Spark Streaming and is in priority. For second question, couple clarifications.
|
Thank you for your response Bhavani. @bhasudha
1.May I know the purpose of compaction schedule and compaction run command
from Hudi CLI?
2. If inline compaction is only possible from spark streaming for MOR
tables then it is similar to CopyOnWrite table? There is no difference in
using them? Pls correct me if I am wrong.
Kind regards,
Rajasekar
…On Mon, 13 Jul 2020, 5:43 am Bhavani Sudha Saktheeswaran, < ***@***.***> wrote:
@RajasekarSribalan <https://github.com/RajasekarSribalan> For your first
question, unfortunately currently in Spark Streaming writes only support
inline compaction is supported. So you have to enable that config. Good
news is, this PR <#1752> is working on
enabling the async compaction for Spark Streaming and is in priority.
For second question, couple clarifications.
- Hudi moved to Spark 2.4. I see that you are using spark 2.2.0. Could
you try on spark 2.4* ?
- Also, in your spark submit command are you passing in these jars and
cones - https://hudi.apache.org/docs/quick-start-guide.html#setup
1. The conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
2. in addition to hudi-spark-bundle, you need to pass
,org.apache.spark:spark-avro_2.11:2.4.4 Note the spark-avro must match
your spark version which is 2.4.4. This applies if you are using
spark-shell as it does not ship with spark-avro explicitly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1823 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMO6I26RADG2PR3ZGDG7GDR3JGRVANCNFSM4OXMTUNQ>
.
|
@RajasekarSribalan : Compactions should be able to be triggered and executed from CLI if there are delta files and configured correctly. Can you post the logs from "compaction schedule" after you disabled inline compaction and delta files got added. The only constraint is that when scheduling compactions, ingestion should not be running. If you are running structured streaming job, please try out the PR suggested by @bhasudha which will execute compaction asynchronously. In batch mode, You can also enable inline compaction and setting a proper value for hoodie.compact.inline.max.delta.commits depending on how frequent you want to run compactions. Regarding (2) Compaction is only relevant for MOR tables. Compaction is the process of taking the delta-files and creating columnar files. CopyOnWrite tables are implicitly compacted. Regarding the error related to kryoException, Are you still seeing the error ? |
Thanks @bvaradar @bhasudha and one more problem which I could see is, how the compaction and cleaner should be configured? Should both have same values? What If i configure clean commits as 3, so that I reclaim more space and compaction to happen after 24 commits.. Since I am doing cleaner frequently, will be delta commits will be cleaned/delete before compaction. Please shed some light on this matter because, i could see tons of files in hdfs for a single table. For example, in my case, when i ran a bulk insert for a table to store it in Hudi, there were 7000+ parquet files for created which was fine. After running streaming pipeline for doing upsert on the same table for 2 days, i could see there were 90,000+ files in HDFS. I havent changed the default cleaner configuration ,so i believe cleaning happends after 24 commits? so thats the reason i have these many files. Pls correct me if I am wrong. |
PLs find the cli output for a MOR table clean info 20/07/18 16:05:31 INFO timeline.HoodieActiveTimeline: Loaded instants [[20200716082419__clean__COMPLETED], [20200716102509__clean__COMPLETED], [20200716103921__clean__COMPLETED], [20200716134933__clean__COMPLETED], [20200716135749__clean__COMPLETED], [20200716163408__clean__COMPLETED], [20200716164519__clean__COMPLETED], [20200716192304__clean__COMPLETED], [20200716192304__deltacommit__COMPLETED], [20200716193103__deltacommit__COMPLETED], [20200717034005__commit__COMPLETED], [20200717080741__clean__COMPLETED], [20200717080741__deltacommit__COMPLETED], [20200717100758__clean__COMPLETED], [20200717100758__deltacommit__COMPLETED], [20200717101709__clean__COMPLETED], [20200717101709__deltacommit__COMPLETED], [20200717120702__clean__COMPLETED], [20200717120702__deltacommit__COMPLETED], [20200717121648__clean__COMPLETED], [20200717121648__deltacommit__COMPLETED], [20200717141621__clean__COMPLETED], [20200717141621__deltacommit__COMPLETED], [20200717142837__clean__COMPLETED], [20200717142837__deltacommit__COMPLETED], [20200717161843__clean__COMPLETED], [20200717161843__deltacommit__COMPLETED], [20200717162524__clean__COMPLETED], [20200717162524__deltacommit__COMPLETED], [20200717180202__clean__COMPLETED], [20200717180202__deltacommit__COMPLETED], [20200717182211__deltacommit__COMPLETED], [20200717203440__deltacommit__COMPLETED], [20200718040640__clean__COMPLETED], [20200718040640__deltacommit__COMPLETED], [20200718055600__commit__COMPLETED], [20200718062014__clean__COMPLETED], [20200718062014__deltacommit__COMPLETED], [20200718062721__clean__COMPLETED], [20200718062721__deltacommit__COMPLETED], [20200718082117__clean__COMPLETED], [20200718082117__deltacommit__COMPLETED], [20200718082800__clean__COMPLETED], [20200718082800__deltacommit__COMPLETED], [20200718102800__clean__COMPLETED], [20200718102800__deltacommit__COMPLETED], [20200718104348__deltacommit__COMPLETED]] |
Hi @RajasekarSribalan , I am seeing the same issue as you did.
Have you solved this problem? |
@garyli1019 : Trying to catchup on this. Is this still an issue ? |
@bvaradar I think this issue is reproducible when the log file group is larger than 2GB. |
@bvaradar I created a ticket to track this. I think we can close this issue and #1890 |
Thanks Gary. |
can we disable compaction in sink ,and run compaction using hudi-cli manully? both inline or async all disable |
Describe the problem you faced
We are writing to a Hudi MOR table via spark streaming. We read data from kafka and write to Hudi MOR. We get huge inserts/upserts so we want to have good performance ,so we chose MOR tables. We have disabled inline compaction to avoid blocking ingestion and we wanted compaction to run async via Hudi CLI. The issue is, we are unable to see any COMPACTION instant in the DFS hence we get error saying "No Pending compaction", but we do see a lot of delta logs getting created/appended but compaction is not requested.
We want to understand when does the compaction request is triggerred when inline compaction is switched OFF? so that I can run compaction via hudi-cli? Please assist vinoth @vinothchandar @bhasudha . There is no much information for async compaction in hudi documentation.
upsertDf.write
.format("hudi")
.options(getQuickstartWriteConfigs)
.option(OPERATION_OPT_KEY, "upsert")
.option(PRECOMBINE_FIELD_OPT_KEY, hudi_precombine_key)
.option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
.option(PARTITIONPATH_FIELD_OPT_KEY, "")
.option(KEYGENERATOR_CLASS_OPT_KEY, classOf[NonpartitionedKeyGenerator].getName)
.option(TABLE_NAME, tablename)
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
.option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
.option(HIVE_URL_OPT_KEY, "XXXXXXX")
.option(HIVE_DATABASE_OPT_KEY, hudi_db)
.option(HIVE_TABLE_OPT_KEY, tablename)
.option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[NonPartitionedExtractor].getName)
.option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
.option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "false")
.option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, "24")
.mode(Append)
.save("/user/xyz/hudi/" + tablename)
Environment Description
Hudi version : 0.5.2
Spark version : 2.2.0
Hive version :1.0
Hadoop version :2.7
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Stacktrace
hudi:user_emails->compactions show all
╔═════════════════════════╤═══════╤═══════════════════════════════╗
║ Compaction Instant Time │ State │ Total FileIds to be Compacted ║
╠═════════════════════════╧═══════╧═══════════════════════════════╣
║ (empty) ║
╚═════════════════════════════════════════════════════════════════╝
The text was updated successfully, but these errors were encountered: