Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-4081] Fix multiple issues with clean files command #4051

Closed
wants to merge 5 commits into from

Conversation

vikramahuja1001
Copy link
Contributor

@vikramahuja1001 vikramahuja1001 commented Dec 10, 2020

Why is this PR needed?

  1. While getting stale segment, we do a list files and take all the files, if there is any folder/file other than .segment file, it will lead to further issues while copying data to the trash folder.

  2. In the case when AbstractDFSCarbonFile is created with path and HadoopConf instead of fileStatus and if the file does not exist, since fileStatus is empty ListDirs returns empty result and getAbsolutePath throws file does not exist exception

  3. Clean files is not allowed with concurrent insert overwrite in progress, In case with concurrent loading to MV is by default an insert overwrite operation, clean files operation on that MV would fail and throw exception.

What changes were proposed in this PR?

  1. Added a filter, to only consider the files ending with ".segment"
  2. Using listfiles instead of list dirs in the trash folder
  3. No need to throw exceptions, just put a info message in case of such MV tables and continue blocking clean files for such MV.

Does this PR introduce any user interface change?

  • No

Is any new testcase added?

  • No

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5137/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3375/

@vikramahuja1001 vikramahuja1001 changed the title [WIP] Only consider .segment files for stale segments [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments dir for cleaning stale segments Dec 10, 2020
@vikramahuja1001 vikramahuja1001 changed the title [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments dir for cleaning stale segments [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments folder for cleaning stale segments Dec 10, 2020
@QiangCai
Copy link
Contributor

can you add a test case for fault testing?

@vikramahuja1001
Copy link
Contributor Author

@QiangCai , i have changed current test cases itself. Please check

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5166/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3404/

@vikramahuja1001 vikramahuja1001 changed the title [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments folder for cleaning stale segments [CARBONDATA-4081] Fix multiple issues with clean files command Dec 15, 2020
@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5168/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3406/

@ajantha-bhat
Copy link
Member

@vikramahuja1001 : when the issue is found, what other file was present in segment metadata directory ? how it is impacting ?

@vikramahuja1001
Copy link
Contributor Author

@ajantha-bhat , so in the case of partition table, we have a tmp directory in the segment folder. In this case, while we check for stale segments, we are blindly reading all the files in the segment folder and collecting the name of it and then based on that we read the segment files, since it is not a file, thus it cannot be read and would give an exception. In this fix, we just need to filter if the file ends with ".segment"

@@ -157,11 +157,12 @@ public static void deleteExpiredDataFromTrash(String tablePath) {
// Deleting the timestamp based subdirectories in the trashfolder by the given timestamp.
try {
if (FileFactory.isFileExist(trashPath)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (FileFactory.isFileExist(trashPath)) {
CarbonFile trashFile = FileFactory.getCarbonFile(trashPath);
if (trashFile.exists()) {
CarbonFile[] timestampFolderList = trashFile.listFiles();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should deprecate listfiles and fileExist and other API in file factory that takes path, it has to take carbon file instead of path. This way we can force user to think about reusing the carbonFile Object.
we can raise a JIRA and assign some new contributors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently Vikram can handle for his PR by accepting suggested changes by Indhu. But many places we are not resuing the CarbonFile object. that can be handled by my above point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat , i can raise a PR for that as well

@@ -181,7 +182,7 @@ public static void emptyTrash(String tablePath) {
// if the trash folder exists delete the contents of the trash folder
try {
if (FileFactory.isFileExist(trashPath)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle same as above comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

CarbonFilters.getPartitions(Seq.empty[Expression], sparkSession, carbonTable))
}
} else {
LOGGER.info(s"Can not do clean files operation for the MV: ${carbonTable.getTableName}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean files for MV is not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please handle based on isInternalCleanCall

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5175/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3413/

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5182/

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3420/

// if insert overwrite in progress, do not allow delete segment
if (SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
// if insert overwrite in progress and table not a MV, do not allow delete segment
if (!carbonTable.isMV && SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, handle at the place where we call clean files for MV when cleanfiles is called for maintable. Else if the user calls clean files on MV table when concurrently insert overwrite is happening, now you don't throw an exception. which is out of synch with main table clean files behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please handle based on isInternalCleanCall

override def processData(sparkSession: SparkSession): Seq[Row] = {
Checker.validateTableExists(databaseNameOp, tableName, sparkSession)
val carbonTable = CarbonEnv.getCarbonTable(databaseNameOp, tableName)(sparkSession)
setAuditTable(carbonTable)
// if insert overwrite in progress, do not allow delete segment
if (SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
// if insert overwrite in progress and table not a MV, do not allow delete segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if insert overwrite in progress and table not a MV, do not allow delete segment
// if insert overwrite in progress and table is not a MV, do not allow delete segment

@CarbonDataQA2
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3434/

@CarbonDataQA2
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5194/

@ajantha-bhat
Copy link
Member

LGTM

@asfgit asfgit closed this in 7aafb6b Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants