Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Should match from pathToStorages when appId does not exist in appIdToStorages #168

Merged
merged 2 commits into from
Aug 22, 2022

Conversation

smallzhongfeng
Copy link
Contributor

@smallzhongfeng smallzhongfeng commented Aug 19, 2022

What changes were proposed in this pull request?

createlog
deletelog
From the audit log of HDFS, it can be seen that when the HDFS path of this app was last deleted at 18:00:55, the log in the shuffleServer found that the error about file could not be found, and the file would continue to be written. At last we found that when some appId cache was removed in appIdToStorages, and then HdfsStorageManager calls removeResources will cause storagePath to not be deleted.

Why are the changes needed?

When the cache of appIdToStorages removed, the remote path can be deleted normally.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added ut.

@smallzhongfeng smallzhongfeng changed the title [Improvement] When appId does not exist in appIdToStorages, match from pathToStorages [Improvement] Should match from pathToStorages when appId does not exist in appIdToStorages Aug 19, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2022

Codecov Report

Merging #168 (8e4c077) into master (74cd40b) will increase coverage by 0.02%.
The diff coverage is 83.33%.

@@             Coverage Diff              @@
##             master     #168      +/-   ##
============================================
+ Coverage     58.29%   58.31%   +0.02%     
- Complexity     1262     1266       +4     
============================================
  Files           158      158              
  Lines          8397     8409      +12     
  Branches        779      782       +3     
============================================
+ Hits           4895     4904       +9     
- Misses         3251     3253       +2     
- Partials        251      252       +1     
Impacted Files Coverage Δ
...che/uniffle/server/storage/HdfsStorageManager.java 90.00% <83.33%> (-2.69%) ⬇️
...e/spark/shuffle/reader/RssShuffleDataIterator.java 88.73% <0.00%> (-0.98%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

}
} catch (Exception e) {
LOG.error("Some error happened when fileSystem got the file status.");
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use printStackTrace, it print the error stack in the stdout, could we print the error stack in the log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jerqi
Copy link
Contributor

jerqi commented Aug 22, 2022

It seems that we only call the method removeResources once, if appIdToStorages was removed because of method removeResources, we won't call the method removeResources any more. It seems that pr can't achieve our aim.

if (!appIdToStorages.containsKey(appId)) {
String msg = "Can't find HDFS storage for appId[" + appId + "]";
LOG.error(msg);
// outside should deal with null situation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add back.

@smallzhongfeng
Copy link
Contributor Author

It seems that we only call the method removeResources once, if appIdToStorages was removed because of method removeResources, we won't call the method removeResources any more. It seems that pr can't achieve our aim.

Because we found the log can't find HDFS storage for appid, it means that appIdToStorages has removed the cache of this app. Because removereSources was called only once, but from the HDFS audit log, it can be seen that the shuffle file was deleted at 18:00:55, but was rewritten at 18:00:57, resulting in the loss of the appId of the cache of appIdToStorages, However, the cached appId of shuffleTaskInfos is still there and cannot be removed. Therefore, this exception occurs when removeResources is scheduled again. Therefore, this PR is mainly used to solve the problem that the path cannot be deleted after the cache of appid by appidtostorages expires.

@jerqi
Copy link
Contributor

jerqi commented Aug 22, 2022

It seems that we only call the method removeResources once, if appIdToStorages was removed because of method removeResources, we won't call the method removeResources any more. It seems that pr can't achieve our aim.

Because we found the log can't find HDFS storage for appid, it means that appIdToStorages has removed the cache of this app. Because removereSources was called only once, but from the HDFS audit log, it can be seen that the shuffle file was deleted at 18:00:55, but was rewritten at 18:00:57, resulting in the loss of the appId of the cache of appIdToStorages, However, the cached appId of shuffleTaskInfos is still there and cannot be removed. Therefore, this exception occurs when removeResources is scheduled again. Therefore, this PR is mainly used to solve the problem that the path cannot be deleted after the cache of appid by appidtostorages expires.

Why will the method removeResources be scheduled again? It seems that the method removeResources will be called once.

String msg = "Can't find HDFS storage for appId[" + appId + "]";
LOG.error(msg);
// outside should deal with null situation
// todo: it's better to have a fake storage for null situation
Copy link
Contributor

@jerqi jerqi Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove this todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add back.

@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Aug 22, 2022

Why will the method removeResources be scheduled again? It seems that the method removeResources will be called once.

You can take a look at this thread clearResourceThread in ShuffleTaskManager. As long as the expiredAppIdQueue still has the app, it will call the logic of removeResources, and then call the removeResources of the HdfsStorageManager.

@jerqi
Copy link
Contributor

jerqi commented Aug 22, 2022

Why will the method removeResources be scheduled again? It seems that the method removeResources will be called once.

You can take a look at this thread clearResourceThread in ShuffleTaskManager. As long as the expiredAppIdQueue still has the app, it will call the logic of removeResources, and then call the removeResources of the HdfsStorageManager.

I know clearResourceThread's logic. I means that we will call method removeResources once for the one app. Right?

@smallzhongfeng
Copy link
Contributor Author

I know clearResourceThread's logic. I means that we will call method removeResources once for the one app. Right?

Yes, normally, an app calls the logic of removeResources once. However, an exception occurs, which causes the shuffle file to be deleted at 18:00:55 and the file to be recreated at 18:00:57. The expiredAppIdQueue adds the appId again, and then the path of the appId is not deleted.

@jerqi
Copy link
Contributor

jerqi commented Aug 22, 2022

I know clearResourceThread's logic. I means that we will call method removeResources once for the one app. Right?

Yes, normally, an app calls the logic of removeResources once. However, an exception occurs, which causes the shuffle file to be deleted at 18:00:55 and the file to be recreated at 18:00:57. The expiredAppIdQueue adds the appId again, and then the path of the appId is not deleted.

Why does the expiredAppIdQueue add the appId again?

@smallzhongfeng
Copy link
Contributor Author

This problem was fixed immediately after it appeared last week. Unfortunately, there was no backup log during the redeployment. However, according to the results of last week, we think that the appId was rewritten into the buffer of shuffleTaskInfos at 18:00:57, and the expiredAppIdQueue stores the appId. At present, this HDFS path is also deleted normally.

@jerqi
Copy link
Contributor

jerqi commented Aug 22, 2022

I just doubt that it's not a root cause. Maybe we can merge this pr first. Please resolve the comment https://github.com/apache/incubator-uniffle/pull/168/files#r951078297

@smallzhongfeng
Copy link
Contributor Author

At present, this is an occasional problem, and the impact is relatively controllable.We will observe it first, and any subsequent progress will be returned to the community. @jerqi

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this pr first. Thanks @smallzhongfeng

@jerqi jerqi merged commit 805a372 into apache:master Aug 22, 2022
@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Aug 22, 2022

OK, thanks for your patient review. @jerqi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants