New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hsm_remove not removing files #74
Comments
|
The purpose of lhsm_remove policy is to delete files from the backend after they were archived, and then deleted from Lustre. Do you have files in this case? If so, to troubleshoot you can run the following commands:
|
|
Thanks for the reply! I definitely have files that should be removed. For example: So the file looks to be archived and released. checking the backend so the file seems to be found in the hsm backend. @tl-cea Running the undelete and lshm_remove commands you mentioned gave me this: So it looks like it's not seeing anything even though there are about 200 files that are all listed as released and archived (and have been for a couple of days). And as you can see in my config given earlier, files should be removed from the hsm backend pretty much immediately due to the It's quite possible I'm simply not understanding something very basic here, but I'm at a loss nonetheless. |
|
Something that might also be relevant is that I'm getting a ton of these messages in my robinhood.log: The archive and release commands still seem to go through OK, so I'm assuming this has something to do with hsm_remove? Do these messages mean that I have to specify certain files for archiving? I haven't set up any fileclasses because I want the rules to apply to all files. Is this something I need to do? I found this code in your #define CHECK_ATTR(_pset_, _attr_, _no_trace) do { \
if (!ATTR_MASK_TEST(_pset_, _attr_)) { \
if (!(_no_trace)) \
DisplayLog(LVL_MAJOR, POLICY_TAG, \
"Missing attribute '%s' for evaluating " \
"boolean expression on " \
DFID, (#_attr_), PFID(p_entry_id)); \
return POLICY_MISSING_ATTR; \
} \
} while (0) |
|
I just learned something interesting from someone at Intel. I opened an issue with their Lustre repo (LU-9255) regarding the problems I'm having with the hsm_remove function. Here's the response I got:
However, in your v3_lhsm_tutorial you specifically state the following about the hsm remove feature:
So Intel is saying you can't do hsm_remove on a released file, while robinhood documentation is saying that hsm_remove is supposed to be used after being released from lustre. There seems to be disagreement regarding what hsm_remove is supposed to be able to achieve. Can anyone shed any light on this? |
|
OK so it appears that I was mistaking "releasing" for "removing". It would be helpful if you put in your documentation that the So that brings me to a question: How can you delete files that have been released and are archived? In my case, that is exactly what I want to do. I want to backup files to the hsm, release/remove them from Lustre, and then after a period of time has passed, remove them from the hsm as well. At that point all remnants of the file would be gone. I don't understand why |
|
The purpose of Lustre-HSM is to move data away from Lustre for example to put it on cheaper hardware, you can check out this document for more information on Lustre-HSM. I don't think there is any valid use case for wanting to remove data from the backend relating to a From what you explained it is my understanding that you wish to use HSM as some sort of trashbin: unused data is first moved away from Lustre ( Another way to achieve a similar goal would be to use |
I updated the doc by changing "removed" to "deleted". I guess it make it more clear.
|
|
@qb-cea Thank you for the very helpful feedback! I am definitely understanding the purpose of the hsm system a lot better now. You are correct that we are wanting the hsm system to be a type of trash bin. I'm thinking of possibly doing this to achieve our goal:
Between steps 3 and 4, we could use the In my initial testing on small data (80MB), I'm noticing that the Is this normal or am I missing something? Thank you again for all the help! |
|
@mkgilbert You're welcome! Could you check the output of Could you also check whether or not files in the backend are deleted after find /backend -type f > /dev/shm/backend.filelist
# run hsm_remove
diff <(find /backend -type f | sort) <(sort /dev/shm/backend.filelist) |
|
I think what happens is the step 3 (cleanup) drops any information about the file from robinhood DB. So when it then processes the UNLINK changelog record, it doesn't know was path the entry had. So I think the entry is correctly added to SOFTRM table, but it only appears there with a fid, not a path. So "rbh-undelete -L" should report them with fids only, not paths. A possible workaround is to modify the action of the cleanup policy, to preserve the entry in DB until robinhood gets the UNLINK record. To achieve that you, change the action of cleanup policy (which is common.unlink by default) to cmd("rm -f {path}"); Regards |
|
@qb-cea I apologize for the delay...I will get you that information as soon as I can. Unfortunately, my test lustre system went haywire and stopped working so I'm trying to figure out how to fix it. |
|
@qb-cea So I did the steps you suggested (find all hsm files, remove, then do a diff) and at first, nothing happened; the file was not deleted. Literally 41 minutes went by and then boom the copytool got the message and removed the file from the backend. It's worth noting that I did the remove command via lfs ( The funny thing is that this is an extremely small test system with no one using it but me. I hadn't added any files or been doing a lot of Lustre operations during this period. The only major change I made was to the robinhood config, where I added rules for lhsm_remove. But the odd thing is, the rules didn't appear to be working because I did an This type of behavior is what I've been troubleshooting...I do an action and nothing happens. Then at some random amount of time later, the action finally goes through. There is nothing of significance happening in the system logs of the MDS or the client node during these "waiting" periods. |
|
I believe I resolved my issues. Since the copytool wasn't seeming to get any of the commands I was sending it, I purged the actions from the MDS using the following command:
where Apparently the coordinator somehow got hung up and would no longer communicate with the copytool. Even restarting all of the nodes, including the mds and oss nodes didn't fix the problem, only running this purge did the trick. I've now been able to verify that indeed files definitely are getting removed from the hsm backend when doing an Thank you everyone for your help! |
We are having trouble getting hsm_remove to work at all on our cluster. Here is our setup:
robinhood 3.0.1
CentOS 6.8
Lustre 2.8
In our robinhood config we are mainly just attempting to use the lhsm funcionality. Running the posix copytool seems to work with archiving and releasing, but not removing. We have a test cluster setup to try to get this working but haven't had any luck.
In our
/etc/sysconfig/robinhoodfile we have:We added the
--scancommand because the robinhood log was saying that the policies were not seeing changes in the file system. Adding a FS_scan block to the config seemed to help with this, but didn't fix our problem with hsm_remove.We are currently just running our copytool from a prompt (for testing) with the following command:
where
/hsmis an ext4 file system mount and/lustre-scratch/is our lustre file system mount. Our robinhood config is as follows:These settings are just for testing. I'm trying to release and remove files as quickly as possible just in the interest of seeing that it works. In the robinhood logs, I'm frequently seeing this:
and the hsm_remove command always shows that there is nothing to remove:
Does anyone have any clue why this might be happening? Please forgive the excess of information in a GitHub issue, but I haven't heard back from robinhood support for several days.
Thank you!
The text was updated successfully, but these errors were encountered: