Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hsm_remove not removing files #74

Closed
mkgilbert opened this issue Mar 23, 2017 · 13 comments
Closed

hsm_remove not removing files #74

mkgilbert opened this issue Mar 23, 2017 · 13 comments

Comments

@mkgilbert
Copy link

We are having trouble getting hsm_remove to work at all on our cluster. Here is our setup:
robinhood 3.0.1
CentOS 6.8
Lustre 2.8

In our robinhood config we are mainly just attempting to use the lhsm funcionality. Running the posix copytool seems to work with archiving and releasing, but not removing. We have a test cluster setup to try to get this working but haven't had any luck.

In our /etc/sysconfig/robinhood file we have:

RBH_OPT="--readlog --scan --run=all"

We added the --scan command because the robinhood log was saying that the policies were not seeing changes in the file system. Adding a FS_scan block to the config seemed to help with this, but didn't fix our problem with hsm_remove.

We are currently just running our copytool from a prompt (for testing) with the following command:

lhsmtool_posix --verbose --hsm-root /hsm --archive=1 /lustre-scratch/

where /hsm is an ext4 file system mount and /lustre-scratch/ is our lustre file system mount. Our robinhood config is as follows:

%include "includes/tmpfs.inc"
%include "includes/lhsm.inc"

General {
    fs_path = "/lustre-scratch";
    # filesystem type, as displayed by 'mount' (e.g. ext4, xfs, lustre, ...)
    fs_type = lustre;
}

FS_Scan {
    # daily scan
    scan_interval = 10s;
}

EntryProcessor {
    nb_threads = 4;
    STAGE_GET_INFO_FS_threads_max = 2;
    STAGE_GET_FID_threads_max = 2;
}

Log {
    debug_level = DEBUG;
    log_file = "/var/log/robinhood/robinhood.log";
    report_file = "/var/log/robinhood/actions.log";
    alert_file = "/var/log/robinhood/alerts.log";
    alert_show_attrs = yes;
}

ListManager {
    MySQL {
        server = localhost;
        db = robinhood_lustre;
        user = robinhood;
        password_file = /etc/robinhood.d/.dbpassword;
    }
}

# Lustre 2.x only
ChangeLog {
    MDT {
        mdt_name = "MDT0000";
        reader_id = "cl1";
    }
    polling_interval = 1s;
}

define_policy global {
     status_manager = none;
     scope = all;
     default_action = none;
     default_lru_sort_attr = none;
}

global_rules {
    rule default {
        condition { path == "/scratch/**" }
    }
}

global_trigger {
    trigger_on = global_usage;
    high_threshold_pct = 70%;
    low_threshold_pct = 5%;
    check_interval     = 1s;
    alert_high = yes;
}

lhsm_config {
    # copytool-specific
    rebind_cmd = "lhsmtool_posix --archive={archive_id} --rebind {oldfid} {newfid} {fsroot}";
}

lhsm_archive_rules {
   rule default {
       condition {last_mod > 15s}
   }
}

lhsm_archive_trigger {
    trigger_on = scheduled;
    check_interval = 15s;
}

lhsm_release_rules {
    rule default {
        condition { last_access > 30s }
    }    
}

lhsm_release_trigger {
    # this quota applies to the OSTs
    #trigger_on         = global_usage;
    #high_threshold_pct = 4%;
    #low_threshold_pct = 1%;
    trigger_on = scheduled;
    check_interval     = 30s;
    alert_high = yes;
}

lhsm_remove_rules {
    rule default {
        condition { rm_time > 10s }
    }
}

lhsm_remove_trigger {
    trigger_on = scheduled;
    check_interval = 20s;
    alert_high = yes;
}

These settings are just for testing. I'm trying to release and remove files as quickly as possible just in the interest of seeing that it works. In the robinhood logs, I'm frequently seeing this:

2017/03/23 13:15:50 [26801/7] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a2:0xe934:0x0]

and the hsm_remove command always shows that there is nothing to remove:

2017/03/23 13:17:19 [26801/4] lhsm_remove | Checking trigger #0 (scheduled), last check 10s ago
2017/03/23 13:17:19 [26801/4] lhsm_remove | Checking policy rules for all
2017/03/23 13:17:19 [26801/4] lhsm_remove | Building policy list - last full FS Scan: 2017/03/23 13:16:50
2017/03/23 13:17:19 [26801/4] lhsm_remove | Starting policy run
2017/03/23 13:17:19 [26801/4] lhsm_remove | End of current pass
2017/03/23 13:17:19 [26801/4] lhsm_remove | target: none
2017/03/23 13:17:19 [26801/4] lhsm_remove | done: none
2017/03/23 13:17:19 [26801/4] lhsm_remove | Policy run summary: time=01s; target=all; 0 successful actions (0.00/sec); volume: 0 (0/sec); 0 entries skipped; 0 errors.

Does anyone have any clue why this might be happening? Please forgive the excess of information in a GitHub issue, but I haven't heard back from robinhood support for several days.

Thank you!

@tl-cea
Copy link
Member

tl-cea commented Mar 24, 2017

The purpose of lhsm_remove policy is to delete files from the backend after they were archived, and then deleted from Lustre. Do you have files in this case?

If so, to troubleshoot you can run the following commands:

  • This will list all entries that were archived and then removed:
    rbh-undelete -L
  • This can give clues to determine why no hsm_remove is done:
    robinhood --run=lhsm_remove --once -l FULL

@mkgilbert
Copy link
Author

Thanks for the reply! I definitely have files that should be removed. For example:

[root@head src]# lfs hsm_state /lustre-scratch/ricky/file99
/lustre-scratch/ricky/file99: (0x0000000d) released exists archived, archive_id:1

So the file looks to be archived and released. checking the backend /hsm directory for that file gives me this:

[root@head src]# ls -al /hsm/shadow/ricky/file99
lrwxrwxrwx 1 root root 58 Mar 20 12:54 /hsm/shadow/ricky/file99 -> ../../eb64/0000/13a2/0000/0002/0000/0x2000013a2:0xeb64:0x0

so the file seems to be found in the hsm backend.

@tl-cea Running the undelete and lshm_remove commands you mentioned gave me this:

[root@head ricky]# rbh-undelete --list /hsm
Using config file '/etc/robinhood.d/lustre.conf'.
2017/03/24 10:51:03 [28401/1] CfgLoader | Loading updt params config
2017/03/24 10:51:03 [28401/1] CfgLoader | Loading list manager config
2017/03/24 10:51:03 [28401/1] CfgLoader | Loading policies config
2017/03/24 10:51:03 [28401/1] mod_mgt | Successfully loaded module common
2017/03/24 10:51:03 [28401/1] mod_mgt | Successfully loaded module lhsm
2017/03/24 10:51:03 [28401/1] CfgLoader | Loading status managers config
2017/03/24 10:51:03 [28401/1] smi_cfg | Loading status manager 'lhsm' config
2017/03/24 10:51:03 [28401/1] CheckFS | Root mountpoint is NOT allowed for matching /lustre-scratch, type=ext4, fs=/dev/mapper/vg_head-lv_root
2017/03/24 10:51:03 [28401/1] CheckFS | /lustre-scratch is under mountpoint /lustre-scratch, type=lustre, fs=mds@tcp:/scratch
2017/03/24 10:51:03 [28401/1] CheckFS | '/lustre-scratch' matches mount point '/lustre-scratch', type=lustre, fs=mds@tcp:/scratch
2017/03/24 10:51:03 [28401/1] FSInfo | fs_key: hash(fsname=scratch)=DAB1E06F
2017/03/24 10:51:03 [28401/1] smi_init | Status manager lhsm successfully initialized
2017/03/24 10:51:03 [28401/1] ListMgr | Checking database schema
2017/03/24 10:51:03 [28401/1] Undelete | ListManager successfully initialized
2017/03/24 10:51:03 [28401/1] CheckFS | /lustre-scratch matches database content.
             rm_time,                        id,     type,       user,      group,       size,             last_mod,     lhsm.status,                                     path
[root@head ricky]# robinhood --run=lhsm_remove --once -l FULL
Using config file '/etc/robinhood.d/lustre.conf'.
2017/03/24 10:51:36 [28454/1] CfgLoader | Loading updt params config
2017/03/24 10:51:36 [28454/1] CfgLoader | Loading list manager config
2017/03/24 10:51:36 [28454/1] CfgLoader | Loading policies config
2017/03/24 10:51:36 [28454/1] mod_mgt | Successfully loaded module common
2017/03/24 10:51:36 [28454/1] mod_mgt | Successfully loaded module lhsm
2017/03/24 10:51:36 [28454/1] CfgLoader | Loading status managers config
2017/03/24 10:51:36 [28454/1] smi_cfg | Loading status manager 'lhsm' config
2017/03/24 10:51:36 [28454/1] CfgLoader | Loading policy run config
2017/03/24 10:51:36 [28454/1] CheckFS | Root mountpoint is NOT allowed for matching /lustre-scratch, type=ext4, fs=/dev/mapper/vg_head-lv_root
2017/03/24 10:51:36 [28454/1] CheckFS | /lustre-scratch is under mountpoint /lustre-scratch, type=lustre, fs=mds@tcp:/scratch
2017/03/24 10:51:36 [28454/1] CheckFS | '/lustre-scratch' matches mount point '/lustre-scratch', type=lustre, fs=mds@tcp:/scratch
2017/03/24 10:51:36 [28454/1] FSInfo | fs_key: hash(fsname=scratch)=DAB1E06F
2017/03/24 10:51:36 [28454/1] smi_init | Status manager lhsm successfully initialized
2017/03/24 10:51:36 [28454/1] Main | Signal handler thread started successfully
2017/03/24 10:51:36 [28454/2] SigHdlr | Signals SIGTERM and SIGINT (daemon shutdown) are ready to be used
2017/03/24 10:51:36 [28454/2] SigHdlr | Signal SIGHUP (config reloading) is ready to be used
2017/03/24 10:51:36 [28454/2] SigHdlr | Signal SIGUSR1 (stats dump) is ready to be used
2017/03/24 10:51:36 [28454/1] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/1] ListMgr | Checking database schema
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SELECT value FROM VARS WHERE varname='VersionFunctionSet'
2017/03/24 10:51:36 [28454/1] ListMgr | Functions version is up-to-date (1.5)
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SELECT value FROM VARS WHERE varname='VersionTriggerSet'
2017/03/24 10:51:36 [28454/1] ListMgr | Triggers version is up-to-date (1.4)
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM VARS
2017/03/24 10:51:36 [28454/1] ListMgr | VARS.varname field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | VARS.value field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM ENTRIES
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.uid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | uid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | uid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.gid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | gid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | gid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.size field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | size field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | size field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.blocks field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | blocks field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | blocks field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.creation_time field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | creation_time field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.last_access field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | last_access field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.last_mod field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | last_mod field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.last_mdchange field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | last_mdchange field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.type field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | type field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | type field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.mode field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | mode field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.nlink field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | nlink field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.md_update field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | md_update field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.invalid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | invalid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.fileclass field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | fileclass field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.class_update field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | class_update field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_status field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_status field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_status field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_archid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_archid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_archid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_norels field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_norels field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_norels field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_noarch field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_noarch field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_noarch field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_lstarc field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_lstarc field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_lstarc field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_lstrst field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_lstrst field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_lstrst field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ENTRIES.lhsm_uuid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_uuid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM NAMES
2017/03/24 10:51:36 [28454/1] ListMgr | NAMES.id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | NAMES.pkn field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | NAMES.parent_id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | NAMES.name field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | NAMES.path_update field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM ANNEX_INFO
2017/03/24 10:51:36 [28454/1] ListMgr | ANNEX_INFO.id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ANNEX_INFO.link field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | link field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW FUNCTION STATUS WHERE DB='robinhood_lustre' AND NAME='sz_range'
2017/03/24 10:51:36 [28454/1] ListMgr | Function sz_range exists
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM ACCT_STAT
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.uid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | uid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | uid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.gid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | gid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | gid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.type field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | type field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | type field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.lhsm_status field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_status field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_status field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.size field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | size field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | size field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.blocks field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | blocks field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | blocks field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.count field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz0 field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz1 field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz32 field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz1K field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz32K field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz1M field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz32M field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz1G field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz32G field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | ACCT_STAT.sz1T field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM STRIPE_INFO
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_INFO.id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_INFO.validator field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_INFO.stripe_count field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_INFO.stripe_size field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_INFO.pool_name field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM STRIPE_ITEMS
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_ITEMS.id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_ITEMS.stripe_index field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_ITEMS.ostidx field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | STRIPE_ITEMS.details field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW COLUMNS FROM SOFT_RM
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.id field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.fullpath field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | fullpath field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.uid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | uid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | uid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.gid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | gid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | gid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.size field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | size field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | size field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.blocks field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | blocks field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | blocks field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.last_access field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | last_access field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.last_mod field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | last_mod field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.last_mdchange field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | last_mdchange field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.type field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | type field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | type field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.mode field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | mode field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.nlink field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | nlink field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.rm_time field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | rm_time field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.lhsm_status field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_status field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_status field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.lhsm_archid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_archid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_archid field default OK
2017/03/24 10:51:36 [28454/1] ListMgr | SOFT_RM.lhsm_uuid field name OK
2017/03/24 10:51:36 [28454/1] ListMgr | lhsm_uuid field type OK
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SELECT EVENT_OBJECT_TABLE FROM INFORMATION_SCHEMA.TRIGGERS WHERE TRIGGER_SCHEMA='robinhood_lustre'AND TRIGGER_NAME='ACCT_ENTRY_INSERT'
2017/03/24 10:51:36 [28454/1] ListMgr | Trigger ACCT_ENTRY_INSERT exists and is defined on ENTRIES
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SELECT EVENT_OBJECT_TABLE FROM INFORMATION_SCHEMA.TRIGGERS WHERE TRIGGER_SCHEMA='robinhood_lustre'AND TRIGGER_NAME='ACCT_ENTRY_DELETE'
2017/03/24 10:51:36 [28454/1] ListMgr | Trigger ACCT_ENTRY_DELETE exists and is defined on ENTRIES
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SELECT EVENT_OBJECT_TABLE FROM INFORMATION_SCHEMA.TRIGGERS WHERE TRIGGER_SCHEMA='robinhood_lustre'AND TRIGGER_NAME='ACCT_ENTRY_UPDATE'
2017/03/24 10:51:36 [28454/1] ListMgr | Trigger ACCT_ENTRY_UPDATE exists and is defined on ENTRIES
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW FUNCTION STATUS WHERE DB='robinhood_lustre' AND NAME='one_path'
2017/03/24 10:51:36 [28454/1] ListMgr | Function one_path exists
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SHOW FUNCTION STATUS WHERE DB='robinhood_lustre' AND NAME='this_path'
2017/03/24 10:51:36 [28454/1] ListMgr | Function this_path exists
2017/03/24 10:51:36 [28454/1] ListMgr | Database connection closed
2017/03/24 10:51:36 [28454/1] Main | ListManager successfully initialized
2017/03/24 10:51:36 [28454/1] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/1] ListMgr | SQL query: SELECT value FROM VARS WHERE varname='FS_Path'
2017/03/24 10:51:36 [28454/1] CheckFS | /lustre-scratch matches database content.
2017/03/24 10:51:36 [28454/1] ListMgr | Database connection closed
2017/03/24 10:51:36 [28454/1] Main | Policy lhsm_remove successfully initialized
2017/03/24 10:51:36 [28454/3] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/3] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/4] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/4] lhsm_remove | Checking trigger #0 (scheduled), never checked
2017/03/24 10:51:36 [28454/4] lhsm_remove | Checking policy rules for all
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: INSERT INTO VARS (varname,value) VALUES ('lhsm_remove_start_current','1490377896') ON DUPLICATE KEY UPDATE value='1490377896'
2017/03/24 10:51:36 [28454/5] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/5] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/6] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/6] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/7] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/7] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/8] ListMgr | Logged on to database 'robinhood_lustre' successfully
2017/03/24 10:51:36 [28454/8] ListMgr | SQL query: SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED
2017/03/24 10:51:36 [28454/8] Main | Statistics thread started
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: INSERT INTO VARS (varname,value) VALUES ('lhsm_remove_trigger_current','trigger: scheduled (one-shot command), target: all') ON DUPLICATE KEY UPDATE value='trigger: scheduled (one-shot command), target: all'
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: SELECT value FROM VARS WHERE varname='LastScanEndTime'
2017/03/24 10:51:36 [28454/4] lhsm_remove | Building policy list - last full FS Scan: 2017/03/24 10:50:48
2017/03/24 10:51:36 [28454/4] lhsm_remove | Converting scope to DB filter...
2017/03/24 10:51:36 [28454/4] ListMgr | Appending filter on "type", flags=0X10
2017/03/24 10:51:36 [28454/4] lhsm_remove | Starting policy run
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: SELECT id,fullpath,uid,gid,size,blocks,last_access,last_mod,last_mdchange,type,mode,nlink,rm_time,lhsm_status,lhsm_archid,lhsm_uuid FROM SOFT_RM WHERE (type='file' OR type IS NULL) ORDER BY rm_time ASC
2017/03/24 10:51:36 [28454/4] lhsm_remove | End of list (0 entries returned)
2017/03/24 10:51:36 [28454/4] lhsm_remove | End of current pass
2017/03/24 10:51:36 [28454/4] lhsm_remove | target: none
2017/03/24 10:51:36 [28454/4] lhsm_remove | done: none
2017/03/24 10:51:36 [28454/4] lhsm_remove | Policy run summary: time=01s; target=all; 0 successful actions (0.00/sec); volume: 0 (0/sec); 0 entries skipped; 0 errors.
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: DELETE FROM VARS WHERE varname ='lhsm_remove_start_current'
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: DELETE FROM VARS WHERE varname ='lhsm_remove_trigger_current'
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: INSERT INTO VARS (varname,value) VALUES ('lhsm_remove_start','1490377896') ON DUPLICATE KEY UPDATE value='1490377896'
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: INSERT INTO VARS (varname,value) VALUES ('lhsm_remove_end','1490377896') ON DUPLICATE KEY UPDATE value='1490377896'
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: INSERT INTO VARS (varname,value) VALUES ('lhsm_remove_trigger','trigger: scheduled (one-shot command), target: all') ON DUPLICATE KEY UPDATE value='trigger: scheduled (one-shot command), target: all'
2017/03/24 10:51:36 [28454/4] ListMgr | SQL query: INSERT INTO VARS (varname,value) VALUES ('lhsm_remove_status','0 successful actions, volume: 0; 0 entries skipped; 0 errors') ON DUPLICATE KEY UPDATE value='0 successful actions, volume: 0; 0 entries skipped; 0 errors'
2017/03/24 10:51:36 [28454/4] ListMgr | Database connection closed
2017/03/24 10:51:36 [28454/1] Main | lhsm_remove: policy run terminated (rc = 0).
2017/03/24 10:51:36 [28454/1] Main | All tasks done! Exiting.
[root@head ricky]# 

So it looks like it's not seeing anything even though there are about 200 files that are all listed as released and archived (and have been for a couple of days). And as you can see in my config given earlier, files should be removed from the hsm backend pretty much immediately due to the rm_time > 10s parameter.

It's quite possible I'm simply not understanding something very basic here, but I'm at a loss nonetheless.

@mkgilbert
Copy link
Author

Something that might also be relevant is that I'm getting a ton of these messages in my robinhood.log:

2017/03/24 11:48:28 [9121/15] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a3:0xd9f4:0x0]
2017/03/24 11:48:28 [9121/15] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a3:0xd9f4:0x0]
2017/03/24 11:48:28 [9121/8] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a2:0xe892:0x0]
2017/03/24 11:48:28 [9121/8] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a2:0xe892:0x0]
2017/03/24 11:48:28 [9121/10] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a2:0xea85:0x0]
2017/03/24 11:48:28 [9121/10] Policy | Missing attribute 'no_archive' for evaluating boolean expression on [0x2000013a2:0xea85:0x0]

The archive and release commands still seem to go through OK, so I'm assuming this has something to do with hsm_remove? Do these messages mean that I have to specify certain files for archiving? I haven't set up any fileclasses because I want the rules to apply to all files. Is this something I need to do?

I found this code in your policy_matching.c file that seems to be sending that message.

#define CHECK_ATTR(_pset_, _attr_, _no_trace) do {            \
                    if (!ATTR_MASK_TEST(_pset_, _attr_)) {     \
                        if (!(_no_trace))                      \
                            DisplayLog(LVL_MAJOR, POLICY_TAG,  \
                                "Missing attribute '%s' for evaluating "   \
                                "boolean expression on "       \
                                 DFID, (#_attr_), PFID(p_entry_id)); \
                        return POLICY_MISSING_ATTR;            \
                    }                                          \
             } while (0)

@mkgilbert
Copy link
Author

I just learned something interesting from someone at Intel. I opened an issue with their Lustre repo (LU-9255) regarding the problems I'm having with the hsm_remove function. Here's the response I got:

HSM remove on a released file would result in data loss and is therefore prohibited. If the file is restored then HSM remove will be allowed.

However, in your v3_lhsm_tutorial you specifically state the following about the hsm remove feature:

remove: remove a copy from the backend (allow cleaning a file from the backend after it has been removed from Lustre).

So Intel is saying you can't do hsm_remove on a released file, while robinhood documentation is saying that hsm_remove is supposed to be used after being released from lustre. There seems to be disagreement regarding what hsm_remove is supposed to be able to achieve. Can anyone shed any light on this?

@mkgilbert
Copy link
Author

OK so it appears that I was mistaking "releasing" for "removing". It would be helpful if you put in your documentation that the hsm_remove command CANNOT be done on a file that is in the released state. The wording of "released" vs "removed" in Lustre is very confusing.

So that brings me to a question: How can you delete files that have been released and are archived? In my case, that is exactly what I want to do. I want to backup files to the hsm, release/remove them from Lustre, and then after a period of time has passed, remove them from the hsm as well. At that point all remnants of the file would be gone.

I don't understand why hsm_remove only works if you do the rm command on a file, while putting it in a "released" state does not.

@ypsah
Copy link
Contributor

ypsah commented Mar 27, 2017

The purpose of Lustre-HSM is to move data away from Lustre for example to put it on cheaper hardware, you can check out this document for more information on Lustre-HSM.

I don't think there is any valid use case for wanting to remove data from the backend relating to a released file that still exists in Lustre (feel free to disagree).

From what you explained it is my understanding that you wish to use HSM as some sort of trashbin: unused data is first moved away from Lustre (archived + released) and then after a certain amount of time completely deleted (file is deleted from Lustre + archived data is deleted from the backend). If so, you need to perform both actions (rm on Lustre + hsm_remove on the backend).

Another way to achieve a similar goal would be to use rbh-undelete: archive and remove unused data and after a certain amount of time perform hsm_remove backed up files.
If between the time the file was removed from Lustre and removed from the backend you want to restore it you can run rbh-undelete -R <filepath> which will re-create the file in the released state.

@tl-cea
Copy link
Member

tl-cea commented Mar 27, 2017

I was mistaking "releasing" for "removing".

I updated the doc by changing "removed" to "deleted". I guess it make it more clear.

The purpose of this policy is to clean entries in the archive backend after they have been deleted from Lustre

@mkgilbert
Copy link
Author

@qb-cea Thank you for the very helpful feedback! I am definitely understanding the purpose of the hsm system a lot better now. You are correct that we are wanting the hsm system to be a type of trash bin.

I'm thinking of possibly doing this to achieve our goal:

  1. an archive on all files frequently
  2. a periodic release on old files
  3. a less frequent cleanup policy on even older files
  4. a remove policy from the HSM on really old files.

Between steps 3 and 4, we could use the rbh-undelete tool as needed if users need their files back.

In my initial testing on small data (80MB), I'm noticing that the hsm backend doesn't really seem to be shrinking in size when the files are being removed with hsm_remove. For example, I saw success messages from robinhood that about 100 files totaling around 30MB were being removed but the size of the backend hasn't changed at all. rbh-undelete -L doesn't show these files, so they appear to be gone, but the space they are taken up in the backend isn't being freed.

Is this normal or am I missing something?

Thank you again for all the help!

@ypsah
Copy link
Contributor

ypsah commented Mar 28, 2017

@mkgilbert You're welcome!

Could you check the output of rbh-undelete -L before running hsm_remove?

Could you also check whether or not files in the backend are deleted after hsm_remove (not just that it takes space)? For this you can run something like:

find /backend -type f > /dev/shm/backend.filelist
# run hsm_remove
diff <(find /backend -type f | sort) <(sort /dev/shm/backend.filelist)

@tl-cea
Copy link
Member

tl-cea commented Mar 29, 2017

I think what happens is the step 3 (cleanup) drops any information about the file from robinhood DB. So when it then processes the UNLINK changelog record, it doesn't know was path the entry had. So I think the entry is correctly added to SOFTRM table, but it only appears there with a fid, not a path. So "rbh-undelete -L" should report them with fids only, not paths.

A possible workaround is to modify the action of the cleanup policy, to preserve the entry in DB until robinhood gets the UNLINK record. To achieve that you, change the action of cleanup policy (which is common.unlink by default) to cmd("rm -f {path}");

Regards
Thomas

@mkgilbert
Copy link
Author

@qb-cea I apologize for the delay...I will get you that information as soon as I can. Unfortunately, my test lustre system went haywire and stopped working so I'm trying to figure out how to fix it.

@mkgilbert
Copy link
Author

@qb-cea So I did the steps you suggested (find all hsm files, remove, then do a diff) and at first, nothing happened; the file was not deleted. Literally 41 minutes went by and then boom the copytool got the message and removed the file from the backend. It's worth noting that I did the remove command via lfs (lfs hsm_remove ...)

The funny thing is that this is an extremely small test system with no one using it but me. I hadn't added any files or been doing a lot of Lustre operations during this period. The only major change I made was to the robinhood config, where I added rules for lhsm_remove. But the odd thing is, the rules didn't appear to be working because I did an rm on all the lustre files and the remove tool did nothing. The robinhood logs also don't show that any files were removed. But the copytool did get the original lfs message that I sent 40 minutes prior.

This type of behavior is what I've been troubleshooting...I do an action and nothing happens. Then at some random amount of time later, the action finally goes through. There is nothing of significance happening in the system logs of the MDS or the client node during these "waiting" periods.

@mkgilbert
Copy link
Author

mkgilbert commented Apr 7, 2017

I believe I resolved my issues. Since the copytool wasn't seeming to get any of the commands I was sending it, I purged the actions from the MDS using the following command:

lctl set_param mdt.<fsname>-MDT0000.hsm_control=purge

where <fsname> for my system was "scratch". Immediately after running this command, actions from robinhood started working again without any problem. (note that according to the Lustre docs this command doesn't actually change the hsm_control setting in the config, it only purges actions)

Apparently the coordinator somehow got hung up and would no longer communicate with the copytool. Even restarting all of the nodes, including the mds and oss nodes didn't fix the problem, only running this purge did the trick.

I've now been able to verify that indeed files definitely are getting removed from the hsm backend when doing an hsm_remove command.

Thank you everyone for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants