Skip to content

Investigate feasibility of creating Cleanup jobs from a central microservice

Kenyi Hurtado edited this page Jun 2, 2021 · 8 revisions

We are considering to implement a new microservice to be hosted in CMSWEB, which will be responsible for fetching a dump of all the unmerged files in a given RSE (from the Consistency Checking tool), figuring out which files are no longer needed in the system, and providing those unneeded files as input to cleanup jobs submitted to one of our production schedds. Those cleanup jobs should be very similar (if not equal) to our standard production Cleanup jobs, thus running locally on the site CPU resources and deleting files "locally".

Proposed solution

This is meant to be a somehow high level investigation on how feasible it would be to reuse the WMCore/WMSpec/Steps package to create the Cleanup jobs. We also have to explore how we could use the WMCore/Storage/Backends package to use the correct tool for the file deletions. A third point of investigation will be on the job post processing, such that we can identify what succeeded or failed, and whether files managed to be deleted or not. One could argue that, as a first alpha product, we trust that part of the files will be deleted and there is no need to know how many jobs are succeeding, and how many files are successfully deleted.

The solution will likely be in the form of a document explaining how to plug all these pieces, what is required, etc. No need to make any implementation at this stage.

Investigation

In order to create a micro service such as this, to create clean up jobs that are run at the sites themselves, we need to be able to:

  • Submit jobs from the Micro Service node/pod to a remote production schedd
  • Create the clean up jobs, with all dependencies needed
  • Read the site local config and figure out the right mechanism to delete the files "locally"
  • Check if files were indeed deleted or not (post job processing)

Submitting from a Micro Service to a remote production schedd

Authentication mechanism between service and production schedd:

  • This can be done at the host level, by including the proper subnet as part of:
ALLOW_DAEMON
ALLOW_NEGOTIATOR

or possibly by using token authentication (see: https://agenda.hep.wisc.edu/event/1579/contributions/23053/attachments/7870/8965/BockelmanHTCondorWeek2021.mp4)

Python Submission example

The following example submits a job from an external machine to login-el7.uscms.org and and the same can be used to submit to a production schedd:

  1. From the schedd, query the MyAddress classad:
$ condor_status -schedd -af MyAddress
<192.170.227.182:9618?addrs=192.170.227.182-9618&alias=login-el7.uscms.org&noUDP&sock=schedd_1934190_0a19>
  1. From the external client (the micro service), submit a job to the remote schedd:
#! /bin/env python

# Requires: Valid VO CMS proxy certificate (in this  example: /tmp/x509up_$(id -u)

import os
import htcondor
import classad

import logging


def submit(schedd, sub):
    """Submit condor job to local schedd.
    :param schedd: The local
    """
    try:
        with schedd.transaction() as txn:
            clusterid = sub.queue(txn)
    except Exception as e:
        logging.debug("Error submission: {0}".format(e))
        raise e

    return clusterid


schedd_ad = classad.ClassAd()
schedd_ad["MyAddress" ] = "<192.170.227.182:9618?addrs=192.170.227.182-9618&alias=login-el7.uscms.org&noUDP&sock=schedd_1934190_0a19>"
schedd =  htcondor.Schedd(schedd_ad)

sub = htcondor.Submit()
sub['executable'] = '/tmp/hello.sh'
sub['Output'] = '/tmp/result.$(Cluster)-$(Process).out'
sub['Error'] = '/tmp/result.$(Cluster)-$(Process).err'
sub['My.x509userproxy'] = classad.quote('/tmp/x509up_%s' % os.getuid())

clusterid =  submit(schedd, sub)

Note the remote submission assumes paths to be available in the remote schedd ( e.g.: /tmp/hello.sh is expected to exist in the remote schedd)

Clone this wiki locally