Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes job runner integration test and enhancements #6958

Merged
merged 31 commits into from Feb 22, 2019

Conversation

@mvdbeek
Copy link
Member

commented Nov 1, 2018

  • pykube is now a conditional requirement and will be installed if a kubernetes runner is set up
  • Submit ajs.job_file instead of tool_script.sh, which allows using standard Galaxy features, like setting env variables per destination, reading stdout/stderr from file, collecting job measurements
  • Fixes state changes
  • Allows selecting the shell to use for all job runners
  • Python 3 fixes for container resolver
  • Writes stdout and stderr to standard file locations
  • Passes GALAXY_SLOTS and GALAXY_MEMORY_MB and GALAXY_MEMORY_MB_PER_SLOT to wrappers
  • Includes integration tests for running jobs, cancelling jobs, executing jobs in mulled containers, detecting exit code, stdout, stderr and passing GALAXY_SLOTS and GALAXY_MEMORY_MB.

@galaxybot galaxybot added the status/WIP label Nov 1, 2018

@mvdbeek mvdbeek changed the title WIP: Kubernetes job runner integration test and enhancements Kubernetes job runner integration test and enhancements Nov 4, 2018

@mvdbeek mvdbeek added status/review and removed status/WIP labels Nov 4, 2018

@galaxybot galaxybot added this to the 19.01 milestone Nov 4, 2018

@mvdbeek mvdbeek force-pushed the mvdbeek:kubernetes_integration_test branch from c1435ad to 0c2e3d3 Nov 4, 2018

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 4, 2018

Alright, this is ready now. @pcm32 @ilveroluca can you guys give this a try ? I only have a toy kubernetes cluster to play with.

@jmchilton

This comment has been minimized.

Copy link
Member

commented Nov 5, 2018

It looks good to me, I was going to merge before I saw the above comment. If it makes testing easier let me know and I'll merge it, looks ready to go to me.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

Yeah, next step is a set of re-usable runner tests. I want all runners to make sure they know about Galaxy-isms like GALAXY_SLOTS, stderr etc ... . If bugs arise I'm happy and in a good position to fix them.

@pcm32

This comment has been minimized.

Copy link
Member

commented Nov 5, 2018

I have set up some testing for Galaxy-Kubernetes on travis here. We could put variations of this if needed somewhere on the Galaxy tests. It is missing though to have a job execution test (so it is currently more of a test that Galaxy spins up, but should be easy to extend for sending a job and waiting for it). It takes a little while though, in case you have alternatives to travis.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

I think with this travis setup you should be able to run the integration tests here, which run a couple of jobs that use important Galaxy features. In the best case you'd just have to sh run_tests.sh -integration test/integration/test_kubernetes_runner.py

@pcm32
Copy link
Member

left a comment

This is a great improvement, many thanks! I have some questions regarding how certain things are handled now, most notably the logs of the tool (stdout and stderr, which you normally retrieve through kubectl logs). However I found a typo somewhere, so I thought I'd made it a proper review. Unfortunately I made most of the comments outside the review. So my concerns are not blocking of course (but would appreciate some clarification), but the typo at least it is I guess.

@mvdbeek mvdbeek force-pushed the mvdbeek:kubernetes_integration_test branch 2 times, most recently from 9f605ac to 676bdb5 Nov 5, 2018

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

Oh, and if you want to test this using docker for mac you have to set the TEMP variable to something different than /var/private, since that can't be mounted by docker. I'm just doing TMPDIR=~/temp sh run_tests.sh -integration test/integration/test_kubernetes_runner.py

@pcm32

pcm32 approved these changes Nov 5, 2018

Copy link
Member

left a comment

LGTM! Thanks again! I should be able to test this after Thursday if that is fine.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

Sure, I'm not in a hurry for this.

@pcm32

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

On second thoughts, I think that galaxy slots should be equal to the floor of cpu-limits instead of it's ceiling, otherwise we risk having the job killed by kubernetes for trespassing the max amount of CPUs allocated.

@pcm32

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

Handling of course the case where this would lead to zero slots, which might produce weird behaviour in the tool.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2018

Is that how kubernetes handles this ? I'd assume it'd just throttle, otherwise I can't really see how fractions of a cpu would work ?

@pcm32

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

Yes, you're right, it throttles for CPU, it kills for memory, so yes, no change needed on that!

@mvdbeek mvdbeek force-pushed the mvdbeek:kubernetes_integration_test branch from aa0a448 to e52bb5d Feb 19, 2019

@mvdbeek mvdbeek referenced this pull request Feb 19, 2019

Merged

make job script actually return tool exit code #7147

0 of 1 task complete
Revert "job script returns tool exit code in any case"
This breaks exit code reporting and leads to pod re-scheduling.

This reverts commit 6beeec8.
@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2019

Actually the single mount point thing makes it difficult to test tools with a wrapper living next to the tool ... :/

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2019

Alright, @pcm32 7e32977 should fix your issue with GALAXY_VIRTUAL_ENV bleeding into the container. I didn't have any problem also without the commit, but that's because I am only mounting in the test directory and the tool directory, so the virtualenv couldn't be activated. That's fixed now, I hope this works for you as well. Note that I changed the syntax for mounting in PVCs in 65f33df, let me know what you think about that.

@mvdbeek mvdbeek force-pushed the mvdbeek:kubernetes_integration_test branch from 78e804f to 9bfc7b4 Feb 19, 2019

@mvdbeek mvdbeek added status/review and removed status/WIP labels Feb 19, 2019

mvdbeek added some commits Feb 19, 2019

Set galaxy_virtual_env to None for kubernetes runner
This prevents attempts of activating Galaxy's virtualenv
inside the container.
Drop GALAXY_VIRTUAL_ENV env hack
GALAXY_VIRTUAL_ENV is injected through job_wrapper.galaxy_virtual_env,
which does `os.environ.get(VIRTUAL_ENV, None)`, so this didn't work.
The previous commit addresses that.

@mvdbeek mvdbeek force-pushed the mvdbeek:kubernetes_integration_test branch from 9bfc7b4 to db387ab Feb 19, 2019

@pcm32

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

Thanks, I'll try it as soon as I can. I have some concerns on the multiple mount points (made the comment on that commit), but great that we can have this working!

@pcm32

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

Unfortunately after the last rebase the images don't manage to spin up galaxy correctly, we must have grabbed a bug in dev. Images build correctly (see docker-galaxy-stable/compose build-orchestration-images.sh), but fail to start with this error:

galaxy.jobs DEBUG 2019-02-21 14:36:18,901 [p:1953,w:0,m:0] [MainThread] Done loading job configuration
Traceback (most recent call last):
  File "lib/galaxy/webapps/galaxy/buildapp.py", line 49, in app_factory
    app = galaxy.app.UniverseApplication(global_conf=global_conf, **kwargs)
  File "lib/galaxy/app.py", line 128, in __init__
    self.watchers = ConfigWatchers(self)
  File "lib/galaxy/webapps/galaxy/config_watchers.py", line 34, in __init__
    self.tool_data_watcher = get_tool_data_dir_watcher(self.app.tool_data_tables, config=self.app.config)
  File "lib/galaxy/tools/toolbox/watcher.py", line 57, in get_tool_data_dir_watcher
    observer_class = get_observer_class(config_value, default="False", monitor_what_str="tool-data directory")
  File "lib/galaxy/tools/toolbox/watcher.py", line 46, in get_observer_class
    raise Exception(message)
Exception: Watchdog library unavailable, cannot monitor tool-data directory.

as such I cannot yet test this :-( . I will investigate what is the problem next week and try to fix.

Happy to go with the additional mount points, but will probably stay without it for a while in a branch until we move to 19.05, to avoid changing the helm charts (I maintain a number of different instances at different galaxy versions, and I don't have man power to deal with that now).

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

This seems to be missing the watchdog library, so you can either run through common_startup.sh or disable watch_tool_data_dir https://github.com/galaxyproject/galaxy/blob/dev/config/galaxy.yml.sample#L402 . I don't think this has changed on the Galaxy side.

@pcm32

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

Yes, strange, as I haven't changed either the default setting, so it should be turned off (for watchdog). I tried a completely clean start with a fresh PV, but still.

@pcm32

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

I had to go inside the running container and install watchdog... unfortunately the container fails with the same error:

$ kubectl logs galaxy-mk-1809-int-test-2-dwxbm
python: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory

if I inspect the script file, generated for the failed job I see:

#!/bin/bash



_galaxy_setup_environment() {
    local _use_framework_galaxy="$1"
    _GALAXY_JOB_HOME_DIR="/export/galaxy-central/database/job_working_directory/000/2/home"
    _GALAXY_JOB_TMP_DIR=$([ ! -e '/export/galaxy-central/database/job_working_directory/000/2/tmp' ] || mv '/export/galaxy-central/database/job_working_directory/000/2/tmp' '/export/galaxy-central/database/job_working_directory/000/2'/tmp.$(date +%Y%m%d-%H%M%S) ; mkdir '/export/galaxy-central/database/job_working_directory/000/2/tmp'; echo '/export/galaxy-central/database/job_working_directory/000/2/tmp')
    
. "/export/venv/bin/activate"
    if [ "$GALAXY_LIB" != "None" -a "$_use_framework_galaxy" = "True" ]; then
        if [ -n "$PYTHONPATH" ]; then
            PYTHONPATH="$GALAXY_LIB:$PYTHONPATH"
        else
            PYTHONPATH="$GALAXY_LIB"
        fi
        export PYTHONPATH
    fi
    # These don't get cleaned on a re-run but may in the future.
    [ -z "$_GALAXY_JOB_TMP_DIR" -a ! -f "$_GALAXY_JOB_TMP_DIR" ] || mkdir -p "$_GALAXY_JOB_TMP_DIR"
    [ -z "$_GALAXY_JOB_HOME_DIR" -a ! -f "$_GALAXY_JOB_HOME_DIR" ] || mkdir -p "$_GALAXY_JOB_HOME_DIR"
    if [ "$GALAXY_VIRTUAL_ENV" != "None" -a -f "$GALAXY_VIRTUAL_ENV/bin/activate" \
         -a "`command -v python`" != "$GALAXY_VIRTUAL_ENV/bin/python" ]; then
        . "$GALAXY_VIRTUAL_ENV/bin/activate"
    fi
}


# The following block can be used by the job system
# to ensure this script is runnable before actually attempting
# to run it.
if [ -n "$ABC_TEST_JOB_SCRIPT_INTEGRITY_XYZ" ]; then
    exit 42
fi

export GALAXY_SLOTS_CONFIGURED="1"
if [ -n "$SLURM_CPUS_ON_NODE" ]; then
    # This should be valid on SLURM except in the case that srun is used to
    # submit additional job steps under an existing allocation, which we do not
    # currently do.
    GALAXY_SLOTS="$SLURM_CPUS_ON_NODE"
elif [ -n "$SLURM_NTASKS" ] || [ -n "$SLURM_CPUS_PER_TASK" ]; then
    # $SLURM_CPUS_ON_NODE should be set correctly on SLURM (even on old
    # installations), but keep the $SLURM_NTASKS logic as a backup since this
    # was the previous method under SLURM.
    #
    # Multiply these values since SLURM_NTASKS is total tasks over all nodes.
    # GALAXY_SLOTS maps to CPUS on a single node and shouldn't be used for
    # multi-node requests.
    GALAXY_SLOTS=`expr "${SLURM_NTASKS:-1}" \* "${SLURM_CPUS_PER_TASK:-1}"`
elif [ -n "$NSLOTS" ]; then
    GALAXY_SLOTS="$NSLOTS"
elif [ -n "$NCPUS" ]; then
    GALAXY_SLOTS="$NCPUS"
elif [ -n "$PBS_NCPUS" ]; then
    GALAXY_SLOTS="$PBS_NCPUS"
elif [ -f "$PBS_NODEFILE" ]; then
    GALAXY_SLOTS=`wc -l < $PBS_NODEFILE`
elif [ -n "$LSB_DJOB_NUMPROC" ]; then
    GALAXY_SLOTS="$LSB_DJOB_NUMPROC"
elif [ -n "$GALAXY_SLOTS" ]; then
    # kubernetes runner injects GALAXY_SLOTS into environment
    GALAXY_SLOTS=$GALAXY_SLOTS
else
    GALAXY_SLOTS="1"
    unset GALAXY_SLOTS_CONFIGURED
fi

export GALAXY_SLOTS
GALAXY_VIRTUAL_ENV="None"
_GALAXY_VIRTUAL_ENV="None"
PRESERVE_GALAXY_ENVIRONMENT="False"
GALAXY_LIB="/export/galaxy-central/lib"
_galaxy_setup_environment "$PRESERVE_GALAXY_ENVIRONMENT"

so that means that while GALAXY_VIRTUAL_ENV is being correctly set to None, there is still a direct call to . "/export/venv/bin/activate" that is not being controlled by any environment variable, messing up the tools python environment. :-(

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

I am not seeing this, any chance this is caused by a non-default setting in your job_conf.xml file ?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

So this is what I see:

#!/bin/sh



_galaxy_setup_environment() {
    local _use_framework_galaxy="$1"
    _GALAXY_JOB_HOME_DIR="/Users/mvandenb/temp/tmpw_906fen/000/2/home"
    _GALAXY_JOB_TMP_DIR=$([ ! -e '/Users/mvandenb/temp/tmpw_906fen/000/2/tmp' ] || mv '/Users/mvandenb/temp/tmpw_906fen/000/2/tmp' '/Users/mvandenb/temp/tmpw_906fen/000/2'/tmp.$(date +%Y%m%d-%H%M%S) ; mkdir '/Users/mvandenb/temp/tmpw_906fen/000/2/tmp'; echo '/Users/mvandenb/temp/tmpw_906fen/000/2/tmp')
    
SOME_ENV_VAR="42"; export SOME_ENV_VAR
    if [ "$GALAXY_LIB" != "None" -a "$_use_framework_galaxy" = "True" ]; then
        if [ -n "$PYTHONPATH" ]; then
            PYTHONPATH="$GALAXY_LIB:$PYTHONPATH"
        else
            PYTHONPATH="$GALAXY_LIB"
        fi
        export PYTHONPATH
    fi
    # These don't get cleaned on a re-run but may in the future.
    [ -z "$_GALAXY_JOB_TMP_DIR" -a ! -f "$_GALAXY_JOB_TMP_DIR" ] || mkdir -p "$_GALAXY_JOB_TMP_DIR"
    [ -z "$_GALAXY_JOB_HOME_DIR" -a ! -f "$_GALAXY_JOB_HOME_DIR" ] || mkdir -p "$_GALAXY_JOB_HOME_DIR"
    if [ "$GALAXY_VIRTUAL_ENV" != "None" -a -f "$GALAXY_VIRTUAL_ENV/bin/activate" \
         -a "`command -v python`" != "$GALAXY_VIRTUAL_ENV/bin/python" ]; then
        . "$GALAXY_VIRTUAL_ENV/bin/activate"
    fi
}


# The following block can be used by the job system
# to ensure this script is runnable before actually attempting
# to run it.
if [ -n "$ABC_TEST_JOB_SCRIPT_INTEGRITY_XYZ" ]; then
    exit 42
fi

export GALAXY_SLOTS_CONFIGURED="1"
if [ -n "$SLURM_CPUS_ON_NODE" ]; then
    # This should be valid on SLURM except in the case that srun is used to
    # submit additional job steps under an existing allocation, which we do not
    # currently do.
    GALAXY_SLOTS="$SLURM_CPUS_ON_NODE"
elif [ -n "$SLURM_NTASKS" ] || [ -n "$SLURM_CPUS_PER_TASK" ]; then
    # $SLURM_CPUS_ON_NODE should be set correctly on SLURM (even on old
    # installations), but keep the $SLURM_NTASKS logic as a backup since this
    # was the previous method under SLURM.
    #
    # Multiply these values since SLURM_NTASKS is total tasks over all nodes.
    # GALAXY_SLOTS maps to CPUS on a single node and shouldn't be used for
    # multi-node requests.
    GALAXY_SLOTS=`expr "${SLURM_NTASKS:-1}" \* "${SLURM_CPUS_PER_TASK:-1}"`
elif [ -n "$NSLOTS" ]; then
    GALAXY_SLOTS="$NSLOTS"
elif [ -n "$NCPUS" ]; then
    GALAXY_SLOTS="$NCPUS"
elif [ -n "$PBS_NCPUS" ]; then
    GALAXY_SLOTS="$PBS_NCPUS"
elif [ -f "$PBS_NODEFILE" ]; then
    GALAXY_SLOTS=`wc -l < $PBS_NODEFILE`
elif [ -n "$LSB_DJOB_NUMPROC" ]; then
    GALAXY_SLOTS="$LSB_DJOB_NUMPROC"
elif [ -n "$GALAXY_SLOTS" ]; then
    # kubernetes runner injects GALAXY_SLOTS into environment
    GALAXY_SLOTS=$GALAXY_SLOTS
else
    GALAXY_SLOTS="1"
    unset GALAXY_SLOTS_CONFIGURED
fi

export GALAXY_SLOTS
GALAXY_VIRTUAL_ENV="None"
_GALAXY_VIRTUAL_ENV="None"
PRESERVE_GALAXY_ENVIRONMENT="False"
GALAXY_LIB="/Users/mvandenb/src/galaxy/lib"
_galaxy_setup_environment "$PRESERVE_GALAXY_ENVIRONMENT"
GALAXY_PYTHON=`command -v python`
cd /Users/mvandenb/temp/tmpw_906fen/000/2
if [ -n "$SLURM_JOB_ID" ]; then
    GALAXY_MEMORY_MB=`scontrol -do show job "$SLURM_JOB_ID" | sed 's/.*\( \|^\)Mem=\([0-9][0-9]*\)\( \|$\).*/\2/p;d'` 2>memory_statement.log
fi
if [ -n "$SGE_HGR_h_vmem" ]; then
    GALAXY_MEMORY_MB_PER_SLOT=`echo "$SGE_HGR_h_vmem" | sed 's/G$/ * 1024/' | bc | cut -d"." -f1` 2>memory_statement.log
fi

if [ -z "$GALAXY_MEMORY_MB_PER_SLOT" -a -n "$GALAXY_MEMORY_MB" ]; then
    GALAXY_MEMORY_MB_PER_SLOT=$(($GALAXY_MEMORY_MB / $GALAXY_SLOTS))
elif [ -z "$GALAXY_MEMORY_MB" -a -n "$GALAXY_MEMORY_MB_PER_SLOT" ]; then
    GALAXY_MEMORY_MB=$(($GALAXY_MEMORY_MB_PER_SLOT * $GALAXY_SLOTS))
fi
[ "${GALAXY_MEMORY_MB--1}" -gt 0 ] 2>>memory_statement.log && export GALAXY_MEMORY_MB || unset GALAXY_MEMORY_MB
[ "${GALAXY_MEMORY_MB_PER_SLOT--1}" -gt 0 ] 2>>memory_statement.log && export GALAXY_MEMORY_MB_PER_SLOT || unset GALAXY_MEMORY_MB_PER_SLOT

echo "$GALAXY_SLOTS" > '/Users/mvandenb/temp/tmpw_906fen/000/2/__instrument_core_galaxy_slots' 
echo "$GALAXY_MEMORY_MB" > '/Users/mvandenb/temp/tmpw_906fen/000/2/__instrument_core_galaxy_memory_mb' 
date +"%s" > /Users/mvandenb/temp/tmpw_906fen/000/2/__instrument_core_epoch_start
out="${TMPDIR:-/tmp}/out.$$" err="${TMPDIR:-/tmp}/err.$$"
mkfifo "$out" "$err"
trap 'rm "$out" "$err"' EXIT
tee -a stdout.log < "$out" &
tee -a stderr.log < "$err" >&2 & rm -rf working; mkdir -p working; cd working; /Users/mvandenb/temp/tmpw_906fen/000/2/tool_script.sh > '/Users/mvandenb/temp/tmpw_906fen/000/2/galaxy_2.o' 2> '/Users/mvandenb/temp/tmpw_906fen/000/2/galaxy_2.e'; return_code=$?; sh -c "exit $return_code"
echo $? > /Users/mvandenb/temp/tmpw_906fen/000/2/galaxy_2.ec
date +"%s" > /Users/mvandenb/temp/tmpw_906fen/000/2/__instrument_core_epoch_end
@pcm32

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

You rock @mvdbeek! It must be this line:
https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/blob/develop/config/job_conf.xml#L39

I'll rebuild removing that and try again as soon as I can.

@pcm32

This comment has been minimized.

Copy link
Member

commented Feb 22, 2019

Yes, after that it works. Thanks @mvdbeek for all the follow ups and patiently going through all my concerns. This PR is fine for me.

@jmchilton jmchilton merged commit c53fb62 into galaxyproject:dev Feb 22, 2019

5 of 6 checks passed

selenium test Build finished. 151 tests run, 3 skipped, 1 failed.
Details
api test Build finished. 453 tests run, 1 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 197 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 294 tests run, 23 skipped, 0 failed.
Details
toolshed test Build finished. 578 tests run, 0 skipped, 0 failed.
Details

Remote Execution of Workflows/Jobs automation moved this from In progress to Done Feb 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.