Retire launch script #62

XkunW · 2025-03-12T19:47:04Z

PR Type

[Feature | Fix]

Short Description

Retire launch_server script, the launch command now takes care of setting environment variables and runs the sbatch command directly
Updated default behaviour for launching a model that's absent from config, only send a warning msg if the model weights exist
Replaced all os.path methods with pathlib
Moved singularity containers to model-weights and added a cache config
Added a json file to keep track of launch params and server address for each job
Updated logging file structure
Updated CLI integration tests to accommodate the changes, split the launch_command_model_not_found test into two
Minor updates to various variable names
Fixed Qwen2.5-Math model config issues

Tests Added

test_launch_command_model_not_in_config_with_weights
test_launch_command_model_not_found

…nstead of raising an exception if the model weights exist but missing relevant config

…on file, update env var name

…sh command, update CACHED CONFIG location

…m launch json file

…ng bash command

… addr writing, update binding to the specific model weights dir

… updated logging structure

…model

…rivate method

for more information, see https://pre-commit.ci

…tute/vector-inference into retire_launch_script

amrit110

Looks great. I assume it has been tested on the cluster too, I'm going to do that as well. Great refactor!

XkunW and others added 29 commits February 28, 2025 16:59

Rename num_gpus to gpus_per_node to avoid confusion

72618f2

Move launch server logic to LaunchHelper, display a warning message i…

6e6617f

…nstead of raising an exception if the model weights exist but missing relevant config

Point singularity path to new location, dump server addr to server js…

86e7026

…on file, update env var name

Update os.path methods to pathlib, add second returned var for run ba…

3ad0ea1

…sh command, update CACHED CONFIG location

Add json handling for read slurm log, update get_base_url to read fro…

c2f17c9

…m launch json file

Remove SERVER_ADDRESS_SIGNATURE

057d683

Update launch logic to match launch helper, add error check for runni…

38d4245

…ng bash command

ruff format

b9e166f

Typing fixes

4568019

Change the default value for log dir, update num_gpus to gpus_per_node

95be110

Remove redundant cuda loading commands (legacy from venv), fix server…

54dda51

… addr writing, update binding to the specific model weights dir

Replace NUM_GPUS with SLURM_GPUS_PER_NODE

43324ba

Replace NUM_NODES with SLURM_JOB_NUM_NODES

3e0393c

Various small fixes for launch command, removed unnecessary env vars,…

3fc6120

… updated logging structure

Update log file path to match changes in _helper

0fa5d20

Update server address key for launch json

3d6a64e

Update err messages, update util tests to match changes in _utils

c6c08f6

Move convert bool val to utils

8aa76cf

Fix max_model_len for Qwen2.5-Math models, added Qwen2.5-Math-PRM-7B …

65258c9

…model

Rename cached config

0bae6fd

Update warning msg, remove model name exclusion from params, update p…

3f32196

…rivate method

Fix & update launch command integration tests

5193157

ruff format

0e4fc09

Add helper for list command

4dda406

Retire launch script

0dbf73c

[pre-commit.ci] Add auto fixes from pre-commit.com hooks

8005f95

for more information, see https://pre-commit.ci

Merge branch 'retire_launch_script' of https://github.com/VectorInsti…

aa7325b

…tute/vector-inference into retire_launch_script

Fix mypy and ruff errors for test_cli.py

da39919

mypy fixes

4cb0552

jwilles self-assigned this Mar 12, 2025

jwilles requested review from amrit110, fcogidi and jwilles March 12, 2025 21:04

jwilles removed their assignment Mar 12, 2025

amrit110 approved these changes Mar 13, 2025

View reviewed changes

XkunW merged commit 4a9b41c into develop Mar 13, 2025
4 checks passed

XkunW deleted the retire_launch_script branch March 13, 2025 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retire launch script #62

Retire launch script #62

Uh oh!

XkunW commented Mar 12, 2025

Uh oh!

amrit110 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Retire launch script #62

Retire launch script #62

Uh oh!

Conversation

XkunW commented Mar 12, 2025

PR Type

Short Description

Tests Added

Uh oh!

amrit110 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants