feat: Add a dockerfile argument to enable aimstack #261

dushyantbehl · 2024-07-23T07:22:39Z

Description of the change

Add a dockerfile argument to enable aimstack

Related issue number

NA

How to verify the PR

Run with a dockerfile built with and without the argument ENABLE_AIM

Was the PR tested

Tested the PR with both scenarios, ENABLE_AIM does not set and ENABLE_AIM=true

Tested without ENABLE_AIM argument

[dushyantbehl@experimentsplatform fms-hf-tuning]$ docker build -t fms-hf-tuning:dev . -f build/Dockerfile
[+] Building 562.1s (30/30) FINISHED                                                                                                                                                                                                           docker:default
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                     0.1s
 => => transferring dockerfile: 6.51kB                                                                                                                                                                                                                   0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 24)                                                                                                                                                                          0.1s
 => [internal] load metadata for registry.access.redhat.com/ubi9/ubi:latest                                                                                                                                                                              1.3s
 => [internal] load .dockerignore                                                                                                                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                                                                                                                          0.0s
 => CACHED [internal] settings cache mount permissions                                                                                                                                                                                                   0.0s
 => [base 1/3] FROM registry.access.redhat.com/ubi9/ubi:latest@sha256:1ee4d8c50d14d9c9e9229d9a039d793fcbc9aa803806d194c957a397cf1d2b17                                                                                                                  13.8s
 => => resolve registry.access.redhat.com/ubi9/ubi:latest@sha256:1ee4d8c50d14d9c9e9229d9a039d793fcbc9aa803806d194c957a397cf1d2b17                                                                                                                        0.0s
 => => sha256:1ee4d8c50d14d9c9e9229d9a039d793fcbc9aa803806d194c957a397cf1d2b17 1.47kB / 1.47kB                                                                                                                                                           0.0s
 => => sha256:763f30167f92ec2af02bf7f09e75529de66e98f05373b88bef3c631cdcc39ad8 429B / 429B                                                                                                                                                               0.0s
 => => sha256:159a1e67312ef50059357047ebe2a365afea904504fca9561abb385ecd942d62 6.43kB / 6.43kB                                                                                                                                                           0.0s
 => => sha256:cc296d75b61273dcb0db7527435a4c3bd03f7723d89a94d446d3d52849970460 79.43MB / 79.43MB                                                                                                                                                         1.5s
 => => extracting sha256:cc296d75b61273dcb0db7527435a4c3bd03f7723d89a94d446d3d52849970460                                                                                                                                                               11.9s
 => [internal] load build context                                                                                                                                                                                                                        0.4s
 => => transferring context: 1.18MB                                                                                                                                                                                                                      0.4s
 => [base 2/3] RUN dnf remove -y --disableplugin=subscription-manager         subscription-manager     && dnf install -y python3.11 procps     && ln -s /usr/bin/python3.11 /bin/python     && python -m ensurepip --upgrade     && python -m pip inst  16.4s
 => [base 3/3] RUN useradd -u 1000 tuning -m -g 0 --system &&     chmod g+rx /home/tuning                                                                                                                                                                0.6s 
 => [release-base 1/1] RUN rpm -e $(dnf repoquery python3-* -q --installed) dnf python3 yum crypto-policies-scripts                                                                                                                                      2.1s 
 => [cuda-base 1/1] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-cudart-12-1-12.1.55-1         cuda-compat-12-1-530.30.02-1     &&  9.3s 
 => [release  1/10] RUN mkdir -p /licenses                                                                                                                                                                                                               0.5s 
 => [release  2/10] COPY LICENSE /licenses/                                                                                                                                                                                                              0.1s 
 => [release  3/10] RUN mkdir /app &&     chown -R tuning:0 /app /tmp &&     chmod -R g+rwX /app /tmp                                                                                                                                                    0.5s 
 => [release  4/10] RUN if [[ "${ENABLE_AIM}" == "true" ]] ; then         touch /.aim_profile &&         chmod -R 777 /.aim_profile;     fi                                                                                                              0.5s 
 => [release  5/10] RUN mkdir /.cache &&     chmod -R 777 /.cache                                                                                                                                                                                        0.7s
 => [release  6/10] COPY build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/                                                                                                                                                         0.2s
 => [release  7/10] COPY build/utils.py /app/build/                                                                                                                                                                                                      0.1s
 => [release  8/10] RUN chmod +x /app/accelerate_launch.py                                                                                                                                                                                               0.5s
 => [release  9/10] WORKDIR /app                                                                                                                                                                                                                         0.1s
 => [cuda-devel 1/1] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-command-line-tools-12-1-12.1.0-1         cuda-libraries-devel-  197.8s
 => [python-installations 1/8] RUN dnf install -y git &&     rm -f /usr/share/doc/perl-Net-SSLeay/examples/server_key.pem &&     dnf clean all                                                                                                          11.7s 
 => [python-installations 2/8] WORKDIR /tmp                                                                                                                                                                                                              0.1s 
 => [python-installations 3/8] RUN --mount=type=cache,target=/home/tuning/.cache/pip,uid=1000     python -m pip install --user build                                                                                                                     1.5s 
 => [python-installations 4/8] COPY --chown=tuning:root tuning tuning                                                                                                                                                                                    0.1s 
 => [python-installations 5/8] COPY .git .git                                                                                                                                                                                                            0.1s 
 => [python-installations 6/8] COPY pyproject.toml pyproject.toml                                                                                                                                                                                        0.1s 
 => [python-installations 7/8] RUN if [[ -z "" ]];     then python -m build --wheel --outdir /tmp;     else pip download fms-hf-tuning== --dest /tmp --only-binary=:all: --no-deps;     fi &&     ls /tmp/*.whl >/tmp/bdist_name                         3.8s 
 => [python-installations 8/8] RUN --mount=type=cache,target=/home/tuning/.cache/pip,uid=1000     python -m pip install --user wheel &&     python -m pip install --user "$(head bdist_name)" &&     python -m pip install --user "$(head bdist_name)  154.4s 
 => [release 10/10] COPY --from=python-installations /home/tuning/.local /home/tuning/.local                                                                                                                                                            39.8s 
 => exporting to image                                                                                                                                                                                                                                  85.9s 
 => => exporting layers                                                                                                                                                                                                                                 85.8s 
 => => writing image sha256:91d6aeae2b7391243cff36eee99fa61b81cb478f61726591bcfcf8343a308737                                                                                                                                                             0.0s 
 => => naming to docker.io/library/fms-hf-tuning:dev                                                                                                                                                                                                     0.0s 
                                                                                                                                                                                                                                                              
 9 warnings found (use --debug to expand):
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 24)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 46)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 77)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 101)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 144)
 - UndefinedVar: Usage of undefined variable '$CUDA_HOME' (line 72)
 - UndefinedVar: Usage of undefined variable '$CUDA_HOME' (line 72)
 - UndefinedVar: Usage of undefined variable '$LD_LIBRARY_PATH' (line 72)

Verification with pip freeze

[dushyantbehl@experimentsplatform fms-hf-tuning]$ docker run -it fms-hf-tuning:dev /bin/bash
[tuning@0c6af3eccb7f app]$ pip freeze | grep aim
[tuning@0c6af3eccb7f app]$ exit
exit

Tested With ENABLE_AIM=true build arg

[dushyantbehl@experimentsplatform fms-hf-tuning]$ docker build -t fms-hf-tuning-aim:dev . -f build/Dockerfile --build-arg ENABLE_AIM=true
[+] Building 312.5s (30/30) FINISHED                                                                                                                                                                                                           docker:default
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                     0.0s
 => => transferring dockerfile: 6.51kB                                                                                                                                                                                                                   0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 24)                                                                                                                                                                          0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 46)                                                                                                                                                                          0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)                                                                                                                                                                          0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 77)                                                                                                                                                                          0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 101)                                                                                                                                                                         0.0s
 => WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 144)                                                                                                                                                                         0.0s
 => [internal] load metadata for registry.access.redhat.com/ubi9/ubi:latest                                                                                                                                                                              0.3s
 => [internal] load .dockerignore                                                                                                                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                                                                                                                          0.0s
 => CACHED [internal] settings cache mount permissions                                                                                                                                                                                                   0.0s
 => [internal] load build context                                                                                                                                                                                                                        0.0s
 => => transferring context: 11.14kB                                                                                                                                                                                                                     0.0s
 => [base 1/3] FROM registry.access.redhat.com/ubi9/ubi:latest@sha256:1ee4d8c50d14d9c9e9229d9a039d793fcbc9aa803806d194c957a397cf1d2b17                                                                                                                   0.0s
 => CACHED [base 2/3] RUN dnf remove -y --disableplugin=subscription-manager         subscription-manager     && dnf install -y python3.11 procps     && ln -s /usr/bin/python3.11 /bin/python     && python -m ensurepip --upgrade     && python -m pi  0.0s
 => CACHED [base 3/3] RUN useradd -u 1000 tuning -m -g 0 --system &&     chmod g+rx /home/tuning                                                                                                                                                         0.0s
 => CACHED [cuda-base 1/1] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-cudart-12-1-12.1.55-1         cuda-compat-12-1-530.30.02-1  0.0s
 => CACHED [cuda-devel 1/1] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-command-line-tools-12-1-12.1.0-1         cuda-libraries-d  0.0s
 => [python-installations 1/8] RUN dnf install -y git &&     rm -f /usr/share/doc/perl-Net-SSLeay/examples/server_key.pem &&     dnf clean all                                                                                                          11.2s
 => [python-installations 2/8] WORKDIR /tmp                                                                                                                                                                                                              0.2s
 => [python-installations 3/8] RUN --mount=type=cache,target=/home/tuning/.cache/pip,uid=1000     python -m pip install --user build                                                                                                                     1.2s 
 => [python-installations 4/8] COPY --chown=tuning:root tuning tuning                                                                                                                                                                                    0.1s 
 => [python-installations 5/8] COPY .git .git                                                                                                                                                                                                            0.1s 
 => [python-installations 6/8] COPY pyproject.toml pyproject.toml                                                                                                                                                                                        0.1s 
 => [python-installations 7/8] RUN if [[ -z "" ]];     then python -m build --wheel --outdir /tmp;     else pip download fms-hf-tuning== --dest /tmp --only-binary=:all: --no-deps;     fi &&     ls /tmp/*.whl >/tmp/bdist_name                         3.8s 
 => [python-installations 8/8] RUN --mount=type=cache,target=/home/tuning/.cache/pip,uid=1000     python -m pip install --user wheel &&     python -m pip install --user "$(head bdist_name)" &&     python -m pip install --user "$(head bdist_name)  136.9s 
 => CACHED [release-base 1/1] RUN rpm -e $(dnf repoquery python3-* -q --installed) dnf python3 yum crypto-policies-scripts                                                                                                                               0.0s 
 => CACHED [release  1/10] RUN mkdir -p /licenses                                                                                                                                                                                                        0.0s 
 => CACHED [release  2/10] COPY LICENSE /licenses/                                                                                                                                                                                                       0.0s 
 => CACHED [release  3/10] RUN mkdir /app &&     chown -R tuning:0 /app /tmp &&     chmod -R g+rwX /app /tmp                                                                                                                                             0.0s 
 => CACHED [release  4/10] RUN if [[ "${ENABLE_AIM}" == "true" ]] ; then         touch /.aim_profile &&         chmod -R 777 /.aim_profile;     fi                                                                                                       0.0s 
 => CACHED [release  5/10] RUN mkdir /.cache &&     chmod -R 777 /.cache                                                                                                                                                                                 0.0s 
 => CACHED [release  6/10] COPY build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/                                                                                                                                                  0.0s
 => CACHED [release  7/10] COPY build/utils.py /app/build/                                                                                                                                                                                               0.0s
 => CACHED [release  8/10] RUN chmod +x /app/accelerate_launch.py                                                                                                                                                                                        0.0s
 => CACHED [release  9/10] WORKDIR /app                                                                                                                                                                                                                  0.0s
 => [release 10/10] COPY --from=python-installations /home/tuning/.local /home/tuning/.local                                                                                                                                                            42.9s
 => exporting to image                                                                                                                                                                                                                                  90.4s
 => => exporting layers                                                                                                                                                                                                                                 90.4s
 => => writing image sha256:40e46b9deee9bf1b7cf9e7feb093e57e0b7105f81589ba4cadb5f509c0ab24b0                                                                                                                                                             0.0s
 => => naming to docker.io/library/fms-hf-tuning-aim:dev                                                                                                                                                                                                 0.0s

 9 warnings found (use --debug to expand):
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 24)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 46)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 77)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 101)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 144)
 - UndefinedVar: Usage of undefined variable '$CUDA_HOME' (line 72)
 - UndefinedVar: Usage of undefined variable '$CUDA_HOME' (line 72)
 - UndefinedVar: Usage of undefined variable '$LD_LIBRARY_PATH' (line 72)

Verification with pip freeze

[dushyantbehl@experimentsplatform fms-hf-tuning]$ docker run -it fms-hf-tuning-aim:dev /bin/bash
[tuning@737ad308731f app]$ pip freeze | grep aim
aim==3.23.0
aim-ui==3.23.0
aimrecords==0.0.7
aimrocks==0.5.2
[tuning@737ad308731f app]$ exit
exit

dushyantbehl · 2024-07-23T07:23:31Z

@anhuong @Ssukriti @ashokponkumar

This change is in line with our request to have the dockerfile contain an argument to enable Aim.

ashokponkumar · 2024-07-23T07:29:14Z

Why are we not using the optional dependency to install aim?

anhuong · 2024-07-23T16:13:38Z

build/Dockerfile

+ARG ENABLE_AIM
+
+# Need a way to keep this aim version in sync with pyproject.toml
+RUN if [ "$ENABLE_AIM" ] ; then \


Also this formatting is incorrect and thus you see the image build failure. We could expect for ENABLE_AIM to be set to the string true by doing

RUN if [[ "$ENABLE_AIM" == "true ]]; then \

Or we could expect that is ENABLE_AIM is set to any value that we set it

RUN if [[ -n "$ENABLE_AIM" == "true ]]; then \

Hey @anhuong sure I noticed the bug I was busy last 2 days hence did not get to it but fixed now.

Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl · 2024-07-25T11:18:39Z

@anhuong @ashokponkumar changed the way aim is being installed and updated the arg check. Please let me know if this is okay to merge.

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl · 2024-07-25T16:04:41Z

@anhuong @ashokponkumar please see the updated issue description for verification

ashokponkumar · 2024-07-29T12:17:08Z

@anhuong Can you review the PR and merge? Also, let us know the outlook for getting the images with aim and image from main branch in the image repo.

anhuong · 2024-07-29T17:22:28Z

Thanks @dushyantbehl the changes look good and appreciate adding the notes on testing build the image in both scenarios. Did you verify that running tuning on both images works as expected as well?

dushyantbehl · 2024-07-29T17:37:00Z

we have not but can certainly do that @anhuong

HarikrishnanBalagopal · 2024-08-01T09:46:42Z

@dushyantbehl Tested with --build-arg ENABLE_AIM=true

Commit 00c33b6d04781e8cf063df376269c4622e979afc

HarikrishnanBalagopal · 2024-08-01T09:55:38Z

Also tested the same commit 00c33b6d04781e8cf063df376269c4622e979afc without AIM (the image was built without providing the ENABLE_AIM build arg).

It gives an error as expected since the AIM package is not installed.

ValueError: Requested tracker aim is not installed. Please install before proceeding

Full error logs:


running accelerate launch...
The following values were not passed to `accelerate launch` and had defaults used instead:
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Requested tracker aim is not installed. Please install before proceeding
Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 529, in main
    train(
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 152, in train
    t = get_tracker(name, tracker_configs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/trackers/tracker_factory.py", line 146, in get_tracker
    raise ValueError(e)
ValueError: Requested tracker aim is not installed. Please install before proceeding

Requested tracker aim is not installed. Please install before proceeding
Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 529, in main
    train(
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 152, in train
    t = get_tracker(name, tracker_configs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/trackers/tracker_factory.py", line 146, in get_tracker
    raise ValueError(e)
ValueError: Requested tracker aim is not installed. Please install before proceeding

[rank0]:[W801 09:53:44.881254409 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0801 09:53:45.066000 140372392019776 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 73 closing signal SIGTERM
E0801 09:53:45.280000 140372392019776 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 72) of binary: /bin/python
Traceback (most recent call last):
  File "/home/tuning/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1093, in launch_command
    multi_gpu_launcher(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tuning.sft_trainer FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-01_09:53:45
  host      : p-tests-exp7-test-pr-without-aim-will-error-master-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 72)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

dushyantbehl · 2024-08-01T09:59:27Z

Thanks a lot @HarikrishnanBalagopal

@anhuong both scenarios have been tested so can we proceed with merge?

ashokponkumar · 2024-08-01T11:29:47Z

@anhuong can we please merge this PR and make the images available.

anhuong

LGTM, thank you for testing building and running the image

…stack#261) * Add a dockerfile argument at the end of final layer to enable aimstack. Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Set the default value of ENABLE_AIM to false Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* Set default value of target_modules to be None in LoraConfig Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT and lint check: Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: remove lm_head for granite with llama arch models (#258) * initial code for deleting lm_head Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix logic for copying checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix check that embed_tokens and lm_head weights are the same Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix warning assertion Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix lm_head check, remove test Signed-off-by: Anh-Uong <anh.uong@ibm.com> * small fixes from code review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fmt Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config_utils tests Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix fmt Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Separate tests out and use docstrings Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Update more field/value checks from HF defaults Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add test for tokenizer in lora config (should be ignored) Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT_FIX: Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * bug: On save event added to callback (#256) * feat: On save event added to callback Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues resolved Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase with upstream and add new line Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: All metric handling changes (#263) * feat: All metric handling changes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Configuration to set logging level for trigger log (#241) * feat: Added the triggered login in the operation Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added default config Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Moved the variable to right scope Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Checked added to validate config log level Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed some unwanted log file Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * limit peft deps until investigate (#274) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * Data custom collator (#260) * refactor code to preprocess datasets Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix formatting Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * allow input/output in validate args Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * format input/output JSON and mask Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * function to return suitable collator Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add tests for SFT Trainer input/output format Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove unused functions Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add eos token to input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix tests Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * improve docstrings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * keeping JSON keys constant Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * support for input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting fixes Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * update rEADME formats Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting README Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> * Revert "limit peft deps until investigate (#274)" (#275) This reverts commit f57ff63. Signed-off-by: Anh-Uong <anh.uong@ibm.com> * feat: per process state metric (#239) Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> * Modify test to pass with target_modules: None Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Logging changes and unit tests added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Add a dockerfile argument to enable aimstack (#261) * Add a dockerfile argument at the end of final layer to enable aimstack. Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Set the default value of ENABLE_AIM to false Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT:Fix Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * enabling tests for prompt tuning Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Support pretokenized (#272) * feat: support pretokenized datasets Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * consolidate collator code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add valuerrors for incorrect args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> * Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212) Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](pypa/packaging@23.2...24.1) --- updated-dependencies: - dependency-name: packaging dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * enabling tests for prompt tuning (#278) Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * fix: do not add special tokens for custom tokenizer (#279) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * PR changes for changing logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: bug where the logger was not being used properly (#286) Signed-off-by: Hari <harikrishmenon@gmail.com> * Unit Tests changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add functionality to free disk space from Github Actions (#287) * Add functionality to free disk space from Github Actions Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Move freeing space step to before building image Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * commented os.environ[LOG_LEVEL] in accelerate.py for testing Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FIX:FMT Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add unit test to verify target_modules defaults correctly (#281) * Add unit test to verify target_modules defaults correctly Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Use model_args instead of importing, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add test to ensure target_modules defaults to None in job config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add additional check, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: Add documentation on experiment tracking. (#257) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Ensure additional metadata to trackers don't throw error in happy case. (#290) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix multiple runid creation bug with accelerate. (#268) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * feat: logging control operation (#264) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Metrics file epoch indexing from 0 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Revert last commit Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix run evaluation to get base model path (#273) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Always update setuptools to latest (#288) Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * Rename all fixtures with correct .jsonl extension (#295) Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * feat: add save_model_dir flag where final checkpoint saved (#291) * add save_model_dir flag for final checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * remove output_dir logic, add save method Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update accelerate_launch, remove save tokenizer Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix: put back creation of .complete file Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix failing tests and add new ones Signed-off-by: Anh-Uong <anh.uong@ibm.com> * tests: add sft_trainer test to train and save - small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add docs on saving checkpoints and fix help msg Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update example and note best checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * changes based on PR review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add logging to save, fix error out properly Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com> Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Hari <harikrishmenon@gmail.com> Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Angel Luu <angel.luu@us.ibm.com> Co-authored-by: Angel Luu <an317gel@gmail.com> Co-authored-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Hari <harikrishmenon@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Busche <jbusche@us.ibm.com>

dushyantbehl requested review from anhuong, Ssukriti and alex-jw-brooks as code owners July 23, 2024 07:22

dushyantbehl marked this pull request as draft July 23, 2024 07:22

anhuong reviewed Jul 23, 2024

View reviewed changes

dushyantbehl force-pushed the dockerfile-arg branch 5 times, most recently from 0575ccc to dc5992f Compare July 25, 2024 10:16

Add a dockerfile argument at the end of final layer to enable aimstack.

c6ad231

Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the dockerfile-arg branch from dc5992f to c6ad231 Compare July 25, 2024 10:42

dushyantbehl marked this pull request as ready for review July 25, 2024 11:18

Set the default value of ENABLE_AIM to false

f39ccab

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the dockerfile-arg branch from f140552 to f39ccab Compare July 25, 2024 16:01

dushyantbehl added 2 commits July 29, 2024 11:35

Merge branch 'main' into dockerfile-arg

f64c613

Merge branch 'main' into dockerfile-arg

2302f23

ashokponkumar approved these changes Jul 29, 2024

View reviewed changes

dushyantbehl added 2 commits July 31, 2024 15:34

Merge branch 'main' into dockerfile-arg

2859747

Merge branch 'main' into dockerfile-arg

00c33b6

Merge branch 'main' into dockerfile-arg

021f5a8

anhuong approved these changes Aug 1, 2024

View reviewed changes

anhuong merged commit 003feb5 into foundation-model-stack:main Aug 1, 2024
7 checks passed

dushyantbehl deleted the dockerfile-arg branch August 2, 2024 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add a dockerfile argument to enable aimstack #261

feat: Add a dockerfile argument to enable aimstack #261

dushyantbehl commented Jul 23, 2024 •

edited

Loading

dushyantbehl commented Jul 23, 2024

ashokponkumar commented Jul 23, 2024

anhuong Jul 23, 2024

dushyantbehl Jul 25, 2024

dushyantbehl commented Jul 25, 2024

dushyantbehl commented Jul 25, 2024 •

edited

Loading

ashokponkumar commented Jul 29, 2024

anhuong commented Jul 29, 2024

dushyantbehl commented Jul 29, 2024

HarikrishnanBalagopal commented Aug 1, 2024 •

edited

Loading

HarikrishnanBalagopal commented Aug 1, 2024 •

edited

Loading

dushyantbehl commented Aug 1, 2024

ashokponkumar commented Aug 1, 2024 •

edited

Loading

anhuong left a comment

feat: Add a dockerfile argument to enable aimstack #261

feat: Add a dockerfile argument to enable aimstack #261

Conversation

dushyantbehl commented Jul 23, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

dushyantbehl commented Jul 23, 2024

ashokponkumar commented Jul 23, 2024

anhuong Jul 23, 2024

Choose a reason for hiding this comment

dushyantbehl Jul 25, 2024

Choose a reason for hiding this comment

dushyantbehl commented Jul 25, 2024

dushyantbehl commented Jul 25, 2024 • edited Loading

ashokponkumar commented Jul 29, 2024

anhuong commented Jul 29, 2024

dushyantbehl commented Jul 29, 2024

HarikrishnanBalagopal commented Aug 1, 2024 • edited Loading

HarikrishnanBalagopal commented Aug 1, 2024 • edited Loading

dushyantbehl commented Aug 1, 2024

ashokponkumar commented Aug 1, 2024 • edited Loading

anhuong left a comment

Choose a reason for hiding this comment

dushyantbehl commented Jul 23, 2024 •

edited

Loading

dushyantbehl commented Jul 25, 2024 •

edited

Loading

HarikrishnanBalagopal commented Aug 1, 2024 •

edited

Loading

HarikrishnanBalagopal commented Aug 1, 2024 •

edited

Loading

ashokponkumar commented Aug 1, 2024 •

edited

Loading