[AIR] LightningTrainer Dolly V2 FSDP Fine-tuning Example (ray-project…

…#34990) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
architkulkarni · May 16, 2023 · 8492f80 · 8492f80
1 parent 3d7b2ff
commit 8492f80
Show file tree

Hide file tree

Showing 14 changed files with 1,130 additions and 1 deletion.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -82,6 +82,7 @@ parts:
               - file: ray-air/examples/gptj_batch_prediction
               - file: ray-air/examples/gptj_serving
               - file: ray-air/examples/dreambooth_finetuning
+              - file: ray-air/examples/dolly_lightning_fsdp_finetuning
           - file: ray-air/api/api
           - file: ray-air/benchmarks
 

diff --git a/doc/source/ray-air/examples/BUILD b/doc/source/ray-air/examples/BUILD
@@ -52,6 +52,7 @@ py_test_run_all_notebooks(
         "stablediffusion_batch_prediction.ipynb",  # Requires GPUs
         "gptj_deepspeed_fine_tuning.ipynb",  # Requires release test
         "opt_deepspeed_batch_inference.ipynb", # Requires release test
+        "dolly_lightning_fsdp_finetuning.ipynb", # Requires release test
     ],
     data = ["//doc/source/ray-air/examples:air_examples"],
     tags = ["exclusive", "team:ml", "ray_air"],

diff --git a/doc/source/ray-air/examples/dolly_lightning_fsdp_finetuning.ipynb b/doc/source/ray-air/examples/dolly_lightning_fsdp_finetuning.ipynb
diff --git a/doc/source/ray-air/examples/gptj_deepspeed_fine_tuning.ipynb b/doc/source/ray-air/examples/gptj_deepspeed_fine_tuning.ipynb
@@ -5,6 +5,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "(gptj_deepspeed_finetune)=\n",
+    "\n",
     "# GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed\n",
     "\n",
     "In this example, we will showcase how to use the Ray AIR for **GPT-J fine-tuning**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).\n",

diff --git a/doc/source/ray-air/examples/index.rst b/doc/source/ray-air/examples/index.rst
@@ -30,6 +30,7 @@ Text/NLP
 - :doc:`/ray-air/examples/gptj_serving`: How to use Ray AIR to do online serving with the Hugging Face Transformers GPT-J model.
 - :doc:`/ray-air/examples/dreambooth_finetuning`: How to fine-tune a DreamBooth text-to-image model with your own images.
 - :doc:`/ray-air/examples/opt_deepspeed_batch_inference`: How to run batch inference on a dataset of texts with a 30B OPT model.
+- :doc:`/ray-air/examples/dolly_lightning_fsdp_finetuning`: How to fine-tune a dolly-v2-7b model with Ray AIR LightningTrainer and FSDP.
 
 Image/CV
 --------

diff --git a/doc/source/train/examples.rst b/doc/source/train/examples.rst
@@ -83,6 +83,14 @@ Distributed Training Examples using Ray Train
 
             Use LightningTrainer with Ray Data and Batch Predictor
 
+    .. grid-item-card::
+        :img-top: /images/pytorch_lightning_small.png
+        :class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img
+
+        .. button-ref:: dolly_lightning_fsdp_finetuning
+
+            Fine-tune LLM with AIR LightningTrainer and FSDP
+
 
 Ray Train Examples Using Loggers & Callbacks
 --------------------------------------------

diff --git a/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb b/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb
@@ -1483,6 +1483,17 @@
     "print(results.head(10))\n",
     "print(matthews_corr)"
    ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What's next?\n",
+    "\n",
+    "- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
+    "- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
+   ]
   }
  ],
  "metadata": {

diff --git a/doc/source/train/examples/lightning/lightning_mnist_example.ipynb b/doc/source/train/examples/lightning/lightning_mnist_example.ipynb
@@ -741,6 +741,7 @@
     "## What's next?\n",
     "\n",
     "- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
+    "- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
     "- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
    ]
   }

diff --git a/doc/source/tune/examples/tune-pytorch-lightning.ipynb b/doc/source/tune/examples/tune-pytorch-lightning.ipynb
@@ -582,6 +582,7 @@
     "\n",
     "- {ref}`Use LightningTrainer for Image Classification <lightning_mnist_example>`.\n",
     "- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
+    "- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
     "- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)\n",
     "  and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune.\n",
     "- {doc}`/tune/examples/includes/mnist_ptl_mini`:\n",
@@ -607,7 +608,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.15"
+   "version": "3.8.16"
   }
  },
  "nbformat": 4,

diff --git a/release/air_examples/dolly_v2_lightning_fsdp_finetuning/dolly_v2_fsdp_compute_aws.yaml b/release/air_examples/dolly_v2_lightning_fsdp_finetuning/dolly_v2_fsdp_compute_aws.yaml
@@ -0,0 +1,20 @@
+cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
+region: us-west-2
+
+head_node_type:
+    name: head_node
+    instance_type: g4dn.8xlarge
+
+worker_node_types:
+    - name: worker_node
+      instance_type: g4dn.4xlarge
+      min_workers: 15
+      max_workers: 15
+      use_spot: false
+
+aws:
+  TagSpecifications:
+    - ResourceType: "instance"
+      Tags:
+        - Key: ttl-hours
+          Value: '24'
diff --git a/release/air_examples/dolly_v2_lightning_fsdp_finetuning/dolly_v2_fsdp_env.yaml b/release/air_examples/dolly_v2_lightning_fsdp_finetuning/dolly_v2_fsdp_env.yaml
@@ -0,0 +1,21 @@
+base_image: {{ env["RAY_IMAGE_ML_NIGHTLY_GPU"] | default("anyscale/ray:nightly-py38-cu118") }}
+env_vars: {}
+debian_packages:
+  - curl
+
+python:
+  pip_packages:
+    - "datasets"
+    - "evaluate"
+    - "scikit-learn"
+    - "boto3"
+    - myst-parser==0.15.2
+    - myst-nb==0.13.1
+    - jupytext==1.13.6
+  conda_packages: []
+
+post_build_cmds:
+  - pip uninstall -y ray || true && pip3 install -U {{ env["RAY_WHEELS"] | default("ray") }}
+  - {{ env["RAY_WHEELS_SANITY_CHECK"] | default("echo No Ray wheels sanity check") }}
+  - pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+  - pip3 install "pytorch_lightning>=2.0.0" "transformers>=4.28.0" "accelerate>=0.18.0"
diff --git a/release/air_examples/dolly_v2_lightning_fsdp_finetuning/lightning-llm-finetuning-7b.ipynb b/release/air_examples/dolly_v2_lightning_fsdp_finetuning/lightning-llm-finetuning-7b.ipynb
@@ -0,0 +1 @@
+../../../doc/source/ray-air/examples/dolly_lightning_fsdp_finetuning.ipynb
diff --git a/release/air_examples/dolly_v2_lightning_fsdp_finetuning/test_myst_doc.py b/release/air_examples/dolly_v2_lightning_fsdp_finetuning/test_myst_doc.py
@@ -0,0 +1 @@
+../../../doc/test_myst_doc.py
diff --git a/release/release_tests.yaml b/release/release_tests.yaml
@@ -827,6 +827,23 @@
         cluster_compute: gptj_deepspeed_compute_gce.yaml
 
 
+- name: air_example_dolly_v2_lightning_fsdp_finetuning
+  group: AIR examples
+  working_dir: air_examples/dolly_v2_lightning_fsdp_finetuning
+
+  python: "3.8"
+
+  frequency: weekly
+  team: ml
+  cluster:
+    cluster_env: dolly_v2_fsdp_env.yaml
+    cluster_compute: dolly_v2_fsdp_compute_aws.yaml
+
+  run:
+    timeout: 4700
+    script: python test_myst_doc.py --path lightning-llm-finetuning-7b.ipynb
+
+
 - name: air_example_opt_deepspeed_batch_inference
   group: AIR examples
   working_dir: air_examples/opt_deepspeed_batch_inference