# Pre-train BERT using Hugging Face Accelerate library

This example illustrates how to [pre-train BERT on Glue MRPC dataset](https://github.com/huggingface/accelerate/blob/main/examples/complete_nlp_example.py) with [Hugging Face Accelerate](https://github.com/huggingface/accelerate) library, using a [Kubeflow Pipeline (v2)](https://www.kubeflow.org/docs/components/pipelines/v2/introduction/). 

## Launch a JupyterLab notebook server

We need to run this notebook in a [Kubeflow Notebooks](https://www.kubeflow.org/docs/components/notebooks/overview/) JupyterLab notebook server. [Access Kubeflow Central Dashboard](https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow/tree/master/README.md#access-kubeflow-central-dashboard-optional), and follow the [Quickstart Guide](https://www.kubeflow.org/docs/components/notebooks/quickstart-guide/) to create an instance of the default JupyterLab notebook server. Connect to the launched notebook server from within the Kubeflow Central Dashboard. Clone this git repository under the home directory on the notebook server, and open this notebook.

## Persistent volumes

Amazon EFS and FSx for Lustre persistent volumes are mounted at `~/pv-efs`and `~/pv-fsx`, respectively, within the notebook server, and at `/efs` and `/fsx` within the pre-training job runtime environment.

## Implicitly defined environment variables

Following variables are implicitly defined by the [PyTorch Elastic](https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow/tree/master/charts/machine-learning/training/pytorchjob-elastic/) Helm chart for use with [Torchrun elastic launch](https://pytorch.org/docs/stable/elastic/run.html):

1. `PET_NNODES` : Torchrun elastic launch`nnodes`
2. `PET_NPROC_PER_NODE` : Torchrun elastic launch `nproc_per_node` 
3. `PET_RDZV_ID` : Torchrun elastic launch `rdzv_id` 
4. `PET_RDZV_ENDPOINT`: Torchrun elastic launch `rdzv_endpoint`

## Create Kubeflow Pipelines Client

We start by creating a client for Kubeflow Pipelines. Since we are running this notebook in a JupyterLab notebook server inside the Kubeflow platform, we can discover the Kubeflow Pipelines endpoint automatically, as shown below.

In [None]:
import kfp

client = kfp.Client()
client

Next, we get our Kubernetes namespace.

In [None]:
ns = client.get_user_namespace()
print(f"user namespace: {ns}")

## Define pre-training configuration

In this step, we define the configuration for the pre-training job. This configuration specifies the Git repo URL for the  Helm chart, and specifies the input values used to install the Helm chart. The input values are loaded from the [pretrain.yaml](./pretrain.yaml) file.

The Helm chart `release_name` must be unique among the Helm charts installed within the user namespace. The `repo_url` below specifies the Git repository URL for the Helm chart, and the `path` specifies the relative path within the Git repository. 


In [None]:
import yaml


# Release name must be unique within the namespace
release_name = "accel-bert"

pretrain_config_yaml = yaml.safe_load(f"""
release_name: {release_name}
namespace: {ns}
repo_url: "https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow.git"
path: "charts/machine-learning/training/pytorchjob-elastic"
""")

with open("pretrain.yaml") as file:
    pretrain_config_yaml["values"] = yaml.safe_load(file)
    
pretrain_config_yaml

## Create a New Kubeflow Experiment 

Next, we create a new [Kubeflow Experiment](https://www.kubeflow.org/docs/components/pipelines/v1/concepts/experiment/).

In [None]:
exp_name = "accel-bert-experiment-1"
exp_desc="Pre-train BERT on Glue MRPC dataset using Hugging Face Accelerate"
exp = client.create_experiment(name=exp_name, description=exp_desc, namespace=ns)
exp

## Run the Pipeline in the Experiment

To run this pipeline, we must input `arguments` with `chart_configs` list, as shown below.

In [None]:
from datetime import datetime

ts = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

run_name=f"{exp_name}-run-{ts}"

pipeline_package = "../../../../kfp/pipelines/packages/helm_charts_pipeline.yaml"

pipeline_run=client.create_run_from_pipeline_package(pipeline_file=pipeline_package, 
                                 arguments={ "chart_configs": [pretrain_config_yaml]},
                                 run_name=run_name,
                                 experiment_name=exp_name, 
                                 namespace=ns, 
                                 enable_caching=False, 
                                 service_account='default')
pipeline_run

## What happens during the Pipeline Run

You can check the Kubeflow Pipeline Run logs using the link output above. 

During the Pipeline Run, the Helm charts in the `chart_configs` list are installed sequentially. Each installed Helm chart is monitored to a successful completion, or failure. If any chart in the list fails, the Pipeline Run ends in a failure, otherwise, when all the charts in the list complete successfully, the Pipeline Run concludes successfully.

## Tail log output

The pre-training job log output (when the job is running) is available on the EFS file-system, as shown below. You can expect a delay of about 5-minutes before the pre-training job logout is available.

In [None]:
!tail -f ~/pv-efs/home/{release_name}/logs/pytorchjob-{release_name}-worker-0/pretrain.log