### 2. Prepare Data for LLM Training

In this notebook we will use a dataset in Huggingface Datasets repo to train the Llama model. As part of data preparation we will do the below steps

1. Download Dataset from HF hub. The notebook will use the provided dataset repo name to download it to the instance. 
2. Load and tokenize the dataset. 
3. Save the tokenized data in nemo format that can be used for training.

Note: If you want to use your own dataset then you can directly provide it as jsonl file which Nemo Megatron supports.

#### Run the processing Job

We will use SageMaker training job with an to run the data processing. We will start with importing necessary SageMaker modules from the SageMaker python SDK.

In [None]:
#retrive the docker image URL stored in step 1
%store -r docker_image 

use_fsx = False # set this to true and check other fsx parameters to use FSxL for the job

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:  
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


### Run the data processing job with NEMO neuron container

We will use the custom docker image that we created in step 1 with the Neuron Image as base to run the processing Job. We will provide some of the hyperparameters according to the dataset we use. 

In [None]:
hyperparameters = {}

hyperparameters['hf_dataset_name'] = "wikitext" # we will use wikitext to run this example
hyperparameters['hf_subset'] = "wikitext-103-v1" # Change this depending on the dataset used.
hyperparameters['dataset_split'] = "train" # Change this depending on the dataset used.
hyperparameters['token'] = "hf_XXXX" # Please add your HuggingFace Token to download the gated Llama2 model.
hyperparameters['model_id'] = "meta-llama/Llama-2-7b-hf"

hyperparameters['json-keys'] = 'text' # key in the dataset JSON that contains the text.
hyperparameters['tokenizer-library'] = 'huggingface'
hyperparameters['dataset-impl'] = 'mmap'
hyperparameters['need-pad-id'] = ""
hyperparameters['append-eod'] = ""
hyperparameters['workers'] = 48

In [None]:
# Retrive the FSX details from Store Magic 

if use_fsx:
    #retrive fsx details
    %store -r fsx_id
    %store -r sec_group
    %store -r private_subnet_id     
    %store -r fsx_mount
    %store -r fsx_file_system_path
else:
    use_fsx = False

In [None]:
# setup fsx config for data channels
from sagemaker.inputs import FileSystemInput
if use_fsx:
    FS_ID = fsx_id # FSX ID
    FS_BASE_PATH = "/" + fsx_mount + "/" + fsx_file_system_path # Path in the filesystem that needs to be mounted
    SUBNET_ID = private_subnet_id # Subnet to launch SM jobs in
    SEC_GRP = [sec_group]

    fsx_train_input = FileSystemInput(
        file_system_id=FS_ID,
        file_system_type='FSxLustre',
        directory_path=FS_BASE_PATH + "/nemo_llama",
        file_system_access_mode="rw"
    )
    hyperparameters['input'] = "/opt/ml/input/data/train/wiki.jsonl"
    hyperparameters['tokenizer-type'] = '/opt/ml/input/data/train/llama7b-hf'
    hyperparameters['output-prefix'] = '/opt/ml/input/data/train/wiki'
    data_channels = {"train": fsx_train_input}

else:
    checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/nemo_llama_experiment"
    # we will use the sagemaker s3 checkpoints mechanism since we need read/write access to the paths.
    hyperparameters['input'] = "/opt/ml/checkpoints/wiki.jsonl"
    hyperparameters['tokenizer-type'] = '/opt/ml/checkpoints/llama7b-hf'
    hyperparameters['output-prefix'] = '/opt/ml/checkpoints/wiki'
    hyperparameters["checkpoint-dir"] = '/opt/ml/checkpoints'


In [None]:
from sagemaker.pytorch import PyTorch
# Need to check if this works on multinode with torchrun.
estimator = PyTorch(
    base_job_name="nemo-megatron-data-prep",
    source_dir="./scripts",
    entry_point="process_data_for_megatron.py",
    role=role,
    image_uri=docker_image,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=1,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sess,
    volume_size=2048,
    hyperparameters=hyperparameters,
    checkpoint_s3_uri=checkpoint_s3_uri if not use_fsx else None,
    checkpoint_local_path=hyperparameters["checkpoint-dir"] if not use_fsx else None,
    debugger_hook_config=False,
    keep_alive_period_in_seconds=600,

    subnets = [SUBNET_ID] if use_fsx else None, # Give SageMaker Training Jobs access to FSx resources in your Amazon VPC
    security_group_ids=SEC_GRP if use_fsx else None,
)

### Start the training Job

In [None]:
if use_fsx:
    estimator.fit(data_channels)
else:
    estimator.fit()

### Terminate the warm pool    

In [None]:
sess.update_training_job(estimator.latest_training_job.job_name, resource_config={"KeepAlivePeriodInSeconds":900})