# Train EleutherAI GPT-J with PyTorch 1.8.1 and Pipeline Parallelism Using the SageMaker Model Parallelism Library

**This training job completes successfully on 1x p4dn.24xlarge instance or a cluster 4x p4.16xlarge instances.**

*Please run this notebook with Data Science-> Python 3 Kernel on SageMaker Studio Notebook or a conda_pytorch_p38 Kernel on SageMaker Notebook instances*

This notebook walks you through how to train the [EleutherAI's](https://www.eleuther.ai/) [GPT-J](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) model with SageMaker's model parallelism.
EleutherAI released GPT-J 6B, an open-source alternative to [OpenAIs GPT-3](https://openai.com/blog/gpt-3-apps/). [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) is the 6 billion parameter successor to EleutherAIs GPT-NEO family, a family of transformer-based language models based on the GPT architecture for text generation.

EleutherAI's primary goal is to train a model that is equivalent in size to GPT⁠-⁠3 and make it available to the public under an open license.
Over the last few months, GPT-J gained a lot of interest from Researchers, Data Scientists, and even Software Developers, but it remained very challenging to fine tune GPT-J.

The weights of the 6 billion parameter model represent a ~24GB memory footprint. To load it in float32, one would need at least 2x model size CPU RAM: 1x for initial weights and another 1x to load the checkpoint. Apart from the model parameters, there are the gradients, optimizer states, and activations taking memory, so the actual memory usage might be significantly higher than 48GB. Just as an example, with Adam optimizer and FP32 training, the use from parameters, gradients and optimizer states might be 96GB+, and activation memory footprint would be even more than this, so the total memory usage might be easily larger than 200 GB.

In this notebook, you will learn how to easily fine tune GPT-J using Amazon SageMaker and Hugging Face on NVIDIA GPU instances.

This notebook depends on the following files and folders:

1. `train_gptj_smp_script.py`: This is an entrypoint script that is passed to the PyTorch estimator in the notebook instructions. This script is responsible for end to end training of the GPT-J model with SMP. The script has additional comments at places where the SMP API is used.
2. `fp16`: This folder is used for 16-bit float training, which contains a fp16 optimizer and various fp16 utilities.
3. `learning_rates.py`: This contains the functions for learning rate schedule.
4. `requirements.txt`: This will install the dependencies, like the right version of huggingface transformers.
5. `preprocess.py`: This will download and preprocess the sst2/glue dataset.
6. `args.py`: collection of difference arguments like training, data, SageMaker Model Parallel related args.
7. `smp_trainer.py`.py: Defines the SageMaker Model Parallel Trainer class.


## SageMaker Distributed Training 

SageMaker provides distributed training libraries for data parallelism and model parallelism. The libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.

### Approaches

![SageMaker Distributed Training Approaches](img/TypesOfDistributedTraining.png)


### SageMaker Model Parallel

Model parallelism is the process of splitting a model up between multiple devices or nodes (such as GPU-equipped instances) and creating an efficient pipeline to train the model across these devices to maximize GPU utilization.

Increasing deep learning model size (layers and parameters) can result in better accuracy. However, there is a limit to the maximum model size you can fit in a single GPU. When training deep learning models, GPU memory limitations can be a bottleneck in the following ways:

1. They can limit the size of the model you train. Given that larger models tend to achieve higher accuracy, this directly translates to trained model accuracy.

2. They can limit the batch size you train with, leading to lower GPU utilization and slower training.

To overcome the limitations associated with training a model on a single GPU, you can use model parallelism to distribute and train your model on multiple computing devices.

### Core features of SageMaker Model Parallel 

1. [Automated Model Splitting](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): When you use SageMaker's model parallel library, you can take advantage of automated model splitting, also referred to as automated model partitioning. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the automated partitioning algorithm to optimize for speed or memory.

2. [Pipeline Execution Schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): A core feature of SageMaker's distributed model parallel library is pipelined execution, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation.

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A microbatch is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot.

In addition to its [core features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html), the SageMaker distributed model parallel library offers [memory-saving features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html) for training deep learning models with PyTorch: [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html), [optimizer state sharding](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html), [activation checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html), and [activation offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html). 

### SageMaker Model Parallel configuration

Please refer to all the [configuration parameters](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) related to SageMaker Distributed Training.

As we are going to use PyTorch and Hugging Face for training GPT-J, it is important to understand all the SageMaker Distributed configuration parameters specific to PyTorch [here](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html#pytorch-specific-parameters).

#### Important

`process_per_host` must not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance.

For example, if you use one instance with 4-way pipeline parallelism and 2-way data parallelism, then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs, such as an ml.p3.16xlarge.

The following image illustrates how 4-way data parallelism and 2-way pipeline parallelism is distributed across 8 GPUs: the models is partitioned across 2 GPUs, and each partition is added to 4 GPUs.

It is also important to understand how the [ranking mechanism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-ranking-mechanism.html) of model parallelism works with tensor parallelism. This is extended from the Ranking Basics for Core Features of the SageMaker Model Parallel Library.

![SageMaker Distributed Training Approaches](img/SMP-Pipeline-Parallel-DDP.png)


#### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with PyTorch.

1. To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

2. To learn more about using the SageMaker Python SDK with PyTorch, see Using [PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

3. To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

### Using this notebook

*Note:* You have two options on how to run this notebook.  
    1. You can use the _Amazon SageMaker local mode_ to train on your notebook instance  
    2. You can the _managed Amazon SageMaker training_ environment to train your model in a separate cluster of EC2 instance(s)

The below cells check the notebook instance you are using and set the `local_training` flag to `True` if you are on a multi-GPU SageMaker notebook instance and can therefore train the model locally. If you have a CPU based SageMaker notebook instance we set the `local_training` flag to `False` and you will use SageMaker managed training.

In [None]:
import boto3

def get_notebook_name():
    import json
    log_path = '/opt/ml/metadata/resource-metadata.json'
    with open(log_path, 'r') as logs:
        _logs = json.load(logs)
    return _logs['ResourceName']

client = boto3.client('sagemaker')
response = client.describe_notebook_instance(
    NotebookInstanceName=get_notebook_name())
    # set to the number of GPUs on that instance
instance_type= response['InstanceType']

In [None]:
if instance_type in ['ml.p4d.24xlarge']:
    local_training = True
    print("You are running this notebook on a multi GPU SageMaker notebook instance. For the purpose of this workshop you will use SageMaker local training.")
else:
    local_training = False
    print("You are running this notebook on a CPU based SageMaker notebook instance. For the purpose of this workshop you will use SageMaker remote/managed training.")
%store local_training

## Install and Upgrade Libraries

The SageMaker model parallelism library's tensor parallelism feature requires the SageMaker Python SDK and the SageMaker Experiments library. Run the following cell to install or upgrade the libraries.

**Note:** To finish applying the changes, you must restart the kernel.

In [None]:
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p36/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p27
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p36
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p27/
!rm -rf /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p36/
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p27/
!rm -rf /home/ec2-user/anaconda3/envs/chainer_p36/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_latest_p37/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_p27/
!rm -rf /home/ec2-user/anaconda3/envs/mxnet_p36/
!rm -rf /home/ec2-user/anaconda3/envs/python2/
!rm -rf /home/ec2-user/anaconda3/envs/python3/
!rm -rf /home/ec2-user/anaconda3/envs/pytorch_p27/
!rm -rf /home/ec2-user/anaconda3/envs/pytorch_p36/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow2_p36/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow_p27/
!rm -rf /home/ec2-user/anaconda3/envs/tensorflow_p36/
!rm -rf /home/ec2-user/anaconda3/envs/R/

In [None]:
# run once, restart kernel, then comment out this cell
# update sagemaker to the latest 2.x version
! pip install -qU pip
! pip install -qU "sagemaker>=2,<3"
! pip install -q 'sagemaker[local]' --upgrade
! pip install -qU sagemaker-experiments
! pip install -q setuptools==59.5.0
! pip install -qr requirements.txt
! pip install -q transformers
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

<b>Important</b> After you run the above cell, comment it out for future runs.

Now, restart the kernel, and come back here following all cells again.

Import and check if the SageMaker Python SDK version is successfully set to the latest version

In [None]:
import sagemaker

print(sagemaker.__version__)

## Amazon SageMaker Initialization

Throughout this example, you'll use a training script of GPT-J model and a dataset from Huggingface.

Based on the `local_training` flag, you need to setup SageMaker differently. The below cells check this flag and initailize SageMaker appropriately.

#### Considerations when setting up SageMaker local training 

If you want to train locally on this notebook the following changes are needed. Running the cells in this notebook will take care of these steps.

1. Install `pip install 'sagemaker[local]' --upgrade` (installed above)
2. create config.yml in the same folder as the notebook   
    2.1 Make a directory `!mkdir ./.sagemaker`  
    2.2 save this to config.yml  
    
    `%%writefile -a ./.sagemaker/config.yml`
  
    ` 
    local:  
    container_root: /home/ec2-user/SageMaker/hf-gptj-remars-workshop/training/distributed_training/pytorch/model_parallel/gpt-j/tmp/  
    `
3. Create a local session and comment out `sagemaker_session`

    `
    from sagemaker.local.local_session import LocalSession
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'container_root': '/home/ec2-user/SageMaker/hf-gptj-remars-workshop/training/distributed_training/pytorch/model_parallel/gpt-j/tmp/'}}
    `
  
4. Comment out `checkpoint_s3_uri`, as it is not supported in local mode
5. replace instance_type with `"local_gpu"`

In [None]:
#retrieve value of local_training
%store -r local_training
local_training

In [None]:
if local_training:
    # clear all images
    !docker system prune -f

In [None]:
if local_training:
    !mkdir ./.sagemaker
    !mkdir ./tmp

In [None]:
%%writefile ./.sagemaker/config.yml

local:
#     local_code: true
    container_root: /home/ec2-user/SageMaker/hf-gptj-remars-workshop/training/distributed_training/pytorch/model_parallel/gpt-j/tmp/

In [None]:
import os
if local_training:
    from sagemaker.local.local_session import LocalSession
    sagemaker_session = LocalSession()
    tmp_dir = os.getcwd() + "/tmp/"
    sagemaker_session.config = {'local': {'container_root': tmp_dir,'local_code': True}}

Run the following cell to import SageMaker modules and retrieve information of your current SageMaker work environment: your AWS account ID, the AWS Region you are using to run the notebook, and the ARN of your Amazon SageMaker execution role.

In [None]:
%%time
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFace
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import boto3

def get_notebook_name():
    import json
    log_path = '/opt/ml/metadata/resource-metadata.json'
    with open(log_path, 'r') as logs:
        _logs = json.load(logs)
    return _logs['ResourceName']

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role: {role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

sm_boto_client = boto3.client("sagemaker")
if not local_training:
    sagemaker_session = sagemaker.session.Session(boto_session=session)

# get default bucket
default_bucket = sagemaker_session.default_bucket()
print()
print("Default bucket for this session: ", default_bucket)

response = sm_boto_client.describe_notebook_instance(
    NotebookInstanceName=get_notebook_name())
# set to the number of GPUs on that instance
instance_type= response['InstanceType']

In [None]:
s3_output_bucket = f"s3://sagemaker-{region}-{account}/smp-model-parallel-outputdir/"

## Training Dataset
The training script fine-tunes GPT-J on the `sst2` dataset. The DataLoader and Sampler is defined in `smp_trainer.py`

## Setup Hyperparameters
The following `hyperparameters` dictionary is to pass arguments to the training script and set the model parallel configuration when creating the training job.

You can also add custom mpi flags. By default, we have `--mca btl_vader_single_copy_mechanism none` to remove unnecessary logs.

Next we add a base metric definitions to enable the metric upload in SageMaker. You can add any further metric definitions.

We can train on multi-node clusters too, but the default below is on a single node so there's no capacity concerns. Note the following relationship when we are using pipeline parallelism. Note the flag below for ddp in the hyperparameters, which integrates with Pytorch's Distributed Data Parallel for data parallelism.

`(pipeline parallelism degree) x (data parallelism degree) = total number of GPUs`

In [None]:
model_config = "gpt-j-6B"
model_name_or_path = "EleutherAI/gpt-j-6B"

if model_config == "gpt-j-6B":
    model_params = {
        "tensor_parallel_degree": 1,
        "pipeline_parallel_degree": 8,
        "prescaled_batch": 0,
    }

In [None]:
hyperparameters = {
    "dataset_name": "glue",
    "dataset_config_name": "sst2",
    "do_train": True,
    "do_eval": False,
    "load_from_s3": False,
    "per_device_train_batch_size": 2,
    "output_dir": "./temp",
    "model_name_or_path": model_name_or_path,
    "max_steps": 100,
    "seed": 12345,
    "lr": 2.0e-4,
    "lr_decay_iters": 125000,
    "min_lr": 0.00001,
    "warmup": 0.01,
    "shard_optimizer_state": 1,
    "activation_checkpointing": 1,
    "activation_strategy": "each",
    "optimize": "memory",
    "ddp": True,
    "cache_dir": "/tmp",
    "save_final_full_model": 1,
}

for k, v in model_params.items():
    hyperparameters[k] = v
    
mpioptions = "-x NCCL_DEBUG=WARN -x SMDEBUG_LOG_LEVEL=ERROR "
mpioptions += "-x SMP_NCCL_THROTTLE_LIMIT=1 "
mpioptions += "-x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1"

metric_definitions = [
    {"Name": "base_metric", "Regex": "<><><><><><>"}
]  # Add your custom metric definitions

## Set Up SageMaker Studio Experiment
Create or load [SageMaker Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) for the example training job. This will create an experiment trial object in SageMaker Studio.

In [None]:
from time import gmtime, strftime

# Specify your experiment name
experiment_name = "smp-gptj-pipeline-parallel"
# Specify your trial name
trial_name = f"{experiment_name}-trial1"

all_experiment_names = [exp.experiment_name for exp in Experiment.list()]
# Load the experiment if it exists, otherwise create
if experiment_name not in all_experiment_names:
    experiment = Experiment.create(
        experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client
    )
else:
    experiment = Experiment.load(
        experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client
    )

# Create the trial
trial = Trial.create(
    trial_name="smp-{}-{}".format(trial_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime())),
    experiment_name=experiment.experiment_name,
    sagemaker_boto_client=sm_boto_client,
)

## Specify Essential Parameters for a SageMaker Training Job

Next, you will use the [`SageMaker Estimator API`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define a SageMaker Training Job, passing values through the following parameters for training job name, the number of EC2 instances, the instance type, and the size of the volume attached to the instances. 

* `instance_count`
* `instance_type`
* `volume_size`
* `base_job_name`

### Update the Type and Number of EC2 Instance to Use

The instance type and the number of instances you specify to the `instance_type` and `instance_count` parameters, respectively, will determine the total number of GPUs (world size).

$$ \text{(world size) = (the number of GPUs on a single instance)}\times\text{(the number of instance)}$$

In [None]:
# define instance type for a remote training
if not local_training:
    instance_type = "ml.p4d.24xlarge"
instance_count = 1

In [None]:
if instance_type in ['ml.p3.16xlarge','p3dn.24xlarge','ml.g5.48xlarge', 'ml.p4d.24xlarge']:
    processes_per_host = 8
elif instance_type == 'ml.p2.16xlarge':
    processes_per_host = 16
else:
    processes_per_host = 4

To look up the number of GPUs of different instance types, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/). Use the section **Accelerated Computing** to see general purpose GPU instances. Note that, for example, a given instance type `p4d.24xlarge` has a corresponding instance type `ml.p4d.24xlarge` in SageMaker.
For SageMaker supported `ml` instances and cost information, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

### Attach an EBS Volume to the Training Instance
The volume size you specify in `volume_size` must be larger than your input data size. In this example, the volume size is set to 900GB.

In [None]:
volume_size = 900

### Specify a Base Job Name

In [None]:
machine_str = instance_type.split(".")[1] + instance_type.split(".")[2][:3]
pp_degree = hyperparameters["pipeline_parallel_degree"]
tp_degree = hyperparameters["tensor_parallel_degree"]
base_job_name = f'smp-{model_config}-{machine_str}-tp{tp_degree}-pp{pp_degree}-bs{hyperparameters["per_device_train_batch_size"]}'

### Create a SageMaker HuggingFace 🤗 Estimator

The following cell constructs a PyTorch estimator using the parameters defined above. To see how the SageMaker pipeline parallelism is enabled in the script, see the `train_gptj_smp_script.py` file and SageMaker Model Parallel documentation. 

In [None]:
if local_training:
    instance_type = 'local_gpu'

mpi = {
    "enabled": True,
    "processes_per_host": processes_per_host,
    "custom_mpi_options": mpioptions,
}

smdistributed = {
    "modelparallel": {
        "enabled": True,
        "parameters": {
            "ddp": hyperparameters["ddp"],
            "microbatches": 2,
            # partitions is a required param in the current SM SDK so it needs to be passed,
            # these two map to the same config
            "partitions": hyperparameters["pipeline_parallel_degree"],
            "shard_optimizer_state": hyperparameters["shard_optimizer_state"] > 0,
            "prescaled_batch": hyperparameters["prescaled_batch"] > 0,
            "optimize": hyperparameters["optimize"],
            "auto_partition": True,
            "default_partition": 0,
            "offload_activations": True,
            "active_microbatches": 2,
            "optimize": hyperparameters["optimize"],
        },
    }
}


distribution = {"mpi": mpi, "smdistributed": smdistributed}

In [None]:
smp_estimator = HuggingFace(
    entry_point="train_gptj_smp_script.py",
    source_dir=os.getcwd(),
    role=role,
    instance_type=instance_type,
    volume_size=volume_size,
    instance_count=instance_count,
    sagemaker_session=sagemaker_session,
    distribution=distribution,
    pytorch_version="1.10.2",
    transformers_version="4.17.0",
    py_version="py38",
    output_path=s3_output_bucket,
    hyperparameters=hyperparameters,
    debugger_hook_config=False,
    disable_profiler=True,
    base_job_name=base_job_name,
)

Finally, run the estimator to launch the SageMaker training job of GPT-J model with pipeline parallelism.


If you receive a `ResourceLimitExceeded` error message when running the following cell, you can request an increase on the default quota by contacting [AWS support](https://console.aws.amazon.com/support). Open the [AWS Support Center](https://console.aws.amazon.com/support), and then choose Create case. Choose Service limit increase. For Limit Type choose SageMaker Training Jobs. Complete the rest of the form and submit.

In [None]:
smp_estimator.fit(
    experiment_config={
        "ExperimentName": experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "Training",
    },
    logs=True,
)

In [None]:
model_location = smp_estimator.model_data

In [None]:
model_location

## Accessing the Training Logs

You can access the training logs from [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). Make sure to look at the logs of **algo-1** because that is the main node whose output stream will have the training job logs.

You can use CloudWatch to track SageMaker GPU and memory utilization during training and inference. To view the metrics and logs that SageMaker writes to CloudWatch, see [SageMaker Jobs and Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs) in the Amazon SageMaker Developer Guide.

If you are a new user of CloudWatch, see [Getting Started with Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/GettingStarted.html). 

For additional information on monitoring and analyzing Amazon SageMaker training jobs, see [Monitor and Analyze Training Jobs Using Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).

## Deploying Trained Model for Inference

In most cases, a trained model can be deployed on a single device for inference because inference only requires a small amount of memory. You can use the SMP API to create a single, unified model after training: See [smp.save()](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#apis-for-saving-and-loading) function for PyTorch.

After you build and train your models, you can deploy them to get predictions in one of two ways:

* To set up a persistent endpoint to get predictions from your models, use SageMaker hosting services. For an overview on deploying a single model or multiple models with SageMaker hosting services, see [Deploy a Model on SageMaker Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html#how-it-works-hosting).
* To get predictions for an entire dataset, use SageMaker batch transform. For an overview on deploying a model with SageMaker Batch Transform, see [Get Inferences for an Entire Dataset with Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html).

To learn more about deploying models for inference using SageMaker, see [Deploy Models for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html). 


### Deploy the model using `model_data`

In [None]:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker 

role = sagemaker.get_execution_role()
model_data = model_location

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data = model_data,  # path to your trained sagemaker model
    role = role, # iam role with permissions to create an Endpoint
    transformers_version = "4.17", # transformers version used
    pytorch_version = "1.10", # pytorch version used
    py_version = "py38", # python version of the DLC
)

In [None]:
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count = 1,
   instance_type = "ml.m5.4xlarge"
)

In [None]:
# example request, you always need to define "inputs"
data = {
   "inputs": "The new Hugging Face SageMaker DLC makes it super easy to deploy models in production. It is great!"
}

# request
predictor.predict(data)

In [None]:
predictor.predict({
    'inputs': "Can you please let us know more details about your "
})

In [None]:
predictor.predict({
    'inputs': "Can you please let us know more details about your "
})

Parameterized request

In [None]:
%%time
predictor.predict({
    'inputs': "Can you please let us know more ",
  "parameters" : {
    "min_length": 220,
    "temperature": 0.6,
  }
})

Custom end of sequence token

In [None]:
%%time 
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

end_sequence = "."
temparature = 40
max_generated_token_length = 100
input = "Can you please let us know more details about your "

predictor.predict({
    'inputs': input,
    "parameters" : {
        "min_length": int(len(input) + max_generated_token_length),
        "temperature":temparature,
        "eos_token_id": tokenizer.convert_tokens_to_ids(end_sequence)
      }
})

In [None]:
predictor.delete_endpoint()