# Fine-tune a T5 transformer for text summarization
# Register project and result in Weight & Bias WandB

## Logging into W&B
Goto your profile page and copy an api key. Add a new cell below and run the following:

!wandb login PASTE_API_KEY_HERE

You only need to run this once as it writes your credentials in your home directory.

W&B looks for a file named `secrets.env` relative to the training script and loads them into the environment when wandb.init() is called. You can generate a `secrets.env` file by calling wandb.sagemaker_auth(path="source_dir") in the script you use to launch your experiments. Be sure to add this file to your .gitignore!

In [1]:
!pip install wandb

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
#!wandb login <API_KEY>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ec2-user/.netrc


In [17]:
#!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
Aborted!


## Load the libraries and initialize SageMaker session

In [3]:
import os
import time
import sagemaker
from sagemaker import get_execution_role

import wandb

In [4]:
# Create a SageMaker session to work with
sagemaker_session = sagemaker.Session()
# Get the role of our user and the region
role = get_execution_role()
region = sagemaker_session.boto_session.region_name
print(role)
print(region)

arn:aws:iam::223817798831:role/service-role/AmazonSageMaker-ExecutionRole-20200708T194212
us-east-1


In [5]:
wandb.sagemaker_auth(path="source")

Failed to query for notebook name, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable


## Step 3: Upload the data to S3

Now, we will need to upload the training dataset to S3 in order for our training code to access it. we upload the training data to the SageMaker S3 bucket so that we can provide access to it while training our model.

In [6]:
# Set the variables for data locations
data_folder_name='data'
train_filename = 'news_summary.csv'
# Set the absolute path of the train data 
train_file = os.path.abspath(os.path.join(data_folder_name, train_filename))

In [7]:
# Specify your bucket name
bucket_name = 'edumunozsala-ml-sagemaker'
# Set the training data folder in S3
training_folder = r't5-summarization/train'
# Set the output folder in S3
output_folder = r't5-summarization'
# Set the checkpoint in S3 folder for our model 
#ckpt_folder = r't5-summarization/ckpt'

training_data_uri = r's3://' + bucket_name + r'/' + training_folder
output_data_uri = r's3://' + bucket_name + r'/' + output_folder
#ckpt_data_uri = r's3://' + bucket_name + r'/' + ckpt_folder

In [8]:
training_data_uri,output_data_uri #,ckpt_data_uri

('s3://edumunozsala-ml-sagemaker/t5-summarization/train',
 's3://edumunozsala-ml-sagemaker/t5-summarization')

In [11]:
inputs = sagemaker_session.upload_data(train_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)


**NOTE:** The cell above uploads the entire contents of our data directory.

## Step 4: Build and Train the PyTorch Model

A model in the SageMaker framework, in particular, comprises three objects:

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another.

We will start by implementing our own neural network in PyTorch along with a training script. For the purposes of this project we need to provide the model object implementation in the `model.py` file, inside of the `train` folder. You can see the provided implementation by running the cell below.

In [None]:
# Show code of trayn.py


In order to construct a PyTorch model using SageMaker we must provide SageMaker with a training script. We may optionally include a directory which will be copied to the container and from which our training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

In this example, we require the packages: numpy, pandas, transformers, wandb.

### Training the model in a SageMaker training job

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` which contains most of the necessary code to train our model. 

The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script. To see how this is done take a look at the provided `train/train.py` file.

First, we need to set which type of instance will run our training:
- Local: We do not launch a real compute instance, just a container where our scripts will run. This scenario is very useful to test that the train script is working fine because it is faster to run a container than an compute instance. But finally, when we confirm that everything is working we must change the instance type for a "real" training instance.
- ml.m4.4xlarge: It is a CPU instance
- ml.p2.xlarge: A GPU instance to use when managing a big volume of data to train on.


In [9]:
# Select the type of instance to use for training
#instance_type='ml.m4.4xlarge' # CPU instance
instance_type='ml.p2.xlarge' # GPU instance
#instance_type='local'

In [10]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='train.py',
                    source_dir='source',
                    role=role,
                    framework_version='1.4.0',
                    py_version='py3',
                    instance_count=1,
                    instance_type=instance_type,
                    output_path=output_data_uri,
                    code_location=output_data_uri,
                    base_job_name='t5-summarization',
                    hyperparameters={
                        'train_epochs': 1,
                        'datafile': 'news_summary.csv'
                    })

In [11]:
# Set the job name and show it
base_job_name='t5-summarization'
job_name = '{}-{}'.format(base_job_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

t5-summarization-2020-11-14-19-24-45


In [12]:
# Call the fit method to launch the training job
estimator.fit({'training':training_data_uri}, job_name = job_name) 
#              experiment_config = experiment_config)

2020-11-14 19:24:47 Starting - Starting the training job...
2020-11-14 19:24:50 Starting - Launching requested ML instances......
2020-11-14 19:26:04 Starting - Preparing the instances for training......
2020-11-14 19:27:18 Downloading - Downloading input data......
2020-11-14 19:28:15 Training - Downloading the training image............
2020-11-14 19:30:05 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-11-14 19:30:07,254 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-11-14 19:30:07,278 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-11-14 19:30:07,283 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-11-14 19:30:07,676 sagemaker-containers INFO     Module default_user_module_name d

[34mUsing device cuda.[0m
[34mGet train and validation dataloaders.[0m
[34mFULL Dataset: (500, 2)[0m
[34mTRAIN Dataset: (450, 2)[0m
[34mTEST Dataset: (50, 2)[0m
[34mCreating the pretrained model[0m
[34mActivating WandB tracking[0m
[34mInitiating Fine-Tuning for the model on our dataset[0m
[34m[2020-11-14 19:30:55.596 algo-1:72 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m
[34m[2020-11-14 19:30:55.596 algo-1:72 INFO hook.py:191] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.[0m
[34m[2020-11-14 19:30:55.596 algo-1:72 INFO hook.py:236] Saving to /opt/ml/output/tensors[0m
[34m[2020-11-14 19:30:55.596 algo-1:72 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.[0m
[34m[2020-11-14 19:30:55.624 algo-1:72 INFO hook.py:376] Monitoring the collections: losses[0m
[34m[2020-11-14 19:30:55.624 algo-1:72 I


2020-11-14 19:36:21 Uploading - Uploading generated training model[34mOutput Files generated for review[0m
[34m[2020-11-14 19:36:16.835 algo-1:72 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.[0m
[34m2020-11-14 19:36:20,729 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-11-14 19:39:09 Completed - Training job completed
Training seconds: 711
Billable seconds: 711
