# Model Training

In this notebook we will use the original data and the privatized data we just created to train models and compare their performance in terms of accuracy. We'll be demonstrating what is commonly referred in the literature as the privacy/utility tradeoff: We usually have to sacrifice some of the accuracy in the models, in exchange better preserving privacy. The epsilon value is usually the knob that is used to manage this trade-off.

## Get location of data and vectors

Our first step will be to get the locations of the privatized dataset we created in the data privatization notebook, along with the pre-trained vector embeddings we'll be using. 

In [None]:
import sys
sys.path.append('./src/')

In [None]:
import sagemaker
from botocore.config import Config as BotocoreConfig
from package import config
import boto3

# We create a SageMaker session and get the IAM role we'll be using
sm_boto = boto3.client('sagemaker', config=BotocoreConfig(connect_timeout=5, read_timeout=60, retries={'max_attempts': 30}))
sagemaker_session = sagemaker.Session(sagemaker_client = sm_boto)
role = config.PRIVACY_SAGEMAKER_IAM_ROLE

# Note the input and output buckets
solution_bucket = f"{config.SOLUTIONS_S3_BUCKET}-{config.AWS_REGION}"
bucket = config.S3_BUCKET
solution_prefix = config.SOLUTION_NAME
prefix = solution_prefix

# These are the embeddings that we'll use for the model, same as with the privatization step.
s3_vectors = "s3://{}/{}/vectors/glove.6B.300d.txt.gz".format(solution_bucket, solution_prefix)

In the pre-processing notebook we created two training files, one with the original data, and one with the privatized version of the same reviews.

In [None]:
privatized_train_data = 's3://{}/{}/processed-data/reviews-privatized'.format(bucket, prefix)
sensitive_train_data = "s3://{}/{}/processed-data/reviews-sensitive".format(bucket, prefix)

### Build the models

Our training entry point is the `train.py` file under `./src/package/model/`. There we have included a `requirements.txt` file, Amazon SageMaker will use that to prepare the container for our training instances with all the required libraries.

Since we are interested in training one model on the original data and one on the privatized data, our training script supports both, we only change the input dataset for each estimator.

In [None]:
# Create an estimator for the original data.
from sagemaker.pytorch import PyTorch

sensitive_train_output = 's3://{}/{}/sensitive-output'.format(bucket, prefix)
sensitive_estimator = PyTorch(entry_point='train.py',
                    source_dir='./src/package/model/',
                    sagemaker_session=sagemaker_session,
                    role=role,
                    framework_version='1.5.0',
                    train_instance_count=1,
                    train_instance_type=config.TRAINING_INSTANCE_TYPE,
                    base_job_name=f"{config.SOLUTION_NAME}",
                    output_path=sensitive_train_output)

In [None]:
# Create an estimator for the privatized data.
privatized_train_output = 's3://{}/{}/privatized-output'.format(bucket, prefix)
privatized_estimator = PyTorch(entry_point='train.py',
                    source_dir='./src/package/model/',
                    sagemaker_session=sagemaker_session,
                    role=role,
                    framework_version='1.5.0',
                    train_instance_count=1,
                    train_instance_type=config.TRAINING_INSTANCE_TYPE,
                    base_job_name=f"{config.SOLUTION_NAME}",
                    output_path=privatized_train_output)

Amazon SageMaker gives us the option to launch one training job in the background, and continue working, using asynchronous training. We will make use of this capability here to launch the original data training job, and immediately after launch the privatized training data job. This allows the two training jobs to run in parallel, so we don't have to wait for them to finish in sequence.

In [None]:
sensitive_estimator.fit({"train": sensitive_train_data, "vectors": s3_vectors}, wait=False)

In [None]:
privatized_estimator.fit({"train": privatized_train_data, "vectors": s3_vectors}, wait=False)

Now we have started both training jobs and they working in the background. Next, we'll attach to those jobs, to get the estimators' output and wait until both are finished.

In [None]:
privatized_estimator = PyTorch.attach(training_job_name=privatized_estimator.latest_training_job.name)

In [None]:
sensitive_estimator = PyTorch.attach(training_job_name=sensitive_estimator.latest_training_job.name)

## Accuracy evaluation

Now that we we have both models trained, we can evaluate their performance on a test set to see how the perturbation has affected the model's accuracy.

Since we only want to evaluate the two different models on an existing test dataset, we can use an Amazon SageMaker Processing job to make predictions for all our test data and output the accuracy. We will use the same Docker container we used for our privatization job to run our predictions, but this time we're making use of GPU P3 instances to speed up model inference. 

In [None]:
account_id = boto3.client('sts').get_caller_identity().get('Account')

ecr_repository = config.SAGEMAKER_PROCESSING_JOB_CONTAINER_NAME
ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, config.AWS_REGION, ecr_repository)


As in the previous notebook, we set up a script processor, only switching over to a P3 instance.

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   sagemaker_session=sagemaker_session,
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type=config.PROCESSING_INSTANCE_TYPE)

As with our training job, we can start both jobs aync and let them run in parallel. We change the source and destination locations to match the model trained on sensitive and privatized data respectively, and run the jobs, giving them a job name that we can refer to later:

In [None]:
import time
test_data = "s3://{}/{}/data".format(solution_bucket, solution_prefix)

In [None]:
sensitive_model_evaluation = 's3://{}/{}/sensitive-model-evaluation'.format(bucket, prefix)
sensitive_job_name = "sensitive-model-evaluation-{}".format(int(time.time()))

script_processor.run(code='src/package/model/inference.py',
                     inputs=[ProcessingInput(source=test_data,
                                             destination='/opt/ml/processing/data'),
                            ProcessingInput(source=sensitive_estimator.model_data,
                                             destination='/opt/ml/processing/model')],
                     outputs=[ProcessingOutput(destination=sensitive_model_evaluation,
                                               source='/opt/ml/processing/output')],
                     job_name=sensitive_job_name,
                     wait=False)

In [None]:
privatized_model_evaluation = 's3://{}/{}/privatized-model-evaluation'.format(bucket, prefix)
privatized_job_name = "privatized-model-evaluation-{}".format(int(time.time()))

script_processor.run(code='src/package/model/inference.py',
                     inputs=[ProcessingInput(source=test_data,
                                             destination='/opt/ml/processing/data'),
                            ProcessingInput(source=privatized_estimator.model_data,
                                             destination='/opt/ml/processing/model')],
                     outputs=[ProcessingOutput(destination=privatized_model_evaluation,
                                               source='/opt/ml/processing/output')],
                     job_name=privatized_job_name,
                     wait=False)

We can now wait until the jobs are finished. When the evaluation of the model trained on the original data is finished, the one trained on privatized should be done soon after.

In [None]:
from sagemaker.processing import ProcessingJob

sensitive_job = ProcessingJob.from_processing_name(
    sagemaker_session, processing_job_name=sensitive_job_name)
sensitive_job.wait()

In [None]:
privatized_job = ProcessingJob.from_processing_name(
    sagemaker_session, processing_job_name=privatized_job_name)
privatized_job.wait()

In [None]:
from sagemaker.s3 import S3Downloader
from pathlib import Path

Path('./sensitive-model-evaluation').mkdir(exist_ok=True)
S3Downloader.download(sensitive_model_evaluation, "./sensitive-model-evaluation")

### Accuracy on original data

In [None]:
!cat ./sensitive-model-evaluation/accuracy-metrics.txt


In [None]:
from IPython.display import Image

# Sensitive Data - ROC Curve
Image(url= "./sensitive-model-evaluation/accuracy-ROC.png")

### Accuracy on privatized data

In [None]:
Path('./privatized-model-evaluation').mkdir(exist_ok=True)
S3Downloader.download(privatized_model_evaluation, "./privatized-model-evaluation")

In [None]:
!cat ./privatized-model-evaluation/accuracy-metrics.txt


In [None]:
# Privatized Data - ROC Curve
Image(url= "./privatized-model-evaluation/accuracy-ROC.png")

While the exact numbers for accuracy might vary slightly between the two models, we should see that their performance is very similar.

So even though as we saw in the pre-processing examples we have modified the exact words in the reviews quite heavily, thereby making it harder to identify individuals as the ones who wrote them, we lost very little in terms of the accuracy of the privatized model compared to the one trained on the original data.

Using the proposed algorithm, customers can provide better privacy for their users, while maintaining accurate models that help meet their business needs.