# Data Privatization

The privacy of users is one of the most important aspects of maintaining customer trust. Recently, researchers have demonstrated that it is [sometimes](https://arxiv.org/abs/1610.05820) [possible](https://arxiv.org/abs/1811.00513) to extract user data from machine learning models. In this solution we will demonstrate that it's possible to train NLP models that help protect the privacy of users, while maintaining high accuracy.

In this first notebook we will create a privatized dataset from Amazon product reviews, using Amazon SageMaker Processing. This process involves gathering some statistics from the original dataset, then applying a privatization mechanism that helps protect the privacy of users when the privatized data are used to train downstream models.

We start by explaining how the privatization algorithm we will use works. The algorithm was developed by [Amazon scientists](https://www.amazon.science/blog/preserving-privacy-in-analyses-of-textual-data).

## Demonstration of the privatization mechanism

The main idea of the privatization algorithm we'll be using, is to replace the sensitive text in a sentence with other words that are semantically similar. We do this by moving in embedding space, from the original word, towards a carefully crafted noise vector, and obtaining a new word. In the example below, we replace the word "phone" with the word "mobile". The technique is a form of _differential privacy_ that perturbs data in way such that attackers cannot claim with certainty whether a particular sentence originated from a user, or was the result of a perturbation.

![Example of noise injection](./images/privatization-example.png)
Image Source: https://www.amazon.science/blog/preserving-privacy-in-analyses-of-textual-data

The steps of the algorithm are the following:

* For each word in the dataset:
    * Retrieve the word's embedding vector $w$. In this example we use [GloVe](https://nlp.stanford.edu/projects/glove/) 300-dimensional embeddings.
    * Generate a noisy vector $\delta$, using Laplacian noise. The parameter `epsilon` determines the amount of noise added.
        * We use this noise to find embeddings of similar words that are close to the original we're trying to replace.
    * Retrieve the embedding vector that is closest to the noisy vector $w + \delta$.
    * Get the word that corresponds to that closest vector.
        * For example, the closest word to Germany + `noise_vector` might end being France.
    * Replace the original word with the retrieved word that was closest to the noisy embedding.

The mechanism we just described requires the following artifacts to work:

* A word-index mapping. This allows us to map word strings to vectors and back. Here we use a `torchtext.Vocab` object.
* An approximate nearest neighbor index. This allows us to quickly find the words that are close to the noisy vector in our embeddings.


## Interactive Examples

To quickly illustrate the algorithm at work, we include pre-trained files in the solution. We will use these files 
to create interactive privatization examples, then perform the pre-processing ourselves on the Amazon Review data, using Amazon SageMaker Processing.


In [None]:
# Set up the notebook's dependencies
import sys
sys.path.append('./src/')

In [None]:
from package import config
solutions_bucket = f"{config.SOLUTIONS_S3_BUCKET}-{config.AWS_REGION}"
solution_name = config.SOLUTION_NAME

In [None]:
# Download the required artifacts
!aws s3 sync s3://$solutions_bucket/$solution_name/artifacts/ ./artifacts

In [None]:
# This function will replace a single word with a privatized version
from package.data_privatization.data_privatization import replace_word

# The torchtext vocab contains mappings from vectors to words and back
from torchtext import vocab
import torch
from os.path import join

artifact_prefix = "./artifacts"
train_vocab = torch.load(join(artifact_prefix, "vocab.pt"))
embedding_dims = 300 # Because we use the 300-dim GloVe embeddings.

We use the [annoy](https://github.com/spotify/annoy/) library to find the nearest vectors quickly

In [None]:
from annoy import AnnoyIndex
ann_index = AnnoyIndex(embedding_dims, 'euclidean')
assert ann_index.load(join(artifact_prefix, "index.ann"))

Now we have everything in place to see the privatization mechanism in action. The `count_replacements` function below will take a word as input with a couple of parameters and use the mechanism to return a privatized version of the word.

The `epsilon` value determines the amount of noise introduced. Smaller values of `epsilon` means more noise is added,
making it less likely we'll get the original word back as output.

In [None]:
# Function to count the replacements for one word
import pandas as pd
from collections import Counter


def count_replacements(word, epsilon, num_replacements=100):
    if train_vocab.stoi[word] == 0:
        print("WARNING: You chose an out-of-vocabulary word, the returned words will be unrelated.")
    replacement_counter = Counter()
    for i in range(num_replacements):
        replacement_counter[replace_word(word, train_vocab, epsilon, ann_index, embedding_dims)] += 1
    percentages = [count/num_replacements for count in sorted(replacement_counter.values(), reverse=True)]
    counts = pd.DataFrame(replacement_counter.most_common(), columns=["Word", "Count"])
    counts['Ratio'] = percentages

    return counts

### Privatization example.


Let us use an example to demonstrate the process of privatization for a single word.
You can choose a word to replace, and play around with the `epsilon` value, which determines the amount
of noise introduced to the original embedding vector.
You will observe that as you change the `epsilon` value to smaller values you will get a larger variety of words back,
while changing it to larger values will generally tend to return the original word. Feel free to replace the original word with your own, but note that the vocabulary only has 25,000 words that are present in the Amazon review data, so you'll have more luck using common words.

In [None]:
count_replacements(word="germany", epsilon=25)

Next, we will perform the steps that are necessary to produce the privatization artifacts we just used above, and privatize a dataset of
Amazon reviews. In the next notebook, we will use these reviews to create two sentiment classification models, one
trained on the privatized data and one on the original data, and compare their performance in terms of utility and
privacy.

## Perform privatization process on review data

Our next step is to perform the privatization process demonstrated above to every review in the Amazon reviews dataset.
Let's first set up our environment with the input and output buckets on S3.

In [None]:
import sagemaker

from package import config, container_build

# We create a SageMaker session and get the IAM role we'll be using
sagemaker_session = sagemaker.Session()
role = config.PRIVACY_SAGEMAKER_IAM_ROLE

# Get the input and output buckets
output_bucket = config.S3_BUCKET
solution_prefix = config.SOLUTION_NAME
prefix = solution_prefix

# These are the embeddings that we'll use for our privatization mechanism.
s3_vectors = "s3://{}/{}/vectors/glove.6B.300d.txt.gz".format(solutions_bucket, solution_prefix)
# The input training data lies on S3
sensitive_train_data = "s3://{}/{}/data/train_examples.csv".format(solutions_bucket, solution_prefix)

Next, we'll set up our output destinations for the outcomes of the privatization process.

In [None]:
# Here is where our processed data will go
processed_data = 's3://{}/{}/processed-data'.format(output_bucket, prefix)
# And here will go the artifacts created by the privatization mechanism, we can use those to privatize new inputs
artifacts = 's3://{}/{}/artifacts'.format(output_bucket, prefix)

Let's download a sample of the data and take a look at one example:

In [None]:
from sagemaker.s3 import S3Downloader
data_sample = "s3://{}/{}/data/train_1k.csv".format(solutions_bucket, solution_prefix)
S3Downloader.download(s3_uri=data_sample, local_path='.')

In [None]:
!head -2 train_1k.csv

We can see an example review above, with the sentiment at the end, 1 for negative, 2 for positive.
Let's move on to performing data the privatization now.

## Build container for data privatization

For our next step we'll prepare the Docker container that we will use to run the privatization on our data.
To ensure that our approach is scalable we are using Apache Spark to parallelize the privatization process.
You can view the complete script under `src/package/data_privatization/data_privatization.py` that applies the steps we did above for every sentence in the Amazon reviews
dataset.

The training dataset we use is only 25,000 records, so we run Spark locally on a single node, making use of all its cores.
If your dataset is very large (billions of records) you might want to use distributed processing to process the complete dataset quickly. For more information on using a Spark container with Amazon SageMaker see [here](https://docs.aws.amazon.com/sagemaker/latest/dg/use-spark-processing-container.html). The process of building the container should take around **5 minutes** to complete.


In [None]:
import boto3
import os

region = config.AWS_REGION
account_id = config.AWS_ACCOUNT_ID

ecr_repository = config.SAGEMAKER_PROCESSING_JOB_CONTAINER_NAME

if config.SAGEMAKER_PROCESSING_JOB_CONTAINER_BUILD == "local":
    old_cwd = os.getcwd()
    os.chdir("./src/package/data_privatization/")
    !bash container/build_and_push.sh $ecr_repository $region $account_id
    os.chdir(old_cwd)
else:
    container_build.build(config.SAGEMAKER_PROCESSING_JOB_CONTAINER_BUILD)

ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)

## Run data privatization job with Amazon SageMaker Processing

Now that our container is built, we can run the privatization job. The following cell will launch an instance using the
container we just created, and execute the `data_privatization.py` script. 

This will create a new privatized dataset based
on the original data, as well as a set of output artifacts we can use to privatize words and sentences on the fly, replicating the results of the demo above.
It will also lightly pre-process the original data and create new output that we will use to train a model on in the next notebook.

Note: The cell below should take around **15 minutes** to run.

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   sagemaker_session=sagemaker_session,
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type='ml.c5.4xlarge')

script_processor.run(code='src/package/data_privatization/data_privatization.py',
                     inputs=[ProcessingInput(source=sensitive_train_data,
                                             destination='/opt/ml/processing/input'),
                            ProcessingInput(source=s3_vectors,
                                             destination='/opt/ml/processing/vectors')],
                     outputs=[ProcessingOutput(destination=processed_data,
                                               source='/opt/ml/processing/output'),
                             ProcessingOutput(destination=artifacts,
                                               source='/opt/ml/processing/artifacts')],
                    arguments=['--epsilon', '23'])

## View results of privatization

Once the above process has finished we have access to two new CSV files: one with original reviews and one with the privatized version of each review. We will use these two files to train separate models in the next notebook. For now, let's take a look at one example output from each of the created training files on S3.


In [None]:
sensitive_sample = processed_data + "/reviews-sensitive/part-00000"
privatized_sample = processed_data + "/reviews-privatized/part-00000"

In [None]:
!aws s3 cp $sensitive_sample sensitive_sample.csv
!aws s3 cp $privatized_sample privatized_sample.csv

In [None]:
!tail -1 sensitive_sample.csv

In [None]:
!tail -1 privatized_sample.csv

We see in the above comparison that many words will remain the same, and many will change. While individual reviews no longer make grammatical sense after the privatization, in aggregate they should maintain the original review's sentiment. We put this to the test in the next notebook where we train two sentiment classification models, one on the original data and one on the privatized data.

### Setting the epsilon parameter

One of the most important decisions one has to make when using differential privacy is setting the `epsilon` value that determines the amount of noise added to the data. This is largely an open research problem, and even harder for privacy in metric spaces, such as embeddings, because the effect of epsilon will depend on the density of the embeddings being used (see the [linked article](https://www.amazon.science/blog/preserving-privacy-in-analyses-of-textual-data) for more information). 

In this notebook we follow the suggestions of the original publication for the Glove-300 embeddings and set an epsilon value of 23. A good rule of thumb would be to set a business goal of how much utility loss is acceptable in the downstream task you're interested in, for example %2 absolute accuracy loss on the test set, then setting the epsilon value to the lowest possible value that maintains the desired accuracy level.


## Next up: Training and comparing original and privatized models

In the next notebook we will use the two datasets we have created here to train two separate models to predict the sentiment of the reviews.
Finally, will then investigate how the privatization mechanism affects the accuracy of the models. You can move directly to [Notebook 2](./2.Model_Training.ipynb)