<a href="https://colab.research.google.com/github/arminmirrezai/text_privatization/blob/main/Mechanism_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Setup

## 1.1. Use Colab GPU for Training

In [None]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device.

In [None]:
import torch

# set global seed
torch.manual_seed(42)

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla V100-SXM2-16GB


## 1.2. Install annoy

Next we install the annoy library. The annoy library helps us find nearest vectors quickly.

In [None]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.17.0.tar.gz (646 kB)
[?25l[K     |▌                               | 10 kB 31.3 MB/s eta 0:00:01[K     |█                               | 20 kB 32.6 MB/s eta 0:00:01[K     |█▌                              | 30 kB 33.1 MB/s eta 0:00:01[K     |██                              | 40 kB 15.4 MB/s eta 0:00:01[K     |██▌                             | 51 kB 14.0 MB/s eta 0:00:01[K     |███                             | 61 kB 16.3 MB/s eta 0:00:01[K     |███▌                            | 71 kB 15.2 MB/s eta 0:00:01[K     |████                            | 81 kB 12.5 MB/s eta 0:00:01[K     |████▋                           | 92 kB 13.8 MB/s eta 0:00:01[K     |█████                           | 102 kB 14.6 MB/s eta 0:00:01[K     |█████▋                          | 112 kB 14.6 MB/s eta 0:00:01[K     |██████                          | 122 kB 14.6 MB/s eta 0:00:01[K     |██████▋                         | 133 kB 14.6 MB/s eta 0:00:01[K   

## 1.3. Setting up PySpark in Colab

Spark is written in the Scala programming language and requires the Java Virutal Machine (JVM) to run. Therefore, our first task is to download Java. It will help us to do the nearest neighbour computings parallel. For info why PySpark is useful, click [here](https://moviecultists.com/why-we-use-parallelize-in-spark).

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will install Apache Spark 3.0.1 with Hadoop 2.7.

In [None]:
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

Now, we just need to unzip that folder.

In [None]:
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.

In [None]:
!pip install -q findspark

Now that we have install all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run PySpark in the Colab environment.

In [None]:
# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

We need to locate Spark in the system. For that, we import findspark and use findspark.init() method.

In [None]:
import findspark
findspark.init()

## 1.4. Install the Hugging Face Library

Next, we  install the transformers package from Hugging Face which will give us a pytorch interface for working with BERT. We've selected the pytorch interface because it strikes a nice balance betwee nthe high-level APIs (which are easy to use but don't provide insight into how things work) and tensorflow code (which contains lots of details but often sidetracks us into lessons about tensorflow, when the purpose here is BERT).

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. In addition to supporting a variety of different pre-trained transformer models, the library also includes pre-built modifications of these models suited to a specific task. For example, in this notebook we will use ``` BertForSequenceClassification```.



In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 14.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 66.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 3.3 kB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 58.2 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for

In [None]:
!pip install torchtext==0.9.0



# 2. Loading Data

## 2.1. Dutch Book Review Dataset

We'll use the [Dutch Book Review Dataset (DBRD)](https://github.com/benjaminvdb/DBRD) to be privatized. It's a set of reviews labeled as positive or negative.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Here we make TabularDataset of our data. This way we can create a vocabulary and add the fine-tuned Dutch fastText model.

In [None]:
import torch
from torchtext.legacy import data

# give filepath
filepath = 'gdrive/My Drive/Colab Data/MSc thesis/dbrd_preprocessed/complete_data.csv'

# prepare sensitive data
TEXT = data.Field(sequential=True, include_lengths=True)

LABEL = data.LabelField(dtype=torch.float)

fields = [('title', None),
          ('sentiment', LABEL),
          ('review', TEXT)]

# make tabular dataset
reviews = data.TabularDataset(
    path=filepath, format='csv',
    fields=fields,
    skip_header=True)

Next we collect a vocabulary of words that occurs in the dataset. There is a distinction between the phrase count created for word2vec and fastText, because word2vec is only trained on uncased words and fastText has been trained on cased words. 

The meaning of the word 'Parijs' is sensitive for capitalization in fastText and not in word2vec.

In [None]:
from collections import Counter
import re

def create_counter(input_data, add_space_split=False, w2v=False):
    phrase_count = Counter()
    for example in input_data:
      review = example.review
      original_text = " ".join(review)
      text = original_text.replace(
          " ' ", "").replace("'", "").replace("/", " ").replace("  ", " ").replace('"', '')
      if add_space_split:
        text = re.split('\!|\,|\n|\.|\?|\-|\;|\:|\(|\)|\s', text)
      else:
        text = re.split('\!|\,|\n|\.|\?|\-|\;|\:|\(|\)', text)
      sentences = [x.strip() for x in text if x.strip()]
      if w2v:
        for sentence in sentences:
          phrase_count[sentence.lower()] += 1
      else:
        for sentence in sentences:
          phrase_count[sentence] += 1
    return phrase_count

Next we create counts and compare the size for word2vec and fastText.

In [None]:
phrase_count = create_counter(reviews, add_space_split=True)
phrase_count_w2v = create_counter(reviews, add_space_split=True, w2v=True)

The size of the vocabulary, when sensitive to capitalization (fastText):

In [None]:
len(phrase_count)

144695

The size of the vocabulary, when not sensitive to capitalization (word2vec):

In [None]:
len(phrase_count_w2v)

129461

## 2.2. Embedding Models

### 2.2.1. word2vec (Option 1)

Load the Dutch fine-tuned word2vec embedding model.

In [None]:
from torchtext.vocab import Vectors

# load embeddings using torchtext
# vectors = Vectors('gdrive/My Drive/Colab Data/MSc thesis/word2vec/word2vec_coosto') # file created by gensim
vectors_w2v = Vectors('gdrive/My Drive/Colab Data/MSc thesis/word2vec/combined-320.txt')

  0%|          | 0/1442950 [00:00<?, ?it/s]Skipping token b'1442950' with 1-dimensional vector [b'320']; likely a header
100%|██████████| 1442950/1442950 [02:03<00:00, 11704.72it/s]


In [None]:
# compute list of all words in word embedding model
words_list_w2v = []
for i in range(vectors_w2v.__len__()):
  words_list_w2v.append(vectors_w2v.itos[i])

In [None]:
len(words_list_w2v)

1442950

### 2.2.2. fastText (Option 2)

Load the Dutch fine-tuned fastText embedding model.

In [None]:
from torchtext.vocab import Vectors

# attach fastText embeddings
vectors_ft = Vectors('gdrive/My Drive/Colab Data/MSc thesis/fastText/cc.nl.300.vec.gz')

  0%|          | 0/2000000 [00:00<?, ?it/s]Skipping token b'2000000' with 1-dimensional vector [b'300']; likely a header
100%|██████████| 2000000/2000000 [03:13<00:00, 10323.77it/s]


In [None]:
words_list_ft = []
for i in range(vectors_ft.__len__()):
  words_list_ft.append(vectors_ft.itos[i])
len(words_list_ft)

2000000

### 2.2.3. BERT (Option 3)

In order to obtain contextual embedding, we have to load the BERTje tokenizer and model.

In [None]:
from transformers import AutoTokenizer

# Load the BERTje tokenizer
print('Loading BERTje tokenizer...')
BERT_tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")

Loading BERTje tokenizer...


Downloading:   0%|          | 0.00/254 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/608 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/236k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
from transformers import AutoModel

# load pre-trained model (weights)
BERT_model = AutoModel.from_pretrained("GroNLP/bert-base-dutch-cased",
                                  output_hidden_states = True, # whether the model returns all hidden-states
                                  )

Downloading:   0%|          | 0.00/417M [00:00<?, ?B/s]

Some weights of the model checkpoint at GroNLP/bert-base-dutch-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at GroNLP/bert-base-dutch-cased and are newly initialized: ['bert.pooler.dense.weight', 'bert.poole

Note that the BERTje model is cased, which means that it is sensitive to capitalization of words. The word 'parijs' does not exist, whereas the word 'Parijs' does. Just as with fastText this is again in contrast with the word2vec model, where everything is lowercase (uncased).

In [None]:
bertje_tokens = []
for token in BERT_tokenizer.vocab.keys():
  bertje_tokens.append(token)

len(bertje_tokens)

30073

## 2.3. Build Vocab for Embedding Models



In order to gain a somewhat fair comparison between the static embedding models, word2vec and fastText, we make a list of words that occur in both the embedding model and the vocabulary. Otherwise a performance advantage could be inherent to fastText as it has 2 million words and word2vec has 1.44 million words.

In [None]:
vocab_words = list(phrase_count.keys())
vocab_words_w2v = list(phrase_count_w2v.keys())

The number of words that occur in the dataset and are in the word2vec embeddings model:

In [None]:
intersect_w2v = list(set(words_list_w2v).intersection(vocab_words_w2v))
len(intersect_w2v)

94400

The number of words that occur in the dataset and are in the fastText embeddings model:

In [None]:
intersect_ft = list(set(words_list_ft).intersection(vocab_words))
len(intersect_ft)

107708

Now we finally make the vocabularies for the word embedding models.

In [None]:
from collections import defaultdict, Counter
from torchtext.vocab import Vocab

def vocab_counter(words):
  """
  Create a counter that holds all the words that intersect in the word2vec and fasttext vocab.
  """ 
  vocab_counter = Counter()
  for word in words:
    vocab_counter[word] += 1
  return vocab_counter

# create counter for the list of intersecting words
w2v_count = vocab_counter(intersect_w2v)

# create word2vec vocab
w2v_vocab = Vocab(counter=w2v_count)
w2v_vocab.load_vectors(vectors_w2v)

# create counter for fastText
ft_count = vocab_counter(intersect_ft)

# create fasttext vocab
ft_vocab = Vocab(counter=ft_count)
ft_vocab.load_vectors(vectors_ft)

# create counter for BERT tokens
bert_count = vocab_counter(bertje_tokens)

# create BERT vocab
BERT_vocab = Vocab(counter=bert_count)
BERT_vocab.set_vectors(BERT_tokenizer.vocab, BERT_model.embeddings.word_embeddings.weight.data, dim = 768)


## 2.4. Build Annoy Index (Mechanism 2)

We use the annoy library to build an annoy index in order find the nearest vectors quickly. The reason we use Annoy, is because we only need to build the index once, and can pass this index in a parallel computing environment such as PySpark. 

The main difference with script for Mechanism 1 occurs here. We first randomly project the high dimensional vectors to a smaller dimension. Thereafter, we build an Annoy Index with these smaller vectors. In order randomly project the vectors we first have to determine the dimension we reduce the vectors to (dependent on β) and also define a function that performs the projection.

### 2.4.1. Mechanism 2 Properties


In the work of [Feyisetan](https://aclanthology.org/2021.trustnlp-1.3/), it is given that in order to keep $(\epsilon, \delta)$-privacy our vector with original dimension $d$ has to be reduced to is the following expression: 

$$
 m = \Omega\left(\left[\omega(Ran(M)) + \sqrt{log(1/\delta)}\right]^2 / \beta^2\right),
$$

where $\Omega$ indicates that it is a dimension, $ω$ is the gaussian width, $Ran(M)$ is the dimension of the embedding model $M$, $\delta$ is a very small value and $\beta$ is the dimension reduction parameter.

In line with the recommendation of the authors, we set

*   $\omega(Ran(M)) = \sqrt{log(d)}$,
*   $\delta = 1\text{e}-6$,
*   $\beta \in [0.7, 0.8, 0.9]$.

In [None]:
# set beta
beta = 0.7

In [None]:
import numpy as np

def calculate_new_dim(embedding_dims, beta):
  """
  Calculates new smaller embedding dimension, given a specified embedding dimension and beta.
  """
  return int(np.square(np.log(embedding_dims) + np.sqrt(np.log(1/0.000001))) / np.square(beta))

In [None]:
from sklearn import random_projection

def reduce_vectors_dim(vocab, new_dim):
  """
  Reduces the dimension of all the original vectors in the vocab object to corresponding new dimension.
  """
  projecter = random_projection.GaussianRandomProjection(n_components=new_dim, random_state=42)
  reduced_vectors = torch.tensor(projecter.fit_transform(vocab.vectors))
  vocab.vectors = reduced_vectors

  return vocab.vectors.shape

In [None]:
def reduce_single_vector_dim(vector_tensor, new_dim):
  """
  Reduce the dimension of a single vector to corresponding new dimesion.
  """
  projecter = random_projection.GaussianRandomProjection(n_components=new_dim, random_state=42)
  vector = vector_tensor.numpy().reshape(1, -1)
  reduced_vector = projecter.fit_transform(vector)

  return reduced_vector

In [None]:
def reduce_multi_vector_dim(vectors, new_dim):
  """
  Reduce the dimension of multiple vectors to corresponding new dimension.
  """
  projector = random_projection.GaussianRandomProjection(n_components=new_dim, random_state=42)
  reduced_vectors = projector.fit_transform(np.array(vectors))

  return reduced_vectors

### 2.4.2. Annoy Index


We define a function that creates an AnnoyIndex and saves it for later use, for a specified embedding model.

In [None]:
from os.path import join
from annoy import AnnoyIndex

def build_AnnoyIndex(emb_vocab, emb_model, embedding_dims, num_trees=50):
  """
  Build AnnoyIndex for a specified embedding model and a vocabulary
  """
  # create approximate nearest neighbor index
  ann_index = AnnoyIndex(embedding_dims, 'euclidean')

  # initialize annoy index file name
  ann_title = 'M2_index_' + emb_model + '.ann'
  ann_filename = join('gdrive/My Drive/Colab Data/MSc thesis/Annoy Index/', ann_title)

  # add all word vectors in pretrained emb model
  for vector_num, vector in enumerate(emb_vocab.vectors):
      ann_index.add_item(vector_num, vector)

  print("Building annoy index...")
  # num_trees affects the build time and the index size
  # larger value will give more accurate results, but larger indexes
  assert ann_index.build(num_trees)
  ann_index.save(ann_filename)
  print("Annoy index built")

  return ann_filename, ann_index

#### Annoy Index word2vec

In [None]:
# initialize model params
emb_model = 'word2vec'
embedding_dims = 320

# calculate new dimension
w2v_new_dim = calculate_new_dim(embedding_dims, beta)
print("Old dimension: ", embedding_dims)
print("New dimension: ", w2v_new_dim)

# reduce the vector dimension
reduce_vectors_dim(w2v_vocab, w2v_new_dim)

# create annoy index
w2v_ann_filename, w2v_ann_index = build_AnnoyIndex(w2v_vocab, emb_model, w2v_new_dim)

# print number of vectors in annoy index
w2v_ann_index.get_n_items()

Old dimension:  320
New dimension:  183
Building annoy index...
Annoy index built


94402

In the word2vec model the 10 nearest neighbours of the word 'parijs' are:

In [None]:
word = 'parijs'

word_index = w2v_vocab.stoi[word]
indices = w2v_ann_index.get_nns_by_item(word_index, 10)

for i in indices:
  print(w2v_vocab.itos[i])

parijs
straatsburg
lyon
bordeaux
marseille
montpellier
berlijn
antibes
frankrijk
genève


#### Annoy Index fastText

In [None]:
# initialize model params
emb_model = 'fastText'
embedding_dims = 300

# calculate new dimension
ft_new_dim = calculate_new_dim(embedding_dims, beta)
print("Old dimension: ", embedding_dims)
print("New dimension: ", ft_new_dim)

# reduce the vector dimension
reduce_vectors_dim(ft_vocab, ft_new_dim)

# create annoy index
ft_ann_filename, ft_ann_index = build_AnnoyIndex(ft_vocab, emb_model, ft_new_dim)

# print number of vectors in annoy index
ft_ann_index.get_n_items()

Old dimension:  300
New dimension:  181
Building annoy index...
Annoy index built


107710

In the fastText model the 10 nearest neighbours of the word 'Parijs' are:

In [None]:
word = 'Parijs'

word_index = ft_vocab.stoi[word]
indices = ft_ann_index.get_nns_by_item(word_index, 10)

for i in indices:
  print(ft_vocab.itos[i])

Parijs
Londen
Straatsburg
Brussel
Stockholm
Berlijn
Milaan
Antwerpen
Boekarest
Bordeaux


#### Annoy Index BERT

In [None]:
# initialize model params
emb_model = 'BERT'
embedding_dims = 768

# calculate new dimension
BERT_new_dim = calculate_new_dim(embedding_dims, beta)
print("Old dimension: ", embedding_dims)
print("New dimension: ", BERT_new_dim)

# reduce the vector dimension
reduce_vectors_dim(BERT_vocab, BERT_new_dim)

# create annoy index
BERT_ann_filename, BERT_ann_index = build_AnnoyIndex(BERT_vocab, emb_model, BERT_new_dim)

# print number of vectors in annoy index
BERT_ann_index.get_n_items()

Old dimension:  768
New dimension:  219
Building annoy index...
Annoy index built


30075

In [None]:
word = 'Parijs'

word_index = BERT_vocab.stoi[word]
indices = BERT_ann_index.get_nns_by_item(word_index, 10)

for i in indices:
  print(BERT_vocab.itos[i])

Parijs
Londen
Franse
Amsterdam
Brussel
Berlijn
Nederlandse
Italiaanse
Gent
Duitse


# 3. Privatization algorithm implementation


The steps of the algorithm are as follows:
* For each word in the dataset:
  * Obtain word's embedding vector μ.
  * Generate a noisy vector $N$. The parameter ```epsilon```  determines the amount of noise added.
  * Retrieve the embedding closest to the noisy vector μ + $N$.
  * Get the word that correpsonds to the closest vector.
  * Replace the original word with the retrieved word closest to the noisy vector.

## 3.1. Utility functions

### 3.1.1. Generate noise

In order to generate a noisy vector $N$, we define the following function.

In [None]:
import numpy as np

# generate noise vector
def generate_laplacian_noise_vector(dimension, sensitivity, epsilon):
  """
  Generates noise to the provided vector dimension and epsilon value.
  """
  # sample normalized random normal vector
  rand_vec = np.random.normal(size=dimension)
  normalized_vec = rand_vec / np.linalg.norm(rand_vec)

  # sample magnituded from gamma distribution
  magnitude = np.random.gamma(shape=dimension, scale=sensitivity / epsilon)
  
  return normalized_vec * magnitude

### 3.1.2. Replace word by nearest to noise

In order to retrieve the embedding closest to the noisy vector μ + $N$ and retrieving the word corresponding to that embedding, we define the following functions.

#### 3.1.2.1. Static Replace



In [None]:
def replace_word(sensitive_word, vocab, ann_index, epsilon, embedding_dims, beta):
    """
    Replace a word by injecting noise according to the provided epsilon value 
    and return a perturbed word.
    """
    # turn word into lowercase for word2vec
    if embedding_dims == w2v_new_dim:
      sensitive_word = sensitive_word.lower()

    # generate a noise vector
    sensitivity = 1 + beta
    noise = generate_laplacian_noise_vector(embedding_dims, sensitivity, epsilon)

    # obtain vector of sensitive word
    original_vec = vocab.vectors[vocab.stoi[sensitive_word]]

    # obtain perturbed vector
    noisy_vector = original_vec + noise

    # obtain item closest to noisy vector
    closest_item = ann_index.get_nns_by_vector(noisy_vector, 1)[0]

    # check if word is out of vocab
    # if word is out of vocab return the original word
    if vocab.__getitem__(sensitive_word) != 0:
      privatized_word = vocab.itos[closest_item]
    else:
      privatized_word = sensitive_word

    return privatized_word

#### word2vec: small example how the word replace mechanism works

In [None]:
word = 'Parijs'
epsilon = 150

for i in range(10):
  print(replace_word(word, w2v_vocab, w2v_ann_index, epsilon, w2v_new_dim, beta))

schotse
parijs
parijs
parijs
wekenlang
lyon
hosson
parijs
kostschool
lucien


In [None]:
sensitive_sent = 'Ik ga wel eens op vakantie naar Parijs en soms naar Engeland'
epsilon = 150

print("Original: ", sensitive_sent)

for i in range(10):
  privatized_sent = []
  for word in sensitive_sent.split():
    privatized_sent.append(replace_word(word, w2v_vocab, w2v_ann_index, epsilon, w2v_new_dim, beta))

  print("Privatized: ", " ".join(privatized_sent))

Original:  Ik ga wel eens op vakantie naar Parijs en soms naar Engeland
Privatized:  puzzelstukken welter kia volsta anaesthesie madre jingū tentoonstellingshallen vronski lokaal onderzoekschip voedselvoorraad
Privatized:  kuper paginaatjes wel apparaatje onderschriften heuvelts klimt parijs boekverkoopster geschaafd veiligheidsmaatregelen stokes
Privatized:  hebt zeg schizofrene asymmetrisch vulnerable quicksilver cuypmarkt lebrun terugneemt schuilplekken kortst melbourne
Privatized:  je vranckx nieuwenhuijsen onderschat overkomelijk zoek restverschijnselen valette via panisch krioelt wilkie
Privatized:  scratchy cursusleider schaal lichtkegel doodgeërgerd pyramiden onderdoor noopt beweren hoogstens meerzicht recupereren
Privatized:  gehuild hagel maatvoeringen flink ilorin conferentie achter parijs geboortegrond losgerukte windsor brighton
Privatized:  raaskallen dullens foutief aanwezige modris begeleidster rillanon dagenlange geologiestudent stemgebruik zeiltocht raketkoppen
Privat

#### fastText: small example how the word replace mechanism works

In [None]:
word = 'Parijs'
epsilon = 150

for i in range(10):
  print(replace_word(word, ft_vocab, ft_ann_index, epsilon, ft_new_dim, beta)) 

Zirkzee
Boedapest
Rousseau
zwaarte
Weelen
Parijs
naderen
ignite
Rotterdam
afgekeurd


In [None]:
sensitive_sent = 'Ik ga wel eens op vakantie naar Parijs en soms naar Engeland'
epsilon = 150

print("Original: ", sensitive_sent)

for i in range(10):
  privatized_sent = []
  for word in sensitive_sent.split():
    privatized_sent.append(replace_word(word, ft_vocab, ft_ann_index, epsilon, ft_new_dim, beta))

  print("Privatized: ", " ".join(privatized_sent))


Original:  Ik ga wel eens op vakantie naar Parijs en soms naar Engeland
Privatized:  Ik ga wel nooit op portugues naar Wenen en Soms daarmede Missis
Privatized:  Ik ga wel ding op Ardennen naar Carlan en maak zicht Columbia
Privatized:  Ik ga wel eens op Dieulafoy graag Carnaval en net vanuit Hoboken
Privatized:  Ik ga wel even op Hauger naar Lille en soms Tsjechië oktober
Privatized:  Ik ga wel ergens op luchtrace naar Carriere en dagen heden Palmers
Privatized:  Ik ga wel maatje op Huysentruyt Schuyler 1981 en haast terug tienerfilm
Privatized:  Ik ga wel eens op vijftigste naar Parijs en soms mijn morning
Privatized:  Ik ga wel effe op Lohmark naar Billancourt en soms naar ballingsoord
Privatized:  Ik ga niet eens op vakantietrip langs Bonnart en boedel naar uitgestuurd
Privatized:  Ik ga wel eens op Toestanden naar Parijs en soms meeneem Fouquet


#### BERT: small example how the word replace mechanism works

In [None]:
word = 'Parijs'
epsilon = 150

for i in range(10):
  print(replace_word(word, BERT_vocab, BERT_ann_index, epsilon, BERT_new_dim, beta)) 

Rachel
dr
Unilever
##o
German
Zwolle
vlinder
Parijs
was
huis


In [None]:
sensitive_sent = 'Ik ga wel eens op vakantie naar Parijs en soms naar Engeland'
epsilon = 150

print("Original: ", sensitive_sent)

for i in range(10):
  privatized_sent = []
  for word in sensitive_sent.split():
    privatized_sent.append(replace_word(word, BERT_vocab, BERT_ann_index, epsilon, BERT_new_dim, beta))

  print("Privatized: ", " ".join(privatized_sent))


Original:  Ik ga wel eens op vakantie naar Parijs en soms naar Engeland
Privatized:  ##H ga wel in op vakantie ##heden meldde tijd soms In Engeland
Privatized:  ik Bovendien weer Leiden op vakantie naar gaat en verdwijnen naar Europa
Privatized:  Ik ##er wel af op toeristen 43 Wenen en even EK Duitsland
Privatized:  We ga wel groot op reis zogenaamd Brabant en gedenkteken naar religie
Privatized:  uitgevoerd stoot wel eens van vaak naar Julia en soms wekt België
Privatized:  ##centra organisator wel nieuws op vrijheid naar 16 en bedrijven opgenomen Howard
Privatized:  ik ga wel plaatsen tot lunch uit Parijs en ##blijft stapje Duitsland
Privatized:  Ik FNV wel eens op internet naar Parijs en soms naar achtste
Privatized:  1986 <unk> wel eens op interview naar Parijs en eigenlijk naar Oslo
Privatized:  ik juli wel eens op vakantie Europa ##mar en soms naar Engeland


#### 3.1.2.2. Contextual Replace

Here we have to put the input text into a specific format that BERT can read. Mainly we add the ```[CLS]``` to the beginning and ```[SEP]``` to the end of the input. Then we convert the tokenized BERT input to the tensor format.

In [None]:
def bert_text_preparation(text, tokenizer):
  """
  Preprocesses text input in a way that BERT can interpret.
  """
  marked_text = "[CLS] " + text + " [SEP]"
  tokenized_text = tokenizer.tokenize(marked_text, truncation=True)
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  segments_ids = [1]*len(indexed_tokens)

  # convert inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensor = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensor

Now we can compute the BERT token embedding. Here we will use the first layer of the BERT model as the contextual model. We found chose this layer by trial and error. According to the authors of BERT, it is highly task specific which layer(s) of combination of layers works best.

In our case, we would like the contextual embeddings to look somewhat like the non-contextual embeddings, because we have built an Annoy Index based on the non-contextual embeddings. If the embeddings keep a similar 'shape/direction', the performance is best. We hypothesise that this is due to the fact that if you take a 'very contextual' embedding, for instance the last layer, it won't be near the non-contextual embedding in our Annoy Index. This will cause our mechanism not to find suitable nearest neighbours.

In [None]:
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens, in context of the given sentence.
    """
    # gradient calculation id disabled
    with torch.no_grad():
      # obtain hidden states
      outputs = model(tokens_tensor, segments_tensor)
      hidden_states = outputs[2]

    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)

    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)

    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)

    # intialized list to store embeddings
    token_vecs_first = []

    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence

    # loop over tokens in sentence
    for token in token_embeddings:

        # "token" is a [12 x 768] tensor

        # Sum the vectors from the last four layers.
        # sum_vec = torch.sum(token[-4:], dim=0)

        # take first layer
        first_vec = token[0]
        token_vecs_first.append(first_vec.numpy())

    return token_vecs_first

In order to privatize the contextual token embeddings we first need to create a Annoy Index filled with contextual embeddings. For this we will create contextual embeddings for a sample of the dataset, otherwise we will have to many embeddings (note that this is because in the contexual perspective words like 'de' and 'een' have different embeddings each time).

The sample of the dataset will be a 10 percent split of the original dataset. We will make a list of tuples, where each tuple is the string and its corresponding embeddings. For this list of tuples we make a Annoy Index.

First we create the 10 percent split.

In [None]:
import pandas as pd

# get filepath
filepath = 'gdrive/My Drive/Colab Data/MSc thesis/dbrd_preprocessed/complete_data.csv'

# get dataframe
df = pd.read_csv(filepath, index_col = 0)
df = df.dropna()

# Report the number of sentences.
print('Number of sentences: {:,}\n'.format(df.shape[0]))

Number of sentences: 22,226



In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.05, random_state=42)

In [None]:
sentences = df_test.sentence.values

# Report the number of sentences.
print('Number of sentences: {:,}\n'.format(df_test.shape[0]))

Number of sentences: 1,112



Then we create a contextual embeddings for each word in each sentence in our sample.

In [None]:
from collections import OrderedDict

context_emb = []
context_emb_tokens = []

for sentence in sentences:
  # obtain contextual BERT embeddings
  tokenized_text, tokens_tensor, segments_tensor = bert_text_preparation(sentence, BERT_tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensor, BERT_model)

  # reduce dimension of all token embeddings
  list_red_token_embeddings = reduce_multi_vector_dim(list_token_embeddings, BERT_new_dim)
    
  # make ordered dictionary to keep track of the position of each word
  tokens = OrderedDict()

  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
    else:
      tokens[token] = 1

    # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]

    # get the corresponding embedding
    token_vec = list_red_token_embeddings[current_index]

    # reduced dimension of embedding
    # new_token_vec = reduce_single_vector_dim(token_vec, BERT_new_dim)[0]

    context_emb.append((token, token_vec))
    context_emb_tokens.append(token)

Now we make the Annoy Index.

In [None]:
from os.path import join
from annoy import AnnoyIndex

def build_AnnoyIndex_context(contex_emb, emb_model, embedding_dims, num_trees=50):
  """
  Build AnnoyIndex for a specified embedding model and a vocabulary
  """
  # create approximate nearest neighbor index
  ann_index = AnnoyIndex(embedding_dims, 'euclidean')

  # initialize annoy index file name
  ann_title = 'M2_index_' + emb_model + '.ann'
  ann_filename = join('gdrive/My Drive/Colab Data/MSc thesis/Annoy Index/', ann_title)

  # add all word vectors in list of contextual embeddings
  for vector_num, vector in enumerate(context_emb):
      ann_index.add_item(vector_num, vector[1])

  print("Building annoy index...")
  # num_trees affects the build time and the index size
  # larger value will give more accurate results, but larger indexes
  assert ann_index.build(num_trees)
  ann_index.save(ann_filename)
  print("Annoy index built")

  return ann_filename, ann_index

In [None]:
# initialize model params
emb_model = 'BERT_context'

# create annoy index
BERT_context_ann_filename, BERT_context_ann_index = build_AnnoyIndex_context(context_emb, emb_model, BERT_new_dim)

# print number of vectors in annoy index
BERT_context_ann_index.get_n_items()

Building annoy index...
Annoy index built


289533

Now remove the ```context_emb``` variable, as it was only needed for the annoy index and is taking up much RAM.

In [None]:
del context_emb

The following function privatizes a contextual token embedding. 

In [None]:
def replace_word_context(sensitive_vec, context_emb_tokens, ann_index, epsilon, embedding_dims, beta):
    """
    Replace a word by injecting noise according to the provided epsilon value 
    and return a perturbed word.
    """
    # generate a noise vector
    sensitivity = 1 + beta
    noise = generate_laplacian_noise_vector(embedding_dims, sensitivity, epsilon)

    # obtain perturbed vector
    noisy_vector = sensitive_vec + noise

    # obtain item closest to noisy vector
    closest_item = ann_index.get_nns_by_vector(noisy_vector, 1)[0]

    # get word from item
    privatized_word = context_emb_tokens[closest_item]
    
    return privatized_word

Small example how the word replace mechanism works for the BERT model:

In [None]:
from collections import OrderedDict

# settings
sensitive_sent = 'Ik ga wel eens op vakantie naar Parijs en soms naar Engeland'
epsilon = 15

print("Original sentence: ", sensitive_sent)

# obtain contextual BERT embeddings
tokenized_text, tokens_tensor, segments_tensor = bert_text_preparation(sensitive_sent, BERT_tokenizer)
list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensor, BERT_model)

# reduce dimension of all token embeddings
list_red_token_embeddings = reduce_multi_vector_dim(list_token_embeddings, BERT_new_dim)

for j in range(10):
  
  # make ordered dictionary to keep track of the position of each word
  sensitive_tokens = OrderedDict()

  # initialize privatized sentence
  private_sent = []

  # loop over tokens in sensitive sentence
  for sensitive_token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if sensitive_token in sensitive_tokens:
      sensitive_tokens[sensitive_token] += 1
    else:
      sensitive_tokens[sensitive_token] = 1

    # compute the position of the current token
    token_indices = [i for i, token in enumerate(tokenized_text) if token == sensitive_token]
    current_index = token_indices[sensitive_tokens[sensitive_token]-1]

    # get the corresponding embedding
    sensitive_vec = list_red_token_embeddings[current_index]

    # privatize word
    privatized_word = replace_word_context(sensitive_vec, context_emb_tokens, BERT_context_ann_index, epsilon, BERT_new_dim, beta)
    private_sent.append(privatized_word)

  print("Privatized sentence: ", " ".join(private_sent))

Original sentence:  Ik ga wel eens op vakantie naar Parijs en soms naar Engeland
Privatized sentence:  Ik ga wel eens op vakantie naar universiteit en soms naar Engeland
Privatized sentence:  Ik ga wel ##s op vakantie naar Naar en soms naar Londen
Privatized sentence:  Ik ga wel eens op meemaken naar vage en soms naar Frankrijk
Privatized sentence:  Ik regelen wel eens op ##ga naar Londen en soms naar Gi
Privatized sentence:  Ik ging wel eens op vakantie naar Parijs en soms naar Engeland
Privatized sentence:  Ik ga wel eens op vakantie naar mooie en soms naar Engeland
Privatized sentence:  Ik ga wel eens op familie naar Edwards en soms naar En
Privatized sentence:  Ik maand wel eens op vakantie naar Rotterdam en soms naar omdat
Privatized sentence:  Ik ga wel eens op vakantie naar Parijs en soms naar Italië
Privatized sentence:  Ik ga wel eens op vakantie Miss Spaanse en soms naar Engeland


### 3.1.3. Privatize the dataset

In order to replace the original words in a sensitive review with a privatized review, we define the following functions for the static case.

In [None]:
import re

def obtain_phrases(example):
  """
  Remove special characters in the review and chop review in to smaller phrases.
  """
  original_text = " ".join(example.review)
  clean_text = original_text.replace(
      "'", " ").replace("/", " ").replace("  ", " ").replace('"', '')
  text = re.split('\!|\,|\n|\.|\?|\-|\;|\:|\(|\)', clean_text)
  
  return text

In [None]:
from pyspark import SparkFiles
import itertools

def privatize_example_static(example, emb_model, local_vocab, local_epsilon, local_embedding_dims, local_beta):
  """
  Replace a word by injecting noise according to the provided epsilon value 
  and return a perturbed word.
  """
  from annoy import AnnoyIndex
  # load the annoy index to find nearest neighbours
  local_index = AnnoyIndex(local_embedding_dims, 'euclidean') 
  if "word2vec" in emb_model:
    local_index.load(SparkFiles.get("M2_index_word2vec.ann"))
  elif "fastText" in emb_model:
    local_index.load(SparkFiles.get("M2_index_fastText.ann"))
  elif "BERT" in emb_model:
    local_index.load(SparkFiles.get("M2_index_BERT.ann"))

  # make remove any space at the end of the cleaned phrases
  sensitive_phrases = [phrase.strip() for phrase in obtain_phrases(example) if phrase.strip()]

  # make list privatized phrases
  privatized_phrases = []
  for sensitive_phrase in sensitive_phrases:
    privatized_words = []
    for sensitive_word in sensitive_phrase.split(' '):
      privatized_word = replace_word(sensitive_word, local_vocab, local_index, local_epsilon, local_embedding_dims, local_beta)
      if privatized_word == '"' or privatized_word == "'" or privatized_word ==",":
        continue
      else: 
        privatized_words.append(privatized_word)

    # flatten nested list of words
    privatized_phrases.append(itertools.chain(*[privatized_words]))

  # reconstruct review
  privatized_review = " ".join(list(itertools.chain(*privatized_phrases)))

  # reconstruct row
  privatized_row = "\"{}\",{}".format(privatized_review, example.sentiment)

  return privatized_row

For the contextual case, we use a slightly different function:

In [None]:
from pyspark import SparkFiles
import itertools

def privatize_example_context(example, emb_model, context_emb_tokens, local_epsilon, local_embedding_dims, local_beta, model, tokenizer):
  """
  Replace a word by injecting noise according to the provided epsilon value 
  and return a perturbed word.
  """
  from annoy import AnnoyIndex
  # load the annoy index to find nearest neighbours
  local_index = AnnoyIndex(local_embedding_dims, 'euclidean') 
  if "BERT_context" in emb_model:
    local_index.load(SparkFiles.get("M2_index_BERT_context.ann"))

  # make ordered dictionary to keep track of the position of each word
  sensitive_tokens = OrderedDict()

  # initialize private sentence
  privatized_sent = []

  # initialize review
  review = " ".join(example.review)

  # obtain contexual BERT embeddings
  tokenized_text, tokens_tensor, segments_tensor = bert_text_preparation(review, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensor, model)

  # reduce dimension of all token embeddings
  list_red_token_embeddings = reduce_multi_vector_dim(list_token_embeddings, BERT_new_dim)

  # loop over tokens in sensitive sentence
  for sensitive_token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if sensitive_token in sensitive_tokens:
      sensitive_tokens[sensitive_token] += 1
    else:
      sensitive_tokens[sensitive_token] = 1

    # compute the position of the current token
    token_indices = [i for i, token in enumerate(tokenized_text) if token == sensitive_token]
    current_index = token_indices[sensitive_tokens[sensitive_token]-1]

    # get the corresponding embedding
    sensitive_vec = list_red_token_embeddings[current_index]

    # privatize word
    privatized_word = replace_word_context(sensitive_vec, context_emb_tokens, local_index, local_epsilon, local_embedding_dims, local_beta)

    if privatized_word == '"' or privatized_word == "'":
      continue
    else: 
      privatized_sent.append(privatized_word)

  # reconstruct review
  privatized_review = " ".join(privatized_sent)

  # reconstruct row
  privatized_row = "\"{}\",{}".format(privatized_review, example.sentiment)

  return privatized_row


### 3.1.4. Miscellaneous utility functions 

A function that renames the various output files so they have an extension ".txt". PySpark does not do this automatically.

In [None]:
import os

# rename files to append '.txt' to filename
def rename_files(file_directory):
  """
  PySpark files are saved without an extension, therefore we rename the file to add the .txt extension.
  Also compile the files into one dataframe.
  """
  # rename files to add .txt extension
  for f in os.listdir(file_directory):
    path = os.path.join(file_directory, f)
    if not os.path.isfile(path):
      continue  # A directory or some other weird object
    if not os.path.splitext(f)[1]:
      os.rename(path, path + '.txt')
  return None

A function that visualizes distances results for our privacy experiments.

In [None]:
import matplotlib.pyplot as plt

def get_cmap(n, name='hsv'):
  '''
  Returns a function that maps each index in 0, 1, ..., n-1 to a distinct 
  RGB color; the keyword argument name must be a standard mpl colormap name.
  '''
  return plt.cm.get_cmap(name, n)

def plot_pertubations(dist, epsilons, title):
  """
  Function that plots the avg distance between each word and it perturbations given specified epsilons.
  """
  cmap = get_cmap(len(dist))

  # specifying the plot size
  plt.figure(figsize = (10, 5))
  
  # only one line may be specified; full height
  i = 0
  for value in dist: 
    plt.axvline(x = value, color = cmap(i), label = 'epsilon: ' + str(epsilons[i]))
    i += 1

  # place legend outside
  plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper left')

  # set labels
  plt.xlabel('Euclidean distance')
  plt.title(title)
  plt.grid()
  
  # rendering plot
  plt.show()

  return None

def plot_nn(dist, k_list, title):
  """
  Function that plots that avg distance between each words and its k nearest neigbours.
  """
  # create function to make colour in figure
  cmap = get_cmap(len(dist))

  # specifying the plot size
  plt.figure(figsize = (10, 10))
  
  # only one line may be specified; full height
  for i in range(len(dist)): 
    plt.plot(dist[i], k_list[i], 'ro', color = cmap(i), label = 'k: ' + str(k_list[i]))

  # plt.scatter(avg_distances, k_list)
  # place legend outside
  plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper left')

  # set labels
  plt.xlabel('Euclidean distance')
  plt.ylabel('k nearest neighbours')
  plt.title(title)
  plt.grid()
  
  # rendering plot
  plt.show()

  return None

# 4. Privatize Reviews

The next step is to perform the privatization process to every review in our dataset. For this we first initialize a [SparkSession](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html) so we can privatize the review parallelized. For info why PySpark is useful click [here](https://moviecultists.com/why-we-use-parallelize-in-spark).

In [None]:
from pyspark import SparkFiles
from pyspark.sql import SparkSession

def privatize_reviews(reviews, emb_model, ann_filename, vocab, epsilon, embedding_dims, beta=beta, tokens=context_emb_tokens, model=BERT_model, tokenizer=BERT_tokenizer):
  """
  Privatize each review using Mechanism 2.
  """
  # start sparksession
  spark = SparkSession.builder.config("spark.driver.memory", "15g").appName("review-privatization").getOrCreate()

  # initialize title of experiment
  title = emb_model + "_epsilon_" + str(epsilon)

  # parallelize data and obtain distances
  with spark.sparkContext as sc:
    sc.addFile(ann_filename)
    examples = sc.parallelize(reviews, numSlices=500)

    if "BERT_context" in emb_model:
      # privatize each example in the dataset with context
      privatized_examples = examples.map(
        lambda example: privatize_example_context(example, emb_model, tokens, epsilon, embedding_dims, beta, model, tokenizer))   
    else:
      # privatize each example in the dataset statically
      privatized_examples = examples.map(
        lambda example: privatize_example_static(example, emb_model, vocab, epsilon, embedding_dims, beta)) 

    # save privatized data
    privatized_dir = join('gdrive/My Drive/Colab Data/MSc thesis/output/reviews/M2/' + "beta_" + str(beta) + '/', title)
    privatized_examples.saveAsTextFile(privatized_dir)

    # we also save the sensitive examples, to ensure we train on the same source data later
    sensitive_dir = join('gdrive/My Drive/Colab Data/MSc thesis/input/reviews/M2/' + "beta_" + str(beta) + '/', title)
    examples.map(lambda example: "\"{}\",{}".format(
        " ".join(obtain_phrases(example)), example.sentiment)).saveAsTextFile(
        sensitive_dir
        )

  print("Privatization " + title + " Done!")

  return privatized_dir, sensitive_dir

We privatized these reviews for a set of chosen ϵ values to compare the results.

In [None]:
# set epsilons
# epsilons = [50, 75, 100, 125, 150, 200, 300, 500]
epsilons = [5, 10, 15, 25, 50]

# perform privatization for set of epsilons
for epsilon in epsilons:

  # # word2vec ________________________
  # # set embedding model
  # emb_model = "word2vec"

  # # privatized reviews
  # w2v_privatized_dir, w2v_sensitive_dir = privatize_reviews(reviews, emb_model, w2v_ann_filename, w2v_vocab, epsilon, w2v_new_dim)

  # # rename files
  # rename_files(w2v_privatized_dir)
  # rename_files(w2v_sensitive_dir)

  # # fastText ________________________
  # # set embedding model
  # emb_model = "fastText"

  # # privatized reviews
  # ft_privatized_dir, ft_sensitive_dir = privatize_reviews(reviews, emb_model, ft_ann_filename, ft_vocab, epsilon, ft_new_dim)

  # # rename files
  # rename_files(ft_privatized_dir)
  # rename_files(ft_sensitive_dir)

  # # BERTje static ________________________
  # # set embedding model
  # emb_model = "BERT_static"

  # # privatized reviews
  # BERT_static_privatized_dir, BERT_static_sensitive_dir = privatize_reviews(reviews, emb_model, BERT_ann_filename, BERT_vocab, epsilon, BERT_new_dim)

  # # rename files
  # rename_files(BERT_static_privatized_dir)
  # rename_files(BERT_static_sensitive_dir)

  # BERTje context ________________________
  # set embedding model
  emb_model = "BERT_context"

  # privatized reviews
  BERT_context_privatized_dir, BERT_context_sensitive_dir = privatize_reviews(reviews, emb_model, BERT_context_ann_filename, BERT_vocab, epsilon, BERT_new_dim)

  # rename files
  rename_files(BERT_context_privatized_dir)
  rename_files(BERT_context_sensitive_dir)

# 5. Geometry of Word Embedding Spaces

The privacy protection given by this algorithm depends on the fact that we have chosen the Euclidean distance as a measure between word embeddings. Thus, to understand the privacy protection and the noisy injection of this algorithm, we analyze the geometry properties of the embedding space. In order to do this we so run two main experiments, which we combine to get a nice result we can interpret. 

## 5.1. Distances between original word vector and its pertubation

The first experiment is to compute the Euclidean distance between each word embedding with its privatized word embedding after the privatization with the algorithm described above.

We compute these Euclidean distances for each word in the embedding model and average this over the vocabulary size of the embedding model. We calculate this average for a set of chosen ϵ values to compare the results.

In [None]:
import numpy as np
from pyspark import SparkFiles

def calculate_dist(word, emb_model, local_vocab, local_epsilon, local_embedding_dims, local_beta):
  """
  Calculates the distance between a word and its perturbed word.
  """
  from annoy import AnnoyIndex
  # load the annoy index to find nearest neighbours
  local_index = AnnoyIndex(local_embedding_dims, 'euclidean')
  if "word2vec" in emb_model:
    local_index.load(SparkFiles.get("M2_index_word2vec.ann"))
  elif "fastText" in emb_model:
    local_index.load(SparkFiles.get("M2_index_fastText.ann"))
  elif "BERT" in emb_model:
    local_index.load(SparkFiles.get("M2_index_BERT.ann"))

  # obtain word vector
  index_word = local_vocab.stoi[word]
  vec_word = np.array(local_vocab.vectors[index_word])

  # generate a noise vector
  sensitivity = 1 + local_beta
  noise = generate_laplacian_noise_vector(local_embedding_dims, sensitivity, local_epsilon)

  # obtain perturbed vector
  noisy_vec = vec_word + noise

  # calculate distance
  dist = np.linalg.norm(vec_word-noisy_vec)

  return dist

In [None]:
from pyspark import SparkFiles
from pyspark.sql import SparkSession

def all_distances(words_list, emb_model, ann_filename, vocab, epsilon, embedding_dims, beta=beta):
  """
  Computes the distance between each word and its perturbed word and save this in a file.
  """
  # start sparksession
  spark = SparkSession.builder.config("spark.driver.memory", "15g").appName("privacy-experiment-1a").getOrCreate()

  # initialize title of experiment 
  title = emb_model + "_epsilon_" + str(epsilon)

  # parallelize data and obtain distances
  with spark.sparkContext as sc:
    sc.addFile(ann_filename)
    words = sc.parallelize(words_list, numSlices=500)

    # obtain plausible deniability statistics for every word in emb model
    distances = words.map(
        lambda word: calculate_dist(word, emb_model, vocab, epsilon, embedding_dims, beta))  
    
    distances_dir = join('gdrive/My Drive/Colab Data/MSc thesis/output/distances/M2/' + "beta_" + str(beta) + '/', title)
    distances.saveAsTextFile(distances_dir)

  print("Experiment " + title + " Done!")

  return distances_dir

We calculate these average distances for a set of chosen ϵ values to compare the results.

In [None]:
import glob
import os
import pandas as pd 

# set epsilon and embedding model 
epsilons = [50, 75, 100, 125, 150, 200, 300, 500]
avg_w2v_distances = []
avg_ft_distances = []
avg_BERT_distances = []

# perform get distances for set of epsilons
for epsilon in epsilons:

  # word2vec ________________________
  # set embedding model
  emb_model = "word2vec"

  # create the distances 
  w2v_distances_dir = all_distances(intersect_w2v, emb_model, w2v_ann_filename, w2v_vocab, epsilon, w2v_new_dim)

  # rename files
  rename_files(w2v_distances_dir)

  # return dataframe list by using a list comprehension
  files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(w2v_distances_dir ,"*.txt"))]
  w2v_distances = pd.concat(files).sum()[0]
  avg_w2v_distances.append(w2v_distances)

  # fastText ________________________
  # set embedding model
  emb_model = "fastText"

  # create the distances 
  ft_distances_dir = all_distances(intersect_ft, emb_model, ft_ann_filename, ft_vocab, epsilon, ft_new_dim)

  # rename files
  rename_files(ft_distances_dir)
  
  # return dataframe list by using a list comprehension
  files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(ft_distances_dir ,"*.txt"))]
  ft_distances = pd.concat(files).sum()[0]
  avg_ft_distances.append(ft_distances)

  # BERT ________________________
  # set embedding model
  emb_model = "BERT_static"
  embedding_dims = 768

  # create the distances 
  BERT_distances_dir = all_distances(bertje_tokens, emb_model, BERT_ann_filename, BERT_vocab, epsilon, BERT_new_dim)

  # rename files
  rename_files(BERT_distances_dir)
  
  # return dataframe list by using a list comprehension
  files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(BERT_distances_dir ,"*.txt"))]
  BERT_distances = pd.concat(files).sum()[0]
  avg_BERT_distances.append(BERT_distances)

# compute average
avg_w2v_distances = np.array(avg_w2v_distances) / len(intersect_w2v)
avg_ft_distances = np.array(avg_ft_distances) / len(intersect_ft)
avg_BERT_distances = np.array(avg_BERT_distances) / len(bertje_tokens)

In [None]:
avg_w2v_distances

In [None]:
avg_ft_distances

In [None]:
avg_BERT_distances

## 5.1.2. Visualize results

### word2vec

In [None]:
plot_pertubations(avg_w2v_distances, epsilons, 'word2vec: avg distance between each word and its perturbation')

### fastText

In [None]:
plot_pertubations(avg_ft_distances, epsilons, 'fastText: avg distance between each word and its perturbation')

### BERT

In [None]:
plot_pertubations(avg_BERT_distances, epsilons, 'BERT: avg distance between each word and its perturbation')

## 5.2. Distance between original word and $k$ nearest neighbours

The second experiment is to compute the Euclidean distance between each word embedding and its $k$ nearest neighbours (without any pertubation). This gives us baseline of the average distance between words in the embedding space. We calculate this average for a set of chosen $k$ values to compare the results. We consider $k \in [1,2,3,4,5,10,20,50,200,500,1000]$.

In [None]:
from pyspark import SparkFiles

def calculate_nn_dist(word, emb_model, local_vocab, local_k, local_embedding_dims, local_beta):
  """
  Calculates the distance between a word and its k nearest neighbours.
  """
  from annoy import AnnoyIndex
  # load the annoy index to find nearest neighbours
  local_index = AnnoyIndex(embedding_dims, 'euclidean')
  if "word2vec" in emb_model:
    local_index.load(SparkFiles.get("M2_index_word2vec.ann"))
  elif "fastText" in emb_model:
    local_index.load(SparkFiles.get("M2_index_fastText.ann"))
  elif "BERT_static" in emb_model:
    local_index.load(SparkFiles.get("M2_index_BERT.ann"))
  elif "BERT_context" in emb_model:
    local_index.load(SparkFiles.get("M2_index_BERT_context.ann"))

  # obtain word index
  if "BERT_context" in emb_model:
    i = word
  else:
    i = local_vocab.stoi[word]

  # obtain nearest neighbours
  # use k+1, because k=1 corresponds to the item itself
  nns = local_index.get_nns_by_item(i, local_k+1, include_distances=True)

  # obtain total distance between the word and its nns
  total_dist = sum(nns[1])

  # obtain avg distance between the word and its nns
  avg_dist = total_dist / local_k

  return avg_dist

For the contextual case, we use a slightly different function.

In [None]:
from pyspark import SparkFiles
from pyspark.sql import SparkSession

def all_nn_distances(words_list, emb_model, ann_filename, vocab, k, embedding_dims, beta=beta):
  """
  Computes the distance between each word and its k nearest neighbours and save this in a file.
  """
  # start sparksession
  spark = SparkSession.builder.config("spark.driver.memory", "15g").appName("privacy-experiment-1b").getOrCreate()

  # initialize title of experiment f
  title = emb_model + "_k_" + str(k)

  # parallelize data and obtain nn distances
  with spark.sparkContext as sc:
    sc.addFile(ann_filename)
    words = sc.parallelize(words_list, numSlices=500)

    # obtain nn distances for every word in emb model
    nn_distances = words.map(
        lambda word: calculate_nn_dist(word, emb_model, vocab, k, embedding_dims))  
    
    nn_distances_dir = join('gdrive/My Drive/Colab Data/MSc thesis/output/nn_distances/M2/' + "beta_" + str(beta) + '/', title)
    nn_distances.saveAsTextFile(nn_distances_dir)
  
  print("Experiment " + title + " Done!")

  return nn_distances_dir

Compute the average distances for each $k$.

In [None]:
import glob
import os
import pandas as pd 

# set epsilon 
# k_list = [1,2,3,4,5,10,20,50,200,500,1000]
k_list = [2,3,4,5,10,20,50,200,500,1000]

avg_w2v_nn_distances = []
avg_ft_nn_distances = []
avg_BERT_nn_distances = []
avg_BERT_context_nn_distances = []

# perform get distances for set of epsilons
for k in k_list:

  # # word2vec ________________________
  # # set embedding model
  # emb_model = "word2vec"

  # # create the distances
  # w2v_nn_distances_dir = all_nn_distances(intersect_w2v, emb_model, w2v_ann_filename, w2v_vocab, k, w2v_new_dim)

  # # rename files 
  # rename_files(w2v_nn_distances_dir)

  # # return dataframe list by using a list comprehension
  # files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(w2v_nn_distances_dir ,"*.txt"))]
  # w2v_nn_distances = pd.concat(files).sum()[0]
  # avg_w2v_nn_distances.append(w2v_nn_distances)

  # # fastText ________________________
  # # set embedding model
  # emb_model = "fastText"

  # # create the distances
  # ft_nn_distances_dir = all_nn_distances(intersect_ft, emb_model, ft_ann_filename, ft_vocab, k, ft_new_dim)

  # # rename files 
  # rename_files(ft_nn_distances_dir)

  # # return dataframe list by using a list comprehension
  # files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(ft_nn_distances_dir ,"*.txt"))]
  # ft_nn_distances = pd.concat(files).sum()[0]
  # avg_ft_nn_distances.append(ft_nn_distances)

  # BERT static ________________________
  # set embedding model
  emb_model = "BERT_static"
  embedding_dims = 768

  # create the distances
  BERT_nn_distances_dir = all_nn_distances(bertje_tokens, emb_model, BERT_ann_filename, BERT_vocab, k, BERT_new_dim)

  # rename files 
  rename_files(BERT_nn_distances_dir)

  # return dataframe list by using a list comprehension
  files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(BERT_nn_distances_dir ,"*.txt"))]
  BERT_nn_distances = pd.concat(files).sum()[0]
  avg_BERT_nn_distances.append(BERT_nn_distances)

  # BERT context ________________________
  # set embedding model
  emb_model = "BERT_context"
  embedding_dims = 768

  # make list of indices
  list_indices = [i for i in range(context_emb_tokens)]

  # create the distances
  BERT_context_nn_distances_dir = all_nn_distances(list_indices, emb_model, BERT_context_ann_filename, context_emb_tokens, k, BERT_new_dim)

  # rename files 
  rename_files(BERT_context_nn_distances_dir)

  # return dataframe list by using a list comprehension
  files = [pd.read_csv(file, names =['distance'] ) for file in glob.glob(os.path.join(BERT_context_nn_distances_dir ,"*.txt"))]
  BERT_context_nn_distances = pd.concat(files).sum()[0]
  avg_BERT_context_nn_distances.append(BERT_nn_distances)

# compute average
avg_w2v_nn_distances = np.array(avg_w2v_nn_distances) / len(intersect_w2v)
avg_ft_nn_distances = np.array(avg_ft_nn_distances) / len(intersect_ft)
avg_BERT_nn_distances = np.array(avg_BERT_nn_distances) / len(bertje_tokens)
avg_BERT_context_nn_distances = np.array(avg_BERT_context_nn_distances) / len(context_emb_tokens)

In [None]:
avg_w2v_nn_distances

In [None]:
avg_ft_nn_distances

In [None]:
avg_BERT_nn_distances

In [None]:
avg_BERT_context_nn_distances

## 5.2.2. Visualize results

### word2vec

In [None]:
plot_nn(avg_w2v_nn_distances, k_list, 'word2vec: avg distance between every word and k nearest neighbours')

### fastText

In [None]:
plot_nn(avg_ft_nn_distances, k_list, 'fastText: avg distance between every word and k nearest neighbours')

### BERT

In [None]:
plot_nn(avg_BERT_nn_distances, k_list, 'BERT: avg distance between every word and k nearest neighbours')

In [None]:
plot_nn(avg_BERT_context_nn_distances, k_list, 'BERT context: avg distance between every word and k nearest neighbours')

# Random snippets

The following function privatizes a contextual token embedding. 

In [None]:
# def replace_word_context(sensitive_vec, vocab, ann_index, epsilon, embedding_dims, beta):
#     """
#     Replace a word by injecting noise according to the provided epsilon value 
#     and return a perturbed word.
#     """
#     # generate a noise vector
#     sensitivity = 1 + beta
#     noise = generate_laplacian_noise_vector(embedding_dims, sensitivity, epsilon)

#     # reduced dimension of sensitive vector
#     new_sensitive_vec = reduce_single_vector_dim(sensitive_vec, embedding_dims)[0]

#     # obtain perturbed vector
#     noisy_vector = new_sensitive_vec + noise

#     # obtain item closest to noisy vector
#     closest_item = ann_index.get_nns_by_vector(noisy_vector, 1)[0]

#     # get word from item
#     privatized_word = vocab.itos[closest_item]
    
#     return privatized_word
