# Word Level Federated Text Generation with Stack Overflow
- Joel Stremmel
- 01-28-20

**About:**

This notebook loads the Stack Overflow data available through `tff.simulation.datasets` and trains an LSTM model with Federared Averaging by following the Federated Learning for Text Generation [example notebook](https://github.com/tensorflow/federated/blob/master/docs/tutorials/federated_learning_for_text_generation.ipynb).

**Notes:**

- This notebook prepares the Stack Overflow dataset for word level language modeling using this [module](https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/research/baselines/stackoverflow/dataset.py
).
- The metrics for model training come from this [module](https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/research/baselines/stackoverflow/metrics.py). 


**Data:** 
- https://www.kaggle.com/stackoverflow/stackoverflow

**License:** 
- https://creativecommons.org/licenses/by-sa/3.0/

**Data and Model References:**
- https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/stackoverflow/load_data
- https://github.com/tensorflow/federated/blob/master/docs/tutorials/federated_learning_for_text_generation.ipynb
- https://github.com/tensorflow/federated/
- https://github.com/tensorflow/federated/tree/master/tensorflow_federated/python/research/baselines/stackoverflow
- https://www.tensorflow.org/tutorials/text/text_generation
- https://ruder.io/deep-learning-nlp-best-practices/

**Environment Setup References:**
- https://www.tensorflow.org/install/gpu
- https://gist.github.com/matheustguimaraes/43e0b65aa534db4df2918f835b9b361d
- https://www.tensorflow.org/install/source#tested_build_configurations
- https://anbasile.github.io/programming/2017/06/25/jupyter-venv/

### Environment Setup
Pip install these packages in the order listed.

In [21]:
# !pip install --upgrade pip
# !pip install --upgrade tensorflow-federated
# !pip uninstall tensorflow -y
# !pip install --upgrade tensorflow-gpu==2.0
# !pip install --upgrade nltk
# !pip install matplotlib
# !pip install nest_asyncio

### Imports

In [22]:
import nest_asyncio
nest_asyncio.apply()

In [23]:
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.getcwd())))

In [24]:
# from https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/research/baselines/stackoverflow/dataset.py
from utils.dataset import construct_word_level_datasets, get_vocab, get_special_tokens

# from https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/research/baselines/stackoverflow/metrics.py
import utils.metrics as metrics

In [25]:
import collections
import functools
import six
import time
import string

import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

import tensorflow as tf
import tensorflow_federated as tff

### Set Compatability Behavior

In [26]:
tf.compat.v1.enable_v2_behavior()

### Check Tensorflow Install

In [27]:
print('Built with Cuda: {}'.format(tf.test.is_built_with_cuda()))
print('Build with GPU support: {}'.format(tf.test.is_built_with_gpu_support()))
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Built with Cuda: True
Build with GPU support: True
Num GPUs Available:  1


### Set Tensorflow to Use GPU

In [28]:
physical_devices = tf.config.experimental.list_physical_devices(device_type=None)
tf.config.experimental.set_memory_growth(physical_devices[-1], enable=True)
for device in physical_devices:
    print(device)

PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


### Test TFF

In [29]:
tff.federated_computation(lambda: 'Hello, World!')()

'Hello, World!'

### Set Some Parameters for Preprocessing the Data and Training the Model
**Note:** Ask Keith how he's been setting there for internal experiments.

In [30]:
VOCAB_SIZE = 5000
BATCH_SIZE = 16
CLIENTS_EPOCHS_PER_ROUND = 1
MAX_SEQ_LENGTH = 100
MAX_ELEMENTS_PER_USER = 100
CENTRALIZED_TRAIN = False
SHUFFLE_BUFFER_SIZE = 5000
NUM_VALIDATION_EXAMPLES = 200
NUM_TEST_EXAMPLES = 200

NUM_ROUNDS = 10
NUM_TRAIN_CLIENTS = 10
UNIFORM_WEIGHTING = False

### Load and Preprocess Word Level Datasets

In [31]:
train_data, val_data, test_data = construct_word_level_datasets(
    vocab_size=VOCAB_SIZE,
    batch_size=BATCH_SIZE,
    client_epochs_per_round=CLIENTS_EPOCHS_PER_ROUND,
    max_seq_len=MAX_SEQ_LENGTH,
    max_elements_per_user=MAX_ELEMENTS_PER_USER,
    centralized_train=CENTRALIZED_TRAIN,
    shuffle_buffer_size=SHUFFLE_BUFFER_SIZE,
    num_validation_examples=NUM_VALIDATION_EXAMPLES,
    num_test_examples=NUM_TEST_EXAMPLES)

  collections.OrderedDict((name, ds.value) for name, ds in sorted(


### Retrieve the Dataset Vocab

In [32]:
vocab = get_vocab(VOCAB_SIZE)

### Retrieve the Special Characters Created During Preprocessing
The four special tokens are:
- pad: padding token
- oov: out of vocabulary
- bos: begin of sentence
- eos: end of sentence

In [33]:
pad, oov, bos, eos = get_special_tokens(VOCAB_SIZE)

In [34]:
special2idx = dict(zip(['pad', 'oov', 'bos', 'eos'], [pad, oov, bos, eos]))
idx2special = {v:k for k, v in special2idx.items()}

### Set Vocabulary
Add one to account for the pad token which has idx 0.

In [35]:
word2idx = {word:i+1 for i, word in enumerate(vocab)}
idx2word = {i+1:word for i, word in enumerate(vocab)}

### Add Special Characters

In [36]:
word2idx = {**word2idx, **special2idx}
idx2word = {**idx2word, **idx2special}

### Reset Vocab Size
This accounts for having added the special characters.

In [37]:
VOCAB_SIZE = VOCAB_SIZE + len(special2idx)

### Define Function to Build Model

In [38]:
def build_model(embedding_dim=256, rnn_units=512):
    """
    Build model with architecture from: https://www.tensorflow.org/tutorials/text/text_generation.
    """

    model1_input = tf.keras.Input(shape=(MAX_SEQ_LENGTH, ),
                                  name='model1_input')
    
    model1_embedding = tf.keras.layers.Embedding(input_dim=VOCAB_SIZE,
                                                 output_dim=embedding_dim,
                                                 input_length=MAX_SEQ_LENGTH,
                                                 batch_input_shape=[BATCH_SIZE, None],
                                                 name='model1_embedding')(model1_input)
    
    model1_lstm = tf.keras.layers.LSTM(units=rnn_units,
                                       return_sequences=True,
                                       recurrent_initializer='glorot_uniform',
                                       name='model1_lstm')(model1_embedding)
    
    model1_dense = tf.keras.layers.Dense(units=VOCAB_SIZE)(model1_lstm)
    
    final_model = tf.keras.Model(inputs=model1_input, outputs=model1_dense)
                 
    return final_model

### Define the Text Generation Strategy

In [39]:
def generate_text(model, start_string):
    """
    Generate text by sampling from the model output distribution
    as in From https://www.tensorflow.org/tutorials/sequences/text_generation.
    """
    
    start_words = [word.lower() for word in start_string.split(' ')]

    num_generate = 50
    input_eval = [word2idx[word] for word in start_words]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    temperature = 1.0

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2word[predicted_id])

    return (' '.join(start_words) + ' '.join(text_generated))

### Load or Build the Model

In [41]:
keras_model_batch1 = build_model()
print(generate_text(keras_model_batch1, "How are you today"))

how are you todaylooked repeating interpretation wondering year usual relation cities keys ts concept ^ cleanest limiting ? operations assistance exports stored authorize essential yahoo compile www positions explanation proceed recall migrate msdn broke confuse into java covered datasource primary determines trouble bottleneck relative styled selected concise messaging natural universal price over row


### Create TFF Version of the Model to be Trained with Federated Averaging
TFF uses a sample batch so it knows the types and shapes that your model expects.

In [42]:
sample_batch = tf.nest.map_structure(lambda x: x.numpy(), next(iter(val_data)))

  collections.OrderedDict((name, ds.value) for name, ds in sorted(


In [43]:
def model_fn():
    """
    Create TFF model from compiled Keras model and a sample batch.
    """
    
    keras_model = build_model()
    
    train_metrics = [
        metrics.NumTokensCounter(name='num_tokens', masked_tokens=[pad]),
        metrics.NumTokensCounter(name='num_tokens_no_oov', masked_tokens=[pad, oov]),
        metrics.NumBatchesCounter(),
        metrics.NumExamplesCounter(),
        metrics.MaskedCategoricalAccuracy(name='accuracy', masked_tokens=[pad]),
        metrics.MaskedCategoricalAccuracy(name='accuracy_no_oov', masked_tokens=[pad, oov]),
        metrics.MaskedCategoricalAccuracy(name='accuracy_no_oov_no_eos', masked_tokens=[pad, oov, eos])
    ]
    
    keras_model.compile(
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=tf.keras.optimizers.Adam(),
        metrics=train_metrics)
    
    return tff.learning.from_compiled_keras_model(keras_model, sample_batch)

### Define Lists to Track Loss and Accuracy at Each Training Round

In [44]:
train_loss = []
train_accuracy = []
val_loss = []
val_accuracy = []

### Define Function to Evaluate Model Performance on Validation Data

In [45]:
def keras_evaluate(keras_model, state, val_dataset):
    
    tff.learning.assign_weights_to_keras_model(keras_model, state.model)
    loss, accuracy = keras_model.evaluate(val_dataset, steps=2)
    
    val_loss.append(loss)
    val_accuracy.append(accuracy)

### Define Function to Weight Clients Uniformly or by Number of Tokens

In [46]:
def client_weight_fn(local_outputs):
    
    num_tokens = tf.cast(tf.squeeze(local_outputs['num_tokens']), tf.float32)
    
    return 1.0 if UNIFORM_WEIGHTING else num_tokens

### Define Function to Supply Server Optimizer

In [50]:
def server_optimizer_fn():
    
    return tf.keras.optimizers.Adam()

### Define Function to Create Training Datsets from Randomly Sampled Clients

In [47]:
def get_sample_clients(dataset, num_clients):
    
    random_indices = np.random.choice(len(dataset.client_ids), size=num_clients, replace=False)
    
    return np.array(dataset.client_ids)[random_indices]

### Train Model Across Many Randomly Sampled Clients with Federated Averaging
- Set the default executor
- Create and initailize an iterative process
- Apply federated training rounds

In [48]:
tff.framework.set_default_executor(tff.framework.create_local_executor(max_fanout=10))

In [52]:
iterative_process = (
      tff.learning.federated_averaging.build_federated_averaging_process(
          model_fn=model_fn,
          server_optimizer_fn=server_optimizer_fn,
          client_weight_fn=client_weight_fn))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [53]:
server_state = iterative_process.initialize()

In [54]:
for round_num in range(1, NUM_ROUNDS):
    
    # Examine validation metrics
#     print(f'Evaluating before training round #{round_num} on {NUM_VALIDATION_EXAMPLES} clients.')
#     keras_evaluate(keras_model, state, val_data)
    
    # Sample train clients to create a train dataset
    print(f'Sampling {NUM_TRAIN_CLIENTS} new clients.')
    train_clients = get_sample_clients(train_data, num_clients=NUM_TRAIN_CLIENTS)
    train_datasets = [train_data.create_tf_dataset_for_client(client) for client in train_clients]
    
    # Apply federated training round
    server_state, server_metrics = iterative_process.next(server_state, train_datasets)
    
    # Examine training metrics
    print('Round: {}'.format(round_num))
    print('   Loss: {:.8f}'.format(server_metrics.loss))
    print('   num_batches: {}'.format(server_metrics.num_batches))
    print('   num_examples: {}'.format(server_metrics.num_examples))
    print('   num_tokens: {}'.format(server_metrics.num_tokens))
    print('   num_tokens_no_oov: {}'.format(server_metrics.num_tokens_no_oov))
    print('   accuracy: {:.5f}'.format(server_metrics.accuracy))
    print('   accuracy_no_oov: {:.5f}'.format(server_metrics.accuracy_no_oov))
    
    train_loss.append(server_metrics.loss)
    train_accuracy.append(server_metrics.accuracy)

Sampling 10 new clients.


  collections.OrderedDict((name, ds.value) for name, ds in sorted(


Round: 1
   Loss: 8.15574360
   num_batches: 66
   num_examples: 945
   num_tokens: 14835
   num_tokens_no_oov: 14307
   accuracy: 0.00054
   accuracy_no_oov: 0.00056
Sampling 10 new clients.
Round: 2
   Loss: 5.83773422
   num_batches: 68
   num_examples: 973
   num_tokens: 14345
   num_tokens_no_oov: 13656
   accuracy: 0.00014
   accuracy_no_oov: 0.00015
Sampling 10 new clients.
Round: 3
   Loss: 3.78267336
   num_batches: 70
   num_examples: 997
   num_tokens: 17429
   num_tokens_no_oov: 16627
   accuracy: 0.00000
   accuracy_no_oov: 0.00000
Sampling 10 new clients.
Round: 4
   Loss: 2.38870931
   num_batches: 58
   num_examples: 825
   num_tokens: 12416
   num_tokens_no_oov: 11904
   accuracy: 0.00000
   accuracy_no_oov: 0.00000
Sampling 10 new clients.
Round: 5
   Loss: 1.67560375
   num_batches: 68
   num_examples: 979
   num_tokens: 14771
   num_tokens_no_oov: 14073
   accuracy: 0.00000
   accuracy_no_oov: 0.00000
Sampling 10 new clients.
Round: 6
   Loss: 1.45487618
   num_batc

### Plot Model Objective Function

In [None]:
fig, ax = plt.subplots()
x_axis = range(0, NUM_ROUNDS)
ax.plot(x_axis, train_loss, label='Train')
ax.plot(x_axis, val_loss, label='Validation')
ax.legend(loc='best')
plt.ylabel('Value of Objective Function')
plt.title('Model Objective Function at Each Training Round')
plt.show()

### Plot Model Accuracy

In [None]:
fig, ax = plt.subplots()
x_axis = range(0, NUM_ROUNDS)
ax.plot(x_axis, train_accuracy, label='Train')
ax.plot(x_axis, val_accuracy, label='Validation')
ax.legend(loc='best')
plt.ylabel('Accuracy')
plt.title('Model Accuracy at Each Training Round')
plt.show()

### Get Final Evaluation

In [None]:
# keras_evaluate(keras_model, state, val_dataset)

### Generate Text
Text generation requires batch_size=1.

In [None]:
# keras_model_batch1.set_weights([v.numpy() for v in keras_model.weights])
# print(generate_text(keras_model_batch1, "How are you today? "))