# Word Level Federated Text Generation with Stack Overflow (Work in Progress)
- Joel Stremmel, Arjun Singh
- 01-20-20

**About:**

This notebook loads the Stack Overflow data available through `tff.simulation.datasets` and trains an LSTM model with Federared Averaging by following the Federated Learning for Text Generation [example notebook](https://github.com/tensorflow/federated/blob/master/docs/tutorials/federated_learning_for_text_generation.ipynb).

**Notes:**

This notebook prepares the Stack Overflow dataset for word level language modeling using this [module](https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/research/baselines/stackoverflow/dataset.py
).


**Data:** 
- https://www.kaggle.com/stackoverflow/stackoverflow

**License:** 
- https://creativecommons.org/licenses/by-sa/3.0/

**Data and Model References:**
- https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/stackoverflow/load_data
- https://github.com/tensorflow/federated/blob/master/docs/tutorials/federated_learning_for_text_generation.ipynb
- https://github.com/tensorflow/federated/
- https://www.tensorflow.org/tutorials/text/text_generation
- https://ruder.io/deep-learning-nlp-best-practices/

**Environment Setup References:**
- https://www.tensorflow.org/install/gpu
- https://gist.github.com/matheustguimaraes/43e0b65aa534db4df2918f835b9b361d
- https://www.tensorflow.org/install/source#tested_build_configurations
- https://anbasile.github.io/programming/2017/06/25/jupyter-venv/

### Environment Setup
Pip install these packages in the order listed.

In [0]:
!pip install --upgrade pip
!pip install --upgrade tensorflow-federated
!pip install tensorflow
!pip install --upgrade tensorflow-gpu==2.0
!pip install --upgrade nltk
!pip install matplotlib
!pip install nest_asyncio

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 2.8MB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.0.2
Collecting tensorflow-federated
  Downloading tensorflow_federated-0.11.0-py2.py3-none-any.whl (385 kB)
[K     |████████████████████████████████| 385 kB 2.9 MB/s 
[?25hCollecting tensorflow-model-optimization~=0.1.3
  Downloading tensorflow_model_optimization-0.1.3-py2.py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 9.2 MB/s 
[?25hCollecting tensorflow~=2.0.0
  Downloading tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3 MB)
[K     |████████████████████████████████| 86.3 MB 25 kB/s 
[?25hCollecting tensorflow-addons~=0.6.0
  

Requirement already up-to-date: tensorflow-gpu==2.0 in /usr/local/lib/python3.6/dist-packages (2.0.0)
Collecting nltk
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 2.8 MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.4.5-py3-none-any.whl size=1449905 sha256=f7a2ff5ef650f1721108b456790977831aa33c0f9e9e53325018e79b80db0f07
  Stored in directory: /root/.cache/pip/wheels/e3/c9/b0/ed26a73ef75a53145820825afa8e2d2c9b30fe9f6c10cd3202
Successfully built nltk
Installing collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.4.5


Collecting nest_asyncio
  Downloading nest_asyncio-1.2.2-py3-none-any.whl (4.5 kB)
Installing collected packages: nest-asyncio
Successfully installed nest-asyncio-1.2.2


### Imports

In [0]:
import nest_asyncio
nest_asyncio.apply()

In [0]:
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.getcwd())))

In [0]:
# from https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/research/baselines/stackoverflow/dataset.py
from utils.dataset import construct_word_level_datasets

In [0]:
import collections
import functools
import six
import time
import string

import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

import tensorflow as tf
import tensorflow_federated as tff

### To get the local files to be imported into Colab Sessions

In [0]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)
!ls /gdrive

Mounted at /gdrive
'My Drive'  'Shared drives'


In [0]:
import os
BASE_PATH = '/gdrive/My Drive/colab_files/'
os.chdir(BASE_PATH)

In [0]:
!pwd

/gdrive/My Drive/colab_files


In [0]:
import dataset
from dataset import construct_word_level_datasets

### Set Compatability Behavior

In [0]:
tf.compat.v1.enable_v2_behavior()

### Check Tensorflow Install

In [0]:
print('Built with Cuda: {}'.format(tf.test.is_built_with_cuda()))
print('Build with GPU support: {}'.format(tf.test.is_built_with_gpu_support()))
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Built with Cuda: True
Build with GPU support: True
Num GPUs Available:  1


### Set Tensorflow to Use GPU

In [0]:
physical_devices = tf.config.experimental.list_physical_devices(device_type=None)
tf.config.experimental.set_memory_growth(physical_devices[-1], enable=True)
for device in physical_devices:
    print(device)

PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


### Test TFF

In [0]:
tff.federated_computation(lambda: 'Hello, World!')()

'Hello, World!'

### What follows next is the pre-processing for NWP, as now the vocab is in terms of words

In [0]:
  # path = tf.keras.utils.get_file(
  #     'stackoverflow.tar.bz2',
  #     origin='https://storage.googleapis.com/tff-datasets-public/stackoverflow.tar.bz2',
  #     file_hash='99eca2f8b8327a09e5fc123979df2d237acbc5e52322f6d86bf523ee47b961a2',
  #     hash_algorithm='sha256',
  #     extract=True,
  #     archive_format='tar',
  #     cache_dir=None)

Downloading data from https://storage.googleapis.com/tff-datasets-public/stackoverflow.tar.bz2


In [0]:
  path = tf.keras.utils.get_file(
      'shakespeare.tar.bz2',
      origin='https://storage.googleapis.com/tff-datasets-public/shakespeare.tar.bz2',
      file_hash='0285be9906cb5f268092eee4edeeacfc2af4574f2941f7cc2f08a321d7f5c707',
      hash_algorithm='sha256',
      extract=True,
      archive_format='tar',
      cache_dir=None)

Downloading data from https://storage.googleapis.com/tff-datasets-public/shakespeare.tar.bz2


In [0]:
path

'/root/.keras/datasets/shakespeare.tar.bz2'

In [0]:
!pip install Unidecode

Collecting Unidecode
  Downloading Unidecode-1.1.1-py2.py3-none-any.whl (238 kB)
[?25l[K     |█▍                              | 10 kB 36.6 MB/s eta 0:00:01[K     |██▊                             | 20 kB 1.8 MB/s eta 0:00:01[K     |████▏                           | 30 kB 2.7 MB/s eta 0:00:01[K     |█████▌                          | 40 kB 1.7 MB/s eta 0:00:01[K     |██████▉                         | 51 kB 2.2 MB/s eta 0:00:01[K     |████████▎                       | 61 kB 2.6 MB/s eta 0:00:01[K     |█████████▋                      | 71 kB 3.0 MB/s eta 0:00:01[K     |███████████                     | 81 kB 3.4 MB/s eta 0:00:01[K     |████████████▍                   | 92 kB 3.7 MB/s eta 0:00:01[K     |█████████████▊                  | 102 kB 2.9 MB/s eta 0:00:01[K     |███████████████▏                | 112 kB 2.9 MB/s eta 0:00:01[K     |████████████████▌               | 122 kB 2.9 MB/s eta 0:00:01[K     |█████████████████▉              | 133 kB 2.9 MB/s eta 0:00:

In [0]:
import os
import time
 
import numpy as np
import tensorflow as tf
import unidecode
from keras_preprocessing.text import Tokenizer
 
#tf.enable_eager_execution()
 
#file_path = path

text = unidecode.unidecode(open('raw_shakespeare_data.txt').read().encode('utf-8').strip())



AttributeError: ignored

In [0]:
with open('raw_shakespeare_data.txt') as f:
    # This reads all the data from the file, but does not do any processing on it.
    data = f.read()

# preprocessing to replace all the whitespace characters (space, \n, \t, etc.) in the file with the space character.
data = " ".join(data.split())

In [0]:
data[:100]

'The Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare Thi'

In [0]:
text = data
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
 
encoded = tokenizer.texts_to_sequences([text])[0]
 
vocab_size = len(tokenizer.word_index) + 1
 
word2idx = tokenizer.word_index
idx2word = tokenizer.index_word

In [0]:
{k: word2idx[k] for k in list(word2idx)[:5]}

{'and': 2, 'i': 3, 'of': 5, 'the': 1, 'to': 4}

In [0]:
{k: idx2word[k] for k in list(idx2word)[:5]}

{1: 'the', 2: 'and', 3: 'i', 4: 'to', 5: 'of'}

In [0]:
vocab_size

27433

### Set Some Parameters for Preprocessing the Data and Training the Model
**Note:** Ask Keith how he's been setting there for internal experiments.

In [0]:
VOCAB_SIZE = vocab_size
BATCH_SIZE = 16
CLIENTS_EPOCHS_PER_ROUND = 1
MAX_SEQ_LENGTH = 100
MAX_ELEMENTS_PER_USER = 100
CENTRALIZED_TRAIN = False
SHUFFLE_BUFFER_SIZE = 5000
NUM_VALIDATION_EXAMPLES = 200
NUM_TEST_EXAMPLES = 200

NUM_ROUNDS = 20

In [0]:
train_data, val_data, test_data = construct_word_level_datasets(
    vocab_size=VOCAB_SIZE,
    batch_size=BATCH_SIZE,
    client_epochs_per_round=CLIENTS_EPOCHS_PER_ROUND,
    max_seq_len=MAX_SEQ_LENGTH,
    max_elements_per_user=MAX_ELEMENTS_PER_USER,
    centralized_train=CENTRALIZED_TRAIN,
    shuffle_buffer_size=SHUFFLE_BUFFER_SIZE,
    num_validation_examples=NUM_VALIDATION_EXAMPLES,
    num_test_examples=NUM_TEST_EXAMPLES)

In [0]:
train_data

### Count Number of Clients

In [0]:
print('{} train clients.'.format(len(train_data.client_ids)))
print('{} val clients.'.format(len(val_data.client_ids)))
print('{} test clients.'.format(len(test_data.client_ids)))

### Set Vocabulary
- Currently using the fixed vocabularly of ASCII chars that occur in the works of Shakespeare and Dickens
- **Is there a good way to get the distinct characters from a TF dataset?**

In [0]:
vocab = list('dhlptx@DHLPTX $(,048cgkoswCGKOSW[_#\'/37;?bfjnrvzBFJNRVZ"&*.26:\naeimquyAEIMQUY]!%)-159\r')
vocab_size = len(vocab)

### Creating a Mapping from Unique Characters to Indices

In [0]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

### Build Model

In [0]:
def build_model(batch_size, vocab_size, seq_length, embedding_dim=256, rnn_units=512):
    """
    Build model with architecture from: https://www.tensorflow.org/tutorials/text/text_generation.
    """

    model1_input = tf.keras.Input(shape=(seq_length, ),
                                  name='model1_input')
    
    model1_embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                                 output_dim=embedding_dim,
                                                 input_length=seq_length,
                                                 batch_input_shape=[batch_size, None],
                                                 name='model1_embedding')(model1_input)
    
    model1_lstm = tf.keras.layers.LSTM(units=rnn_units,
                                       return_sequences=True,
                                       recurrent_initializer='glorot_uniform',
                                       name='model1_lstm')(model1_embedding)
    
    model1_dense = tf.keras.layers.Dense(units=vocab_size)(model1_lstm)
    
    final_model = tf.keras.Model(inputs=model1_input, outputs=model1_dense)
                 
    return final_model

### Define the Text Generation Strategy

In [0]:
def generate_text(model, start_string):
    """
    Generate text by sampling from the model output distribution
    as in From https://www.tensorflow.org/tutorials/sequences/text_generation.
    """

    num_generate = 200
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    temperature = 1.0

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

### Load or Build the Model
Text generation requires a batch_size=1 model.

In [0]:
keras_model_batch1 = build_model(batch_size=1, vocab_size=vocab_size, seq_length=SEQ_LENGTH)
print(generate_text(keras_model_batch1, "How's the water today? "))

### Define Functions to Preprocess Federated Stack Overflow
- Using a namedtuple with keys x and y as the output type of the dataset keeps both TFF and Keras happy.
- Construct a lookup table to map string chars to indexes, using the vocab loaded above.
- Write functions for:
    - ID lookup
    - Splitting inputs and targets
    - Applying preprocessing steps to dataset
    - Taking clients and client records and applying preprocessing

In [0]:
BatchType = collections.namedtuple('BatchType', ['x', 'y'])

In [0]:
table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(
        keys=vocab,
        values=tf.constant(list(range(len(vocab))),
        dtype=tf.int64)),
    default_value=0)

In [0]:
def to_ids(x):
    
    s = tf.reshape(x['tokens'], shape=[1])
    chars = tf.strings.bytes_split(s).values
    ids = table.lookup(chars)
    
    return ids

In [0]:
def split_input_target(chunk):
    
    input_text = tf.map_fn(lambda x: x[:-1], chunk)
    target_text = tf.map_fn(lambda x: x[1:], chunk)
    
    return BatchType(input_text, target_text)

In [0]:
def preprocess(dataset):
    
    return (
        # Map ASCII chars to int64 indexes using the vocab
        dataset.map(to_ids)
        # Split into individual chars
        .unbatch()
        # Form example sequences of SEQ_LENGTH +1
        .batch(SEQ_LENGTH + 1, drop_remainder=True)
        # Shuffle and form minibatches
        .shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
        # And finally split into (input, target) tuples,
        # each of length SEQ_LENGTH.
        .map(split_input_target))

In [0]:
def preprocess_data_for_client(client, data_source):
    
    return preprocess(data_source.create_tf_dataset_for_client(client))

### Get Sample Clients for Validation and Testing
Note that the train clients will be sampled within the training loop.

In [0]:
def get_sample_clients(dataset, num_clients):
    
    random_indices = np.random.choice(len(dataset.client_ids), size=num_clients, replace=False)
    
    return np.array(dataset.client_ids)[random_indices]

In [0]:
val_clients = get_sample_clients(val_data, num_clients=NUM_VAL_CLIENTS)
test_clients = get_sample_clients(test_data, num_clients=NUM_TEST_CLIENTS)

### Build and Preprocess the Validation and Test Datasets
Concatenate the validation and test datasets for evaluation with Keras.

In [0]:
val_dataset = functools.reduce(lambda d1, d2: d1.concatenate(d2), 
                               [preprocess_data_for_client(client, val_data) for client in val_clients])

test_dataset = functools.reduce(lambda d1, d2: d1.concatenate(d2), 
                                [preprocess_data_for_client(client, test_data) for client in test_clients])

### Define Lists to Track Loss and Accuracy at Each Training Round

In [0]:
train_loss = []
train_accuracy = []
val_loss = []
val_accuracy = []

### Define the Evaluation Function to Use During Training

In [0]:
def keras_evaluate(keras_model, state, val_dataset):
    
    tff.learning.assign_weights_to_keras_model(keras_model, state.model)
    loss, accuracy = keras_model.evaluate(val_dataset, steps=2)
    
    val_loss.append(loss)
    val_accuracy.append(accuracy)

### Define Loss Function and Metrics

In [0]:
class FlattenedCategoricalAccuracy(tf.keras.metrics.SparseCategoricalAccuracy):

    def __init__(self, name='accuracy', dtype=None):
        super(FlattenedCategoricalAccuracy, self).__init__(name, dtype=dtype)

    def update_state(self, y_true, y_pred, sample_weight=None):
        
        y_true = tf.reshape(y_true, [-1, 1])
        y_pred = tf.reshape(y_pred, [-1, len(vocab), 1])
        
        return super(FlattenedCategoricalAccuracy, self).update_state(y_true, y_pred, sample_weight)

In [0]:
def compile(keras_model):
    
    keras_model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[FlattenedCategoricalAccuracy()]
    )
    
    return keras_model

### Load and Compile the Model
The keras model will be accessed as a global variable to create a copy to be called by TFF and will be updated within the training loop to follow.

In [0]:
keras_model = build_model(batch_size=BATCH_SIZE,
                          vocab_size=vocab_size,
                          seq_length=SEQ_LENGTH)
compile(keras_model)

In [0]:
keras_model.summary()

### Create TFF Version of the Model to be Trained with Federated Averaging
- Clone the keras_model inside `create_tff_model()`, which TFF will call to produce a new copy of the model inside the graph that it will serialize.
- TFF uses a `dummy_batch` so it knows the types and shapes that your model expects.
- Build and serialize the Tensorflow graph with `build_federated_averaging_process`.

In [0]:
def create_tff_model():
    
    x = tf.constant(np.random.randint(1, len(vocab), size=[BATCH_SIZE, SEQ_LENGTH]))
    dummy_batch = collections.OrderedDict([('x', x), ('y', x)]) 
    keras_model_clone = compile(tf.keras.models.clone_model(keras_model))
    
    return tff.learning.from_compiled_keras_model(keras_model_clone, dummy_batch=dummy_batch)

In [0]:
fed_avg = tff.learning.build_federated_averaging_process(model_fn=create_tff_model)

### Initialize the Federated Averaging Process and the Starting Model State

In [0]:
# NOTE: If the statement below fails, it means that you are
# using an older version of TFF without the high-performance
# executor stack. Call `tff.framework.set_default_executor()`
# instead to use the default reference runtime.
if six.PY3:
    tff.framework.set_default_executor(tff.framework.create_local_executor())

In [0]:
# The state of the FL server, containing the model and optimization state.
state = fed_avg.initialize()

state = tff.learning.state_with_new_model_weights(
    state,
    trainable_weights=[v.numpy() for v in keras_model.trainable_weights],
    non_trainable_weights=[v.numpy() for v in keras_model.non_trainable_weights]
)

### Train Model Across Many Randomly Sampled Clients with Federated Averaging

In [0]:
for round_num in range(NUM_ROUNDS):
    
    # Examine validation metrics
    print(f'Evaluating before training round #{round_num} on {NUM_VAL_CLIENTS} clients.')
    keras_evaluate(keras_model, state, val_dataset)
    
    # Sample train clients to create a train dataset
    print(f'Sampling {NUM_TRAIN_CLIENTS} new clients.')
    train_clients = get_sample_clients(train_data, num_clients=NUM_TRAIN_CLIENTS)
    train_datasets = [preprocess_data_for_client(client, train_data) for client in train_clients]
    
    # Apply federated training round
    print('Applying federated training round.')
    state, metrics = fed_avg.next(state, train_datasets)
    
    # Examine training metrics
    print(f'Training metrics - loss: {metrics[1]:4.4f}; accuracy: {metrics[0]:4.4f}')
    train_loss.append(metrics[1])
    train_accuracy.append(metrics[0])

### Plot Model Objective Function

In [0]:
fig, ax = plt.subplots()
x_axis = range(0, NUM_ROUNDS)
ax.plot(x_axis, train_loss, label='Train')
ax.plot(x_axis, val_loss, label='Validation')
ax.legend(loc='best')
plt.ylabel('Value of Objective Function')
plt.title('Model Objective Function at Each Training Round')
plt.show()

### Plot Model Accuracy

In [0]:
fig, ax = plt.subplots()
x_axis = range(0, NUM_ROUNDS)
ax.plot(x_axis, train_accuracy, label='Train')
ax.plot(x_axis, val_accuracy, label='Validation')
ax.legend(loc='best')
plt.ylabel('Accuracy')
plt.title('Model Accuracy at Each Training Round')
plt.show()

### Get Final Evaluation

In [0]:
keras_evaluate(keras_model, state, val_dataset)

### Generate Text
Text generation requires batch_size=1.

In [0]:
keras_model_batch1.set_weights([v.numpy() for v in keras_model.weights])
print(generate_text(keras_model_batch1, "How's the water today? "))

**Suggested extensions:**

- Use ".repeat(NUM_EPOCHS)" on the client datasets to try multiple epochs of local training (e.g., as in McMahan et. al.). See also Federated Learning for Image Classification which does this.
- Change the compile() command to experiment with using different optimization algorithms on the client.
- Try the server_optimizer argument to build_federated_averaging_process to try different algorithms for applying the model updates on the server.
- Try the client_weight_fn argument to to build_federated_averaging_process to try different weightings of the clients. The default weights client updates by the number of examples on the client, but you can do e.g. client_weight_fn=lambda _: tf.constant(1.0).