Logistic regression on the epsilon dataset
==========================================

_Chuan-Zheng Lee <<czlee@stanford.edu>>_ <br />
_July 2021_

This is a "getting started" exercise. Simple logistic regression on the [epsilon dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon), which contains 400,000 training data points with 2,000 features, and 100,000 test data points.

This notebook is mostly to try things out. The "real" script is in ../logistic.py. To run this locally, I used a smaller version of the epsilon dataset, constructed by taking the first 1000 lines of the test set as the "smaller training set", and the last 200 lines of the test set as the "smaller test set", as follows (in bash, replace `~/jadeite/data/sources` with wherever your data directory is):

``` bash
mkdir -p ~/jadeite/data/sources/epsilon
cd ~/jadeite/data/sources/epsilon
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2
bunzip2 epsilon_normalized.t.bz2
mv epsilon_normalized.t epsilon_normalized.t.full
head epsilon_normalized.t.full -n 1000 > epsilon_normalized
tail epsilon_normalized.t.full -n 200 > epsilon_normalized.t
```

In [1]:
import tensorflow as tf
import nest_asyncio
nest_asyncio.apply()

import data.epsilon_tf as epsilon
import utils

2021-07-04 17:38:42.433108: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-04 17:38:42.433159: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 1)                 2001      
Total params: 2,001
Trainable params: 2,001
Non-trainable params: 0
_________________________________________________________________


2021-07-04 17:38:43.382481: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-04 17:38:43.382644: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-04 17:38:43.382676: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (zinfandel): /proc/driver/nvidia/version does not exist
2021-07-04 17:38:43.382945: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
model.compile(loss="binary_crossentropy", optimizer='sgd', metrics=['accuracy'])
nepochs = 5
batch_size = 64
dataset = epsilon.train_dataset().repeat(nepochs).batch(batch_size)
model.fit(dataset, epochs=nepochs, steps_per_epoch = epsilon.ntrain // batch_size)

Loading data from /home/czlee/jadeite/data/sources/epsilon/epsilon_normalized, up to line 800... done.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


2021-07-04 17:38:44.382453: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-04 17:38:44.383664: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 1992005000 Hz


<tensorflow.python.keras.callbacks.History at 0x7fa510deda30>

In [4]:
test_dataset = epsilon.test_dataset().batch(batch_size)
evaluation = model.evaluate(test_dataset, return_dict=True)
evaluation



{'loss': 0.6956179141998291, 'accuracy': 0.4350000023841858}

In [5]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
])
loss_fn = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.SGD()

nepochs = 5
batch_size = 64
nbatches = epsilon.ntest // batch_size

for epoch in range(nepochs):
    dataset = epsilon.test_dataset().batch(batch_size)
    for i, (x, y) in dataset.enumerate():
        with tf.GradientTape() as tape:
            ŷ = model(x)
            loss = loss_fn(y, ŷ)
        gradients = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))
    
        if i % 10 == 0:
            print(f"epoch {epoch} of {nepochs}, {i} of {nbatches}, loss: {loss:f}", end='\r')

epoch 4 of 5, 0 of 3, loss: 0.695436

In [6]:
batch_size = 1000
dataset = epsilon.test_dataset().batch(batch_size)
nbatches = epsilon.ntest // batch_size
accuracy_fn = tf.keras.metrics.BinaryAccuracy()
for i, (x, y) in dataset.enumerate():
    ŷ = model(x)
    accuracy_fn.update_state(y, ŷ)
    print(f"{i} of {nbatches}...", end='\r')
accuracy = accuracy_fn.result().numpy()
print(f"\nAccuracy: {accuracy}")

0 of 0...
Accuracy: 0.5350000262260437


# Simple federated averaging

Again, mostly an exercise, this is an attempt to use the tensorflow-federated framework with federated averaging to achieve the same thing.

In [7]:
import tensorflow_federated as tff

In [8]:
%load_ext tensorboard

The `Dataset.shard()` method divides a dataset into several shards. Originally I had something like this:

``` python
def client_data_by_shard(client_id):
    return train_dataset.shard(nclients, client_id)

client_data = tff.simulation.datasets.ClientData.from_clients_and_fn(range(nclients), client_data_by_shard)
```

but we don't actually need a `ClientData` object, since TFF just takes in lists of `tf.data.Dataset` objects.

In [9]:
nclients = 10
nrounds = 8
batch_size = 64
train_dataset = epsilon.train_dataset().batch(batch_size)
client_shards = [train_dataset.shard(nclients, i) for i in range(nclients)]

Loading data from /home/czlee/jadeite/data/sources/epsilon/epsilon_normalized, up to line 800... done.


In [10]:
train_dataset.element_spec

(TensorSpec(shape=(None, 2000), dtype=tf.float32, name=None),
 TensorSpec(shape=(None,), dtype=tf.float32, name=None))

In [11]:
def create_keras_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
    ])

def model_fn():
    keras_model = create_keras_model()
    return tff.learning.from_keras_model(
        keras_model,
        input_spec=train_dataset.element_spec,
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=[tf.keras.metrics.BinaryAccuracy()],
    )

iterative_process = tff.learning.build_federated_averaging_process(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(),
)

In [12]:
results_dir = utils.create_results_directory()
log_dir = results_dir / 'logs'
summary_writer = tf.summary.create_file_writer(str(log_dir))  # doesn't support Path objects

state = iterative_process.initialize()
state

Saving results to: /home/czlee/jadeite/results/20210704-173849
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`


Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`


ServerState(model=ModelWeights(trainable=[array([[-0.03082578],
       [-0.04259921],
       [-0.01149698],
       ...,
       [ 0.03364855],
       [-0.00396271],
       [-0.04661773]], dtype=float32), array([0.], dtype=float32)], non_trainable=[]), optimizer_state=[0], delta_aggregate_state=OrderedDict([('value_sum_process', ()), ('weight_sum_process', ())]), model_broadcast_state=())

In [13]:
with summary_writer.as_default():
    for r in range(nrounds):
        print(f"round {r} of {nrounds}...")
        state, metrics = iterative_process.next(state, client_shards)
        for name, value in metrics['train'].items():
            tf.summary.scalar(name, value, step=r)

round 0 of 8...
round 1 of 8...


2021-07-04 17:38:55.108429: W tensorflow/core/kernels/data/model_dataset_op.cc:205] Optimization loop failed: Cancelled: Operation was cancelled


round 2 of 8...


2021-07-04 17:38:55.771739: W tensorflow/core/kernels/data/model_dataset_op.cc:205] Optimization loop failed: Cancelled: Operation was cancelled


round 3 of 8...


2021-07-04 17:38:56.576080: W tensorflow/core/kernels/data/model_dataset_op.cc:205] Optimization loop failed: Cancelled: Operation was cancelled


round 4 of 8...


2021-07-04 17:38:57.240788: W tensorflow/core/kernels/data/model_dataset_op.cc:205] Optimization loop failed: Cancelled: Operation was cancelled


round 5 of 8...


2021-07-04 17:38:57.878441: W tensorflow/core/kernels/data/model_dataset_op.cc:205] Optimization loop failed: Cancelled: Operation was cancelled


round 6 of 8...


2021-07-04 17:38:58.508447: W tensorflow/core/kernels/data/model_dataset_op.cc:205] Optimization loop failed: Cancelled: Operation was cancelled


round 7 of 8...


In [14]:
metrics['train']

OrderedDict([('binary_accuracy', 0.48710936), ('loss', 0.69323754)])

In [15]:
%tensorboard --logdir {log_dir}

Evaluation:

In [16]:
test_model = create_keras_model()
test_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.keras.metrics.BinaryAccuracy()],
)
state.model.assign_weights_to(test_model)
test_dataset = epsilon.test_dataset().batch(batch_size)
test_model.evaluate(test_dataset)



[0.693757176399231, 0.5]

# Federated averaging done "manually"

In [17]:
nrounds = 5
nclients = 10
batch_size = 64


def create_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
    ])

def client_train(dataset, model, loss_fn, optimizer):
    for x, y in dataset:
        with tf.GradientTape() as tape:
            pred = model(x)
            loss = loss_fn(y, pred)
        gradients = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))

    return loss.numpy()


def server_aggregate(global_model, client_models):
    """Aggregates client models by just taking the mean, a.k.a.
    federated averaging."""
    client_weights = [model.get_weights() for model in client_models]
    new_weights = [
        tf.math.reduce_mean(tf.stack(weights, axis=0), axis=0)
        for weights in zip(*client_weights)
    ]
    for model in client_models:
        model.set_weights(new_weights)


def test(dataset, model, loss_fn, accuracy_fn):
    test_losses = []
    accuracy_fn.reset_state()
    for x, y in dataset:
        pred = model(x)
        accuracy_fn.update_state(y, pred)
        test_losses.append(loss_fn(y, pred))

    test_loss = tf.math.reduce_mean(test_losses).numpy()
    accuracy = accuracy_fn.result().numpy()
    return test_loss, accuracy

In [18]:
global_model = create_model()
client_models = [create_model() for i in range(nclients)]
for model in client_models:
    model.set_weights(global_model.get_weights())
model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 1)                 2001      
Total params: 2,001
Trainable params: 2,001
Non-trainable params: 0
_________________________________________________________________


In [19]:
loss_fn = tf.keras.losses.BinaryCrossentropy()
accuracy_fn = tf.keras.metrics.BinaryAccuracy()
client_optimizers = [tf.keras.optimizers.SGD(learning_rate=1e-2) for i in range(nclients)]

In [20]:
# train clients, just one epoch
clients = zip(client_shards, client_models, client_optimizers)
for i, (dataset, model, optimizer) in enumerate(clients):
    loss = client_train(dataset, model, loss_fn, optimizer)
    print(f"Client {i} of {nclients}: loss {loss}")

Client 0 of 10: loss 0.6925334930419922
Client 1 of 10: loss 0.6869416832923889
Client 2 of 10: loss 0.6913622617721558
Client 3 of 10: loss 0.692907452583313
Client 4 of 10: loss 0.6913371086120605
Client 5 of 10: loss 0.6915407180786133
Client 6 of 10: loss 0.6922076940536499
Client 7 of 10: loss 0.6917317509651184
Client 8 of 10: loss 0.6906509399414062
Client 9 of 10: loss 0.6940387487411499


In [21]:
client_models[0].get_weights()

[array([[ 0.05350032],
        [-0.05351546],
        [ 0.04530976],
        ...,
        [-0.04561194],
        [ 0.02014246],
        [ 0.04896324]], dtype=float32),
 array([-0.00151702], dtype=float32)]

In [22]:
client_models[1].get_weights()

[array([[ 0.05349647],
        [-0.05352294],
        [ 0.04531534],
        ...,
        [-0.04560898],
        [ 0.02012952],
        [ 0.04897283]], dtype=float32),
 array([-0.00135763], dtype=float32)]

In [23]:
client_weights = [model.get_weights() for model in client_models]
[tf.stack(weights, axis=0) for weights in zip(*client_weights)]

[<tf.Tensor: shape=(10, 2000, 1), dtype=float32, numpy=
 array([[[ 0.05350032],
         [-0.05351546],
         [ 0.04530976],
         ...,
         [-0.04561194],
         [ 0.02014246],
         [ 0.04896324]],
 
        [[ 0.05349647],
         [-0.05352294],
         [ 0.04531534],
         ...,
         [-0.04560898],
         [ 0.02012952],
         [ 0.04897283]],
 
        [[ 0.05347595],
         [-0.05350532],
         [ 0.04529609],
         ...,
         [-0.04559762],
         [ 0.02013243],
         [ 0.04894402]],
 
        ...,
 
        [[ 0.05348852],
         [-0.05352173],
         [ 0.04530226],
         ...,
         [-0.04562178],
         [ 0.02014896],
         [ 0.04893569]],
 
        [[ 0.05350902],
         [-0.05348571],
         [ 0.04531093],
         ...,
         [-0.0456634 ],
         [ 0.02016379],
         [ 0.04889783]],
 
        [[ 0.05346051],
         [-0.05350388],
         [ 0.04530965],
         ...,
         [-0.04565586],
         [ 0.0

In [24]:
server_aggregate(global_model, client_models)

In [25]:
global_model.get_weights()

[array([[ 0.05347797],
        [-0.05350733],
        [ 0.04529819],
        ...,
        [-0.0456288 ],
        [ 0.02014136],
        [ 0.04891669]], dtype=float32),
 array([0.], dtype=float32)]

In [26]:
test_loss, accuracy = test(test_dataset, global_model, loss_fn, accuracy_fn)
print(f"test loss {test_loss}, accuracy {accuracy}")

test loss 0.6929768323898315, accuracy 0.51953125
