# Logistic regression on the epsilon dataset

This is a "getting started" exercise. Simple logistic regression on the [epsilon dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon), which contains 400,000 training data points with 2,000 features, and 100,000 test data points.

This notebook is mostly to try things out. The "real" script is in ../logistic.py. To run this locally, I used a smaller version of the epsilon dataset, constructed by taking the first 1000 lines of the test set as the "smaller training set", and the last 200 lines of the test set as the "smaller test set", as follows (in bash, replace `~/jadeite/data/sources` with wherever your data directory is):

``` bash
mkdir -p ~/jadeite/data/sources/epsilon
cd ~/jadeite/data/sources/epsilon
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2
bunzip2 epsilon_normalized.t.bz2
mv epsilon_normalized.t epsilon_normalized.t.full
head epsilon_normalized.t.full -n 1000 > epsilon_normalized
tail epsilon_normalized.t.full -n 200 > epsilon_normalized.t
```

In [1]:
import tensorflow as tf
import nest_asyncio
nest_asyncio.apply()

import data.epsilon as epsilon
import results

2021-07-03 20:24:28.574937: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-03 20:24:28.574985: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 1)                 2001      
Total params: 2,001
Trainable params: 2,001
Non-trainable params: 0
_________________________________________________________________


2021-07-03 20:24:29.690028: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-03 20:24:29.690078: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-03 20:24:29.690099: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (zinfandel): /proc/driver/nvidia/version does not exist
2021-07-03 20:24:29.690322: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
model.compile(loss="binary_crossentropy", optimizer='sgd', metrics=['accuracy'])
nepochs = 5
batch_size = 64
dataset = epsilon.train_dataset().repeat(nepochs).batch(batch_size)
model.fit(dataset, epochs=nepochs, steps_per_epoch = epsilon.ntrain // batch_size)

2021-07-03 20:24:29.841419: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-03 20:24:29.841975: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 1992005000 Hz


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f624a9ded90>

In [4]:
test_dataset = epsilon.test_dataset().batch(batch_size)
evaluation = model.evaluate(test_dataset, return_dict=True)
evaluation



{'loss': 0.6930496692657471, 'accuracy': 0.5149999856948853}

In [5]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
])
loss_fn = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.SGD()

nepochs = 5
batch_size = 64
nbatches = epsilon.ntest // batch_size

for epoch in range(nepochs):
    dataset = epsilon.test_dataset().batch(batch_size)
    for i, (x, y) in dataset.enumerate():
        with tf.GradientTape() as tape:
            ŷ = model(x)
            loss = loss_fn(y, ŷ)
        gradients = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))
    
        if i % 10 == 0:
            print(f"epoch {epoch} of {nepochs}, {i} of {nbatches}, loss: {loss:f}", end='\r')

epoch 4 of 5, 0 of 3, loss: 0.693686

In [6]:
batch_size = 1000
dataset = epsilon.test_dataset().batch(batch_size)
nbatches = epsilon.ntest // batch_size
accuracy_fn = tf.keras.metrics.BinaryAccuracy()
for i, (x, y) in dataset.enumerate():
    ŷ = model(x)
    accuracy_fn.update_state(y, ŷ)
    print(f"{i} of {nbatches}...", end='\r')
accuracy = accuracy_fn.result().numpy()
print(f"\nAccuracy: {accuracy}")

0 of 0...
Accuracy: 0.5099999904632568


# Simple federated averaging

Again, mostly an exercise, this is an attempt to use the tensorflow-federated framework with federated averaging to achieve the same thing.

In [7]:
import tensorflow_federated as tff

In [8]:
%load_ext tensorboard

The `Dataset.shard()` method divides a dataset into several shards. Originally I had something like this:

``` python
def client_data_by_shard(client_id):
    return train_dataset.shard(nclients, client_id)

client_data = tff.simulation.datasets.ClientData.from_clients_and_fn(range(nclients), client_data_by_shard)
```

but we don't actually need a `ClientData` object, since TFF just takes in lists of `tf.data.Dataset` objects.

In [9]:
nclients = 10
nrounds = 8
batch_size = 64
train_dataset = epsilon.train_dataset().batch(batch_size)
client_shards = [train_dataset.shard(nclients, i) for i in range(nclients)]

In [10]:
train_dataset.element_spec

(TensorSpec(shape=(None, 2000), dtype=tf.float64, name=None),
 TensorSpec(shape=(None,), dtype=tf.int64, name=None))

In [11]:
def create_keras_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
    ])

def model_fn():
    keras_model = create_keras_model()
    return tff.learning.from_keras_model(
        keras_model,
        input_spec=train_dataset.element_spec,
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=[tf.keras.metrics.BinaryAccuracy()],
    )

iterative_process = tff.learning.build_federated_averaging_process(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(),
)

In [12]:
results_dir = results.create_results_directory()
log_dir = results_dir / 'logs'
summary_writer = tf.summary.create_file_writer(str(log_dir))  # doesn't support Path objects

state = iterative_process.initialize()
state

Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`


Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`


ServerState(model=ModelWeights(trainable=[array([[ 0.04946871],
       [ 0.018157  ],
       [ 0.03626599],
       ...,
       [ 0.0412962 ],
       [-0.01557492],
       [ 0.04785355]], dtype=float32), array([0.], dtype=float32)], non_trainable=[]), optimizer_state=[0], delta_aggregate_state=OrderedDict([('value_sum_process', ()), ('weight_sum_process', ())]), model_broadcast_state=())

In [13]:
with summary_writer.as_default():
    for r in range(nrounds):
        print(f"round {r} of {nrounds}...")
        state, metrics = iterative_process.next(state, client_shards)
        for name, value in metrics['train'].items():
            tf.summary.scalar(name, value, step=r)

round 0 of 8...
round 1 of 8...
round 2 of 8...
round 3 of 8...
round 4 of 8...
round 5 of 8...
round 6 of 8...
round 7 of 8...


In [14]:
metrics['train']

OrderedDict([('binary_accuracy', 0.4873047), ('loss', 0.6933331)])

In [15]:
%tensorboard --logdir {log_dir}

Evaluation:

In [16]:
test_model = create_keras_model()
test_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.keras.metrics.BinaryAccuracy()],
)
state.model.assign_weights_to(test_model)
test_dataset = epsilon.test_dataset().batch(batch_size)
test_model.evaluate(test_dataset)



[0.6929421424865723, 0.5099999904632568]