# Logistic regression on the epsilon dataset

This is a "getting started" exercise. Simple logistic regression on the [epsilon dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon), which contains 400,000 training data points with 2,000 features, and 100,000 test data points.

The training data file is about 12 GB uncompressed, so this uses the test data (about 3 GB) to get things going. When doing this for real we would obviously use the training data for training, not the test data.

In [1]:
import tensorflow as tf
import epsilon

In [2]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 1)                 2001      
Total params: 2,001
Trainable params: 2,001
Non-trainable params: 0
_________________________________________________________________


In [3]:
model.compile(loss="binary_crossentropy", optimizer='sgd', metrics=['accuracy'])
nepochs = 5
batch_size = 64
dataset = epsilon.test_dataset().repeat(nepochs).batch(batch_size)
model.fit(dataset, epochs=nepochs, steps_per_epoch = epsilon.ntest / batch_size)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f59c41158b0>

In [4]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2000,)),
])
loss_fn = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.SGD()

nepochs = 5
batch_size = 64
nbatches = epsilon.ntest // batch_size

for epoch in range(nepochs):
    dataset = epsilon.test_dataset().batch(batch_size)
    for i, (x, y) in dataset.enumerate():
        with tf.GradientTape() as tape:
            ŷ = model(x)
            loss = loss_fn(y, ŷ)
        gradients = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))
    
        if i % 10 == 0:
            print(f"epoch {epoch} of {nepochs}, {i} of {nbatches}, loss: {loss:f}", end='\r')

epoch 4 of 5, 1560 of 1562, loss: 0.677202

In [5]:
batch_size = 1000
dataset = epsilon.test_dataset().batch(batch_size)
nbatches = epsilon.ntest // batch_size
accuracy_fn = tf.keras.metrics.BinaryAccuracy()
for i, (x, y) in dataset.enumerate():
    ŷ = model(x)
    accuracy_fn.update_state(y, ŷ)
    print(f"{i} of {nbatches}...", end='\r')
accuracy = accuracy_fn.result().numpy()
print(f"\nAccuracy: {accuracy}")

99 of 100...
Accuracy: 0.6976001262664795
