# Keras: the Sequential API

[Keras](https://keras.io/) is the easiest library to use for building and using neural networks.  It has two major ways to build models:
1. The Sequential API (in this notebook), which looks and feels a lot like a `sklearn.pipeline.Pipeline()`.
2. The functional API (in the next notebook), which looks and feels very different, but which allows a lot more flexibility.

Keras once existed as a standalone library, but it's been incorporated into [Tensorflow](https://www.tensorflow.org/).  We will not be covering Tensorflow, but it's one of the biggest libaries out there for doing neural networks and general GPU-accelerated matrix operations.  (But it's also a pretty clunky library).  A few years back, Tensorflow absorbed Keras, and the Keras API is now available as a Tensorflow module.

## A note on GPUs

Welcome to a world of confusion, frustration, and borderline broken installers.

Graphics Provessing Units (GPUs; also called "video cards") are a perfect tool for speeding up matrix multiplication--and neural networks are mostly matrix multiplication.  I'm not going to address the details here, but if you can make a GPU (rather than CPU) run matrix multiplication, it can often run it _much_ faster.  At least for large matrices.  But, not everyone has a GPU, and not all GPU manufacturers use the same tools to run stuff on their GPUs.  E.g.: NVidia uses CUDA to write GPU code; AMD uses ROCm; I forget what Intel calls theirs, but they have their own too.  Almost every library for neural networks supports CUDA, but more are supporting ROCm, and Intel's support is still pretty far behind (as are their dedicated GPUs).

So if you want to use your GPU to run code faster--and for neural networks, we're talking 10x faster or more--you need to go through the hassle of installing the right driver and software toolchains and such for your card.  That can be very non-trivial to do.  It's getting better, but it's still a bit of a mess.  (Mostly because some libaries pin their dependency requirements to old versions of e.g. CUDA, which causes all sorts of conflicts when trying to do an actual project).

Installing Tensorflow specifically is an exercize in pain, at least on Windows.  It can't be installed with conda if you want GPU support, because Google seemingly hasn't bothered.  It has to be installed through `pip` (the built-in package manager for Python).  It doesn't install all the dependencies it needs.  It doesn't always see your GPU.  It just generally doesn't always work.  So why are we using it?  Simple: if you can just get it to install, Keras is the easiest way to build neural networks.

All the code in this notebook can be run without a GPU.  But it will be slower.  Not so slow it can't be run--we're not using very large networks--but for any non-trivial neural network, a GPU can literally be the difference between minutes and days of runtime.

# Installation

If you have an NVidia GPU, get ready for a few more commands than normal.  At least on Windows (whish I'm running this notebook from), we have to install `cudatoolkit` and `cudnn` to use GPU acceleration, then we have to use `pip`--not `conda`--to install tensorflow with GPU support. For whatever reason, the Tensorflow devs just don't bother to maintain the `conda` version of tensorflow very well (if at all).

```bash
conda install cudatoolkit cudnn
python -m pip install -U pip
python -m pip install tensorflow-gpu
```

This should install the correct versions of the CUDA tools.

If you have no GPU, or you run into issues with the GPU installation:
```bash
conda install tensorflow
```

If you have an AMD GPU: Tensorflow does seem to support AMD's ROCm (which is their competitor to NVidia's CUDA), but I don't know how stable it is or how to install it.

If you don't have a GPU at all, you're going to be stuck running code on CPU, which also means you'll be stuck running very simple models.  (fortunately the models in these notebooks are simple enough that you should have no real problems, other than the models running a bit slower).

Then test your installation.  If you see a non-empty list when you run the command below, Tensorflow sees your GPU and will use it.  Otherwise, if you see an empty list (`[]`), it'll run on CPU.  The networks we're going to use in this notebook are small enough that they can run on CPU, but it'll just take a bit longer.

In [1]:
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


# Quickstart

Most of the sequential API looks pretty similar no matter what kind of model you're building; just swap out some layers or parameters here and there.  The below cell shows a minimal working example of a model doing some simple regression on the [`diamonds`](https://www.tensorflow.org/datasets/catalog/diamonds) dataset.

In [2]:
import os
import pandas as pd
from tensorflow import keras

if os.path.isfile("diamonds.csv"):
    diamonds = pd.read_csv("diamonds.csv")
else:
    diamonds = pd.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv")
    diamonds.to_csv("diamonds.csv", index=False)

We want to predict the `price` value for each diamond.  Let's do a bit of quick re-shaping--mostly, we want to one-hot encode the `cut`, `color`, and `clarity` columns.

In [3]:
diamonds = pd.get_dummies(
    diamonds,
    ["color", "clarity", "cut"],
)

# Train-test split; ~20% data for testing.
diamonds = diamonds.sample(frac=1, replace=False).reset_index(drop=True)
test = diamonds.loc[:9999]
train = diamonds.loc[9999:]

print(test.shape)
print(train.shape)

train_x = train.drop(columns=["price"]).values
test_x = test.drop(columns=["price"]).values
train_y = train["price"].values
test_y = test["price"].values

(10000, 27)
(43941, 27)


Now, let's build a basic neural network.  We'll do three hidden layers of a moderate size, and configure it to use the squared error loss function.

In [4]:
%%time
# Define a simple model
model = keras.Sequential([
    keras.layers.Dense(32),
    keras.layers.Dense(32),
    keras.layers.Dense(32),
    keras.layers.Dense(1), # linear = no activation = identity
])

# Compile the model--this gets it ready to run and does some behind-the-scenes
# optimizations.
model.compile(
    # ADAM is a very standard optimizer; it's a solid go-to for mos problems.
    optimizer=keras.optimizers.Adam(),
    
    # squared error loss --> this network is analogous to least squared regression
    loss=keras.losses.MeanSquaredError(),
    
    # Metrics to monitor during training--these will be printed out as the model
    # trains.  The loss always gets printed out, so we're not going to specify
    # this right now.
    # metrics=[keras.metrics.MeanSquaredError()],
    
    # install the XLA library with conda and uncomment this line for extra speed,
    # both on CPU and on GPU.
    # jit_compile=True,
)

# Fit the model.
# model.fit() updates the model in-place and returns some data
# about the fit history, which can sometimes be useful.
fit_history = model.fit(
    train_x,
    train_y,
    
    # how many samples to use at once.  Higher --> more GPU VRAM used, model
    # iterates faster, but the model might actually converge slower.
    batch_size=256,
    
    # how many passes to do over the data.  This is a pretty low value just
    # for demonstration purposes.
    epochs=25,
    
    # set aside this fraction of the training data for "validation"--used to
    # monitor the progress of the model against a held-out dataset, kind of like
    # cross-validation.
    validation_split=0.1,
    
    # Callbacks = things to run after each batch.  There are a lot of options
    # for these.
    callbacks=[
        # this callback can end the training process early if some score/monitored
        # quantity reaches some criterion, e.g. changes only very little.
        keras.callbacks.EarlyStopping(
            # Monitor the validation split loss...
            monitor="val_loss",
            
            # ...and stop training when the loss hasn't decreaed by `min_delta`
            # for `patience` epochs.
            min_delta=0,
            patience=5,
        ),
    ],
    
    # Print messages as training happens.
    #  0 = silent, 1 = progress bar, 2 = one line per epoch.
    # "auto" = usuaully use 1.
    verbose="auto",
)

predictions = model.predict(test_x)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
CPU times: total: 26 s
Wall time: 21.2 s


Stripping away a lot of the comments and filler, here's the basic outline of a Keras model:

```python
model = keras.Sequential([layers])
model.compile(options)
fit_history = model.fit(x, y, other_options)
predictions = model.predict(new_x)
```

We can also print out a summary of our model, which will show the layers, how many "neurons" they have (a "neuron" is just an entry in the vector that comes out of each layer), and a few other things.  In order to get the summary, though, the model has to know how many features are in the input observations.  Keras will figure this out automatically when we call `.fit()`, but we can also call `.build()` and pass it the shape of each observation.  In our case, that's just the shape of one row in our dataset.  It would look like this:

```python
model = keras.Sequential([...])
model.build(diamonds.shape[1])
print(model.summary())
```

_Or,_ we can tell the `Sequential()` model what the input shape is when we construct it, by giving it a `keras.layers.Input(diamonds.shape[1])` as the very first layer.

Printing out the model summary can be a nice way to get a feel for the model:

In [5]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                864       
                                                                 
 dense_1 (Dense)             (None, 32)                1056      
                                                                 
 dense_2 (Dense)             (None, 32)                1056      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
Total params: 3,009
Trainable params: 3,009
Non-trainable params: 0
_________________________________________________________________
None


Anyways, let's see how well our model does on the test set, and just for kicks, let's compare it to a simple linear regression from scikit-learn:

In [6]:
# Predict on our test set and print out R2 score.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

%time lr_preds = LinearRegression().fit(train_x, train_y).predict(test_x)

print(f"Neural network R2 score:     {r2_score(test_y, predictions):.4f}")
print(f"Neural network MSE score:    {mean_squared_error(test_y, predictions):,.0f}")
print(f"Linear Regression R2 score:  {r2_score(test_y, lr_preds):.4f}")
print(f"Linear Regression MSE score: {mean_squared_error(test_y, lr_preds):,.0f}")

CPU times: total: 125 ms
Wall time: 43 ms
Neural network R2 score:     0.8863
Neural network MSE score:    1,845,888
Linear Regression R2 score:  0.9197
Linear Regression MSE score: 1,303,996


Oof.  That's...not looking too great for neural networks.  It was slow, it required a lot of finnicky installation of libaries, and it did worse!

This is not the end of the story, though.  If our linear regression had done poorly, we might be out of luck; we could try doing some feature engineering, but we can only do so much, because the model itself is very simple.  A neural network, on the other hand, has a _lot_ more parameters we can tweak.  We could do this through cross-validation, but I'm going to skip all that and just show you a model that more or less matches our simple linear regression for performance.

Here are the changes I made:
- I changed each layer from ReLU to linear/identity activation.
- I made the layers bigger.
- I let the model train for more epochs.
- I tweaked the learning rate for the ADAM optimizer.
- I decreased the batch size.

In [7]:
# Define a simple model
model = keras.Sequential([
    keras.layers.Dense(128, activation="linear"),
    keras.layers.Dense(128, activation="linear"),
    keras.layers.Dense(128, activation="linear"),
    keras.layers.Dense(1, activation="linear"),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss=keras.losses.MeanSquaredError(),
)

fit_history = model.fit(
    train_x,
    train_y,
    batch_size=64,
    epochs=500,
    validation_split=0.1,
    callbacks=[
        keras.callbacks.EarlyStopping(
            monitor="val_loss",
            min_delta=0,
            patience=5,
            restore_best_weights=True,
        ),
    ],
    verbose=1,
)

predictions = model.predict(test_x)
print(f"Better model score: {r2_score(test_y, predictions)}")

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Better model score: 0.9162553035862872


This example, admittedly, is not a great one for neural networks; when a simple linear regression can get an $R^2$ above 0.9 with basically no work, there's no reason to use a big complex neural network.  However, for some kinds of problems, the neural network is the better solution:
- Very sparse data, like text.
- Classification tasks.
- Anything where you need to optimize some more specialized quantity than the standard regression metrics.
- _Extremely_ large datasets.

It's not unheard of, in some cases, to end up with enormous networks, too.  If you network is doing poorly, you can always scale it up by adding way more layers, and making existing layers bigger.  Let's see another example, this time using the 20 Newsgroups dataset from sckit-learn.  We'll treat this like a 20-class classification problem: given the words of a post, identify the newsgroup it was posted to.  Our contenders:
- Another neural network!
- Naive Bayes
- Random Forest

In [8]:
# Load the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

x, y = fetch_20newsgroups(subset="all", return_X_y=True)
x = CountVectorizer(
    min_df=5,
    max_df=0.5,
    max_features=15_000,
).fit_transform(x)

# we'll need this later for the neural network--its
# target variables need to be structured differently.
y_onehot = OneHotEncoder(sparse=False).fit_transform(y.reshape(-1,1))

train_x, test_x, train_y, test_y, train_y_onehot, test_y_onehot = train_test_split(
    x, y, y_onehot,
    train_size=0.9, stratify=y, random_state=0
)

print(train_x.shape, test_x.shape)

(16961, 15000) (1885, 15000)


In [9]:
%%time
# Naive Bayes
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import ComplementNB, GaussianNB

nb = GridSearchCV(
    ComplementNB(),
    param_grid={"alpha": [10**i for i in range(-5, 0)] + [0]},
    scoring=make_scorer(accuracy_score),
    error_score=0,
    cv=5,
    n_jobs=4,
    verbose=1,
)
# ComplementNB requires all positive values to be passed.
minval = min(train_x.min(), test_x.min())
nb.fit(train_x - minval, train_y)
print(f"Complement Naive Bayes: {accuracy_score(test_y, nb.predict(test_x - minval)):.4f}")

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Complement Naive Bayes: 0.8531
CPU times: total: 344 ms
Wall time: 4.04 s


In [10]:
%%time
# Random forest--default settings are usually pretty good.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=4).fit(train_x, train_y)
print(f"Random Forest: {accuracy_score(test_y, rf.predict(test_x)):.4f}")

Random Forest: 0.8446
CPU times: total: 1min 11s
Wall time: 18.3 s


Hm, usually a random forest will outperform something like a Naive Bayes model.  We can just throw a bunch more estimator into the random forest (the default is 100).  More estimators never hurts the accuracy of a random forest; it just starts to taper off after a bit, and it of course runs slower.

In [11]:
%%time
# Random forest--default settings are usually pretty good.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1000, n_jobs=4).fit(train_x, train_y)
print(f"Random Forest: {accuracy_score(test_y, rf.predict(test_x)):.4f}")

Random Forest: 0.8536
CPU times: total: 10min 54s
Wall time: 2min 45s


That's better!  Now let's throw a pretty simple neural network at the same dataset.

In [12]:
%%time

# Neural network
import numpy as np
from tensorflow import keras

# Can't use the `validation_split` argument with sparse arrays (not sure why)--
# so manually do the split here.
nn_train_x, val_x, nn_train_y, val_y = train_test_split(
    train_x.astype(np.int32).toarray(),
    train_y_onehot,
    random_state=0,
    train_size=0.9,
)

# Here's a very simple network: two hidden layers of 128 neurons, with
# ReLU activation and a bit of regulatization and dropout to prevent
# overfitting.
model = keras.Sequential([
    keras.layers.Dense(128, activation="relu", activity_regularizer=keras.regularizers.L2()),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(128, activation="relu", activity_regularizer=keras.regularizers.L2()),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(20, activation="softmax"),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    loss=keras.losses.CategoricalCrossentropy(),
    metrics=[keras.metrics.CategoricalAccuracy()],
)

fit_history = model.fit(
    nn_train_x,
    nn_train_y,
    batch_size=32,
    epochs=500,
    validation_data=(val_x, val_y),
    callbacks=[
        keras.callbacks.EarlyStopping(
            monitor="val_categorical_accuracy",
            min_delta=0,
            patience=5,
            restore_best_weights=True,
        ),
    ],
    verbose=2,
)

preds = model.predict(
    test_x.astype(np.int32).toarray()
)
preds = np.argmax(preds, axis=1).ravel()
print(preds)
print(f"Neural network: {accuracy_score(test_y, preds):.4f}")

Epoch 1/500
477/477 - 4s - loss: 5.6430 - categorical_accuracy: 0.4013 - val_loss: 2.3499 - val_categorical_accuracy: 0.7354 - 4s/epoch - 7ms/step
Epoch 2/500
477/477 - 3s - loss: 2.9842 - categorical_accuracy: 0.8302 - val_loss: 1.5073 - val_categorical_accuracy: 0.8680 - 3s/epoch - 6ms/step
Epoch 3/500
477/477 - 3s - loss: 1.5737 - categorical_accuracy: 0.9226 - val_loss: 1.1506 - val_categorical_accuracy: 0.9010 - 3s/epoch - 6ms/step
Epoch 4/500
477/477 - 3s - loss: 0.9924 - categorical_accuracy: 0.9567 - val_loss: 0.9739 - val_categorical_accuracy: 0.9098 - 3s/epoch - 6ms/step
Epoch 5/500
477/477 - 4s - loss: 0.6946 - categorical_accuracy: 0.9758 - val_loss: 0.8384 - val_categorical_accuracy: 0.9187 - 4s/epoch - 7ms/step
Epoch 6/500
477/477 - 3s - loss: 0.5313 - categorical_accuracy: 0.9865 - val_loss: 0.7691 - val_categorical_accuracy: 0.9175 - 3s/epoch - 7ms/step
Epoch 7/500
477/477 - 3s - loss: 0.4318 - categorical_accuracy: 0.9920 - val_loss: 0.7084 - val_categorical_accuracy: 

There we go, the neural network--a pretty simple one--beat the other two models _handily._  This dataset is much more in line with something neural networks are good at: very sparse datasets with _lots_ of features.

# Tensorboard: Better logging

Reading all those outputs above is a bit of a mess.  Fortunately, there's a better way.  Tensorflow has an absolutely killer feature, which makes it well worth using in spite of the mess that is installing it: Tensorboard.

Tensorboard is basically a fancy logging tool that gives you a browser-based dashboard that runs locally (so there's no actual internet connection).  Tensorboard gets installed alongside Tensorflow, but it's become such a standard tool for monitoring neural networks that most other neural net libraries can interface with it very easily.

To use Tensorboard, add the Tensorboard callback to your model during fitting:

```python
from tensorflow import keras
model = Sequential([...])
model.compile(...)
model.fit(
    ...,
    callbacks=[
        keras.callbacks.TensorBoard("tensorboard_log_dir"),
        ...
    ]
)
```

This will create a folder "tensorboard_log_dir" in the folder the code is being run from.  As your model is fitting, it will update files in this directory in real-time.

From your command line, with your Conda environment activated, navigate to the folder where "tensorboard_log_dir" got created, and run:

```bash
tensorboard --logdir=tensorboard_log_dir
```

TensorBoard will open in your browser.

TensorBoard also integrates with Jupyter Notebooks, which I'll use here to show what it looks like.  First, I needt to re-fit the above model and add the TensorBoard callback.

In [None]:
model = keras.Sequential([
    keras.layers.Dense(128, activation="relu", activity_regularizer=keras.regularizers.L2()),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(128, activation="relu", activity_regularizer=keras.regularizers.L2()),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(20, activation="softmax"),
])

# Changing some settings to make this run longer and slower--makes the graphs
# easier to see.
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),
    loss=keras.losses.CategoricalCrossentropy(),
    metrics=[keras.metrics.CategoricalAccuracy()],
)
fit_history = model.fit(
    nn_train_x,
    nn_train_y,
    batch_size=8,
    epochs=500,
    validation_data=(val_x, val_y),
    callbacks=[
        keras.callbacks.EarlyStopping(
            monitor="val_categorical_accuracy",
            min_delta=0,
            patience=5,
            restore_best_weights=True,
        ),
        keras.callbacks.TensorBoard(
            "tensorboard_log_dir",
            update_freq="batch",
        ),
    ],
    # We don't *need* this if we're going to use TensorBoard for monitoring,
    # since TensorBoard can update in real-time, but it never hurts to have
    # more outputs to monitor.
    verbose=1,
)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500

And that's about where we're gonna leave thing with Keras' Sequential API.

I will say a few more words about TensorBoard, though.  TensorBoard is an amazing tool, with a _huge_ range of capabilities.  It's more helpful than you might expect when it comes to really big models that need to run for a really long time: if you have a model that's building overnight or over a weekend, TensorBoard is a great way to check up on its progress as it trains.  (but anything that runs "over lunch" rather than "overnight," though, TensorBoard is probably overkill).  There are a lot of other really cool features we won't go into, like uploading subsets of your data and other cool logging stuff, but they are absolutely worth getting familiar with if you're considering a career in AI/ML/neural networks.