In [14]:
import gc
import os

import keras.optimizer_v2.gradient_descent
import numpy as np
import skimage
from matplotlib import pyplot as plt
from tabulate import tabulate

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"

In [15]:
import tensorflow as tf
import tensorflow_similarity as tfsim

In [16]:
tfsim.utils.tf_cap_memory()

In [17]:
# Clear out any old model state.
gc.collect()
tf.keras.backend.clear_session()

In [18]:
print("TensorFlow:", tf.__version__)
print("TensorFlow Similarity", tfsim.__version__)

TensorFlow: 2.8.2
TensorFlow Similarity 0.16.7


In [19]:
from PIL import Image
import pathlib
from skimage import io
from skimage.transform import resize


# def prepare(path, array):
#     for folder in os.listdir(path):
#         sub_path = path + "/" + folder
#         for img in os.listdir(sub_path):
#             image_path = sub_path + "/" + img
#             try:
#                 img_arr = io.imread(image_path)
#             except:
#                 print("pass")
#                 pass
#             img_arr = resize(img_arr, output_shape=(224, 224))
#             array.append(img_arr)
def get_images_and_labels(data_root_dir):
    # get all images' paths (format: string)
    data_root = pathlib.Path(data_root_dir)
    all_image_path = [str(path) for path in list(data_root.glob('*/*'))]

    # get labels' names
    label_names = sorted(item.name for item in data_root.glob('*/'))
    # dict: {label : index}
    label_to_index = dict((label, index) for index, label in enumerate(label_names))
    # get all images' labels
    all_image_label = [label_to_index[pathlib.Path(single_image_path).parent.name] for single_image_path in all_image_path]

    return all_image_path, all_image_label


x_train_paths , y_train = get_images_and_labels("./traindata/val")
x_test_paths , y_test = get_images_and_labels("./traindata/test")
print(x_train_paths)

['traindata/val/Tom Hanks/086_e7072dbb.jpg', 'traindata/val/Tom Hanks/091_da40c2a6.jpg', 'traindata/val/Tom Hanks/075_0268d466.jpg', 'traindata/val/Tom Hanks/027_82e30afe.jpg', 'traindata/val/Tom Hanks/062_41f18a02.jpg', 'traindata/val/Tom Hanks/058_450266ba.jpg', 'traindata/val/Tom Hanks/031_1d57bbbf.jpg', 'traindata/val/Tom Hanks/090_3c8ed08d.jpg', 'traindata/val/Tom Hanks/096_eb059e84.jpg', 'traindata/val/Tom Hanks/018_b7231fad.jpg', 'traindata/val/Tom Hanks/012_39efc245.jpg', 'traindata/val/Tom Hanks/070_316ffc8a.jpg', 'traindata/val/Tom Hanks/029_169c07b0.jpg', 'traindata/val/Tom Hanks/097_0c9b7ced.jpg', 'traindata/val/Tom Hanks/022_df2ce089.jpg', 'traindata/val/Tom Hanks/059_c7c906d9.jpg', 'traindata/val/Tom Hanks/001_986d6c22.jpg', 'traindata/val/Tom Hanks/081_ff5f08ea.jpg', 'traindata/val/Tom Hanks/088_78e9691e.jpg', 'traindata/val/Tom Hanks/094_ea1110a3.jpg', 'traindata/val/Tom Hanks/005_dac94cfe.jpg', 'traindata/val/Tom Hanks/077_351b17d6.jpg', 'traindata/val/Tom Hanks/014_ff

In [20]:
from tqdm import tqdm
from keras.preprocessing.image import ImageDataGenerator
#
# datagen = ImageDataGenerator()
# train_gen = datagen.flow_from_directory(directory=train_path, target_size=(224, 224), class_mode="categorical")
# valid_gen = datagen.flow_from_directory(directory=val_path, target_size=(224, 224), class_mode="categorical")
# print(train_x)

*For* a similarity model to learn efficiently, each batch must contains at least 2 examples of each class.

To make this easy, tf_similarity offers `Samplers()` that enable you to set both the number of classes and the minimum number of examples of each class per batch. Here we are creating a `MultiShotMemorySampler()` which allows you to sample an in-memory dataset and provides multiple examples per class.

TensorFlow Similarity provides various samplers to accomodate different requirements, including a `SingleShotMemorySampler()` for single-shot learning, a `TFDatasetMultiShotMemorySampler()` that integrate directly with the TensorFlow datasets catalogue, and a `TFRecordDatasetSampler()` that allows you to sample from very large datasets stored on disk as TFRecords shards.

In [21]:
import random

x_test = []
x_train=[]
for img_path in tqdm(x_test_paths):
    x_test.append(io.imread(img_path))
for img_path in tqdm(x_train_paths):
    x_train.append(io.imread(img_path))
x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)
print(y_test)
sampler = tfsim.samplers.MultiShotMemorySampler(
    x=x_train,
    y=y_train,
    classes_per_batch=17,
)




100%|██████████| 105/105 [00:00<00:00, 325.15it/s]
100%|██████████| 419/419 [00:01<00:00, 317.82it/s]


[15 15 15 15 15 15 14 14 14 14 14 14  8  8  8  8  8  8  2  2  2  2  2  2
  1  1  1  1  1  1  5  5  5  5  5  5 11 11 11 11 11 13 13 13 13 13 13 13
 13 13 13 13  0  0  0  0  0  0 16 16 16 16 16 16  9  9  9  9  9  9  3  3
  3  3  3  3  7  7  7  7  7  7  4  4  4  4  4 12 12 12 12 12 12 10 10 10
 10 10 10  6  6  6  6  6  6]

The initial batch size is 34 (17 classes * 2 examples per class) with 0 augmenters


filtering examples:   0%|          | 0/419 [00:00<?, ?it/s]

selecting classes:   0%|          | 0/17 [00:00<?, ?it/s]

gather examples:   0%|          | 0/419 [00:00<?, ?it/s]

indexing classes:   0%|          | 0/419 [00:00<?, ?it/s]

## Model setup

### Model definition

`SimilarityModel()` models extend `tensorflow.keras.model.Model` with additional features and functionality that allow you to index and search for similar looking examples.

As visible in the model definition below, similarity models output a 64 dimensional float embedding using the `MetricEmbedding()` layers. This layer is a Dense layer with L2 normalization. Thanks to the loss, the model learns to minimize the distance between similar examples and maximize the distance between dissimilar examples. As a result, the distance between examples in the embedding space is meaningful; the smaller the distance the more similar the examples are. 

Being able to use a distance as a meaningful proxy for how similar two examples are, is what enables the fast ANN (aproximate nearest neighbor) search. Using a sub-linear ANN search instead of a standard quadratic NN search is what allows deep similarity search to scale to millions of items. The built in memory index used in this notebook scales to a million indexed examples very easily... if you have enough RAM :)

In [22]:
from keras.applications.efficientnet import EfficientNetB0
from tensorflow_similarity.losses import MultiSimilarityLoss
from keras import layers, Sequential
from keras.applications import inception_v3
from tensorflow_similarity.layers import MetricEmbedding
from tensorflow_similarity.models import SimilarityModel

input_shape = (380, 380, 3)


# def get_inception():
#     inception_model = inception_v3.InceptionV3(weights="imagenet", include_top=False,input_shape=input_shape)
#     for layer in inception_model.layers[:240]:
#         layer.trainable = False
#     for layer in inception_model.layers[240:]:
#         layer.trainable = True
#     return inception_model
#

# def get_model():
#     # inception = get_inception()
#     backbone = EfficientNetB0(include_top=False, weights='imagenet')
#
#     for layer in backbone.layers[:190]:
#             layer.trainable = False
#     inputs = layers.Input(shape=input_shape)
#     x = backbone(inputs)
#     x = layers.Rescaling(1 / 255)(x)
#     x = layers.Dense(512)(x)
#     x = layers.BatchNormalization()(x)
#     x = layers.Activation("relu")(x)
#     x = layers.Dense(256)(x)
#     x = layers.BatchNormalization()(x)
#     x = layers.Activation("relu")(x)
#     x = layers.Flatten()(x)
#     # smaller embeddings will have faster lookup times while a larger embedding will improve the accuracy up to a point.
#     outputs = MetricEmbedding(512)(x)
#     return SimilarityModel(inputs, outputs)
#
#
#
# model = get_model()
model = tfsim.architectures.EfficientNetSim(
   input_shape=input_shape,
    variant="B4",
    embedding_size=512,
)
model.summary()

Model: "similarity_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 380, 380, 3)]     0         
                                                                 
 efficientnetb4 (Functional)  (None, None, None, 1792)  17673823 
                                                                 
 gem_pool (GeneralizedMeanPo  (None, 1792)             0         
 oling2D)                                                        
                                                                 
 metric_embedding (MetricEmb  (None, 512)              918016    
 edding)                                                         
                                                                 
Total params: 18,591,839
Trainable params: 918,016
Non-trainable params: 17,673,823
_________________________________________________________________


### Loss definition

Overall what makes Metric losses different from tradional losses is that:
- **They expect different inputs.** Instead of having the prediction equal the true values, they expect embeddings as `y_preds` and the id (as an int32) of the class as `y_true`. 
- **They require a distance.** You need to specify which `distance` function to use to compute the distance between embeddings. `cosine` is usually a great starting point and the default.

In this example we are using the `MultiSimilarityLoss()`. This loss takes a weighted combination of all valid positive and negative pairs, making it one of the best loss that you can use for similarity training.

In [23]:
loss = tfsim.losses.MultiSimilarityLoss()

### Compilation

Tensorflow similarity use an extended `compile()` method that allows you to optionally specify `distance_metrics` (metrics that are computed over the distance between the embeddings), and the distance to use for the indexer.

By default the `compile()` method tries to infer what type of distance you are using by looking at the first loss specified. If you use multiple losses, and the distance loss is not the first one, then you need to specify the distance function used as `distance=` parameter in the compile function.

In [24]:
from keras.optimizer_v2.adam import Adam

opt = Adam()
model.compile(optimizer=opt, loss=loss)

Distance metric automatically set to cosine use the distance arg to override.


## Training

Similarity models are trained like normal models. 

**NOTE**: don't expect the validation loss to decrease too much here because we only use a subset of the classes within the train data but include all classes in the validation data.

In [25]:
from keras.callbacks import EarlyStopping

EPOCHS = 2  # @param {type:"integer"}

history = model.fit(sampler, epochs=EPOCHS, verbose=1,validation_data=(x_test, y_test))

Epoch 1/2


KeyboardInterrupt: 

In [None]:
# expect loss: 0.14 / val_loss: 0.33
plt.plot(history.history["loss"])
plt.legend(["loss"])
plt.title(f"Loss: {loss.name} ")
plt.show()

save_path = "models/23.09:58"  # @param {type:"string"}
model.save(save_path, save_index=True)

## Indexing

Indexing is where things get different from traditional classification models. Because the model learned to output an embedding that represent the example position within the learned metric space, we need a way to find which known example(s) are the closest to determine the class of the query example (aka nearest neighbors classication).

To do so, **we are creating an index of known examples from all the classes present in the dataset**. We do this by taking a total of **200 examples from the train dataset which amount to 20 examples per class** and we use the `index()` method of the model to build the index.

we store the images (x_index) as data in the index `(data=x_index)` so that we can display them later. Here the images are small so its not an issue but in general, be careful while storing a lot of data in the index to avoid blewing up your memory. You might consider using a different `Store()` backend if you have to store and serve very large indexes.

Indexing more examples per class will help increase the accuracy/generalization, as having more variations improves the classifier "knowledge" of what variations to expect.

Reseting the index is not needed for the first run; however we always calling it to ensure we start the evaluation with a clean index in case of a partial re-run.

In [None]:
x_index, y_index = tfsim.samplers.select_examples(x_train, y_train, class_list[:17], 3)
model.reset_index()
model.index(x_index, y_index, data=x_index)

filtering examples:   0%|          | 0/1187 [00:00<?, ?it/s]

selecting classes:   0%|          | 0/17 [00:00<?, ?it/s]

gather examples:   0%|          | 0/51 [00:00<?, ?it/s]

[Indexing 51 points]
|-Computing embeddings


## Calibration

To be able to tell if an example matches a given class, we first need to `calibrate()` the model to find the optimal cut point. This cut point is the maximum distance below which returned neighbors are of the same class. Increasing the threshold improves the recall at the expense of the precision.

By default, the calibration uses the F-score classification metric to optimally balance out the precsion and recalll; however, you can speficy your own target and change the calibration metric to better suite your usecase.

In [None]:
x_train, y_train = sampler.get_slice()
calibration = model.calibrate(
    x_train,
    y_train
)



## Querying

To "classify" examples, we need to lookup their *k* [nearest neighbors](https://scikit-learn.org/stable/modules/neighbors.html) in the index.

Here we going to query a single random example for each class from the test dataset using `select_examples()` and then find their nearest neighbors using the `lookup()` function.

**NOTE** By default the classes 8, 5, 0, and 4 were not seen during training, but we still get reasonable matches as visible in the image below.

In [None]:
# re-run to test on other examples
num_neighbors = 5

# select
x_display, y_display = tfsim.samplers.select_examples(x_train, y_train, class_list[:3], 1)

# lookup nearest neighbors in the index
nns = model.lookup(x_display, k=num_neighbors)

# # display

# img = io.imread("./traindata/test/Angelina Jolie/004_f61e7d0c.jpg")
# img = resize(img,(299,299))
# nns = model.single_lookup(img,3)



#
#
for idx in np.argsort(y_display):
    tfsim.visualization.viz_neigbors_imgs(x_display[idx], y_display[idx], nns[idx], fig_size=(16, 2))

## Metrics ploting

Let's plot the performance metrics to see how they evolve as the distance threshold increases. 

We clearly see an inflection point where the precision and recall intersect, however, this is not the `optimal_cutpoint` because the recall continues to increase faster than the precision decreases. Different usecases will have different performance profiles, which why each model needs to be calibrated.

In [None]:
fig, ax = plt.subplots()
x = calibration.thresholds["distance"]
ax.plot(x, calibration.thresholds["precision"], label="precision")
ax.plot(x, calibration.thresholds["recall"], label="recall")
ax.plot(x, calibration.thresholds["f1"], label="f1 score")
ax.legend()
ax.set_title("Metric evolution as distance increase")
ax.set_xlabel("Distance")
plt.show()

### Precision/Recall curve

We can see in the precision/recall curve below, that the curve is not smooth.
This is because the recall can improve independently of the precision causing a 
seesaw pattern.

Additionally, the model does extremly well on known classes and less well on 
the unseen ones, which contributes to the flat curve at the begining followed 
by a sharp decline as the distance threshold increases and 
examples are further away from the indexed examples.

In [None]:
fig, ax = plt.subplots()
ax.plot(calibration.thresholds["recall"], calibration.thresholds["precision"])
ax.set_title("Precision recall curve")
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
plt.show()

## Matching

The purpose of `match()` is to allow you to use your similarity models to make 
classification predictions. It accomplishes this by finding the nearest neigbors
for a set of query examples and returning an infered label based on neighbors 
labels and the matching strategy used (MatchNearest by default).

Note: unlike traditional models, the  `match()` method potentially returns -1 
when there are no indexed examples below the cutpoint threshold. The -1 class
should be treated as "unknown".

### Matching in practice
Let's now match a 10 examples to see how you can use the model `match()` method 
in practice. 

In [None]:

x_test, y_test = val_sampler.get_slice()
num_matches = 10  # @param {type:"integer"}

matches = model.match(x_test, cutpoint="optimal")
rows = []
for idx, match in enumerate(matches):
    rows.append([match, y_test[idx], match == y_test[idx]])
print(tabulate(rows, headers=["Predicted", "Expected", "Correct"]))

### confusion matrix
Now that we have a better sense of what the match() method does, let's scale up 
to a few thousand samples per class and evaluate how good our model is at 
predicting the correct classes.

As expected, while the model prediction performance is very good, its not 
competitive with a classification model. However this lower accuracy comes with 
the unique advantage that the model is able to classify classes 
that were not seen during training.


**NOTE** `tf.math.confusion_matrix` doesn't support negative classes, so we are going to use **class 10 as our unknown class**. As mentioned earlier, unknown examples are 
any testing example for which the closest neighbor distance is greater than the cutpoint threshold.

In [None]:
# used to label in images in the viz_neighbors_imgs plots
# note we added a 11th classes for unknown
labels = train_gen.class_indices
num_examples_per_class = 3
cutpoint = "optimal"

x_confusion, y_confusion = tfsim.samplers.select_examples(val_x, val_y, class_list, num_examples_per_class)

matches = model.match(x_confusion, cutpoint=cutpoint, no_match_label=10)
cm = tfsim.visualization.confusion_matrix(
    matches,
    y_confusion,
    labels=labels,
    title="Confusin matrix for cutpoint:%s" % cutpoint,
)

## Index information

Following `model.summary()` you can get information about the index configuration and its performance using `index_summary()`.

In [None]:
model.index_summary()

## Saving and reloading
Saving and reloading the model works as you would expected: 

- `model.save(path, save_index=True)`: save the model and the index on disk. By default the index is compressed but this can be disabled by setting `compressed=False`

- `model = tf.keras.model.load_model(path, custom_objects={"SimilarityModel": tfsim.models.SimilarityModel})` reload the model. 

- **NOTE**: We need to pass `SimilarityModel` as a custom object to ensure that Keras knows about the index methods.

- `model.load_index(path)` Is requried to reload the index. 

- `model.save_index(path)` and `model.load_index(path)` allows to save/reload an index independently of saving/loading a model.


### Saving

In [None]:
# save the model and the index
save_path = "models/17clas-5sample-3 epoch 4000spe"  # @param {type:"string"}
model.save(save_path, save_index=True)

### Reloading

In [None]:
# reload the model
reloaded_model = tf.keras.models.load_model(
    save_path,
    custom_objects={"SimilarityModel": tfsim.models.SimilarityModel},
)
# reload the index
reloaded_model.load_index(save_path)

In [None]:
# check the index is back
reloaded_model.index_summary()

## Query reloaded model
Querying the reloaded model with its reload index works as expected

In [None]:
# re-run to test on other examples
num_neighbors = 5

# select
x_display, y_display = tfsim.samplers.select_examples(test_x, test_y, CLASSES, 1)

# lookup the nearest neighbors
nns = model.lookup(x_display, k=num_neighbors)

# display
for idx in np.argsort(y_display):
    tfsim.visualization.viz_neigbors_imgs(x_display[idx], y_display[idx], nns[idx], fig_size=(16, 2))

Thanks you for following this tutorial till the end. If you are interested in learning about TensorFlow Similarity advanced features, you can checkout our other notebooks.