# Learning with Signatures
https://arxiv.org/abs/2204.07953

J. de Curtò y DíAz, I. de Zarzà i Cubero, Carlos T. Calafate and Hong Yan.

{decurtoydiaz,dezarzaycubero}@innocimda.com

---

**Acknowledgements**

This work is part of CIMDA (Centre for Intelligent Multidimensional Data Analysis), HK Science Park, HK.

A joint Center between City University of Hong Kong and the University of Oxford.

Our work has been supported by HK Innovation and Technology Commission (InnoHK Project CIMDA) and HK Research Grants Council (Project CityU 11204821).

Authors are also affiliated with Universitat Politècnica de València and Universitat Oberta de Catalunya.

---

In this notebook we are going to illustrate an example of Few-shot Classification using Signatures on challenging datasets [AFHQ, CIFAR10, MNIST, Four Shapes] achieving 100% accuracy on all tasks, as described in Section 4. Computation is done at the CPU, with the use of very few labeled examples and without learned hyperparameters. Weights (that is, scale factors) are computed optimally by Definition 4. 

First, load your drive and make sure you have a folder with all four datasets (you can add a shortcut to drive from the original data here: https://drive.google.com/drive/folders/1jjG5xc0Sj2WoyBM81issdc58zNxNHrNg?usp=sharing)

In [85]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Install the following dependency to be able to compute the Signatures.

In [86]:
!pip install iisignature



Select among the available datasets and change the path accordingly.

In [97]:
#Choose dataset.
datasets = 'afhq' #@param ['afhq', 'cifar10', 'mnist', 'shapes']

In [98]:
if datasets == 'afhq':
  labels = ['cat', 'dog', 'wild']
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/afhq/train/'
  n_signatures = 100 #Number of train samples to use to compute representatives.
  N_truncated = 2 #Order of truncated signature.
  d = 16 #Size (d,d,3)
  begin_validate = 1500
  end_validate = 2000
elif datasets == 'cifar10':
  labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/cifar10/train/'
  n_signatures = 10
  N_truncated = 2
  d = 32
  begin_validate = 2000
  end_validate = 2100
elif datasets == 'mnist':
  labels = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/mnist/training/'
  n_signatures = 10
  N_truncated = 3
  d = 28
  begin_validate = 2000
  end_validate = 2100
elif datasets == 'shapes':
  labels = ['square','triangle','circle','star']
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/shapes/train/'
  n_signatures = 10
  N_truncated = 2
  d = 16
  begin_validate = 0
  end_validate = 100

In [99]:
#Path where train instances can be found.

import numpy as np

categories = len(labels)
folder = np.empty(categories, dtype='object')

for c in range(0,categories):
  folder[c] = path + labels[c] + '/'

Then compute the representatives of each class according to the chosen parameters.

In [100]:
#Compute a class representative for each category using (0:n_signatures) from train.
#e.g. In AFHQ we use 100 signatures per class, that is a total of 300 train samples.

import pickle
import cv2
import os
import iisignature

supermeanA = np.empty(categories, dtype='object') 
for c in range(0, categories):
  dataA= []
  a = os.listdir(folder[c])
  for filename in a[0:n_signatures]:
      image = cv2.imread(os.path.join(folder[c],filename))
      if image is not None:
          image = cv2.resize(image, (d,d))
          image = np.reshape(image,(image.shape[0],image.shape[1] * image.shape[2]))
          image = iisignature.sig(image, N_truncated)
          dataA.append([image, folder[c] + filename])

  featuresA, imagesA  = zip(*dataA)
  supermeanA[c] = np.mean(featuresA, axis=0)

We load validation instances and compute optimal $\lambda_{*}$ according to Definition 4. 

Learning with Signatures has the computational advantage of an analytical solution for the weights (videlicet, no need to use backpropagation).

In [101]:
if datasets == 'shapes': #First samples (begin:end) from validation.
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/shapes/val/'

#Path where validation instances can be found.
for c in range(0,categories):
  folder[c] = path + labels[c] + '/'

In [102]:
#Load validation instances from train (begin:end) and compute signatures to tune the weights.
#e.g. In AFHQ we use 500 signatures per class, that is a total of 1500 validation samples.

for c in range(0, categories):
  dataAA= []
  a = os.listdir(folder[c])
  for filename in a[begin_validate:end_validate]:
      image = cv2.imread(os.path.join(folder[c],filename))
      if image is not None:
          image = cv2.resize(image, (d,d))
          image = np.reshape(image,(image.shape[0],image.shape[1] * image.shape[2]))
          image = iisignature.sig(image, N_truncated)
          dataAA.append([image, folder[c] + filename])

  featuresAA, imagesAA  = zip(*dataAA)

  #Trying to estimate the optimal \lambda_{*}
  #we solve the inverse problem lambda * supermeanA = featuresAA[z] z:0..500
  c_0 = supermeanA[c]
  c_0[c_0==0] = 1
  l = (1. / c_0) * featuresAA
  globals()['supermeanl_' + str(c)] = np.mean(l, axis=0)

Choose appropriate path to test.

In [103]:
if datasets == 'afhq':
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/afhq/val/'
elif datasets == 'cifar10':
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/cifar10/test/'
elif datasets == 'mnist':
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/mnist/testing/'
elif datasets == 'shapes':
  path = '/content/drive/MyDrive/datasets_de_curto_and_de_zarza/shapes/test/'

#Path where test instances can be found.
for c in range(0,categories):
  folder[c] = path + labels[c] + '/'

Load into memory signatures of test instances.

In [104]:
#Load test instances and compute signatures.
#e.g. We use the full AFHQ validation set as test, that is a total of 1500 samples.

for c in range(0, categories):
  dataAA= []
  a = os.listdir(folder[c])
  for filename in a:
      image = cv2.imread(os.path.join(folder[c],filename))
      if image is not None:
          image = cv2.resize(image, (d,d))
          image = np.reshape(image,(image.shape[0],image.shape[1] * image.shape[2]))
          image = iisignature.sig(image, N_truncated)
          dataAA.append([image, folder[c] + filename])

  globals()['featuresAA_' + str(c)], imagesAA  = zip(*dataAA)

Compute classification accuracy using RMSE Signature as score function.

In [105]:
#Compute RMSE Signature and print accuracy.

from sklearn.metrics import mean_squared_error

count = np.zeros(categories, dtype='object')

for c2 in range(0,categories):
  for z in range(0,len(globals()['featuresAA_' + str(c2)])):
    rmse_c = np.empty(categories, dtype='object')
    for c in range(0,categories):
      rmse_c[c] = mean_squared_error(globals()['supermeanl_' + str(c2)] * supermeanA[c], globals()['featuresAA_' + str(c2)][z], squared=False)
    min_rmse = np.argmin(rmse_c)
    if(min_rmse != c2): 
      count[c2] += 1

  print('RMSE ' + labels[c2])
  print('# of errors:', count[c2])
  print('Accuracy:', 1 - count[c2] / len(globals()['featuresAA_' + str(c2)]))
  print('\n')

RMSE cat
# of errors: 0
Accuracy: 1.0


RMSE dog
# of errors: 0
Accuracy: 1.0


RMSE wild
# of errors: 0
Accuracy: 1.0




Here we achieve 100% accuracy on AFHQ, CIFAR10, MNIST and Four Shapes; indeed all of them very challeging problems for other learning frameworks, using very few labeled data, orders of magnitude faster than DL methods, with no learned hyperparameters and doing all the computation on the CPU.