#### STOCHASTIC OPTIMIZATION OF SORTING NETWORKS VIA CONTINUOUS RELAXATIONS

This paper deals with an object sorting problem, generally well known in many machine learning pipelines. For instance, the to k-multi-class classification, ranking documents for information retrieval and multi-object target tracking in computer vision.To solve these problems, algorithms are used that typically require the learning of informative representations of complex high-dimensional data, such as images, prior to sorting and subsequent downstream processing. 

However, for a downstream sorting problem, it is not possible to optimize it from end to end because the sorting operator is not differentiable with respect to its input. The goal of this paper is to propose a method that makes the sort operator differentiable almost everywhere with respect to the inputs. This proposed method is $\textbf{NeuralSort}$. This report concerns scientic aspects of NeuralSort. It is organized as follows: 

- $\textbf{Presents a well-understood summay of NeuralSort method}$;
- $\textbf{Give an application of this method on data}$

#### 𝐍𝐞𝐮𝐫𝐚𝐥𝐒𝐨𝐫𝐭 𝐦𝐞𝐭𝐡𝐨𝐝

$\textbf{How understand it}$: in the sorting problem, the output can be viewed as a permutation matrix, which is a square matrix with entries in $\{0,1\}$ such that every row and every column sums to 1. For NeuralSort, we consider other matrix called  unimodal row-stochastic matrix. It is a square matrix with positive real entries, where each row sums to 1 and has a distinct arg max. All permutation matrices are unimodal row-stochastic matrices. 


$\textbf{How NeuralSort is it trained ?}$: the goal is to optimize training objectives involving a sort operator with gradient-based methods.

The problem can be written in the following form:

$\mathcal{L}(\theta,s)= f(P_z,\theta)$ and $ z = sort(s)$

Here, 

- $s\in \mathbb{R}^n$ denotes a vector of n real-valued scores that follows a  Plackett-Luce distribution with 
probability mass function for any $z \in \mathcal{Z}_n $ is given by:
$q(z|s)=\dfrac{s_{z_1}}{Z} \dfrac{s_{z_2}}{Z-s_{z_2}}\cdots \dfrac{s_{z_n}}{Z-\sum_{i=1}^{n-1}s_{z_i}}$, $Z$ is the normalization constant is given by $Z=\sum_{i=1}^{n}s_i$.

- z is the permutation that (deterministically) sorts the scores s, Every
permutation  z is associated with a permutation  matrix $P_z \in \{0,1\}^{n*n}$ with $P_z[i,i]=\mathbb{1}(j=z_i)$.
$\textbf{Example}$, let $s = [9; 10; 5; 2]^T$ , then $sort(s) = [2; 1; 3; 4]^T$ since the largest element is at the second index, second largest element is at the first index and so on. In case of ties, elements are assigned indices in the order they appear. We can obtain the sorted vector simply via $P_{sort(s)}s$.

- $f(·)$ is an arbitrary function of interest assumed to be differentiable w.r.t a set
of parameters $\theta$ and z. 

Since, the sort operation is not, the proposed solution of the authors to derive a relaxation to the sort operator that leads to a surrogate objective with well-defined gradients. In particular, we seek to use such a relaxation to replace the permutation matrix $P_z$ in the objective function above with an approximation $\hat{P}_z$ such that the surrogate
objective $f(\hat{P}_z; \theta)$ is differentiable w.r.t. the scores s.



#### Our implementation

In [12]:
# run the classical KNN with cifar 10
!python run_baseline.py --dataset=cifar10 --nloglr=3


Files already downloaded and verified
Namespace(k=None, tau=None, nloglr=3.0, method=None, resume=False, dataset='cifar10')
Beginning epoch 0:  baseline-resnet-cifar10-b3
train -1.3764500617980957
val 0.47740000739693644
Saving...
Beginning epoch 1:  baseline-resnet-cifar10-b3
train -1.1762725114822388
val 0.5796000073254108
Saving...
Beginning epoch 2:  baseline-resnet-cifar10-b3
train -1.079352855682373
val 0.612600006505847
Saving...
Beginning epoch 3:  baseline-resnet-cifar10-b3
train -1.168888807296753
val 0.6956000089943409
Saving...
Beginning epoch 4:  baseline-resnet-cifar10-b3
train -0.785447359085083
val 0.7190000087618827
Saving...
Beginning epoch 5:  baseline-resnet-cifar10-b3
train -0.5875318646430969
val 0.7298000105619431
Saving...
Beginning epoch 6:  baseline-resnet-cifar10-b3
train -0.7145028114318848
val 0.7648000116348267
Saving...
Beginning epoch 7:  baseline-resnet-cifar10-b3
train -0.5096418857574463
val 0.7790000113844872
Saving...
Beginning epoch 8:  baseline-re

In [13]:
# run the the classical training with cifar 100
!python run_baseline.py --dataset=cifar100 --nloglr=3

Files already downloaded and verified
Namespace(k=None, tau=None, nloglr=3.0, method=None, resume=False, dataset='cifar100')
Beginning epoch 0:  baseline-resnet-cifar100-b3
train -3.960906982421875
val 0.07840000142157078
Saving...
Beginning epoch 1:  baseline-resnet-cifar100-b3
train -3.988917827606201
val 0.10760000208020211
Saving...
Beginning epoch 2:  baseline-resnet-cifar100-b3
train -3.692495822906494
val 0.14000000289082526
Saving...
Beginning epoch 3:  baseline-resnet-cifar100-b3
train -3.466834306716919
val 0.15580000327527524
Saving...
Beginning epoch 4:  baseline-resnet-cifar100-b3
train -3.287175178527832
val 0.17640000376105308
Saving...
Beginning epoch 5:  baseline-resnet-cifar100-b3
train -3.1015334129333496
val 0.20120000429451465
Saving...
Beginning epoch 6:  baseline-resnet-cifar100-b3
train -3.1988484859466553
val 0.23920000521838666
Saving...
Beginning epoch 7:  baseline-resnet-cifar100-b3
train -2.9799928665161133
val 0.265800005748868
Saving...
Beginning epoch 8:

In [14]:
# run the  KNN with cifar 10: deterministic
!python run_dknn.py --k=9 --tau=80 --nloglr=3 --method=deterministic --dataset=cifar10 --num_epochs=30


Files already downloaded and verified
Namespace(k=9, tau=80.0, nloglr=3.0, method='deterministic', resume=False, dataset='cifar10', num_train_queries=100, num_test_queries=10, num_train_neighbors=100, num_samples=5, num_epochs=30)
Beginning epoch 0:  dknn-resnet-cifar10-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.2119560883663321
Avg. val acc: 0.422
Saving...
Beginning epoch 1:  dknn-resnet-cifar10-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.2829061142309212
Avg. val acc: 0.5036
Saving...
Beginning epoch 2:  dknn-resnet-cifar10-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.35303497814837775
Avg. val acc: 0.564
Saving...
Beginning epoch 3:  dknn-resnet-cifar10-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.41915429450847475
Avg. val acc: 0.6206
Saving...
Beginning epoch 4:  dknn-resnet-cifar10-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.46447598327825124
Avg. val acc: 0.649
Saving...
Beginning epoch 5:  dknn

In [15]:
 #run the  KNN with cifar 10: deterministic
!python run_dknn.py --k=9 --tau=80 --nloglr=3 --method=deterministic --dataset=cifar100 --num_epochs=30

Files already downloaded and verified
Namespace(k=9, tau=80.0, nloglr=3.0, method='deterministic', resume=False, dataset='cifar100', num_train_queries=100, num_test_queries=10, num_train_neighbors=100, num_samples=5, num_epochs=30)
Beginning epoch 0:  dknn-resnet-cifar100-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.017833197801201434
Avg. val acc: 0.08
Saving...
Beginning epoch 1:  dknn-resnet-cifar100-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.021412775761183382
Avg. val acc: 0.0796
Beginning epoch 2:  dknn-resnet-cifar100-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.023427657186985
Avg. val acc: 0.0838
Saving...
Beginning epoch 3:  dknn-resnet-cifar100-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.024757881361393286
Avg. val acc: 0.0908
Saving...
Beginning epoch 4:  dknn-resnet-cifar100-deterministic-k9-t8000-b3
Avg. train correctness of top k: 0.02557592941471089
Avg. val acc: 0.0912
Saving...
Beginning epoch 5:  dknn

In [16]:
# run the  KNN with cifar 10: deterministic
!python run_dknn.py --k=9 --tau=80 --nloglr=3 --method=stochastic --dataset=cifar10 --num_epochs=30

Files already downloaded and verified
Namespace(k=9, tau=80.0, nloglr=3.0, method='stochastic', resume=False, dataset='cifar10', num_train_queries=100, num_test_queries=10, num_train_neighbors=100, num_samples=5, num_epochs=30)
Beginning epoch 0:  dknn-resnet-cifar10-stochastic-k9-t8000-b3
Traceback (most recent call last):
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/run_dknn.py", line 244, in <module>
    train(t)
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/run_dknn.py", line 155, in train
    losses = dknn_loss(query_e, neighbor_e, query_y_oh, neighbor_y_oh)
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/run_dknn.py", line 81, in dknn_loss
    top_k_ness = dknn_layer(query, neighbors)
  File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/dknn_layer.py", line 40, in forward
    pl_s

In [17]:
 #run the  KNN with cifar 10: deterministic
!python run_dknn.py --k=9 --tau=80 --nloglr=3 --method=stochastic --dataset=cifar100 --num_epochs=30

Files already downloaded and verified
Namespace(k=9, tau=80.0, nloglr=3.0, method='stochastic', resume=False, dataset='cifar100', num_train_queries=100, num_test_queries=10, num_train_neighbors=100, num_samples=5, num_epochs=30)
Beginning epoch 0:  dknn-resnet-cifar100-stochastic-k9-t8000-b3
Traceback (most recent call last):
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/run_dknn.py", line 244, in <module>
    train(t)
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/run_dknn.py", line 155, in train
    losses = dknn_loss(query_e, neighbor_e, query_y_oh, neighbor_y_oh)
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/run_dknn.py", line 81, in dknn_loss
    top_k_ness = dknn_layer(query, neighbors)
  File "/opt/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/onyxia/work/automatic-differentiation/Neuralsort/dknn_layer.py", line 40, in forward
    pl