# Eccentricity Dependent Neural Networks
## Manish Reddy Vuyyuru (AC299R)

## Table of Contents

1. [Eccentricity Dependent Neural Networks](#Introduction)
    
2. [Previous Results](#PreviousResults)

3. [Sampling For Retinal Eccentricity](#DataSampling)

4. [Data Augmentation](#Augmentation)

5. [Network Design](#NetworkDesign)

6. [Network Training And Evaluation](#NetworkTraining)

7. [Scale, Translation and Clutter Invariance](#Invariance) 

8. [Future Work](#FutureWork)
    
9. [References](#References)

### Eccentricity Dependent Neural Networks<a name="Introduction"></a>

Modern convolutional neural networks have demonstrated remarkable performance in visual recognition on diverse datasets. Yet, these computer vision models lag behind human vision in several ways. Human vision's simultaneous robustness to scale, translation and clutter is unparalleled in computer vision models[1][2]. We propose to consider a convolutional neural network architecture more closely modelled after the ventral stream to address these issues. In particular, we are interested in incorporating retinal eccentricity, where parts of images falling directly onto the fovea are the sharpest and get progressively blurrier farther out. Previous works have argued the computational role of retinal eccentricity, both from a theortical and experimental perspective[3] and by direct evaluation on MNIST-derived datasets[1][2].

**Executive Summary: We demonstrate that a neural network incorporating retinal eccentricity can be trained to a level approaching state-of-the-art models on ImageNet. We note key problems with the current models and ways to mitigate it going forwards. We provide very rough and preliminary evaluations of the network's robustness to scale, clutter and translation transformations.**

### Previous Results<a name="PreviousResults"></a>

The computational role of retinal eccentricity has been studied by various members of the groups in the past. In [3], Poggio et al. argue that retinal eccentricity could be viewed as an extension of M-theory. Notice that mordern convolutional neural networks require a large amount of image augmentation during training to be more robust to image transformations. On the other hand, humans do not need to be shown an object from many different angles before being to recognize it under transformations. M-theory argues that the brain accomplishes this by computing a transformation-invariant representation of an image before visual recognition. Poggio et al. argue that an extension to M-theory focused on the translation-scale invariance tradeoff typical of primates/humans agrees spectacularly with the image sampling that occurs as a result of retinal eccentricity. This was further verified via comparisons against biological measurements of macaque monkeys. The argument goes as follows. Considering the region in scale-translation space that all possible scale transformations of an object must lie:

<img src="http://drive.google.com/uc?export=view&id=1b7qJ-UvcJZRBsmJsBmU7NafittZRFH7z">
Figure 1. Taken from [3], the region in scale (S)-translation (X) space that all possible scale transformations can take form a truncated pyramid.

Without going into the details, intuitively, you could then generate an invariant representation by projecting an image to all points in this triangle and sampling over the space as shown below:

<img src="http://drive.google.com/uc?export=view&id=184QLWsywwpJDbZZ_bNdmg-3NJIvCD0_S">
Figure 2. Taken from [3], sampling (represented by red dots) in the scale (S) - translation (X) space that all possible scale transformations can take.

Additionally, previous works in the group [1][2] have demonstrated on MNIST-derived datasets that by incorporating this retinal eccentricity and pooling over the scales, neural networks could be made more robust to scale invariance and perform better under crowding. Further studies on a more challenging natural image dataset could be more informative.

<img src="http://drive.google.com/uc?export=view&id=1-Uppeuwpotzl_RWmvVLZx5KhsfbN-Cti">
Figure 3. Taken from [2], an example of some of the MNIST-derived datasets used by previous studies.

### Sampling For Retinal Eccentricity<a name="DataSampling"></a>

To incorporate retinal eccentricity into our networks, we will pre-sample the images approriately before feeding into our neural network. We will use a simple approximation of the full invariant representation generated by eccentricity. Consider an image from ImageNet[4] (a challenging natural image dataset):

<img src="http://drive.google.com/uc?export=view&id=1vmkD04W3KCcKBHN4j3CA-BdMPrpO9yl1">
Figure 4. Sample image of ducks from ImageNet train split.

We aim to generate a computationally cheap approximation of retinal eccentricity, not replicate it exactly. Given this, we are looking to transform the image in several ways.

1. Change the scale of the image.
2. Change the sharpness of the image.
3. Vary the fixation points.

We ultimately chose to focus on 1. and 2. as the best way to pad the image for 3. was unclear and we were sufficiently happy with starting off with a baseline with a single central fixation point.

<img src="http://drive.google.com/uc?export=view&id=1Y8u74m2zDaotUZhOl4cUyAG4L7-3N93T">
Figure 5. Sample image of ducks with a single central fixation point indicated by a red dot.

To vary the scale of the image, we took crops of varying sizes and downscaled all the crops to the smallest crops.

<img src="http://drive.google.com/uc?export=view&id=1UDJfRBYuyKwcuCI4hoIgGD6Jf-oPD-Of">
Figure 6. Sample image of ducks with red bounding boxes for location of crops to be downscaled.

We initially experimented with just generating transformations of the image focusing on changing the scale, and relying on bi-cubic interpolations that will be performed to accomplish this to vary the sharpenss of the image.

<img src="http://drive.google.com/uc?export=view&id=1a6ckFuUO7wUonnYiAmIIWYvIqJtk8ksZ">
Figure 7. Sample image of ducks with only scale transformations.

However, as shown above, this leads to artifacts on the edges of the images. Instead, we tried applying gaussian blurs of various radii and sub-sampling the original image. Additionally, this also stuck more closely to the argument for the computational role of retinal eccentricity as the gaussian blurs approximate a low-pass filter.

<img src="http://drive.google.com/uc?export=view&id=1ufa9rOy9BS-ZqBskuXpKZK3tS_Q8JlFo">
Figure 8. Sample image of ducks with gaussian blurs and scale transformations.

Notice now the absence of artifacts on the edges of objects in the transformed images. 

Specifically, in our final version of the image sampling process, we took crops @ 40x40, 80x80, 160x160, 320x320. Then we convolved the 4 crops with gaussian kernels of radius 0, 1, 2, 4 respectively. Finally, we subsampled the x- and y- axis of the images evenly to downsample all crops to 40x40.

### Data Augmentation<a name="Augmentation"></a>

We follow the data augmentation typically employed for training neural networks on ImageNet with a few notable differences.

1. We do not perform any scale transformations.
2. We do not perform any translational transformations.
3. We do not perform random image cropping.

The reasons for 1. and 2. are that we are aruging that the data sampling procedure above generates a scale-translation invariant representation of the image. We will not be able to show this if we also employ scale and translation transformations during training. The reason for 3. is that that taking random image crops would effectively correspond to multiple fixation points (see 3. in sampling for retinal eccentricity above). We will stick to the simplest case of a single central fixation. We used the following augmentations:

1. Left-Right Flip (applied 50% of the time)
2. Brightness (uniform distribution between 0, 32./255.)
3. Saturation (uniform distribution between 0.5, 1.5)
4. Hue (uniform distribution between -0.2, 0.2)
5. Contrast (uniform distribution between 0.5, 1.5)

<img src="http://drive.google.com/uc?export=view&id=1z5vL87XaDbCCo_Wvm-d-mnhK4TUCOfjo">
Figure 9. Sample of typical augmentations. Top: Original Image, Middle: Flip, Bottom: Minimum Contrast Augmentation

### Network Design<a name="NetworkDesign"></a>

We experimented with several classes of network designs. The general requirement for any design was that:

1. Accept image sampled to replicate retinal eccentricity (4 crops)
2. Incorporate information across the crops at some stage before classification

Previous works by Chen et al. [1] and Colokitin et al. [2], used similar approaches to their network design. The networks accepted the 4 crops concatenated along the color channel axis. For example, for 4 crops of the shape 40 x 40 x 3, the networks accepted a tensor of shape 40 x 40 x 12 as input. 

Chen et al. and Colokitin et al. both incorporated information across the crops by pooling over the channels corresponding to each crop. They each investigated the effect that various configurations of the pooling has on scale, translation and clutter invariance.

Specifically, Colokitin et al. considered 3 different pooling configurations. The evaluated configurations employed different extents of pooling at different locations along the network. Colokitin et al. then investigated the robustness of the networks against crowding under different pooling configurations on several MNIST-derived datasets.


<img src="http://drive.google.com/uc?export=view&id=19LQHaZy8kRlNo6q4vwWJEF1YM-RJttJ3">
Figure 10. Taken from [2]. Various pooling configurations considered. Colokitin et al. demonstrate the effects on crowding.

Chen et al. also considered several different pooling configurations. Notably, Chen et al. used a pooling configuration that attempted to closely model the relationship between receptive field size and eccentricity observed in biological measurements in macaque monkeys. They investigated how various pooling configurations affected the robustness of the networks to scale, translational and clutter transformations.

<img src="http://drive.google.com/uc?export=view&id=1f7B_Yitov6aE7gnH4-QVppYy_qnj5oy8">
Figure 11. Taken from [1]. A pooling configuration considered (left) that closely modelled biological measurements in macaque monkeys (right).

In these networks, an implicit design choice is weight sharing. The same weights are convolved over channels corresponding to each crop.

We investigated networks whose design deviated from these previous works in several ways. Notably, we consider networks who process the 4 crops (i.e. the different scales) using entirely seperate neural networks. We only incorporate information across the scales at a final layer using the hidden states from the individual networks. We evaluated several different ways of incorporating the information and the effect it has on the classification performance of the models. The models can be broadly grouped into 3 classes based on how they collate information across scale. Below, a single 'resnet18' block refers to the same full architecture as ResNet18 without the final softmax layer.

1. No Scheme

Models of this class directly combine the output from the 4 networks in a softmax layer. 

<img src="http://drive.google.com/uc?export=view&id=14yZvPm-yXOXYBO9i1OcLWCogW_B0Uzd4">
Figure 12. High-level layout of networks implementing 'no scheme'.

2. Voting Scheme

Models of this class implement a voting scheme (average, max, etc.) between the 4 networks.

<img src="http://drive.google.com/uc?export=view&id=1yIsb6cV2XW60ijmPclX4kOyNglvNp_Up">
Figure 13. High-level layout of networks implementing 'voting scheme'.

3. Dense Scheme

Models of this class combine the output from the 4 networks through one or several dense fully connected layers which feed into a final softmax layer.

<img src="http://drive.google.com/uc?export=view&id=1eH9kqAsuDWlKXnW0CFpYvttA46kOG5iV">
Figure 14. High-level layout of networks implementing 'dense scheme'.


### Network Training And Evaluation<a name="NetworkTraining"></a>

**Note: Unless otherwise stated, results here are from models trained from _scratch_ on ImageNet.**

We trained our networks on 4 GPUs using a mirrored strategy with one replica per GPU and sync replication. This helped reduce training times. Unfortunately, we were unable to scale to scale up to 8 GPUs. It is still unclear why this was the cause but we suspect it might have to do with the overhead from sync replication.

<img src="http://drive.google.com/uc?export=view&id=1Lvwh-7xh43n5FZMIRuTTlzJDzcZXgnmR">
Figure 15. Taken from https://jhui.github.io/assets/tensorflow/gscale.png. Mirrored multi-GPU training of networks.

It's cumbersome to cover results from all experiments on network training. We will present a few of the more interesting tests. Unless specified, models are of the 'no-scheme' as shown above. We will use a shorthand to refer to the models. When we use the term 'AlexNet x4' we are referring to a model with the 'ResNet18' blocks above in the schematic replaced with the full architecture of AlexNet without the final softmax layer. Likewise for other versions of the term such as 'ResNet50 x4'.

We considered training from pre-trained models instead of training our networks from scratch. However, we saw that pre-trained networks generally performed noticeably worse. For example, consider the experiment below. We compared the performance of a pretrained ResNet50 x4 model with frozen weights, without frozen weights and a ResNet18 x4 model. Pre-trained ImageNet weights for ResNet50 were taken from Keras[5]. During this experiment, we were using weight sharing across the scales (i.e. the crops).

<img src="http://drive.google.com/uc?export=view&id=1HKRNFxx3sXlKbk2_b6oTOX4bjX46Ey-y">
Figure 16. classification performance for pre-trained networks and networks trained from scratch.

We observed that the pre-trained networks performed considerably wrose than the network trained from scratch. This is a little surprising since this is a typical transfer learning setup. Perhaps a pretrained ResNet50 model did not perform well as the data distribution after retinal sampling is too different from the original ImageNet distribution. Also, it might be possible that the pretrained ResNet50 without frozen weights did not perform well for the same reason.
might be possible that the pretrained ResNet50 x4 model with unfrozen weights did not perform well for the same reason.

We also considered the effect of weight sharing in the networks. We trained two ResNet18 x4 models from scratch. One implemented weight sharing across the scales (i.e across the crops) while the other used seperate weights for each scale. 

<img src="http://drive.google.com/uc?export=view&id=1gavWBdH1ipgzr-SILH0i8i2Bq2pCVpl7">
Figure 17. Classification performance for networks with and without weight sharing.

We observed that the network without weight sharing performed slightly better. Both networks have the same number of floating point operations for a forward pass but the network without weight sharing is vastly more over-parameterized. We expect this to have a non-trivial effect on the ability for the network to draw decision boundaries and generalise. We did not consider this more in depth in terms of how it relates back to the theory for scale-translational invariance from retinal eccentricity. We see that networks without weight sharing have a great capacity to learn (higher accuracy on the training set) but generalize only about as well as networks with weight sharing (marginally better accuracy on the validation set). We proceeded with the network without weight sharing because of its marginally better classification performance.

We considered different network designs from either of the 3 schemes described above. Below, we show the different in classification performance with equivalent models from 'no scheme' and 'dense scheme'. Specifically, both implement ResNet18 x4. In 'no scheme' we feed the outputs from the terminal global average pool in the 4 ResNet18 networks directly to a dense softmax layer. In 'dense scheme', we feed the outputs from the terminal global average pool in the 4 ResNet18 networks through a series of dense relu layers and dropouts before a final dense softmax layer.

<img src="http://drive.google.com/uc?export=view&id=1dmHDjU_pw3aDpRWfKI1i3UK0rLvbGWUO">
Figure 18. Classification performance for networks of 'no scheme' vs. 'dense scheme'.

We saw quite consistenyl that 'no scheme' models performed better than 'dense scheme' models. 'No scheme' models stagnated to a higher training and validation accuracy faster than 'dense scheme' models. We anticipated that 'dense scheme' models would allow for more complicated decision boundaries to incorporate information across scale. We also expected 'dense scheme' models to take longer to train because of the reintroduction of dropout back into the networks. However, since 'no scheme' models generally performed better, we stuck with 'no scheme' models for the majority of experiments.

We considered working with deeper networks. ResNet networks are notorious for being able to go to significantly deeper depths, boasting better performance on datasets like ImageNet[6]. We experimented with the next deepest network to consider if it was worth going far deeper. Results are from ResNet18 x4 vs ResNet34 x4 models.

<img src="http://drive.google.com/uc?export=view&id=1GuRST7-zGfcis5pBGZq8zGpYc2tOeqdS">
Figure 19. Classification performance for networks with 18 layer vs 34 layer depth intermediate modules.

We saw marginally better validation accuracy and noticeably better training accuracy for the ResNet34 x4 models. However, this came at a significant computational cost. Given our computational resources, we elected to stick with ResNet18 x4 models because it offered the best tradeoff between validation performance and compute for a forward pass.


During training, we considered several different batch sizes. We were performing training across 4 GPUs and expected to use 256 or 512 images per batch to stick closely to the original ResNet papers. We briefly considered using much larger batch sizes of 1024 images to speed up training. Results are below are from a batch size of 256 or 1024 for ResNet18 x4 models.

<img src="http://drive.google.com/uc?export=view&id=1FCb0LIHc0diK3XPZ-K9uvEbxAebsVB7Q">
Figure 20. Classification performance for networks with 256 vs 1024 batch size.

We observed marginally better validation performance with a batch size of 256 and much better training performance. This is not entirely unexpected given the current understanding of the tradeoff between batch size, network training time and network quality. We elected to stick with a batch size of 256 despite the slightly higher computational cost due to the network's better performance.

We implemented both a standard weight decay with every epoch and a weight decay with validation loss stagnation. The hyperparamters we used for the learning rate decay followed closely from the original ResNet papers. We investigated shifting the hyperparameters slightly with almost no change in network performance. Results below from ResNet18 x4 models highlight the role of learning rate decay.

<img src="http://drive.google.com/uc?export=view&id=18YbMlzsUyo7U6ObXxvTmvjoDZEGcbULg">
Figure 21. Classification performance for a ResNet18 x4 model with learning rate decay at plateau.

In general, with learning rate decay, we observed that training and validation accuracy would stagnate after a few epochs of training. By implementing a drop in learning rate when the loss validation plateaus for several epochs, we allow for the networks to navigate the loss landscapes at increasing resolution. The effect of this in the training is quite clear. For example, consider a typical example shown above, where the training and validation performance spike after the learning rate is dropped following a brief plateau.
 

Note that we have always shown good crops generated from retinal sampling. This is not always the case since we're simply taking a central fixation. In either the training or validation set, this could result in cases where the generated crops do not capture the object to be identified at all. Below is a relatively tame example of this.

<img src="http://drive.google.com/uc?export=view&id=11bgjGDjIknckWZzjPbogMG0Z0XHNelJS">
Figure 22. Above: an example of good crops from retinal sampling. Below: an example of bad crops from retinal sampling.

Notice that in the top panel, the crops from retinal sampling clearly capture details of the object of interest at multiple scales. However in the bottom panel, the crops from retinal sampling capture only some parts of the object of interest visible in the original image. This is a limitation of the current setup.

We then went back and compared the performance of the models against corresponding state-of-the-art model on a top-1 accuracy and top-5 accuracy basis. Top-5 accuracy is an interesting metric to consider on ImageNet as the dataset has a considerable amount of both breadth and depth (e.g. species of different turtles). Below, we compare the best ResNet18 x4 model against the related ResNet architectures.

<img src="http://drive.google.com/uc?export=view&id=1W3iEzsrYknOqMVSW7MdxzODPBaI1wyPI">
Figure 23. Accuracy of ResNet18 x4 model compared to state-of-the-art models.

| Model             | Top-1 Accuracy   | Top-5 Accuracy  |
| :------------------- |:--------------------:|:--------------------:|
| ResNet 18         | 68.24 | 88.49 |
| ResNet 18 x4      | 47.25 | 71.28 |
| ResNet 50         | 74.81 | 92.38 |

We used the top-1 and top-5 accuracies reported for pretrained versions of the models. Interestingly, our networks performance approaches the state-of-the-art models. This is a promising result considering that this is still preliminary work.

The tradeoff between computational cost for a forward pass in our models is not clear. Our models employ 4 entirely seperate networks to process each crop before aggregrating over the terminal hidden state for prediction. At the same time, the models use a much smaller input than is typical. We calculated the number of floating point operations for a forward pass for implementations of the models for comparison.

<img src="http://drive.google.com/uc?export=view&id=1TSivXinR72wWhes5dOQK_ZSA8UqHszYl">
Figure 24. FLOPS for forward pass for ResNet18 x4 model compared to state-of-the-art models.

| Model             | Floating-Point Operations per Forward Pass   |
| :------------------- |:--------------------:|
| ResNet 18         | 3.6e9 |
| ResNet 18 x4      | 7.1e9 |
| ResNet 50         | 7.7e9 |

We see that our models use significantly more compute than the corresponding state-of-the-art models. This is a result of the design of ResNet18. In the typical architecture, there is a max pool with large kernel size to almost immediately bring down the input from ~200x200 to ~50x50. This greatly reduces the compute for a forward pass in ResNet18. However, to avoid the dimension of the hidden state collapsing to 1x1 in our ResNet18 x4 models, this max pool was removed. This, combined with our use of 4 seperate networks to process each crop results in the floating-point operations per forward pass increasingly greatly. For example, notice above that our ResNet18 x4 model uses about as many floating point operations with sister ResNet18 networks as a single typical ResNet50 architecture.

To attempt to understand the performance of our model to inform additional changes to the design, we considered more closely how the networks were misclassifying objects in the validation set. For example, one of the things we considered was the number of examples in the validation set (out of 50 examples per class) for each object category that was misclassified. 

<img src="http://drive.google.com/uc?export=view&id=1qVCviH1-If1Og7feFkzGEhxyZ87iahv6">
Figure 25. Distribution of number of misclassifications in the validation set.

<img src="http://drive.google.com/uc?export=view&id=1dRUgZ_VO2Riq0qkvs-A02VRO10nHZGlC">
Figure 26. Classes that were most misclassified in the validation set. Top: Plastic Bag (misclassified 47/50 times), Middle: Buckle (misclassified >40/50 times), Bottom: Velvet (misclassified >40/50 times). 

We obserse a normal distribution for the number of misclassifications in the validation set. This does not provide a clear target to pursue (if for example we saw a bimodal with a peak around ~40-50). We also looked at the classes that were most often misclassified. A possible reason for the misclassification of these classes was that in the validation set images, the objects would not often appear in the central part of the image (especially true for for the 'Plastic Bag' class).

We also considered to what extent each crop in the retinal sampling was contributing to the final classification decision. We setup a simple test where we zeroed out different crops in a forward pass and noted the final classification prediction.

<img src="http://drive.google.com/uc?export=view&id=1FDJjCFU09tM1O1qFKJiLPuaDKVRX8GFI">
Figure 27. Prediction from ResNet18 x4 model with various crops from retinal sampling zeroed out.

We noticed that the network was learning to use almost entirely the largest crop (320x320, with gaussian blur of largest radius and downsampled to 40x40) in making it's prediction. We believe that going forwards, we might alleviate this by training each tower in the models seperately and combining the individual towers in a final learning stage. This way, we might ensure that we get a more proportionate contribute to the final prediction from each retinal crop. Alternatively, we might also be able to accomplish this by performing a global average pooling over the scales (i.e. the crops) at the final layer before feeding into the decision softmax layer.

### Scale, Translation and Clutter Invariance <a name="Invariance"></a>

We intend to first tune the design of the network or how it's trained to ensure that we have more porpotionate contributions from each scale level in the retinal sampling towards the final model prediction. We present below several sanity checks on the scale, translation and clutter invariance of the model. Note that our models did not employ scale or translational augmentation during training. We compare our ResNet18 x4 model against a pre-trained ResNet18 network.

We first considered a simple test for the invariance to scale by scaling up an image at two different positions. We then compared the classification to compare robustness to this scaling for the models.

<img src="http://drive.google.com/uc?export=view&id=1wWZYjRB9CcTnLk5is5_R_AeHuTwMVasK">
Figure 28. Simple test for ResNet18x4 to consider robustness to changes in image scale.
<img src="http://drive.google.com/uc?export=view&id=1V5_I9DSeQwsJRg2VH0aAPbqxsfxJn-qd">
Figure 29. Simple test for ResNet18 to consider robustness to changes in image scale.

We see that the vanilla ResNet18 appears to be more robust to changes in image scale for the simple test. For both models, we see a significant drop in confidence in the correct prediction when scale adjusted with fixation at the head. For ResNet18, we see a high confidence still when scale is adjusted with fixation at the feet while the ResNet18 x4 breaks down here.

We perform a simple test for invariance to translation by moving the object of interest to different positions in the image. We then compared the classification to compare robustness to translation for the models.

<img src="http://drive.google.com/uc?export=view&id=1IYtBUwEQTDq29bnfgnrKHbLaoIz8oNDJ">
Figure 30. Simple test for ResNet18x4 to consider robustness to image translation.
<img src="http://drive.google.com/uc?export=view&id=1BbHzw89BIRowxsJcoyfiAXa-xr_LJR97">
Figure 31. Simple test for ResNet18 to consider robustness to image translation.

We see that ResNet18 x4 appears to be more robust to translations of the object. For both objects, we retrieve near perfect confidence for translations to the right. For translations to the left, ResNet18 x4 performed proportionately better than ResNet18.

We perform a simple test for invariance to noise flanking from either sides. We compared the classification to compare robustness to flanking noisy clutter for the models.

<img src="http://drive.google.com/uc?export=view&id=1xZrIIIKNSTEy35jxSGRaowUxx2JKALC7">
Figure 32. Simple test for ResNet18x4 to consider robustness to flanking noise.
<img src="http://drive.google.com/uc?export=view&id=143bwWT4RpFvg7sKLmpsFdcofMjJWxJ30">
Figure 33. Simple test for ResNet18 to consider robustness to flanking noise.

We see that both ResNet18 x4 and ResNet18 are comparably robust to the introduced noise. In fact, the confidence went up slightly from the baseline for ResNet18.


None of the tests presented above are very thorough. These were simply intended as preliminary checks to establish the current state of the networks. The current simple tests do not differentiate the networks well.

### Future Work<a name="FutureWork"></a>

There are many future directions for the work. We list here the most immediate 5:

1. Redesign network or change learning strategy to ensure more proportionate contribution to final prediction from the different crops from retinal sampling.

2. Implement multiple fixation points at training/validation time instead of a single central fixation.

3. Reimplement retinal sampling using only tensorflow/keras functions to automate gradient propogation through retinal sampling (currently done manually when necessary).

4. Automate generation and evaluation of test images for scale, translation and clutter invariance for the networks.

### References<a name="References"></a>

[1] Eccentricity Dependent Deep Neural Networks: Modeling Invariance in Human Vision (AAAI 2017, Chen, et al.)

[2] Do Deep Neural Networks Suffer from Crowding? (NIPS 2017, Volokitin et al.) 

[3] Computational role of eccentricity dependent cortical magnification (CBMM Memos 2014, Poggio et al.)

[4] ImageNet: A Large-Scale Hierarchical Image Database (CVPR 2009, Deng, et. al.)

[5] Keras Pre-Trained Models https://keras.io/applications/

[6] Deep Residual Learning for Image Recognition

[7] Tensorflow & Keras Model Zoo https://github.com/qubvel/classification_models