Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
This branch is 31 commits ahead of gurkirt:master.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Surgeon Action Detection for endoscopic images/videos

This is a baseline model developed for SARAS-ESAD 2020 challenge. To download the dataset and participate in the challenge, please register at SARAS-ESAD website.

The code is adopated from RetinaNet implementation in pytorch.1.x.

Features of this baseline

  • Data preparation instructions for SARAS-ESAD 2020 challenge
  • Dataloader for SARAS-ESAD dataset
  • Pytorch1.X implementation
  • Feature pyramid network (FPN) architecture with different ResNet backbones
  • Three types of loss functions i.e OHEM Loss, Focal Loss, and YOLO Loss on top of FPN


Here, we implement basic data-handling tools for SARAS-ESAD dataset with FPN training process. We implement a pure pytorch code for train FPN with Focal-Loss or OHEM/multi-box-loss paper.

We hope this will help kick start more teams to get up to the speed and allow the time for more innovative solutions. We want to eliminate the pain of building data handling and training process from scratch. Our final aim is to get this repository the level of realtime-action-detection.

At the moment we support the latest pytorch and ubuntu with Anaconda distribution of python. Tested on a single machine with 2/4/8 GPUs.

You can found out about architecture and loss function on parent repository, i.e. RetinaNet implementation in pytorch.1.x.

ResNet is used as a backbone network (a) to build the pyramid features (b). Each classification (c) and regression (d) subnet is made of 4 convolutional layers and finally a convolutional layer to predict the class scores and bounding box coordinated respectively.

Similar to the original paper, we freeze the batch normalisation layers of ResNet based backbone networks. Also, few initial layers are also frozen, see fbn flag in training arguments.

Loss functions

  • OHEM with multi-box loss function: We use multi-box loss function with online hard example mining (OHEM), similar to SSD. A huge thanks to Max DeGroot, Ellis Brown for Pytorch implementation of SSD and loss function.

  • Focal loss: Same as in the original paper we use sigmoid focal loss, see RetinaNet. We use pure pytorch implementation of it.

  • Yolo Loss: Multi-part loss function from YOLO is also implemented here.


You will need the following to run this code successfully

  • Anaconda python
  • Pytorch latest
  • Visualisation
    • if you want to visualise set tensorboard flag equal to true while training
    • TensorboardX
    • Tensorflow for tensorboard

Datasets and other downloads

  • Please visit SARAS-ESAD website to download the dataset for surgeon action detection.

  • Extract all the sets (train and val) from zip files and put them under a single directory. Provide the path of that directory as data_root in train file. Data preprocessing and feeding pipeline is in file.

  • rename the data directory esad.

  • Your directory will look like

    • esad
      • train
        • set1
          • file.txt
          • file.jpg
          • ..
      • val
        • obj
          • file.txt
          • file.jpg
          • ..
  • Now your dataset is ready, that is time to download imagenet pretrained weights for ResNet backbone models.

  • Weights are initialised with imagenet pretrained models, specify the path of pre-saved models, model_dir in Download them from torchvision models. After you have download weights, please rename then appropriately under model_dir e.g. resnet50 resen101 etc. from This is a requirement of the training process.


Once you have pre-processed the dataset, then you are ready to train your networks. We must have the following arguments set correctly:

  • data_root is base path upto esad directory e.g. \home\gurkirt\
  • save_root is a base path where you want to store the checkpoints, training logs, tensorboard logs etc.
  • model_dir is a path where ResNet backbone model weights are stored

To train run the following command.

python --loss_type=mbox --data_root=\home\gurkirt\ --tensoboard=true

It will use all the visible GPUs. You can append CUDA_VISIBLE_DEVICES=<gpuids-comma-separated> at the beginning of the above command to mask certain GPUs. We used 2 GPU machine to run our experiments.

Please check the arguments in to adjust the training process to your liking.

Some useful flags


Model is evaluated and saved after each 1000 iterations.

mAP@0.25 is computed after every 500 iterations and at the end. You can change to your liking by specify it in arguments.

You can evaluate and save the results in text file using It follow the same arguments By default it evaluate using the model store at max_iters, but you can change it any other snapshot/checkpoint.

python --loss_type=focal

This will dump a log file with results(mAP) on the validation set and as well as a submission file.


Here are the results on esad dataset.

Results of the baseline models with different loss function and input image sizes, where backbone network fixed to ResNet50. AP_10, AP_30, AP_50, and AP_mean are presented on validation-set, while Test-AP_mean is computed based on test-set similar to AP_mean.

Loss min dim AP_10 AP_30 AP_50 AP_MEAN Test-AP_MEAN
Focal 200 33.8 17.7 6.6 19.4 15.7
Focal 400 35.9 19.4 8.0 21.1 16.1
Focal 600 29.2 17.6 8.7 18.5 14.0
Focal 800 31.9 20.1 8.7 20.2 12.4
OHEM 200 35.1 18.7 6.3 20.0 11.3
OHEM 400 33.9 19.2 7.4 20.2 13.6
OHEM 600 37.6 23.4 11.2 24.1 12.5
OHEM 800 36.8 24.3 12.2 24.4 12.3

Results of the baseline models with different loss function, backbone networks, where input image size is fixed to 400. AP_10, AP_30, AP_50, and AP_mean are presented on validation-set, while Test-AP_mean is computed based on test-set similar to AP_mean.

Loss Network AP_10 AP_30 AP_50 AP_MEAN Test-AP_MEAN
Focal ResNet18 35.1 18.9 8.1 20.7 15.3
OHEM ResNet18 36.0 20.7 7.7 21.5 13.8
Focal ResNet34 34.6 18.9 6.4 19.9 14.3
OHEM ResNet34 36.7 20.4 7.1 21.4 13.8
Focal ResNet50 35.9 19.4 8.0 21.1 16.1
OHEM ResNet50 33.9 19.2 7.4 20.2 13.6
Focal ResNet101 32.5 17.2 6.1 18.6 14.0
OHEM ResNet101 36.6 20.1 7.4 21.3 12.3

Outputs from the lastest model (800 OHEM) are uploaded in the sample folder. These are generated using the same model (800 OHEM). See flag at line 114 to select validation or testing set (which will be available on 10th June).


  • Input image size (height x width)is 600x1067 or 800x1422.
  • Batch size is set to 16, the learning rate of 0.01.
  • Weights for initial layers are frozen see freezeupto flag in
  • max number of iterations is set to 6000
  • SGD is used for the optimisation
  • initial learning rate is set to 0.01
  • learning rate is dropped by the factor of 10 after 5000 iterations
  • Different training setting might result in better/same/worse performance


RetinaNet with different loss function types







No releases published


No packages published


  • Python 100.0%