Surgeon Action Detection for endoscopic images/videos
The code is adopated from RetinaNet implementation in pytorch.1.x.
Features of this baseline
- Data preparation instructions for SARAS-ESAD 2020 challenge
- Dataloader for SARAS-ESAD dataset
- Pytorch1.X implementation
- Feature pyramid network (FPN) architecture with different ResNet backbones
- Three types of loss functions i.e OHEM Loss, Focal Loss, and YOLO Loss on top of FPN
We hope this will help kick start more teams to get up to the speed and allow the time for more innovative solutions. We want to eliminate the pain of building data handling and training process from scratch. Our final aim is to get this repository the level of realtime-action-detection.
At the moment we support the latest pytorch and ubuntu with Anaconda distribution of python. Tested on a single machine with 2/4/8 GPUs.
You can found out about architecture and loss function on parent repository, i.e. RetinaNet implementation in pytorch.1.x.
ResNet is used as a backbone network (a) to build the pyramid features (b). Each classification (c) and regression (d) subnet is made of 4 convolutional layers and finally a convolutional layer to predict the class scores and bounding box coordinated respectively.
Similar to the original paper, we freeze the batch normalisation layers of ResNet based backbone networks. Also, few initial layers are also frozen, see
fbn flag in training arguments.
OHEM with multi-box loss function: We use multi-box loss function with online hard example mining (OHEM), similar to SSD. A huge thanks to Max DeGroot, Ellis Brown for Pytorch implementation of SSD and loss function.
Focal loss: Same as in the original paper we use sigmoid focal loss, see RetinaNet. We use pure pytorch implementation of it.
Yolo Loss: Multi-part loss function from YOLO is also implemented here.
You will need the following to run this code successfully
- Anaconda python
- Pytorch latest
- if you want to visualise set tensorboard flag equal to true while training
- Tensorflow for tensorboard
Datasets and other downloads
Please visit SARAS-ESAD website to download the dataset for surgeon action detection.
Extract all the sets (train and val) from zip files and put them under a single directory. Provide the path of that directory as data_root in train file. Data preprocessing and feeding pipeline is in detectionDatasets.py file.
rename the data directory
Your directory will look like
Now your dataset is ready, that is time to download imagenet pretrained weights for ResNet backbone models.
Weights are initialised with imagenet pretrained models, specify the path of pre-saved models,
train.py. Download them from torchvision models. After you have download weights, please rename then appropriately under
model_dire.g. resnet50 resen101 etc. from This is a requirement of the training process.
Once you have pre-processed the dataset, then you are ready to train your networks. We must have the following arguments set correctly:
data_rootis base path upto
save_rootis a base path where you want to store the checkpoints, training logs, tensorboard logs etc.
model_diris a path where ResNet backbone model weights are stored
To train run the following command.
python train.py --loss_type=mbox --data_root=\home\gurkirt\ --tensoboard=true
It will use all the visible GPUs.
You can append
CUDA_VISIBLE_DEVICES=<gpuids-comma-separated> at the beginning of the above command to mask certain GPUs. We used 2 GPU machine to run our experiments.
Please check the arguments in
train.py to adjust the training process to your liking.
Some useful flags
Model is evaluated and saved after each
mAP@0.25 is computed after every
500 iterations and at the end. You can change to your liking by specify it in
You can evaluate and save the results in
text file using
evaluate.py. It follow the same arguments
By default it evaluate using the model store at
max_iters, but you can change it any other snapshot/checkpoint.
python evaluate.py --loss_type=focal
This will dump a log file with results(mAP) on the validation set and as well as a submission file.
Here are the results on
Results of the baseline models with different loss function and input image sizes, where backbone network fixed to ResNet50. AP_10, AP_30, AP_50, and AP_mean are presented on validation-set, while Test-AP_mean is computed based on test-set similar to AP_mean.
Results of the baseline models with different loss function, backbone networks, where input image size is fixed to 400. AP_10, AP_30, AP_50, and AP_mean are presented on validation-set, while Test-AP_mean is computed based on test-set similar to AP_mean.
Outputs from the lastest model (800 OHEM) are uploaded in the sample folder. These are generated using the same model (800 OHEM). See flag at line evaluate.py 114 to select validation or testing set (which will be available on 10th June).
- Input image size (
height x width)is
- Batch size is set to
16, the learning rate of
- Weights for initial layers are frozen see
- max number of iterations is set to 6000
- SGD is used for the optimisation
- initial learning rate is set to
- learning rate is dropped by the factor of 10 after 5000 iterations
- Different training setting might result in better/same/worse performance