# Understanding ViT Tracker from Opencv

The ViT Tracker was implemented by Pengyu Liu as part of GSoC project 2023. \
This tracker is based on this paper.

This document is to understand the various aspects of the code of Vision Transformer Tracking `vttrack`.
1. Figuring out which model they used.
   - VOTS 2023 results paper.
   - OpenCV Zoo repo and comments
   - The model elements from the trained onnx file.
2. Explaining the code parts
   - ROI selector using `selectROI()` function of OpenCV
   - Custom made ROI selector
   - Integrating the model for gimbal control via laptop
   - Integrating the model for gimbal control via jetson
   - Training the same model on custom dataset
   - Training a new model on the custom dataset
3. Writing the paper on the model


### 1. Figuring out which model they used.

#### Visual Object Tracking and Segmentation Challenge 2023:
Their description from the VOT 2023 results from following paper.\
Kristan, M., et al, "The First Visual Object Tracking Segmentation VOTS2023 Challenge Results," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 1796-1818 [Link.](https://openaccess.thecvf.com/content/ICCV2023W/VOTS/html/Kristan_The_First_Visual_Object_Tracking_Segmentation_VOTS2023_Challenge_Results_ICCVW_2023_paper.htm

Steps:
1. 
We fine-tuned the weights generated using the MAE[19] method on the tracking dataset \
2.  We used VIT Large model. First, both the template and search regions were patch embedded, then concatenated together for feature extraction and fusion through transformer block structure. Finally the fused features are output to the classification and regression heads to complete the generation of bounding boxes \
3.  We apply a Hanning window on the output of the classification head to utilize the motion information of the object \
4.  After that, we retrieve the output of the regression head at the position with the highest confidence and output the bounding box. We used Segment Anything Model (SAM)[25] as the model for outputting masks. When the confidence value outputted by the tracker is very low, it is considered that the target is no longer in the image, and an empty mask is outpute..


#### OpenCV Zoo issue comment
*@arielkantorovich* asked: \
Hi, I want to ask you about the weights [object_tracking_vittrack_2023sep.onnx](https://github.com/opencv/opencv_zoo/blob/main/models/object_tracking_vittrack/object_tracking_vittrack_2023sep.onnx) the VitTracker is implementation of the papper Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework?
Do you train on your own data or do you take weights from the GitHub OSTrack and change to ONNX I am asking because I want to try maybe another weights.
Thank you for your help 

@lpylpy0514 replied: \
I used the same training dataset as OSTrack, but without using pre-trained weights. In terms of model implementation, I replaced patch embedding for small models. For this, please refer to Levit: a vision transformer in convnet's clothing for faster inference.:)d

### 2. Understanding the code

In [1]:
# how to obtain the region of interest in a frame
import cv2
import numpy as np

original = cv2.imread('test.jpg')
img = original.copy()
img = cv2.resize(img, None, (0,0), 0.5, 0.5)
roi = cv2.selectROI('ROI', img)
cv2.destroyWindow('ROI')
print("Selected region: ",roi)

# display the region selected
# roi is in (x,y,w,h) format
cv2.rectangle(img, (roi[0], roi[1]), (roi[0] + roi[2], roi[1] + roi[3]), (255,255,255), 1)
cv2.imshow('Selected Region', img)
if cv2.waitKey(0) & 0xFF == ord('q'):
    cv2.destroyAllWindows()

Selected region:  (190, 249, 104, 100)


In [24]:
import cv2

def select_roi(img_path):
    roi_data = {'roi_start': (0, 0), 'roi_end': (0, 0), 'selecting_roi': False, 'display_img': None}

    def mouse_callback(event, x, y, flags, param):
        if event == cv2.EVENT_LBUTTONDOWN:
            roi_data['roi_start'] = (x, y)
            roi_data['roi_end'] = (x, y)
            roi_data['selecting_roi'] = True

        elif event == cv2.EVENT_LBUTTONUP:
            roi_data['roi_end'] = (x, y)
            roi_data['selecting_roi'] = False
            cv2.rectangle(roi_data['display_img'], roi_data['roi_start'], roi_data['roi_end'], (0, 255, 0), 2)
            cv2.imshow("Select ROI", roi_data['display_img'])

        elif event == cv2.EVENT_MOUSEMOVE and roi_data['selecting_roi']:
            roi_data['roi_end'] = (x, y)
            roi_data['display_img'] = img.copy()
            cv2.rectangle(roi_data['display_img'], roi_data['roi_start'], roi_data['roi_end'], (0, 255, 0), 2)
            cv2.imshow("Select ROI", roi_data['display_img'])

    img = cv2.imread(img_path)
    roi_data['display_img'] = img.copy()

    cv2.namedWindow("Select ROI")
    cv2.setMouseCallback("Select ROI", mouse_callback)

    while True:
        cv2.imshow("Select ROI", roi_data['display_img'])
        key = cv2.waitKey(1) & 0xFF

        if key == ord('c') and not roi_data['selecting_roi']:
            roi = img[min(roi_data['roi_start'][1], roi_data['roi_end'][1]):max(roi_data['roi_start'][1], roi_data['roi_end'][1]),
                      min(roi_data['roi_start'][0], roi_data['roi_end'][0]):max(roi_data['roi_start'][0], roi_data['roi_end'][0])]
            cv2.imshow("Cropped ROI", roi)
            cv2.waitKey(0)
            break

        elif key == 27:
            break
    cv2.destroyAllWindows()
    return (
        min(roi_data['roi_start'][0], roi_data['roi_end'][0]), 
        min(roi_data['roi_start'][1], roi_data['roi_end'][1]), 
        abs(roi_data['roi_start'][0] - roi_data['roi_end'][0]), 
        abs(roi_data['roi_start'][1] - roi_data['roi_end'][1])
    )    
           

# Example usage
roi = select_roi('test.jpg')
print(roi)

(1137, 211, 175, 184)


In [21]:
cv2.destroyAllWindows()