### Object classes

Stanford drone dataset has next object classes: pedestrians, bikers, skateboarders, cars, buses, and golf carts.
For drone navigation task it would be useful to reduce number of classes by following rules:
pedestrians -> pedestrians; bikers and skateboarders -> bikers; buses, cars and golf carts -> cars, because this aggregated classes represents objects of the similar physical dimensions and moving behavior, what is important for drone navigation planning.

So now we have 3 classes: pedestrians, bikers and cars.

### Train/val splits

It is necessary to have some data for verification.
I suggest splitting dataset in train/val in following way:
* In every scene bring one video to val split;
* Also bring to val one whole scene
because dataset itself has low number of scenes which can differ a lot
from each other, so we need to check model on completely unseen data.

For each val video annotation produced only for last 20 frames using seprvisly.

### Stanford dataset analysis
Provided in stanford_dataset_overview.ipynb

### Segmentation approaches

Cause stanford drone dataset has no segmentation annotation we need either to create Unsupervised segmentation annotation for learning/inference purposes
or apply pretrained on similar dataset model to our data.

Overall I think the following segmentation 3 approaches are worth to be considered:

1. <b>Unsupervised segmentation based on background image. </b>
Having background image for each clip it is easy get segmentation by subtraction of frames and background, class assignment could be done with help of bounding boxes classes.
If camera position was static, it would be possible to get background images by aggregation of areas not get into bounding boxes.
But in most videos we can see serious fluctuations of camera due to wind, so it is impossible to get clear background image by simple aggregation of video frames.
In such case warping several frames with help of optical flow could be useful for background estimation.


2. <b>Unsupervised segmentation based on CNNs and classical methods like Super Pixel, GrabCut and other. </b>
There are some Unsupervised CNN approaches (https://paperswithcode.com/task/unsupervised-semantic-segmentation),
Semi-supervised (https://paperswithcode.com/task/semi-supervised-semantic-segmentation) and Weakly supervised (https://www.youtube.com/watch?v=jM1T1HwbY5s&ab_channel=RodrigoBenenson).
These methods needs deeper investigation and time for experiments. Also due to weak untight bounding box labeling they could perform not well.
E.g. this method (https://josephkj.in/projects/MASON/) uses GrabCut and CNN features for weakly supervised segmentation and result is not very beautiful (https://www.youtube.com/watch?v=GG_Pr8hdZhY&t=26s&ab_channel=JosephKJ
).

3. <b>Pretraining on other dataset.</b>
Similar datasets overview provided in similar_datasets_overview.ipynb.
The most suitable datasets are marked with the [Good] tag.
Despite there's some number of similar datasets all of them rather small and none of them has 'biker' class.

1st and 2nd approaches could be used either for annotation training data or for direct inference.
There is one important drawback of these two approaches - they suffer much from strong shadows which could be considered as segmented area.
Also in case of usage retrieved segmentations as training data it is needed to implement segmentation model as like for approach #3.

### Approaches implementation

1.  <b>Unsupervised segmentation based on background image. </b>
 * Vanilla background based segmentation.
For every frame we create individual background image due to unstable drone position. For every detection (GT bbox) we find the closest frame which doesn't contain overlapping bounding boxes with ours, after that we simply replace bbox area in our frame with same one area from found frame.
Implementation you can find in vanilla_background_based_segmentation.ipynb and in vanilla_background.py
Segmentation results stored at https://drive.google.com/drive/folders/1lFO-v0sCCegPW9Je4qo2OTmJSmvMiWFr?usp=sharing (segmentation based on GT bboxes; classes are not presented)
Pros and cons of the approach: unsupervised, rather fast, low resources consumption, in spite of this there are several drawbacks, e.g. only detected (by detector) ever moved objects could be segmented, all shadows in bounding box get into segmentation, shadows not get into bbox from previous frames can also corrupt current segmentation, segmentation sometimes rather poor due to low contrast.
 * Approach based on optical flow. For every frame we estimate several optical flows from this frame to a bunch of other frames at different distance ([-8,-4,-2,-1,1,2,3,4,8]).
 Then we aggregate magnitude of this optical flows in one segmentation mask, only crops under bbox of which are used cause in other places there could be movements of other objects like trees or just drone instability.
 Implementation you can find in opt_flow_background.py Segmentation results stored
 at https://drive.google.com/drive/folders/1qojNzP26oMT-MSoyEhOCeXC9Yd47V0rc?usp=sharing (segmentation based on GT bboxes; classes are not presented)
 Pros and cons of the approach: unsupervised, work better than background based approach ot images with low contrastness; drawbacks: rather slow due to we need calculate opt flow for each frame several times,
 only objects in motion could be segmented, all shadows in bounding box get into segmentation.
