Sanskrit, ikṣ (ईक्ष्) is a root verb meaning "to see" or "to look". This repo is to experiment and explore AI models in computer vision for the purpose of Dry waste sorting.

-
Compute vision has 3 mains categories - classification, detection and segmentation. Classification is about - given an image (with single object), map it to a single class. Detection is about - given an image with several objects, detect bounding box around each object and identify the class. Segmentation is about - given an image with several objects, detect exact border around each object and identify the class.
-
Types of models available - CNN based (YOLO etc), Transformer based (eg. ?)
-
Started with basic program to train and detect. Chose YOLO-11 because it's latest and perhaps easiest model for real time tracking.
-
Next, figured we can use Roboflow to upload images and label them using GUI i.e annotate the images. Once annotation is done, we can download the training set in YOLO format.
-
Detecting an object from Dry waste is extremely hard problem, consider object being twisted, crumpled into different shapes. Therefore we need lot of images, like a LOT. So I figured, we could instead take video and extract lot of images (frames) from those.
-
Did exactly as above, took videos (of primarily plastic bottle in Dry waste), extracted all the images (using VLC player), uploaded them to Roboflow, annotated them, downloaded the YOLO format training set and trained the model.
-
Everything works well, but takes TON of manually effort. Example, I extracted 1000 images, each annotation takes 20 seconds, so overall took 20000 seconds (~6 hours) to annotate the images for one category (Plastic PET bottle). The training took additional ~6 hours (Mac M3 Pro) for 1000 images. Therefore it's not scalable.
-
Next, figured we can annotate in video itself. Since things don't change much between frame to next frame in a video, SORT techniques can keep track of objects and directly generate YOLO training data. I am going to try this next.
-
Yes, everything worked fine. Used https://app.cvat.ai/ to annotate video using tracker, downloaded the dataset in YOLO format (had to buy subscription for $33/Month). Uploaded the same data to Roboflow to keep all dataset at a centralized place.
-
Latest run of YOLO model training took ~12hrs (2X time) with ~2k images. There are two clear goals - 1. make the annotations faster and 2. model training faster. Training is clearly Func(resources i.e GPUs). For improving annotations - CVAT auto-annotation with our own trained models and only manual make corrections where inference differ.
Roboflow project link - https://app.roboflow.com/plastics-sxfqi/prototype-uty5w/5