### Phase 1: Simplify code scheme

I'm preparing for trainign our first model, and I've looking into relevant materials in computer vision field. 

For our first coding scheme, we started with a very detailed list of over 40 specific categories, or classes (e.g., broken windows, cracked walls, graffiti, trash). I went through all pictures, and I think, one issue we might have when it comes to training models with this coding scheme, is that we do not have enough examples to let a model learn what every single category looks like. There are not enough pictures for each category, because some categories are rare in our pictures so we can only provide 10 examples for the model to learn. 

Note: 
It's a tough job for a model to learn 40 separate visual dectection tasks with the pictures we have, that is average 20-25 training images per category.
Note: If one class has 400 examples and another only has 30, the model may bias heavily toward the large class. Machine learning needs repetition, we need to remove rare child classes to avoid noise. 

The Solution: 
I grouped everything into three big, parent categories. This way there is more data per category, and the model only learns a few visual concepts, which is easier to learn than the very fine-grained ones. Now, we're just teaching the model to detect: 1) structural blight; 2) environmental blight; 3) infrastructure blight. So, I propose we start with a model that focuses on super-category first, if it works, then train a second model (only on images containing blight indicators) to classify the specific types (distinguish broken windows from broken doors).

Next step: 
Go through every picture we took and draw bounding boxes around the instance we see, and tag them as either structural, or environmental, or insfrastructure. We want to calculate how many instances or boxes we have drawn for each of the three categories. That matters more than the number of pictures we feed the model, because we might not code anything in a picture, and we will still have those not-coded picture as part of training data, so the model knows sometimes it doesn't need to code anything if there is no blighted items. 

After we finish coding, we will be training our first model on an open-source platform called Roboflow. It makes computer vision tasks very fast and intuitive. 

Goal here: 
1) Group (most) classes into 3-5 high-level categories;
2) Our goal with the first model is to let it separate these broad categories well;

For now, try to group visually similar items together, so try to have a 3-class system:
1) Structural Blight (broken building components, boarded/Covered doors/windows/wall, broken/Missing doors, broken/Missing windows,faded/mismatched surface); these are elements that appear on the building structure
2) Environmental Blight (overgrown vegetation, abandoned items, debris on the ground, cracked sidewalk/road/driveway, infit item in context); these are elements appear outside the building
3) Infrastructure Blight
4) Other (everything else: we try to provide examples so the model learns what blight is not)

To-do:
1) Also see if we can combine some child classes if they can belong to a more broad category;
2) Revisit all pictures, check how often each child class appears, decide which to keep and which to drop; the parent classes remain, but only keep child classes that have enough examples within each of them;

### Phase 2: Annotation in CVAT

Start with bounding boxes, not polygons;
1) Apply the SAME rule every time for the same type of object;
2) Be consistent with how to draw boxes, draw the box as tight as possible which means leaving very small margin, like a few pixels.
3) For doors and windows, always include the full frame; If there are five broken windows on a building, draw five separate bounding boxes, instead of drawing one giant box around all of them.
4) For faded/mismatched surfaces, only box the largest part of decoloration or affected area; if the spots are close together and clearly part of the same “problem area”, draw one bounding box around the whole affected region; if the spots are separate and visually distinct (e.g., faded paint patches on different sides of a wall or building), draw separate bounding boxes;
6) For debris and trash, only code large piles; leave single piece(s);
7) Each bouding box can have two bounding boxes assigned, which you can code one area with two classes;
8) A single picture should have multiple bounding boxes, so the model can learn that multiple instances can exist in one image;
9) No need to code the background, or any elements that are not from the scheme;
10) If there is nothing to code in a picture, leave it as it is; when Roboflow sees an image with no annotations, it understand it is not "blighted";


When saving data, split 70% for training and 30% for validation.
1) Do not OVERLAP, that means do not use same images in both training and validation, it will bring the overfitting issue since the model can remember the training images and perform perfectly on the validation set;
2) For training data, half of it should have blight annotations, and the rest are clean and well-kept non-blighted scenes(unannotated at all);
3) Save data as COCO JSON or Pascal VOC XML; annotations files should contain the image dimensions and, for each bounding box, the coordinates and the class label.

### Phase 3: Use roboflow which is a no-code tool

Use their tools to: 
1) Preprocess: resize all images to a standard size
2) Augment: generate new training images by changing brightness, etc., to make the dataset more effective
3) Train a model (use modern architectures like YOLO provided there)
4) Evaluate using charts and metrics

How does Roboflow works:
1) When it takes an image, it reads only the annotations by locating the coordinates;
2) It then uses the pixel data inside those bouding box coordinates to learn visual patterns of each class;
3) The rest of the image provides context, but the learning signal comes from the labeled boxes;

### Phase 4: If want to go coding

Environment: use Google Colab, which provides free GPUs that are essential for training
Framework: learn PyTorch with the torchvision library; it has pre-built models
Model choice: do NOT build a model from scratch, instead use Transfer Learning
Take a model, a CNN, that has been training on millions of general images (ImageNet) and fine-tune it for the task. A good started model is "Faster R-CNN", or "SSD" from torchvision.models.

Find a tutorial: search for "Fine tuning a torchvision model for object detection on custom dataset google colab"
Learn:
1) load data from VOC/COCO format
2) load a pre-trained model
3) replace the last layer to predict my own classes
4) set up a simple training loop
5) evaluate the results
   