# Automatic Vision Object Tracking
[Link to Tutorial](https://medium.com/mjrobot-org/automatic-vision-object-tracking-2dc6b4acaff5)

In [2]:
import cv2
from ultralytics import YOLO
import os

Specs: YOLOv8 Nano, image size = 640, epochs = 5, batch size = 3, cpu-only

### Arguments Explained
- imgsz = 640
    - Defines the size of the input images (in pixels) that the model will process during training.

    - A value of 1280 means that images will be resized to 1280x1280 pixels. Larger image sizes typically lead to better accuracy but increase computational cost.

- epochs = 5
    - Specifies the number of training iterations (epochs).

    - An epoch means the model goes through the entire training dataset once.

- batch = 3:

    - Sets the batch size, meaning the model will process 3 images at a time during training.

    - Batch size impacts memory usage (GPU/CPU) and training speed:
        - Smaller batch size: Lower memory requirement but slower convergence.
        - Larger batch size: Faster convergence but higher memory requirement.

- device = 'cpu'
    - gpu or tpu could be enabled, but my local set-up did not have CUDA enabled (I have an AMD Graphics Card)

```
# Load the model
model = YOLO('yolov8n.pt')
 
# Training.
results = model.train(
   data='C:/Users/fycce/Documents/GitHub/Online_tutorials/Machine Learning/OpenCV Applications/Object Tracking/Rock Paper Scissors SXSW.v14i.yolov8/data.yaml',
   imgsz=640,
   epochs=5,
   batch=3,
   name='yolov8n_v8_50e',
   device='cpu'
)
```

## Training the Model
Training the neuronal network is very cpu-intensive given that I only used my PC's CPU. Hence, I had to limit the specs of the model training given that a regular epoch already took me 21-24 minutes to complete.


### What Happens During Training?

- The dataset is loaded and split into batches (size = 3).
- Each image is resized to 640 x 640 pixels.
- The model fine-tunes its weights over 5 epochs:
    - Updates weights to minimize loss using gradient descent.
    - Computes validation metrics after each epoch to monitor performance.
- At the end of training, the model saves the best weights based on validation metrics.

Below is the actual output of the script after the neuronal network was done training.
At best, no image resizing at all, slightly larger batch sizes and all of the same previous settings should have made the epochs shorter

## Testing the Model on Webcam

In [4]:
os.chdir(r'C:\Users\fycce\Documents\GitHub\Online_tutorials\Machine Learning\OpenCV Applications\Object Tracking with YOLOv8\runs\detect\yolov8n_v8_50e\weights')
model = YOLO('best.pt')

# laptop/pc main camera
capt = cv2.VideoCapture(0)

while True:
    ret, frame = capt.read()
    result = model.predict(frame)

    boxes = result[0].boxes
    for box in boxes:
        # get xyxy coordinates from 1st box
        x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())

        confidence = box.conf[0].item()
        
        # get prediction from 1st box
        class_id = int(box.cls[0].item())
        class_name = model.names[class_id]

        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 255), 2)
        label = f"{class_name} {confidence:.2f}"

        cv2.putText(frame, label,
                     (x1-10, y1-10), 
                    cv2.FONT_HERSHEY_SIMPLEX,
                      0.7,
                      (0, 255, 255), 2)
        # necessary due to several box objects in boxes
        continue
        
    cv2.imshow("YOLOv8 Webcam Predictions", frame)

    # press esc key to exit
    k = cv2.waitKey(1000)
    if k == 27:
        break

capt.release()
cv2.destroyAllWindows()


0: 480x640 1 Rock, 46.0ms
Speed: 2.1ms preprocess, 46.0ms inference, 1.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 Rock, 52.6ms
Speed: 0.5ms preprocess, 52.6ms inference, 1.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 Rock, 43.9ms
Speed: 1.0ms preprocess, 43.9ms inference, 1.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 2 Rocks, 43.1ms
Speed: 1.0ms preprocess, 43.1ms inference, 1.0ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 Paper, 39.8ms
Speed: 1.0ms preprocess, 39.8ms inference, 1.3ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 Paper, 42.4ms
Speed: 1.5ms preprocess, 42.4ms inference, 1.0ms postprocess per image at shape (1, 3, 480, 640)


It seems the model predicts accurately enough given that minimal image resizing was used and the model was trained over only 5 epochs.
Rock had the highest preidiction accuracy for me (surely because it is starkly different from paper and scissors). 

### More conclusions will be made soon based on the graphs provided by YOLOv8.

## Outputs After Training
1. Trained Model Weights:
    - fine-tuned weights file is saved in the runs/detect/... directory
    - it contains **best.pt** (model with the best validation performance) and **last.pt** (final model after all epochs)

2. Training Logs:
    - Training metrics such as loss, precision, recall, mAP (mean Average Precision), etc., will be logged.
    - A results.csv file may also be generated with detailed metrics.

3. Visualized Metrics:
    - Training and validation curves (e.g., loss, mAP) will be saved as .png files in the same directory runs/detect/...

4. Evaluation Results:
    - After training, the model's performance is evaluated on the validation dataset. Metrics include:
        - mAP50: Mean Average Precision at IoU=0.5.
        - mAP50:95: mAP averaged across IoUs from 0.5 to 0.95.
        - Precision and Recall.

Next use Model version 14 on the website through the API:
[Rock-Paper-Scissors Annotated Dataset](https://universe.roboflow.com/roboflow-58fyf/rock-paper-scissors-sxsw/dataset/14)

In [None]:
from roboflow import Roboflow


capt = cv2.VideoCapture(0)
rf = Roboflow(api_key="insert-your-own-key")
project = rf.workspace().project("rock-paper-scissors-sxsw")
model = project.version(14).model

while True:
    ret, frame = capt.read()
    result = model.predict(frame).json()
    predictions = result["predictions"]
    
    for prediction in predictions:
        x = prediction["x"]  # center x coordinate
        y = prediction["y"]  # center y coordinate
        width = prediction["width"]
        height = prediction["height"]
        confidence = prediction["confidence"]
        clas = prediction["class"]
    
    if (predictions != []):
        cv2.rectangle(frame, (int(x - width/2), int(y - height/2)), (int(x + width/2), int(y + height/2)), (0, 0, 255), 2)
        cv2.rectangle(frame, (int(x - 10), int(y - 10)), (int(x + 10), int(y + 10)), (0, 255, 0), 2)
    
        label = f"{clas} {confidence:.2f}"

        cv2.putText(frame,
                  label,
                      (int(x), int(y - height/2 - 10)), 
                     cv2.FONT_HERSHEY_SIMPLEX,
                       0.7,
                       (0, 255, 0), 2)
    
    
    cv2.imshow("YOLOv8 Webcam Predictions", frame)

    # press esc key to exit
    k = cv2.waitKey(10)
    if k == 27:
        break

capt.release()
cv2.destroyAllWindows()

loading Roboflow workspace...
loading Roboflow project...


## Conclusions
The Roboflow pre-trained model was far more accurate than mine, which is understandable given the very low specs of mine. Rock can be distinguished very easily compared to the others and has often a high level of accuracy. The images below give examples of the predictions made by my model and the other pre-trained model.

### My Model

![MyModel:Rock](./image_results/rock1.png)
![MyModel:Scissors](./image_results/scissors1.png)
![MyModel:Paper](./image_results/paper1.png)

### Roboflow Pre-Trained Model

![Roboflow:Rock](./image_results/rock2.png)
![Roboflow:Scissors](./image_results/scissors2.png)
![Roboflow:Paper](./image_results/paper2.png)

### Trolling the AI (my model)

![Dud:Rock](./image_results/dud_rock.png)
![Dud:Scissors](./image_results/dud_scissors.png)
![Dud:Paper](./image_results/dud_paper.png)