In [79]:
# Importing libraries
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

try:
    import cv2
except ImportError:
    install("opencv-python")

try:
    import mediapipe as mp
except ImportError:
    install("mediapipe")

try:
    import time
except ImportError:
    install("time")

from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# Introduction
This project  focuses on building a real-time object recognition system designed to detect and classify common objects from a live camera feed. It's developed using Python, MediaPipe, and OpenCV. The main function of the project includes capturing video from a camera, processing each frame to detect objects, and then overlaying the detection results (bounding boxes and labels) onto the live video display.



### About the project

The primary objective of this project is to develop a computer vision application capable of identifying a small, predefined set of common objects (e.g., chair, phone) within real-time video streams. This system uses pre-trained models of MediaPipe for object detection and classification, complemented by OpenCV for camera access, video processing and display.


### Objectives

1. **Object Detection with MediaPipe:** The system  implements the MediaPipe library for detecting and recognizing objects (e.g., phone, chair etc.) in real-time from a camera feed. An interface is developed using OpenCV that captures video, processes frames, and displays the detected objects with visual annotations on-screen.

3. **Object Classification:** A pre-trained MediaPipe model is employed to classify the detected objects, ensuring that the recognized objects are clearly identified with corresponding labels displayed on the screen.

In [80]:
#Importing libraries
import cv2
import mediapipe as mp
import time

from mediapipe.tasks import python
from mediapipe.tasks.python import vision

Now, we will define the path to the pre-trained MediaPipe object detection model (`efficientdet_lite0.tflite`). We also initialize a global variable `detection_results`. This variable will store the latest object detection results returned by the MediaPipe asynchronous callback function. The global variable is declared as `None` initially and updated by the `results_callback function`.

In [81]:
model_path = 'efficientdet_lite0.tflite'

In [82]:
detection_results = None

**MediaPipe Callback Function (`results_callback`)**

This function serves as the callback for MediaPipe's asynchronous object detector. Whenever MediaPipe finishes processing a frame and generates detection results, this function is called. It updates the global `detection_results` variable with the most recent detection outcomes, making them available for the main loop to retrieve and display. This is required for the `LIVE_STREAM` running mode.



In [83]:
def results_callback(result: vision.ObjectDetectorResult, output_image: mp.Image, timestamp_ms: int):
    global detection_results
    detection_results = result

**Initialization of Object Detector (`create_live_detector`)**

This function is responsible for initializing and configuring the MediaPipe Object Detector. It sets up the detector with specific options tailored for real-time, live stream processing, including the model path, running mode, confidence thresholds, and the callback function for results.

In [84]:
def create_live_detector():
    base_options = python.BaseOptions(model_asset_path=model_path)
    options = vision.ObjectDetectorOptions(
        base_options=base_options,
        running_mode=vision.RunningMode.LIVE_STREAM,
        score_threshold=0.5,
        max_results=3,
        result_callback=results_callback
    )
    detector = vision.ObjectDetector.create_from_options(options)
    print("MediaPipe Object Detector initialized successfully.")
    return detector

The main function initializes the webcam, establishes a continuous loop to capture video frames, sends these frames to the MediaPipe detector for asynchronous processing, and then displays the detected objects with their labels and the current Frames Per Second (FPS) on a live video feed.

1. **Webcam Initialization:** `cv2.VideoCapture(0)` attempts to open the default webcam.

2. **Detector Initialization:** Calls `create_live_detector()` to set up the MediaPipe object detector.

3. **FPS Calculation Setup:** `prev_time` is initialized to calculate Frames Per Second (FPS) for performance monitoring. <br> <br>

The `while cap.isOpened()` loop continuously captures frames from the webcam.

* **Frame Capture:** `cap.read()` reads a frame. If reading fails (e.g., end of stream, camera error), the loop continues to the next iteration. <br> <br>

4. **Data Preprocessing for MediaPipe:**
          
* **Color Conversion:** OpenCV captures frames in BGR format, but MediaPipe expects RGB. `cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)` performs this conversion.
* **MediaPipe Image Creation:** The `rgb_frame` is converted into an `mp.Image` object, which is the required input format for MediaPipe's detection API. <br> <br>

5. **Timestamp Generation:** A unique timestamp in milliseconds (`int(time.time() * 1000)`) is generated for each frame. This is crucial for `LIVE_STREAM` mode to ensure proper processing order and synchronization of results.

6. **Asynchronous Detection:** `detector.detect_async(mp_image, timestamp)` sends the image to the MediaPipe detector for processing. This call is non-blocking, allowing the main loop to continue capturing frames while detection happens in the background.

7. **Brief Pause:** `time.sleep(0.01)` introduces a small delay. This allows the `results_callback` function (running on a different thread) sufficient time to update detection_results. Without this, there's a higher chance of the main loop trying to access detection_results before it's updated, leading to flickering or missed detections. <br> <BR>

8. **Process and Display Detection Results:** If `detection_results` is not `None` (meaning detections have been received from the callback), the code iterates through each detected object:

* It extracts the `bounding_box` coordinates, `category_name`, and `score`

* A green rectangle (`cv2.rectangle`) is drawn around the detected object using its bounding box coordinates

* A label string is created, combining the class name and the confidence score (formatted as a percentage)
   
* The label text is placed above the bounding box (`cv2.putText`) <br> <br>  

9. **FPS Display:** Calculates and displays the current FPS on the top-left corner of the frame.

10. **Frame Display:** `cv2.imshow("Async Object Detection" frame)` displays the processed frame with bounding boxes, labels and FPS.

11. **Quit Condition:** Pressing the 'q' key breaks the loop, releasing the webcam resources and destroying all OpenCV windows.

In [85]:
def main():
    global detection_results
    cap = cv2.VideoCapture(0)
    detector = create_live_detector()

    prev_time = 0
    print(" Press 'q' to quit.")

    while cap.isOpened():
        success, frame = cap.read()
        if not success:
            continue

        # Data Preprocessing for MediaPipe
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)

        timestamp = int(time.time() * 1000)
        detector.detect_async(mp_image, timestamp)

        time.sleep(0.01)

        # Process and Display Detection Results
        if detection_results:
            for detection in detection_results.detections:
                bbox = detection.bounding_box
                category = detection.categories[0]
                class_name = category.category_name
                score = category.score

                x, y, w, h = bbox.origin_x, bbox.origin_y, bbox.width, bbox.height

                 # Define colors
                person_color = (0, 0, 255)  # Red for person (BGR format)
                other_object_color = (0, 255, 0) # Green for other objects

                # Determine color based on class name
                if class_name.lower() == 'person':
                    box_color = person_color
                else:
                    box_color = other_object_color


                cv2.rectangle(frame, (x, y), (x + w, y + h), box_color, 2)
                label = f"{class_name} ({int(score * 100)}%)"
                cv2.putText(frame, label, (x, y - 10),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.6, box_color, 2)

        # FPS display
        curr_time = time.time()
        fps = 1 / (curr_time - prev_time) if prev_time else 0
        prev_time = curr_time
        cv2.putText(frame, f'FPS: {int(fps)}', (10, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)

        # Show the Processed frame in a window named "Async Object Detection"
        cv2.imshow("Async Object Detection", frame)

        # Quit Condition
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()

if __name__ == "__main__":
    main()

MediaPipe Object Detector initialized successfully.
 Press 'q' to quit.


## Technical Choices and Algorithms

1. **Technologies used:**
    
   * **MediaPipe:** It's selected for its pre-trained machine learning models specifically designed for various vision tasks, including object detection. MediaPipe provides a streamlined API for integrating ML models into real-time applications.

   * **OpenCV (Open Source Computer Vision Library):** It is used for camera interaction (capturing video frames), image preprocessing (e.g., color conversion, flipping), and visualizing the detection results (drawing bounding boxes, labels, and displaying the video stream).

2. **Object Detection Model:** The system uses MediaPipe's pre-trained `efficientdet_lite0.tflite` model to detect a small set of common objects (e.g., phone, person etc.).

3. **Asynchronous Processing:** MediaPipe's `LIVE_STREAM` running mode is utilized for asynchronous detection, which helps in maintaining a smooth frame rate by processing frames in the background while the main thread continues to display the video feed.
    
4. **Performance Monitoring:** Frames Per Second (FPS) is displayed on the video feed to give an indication of the system's performance.

5. **Data Preprocessing:** Frames captured in BGR format by OpenCV are converted to RGB for MediaPipe processing. A unique timestamp is generated for each frame to ensure proper processing order in `LIVE_STREAM` mode.

6. **Visualization:** OpenCV's drawing utilities (`cv2.rectangle`, `cv2.putText`) are used to overlay the detection results onto the original video frames. This includes drawing bounding boxes around detected objects and displaying their predicted class labels along with confidence scores.

## Observation

The `efficientdet_lite0.tflite` model was primarily trained on the COCO dataset, which includes 80 common object categories. While objects like person, cell phone and chair are well-represented and have distinct features, smaller objects like pen and pencil might be difficult for a lightweight model trained on a general dataset, especially if their scale in the input image is very small or if they lack distinctive features compared to background elements.

* A higher threshold (e.g., 0.5) prioritizes precision. The system detects large, distinct, and common objects such as person, cell phone, chair and cup. These objects typically yield high confidence scores from the efficientdet_lite0 model. A higher threshold filters out low-confidence predictions, leading to fewer false positives but potentially missing objects that the model is less confident about.

* A lower threshold (e.g., 0.3) prioritizes recall. It allows more detections to pass through, including those with marginal confidence. While this can help detect some previously missed objects, it includes incorrect predictions, resulting in the observed false positives. The efficientdet_lite0 model, being lightweight, is more prone to these ambiguities at lower confidence levels, especially for objects that are visually similar such as pencil or pen to other classes or have less distinctive features in the training data.

