
### MMDetection Installation

* In the lecture video, `mmcv` was installed using `pip install mmcv-full` (this took about 10 minutes).
* In the practice code, it was changed to `pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu116/torch1.13/index.html` (installation took about 12 seconds, as of September 2022).
* As of April 6, 2023, `mmdetection` was upgraded to version 3.0. Since the practice code is based on `mmdetection` 2.x, you need to install the 2.x source code.
* In September 2024, Colab’s NumPy version was upgraded to 1.24, which caused some code execution errors. Downgrading to NumPy 1.23 fixed the issue.
* On January 17, 2025, Colab upgraded Python from 3.10 to 3.11, along with PyTorch 2.0 and TorchVision 0.15. Accordingly, `mmcv` installation changed to:
!pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0.0/index.html

* On August 25, 2025, Colab upgraded Python again, from 3.11 to 3.12, and `mmcv-full` could no longer be installed properly.
* Because of this, the practice environment was moved from Colab to Kaggle. Kaggle still uses Python 3.11.
* In Colab, the working directory was based on `/content`. In Kaggle, it is `/kaggle/working`, but the practice code was adjusted to use the current directory (`.`) automatically.
* On August 25, 2025, due to SSL issues with `download.openmmlab.com`, the `--trusted-host` option had to be added to `pip install`, and the `--no-check-certificate` option added to `wget`.

In [None]:
import torch
print(torch.__version__)

In [None]:
# downgrade pytorch version to 2.0
!pip install torch==2.0.0 torchvision==0.15.1 --index-url https://download.pytorch.org/whl/cu118

In [None]:
!pip install mmcv-full --trusted-host download.openmmlab.com -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0/index.html

In [None]:
!git clone --branch 2.x https://github.com/open-mmlab/mmdetection.git
!cd mmdetection; python setup.py install

In [None]:
!pip install numpy==1.23

In [None]:
# You must restart the kernel before running the code below.
from mmdet.apis import init_detector, inference_detector
import mmcv


### Performing Inference Using a Faster R-CNN Pretrained Model Based on the MS-COCO Dataset

* Download the Faster R-CNN pretrained model
* Set the config file for Faster R-CNN
* Create the inference model and apply inference
* Due to SSL issues with the `download.openmmlab.com` site, add the `--no-check-certificate` option to `wget`.


In [None]:
!cd mmdetection; mkdir checkpoints

In [None]:
!wget --no-check-certificate -O /kaggle/working/mmdetection/checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth

In [None]:
!ls -lia /kaggle/working/mmdetection/checkpoints

In [None]:
config_file = '/kaggle/working/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = '/kaggle/working/mmdetection/checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'

In [None]:
# Create a Detector model based on the config file and pretrained model.
from mmdet.apis import init_detector, inference_detector

model = init_detector(config_file, checkpoint_file, device='cuda:0')


In [None]:
# In mmdetection, when a relative path is provided as an argument, 
# it is always interpreted relative to the mmdetection directory.
%cd mmdetection

from mmdet.apis import init_detector, inference_detector

# Pass the config and checkpoint as arguments to init_detector().
model = init_detector(
    config='configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py',
    checkpoint='checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
)


In [None]:
%cd /kaggle/working

In [None]:
import cv2
import matplotlib.pyplot as plt
img = '/kaggle/working/mmdetection/demo/demo.jpg'

img_arr  = cv2.cvtColor(cv2.imread(img), cv2.COLOR_BGR2RGB)
plt.figure(figsize=(12, 12))
plt.imshow(img_arr)

In [None]:
img = '/kaggle/working/mmdetection/demo/demo.jpg'
# The argument for inference_detector can be a string (file path), 
# a single ndarray, or a list of ndarrays.
results = inference_detector(model, img)

In [None]:
type(results), len(results)

In [None]:
# results is a list containing 80 arrays, one for each COCO class_id from 0 to 79.
# Each array consists of 5 values (coordinates and confidence score for the class). 
# If multiple detections exist for a class, multiple arrays are created.  
# The coordinates follow the format: top-left (xmin, ymin) and bottom-right (xmax, ymax).  
# The shape of each array is (number of detected objects, 5 (coordinates + confidence)).
results


In [None]:
results[0].shape, results[1].shape, results[2].shape, results[3].shape

In [None]:
from mmdet.apis import show_result_pyplot
# Apply the inference results to the original image to generate a new image (with bounding boxes drawn).
# By default, only objects with a score threshold of 0.3 or higher are visualized.  
# show_result_pyplot internally calls model.show_result().
show_result_pyplot(model, img, results)


### Checking the Model’s Config Settings


In [None]:
model.__dict__

In [None]:
#print(model.cfg)
print(model.cfg.pretty_text)

#### When passing an array to `inference_detector()`, the original array must be provided in BGR format (RGB conversion is handled internally, so input should be BGR).



In [None]:
import cv2

# must be provided in BGR format
img_arr = cv2.imread('/kaggle/working/mmdetection/demo/demo.jpg')
results = inference_detector(model, img_arr)

show_result_pyplot(model, img_arr, results)

### Visualizing Inference Results as an Image Without Using `show_result_pyplot()`

* Create a `get_detected_img()` function that takes the model and image array as input, detects objects in the image, and draws bounding boxes.
* COCO class mapping is applied sequentially starting from 0.
* If an array in `results` is empty, it means no object was detected for the class corresponding to that list index (class id).
* Detections with low score thresholds are excluded.


In [None]:
labels_to_names_seq = {0:'person',1:'bicycle',2:'car',3:'motorbike',4:'aeroplane',5:'bus',6:'train',7:'truck',8:'boat',9:'traffic light',10:'fire hydrant',
                        11:'stop sign',12:'parking meter',13:'bench',14:'bird',15:'cat',16:'dog',17:'horse',18:'sheep',19:'cow',20:'elephant',
                        21:'bear',22:'zebra',23:'giraffe',24:'backpack',25:'umbrella',26:'handbag',27:'tie',28:'suitcase',29:'frisbee',30:'skis',
                        31:'snowboard',32:'sports ball',33:'kite',34:'baseball bat',35:'baseball glove',36:'skateboard',37:'surfboard',38:'tennis racket',39:'bottle',40:'wine glass',
                        41:'cup',42:'fork',43:'knife',44:'spoon',45:'bowl',46:'banana',47:'apple',48:'sandwich',49:'orange',50:'broccoli',
                        51:'carrot',52:'hot dog',53:'pizza',54:'donut',55:'cake',56:'chair',57:'sofa',58:'pottedplant',59:'bed',60:'diningtable',
                        61:'toilet',62:'tvmonitor',63:'laptop',64:'mouse',65:'remote',66:'keyboard',67:'cell phone',68:'microwave',69:'oven',70:'toaster',
                        71:'sink',72:'refrigerator',73:'book',74:'clock',75:'vase',76:'scissors',77:'teddy bear',78:'hair drier',79:'toothbrush' }

labels_to_names = {1:'person',2:'bicycle',3:'car',4:'motorcycle',5:'airplane',6:'bus',7:'train',8:'truck',9:'boat',10:'traffic light',
                    11:'fire hydrant',12:'street sign',13:'stop sign',14:'parking meter',15:'bench',16:'bird',17:'cat',18:'dog',19:'horse',20:'sheep',
                    21:'cow',22:'elephant',23:'bear',24:'zebra',25:'giraffe',26:'hat',27:'backpack',28:'umbrella',29:'shoe',30:'eye glasses',
                    31:'handbag',32:'tie',33:'suitcase',34:'frisbee',35:'skis',36:'snowboard',37:'sports ball',38:'kite',39:'baseball bat',40:'baseball glove',
                    41:'skateboard',42:'surfboard',43:'tennis racket',44:'bottle',45:'plate',46:'wine glass',47:'cup',48:'fork',49:'knife',50:'spoon',
                    51:'bowl',52:'banana',53:'apple',54:'sandwich',55:'orange',56:'broccoli',57:'carrot',58:'hot dog',59:'pizza',60:'donut',
                    61:'cake',62:'chair',63:'couch',64:'potted plant',65:'bed',66:'mirror',67:'dining table',68:'window',69:'desk',70:'toilet',
                    71:'door',72:'tv',73:'laptop',74:'mouse',75:'remote',76:'keyboard',77:'cell phone',78:'microwave',79:'oven',80:'toaster',
                    81:'sink',82:'refrigerator',83:'blender',84:'book',85:'clock',86:'vase',87:'scissors',88:'teddy bear',89:'hair drier',90:'toothbrush',
                    91:'hair brush'}

In [None]:
import numpy as np

# example of np.where
arr1 = np.array([[3.75348572e+02, 1.19171005e+02, 3.81950867e+02, 1.34460617e+02,
         1.35454759e-01],
        [5.32362000e+02, 1.09554726e+02, 5.40526550e+02, 1.25222633e+02,
         8.88786465e-01],
        [3.61124298e+02, 1.09049202e+02, 3.68625610e+02, 1.22483063e+02,
         7.20717013e-02]], dtype=np.float32)
print(arr1.shape)

arr1_filtered = arr1[np.where(arr1[:, 4] > 0.1)]
print('### arr1_filtered:', arr1_filtered, arr1_filtered.shape)

In [None]:
np.where(arr1[:, 4] > 0.1)

In [None]:
# Create a visualization function for inference that takes the model, 
# the original image array, and the confidence score threshold for filtering as arguments.
def get_detected_img(model, img_array, score_threshold=0.3, is_print=True):
  # Copy the input image array.
  draw_img = img_array.copy()
  bbox_color = (0, 255, 0)  # green
  text_color = (0, 0, 255)  # red

  # Perform inference detection using the model and image array as inputs,
  # and store the results in 'results'.
  # 'results' is a list containing 80 two-dimensional arrays (shape=(number of objects, 5)).
  results = inference_detector(model, img_array)

  # Iterate over the results list, which contains 80 arrays,
  # and extract each 2D array to visualize objects on the image.
  # The index in the results list corresponds directly to the COCO-mapped class id.
  # Each 2D array contains object coordinates and class confidence scores.
  for result_ind, result in enumerate(results):
    # If the row size of the 2D array is 0, it means there are no detections for that class id.
    continue

    # In the 2D array, the 5th column represents the confidence score.
    # Exclude any detections with scores lower than the threshold passed as a function argument.
    result_filtered = result[np.where(result[:, 4] > score_threshold)]

    # Each 2D array may contain multiple detected objects for the class.
    # Iterate through the rows to extract the coordinates of each detected object.
    for i in range(len(result_filtered)):
      # Extract top-left and bottom-right coordinates.
      left = int(result_filtered[i, 0])
      top = int(result_filtered[i, 1])
      right = int(result_filtered[i, 2])
      bottom = int(result_filtered[i, 3])
      caption = "{}: {:.4f}".format(labels_to_names_seq[result_ind], result_filtered[i, 4])
      cv2.rectangle(draw_img, (left, top), (right, bottom), color=bbox_color, thickness=2)
      cv2.putText(draw_img, caption, (int(left), int(top - 7)), cv2.FONT_HERSHEY_SIMPLEX, 0.37, text_color, 1)
      if is_print:
        print(caption)

  return draw_img


In [None]:
import matplotlib.pyplot as plt

img_arr = cv2.imread('/kaggle/working/mmdetection/demo/demo.jpg')
detected_img = get_detected_img(model, img_arr, score_threshold=0.3, is_print=True)
# The input image for detection is in BGR format. 
# Convert it to RGB for the final output.
detected_img = cv2.cvtColor(detected_img, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(12, 12))
plt.imshow(detected_img)


In [None]:
!mkdir /kaggle/working/data

In [None]:
!wget -O /kaggle/working/data/beatles01.jpg https://raw.githubusercontent.com/gayoung-k/object-detection-learning-notes/image/beatles01.jpg
!ls -lia /kaggle/working/data/beatles01.jpg

In [None]:
img_arr = cv2.imread('/kaggle/working/data/beatles01.jpg')
detected_img = get_detected_img(model, img_arr,  score_threshold=0.5, is_print=True)

detected_img = cv2.cvtColor(detected_img, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(12, 12))
plt.imshow(detected_img)

### Performing Video Inference

* When running video inference with `mmdetection`’s `video_demo.py`, image processing takes relatively longer.
* Modify and apply the image processing logic.
* Change the Colab path `/content` in the lecture video to `/kaggle/working`.


In [None]:
!wget -O /kaggle/working/data/John_Wick_small.mp4 https://github.com/gayoung-k/object-detection-learning-notes/video/John_Wick_small.mp4?raw=true

In [None]:
from mmdet.apis import init_detector, inference_detector
import mmcv

config_file = '/kaggle/working/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = '/kaggle/working/mmdetection/checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')

In [None]:

import cv2

video_reader = mmcv.VideoReader('/kaggle/working/data/John_Wick_small.mp4')
video_writer = None
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
video_writer = cv2.VideoWriter('/kaggle/working/data/John_Wick_small_out1.mp4', fourcc, video_reader.fps,(video_reader.width, video_reader.height))
for frame in mmcv.track_iter_progress(video_reader):
  result = inference_detector(model, frame)
  frame = model.show_result(frame, result, score_thr=0.4)

  video_writer.write(frame)

if video_writer:
        video_writer.release()

### Perform Video Inference by Applying a Customized Frame Processing Logic

* Reuse the previously implemented `get_detected_img()` function as is.


In [None]:
import numpy as np

labels_to_names_seq = {0:'person',1:'bicycle',2:'car',3:'motorbike',4:'aeroplane',5:'bus',6:'train',7:'truck',8:'boat',9:'traffic light',10:'fire hydrant',
                        11:'stop sign',12:'parking meter',13:'bench',14:'bird',15:'cat',16:'dog',17:'horse',18:'sheep',19:'cow',20:'elephant',
                        21:'bear',22:'zebra',23:'giraffe',24:'backpack',25:'umbrella',26:'handbag',27:'tie',28:'suitcase',29:'frisbee',30:'skis',
                        31:'snowboard',32:'sports ball',33:'kite',34:'baseball bat',35:'baseball glove',36:'skateboard',37:'surfboard',38:'tennis racket',39:'bottle',40:'wine glass',
                        41:'cup',42:'fork',43:'knife',44:'spoon',45:'bowl',46:'banana',47:'apple',48:'sandwich',49:'orange',50:'broccoli',
                        51:'carrot',52:'hot dog',53:'pizza',54:'donut',55:'cake',56:'chair',57:'sofa',58:'pottedplant',59:'bed',60:'diningtable',
                        61:'toilet',62:'tvmonitor',63:'laptop',64:'mouse',65:'remote',66:'keyboard',67:'cell phone',68:'microwave',69:'oven',70:'toaster',
                        71:'sink',72:'refrigerator',73:'book',74:'clock',75:'vase',76:'scissors',77:'teddy bear',78:'hair drier',79:'toothbrush' }

# Create a visualization function for inference that takes the model, 
# the original image array, and the confidence score threshold for filtering as arguments.
def get_detected_img(model, img_array, score_threshold=0.3, is_print=True):
  # Copy the input image array.
  draw_img = img_array.copy()
  bbox_color = (0, 255, 0)  # green
  text_color = (0, 0, 255)  # red

  # Perform inference detection using the model and image array as inputs,
  # and store the results in 'results'.
  # 'results' is a list containing 80 two-dimensional arrays (shape=(number of objects, 5)).
  results = inference_detector(model, img_array)

  # Iterate over the results list, which contains 80 arrays,
  # and extract each 2D array to visualize objects on the image.
  # The index in the results list corresponds directly to the COCO-mapped class id.
  # Each 2D array contains object coordinates and class confidence scores.
  for result_ind, result in enumerate(results):
    # If the row size of the 2D array is 0, it means there are no detections for that class id.
    continue

    # In the 2D array, the 5th column represents the confidence score.
    # Exclude any detections with scores lower than the threshold passed as a function argument.
    result_filtered = result[np.where(result[:, 4] > score_threshold)]

    # Each 2D array may contain multiple detected objects for the class.
    # Iterate through the rows to extract the coordinates of each detected object.
    for i in range(len(result_filtered)):
      # Extract top-left and bottom-right coordinates.
      left = int(result_filtered[i, 0])
      top = int(result_filtered[i, 1])
      right = int(result_filtered[i, 2])
      bottom = int(result_filtered[i, 3])
      caption = "{}: {:.4f}".format(labels_to_names_seq[result_ind], result_filtered[i, 4])
      cv2.rectangle(draw_img, (left, top), (right, bottom), color=bbox_color, thickness=2)
      cv2.putText(draw_img, caption, (int(left), int(top - 7)), cv2.FONT_HERSHEY_SIMPLEX, 0.37, text_color, 1)
      if is_print:
        print(caption)

  return draw_img

In [None]:
import time

def do_detected_video(model, input_path, output_path, score_threshold, do_print=True):

    cap = cv2.VideoCapture(input_path)

    codec = cv2.VideoWriter_fourcc(*'XVID')

    vid_size = (round(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), round(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
    vid_fps = cap.get(cv2.CAP_PROP_FPS)

    vid_writer = cv2.VideoWriter(output_path, codec, vid_fps, vid_size)

    frame_cnt = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    print('Total number of frames:', frame_cnt)
    btime = time.time()
    while True:
        hasFrame, img_frame = cap.read()
        if not hasFrame:
            print('No more frames to process.')
            break
        stime = time.time()
        img_frame = get_detected_img(model, img_frame, score_threshold=score_threshold, is_print=False)
        if do_print:
            print('Detection time per frame:', round(time.time() - stime, 4))
        vid_writer.write(img_frame)
    # end of while loop

    vid_writer.release()
    cap.release()

    print('Total detection processing time:', round(time.time() - btime, 4))


In [None]:
do_detected_video(model, '/kaggle/working/data/John_Wick_small.mp4', '/kaggle/working/data/John_Wick_small_out2.mp4', score_threshold=0.4, do_print=True)