
## Mathematical Explanation

1. **Read Frame and Select ROI (Region of Interest)**:
   $$
   \text{frame} = \text{cap.read}(\text{frame\_number})
   $$
   $$
   \text{ROI} = \text{selectROI}(\text{frame})
   $$
   Here, the user selects a rectangular region (ROI) in the frame. Let's denote this region by its coordinates \((x, y, w, h)\), where \(x\) and \(y\) are the top-left coordinates, and \(w\) and \(h\) are the width and height of the region.

   The template image \(T\) is then:
   $$
   T = \text{frame}[y:y+h, x:x+w]
   $$


2. **Frame Processing Loop**:
   For each frame $F_{\text{i}}$ in the video (where ${\text{(i)}}$ is the frame index), the loop processes until frame number 5000:
   $$
   F_i = \text{cap.read}(i)
   $$
   If \(i > 5000\), exit the loop.

3. **Apply Gaussian Blur**:
   $$
   F_i^{\text{blur}} = \text{GaussianBlur}(F_i, (5, 5), 0)
   $$
   This step applies a Gaussian filter to the frame to reduce noise.

4. **Template Matching**:
   $$
   \text{res} = \text{matchTemplate}(F_i^{\text{blur}}, T, \text{TM\_CCOEFF\_NORMED})
   $$
   The `matchTemplate` function computes the normalized cross-correlation between the template \(T\) and the current frame \(F_i^blur\). The result is a matrix `res` where each element represents the correlation coefficient at that point.
   
   $$
   (\min_{\text{val}}, \max_{\text{val}}, \min_{\text{loc}}, \max_{\text{loc}}) = \text{minMaxLoc}(\text{res})
   $$
   
   This function finds the minimum and maximum values and their locations in the result matrix. We are interested in `loc`, the location of the highest correlation.

   Let:
   $$
   \text{top\_left} = \max_{\text{loc}}
   $$
   $$
   \text{bottom\_right} = (\text{top\_left}[0] + w, \text{top\_left}[1] + h)
   $$

In summary, the mathematical operations involve:
- Extracting a template image from a specific region in a frame.
- Applying Gaussian blur to reduce noise.
- Using normalized cross-correlation to find the best match of the template in subsequent frames.
- Identifying the location of the best match and drawing a rectangle around it.

### Requirements

1A. Input images from video file WiiPlay.mp4 with level 15 (frame number between 4820 and 5000).<br> \
1B. (5pts) Acquire a <b>face template</b> from the first frame (frame number = 4820).<br>\
1C. (10pts) Try to detect the face the same as the template on subsequent frames, draw a <b>red</b> rectangle around the detected face, and show the output images in the <b>"find_this_mii"</b> window.<br>

In [1]:
#game_1 : "find_this_mii"

import cv2
import numpy as np

# Load the video
cap = cv2.VideoCapture('WiiPlay.mp4')

# Set the current video frame to 4820
cap.set(cv2.CAP_PROP_POS_FRAMES, 4820)

# Capture a single frame from the video
ret, frame = cap.read()

# Select a region of interest (ROI) manually from the captured frame for template matching
r = cv2.selectROI(frame)
template = frame[int(r[1]):int(r[1]+r[3]), int(r[0]):int(r[0]+r[2])]

while(cap.isOpened()):
    ret, frame = cap.read()

    # Break the loop if frame cannot be read or the current frame exceeds 5000
    if not ret or cap.get(cv2.CAP_PROP_POS_FRAMES) > 5000:
        break

    # Apply Gaussian Blur to the frame to reduce noise for better template matching
    frame = cv2.GaussianBlur(frame, (5, 5), 0)

    # Perform template matching to find the template in the frame
    res = cv2.matchTemplate(frame, template, cv2.TM_CCOEFF_NORMED)

    # Find the location of the template in the frame
    min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)

    top_left = max_loc
    h, w, _ = template.shape
    bottom_right = (top_left[0] + w, top_left[1] + h)

    # Draw a rectangle around the template in the frame
    cv2.rectangle(frame, top_left, bottom_right, (0, 0, 255), 2)

    # Display the frame with the rectangle around the template
    cv2.imshow('find_this_mii', frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Select a ROI and then press SPACE or ENTER button!
Cancel the selection process by pressing c button!


## Mathematical Explanation of the Face Detection Process

### Inputs:
- Let \( frame \) be the image frame in which to detect faces.
- Let \( lower_skin \) be the lower bound of the HSV color range for skin detection.
- Let \( upper_skin \) be the upper bound of the HSV color range for skin detection.
- Let \( min_area \) be the minimum area of a detected face.
- Let \( max\_area \) be the maximum area of a detected face.
- Let \( aspect_ratio_range \) be the range of acceptable aspect ratios for detected faces.

### Steps and Mathematical Operations

1. **Convert Frame to HSV Color Space**:
   Convert the frame from BGR to HSV color space:
   $$
   \text{hsv} = \text{cvtColor}(\text{frame}, \text{COLOR\_BGR2HSV})
   $$

2. **Create Binary Mask for Skin Detection**:
   Create a binary mask where pixels within the skin color range are set to 1 (white), and all other pixels are set to 0 (black):
   $$
   \text{mask} = \text{inRange}(\text{hsv}, \text{lower\_skin}, \text{upper\_skin})
   $$

3. **Morphological Operations**:
   - **Erosion**: Reduce noise by shrinking the white regions:
     $$
     \text{mask} = \text{erode}(\text{mask}, \text{kernel}, \text{iterations} = 2)
     $$
   - **Dilation**: Expand the white regions to restore the eroded parts:
     $$
     \text{mask} = \text{dilate}(\text{mask}, \text{kernel}, \text{iterations} = 2)
     $$
   - **Gaussian Blur**: Smooth the mask to further reduce noise:
     $$
     \text{mask} = \text{GaussianBlur}(\text{mask}, (3, 3), 0)
     $$

4. **Find Contours in the Mask**:
   Identify the boundaries of connected white regions (potential faces) in the mask:
   $$
   \text{contours}, \_ = \text{findContours}(\text{mask}, \text{RETR\_EXTERNAL}, \text{CHAIN\_APPROX\_SIMPLE})
   $$

5. **Filter Contours Based on Geometric Properties**:
   For each contour:
   - Compute the bounding rectangle:
     $$
     (x, y, w, h) = \text{boundingRect}(\text{contour})
     $$
     Here, \( x, y \) are the coordinates of the top-left corner, and \( w, h \) are the width and height of the rectangle.
   
   - Calculate the aspect ratio (\( AR \)) and area (\( A \)):
     $$
     AR = \frac{w}{h}
     $$
     $$
     A = \text{contourArea}(\text{contour})
     $$

   - Check if the contour meets the aspect ratio and area criteria:
     $$
     \text{if} \quad \text{aspect\_ratio\_range}[0] < AR < \text{aspect\_ratio\_range}[1] \quad \text{and} \quad \text{min\_area} < A < \text{max\_area}
     $$
     If true, add the bounding rectangle coordinates to the list of detected faces:
     $$
     \text{faces.append}((x, y, w, h))
     $$
     Draw a rectangle around the detected face in the frame:
     $$
     \text{rectangle}(\text{frame}, (x, y), (x+w, y+h), (255, 0, 0), 2)
     $$

### Outputs
- **faces**: A list of tuples, each containing the coordinates and dimensions of a detected face.
- **frame**: The processed frame with rectangles drawn around detected faces.

## Mathematical Explanation of the Face Comparison Process

### Inputs:
- Let \( face1 \) be the first face image.
- Let \( face2 \) be the second face image.

### Steps and Mathematical Operations

1. **Resize Faces to a Fixed Size**:
   - Resize both face images to \( 100 * 100 \) pixels.
   $$
   \text{face1\_resized} = \text{resize}(\text{face1}, (100, 100))
   $$
   $$
   \text{face2\_resized} = \text{resize}(\text{face2}, (100, 100))
   $$

2. **Calculate the Absolute Difference**:
   - Compute the absolute difference between the corresponding pixels of the two resized face images.
   $$
   \text{difference} = \text{absdiff}(\text{face1\_resized}, \text{face2\_resized})
   $$
   This operation results in a matrix where each element represents the absolute difference between the corresponding pixels of the two images.

3. **Sum the Differences**:
   - Sum all the elements in the difference matrix to obtain a similarity score.
   $$
   \text{similarity\_score} = \sum_{i,j} \text{difference}(i,j)
   $$

### Outputs
- **similarity_score**: The sum of absolute differences between the two images, which serves as a measure of similarity. A lower score indicates higher similarity between the two face images.


## Mathematical Explanation of in Game_2: "find_two_look_alike"

### Inputs:
- **Video Capture**: 
  - Let `cap` be the video capture object for the video file 'WiiPlay.mp4'.
- **HOG Descriptor for People Detection**:
  - Let `hog` be the HOG descriptor initialized for people detection.
- **Frame Range**:
  - Let `start_frame` = 2180  be the starting frame.
  - Let `end_frame` = 2380  be the ending frame.
- **HSV Color Range for Skin Detection**:
  - Let `lower_skin` = [0, 48, 80] be the lower bound of the HSV color range for skin detection.
  - Let `upper_skin` = [20, 255, 255] be the upper bound of the HSV color range for skin detection.
- **Area and Aspect Ratio Range**:
  - Let `min_area` = 300 be the minimum area of a detected face.
  - Let `max_area` = 1000 be the maximum area of a detected face. 
  - Let `aspect_ratio_range` = (0.5, 2) be the range of acceptable aspect ratios for detected faces.

### Steps and Mathematical Operations

1. **Video Capture Initialization**:
   - Initialize video capture and set the starting frame.
   $$
   \text{cap.set}(\text{cv2.CAP\_PROP\_POS\_FRAMES}, \text{start\_frame})
   $$

2. **People Detection Using HOG Descriptor**:
   - For each frame, detect people using the HOG descriptor.
   $$
   \text{boxes}, \text{weights} = \text{hog.detectMultiScale}(\text{frame}, \text{winStride}=(8, 8))
   $$
   - Convert the bounding boxes into the format \([x, y, x + w, y + h]\).

3. **Face Detection Using HSV Color Range**:
   - Convert the frame to HSV color space.
   $$
   \text{hsv} = \text{cvtColor}(\text{frame}, \text{COLOR\_BGR2HSV})
   $$
   - Create a binary mask based on the skin color range.
   $$
   \text{mask} = \text{inRange}(\text{hsv}, \text{lower\_skin}, \text{upper\_skin})
   $$
   - Apply morphological operations (erosion, dilation) and Gaussian blur to refine the mask.
   $$
   \text{mask} = \text{erode}(\text{mask}, \text{kernel}, \text{iterations} = 2)
   $$
   $$
   \text{mask} = \text{dilate}(\text{mask}, \text{kernel}, \text{iterations} = 2)
   $$
   $$
   \text{mask} = \text{GaussianBlur}(\text{mask}, (3, 3), 0)
   $$
   - Find contours in the mask and filter based on area and aspect ratio.
   $$
   \text{contours}, \_ = \text{findContours}(\text{mask}, \text{RETR\_EXTERNAL}, \text{CHAIN\_APPROX\_SIMPLE})
   $$

4. **Face Comparison**:
   - For each pair of detected faces, calculate the similarity score using the sum of absolute differences.
   $$
   \text{difference} = \text{absdiff}(\text{face1\_resized}, \text{face2\_resized})
   $$
   $$
   \text{similarity\_score} = \sum_{i,j} \text{difference}(i,j)
   $$
   - Identify the most similar pair of faces based on the lowest similarity score.

5. **Draw Rectangles Around Detected Faces**:
   - Draw rectangles around detected faces and the most similar pair.
   $$
   \text{rectangle}(\text{frame}, (x1, y1), (x1+w1, y1+h1), (0, 0, 255), 2)
   $$
   $$
   \text{rectangle}(\text{frame}, (x2, y2), (x2+w2, y2+h2), (0, 0, 255), 2)
   $$

6. **Display the Frame**:
   - Display the frame with detected faces.
   $$
   \text{cv2.imshow}('find\_two\_look\_alike', \text{frame})
   $$

### Outputs
- **Detected Faces**: Bounding boxes of detected faces.
- **Most Similar Pair**: Bounding boxes of the most similar pair of faces.
- **Displayed Frame**: Frame with rectangles drawn around detected faces and the most similar pair.

### Requirements

2A. Input images from video file WiiPlay.mp4 with level 8 (frame number between 2180 and 2380).<br>\
2B. (5pts) Detect <b>pedestrians</b> on each frame and draw a <b>green</b> rectangle around your detection.<br>\
2C. (5pts) Detect <b>faces</b> on each frame and draw a <b>blue</b> rectangle around your detection.<br>\
2D. (10pts) Try to find two faces look like each other, draw a <b>red</b> rectangle around each of the two faces, and show the output images in the <b>"find_two_look_alike"</b> window.<br><br>

In [2]:
#game_2 : "find_two_look_alike"

import cv2
import numpy as np

def detect_faces(frame, lower_skin, upper_skin, min_area, max_area, aspect_ratio_range):
    """
    Detects faces in the given frame based on skin color and filters them by area and aspect ratio.

    Args:
        frame (numpy.ndarray): The image frame in which to detect faces.
        lower_skin (numpy.ndarray): The lower bound of the HSV color range for skin detection.
        upper_skin (numpy.ndarray): The upper bound of the HSV color range for skin detection.
        min_area (int): The minimum area of a detected face.
        max_area (int): The maximum area of a detected face.
        aspect_ratio_range (tuple): The range of acceptable aspect ratios for detected faces.

    Returns:
        tuple: A tuple containing a list of detected faces and the processed frame.
    """

    # Convert the frame to the HSV color space
    hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)

    # Create a binary mask of the skin
    mask = cv2.inRange(hsv, lower_skin, upper_skin)
    
    # Create a structuring element for morphological operations
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (11, 11))

    # Erode and dilate the mask to remove noise
    mask = cv2.erode(mask, kernel, iterations=2)
    mask = cv2.dilate(mask, kernel, iterations=2)

    # Apply Gaussian blur to smooth the mask
    mask = cv2.GaussianBlur(mask, (3, 3), 0)
    
    # Find contours in the mask
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    faces = []
    for contour in contours:

        # Get the bounding rectangle for each contour
        x, y, w, h = cv2.boundingRect(contour)
        aspect_ratio = w / float(h)
        area = cv2.contourArea(contour)

        # Filter contours based on aspect ratio and area
        if aspect_ratio_range[0] < aspect_ratio < aspect_ratio_range[1] and min_area < area < max_area:
            faces.append((x, y, w, h))

            # Draw a rectangle around the detected face
            cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 0, 0), 2)
    
    return faces, frame

def compare_faces(face1, face2):
    """
    Compares two face images and returns the sum of absolute differences as a similarity score.

    Args:
        face1 (numpy.ndarray): The first face image.
        face2 (numpy.ndarray): The second face image.

    Returns:
        int: The sum of absolute differences between the two images.
    """
    
    # Resize faces to a fixed size
    face1_resized = cv2.resize(face1, (100, 100))
    face2_resized = cv2.resize(face2, (100, 100))

    # Calculate the absolute difference between the two faces
    difference = cv2.absdiff(face1_resized, face2_resized)

    # Sum the differences to get a similarity score
    return np.sum(difference)

cap = cv2.VideoCapture('WiiPlay.mp4')

# Initialize HOG descriptor for people detection
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

cv2.startWindowThread()

# Set the range of frames to process
start_frame = 2180
end_frame = 2380

cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame)

current_frame = start_frame

# Define HSV color range for skin detection
lower_skin = np.array([0, 48, 80], dtype='uint8')
upper_skin = np.array([20, 255, 255], dtype='uint8')
min_area = 300
max_area = 1000
aspect_ratio_range = (0.5, 2)

# Initialize the faces list
faces = []

while True:
    ret, frame = cap.read()

    # Break the loop if the frame is not read correctly or the end frame is reached
    if not ret or current_frame > end_frame:
        break
    
    # Convert the frame to grayscale
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Detect people in the frame using HOG
    boxes, weights = hog.detectMultiScale(frame, winStride=(8, 8))
    boxes = np.array([[x, y, x + w, y + h] for (x, y, w, h) in boxes])

    for (xA, yA, xB, yB) in boxes:
        cv2.rectangle(frame, (xA, yA), (xB, yB), (0, 255, 0), 2)

        # If there are at least two faces detected, compare them
        if len(faces) >= 2:
            min_difference = float('inf')
            best_pair = (None, None)

            # Compare each pair of faces to find the most similar pair
            for i in range(len(faces)):

                # Compare the current face with all other faces
                for j in range(i + 1, len(faces)):

                    # Get the bounding box coordinates for the two faces
                    x1, y1, w1, h1 = faces[i]
                    x2, y2, w2, h2 = faces[j]
                    face1 = frame[y1:y1+h1, x1:x1+w1]
                    face2 = frame[y2:y2+h2, x2:x2+w2]

                    # Calculate the similarity score
                    difference = compare_faces(face1, face2)

                    # Update the best pair if the current pair is more similar
                    if difference < min_difference:
                        min_difference = difference
                        best_pair = (i, j)

            # To check if the best pair is found
            if best_pair[0] is not None and best_pair[1] is not None:
                x1, y1, w1, h1 = faces[best_pair[0]]
                x2, y2, w2, h2 = faces[best_pair[1]]

                # Draw rectangles around the most similar pair of faces
                cv2.rectangle(frame, (x1, y1), (x1+w1, y1+h1), (0, 0, 255), 2)
                cv2.rectangle(frame, (x2, y2), (x2+w2, y2+h2), (0, 0, 255), 2)
    # Detect faces in the frame
    faces, frame = detect_faces(frame, lower_skin, upper_skin, min_area, max_area, aspect_ratio_range)
    
    # Display the frame 
    cv2.imshow('find_two_look_alike', frame)
    
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

    current_frame += 1

cap.release()
cv2.destroyAllWindows()

### Requirements

3A. Input images from video file WiiPlay.mp4 with level 9 (frame number between 2480 and 2600).<br>\
3B. (5pts) <b>Detect </b>faces(or pedestrians) on the first frame and draw a <b>blue</b> rectangle around your detection.<br>\
3C. (10pts) <b>Track </b>faces(or pedestrians) on subsequent frames and draw a <b>green</b> rectangle around your tracking.<br>\
3D. (5pts) Try to find out the fastest character, draw a <b>red</b> rectangle around the fastest character, and show the output images in the <b>"find_the_fastest_character"</b> window.<br><br>

In [2]:
#game_3 : "find_the_fastest_character"
import cv2
import numpy as np

cap = cv2.VideoCapture('WiiPlay.mp4')

start_frame = 2480
end_frame = 2600

cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame)

hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

ret, frame = cap.read()
if not ret:
    print("Failed to read the video")
    exit()

boxes, weights = hog.detectMultiScale(frame, winStride=(8, 8))

for (x, y, w, h) in boxes:
    cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)

trackers = cv2.legacy.MultiTracker_create()
for box in boxes:
    tracker = cv2.legacy.TrackerMIL_create()
    trackers.add(tracker, frame, tuple(box))

current_frame = start_frame

fastest_speed = 0
fastest_box = None
prev_position = [tuple(box) for box in boxes]

while True:
    ret, frame = cap.read()

    if not ret or current_frame >= end_frame:
        break

    success, tracked_boxes = trackers.update(frame)
    if success:
        speeds = []

        for i, box in enumerate(tracked_boxes):
            x, y, w, h = [int(v) for v in box]
            prev_x, prev_y, _, _ = prev_position[i]
            speed = ((x - prev_x) ** 2 + (y - prev_y) ** 2) ** 0.5
            speeds.append(speed)
            prev_position[i] = (x, y, w, h)
            cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 0, 255), 2)

        max_speed = max(speeds)
        if max_speed > fastest_speed:
            fastest_speed = max_speed
            fastest_box = tracked_boxes[speeds.index(max_speed)]

    if fastest_box is not None:
        x, y, w, h = [int(v) for v in fastest_box]
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 0, 255), 2)
       
        for box in tracked_boxes:
            x, y, w, h = [int(v) for v in box]
            cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 2)

    detected_boxes, weights = hog.detectMultiScale(frame, winStride=(8, 8))
    for (x, y, w, h) in detected_boxes:
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)

    text = f'Current_Frame: {current_frame}'
    text_size = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 1, 2)[0]
    text_x = frame.shape[1] - text_size[0] - 10
    text_y = 30
    cv2.putText(frame, text, (text_x, text_y), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
    
    cv2.imshow('find_the_fastest_character', frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

    current_frame += 1

cap.release()
cv2.destroyAllWindows()

### Requirements
4A. Input images from video file WiiPlay.mp4 with level 6 (frame number between 1650 and 1800).<br>\
4B. (10pts) Compute and show <b>optical flows</b> on each frame using <b>blue</b> arrows.<br>\
4C. (5pts) Try to detect two odd character who face the opposite direction from everyone else, draw a <b>red</b> rectangle around each of the two character, and show the output images in the <b>"find_two_odds"</b> window.<br><br>

In [17]:
#game_4 : "find_two_odds"
import cv2
import numpy as np

def display_flow(img, flow, stride=10):
    height, width = img.shape[:2]
    odd_characters = []

    for y in range(0, height, stride):
        for x in range(0, width, stride):
            flow_at_point = flow[y, x]
            if np.linalg.norm(flow_at_point) > 2:  # Consider only significant flows
                direction = np.arctan2(flow_at_point[1], flow_at_point[0])
                odd_characters.append((x, y, direction))
                pt1 = (x, y)
                delta = flow_at_point.astype(np.int32)[::-1]
                pt2 = (pt1[0] + delta[0] * 2, pt1[1] + delta[1] * 2)
                cv2.arrowedLine(img, pt1, pt2, (255, 0, 0), 1, cv2.LINE_AA, 0, 0.1)

    if len(odd_characters) > 2:
        directions = np.array([c[2] for c in odd_characters])
        median_direction = np.median(directions)
        deviations = np.abs(directions - median_direction)
        odd_indices = deviations.argsort()[-2:]  # Get indices of two most deviating directions

        for idx in odd_indices:
            x, y, _ = odd_characters[idx]
            cv2.rectangle(img, (x-15, y-15), (x+15, y+15), (0, 0, 255), 2)  # Draw red rectangle around odd characters

    norm_opt_flow = np.linalg.norm(flow, axis=2)
    norm_opt_flow = cv2.normalize(norm_opt_flow, None, 0, 1, cv2.NORM_MINMAX)
    cv2.imshow('find_two_odds', img)
    cv2.imshow('optical flow magnitude', norm_opt_flow)

    if cv2.waitKey(10) & 0xFF == ord('q'):
        return 1
    else:
        return 0

cap = cv2.VideoCapture("WiiPlay.mp4")
start_frame = 1650
end_frame = 1800

cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame)
_ , prev_frame = cap.read()
prev_frame = cv2.cvtColor(prev_frame, cv2.COLOR_BGR2GRAY)
prev_frame = cv2.resize(prev_frame, (0,0), None, 0.5, 0.5)
first_frame = True

current_frame = start_frame
fps = 0

while True:
    status_cap, frame = cap.read()
    frame = cv2.resize(frame, (0,0), None, 0.5, 0.5)

    if not status_cap or current_frame >= end_frame:
        break

    timer = cv2.getTickCount()
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    
    if first_frame:
        opt_flow = cv2.calcOpticalFlowFarneback(prev_frame, gray, None, 0.5, 5, 13, 10, 5, 1.1, cv2.OPTFLOW_FARNEBACK_GAUSSIAN)
        first_frame = False
    else:
        opt_flow = cv2.calcOpticalFlowFarneback(prev_frame, gray, opt_flow, 0.5, 5, 13, 10, 5, 1.1, cv2.OPTFLOW_USE_INITIAL_FLOW)

    fps = cv2.getTickFrequency() / (cv2.getTickCount() - timer)
    cv2.putText(frame, f'FPS: {int(fps)}', (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
    prev_frame = np.copy(gray)
    
    if display_flow(frame, opt_flow):
        break

    current_frame += 1

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()                    

### Requirements

5A. Input continuous images from 'car.mp4'.<br>\
5B. (6pts) For each frame, detect every car using YOLOv8 trained data 'yolov8n.pt'. (mark with red rectangles)<br>\
5C. (6pts) For each car, detect a licence plate using 'license_plate_detector.pt'. (mark with blue rectangle)<br>\
5D. (6pts) For each licence plate, OCR using Tesseract. Print the recognized licence plate number above each detected licence plate. (putText() in green color)<br>\
5E. (12pts) Use whatever you learned this semester to improve the result. Write a simple report on your method and observations.<br><br>

In [7]:
import cv2
import numpy as np
import argparse
from ultralytics import YOLO
import pytesseract

def draw_annotations(frame, annotations, color, font_scale=0.9):
    for (x1, y1, x2, y2, text) in annotations:
        cv2.rectangle(frame, (int(x1), int(y1)), (int(x2), int(y2)), color, 2)
        if text:
            cv2.putText(frame, text, (int(x1), int(y1)-10), cv2.FONT_HERSHEY_SIMPLEX, font_scale, (0, 255, 0), 2)

def preprocess_image(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Apply GaussianBlur to remove noise
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)

    # Apply thresholding to get a binary image
    _, binary = cv2.threshold(blurred, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    
    # Optionally apply morphology operations to clean up the image
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
    binary = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
    return binary

pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# Load models
car_detector = YOLO('yolov8n.pt')
license_plate_detector = YOLO('license_plate_detector.pt')

# Load video
cap = cv2.VideoCapture('car.mp4')

# Define the list of vehicle class IDs (as per your model's class mapping)
vehicles = [2, 3, 5, 7]

# Read frames
while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Detect vehicles
    car_detections = car_detector(frame)[0]
    car_annotations = []
    for detection in car_detections.boxes.data.tolist():
        x1, y1, x2, y2, score, class_id = detection
        if int(class_id) in vehicles:
            car_annotations.append((x1, y1, x2, y2, f'Car: {int(score * 100)}%'))

    # Detect license plates
    license_plate_detections = license_plate_detector(frame)[0]
    license_plate_annotations = []
    for license_plate in license_plate_detections.boxes.data.tolist():
        x1, y1, x2, y2, score, class_id = license_plate
        license_plate_crop = frame[int(y1):int(y2), int(x1): int(x2), :]
        processed_image = preprocess_image(license_plate_crop)
        # Adjust Tesseract OCR configuration for better accuracy
        custom_config = r'--oem 3 --psm 8'
        license_plate_text = pytesseract.image_to_string(processed_image, config=custom_config).strip()
        license_plate_annotations.append((x1, y1, x2, y2, license_plate_text))

        text = pytesseract.image_to_string(processed_image, config=custom_config).strip()
        print(f'License Plate: {text}')

    draw_annotations(frame, car_annotations, (0, 0, 255), font_scale = 0.9)  # Red rectangles for cars
    draw_annotations(frame, license_plate_annotations, (255, 0, 0), font_scale = 2)  # Blue rectangles for license plates
    
    frame_resized = cv2.resize(frame, (800, 450))
    
    cv2.imshow('License Plate Detection', frame_resized)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()


0: 384x640 21 cars, 1 bus, 2 trucks, 94.6ms
Speed: 5.9ms preprocess, 94.6ms inference, 2.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plates, 90.9ms
Speed: 4.3ms preprocess, 90.9ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)
License Plate: Pnesivsu
License Plate: “RNAI NRU

0: 384x640 21 cars, 1 bus, 2 trucks, 106.1ms
Speed: 4.9ms preprocess, 106.1ms inference, 1.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plates, 108.2ms
Speed: 3.1ms preprocess, 108.2ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)
License Plate: Pnesivsu
License Plate: “RNAI NRU

0: 384x640 22 cars, 1 bus, 2 trucks, 97.5ms
Speed: 3.6ms preprocess, 97.5ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plates, 81.6ms
Speed: 3.5ms preprocess, 81.6ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)
License Plate: Pes vsu)
License Plate: JENA NR

0: 384x640 21 cars, 1 

6. (5pts) Any comments regarding the final exam? Which steps you believe you have completed? Which steps bother you?<br> 
7. (5pts) Any suggestion to teaching assistants to improve this class? Any suggestion to teacher to improve this class?<br>


### My Answer

6. In the final exam, I completed all the first questions, and I was left with the most difficult functions that had not yet been optimized. 

   There are five questions in total, and I think the most difficult part of the second to fifth questions is to deal with the frame of the video, because in addition to the basic computer vision basics, such as filtering, noise reduction, edge detection, and morphology, we also need to overcome the problems of the film itself to improve the saturation and other related technologies, so that we can present the best results

7. After a semester, I think that the TA system of advanced computer vision is working very well, and basically every TA has     helped me to grasp the basic key knowledge in this class

   As for the suggestion for this class, I think for the final exam, we can take the students to team up to participate in the CVPR Data CV Challenge, which is organized by a very famous computer vision workshop to test whether you can produce a good computer vision work in a limited time

   
   - [CVPR](https://sites.google.com/view/vdu-cvpr24/home)

## Reference
- [OpenCV Tutorial](https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html)

- [Template Matching function reference code and theory](https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_imgproc/py_template_matching/py_template_matching.html)

- [OpenCV selectROI function reference code and theory](https://www.geeksforgeeks.org/python-opencv-selectroi-function/)

- [detect_faces function reference code and theory in GitHub](https://github.com/ageitgey/face_recognition/blob/master/examples/facerec_from_video_file.py)

- [compare_faces function reference code and theory](https://stackoverflow.com/questions/23195522/opencv-fastest-method-to-check-if-two-images-are-100-same-or-not)

In [8]:
import cv2
import pytesseract
import numpy as np
from ultralytics import YOLO

# Function to draw bounding boxes and text
def draw_annotations(frame, annotations, color):
    for (x1, y1, x2, y2, text) in annotations:
        cv2.rectangle(frame, (int(x1), int(y1)), (int(x2), int(y2)), color, 2)
        if text:
            # Draw a filled rectangle as background for text
            (w, h), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 1.5, 3)
            cv2.rectangle(frame, (int(x1), int(y1) - 30), (int(x1) + w, int(y1)), (0, 0, 0), -1)
            # Put the text on top of the background
            cv2.putText(frame, text, (int(x1), int(y1) - 10), cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 255, 0), 3)

# Function to preprocess image for OCR
def preprocess_image(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    enhanced = cv2.equalizeHist(gray)
    filtered = cv2.bilateralFilter(enhanced, 9, 75, 75)
    _, binary = cv2.threshold(filtered, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    kernel = np.ones((3, 3), np.uint8)
    binary = cv2.dilate(binary, kernel, iterations=1)
    binary = cv2.erode(binary, kernel, iterations=1)
    return binary

# Specify the Tesseract executable path
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# Load models
car_detector = YOLO('yolov8n.pt')
license_plate_detector = YOLO('license_plate_detector.pt')

# Load video
cap = cv2.VideoCapture('car.mp4')

# Define the list of vehicle class IDs (as per your model's class mapping)
vehicles = [2, 3, 5, 7]

frame_count = 0

# Read frames
while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    frame_count += 1
    
    # Process every nth frame to reduce computation
    if frame_count % 5 != 0:
        continue

    # Detect vehicles
    car_detections = car_detector(frame)[0]
    car_annotations = []
    for detection in car_detections.boxes.data.tolist():
        x1, y1, x2, y2, score, class_id = detection
        if int(class_id) in vehicles:
            car_annotations.append((x1, y1, x2, y2, f'Car: {int(score * 100)}%'))

    # Detect license plates
    license_plate_detections = license_plate_detector(frame)[0]
    license_plate_annotations = []
    for license_plate in license_plate_detections.boxes.data.tolist():
        x1, y1, x2, y2, score, class_id = license_plate
        license_plate_crop = frame[int(y1):int(y2), int(x1): int(x2), :]
        processed_image = preprocess_image(license_plate_crop)
        license_plate_text = pytesseract.image_to_string(processed_image, config='--psm 8').strip()
        license_plate_annotations.append((x1, y1, x2, y2, license_plate_text))

    draw_annotations(frame, car_annotations, (0, 0, 255))  # Red rectangles for cars
    draw_annotations(frame, license_plate_annotations, (255, 0, 0))  # Blue rectangles for license plates
    
    frame_resized = cv2.resize(frame, (800, 450))
    
    cv2.imshow('Frame', frame_resized)

    # Show OCR
    cv2.imshow('License Plate Detection', processed_image)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()



0: 384x640 22 cars, 1 bus, 2 trucks, 93.4ms
Speed: 3.3ms preprocess, 93.4ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plates, 84.6ms
Speed: 2.5ms preprocess, 84.6ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 22 cars, 1 bus, 2 trucks, 134.5ms
Speed: 9.1ms preprocess, 134.5ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plates, 89.1ms
Speed: 3.4ms preprocess, 89.1ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 24 cars, 1 bus, 2 trucks, 100.4ms
Speed: 6.5ms preprocess, 100.4ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plates, 72.7ms
Speed: 2.5ms preprocess, 72.7ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 23 cars, 1 bus, 2 trucks, 95.1ms
Speed: 4.4ms preprocess, 95.1ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 license_plat