
# Multiple Object Tracking

---

Multi-Object Tracking (MOT) is a core visual ability that humans poses to perform kinetic tasks and coordinate other tasks. The AI community has recognized the importance of MOT via a series of competitions.

The ability to reason even in the absence of perception input task was highlighted in Lecture 1 using a document camera and a canopy type of occlusion where an object moves below it. In this assignment, the object class is `ball` and the ability to reason over time will be demonstrated using Kalman Filters. There will be two cases of occlusion: occlusion by a different object and occlusion by the same object (typical case of the later is on tracking people in crowds).

---


## Task 1: Understand the problem and setup environment (20 points)

The problem is best described using this explanatory video below of the raw source files of this assignment:

1. Single object Tracking - ball.mp4
2. Multiple object Tracking - multiple_ball.avi


I downloaded both files and watched the video.

---

## Task 2: Object Detector (40 points)

In this task you will use a CNN-based object detector to bound box all `ball` instances in each frame. Because the educational value is not object detection, you are allowed to use an object detector of your choice trained to distinguish the `ball` class. You are free to use a pre-trained model (eg on MS COCO that contains the class `sports ball` or train a model yourself. Ensure that you explain thoroughly the code.

#### testing

In [None]:
# initialize models
import torch
import cv2
import pandas as pd
import numpy as np
model = torch.hub.load('ultralytics/yolov5','yolov5s')


Make sure the model works.

In [3]:
im = "https://ultralytics.com/images/zidane.jpg"
results = model(im)
results.pandas().xyxy[0]

Unnamed: 0,xmin,ymin,xmax,ymax,confidence,class,name
0,743.290405,48.343658,1141.756592,720.0,0.879861,0,person
1,441.989624,437.336731,496.585083,710.036194,0.675119,27,tie
2,123.051117,193.238068,714.690796,719.77124,0.666693,0,person
3,978.989807,313.579468,1025.302856,415.526184,0.261517,27,tie


--------------

#### try 1

In [None]:
vidcap = cv2.VideoCapture('ball.mp4') # open video with opencv library
save_path = '/saved/'
while vidcap.isOpened():
  x,frame = vidcap.read() # read the video one frame at a time
  # cv2.imwrite("frame%d.jpg" % count, image)     # save frame as JPEG file 
  # cv2.resize(frame, (640, 640))  # resize frame to the size expected by the model
  # input_tensor = torch.from_numpy(frame).permute(2,0,1).float().unsqueeze(0)
  if x:
    results = model(frame) # run it through yolov5
    df = results.pandas()
    # results.save(save_dir=save_path)
    # results.print()
    results.show()
    # print(results.pred[results.pred[:, -1]==38])
  else:
    break
  
  print(results.boxes[0])
  for box in boxes:
    x1, y1, x2, y2 = box.detach().numpy().astype(int)
    cv2.rectangle(frame, (x1, y1), (x2, y2), (0,255,0), 2)

  cv2.imshow('frame',frame)
  if cv2.waitKey(1) == ord('q'):
    break

#### try 2


In [114]:
# initialize models
import torch
import cv2
import pandas as pd
import numpy as np
model = torch.hub.load('ultralytics/yolov5','yolov5s')

cap = cv2.VideoCapture('ball.mp4')

# Define the codec and create VideoWriter object
#fourcc = cv2.cv.CV_FOURCC(*'DIVX')
#out = cv2.VideoWriter('output.avi',fourcc, 20.0, (640,480))
out = cv2.VideoWriter('output.avi', -1, 30.0, (640,480))

def printall(dict, frame, ind):
    for i in range(ind,0,-1):
        cv2.circle(frame, (dict[i][0]+50, dict[i][1]+50), 5, (0,0,255),-1)

loc = {}
frame_ind = 0

model.classes = [32]

while(cap.isOpened()):
    ret, frame = cap.read()
    if ret:
        # frame = cv2.flip(frame,0)
        # write the flipped frame
        # cv2.circle(frame, (100, 100), 5, (0, 0, 255), -1)
        detections = model(frame)
        pred = detections.pandas().xyxy[0]

        # print(type(pred[0]['xmin'][0]))
        # print(pred[0]['xmin'][0])
        # try:
        #     if isinstance(pred[0]['xmin'][0],float): # frame successfully has data
        #         pred = detections.pandas().xyxy
        #         print(type(pred[0]['xmin']))
        #         xmin = float(pred[0]['xmin'])
        #         ymin = float(pred[0]['ymin'])
        #         xmax = float(pred[0]['xmax'])
        #         ymax = float(pred[0]['ymax'])
        #         loc[frame_ind] = (xmin,ymin,xmax, ymax)
        # except:
        #     loc[frame_ind] = loc[frame_ind-1]
        #     frame = cv2.rectangle(frame, (loc[frame_ind][0], loc[frame_ind][1]), (loc[frame_ind][2],loc[frame_ind][3]), color=(0,0,255))
        #     out.write(frame)
        #     printall(loc, frame, frame_ind)
        #     pass

        for ind, row in pred.iterrows():
            if str(row['name']=="sports ball"):
                x1= int(row['xmin'])
                y1= int(row['ymin'])
                x2= int(row['xmax'])
                y2= int(row['ymax'])
                loc[frame_ind] = (x1,y1,x2,y2)
            else:
                loc[frame_ind] = loc[frame_ind-1]
                pass
                print((row['xmin'],row['ymin'],row['xmax'],row['ymax']))
        try:
            frame = cv2.rectangle(frame, (loc[frame_ind][0], loc[frame_ind][1]), (loc[frame_ind][2],loc[frame_ind][3]), color=(0,0,255))
            out.write(frame)
            printall(loc, frame, frame_ind)
            frame_ind += 1
            #detections.show()
        except:
            frame_ind += 1
            pass
        cv2.imshow('frame',frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    else:
        break

# Release everything if job is finished
cap.release()
out.release()
cv2.destroyAllWindows()

Using cache found in C:\Users\chang/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-4-14 Python-3.10.11 torch-2.0.0+cpu CPU

Fusing layers... 


[31m[1mrequirements:[0m C:\Users\chang\.cache\torch\hub\requirements.txt not found, check failed.


YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape... 


## Task 3: Tracker (40 points)
---

The detector outputs can be used to obtain the centroid(s) of the `ball` instances across time. You can assign a suitable starting state in the 1st frame of the video and obtain the predicted trajectory of the object during both visible and occluded frames. You need to superpose your predicted position of the object in each frame and the raw frame and store a sequence of all frames (generate a video). Ensure that you explain thoroughly the code.

Please note that you can use the filterpy library to implement the Kalman filter.

### Methodology
1. Initalize state with position and velocity
2. define state transition matrix (describes how it evolves over time)
3. define measurement matrix
4. define process noise covariance matrix
5. for each time step, update the state estimate


---
#### try 1

In [None]:

from filterpy.kalman import KalmanFilter
# from yolov5.detect import detect
import utils
#display = utils.notebook_init()

cap = cv2.VideoCapture('ball.mp4')

fourcc = cv2.VideoWriter_fourcc(*'mp4v')
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out = cv2.VideoWriter('output.avi', fourcc, fps,(width, height))
frame_ind = 0
loc={}
while True:
    ret, frame = cap.read()
    if not ret:
        break


    detections = model(frame)
    loc[frame_ind] = detections.pandas().xyxy[0]

    pred = detections.pandas().xyxy[0]
    for ind, row in pred.iterrows():
        if str(row['name']=="sports ball"):
            x1= int(row['xmin'])
            # print(f'x1:{x1}')
            y1= int(row['ymin'])
            x2= int(row['xmax'])
            y2= int(row['ymax'])
            loc[frame_ind] = (x1,y1,x2,y2)
            # print((row['xmin'],row['ymin'],row['xmax'],row['ymax']))
    try:
        frame = cv2.rectangle(frame, (loc[frame_ind][0], loc[frame_ind][1]), (loc[frame_ind][2],loc[frame_ind][3]), color=(0,0,255))
        detections.show()
    except:
        frame_ind += 1
        pass
    # print(detections.xyxy[0][1].numpy())
    cv2.imshow("image",frame)
    frame_ind += 1

cap.release()
out.release()
cv2.destroyAllWindows()


#### try 2


In [36]:
import cv2
import numpy as np
from filterpy.kalman import KalmanFilter

model = torch.hub.load('ultralytics/yolov5','yolov5s')
model.classes=[32]

cap = cv2.VideoCapture('ball.mp4')

# Define the codec and create VideoWriter object
#fourcc = cv2.cv.CV_FOURCC(*'DIVX')
#out = cv2.VideoWriter('output.avi',fourcc, 20.0, (640,480))
out = cv2.VideoWriter('output.avi', -1, 30.0, (640,480))

kf = KalmanFilter(dim_x=2, dim_z=1)
kf.x = np.array([0,0]) # initial state x, dx
kf.P = np.eye(2) * 1000 #initial uncertainty
kf.R = np.array([[0.1]]) # measurement noise
kf.Q = np.eye(2) * 0.01 # process noise

dt = 1.0
kf.F = np.array([[1,dt],
                 [0,1]])

kf.H = np.array([[1,0]])
kf.R = np.array([[0.1]])



while True:
    ret,frame = cap.read()
    if not ret:
        break

    detections = model(frame)
    

Using cache found in C:\Users\chang/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-4-14 Python-3.10.11 torch-2.0.0+cpu CPU

Fusing layers... 


[31m[1mrequirements:[0m C:\Users\chang\.cache\torch\hub\requirements.txt not found, check failed.


YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape... 


In [34]:
# |---------------------------------|
# |                                 |
# |            TESTING              |
# |                                 |
# |---------------------------------|


f = KalmanFilter(dim_x=2, dim_y=1)

f.x = np.array([[2.],    # position
                [0.]])   # velocity

f.F = np.array([[1.,1.],
                [0.,1.]])

f.H = np.array([[1.,0.]])

f.P = np.array([[1000.,    0.],
                [   0., 1000.] ])

f.R = np.array([[5.]])

from filterpy.common import Q_discrete_white_noise
f.Q = Q_discrete_white_noise(dim=2, dt=0.1, var=0.13)

z = get_sensor_reading()
f.predict()
f.update(z)
do_something_with_estimate (f.x)

TypeError: KalmanFilter.__init__() got an unexpected keyword argument 'dim_y'

#### try 3

In [28]:
import cv2
import numpy as np
from filterpy.kalman import KalmanFilter
from filterpy.common import Q_discrete_white_noise
from scipy.linalg import block_diag
import torch

model = torch.hub.load('ultralytics/yolov5','yolov5s')
model.classes=[32]

cap = cv2.VideoCapture('ball.mp4')

# Define the codec and create VideoWriter object
#fourcc = cv2.cv.CV_FOURCC(*'DIVX')
#out = cv2.VideoWriter('output.avi',fourcc, 20.0, (640,480))

w = int(cap.get(3))
h = int(cap.get(4))

out = cv2.VideoWriter('output.avi', cv2.VideoWriter_fourcc(*'MJPG'), 30, (640,480))



# |---------------------------------|
# |                                 |
# |            Filter               |
# |                                 |
# |---------------------------------|
kalman = KalmanFilter(dim_x=4, dim_z=2)
# kalman.x = np.array([ # 
#     [0],
#     [0],
#     [0],
#     [0]
# ])
uncertaintyInit = 500
kalman.P=np.array([
    [1,0,0,0],
    [0,1,0,1],
    [0,0,1,0],
    [0,0,0,1]
]) * uncertaintyInit
processVar = 30
q = Q_discrete_white_noise(dim=2, dt=1.0/20.0, var=processVar)
kalman.Q = block_diag(q,q)
#print(kalman.Q)
kalman.R = np.array([
    [0.5, 0],
    [0, 0.5]
])
kalman.H = np.array([
    [1, 0, 0, 0],
    [0, 0, 1, 0]
])
kalman.F = np.array([
    [1, 1.0/20.0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 1.0/20.0],
    [0, 0, 0, 1]
])

frame_no = 0
loc = []
preds = []

def midpoint(point1, point2):
    x1, y1 = point1
    x2, y2 = point2
    return (int((x1 + x2) / 2), int((y1 + y2) / 2))

def printall(dict, frame, ind):
    for i in range(ind,0,-1):
        cv2.circle(frame, (dict[i][0]+50, dict[i][1]+50), 5, (0,0,255),-1)

i = 0
while cap.isOpened():
    ret, frame = cap.read()
    if ret:
        detections = model(frame)
        pred = detections.pandas().xyxy
        # print(pred)
        
        try:
            # get the values
            xmin = float(pred[0]['xmin'])
            ymin = float(pred[0]['ymin'])
            xmax = float(pred[0]['xmax'])
            ymax = float(pred[0]['ymax'])
            #print(f'{(xmin, ymin, xmax, ymax)}')
            mid = midpoint((xmin, ymin),(xmax, ymax))
            loc.append(mid) # add the midpoint to the vector
            
            frame_no+=1
        except:
            frame_no+= 1
        print(frame_no)


        
        measurement = np.array([
                [loc[frame_no][0]],
                [loc[frame_no][1]]
            ])
        kalman.update(measurement)

        # print(loc)
        kalman.predict()
        pred_x = int(kalman.x[0])
        pred_y = int(kalman.x[2])

        preds.append((pred_x, pred_y))

        # print("predx:",pred_x)
        # print("predy:",pred_y)
        # print("vel:",kalman.x[1])


        #draw circle
        for i in range(frame_no, 0, -1):
            cv2.circle(frame, preds[i], 2, (0,255,0), -1)
        #draw locations
        for i in range(frame_no, 0, -1):
            print("frame:",frame_no)
            print("drawing:",(loc[i][0], loc[i][1]))
            cv2.circle(frame, (loc[i][0], loc[i][1]), 5, (0,0,255), -1)


        #cv2.imshow('frame',frame)
        cv2.imshow("frame",frame)
        out.write(frame)
        # draw previous points

        if cv2.waitKey(1) % 0xFF == ord('s'):
            break
    else:
        break

    







Using cache found in C:\Users\chang/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-4-14 Python-3.10.11 torch-2.0.0+cpu CPU

Fusing layers... 


[31m[1mrequirements:[0m C:\Users\chang\.cache\torch\hub\requirements.txt not found, check failed.


YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape... 


reached
1


IndexError: list index out of range

#### try 4

In [51]:
import cv2
import numpy as np
from filterpy.kalman import KalmanFilter
from filterpy.common import Q_discrete_white_noise
from scipy.linalg import block_diag
import torch

model = torch.hub.load('ultralytics/yolov5','yolov5s')
model.classes=[32]

cap = cv2.VideoCapture('ball.mp4')

# Define the codec and create VideoWriter object
#fourcc = cv2.cv.CV_FOURCC(*'DIVX')
#out = cv2.VideoWriter('output.avi',fourcc, 20.0, (640,480))

w = int(cap.get(3))
h = int(cap.get(4))

out = cv2.VideoWriter('output.avi', cv2.VideoWriter_fourcc(*'MJPG'), 30, (640,480))


# |---------------------------------|
# |                                 |
# |            Filter               |
# |                                 |
# |---------------------------------|
kalman = KalmanFilter(dim_x=4, dim_z=2)
# kalman.x = np.array([ # 
#     [0],
#     [0],
#     [0],
#     [0]
# ])
uncertaintyInit = 500
kalman.P=np.array([
    [1,0,0,0],
    [0,1,0,1],
    [0,0,1,0],
    [0,0,0,1]
]) * uncertaintyInit
processVar = 30
q = Q_discrete_white_noise(dim=2, dt=1.0/20.0, var=processVar)
kalman.Q = block_diag(q,q)
#print(kalman.Q)
kalman.R = np.array([
    [0.5, 0],
    [0, 0.5]
])
kalman.H = np.array([
    [1, 0, 0, 0],
    [0, 0, 1, 0]
])
kalman.F = np.array([
    [1, 1.0/20.0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 1.0/20.0],
    [0, 0, 0, 1]
])


frame_no = 0
locs = []
preds = []

def midpoint(point1, point2):
    x1, y1 = point1
    x2, y2 = point2
    return (int((x1 + x2) / 2), int((y1 + y2) / 2))

def printall(arr, frame, ind):
    for i in range(ind,0,-1):
        cv2.circle(frame, (arr[i][0], arr[i][1]), 5, (0,0,255),-1)


while cap.isOpened():
    ret, frame = cap.read()
    if ret:
        detections = model(frame)
        pred = detections.pandas().xyxy

        if pred[0]['xmin'].any():
            xmin = float(pred[0]['xmin'])
            ymin = float(pred[0]['ymin'])
            xmax = float(pred[0]['xmax'])
            ymax = float(pred[0]['ymax'])
            #print(f'{(xmin, ymin, xmax, ymax)}')
            mid = midpoint((xmin, ymin),(xmax, ymax))
            loc[frame_no] = mid # add the midpoint to the vector
            # print("success, appended:",mid)
        else:
            loc[frame_no] = loc[frame_no-1]
            # print("fuck, inserted", loc[frame_no-1], "instead")

        measurement = np.array([
                [loc[frame_no][0]],
                [loc[frame_no][1]]
            ])
        kalman.update(measurement)

        kalman.predict()
        pred_x = int(kalman.x[0])
        pred_y = int(kalman.x[2])

        preds.append((pred_x, pred_y))

        # print("predx:",pred_x)
        # print("predy:",pred_y)
        # print("vel:",kalman.x[1])

        for i in range(frame_no, 0, -1):
            cv2.circle(frame, preds[i], 2, (0,255,0), -1)

        for i in range(frame_no, 0, -1):
            # print("frame:",frame_no)
            # print("drawing:",(loc[i][0], loc[i][1]))
            cv2.circle(frame, (loc[i][0], loc[i][1]), 5, (0,0,255), -1)

        cv2.imshow("frane:",frame)
        out.write(frame)
        frame_no+=1
        if cv2.waitKey(1) % 0xFF == ord('s'):
            break
    else:
        break

        

Using cache found in C:\Users\chang/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-4-14 Python-3.10.11 torch-2.0.0+cpu CPU

Fusing layers... 


[31m[1mrequirements:[0m C:\Users\chang\.cache\torch\hub\requirements.txt not found, check failed.


YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape... 


In [None]:
""