# Creating Video Clips
This notebook demonstrates using a YOLOv4 model to process individual frames and use the resulting output to generate information about the video.
We use detections of labels over sequential frames to generate Clips which describe the existance of those objects within a specific portion of the video.

# Install ApertureDB
First we install the ApertureDB python module and other modules we need for running our model. Then, we verify our connection.

In [None]:
!pip install aperturedb tqdm
from aperturedb import Utils
c = Utils.create_connector()

In [None]:
u = Utils.Utils(c)
u.summary()

# Download resources
Now we need to download the python code to run the model, `yolov4.py` and the video we are going to use, `norman.mp4`, a video about a dog riding a bicycle.

This was chosen because it includes several labels but also because it has detections which overlap - dogs and bikes.

In [None]:
# Now we retrieve the items we are working with:
# Retrieve the YOLO4 interface
!rm -f yolo4.py
!wget https://raw.githubusercontent.com/drewaogle/YOLOv4-OpenCV-CUDA-DNN/refs/heads/main/yolo4.py
# Retreive video
!wget https://aperturedata-public.s3.us-west-2.amazonaws.com/aperturedb_applications/norman.mp4


# Run The Detector
Now that we've downloaded our YOLOv4 code, let's run it.
This will need to download the weights and some configuration; about 300M and will do it automatically.

After downloading or verify files, it will then process the video. with `no_squash_detections` as `True` it won't overwrite an existing output dir, so delete it to rerun. This code can support hardware acceleration, but is designed so it won't be unwieldly without it. Detections should be at about 3-10fps without hardware, and take less 5 minutes.

If a file were to fail halfway through it, rerunning the loader wont be happy ( sha2 sums won't match )
`!rm ~/models` will reset the downloads, though.

In [None]:

from importlib import reload
import yolo4
reload(yolo4)
from yolo4 import RemoteYOLOv4
class DetectorOptions:
    image='' # path for images
    stream='' # path for stream
    cfg="models/yolov4.cfg" # path to config
    weights="models/yolov4.weights" # path to weights
    namesfile="models/coco.names" # path for output to name mapping
    input_size=416
    use_gpu=False # use GPU or not
    outdir="output/norman"
    no_squash_detections=True # if detections exist, don't rerun.
    def __init__(self, image='',stream=''):
        self.image = image
        self.stream=stream # 'webcam' to open webcam w/ OpenCV

# now we pull data
dopts = DetectorOptions( stream="norman.mp4")
yolo = RemoteYOLOv4.__new__(RemoteYOLOv4)
yolo.__init__(dopts)


## Now Check Detections
the YoloV4 code we use outputs detections sequentially into a csv file, so let's load the file and see what the output looks like.

In [None]:
#Now let's check detections
import pandas as pd
df = pd.read_csv("output/norman/detections.csv")
print(df)

# Process Into Clips
Now that we've verified that we have the data from the model, we will take the output and process it into Clips.

We will define a few classes and functions to process our model output.

- `ClipOptions` - an options class that we will use to define how it works;  
- `preprocess` - convert the dataframe into information the detection can use
- `Clip` - a class that defines our output
- `ClipStorage` - a class for maintaining state between the functions
- `process_new_frame` - a function to run when we see a new frame
- `process_row` - a function to run we we see a new detection
- `process` - the function to process the whole video/csv.

These could be in a single class, but I've left them apart to allow people to take them piece by piece.

In [None]:
import logging
# Fist we'll define the options we're going to use.
class ClipOptions:
    offset_frame=0 # starting offset in frames
    end_frame=-1 # ending offset in frames
    initconf=50 # minimun confidence to start ( 0-100 )
    initlen=5 # minimum detection duration in frames to start a clip
    dropconf=25 # confidence to end a frame (0 -100 )
    droplen=5 # number of detection missed frames to end a clip
    detections=None  # path to output detections
    video=None # video that the detections is from
    verbose=logging.INFO # moderate amount of info
    flush=False # remove old uuids
    nosave=False # dont add data to db
    label="" # label for video
    def __init__(self,video,detections):
        self.video=video # video file to add
        self.detections=detections



## Spot Check Detections
Lets take a look at the detections when we have defined what the output means, and verify the labeling is correct.

In [None]:


# function to prepare dataframe for work; add columns and trim frames we don't want.
def preprocess(df, args ):
   processed = df
   processed.columns = ["frame","label","confidence","left","top","width","height" ]
   processed.drop(processed[processed.frame < args.offset_frame].index, inplace=True)
   if args.end_frame > -1:
      processed.drop(processed[processed.frame > args.end_frame].index,inplace=True)
   return processed

norman_detects = preprocess( df, opts )
print(norman_detects)


In [None]:
# process a frame by hand here.

from IPython.display import display as ds
import cv2
from PIL import Image

def display_image_and_bb( num, df ):

    # we also output the video frames in our model code so we can spot check.
    cv_image = cv2.imread( f"output/norman/video{num}.jpg")

    # Draw a rectangle around the detections
    counter = 0
    for id,coords in df[df["frame"] == num].iterrows():
        left   = coords["left"]
        top    = coords["top"]
        right  = coords["left"] + coords["width"]
        bottom = coords["top"] + coords["height"]
        cv2.rectangle(cv_image, (left, top), (right, bottom), (0, 255, 0), 2)
        y = top - 15 if top - 15 > 15 else top + 15
        cv2.putText(cv_image, coords["label"], (left, y),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 255, 0), 2)
        counter += 1

    cv_image_rgb = cv2.cvtColor(cv_image, cv2.COLOR_BGR2RGB)
    ds(Image.fromarray(cv_image_rgb))



In [None]:
# Now we will select a frame, 150 and see now it looks.
display_image_and_bb( 150, norman_detects)

### Detection Verification
This is pretty much what we would expect. bike, dog, person .. a car detection in the background, a nice find.

That house is *not* a stop sign though. I guess it is seeing the sharp edges and deciding that is the bottom of a stop sign? 

In [None]:
# simple Clip class for data storage
class Clip:
    def __init__(self, label, start,conf):
        self.label = label
        self.start_frame = start
        self.total_frames = 0 # don't include start in total.
        self.missed_frames = 0
        self.max_confidence = conf
        self.min_confidence = conf
        self.last_frame_seen = None
    def is_active( self, current_frame, drop_len ):
        last_seen = self.start_frame + self.total_frames
        # drop_len is the number of frame that can be missed and label is "active"
        # drop_len of 1 means active continues if unseen in previous frame.
        return last_seen + drop_len > current_frame
    def __str__(self):
        return f"(Clip [{self.label}] @ {self.start_frame} + {self.total_frames})"
    def __repr__(self):
        return f"C[{self.label} @ {self.start_frame} + {self.total_frames}]"
    def describe(self):
        return f"Clip is of {self.label}. First seen at {self.start_frame}, seen until {self.start_frame+self.total_frames}"
    def as_finished(self):
        return f"{self.label}_{self.start_frame+self.total_frames}"
    def add_confidence(self,new_confidence, frame_num):
        self.max_confidence = max(self.max_confidence,new_confidence)
        self.min_confidence = min(self.min_confidence,new_confidence)
        # there could be multiple detections of a given type per frame; we don't track multiple here
        # and are merely looking to know how many frames a label occurs in
        if frame_num == self.last_frame_seen:
            return
        self.last_frame_seen = frame_num
        self.total_frames = self.total_frames + 1 + self.missed_frames
        # when a frame is a hit, we add the missed frames to what is considered the total length.
        self.missed_frames = 0 
    # frames where confidence was below threshold but kept to avoid drop out.
    def add_missed(self,missed_confidence):
        self.missed_frames = self.missed_frames + 1

class ClipStorage:
    def __init__(self):
        self.active = {} # clips that have been seen, but not passed initializition count ( suppressed mis-identification )
        self.registered = {} # clips that are 'valid', and currently "seent"
        self.finished = {} # clips that were valid, but dropped off.

In [None]:
# process events which trigger on new frame.
def process_new_frame( verbose, drop_len, cur_frame, last_frame, storage): 
           # drop any which werent active last frame
           new_active = {}
           new_registered = {}
           for clip in storage.active.values():
               if not clip.is_active( cur_frame, drop_len ):
                   if verbose >= logging.INFO:
                       print(f'New Frame: frame {cur_frame}, Dropped {clip}') 
               else:   
                   if verbose:
                       print(f"New Frame: frame {cur_frame}, kept {clip} in active")
                   new_active[clip.label] = clip
           for clip in storage.registered.values():
               if not clip.is_active( cur_frame, drop_len ):
                   if verbose >= logging.INFO:
                       print(f'New Frame: frame {cur_frame}, Retired {clip}') 
                   storage.finished[ clip.as_finished() ] = clip
               else:
                   new_registered[clip.label] = clip
           if verbose >= logging.DEBUG:
               print(f"Active dict: {storage.active}")
           storage.registered = new_registered
           storage.active = new_active   

In [None]:
# process a row in the detections
# YOLOv4 can detect mutitple objects in a frame - this is a single detection in a given frame.
def process_row(verbose, initconf, initlen, dropconf, cur_frame, label, label_confidence, storage):
       if label in storage.active.keys():
           clip = storage.active[label]
           if label_confidence * 100 > initconf:
               clip.add_confidence(label_confidence,cur_frame)
               # total frames doesn't include first frame, so add 1.
               if clip.total_frames +1 >= initlen: 
                   if verbose >= logging.INFO:
                       print(f"At {cur_frame}, moved {clip} to registered")
                   storage.registered[label] = clip
                   del storage.active[label]
               else:
                   if verbose >= logging.INFO:
                       print(f"At frame {cur_frame}, saw {clip}")
           else:   
               if verbose >= logging.INFO:
                   print(f"{clip} seen at frame {cur_frame}, but confidence [ {label_confidence*100} < { initconf }]" )
       elif label in storage.registered.keys():
           clip = storage.registered[label]
           # if above confidence for dropping, consider a new registered frame
           if label_confidence * 100 > dropconf:
                # allows frame to miss one and restart; duration calculated from start to current.
                clip.add_confidence(label_confidence,cur_frame)
           else:
               clip.add_missed(label_confidence)
       else:    
           # if label not in active list, nor registered.
           if label_confidence * 100 > initconf:
               clip = Clip( label, cur_frame, label_confidence )
               if verbose >= logging.INFO:
                   print(f"* Added {clip} to actived")
               storage.active[label] = clip
        

In [None]:
# main loop over a frame.
def process(args,pf):
    args.verbose = True
    clip_store = ClipStorage()
    last_frame =0
    cur_frame = 0
    for idx,row in pf.iterrows():
        cur_frame = row['frame']
        label = row['label']
        if cur_frame > 155:
            break
        if cur_frame != last_frame:
           if args.verbose >= logging.DEBUG:
               print(f"Processing switch from {last_frame} to {cur_frame}")
           process_new_frame(args.verbose,args.droplen, cur_frame, last_frame,clip_store)

        # all old active and registered are dropped prior to this.
        process_row(args.verbose, args.initconf, args.initlen, args.dropconf,cur_frame,label,row['confidence'],clip_store)

        last_frame = cur_frame
    # move all registered to finished.
    if args.verbose:
        print("Video complete, finishing clips")
        for clip in clip_store.registered.values():
            clip_store.finished[ clip.as_finished() ] = clip
    return clip_store.finished
       

    

## Run the Clip Processing
Now that we've defined all our functions, we'll run it.

In [None]:
# options
opts = ClipOptions( "norman.mp4","output/noman/detections.csv" )
opts.label="Norman_Bike"
opts.initconf=45
opts.initlen=3
opts.dropconf=20
opts.droplen=3

norman_finished = process(opts,norman_detects)
#print(norman_finished)
for clip in norman_finished.values():
    print(clip.describe())

# Adding the Results to ApertureDB
Now we have some data that we can put into the database.
We'll make some functions to handle the different types of data we're adding.

In [None]:
# Add Detections to Database
import uuid

video_url = "aperturedb://demos/video_clip/video/{0}"
frame_url = "video_clips://frame/{0}"
clip_url = "video_clips://clip/{0}"

u.remove_all_objects()

def run_query(db, query,blobs,action_desc):
    blobs = [] if blobs is None else blobs
    #print(query)
    result,_ = db.query(query,blobs)
    if not db.last_query_ok():
        raise Exception(f"Failed Running Query for {action_desc}: {result}")
    return result

def add_detections( db, video_id, detections ):
    det_df = pd.read_csv( detections )
    det_df.columns = ["frame","label","confidence","left","top","width","height" ]
    def format_detections( detections ):
        return "".join( f"[{det.frame},{det.label},{det.confidence},{det.left},{det.top},{det.width},{det.height}]" for det in detections )

    frame_number = 0
    frame_detections = []
    def on_end_frame( detections ):
        add_frame_query=[{
            "FindVideo": {
                "constraints": {
                    "id": ["==",str(video_id)]
                },
                "_ref":1
            }
        },{
            "AddFrame": {
                "video_ref":1,
                "frame_number": frame_number,
                "properties": {
                    "detections": format_detections( detections ),
                     "frame_number":frame_number,
                      "id": str(frame_id)
                }
            }
        }]
        run_query(db,add_frame_query,None, "Adding Frame")
        add_bboxes(db,frame_id,detections)

    for row,data in det_df.iterrows():
        if data['frame'] != frame_number:
            on_end_frame(frame_detections)
            frame_detections = []
        frame_number = data['frame']
        frame_detections.append( data )
    # output last frame
    on_end_frame(frame_detections)
        

def add_clips(db,video_id,clips):
    for clip in clips:
        add_clip_query=[{
            "FindVideo": {
                "constraints": {
                    "id": ["==",str(video_id)]
                },
                "_ref":1
            }
        },{
            "AddClip": {
                "video_ref":1,
                "frame_number_range":{
                    "start": clip.start_frame,
                    "stop": clip.start_frame + clip.total_frames
                },
                "properties": {
                    "label": clip.label,
                    "id": str(uuid.uuid5( video_id, clip_url.format( f"{clip.label}_{clip.start_frame}" )))
                },
            }
        }]
        print(run_query(db,add_clip_query,None, "Add Clip"))
        print(add_clip_query)
        acq2=[add_clip_query[0]]
        acq2[0]["FindVideo"]["results"] = {"count":True }
        print(run_query(db,acq2,None, "AC Test"))
        
        



def add_video( db, video_path, detections_path, video_description ):
    video_id = uuid.uuid5( uuid.NAMESPACE_URL, video_url.format( video_path ))
    add_video_query=[{
        "AddVideo": {
            "properties": {
                "source": video_path,
                "descrption": video_description,
                "id": str( video_id )
            }
        }
        }]
    fd = open( video_path, 'rb')
    video_data = fd.read()
    fd.close()

    info_video_query=[{
        "FindVideo":{
            "constraints": {
                "id": ["==",str(video_id)]
            },
            "results": {
                "all_properties":True
            }
        }
    }]

    
    
    run_query(db,add_video_query,[video_data],"Video Adding")
    print(run_query(db,info_video_query,None,"Video Find"))
    
    add_detections( db, video_id, detections_path )
    add_clips( db, video_id,  norman_finished.values() )
try:
    add_video( c, "norman.mp4", "output/norman/detections.csv" , "Norman the dog rides a bike with some help")
    u.summary()
except Exception as e:
    print("Failed adding: ",e)
            
        

# Clip Verification
It looks like we found a lot of the things we wanted to find, a dog, a bicycle, and the lady who was helping the dorg.

It's odd that we don't see the dog until frame 136 though, what gives?

Also, our output shows it first registering the dog at frame 120, but then losing it.

In [None]:
# It's odd that the dog is not seen until clip 136, lets see why?

## at this point we will assume you've figured this out after 