## **Step 1. Calculate annotation time, based on time of the start of the video and on the seconds of annotation frame**

@author: beatriz vinha
If you encounter any problems, please contact beatrizmouravinha@ub.edu

To run the code below, you will need two files:

•	BIIGLE Video annotation report file (in .csv), exported from BIIGLE, with:
  - an added column with "start_time" containing the start time of the video annotated and
  - the squared brackets ("[  ]") removed from all the rows in the "frames" column  
•	Video metadata file (in .csv), based on the USBL navigation, with date and time on separate columns and with time displayed in “HHMMSS” format.


In [None]:
##Run ONLY if you want to connect to Google drive and if your files are stored there

#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter() #to view df as interactive tables

In [None]:
import pandas as pd

#import biigle annotations file
#to connect to folder in google drive "/content/drive/MyDrive/PATHTOFOLDER"
biigle_raw = pd.read_csv('/biigle_annot.csv')


#import video metadata file based on the USBL navigation
rov_nav = pd.read_csv('/rov_navigation.csv', sep = ",",
                      dtype={'time':float}) #moving average clean nav


In [None]:
#convert start_time and frame_secs to timedelta
biigle_raw['start_time'] = pd.to_timedelta(pd.to_datetime(biigle_raw['start_time']).dt.strftime('%H:%M:%S'))
biigle_raw['frames'] = pd.to_timedelta(biigle_raw['frames'], unit = 'seconds')

#check - start_time and frames_sec must be timedelta
biigle_raw.dtypes

In [None]:
#sum start time of the video with frames_sec to obtain new column with annotation time
biigle_raw['annotation_time'] = biigle_raw['start_time']+biigle_raw['frames']

In order to run the rest of the code, the 'annotation_time' column needs to be in a specific format, in this case, as integer (float64) values. For example, "16:53:47" should be represented as "165347". So, in the lines below, this will be done manually and a new column will be created (time).

In [None]:
# Convert annotation_time to total seconds, then format manually to HHMMSS
biigle_raw['hours'] = biigle_raw['annotation_time'].dt.components['hours']
biigle_raw['minutes'] = biigle_raw['annotation_time'].dt.components['minutes']
biigle_raw['seconds'] = biigle_raw['annotation_time'].dt.components['seconds']

In [None]:
# Create 'time' column in HHMMSS format
biigle_raw['time'] = (biigle_raw['hours'] * 10000 + biigle_raw['minutes'] * 100 + biigle_raw['seconds']).astype(float)

## **2. Merge timestamped annotations with ROV navigation**

In [None]:
#time must be float64 for both dataframes
rov_nav.dtypes
biigle_raw.dtypes #column 'time' is the same as 'annotation_time' but as HHMMSS format

Unnamed: 0,0
label_name,object
label_hierarchy,object
video_filename,object
shape_name,object
date_video,object
start_time,timedelta64[ns]
frames,timedelta64[ns]
annotation_time,timedelta64[ns]
hours,int64
minutes,int64


In [None]:
# Georeference all biigle annotations by merging df based on time
allannotations_georef = pd.merge_asof(biigle_raw.sort_values('time'), rov_nav.sort_values('time'),
                                      on="time", direction="nearest")

## **3. Substrate Type Annotations**

In [None]:
# Extract WholeFrame annotations with START/END markers (substrate type, parts of the video to remove, etc.)
wholeframe_annotations = allannotations_georef[allannotations_georef['shape_name'] == 'WholeFrame']

# Delete non-useful columns
wholeframe_annotations.drop(['lat','lng', 'gps_altitude'], axis=1, inplace=True)

In [None]:
# Extracting the relevant rows for START and END based on "START" and "END" in the label_hierarchy
start_annotations = wholeframe_annotations[wholeframe_annotations['label_hierarchy'].str.contains('START', case=False)]
end_annotations = wholeframe_annotations[wholeframe_annotations['label_hierarchy'].str.contains('END', case=False)]

In [None]:

# Create a list of intervals between START and END
intervals = []
for _, start_row in start_annotations.iterrows():
    # Find the corresponding END for each START
    #category = ' > '.join(start_row['label_hierarchy'].split('>')[:-1]).strip()  # Extract category from label_hierarchy (excluding START/END)
    category = start_row['label_hierarchy'].split('>')[0].strip()
    start_time = start_row['time']

    # Find the corresponding END time for the same category (ensure it's after the START time)
    matching_end = end_annotations[(end_annotations['time'] > start_time) &
                                   (end_annotations['label_hierarchy'].str.contains(category, case=False))]

    if not matching_end.empty:
        end_time = matching_end.iloc[0]['time']
        # Append the interval (start_time, end_time, category)
        intervals.append((start_time, end_time, category))

# Assign WholeFrame labels to the navigation data based on intervals
def assign_wholeframe_labels(rov_nav, intervals):
    # Adding a new column to store the WholeFrame labels (e.g., substrate type/Not considered for analysis)
    rov_nav['WholeFrame'] = None

    # Iterate over each interval (start_time, end_time, category)
    for start_time, end_time, category in intervals:
        # Assign the category to the corresponding rows in the navigation data
        mask = (rov_nav['time'] >= start_time) & (rov_nav['time'] <= end_time)
        rov_nav.loc[mask, 'WholeFrame'] = category

    return rov_nav

In [None]:
# Add the continous annotation sequences to rov navigation
sequences_nav = assign_wholeframe_labels(rov_nav, intervals)
sequences_nav_cleaned = sequences_nav.dropna(subset=['WholeFrame']) #remove empty rows

In [None]:
# Merge full annotations df with the navigation data
sequenced_annotations = pd.merge_asof(sequences_nav_cleaned, wholeframe_annotations, on='time', direction='nearest')

In [None]:
sequenced_annotations

## **4. Clean and Export Final Files**

In [None]:
# Delete non-useful columns
allannotations_georef.drop(['frames','hours', 'minutes', 'seconds', 'time'], axis=1, inplace=True)
sequenced_annotations.drop(['frames', 'hours', 'minutes', 'seconds'], axis=1, inplace=True)

In [None]:
# Divide into separate df annotations of species, substrate type and moments to discard in the transect
species_annotations = allannotations_georef[allannotations_georef['shape_name'] != 'WholeFrame']
substrate_type_annotations = sequenced_annotations[sequenced_annotations['WholeFrame'] == 'Substrate Type']
transect_to_discard  = sequenced_annotations[sequenced_annotations['WholeFrame'] != 'Substrate Type']

In [None]:
##Download all files
allannotations_georef.to_csv('/allannotations_georef.csv', index=False)
species_annotations.to_csv('/species_annotations.csv', index=False)
substrate_type_annotations.to_csv('/substrate_type_annotations.csv', index=False)
transect_to_discard.to_csv('/transect_to_discard.csv', index=False)