## Starter Code from Kaggle Notebook
This is just being stored here for historical reasons. Will remove later. It said this:

```
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import sys

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf

# Input data files are available in the read-only "../input/" directory
# For example, RUN THIS (by clicking run or pressing Shift+Enter) to list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
```

## Imports

In [3]:
import os, sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf

### Double-check Versions

In [4]:
print(sys.version)  # Python version
print(tf.__version__) # on Kaggle, this will be 2.6

## Exploratory Data Analysis

*Abbreviations*:     
- COTS = "Crown-of-thorns starfish"

Note: to locate where the dataset is located (on this Kaggle kernel) - I clicked on the "copy" button icon next to the dataset folder icon in the "Data" tab.

In [5]:
data_path = '../input/tensorflow-great-barrier-reef'
df = pd.read_csv(f"{data_path}/train.csv")

In [6]:
df.head()

### How to Be an Observant Surveyor

My goal in this analysis is to build an object-detection system that can scale up the efforts of manual surveyors in the Great Barrier Reef. With that in mind... what makes a human surveyor great at spotting COTS in the first place?

**Question 1**: Do the COTS tend to lump close together?

*Part 1:* On average, how many COTS are seen together in a single video frame?
To do this, let's start by first adding a column with the counts of COTS seen in each particular frame:

In [7]:
type(df['annotations'][43])  # although the data type visually looks like a list, the CSV is all text

We know there may be multiple COTS spotted in a single frame, so let's count up each that is spotted in a new column. We'll using the `{` to know how many COTS are in each: 

In [8]:
count_func = lambda string: string.count('{')
spotted = df['annotations'].apply(count_func)

In [9]:
df = df.assign(starfish_spotted=spotted)
df.head()

Cool beans! Now we can calculate the average of COTS spotted in a given frame:

In [10]:
mean_starfish_spotted_in_a_frame = round(df["starfish_spotted"].mean(), 4)
print(f"On average, {mean_starfish_spotted_in_a_frame} COTS are seen together in a single video frame.")

Wowza, that seems very low. Let's also visualize the distribution of the `starfish_spotted` column using a histogram and PDF:

In [41]:
def plot_histogram_from_df(df, column, title, x_axis, y_axis):
    """
    Plots the PDF of a column in a given DataFrame, using Matplotlib.
    
    Credit for the equation used for plotting the PDF goes to the NumPy documentation:
        https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
    
    Args:
        df(pandas.DataFrame)
        column(str): name of the column being plotted
        title(str), x_axis(str), y_axis(str): will be added to the plot
        
    Returns: None
    """
    # A: calculate the mean and std dev of the column
    mu, sigma = df[column].mean(), df[column].std()
    # B: init the histogram
    bin_edges, bins_probabilites, ignored = plt.hist(df[column], density=True)
    # C: plot the PDF 
    plt.plot(bins_probabilites, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp(-(bins_probabilites - mu)**2 / (2 * sigma**2)),
             linewidth=2, color='r')
    # D: make the plot more presentable
    plt.title(title)
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.show()

In [42]:
plot_histogram_from_df(df, 'starfish_spotted', "PDF of COTS Spotted in Video Frames", "No. of COTS", "Probability")

One takeaway on this: the distribution of COTS per video frame is heavily skewed, and the majority of them have none at all. This reinforces the idea that we'll want to weigh the `recall` highly in evaluating the eventual model we build, so we can detect the relatively low number of COTS that exist per image.

*Part 2:* On average, how many video frames do we go without seeing any COTS in the provided videos?

With this question I am trying to get another measurement of how closely the groups of COTS are to one another. The approach which I'll take here is to gather a distribution of the numbers of frames that happen sequentially in the training data, in which there are zero COTS spotted.

Note that one limitation of this approach is that certain frames might be of the same location on the Great Barrier Reef (since we don't know if the camera-person is always moving). Regardless, I think we'll go ahead with this approach anyway, since I believe it's reasonable to assume the camera is moving for most of the time in the giving videos; therefore, the amount of frames in between the time we spot any COTS is like a "proxy" for how close they are together.

In [36]:
def zero_sequence_lengths(a):
    """Compute the lengths of the sequences of consecutive zeros in an array.
    
    This is a modification of code by Warren Weckesser, originally posted on Stack Overflow:
    https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array
    
    Example:
        >>> a = np.array([[1, 2, 3, 0, 0, 0, 0, 0, 0, 4, 5, 6, 0, 0, 0, 0, 9, 8, 7, 0, 10, 11]])
        >>> zero_sequence_lengths(a)
            array([6, 4, 1])
            
    Args:
        a(array-like object): 1-dimensional. Can have positive or negative numbers.
        
    Returns: 1D array-like object.
    """
    # A: Create an array that is 1 where a is 0, and pad each end with an extra 0
    is_zero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    # B: Zero out any of the"in between" 1's - only 1's at edges remain
    ones_at_edges = np.abs(np.diff(is_zero))
    # C: Get the indices of all the remaining 1's (the starts and ends)
    sequences = np.where(ones_at_edges == 1)[0].reshape(-1, 2)
    # D: Compute a 1D array with just the lengths of these sequences
    return np.squeeze(np.diff(sequences, axis=1))

I will define an "empty frame" as one having no COTS, and use that to make the variable names more brief:

In [40]:
# Compute the lengths of these consecutive frames with zero COTS
consecutive_empty_frames = zero_sequence_lengths(a)
# print the mean
avg_empty = round(consecutive_empty_frames.mean(), 1)
print(f"There is average of {avg_empty} 'empty' frames between the ones that do have COTS.")

So there's our answer! Out of curiosity, let's make a PDF from the distribution of these lengths:

In [55]:
def plot_histogram_from_arr(array, title, x_axis, y_axis):
    """
    Plots the PDF of a column in a given DataFrame, using Matplotlib.
    
    Credit for the equation used for plotting the PDF goes to the NumPy documentation:
        https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
    
    Args:
        array(array-like object): 1-dimensional, has numerical values
        title(str), x_axis(str), y_axis(str): will be added to the plot
        
    Returns: None
    """
    # A: calculate the mean and std dev of the column
    mu, sigma = array.mean(), array.std()
    # B: init the histogram
    bin_edges, bins_probabilites, ignored = plt.hist(array, density=True)
    # C: plot the PDF 
    plt.plot(bins_probabilites, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp(-(bins_probabilites - mu)**2 / (2 * sigma**2)),
             linewidth=2, color='r')
    # D: make the plot more presentable
    plt.title(title)
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.show()

In [56]:
plot_histogram_from_arr(consecutive_empty_frames, "PDF of Empty Video Frames", "No. of Frames", "Probability")

Hmm... interestingly enough, there's only 3 unqiue values in this distribution, and they are not  very high.

In [53]:
np.unique(consecutive_empty_frames)  

Given these plots, it might suggest that the COTS tend to grow near to each other? To brainstorm, this could mean a sequential model would be useful for this problem. That way, once we know we have seen 1 COTS in a video, we can then somehow perhaps "alert" the model, to increase the probability of detecting more COTS nearby.

**Question 2:** What are some "giveaways" that a certain object in an video frame is that of a COTS?

- *Part 1:* What is the distibution of the observed colors of COTS in the videos?

In [None]:
pass

- *Part 2:* What is the distibution of the observed sizes of COTS in the videos?

In [None]:
pass

## Modelling

Step 1: Colab-specific: Mount Google Drive (we can do this by clicking the appropiate icons on the screen, please see here for more info)

In [None]:
try:
  from google.colab import drive
  drive.mount('/content/drive')
except ModuleNotFoundError:
  pass

In [None]:
# Colab-specific - find a folder to store YOLOv5 stuff
%cd ./drive/MyDrive

**Step 2:** Now let's load the YOLOv5 model from the PyTorch Hub:

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt  # install

In [None]:
!ls

In [None]:
import utils
display = utils.notebook_init()  # running the checks

**Step 3:** Now, let export to a `SavedModel` (this is using the "nano" version of YOLOv5, the smallest). Note that only need to do this *once*:

In [None]:
!python export.py --weights yolov5n.pt --include saved_model

**Step 4:** Now, let's test loading the model back in via Tensorflow 2:

In [None]:
tf_model = tf.keras.models.load_model('./yolov5n_saved_model')

Woohoo! We got a warning that the model has no training configuration, but that is to be expected (since we want to do all the training code henceforth ourselves, using Tensorflow code). 

Let's get started!