# DeepSeaAI
#### Pipeline used for the cleaning and analysis of citizen science data with AI.

This markdown will explain how to clean your citizen science data, and train a Yolov8 model on it. It has been developed with the intent to deal with DeepSeaSpy data (accessible from :https://zenodo.org/records/13759095)
https://ocean-spy.ifremer.fr/

Note : some functions may copy your images (specifically **vision**, **catalog**, **prepare_yolo**). Please be wary of the space available on your computer/deployment/cloud storage and act accordingly.

### Contents
You should have a Jupyter notebook file and a python file, both necessary :

```
deep-sea-lab
├── DeepSeaLab.ipynb        <- You are here
├── Functions.py            <- Where functions used for the cleaning/analysis are stored.
```

**Functions** contains all the functions needed for the cleaning and analysis of citizen science datasets. Detailed explanations of the functions are to be found here. You can modify them and use them as the basis for your work. There are also functions that are useful just for the exploration of your dataset, and advanced specific modifications.

**DeepSeaLab** is a ready-to-use file that you can change based on your needs/dataset. Explanations are given on how to proceed and guide you.

In [1]:
import sys

print("Python version:", sys.version_info)

Python version: sys.version_info(major=3, minor=10, micro=16, releaselevel='final', serial=0)


### Requirements
Import the necessaries functions for the cleaning/analysis

In [2]:
import os, csv, json, collections, random, shutil
import pandas as pd
from pathlib import Path
import cv2
from Functions import polygones2bb, points2bb, lines2bb, convert_yolo, prepare_yolo, vision, SaveCSV, create_yaml
from Functions import unite, catalog, get_df
import matplotlib.pyplot as plt
os.getcwd()

'C:\\imagine\\github\\deep-species-detection\\deep-sea-lab'

### Access to data/files
3 main paths are to be defined :

```
path_csv
```
Location of your dataset.


```
path_img
```
Path to the folder where your images are stored.
The format of the images should be in .jpg or .png

```
images
├── image 1.jpg           
├── image 2.jpg          
├── ... 
```

Then, finally
```
path_save
```
Where to store the cleaned dataset, catalogs, etc...

In [2]:
# csv access :
path_csv=r'/storage/export.csv'
# images :
path_img=Path(r'/storage/Image_dsp/') 
# save
path_save=r'/storage/save'

## Conversion of bounding boxes

Our pipeline can convert 3 type of bounding boxes into regular bounding boxes.
If your data already satisfies the following format, you can skip this part.

|xmin |ymin |xmax |ymax |
|-----|-----|-----|-----|
|972  |982  |549  |559  |


#### Polygons
On DeepSeaSpy, polygons are in the json format :

```
[{\x\":282,\"y\":115},{\"x\":15,\"y\":538},{\"x\":50,\"y\":679},{\"x\":285,\"y\":497}]
```

The column containing the polygon values can be named :
|polygon_values                                                                         |
|---------------------------------------------------------------------------------------|
|[{\x\":282,\"y\":115},{\"x\":15,\"y\":538},{\"x\":50,\"y\":679},{\"x\":285,\"y\":497}] |

Therefore our pipeline was made to deal with such format. If you need to convert your own type of polygons, you can modify the way points are stored in the **polygons2bb** function.

#### Lines
Lines are to be in the following format, with two points defined by (x1,y1) and (x2,y2).

|x1 |y1 |x2 |y2 |length|
|---|---|---|---|------|
|761|451|859|364|131   |

Length is used to correct the converted bounding box, depending on the line's angle with the x axis. If the line is too vertical or too horizontal, the lines2bb function automatically corrects the converted bounding box. By default, if the angle is of +-5 degrees, the corrections happens. You can modify/find mor info in the Functions.py file.

#### Points
You can manually set a padding on the x and y axis in the Functions.py file.

|x1 |y1 |
|---|---|
|761|451|

The padding is the same for every point in your dataset, if you wish to use a different one for different species/uses, we recommend you split your dataset and run each part with a different padding. Then, you can concatenate all of your subsets with :

```
pd.concat([polybb,lignesbb,pointsbb])
```

We alsor recommend changing the names of your images columns and species columns, so that our functions can run properly.

|name_img         |name_sp     |
|-----------------|------------|
|'MOMAR_90095.jpg'|'Buccinidae'|

In [None]:
# Import your dataset
data=pd.read_csv(path_csv, sep=None, engine='python')
data.head

In [None]:
# Rename data imported from DeepSeaSpy
#data.rename(columns={'pos1x': 'x1', 'pos1y': 'y1','pos2x': 'x2', 'pos2y': 'y2','name_fr':'name_sp','name':'name_img'}, inplace=True)

Split your data according to your dataset. The conditions on which data are split was decided based on the DeepSeaSpy format.
You can comment the lines of code that you don't want to be executed.

In [None]:
# Subset of polygon labels
poly=data.dropna(subset=['polygon_values'])
# Subset of lines labels
lines=data.dropna(subset=['x2'])
# Subset of points labels
points=data[data['polygon_values'].isna() & data['x2'].isna()]

In [None]:
# Polygons
polybb=polygones2bb(poly)
# Lines
lignesbb=lines2bb(lines)
# Points
pointsbb=points2bb(points)

In [None]:
# Concatenate your split dataset into a single one
bb=pd.concat([polybb,lignesbb,pointsbb])

In [7]:
# Save the converted dataset
SaveCSV(bb,path_save,'export_bb')

In [3]:
bb=pd.read_csv(os.path.join(path_save,'export_bb.csv'), sep=None, engine='python')

## Vision

We encourage you to use the vision function, which will allow you to visualize your images with your bounding boxes added onto them.
First, you have to define the object 'colors', which is a dictonary containing each species with its corresponding color coded in BGR.

If you wish to, you can limit the number of saved images by adding 'nb_img' as an argument. 
```
vision(bb,colors,path_img,path_save=None,nb_img=None)
```
When not specified, it saves the images in the parent directory of the path_img.

This function copies the images from path_img, so it may generate a lot of data if you don't specify a number of images.

In [None]:
# Colors are in BGR
vision(bb, path_img, path_save, nb_img=10) #nb_img is optional, if you want to plot all of your data use "nb_img=None"

In [None]:
#Only color in red
vision(bb, path_img, path_save, nb_img=10,colors='red')

## Unification of overlapping bounding boxes

This step ensures that there is no redundancy in your dataset. You can skip this part if you are not dealing with this kind of problem.

The unification of the bounding boxes is done when they are strictly overlapping (while the iou value is kept as None).
Still, if you wish to limit the unification of the BB to a certain superposition threshold (iou), you can.

```
unite(dataframe, iou=None, grouper_0=False)
```

iou_thresh corresponds to the minimum Intersection over Union (IoU) value between two bounding boxes to consider them overlapping.

Bounding boxes that are not overlapping any are automatically discarded. If you want to keep them, you can change the argument grouper_0 to grouper_0=True.

The function keeps track of how many bounding boxes the final ones are made of in the column "occurrences".



In [None]:
# Unification of all overlapping bounding boxes
ubb=unite(bb)

The following boxes are examples of what you can do with the unite function. Try to experiment and find what may be the best parameters for your dataset.

In [None]:
# Keeping bounding boxes only if they are made of at least 3 overlapping ones
ubb=ubb[ubb['occurences']>=3]

In [None]:
# Unification only when bounding boxes are 0.2% overlapping
ubb=unite(bb,0.2) 

In [None]:
# Unification all of your bounding boxes, while not discarding isolated ones
ubb=unite(bb,grouper_0=True) 

In [5]:
# Save your dataframe
SaveCSV(ubb,path_save,'ubb')

In [3]:
# Read your saved dataset
ubb=pd.read_csv(os.path.join(path_save,'ubb.csv'), sep=',')

## Catalog

Unite only unifies overlapping bounding boxes, it does not verify if the object you want to labelise is in fact inside the bounding box. If you wish to be very wary about which bounding box are to be kept in your dataset, you can use the two functions :

```
catalog(df, path_img, path_save=None)
```
When not specified, it saves the images in the parent directory of the path_img.

Creates a catalog of snapshots from all the bounding boxes you have in your dataframe (df). You can then delete the snapshots of bounding boxes you want to discard.

```
get_df(df,path_save)
```

From the path_save (where your remaining snapshots are), and your unified dataframe (df), this function returns a dataframe that lists all the remaining bounding boxes from your own cleaning.

In [None]:
#Create snapshots from images and the dataframe
catalog(ubb, path_img, path_save)

In [None]:
#After discarding snapshots, get the remaining rows
ubb=get_df(ubb,path_save)
ubb.head()

## Prepare for Yolo

You can then use prepare_yolo to split your dataset in 3 (train,val,test), and train yolov8 on it.

```
prepare_yolo(df,path_save,path_img,prop=[.8,.1])
```

The **prop** parameter stands for proportion. It asks for the size of 2 subsets (in order) : **train** and **validation**. The remaining percentage is the size of the **test** subset. The test subset is not mandatory for the training of a model.
prop=[.8,.1] means that the training subsets is 80% of our dataset, the validation subset is 10%. The remaining percentage is the test subset's size, here, 10%.

Yolov8 takes bounding boxes in the following format :

```
class x y w h
```

With x and y the coordinates to the center of the bounding box. W and h are the width and height of the bounding box. Those 4 values are normalised between 0 and 1. prepare_yolo generates 1 txt file for each image, and each line in this file is the description of 1 bounding box.

example :
```
7 0.5713542 0.6847222 0.0359375 0.0731481
```


In [None]:
convert_yolo(ubb,path_img)

In [None]:
prepare_yolo(ubb,path_save,path_img,prop=[.8,.1])

## Yolo training
Now you can train a yolo model with your dataset.

Your yolo training folder path should look like this :

```
yolo_training
├── images        <- Where your images are
|   ├── train
|   ├── val
|   ├── test
├── labels        <- Where your bounding boxes/labels for each image are
|   ├── train
|   ├── val
|   ├── test
```
Yolo needs a yaml file to understands where the data is, and what the classes are.
You can create the file yourself or use create_yaml.

In [None]:
# Creates a yaml file containing all of the information necessary for running Yolov8 on your data
create_yaml(ubb,path_save,'output')

Everything is ready for Yolov8.


If you installed our requirements correctly, you can train your Yolov8 model in python :

In [None]:
from ultralytics import YOLO
# Load a new YOLO model from scratch
model = YOLO('yolov8n.yaml')  # build a new model from YAML

In [None]:
from ultralytics import YOLO
# Or load a pretrained YOLO model (recommended for training)
model = YOLO('yolov8n.pt')

In [None]:
# Get where the .yaml file is stored
output=str('output'+'.yaml')
yaml_path=os.path.join(path_save,output)

# Train the model
results = model.train(data=yaml_path, epochs=1, imgsz=640)

In [None]:
# Run the model to identify objects
# Load your trained YOLOv8n model
model = YOLO(os.path.join(os.getcwd(),r'runs\detect\train\weights\best.pt'))

In [None]:
#You can then make a list of images to run inference on
#Here, you can put them in a single folder
img_to_predict=r'path\to\images\to\be\predicted'
list_img=list(img_to_predict.glob('**/*.jpg'))

# Run inference on your list of images
results = model.predict(list_img, save=True, save_txt=False, save_conf=False, show_conf=False, project=path_save, name='inference_results')

You can also launch the training within a command terminal (same line of code for windows or linux). This can be useful if you don't want to open a python interactive window, or you are working remotly. 

In [None]:
# For the training
!yolo task=detect mode=train model=yolov8n.yaml imgsz=640 data=absolute/path/to/output.yaml show_labels=False epochs=10 batch=8 name=run1

In [None]:
# Once your model is trained, you can run 'predict' to detect objects
!yolo predict model=path/to/yolo/runs/detect/run1/weights/best.pt source=path/to/data show_labels=False

### Steps to go further

If you wish to train the hyperparameters of your trained model, you can do so.
This step is to be done only if you have the allowable resources to do so, as hyperparameter tuning takes a long time.

In [None]:
model = YOLO('/runs/detect/buccin_cit_07/weights/best.pt')

model.tune(data='output.yaml',epochs=200, iterations=300, optimizer='AdamW', plots=True, save=True, val=False)