# How to prepare dataset for training

1. Place this notebook in the yolov3 directory (after running following)
    - ```git clone -b archive https://github.com/ultralytics/yolov3```
2. Please enter in the next cell the paths to the training- and validation-images in your dataset 
3. Run entire notebook

In [1]:
!dir "../../Dataset/Oresundsbron/02_non_processed_additionalsamples/02_non_processed_additionalsamples/"

 Volume in drive C is SENOR0LUNLT0524
 Volume Serial Number is D2E4-FDEC

 Directory of C:\Users\SEFuDA\OneDrive - Sony\Documents\Dataset\Oresundsbron\02_non_processed_additionalsamples\02_non_processed_additionalsamples

2021-04-25  16:03    <DIR>          .
2021-04-25  16:03    <DIR>          ..
2021-04-25  15:59    <DIR>          images
2021-04-25  16:01    <DIR>          labels
               0 File(s)              0 bytes
               4 Dir(s)  283ÿ694ÿ907ÿ392 bytes free


In [2]:
# Pipeline 1
PATHS_TO_TRAIN = [
    "../../Dataset/Oresundsbron/03_annotated_train_and_val/03_annotated_train_and_val/images/train",
    "../../Dataset/Oresundsbron/02_non_processed_additionalsamples/02_non_processed_additionalsamples/images",
]
PATH_TO_VALID = "../../Dataset/Oresundsbron/03_annotated_train_and_val/03_annotated_train_and_val/images/val"

#PATH_TO_TRAIN = "../../Dataset/Oresundsbron/OneDrive_1_08-04-2021/03_dataset_pipeline_1/02_top_image/images/train"
#PATH_TO_VALID = "../../Dataset/Oresundsbron/OneDrive_1_08-04-2021/03_dataset_pipeline_1/02_top_image/images/val"

###  Put all image paths into a list

In [6]:
import os, glob
valid_paths = glob.glob(PATH_TO_VALID + "/*.jpg")
valid_imgs = [os.path.basename(img_path) for img_path in valid_paths]

In [7]:
train_paths = list()
for path_to_train in PATHS_TO_TRAIN:
    train_paths += [img_path for img_path in glob.glob(path_to_train + "/*.jpg") \
                   if os.path.basename(img_path) not in valid_imgs]
train_imgs = [os.path.basename(train_path) for train_path in train_paths]

In [8]:
intersections = set(train_imgs).intersection(set(valid_imgs))
assert len(intersections) == 0 # Make sure no intersection between valid and train set

### Create dataset files compatible with ultralytics/yolov3

In [9]:
import os

In [10]:
data_file = """
classes=5
train=data/oresund_train.txt
valid=data/oresund_valid.txt
names=data/oresund.names
"""

with open("data/oresundsbron.data", 'w') as f:
    f.write(data_file)

In [11]:
with open('data/oresund_train.txt', 'w') as f:
    print(f"Number of img files: {len(train_paths)}")
    num_txt = 0
    for img_path in train_paths:

        assert os.path.isfile(img_path)
        txt_path = img_path.replace("images", "labels").replace(".jpg", ".txt")
        if os.path.isfile(txt_path):
            num_txt += 1

        f.write(img_path + '\n')

    print(f"Number of txt files: {num_txt}")

        
with open('data/oresund_valid.txt', 'w') as f:
    print(f"Number of img files: {len(valid_paths)}")
    num_txt = 0
    for img_path in valid_paths:
        assert os.path.isfile(img_path)
        txt_path = img_path.replace("images", "labels").replace(".jpg", ".txt")
        if os.path.isfile(txt_path):
            num_txt += 1
                          
        f.write(img_path + '\n')
print(f"Number of txt files: {num_txt}")
        

Number of img files: 2961
Number of txt files: 2860
Number of img files: 204
Number of txt files: 204


In [12]:
classes = [
    "Class_1",
    "Class_2",
    "Class_3",
    "Class_4",
    "Class_5_6"
]

with open('data/oresund.names', 'w') as f:
    for name in classes:
        f.write(name + '\n')

## Configure yolov3.cfg to match number of classes

For each ```[yolo]``` layer,
- Change ```classes=80``` to ```classes=num_classes```
- In the ```[convolutional]``` layer right before, change ```filters=255``` to ```filters=(5+num_classes)*3```

e.g.
classes=num_classes

# Start training

Run the following command in the terminal to start training

**NOTE:**
You may need to manually download the weights 'yolov3.weights' and place it in the 'yolov3/weights' directory. Find the download link in the output logs from the cell.

In [13]:
# python train.py --data data/oresundsbron.data --weight weights/yolov3.weights --cfg cfg/yolov3.cfg --batch-size 4

# python train.py --data data/oresundsbron_top.data --weight weights/yolov3.weights --cfg cfg\yolov3_5_classes.cfg --batch-size 4 --multi-scale

# Installing Apex (speed up training 2x)

1. First make sure you are in the same conda environment as your project (e.g. $ conda activate \<environment_name\>)
2. git clone https://github.com/NVIDIA/apex
3. cd apex
4. pip install -v --no-cache-dir .

# Verify no data leakage

In [15]:
import os, glob

In [19]:
train_imgs = [os.path.basename(img_path) for img_path in train_paths]
valid_imgs = [os.path.basename(img_path) for img_path in valid_paths]

In [21]:
print(len(set(train_paths)))
print(len(set(valid_imgs)))
print(len(set(valid_imgs).intersection(train_imgs)))

2961
204
0
