The VDAS Data Manipulation Notebook is primarily used for data manipulation (dataset preparation and cleaning) before training, validating and testing YOLOv5 models for VDAS.

The key terms:

*   Originals &ndash; original code (DO NOT MODIFY - follow these to know the original flow)
*   Playground &ndash; custom or temporary code unrelated to the section (EDIT THIS)
*   S-Class &ndash; the best code available for a specific task, use them as often as possible
<br/><br/>


NOTES:
*   Use CPU-only runtime (no GPU) for more storage and more lenient timeout
*   Unzip into Colab's VM (DO NOT unzip into shared Drive folder)
*   DO NOT modify code from 'Originals' section

# Mount Google Drive

In [None]:
!sudo add-apt-repository --y ppa:alessandro-strada/ppa > /dev/null 2>&1 
!sudo apt update && sudo apt install google-drive-ocamlfuse > /dev/null 2>&1
!google-drive-ocamlfuse

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Waiting for headers] [C[0m[33m0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (185.125.190.36[0m                                                                               Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:7 http://ppa.launchpad.net/alessandro-strada/ppa/ubuntu bionic InRelease
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:9 http://archive.ubuntu.com/ubuntu bio

In [None]:
!sudo apt-get install w3m # to act as web browser 
!xdg-settings set default-web-browser w3m.desktop # to set default browser 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libgpm2
Suggested packages:
  gpm cmigemo dict dict-wn dictd libsixel-bin mpv w3m-el w3m-img
The following NEW packages will be installed:
  libgpm2 w3m
0 upgraded, 2 newly installed, 0 to remove and 46 not upgraded.
Need to get 939 kB of archives.
After this operation, 2,629 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgpm2 amd64 1.20.7-5 [15.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 w3m amd64 0.5.3-36build1 [924 kB]
Fetched 939 kB in 0s (2,494 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf:

In [None]:
!cd /content && mkdir drive && cd drive && mkdir "MyDrive"
!google-drive-ocamlfuse "/content/drive/MyDrive"

Access token retrieved correctly.


In [None]:
### Imports

import numpy as np, random
from collections import Counter # Check dups
from cv2 import imread, rectangle, getTextSize, putText, FONT_HERSHEY_DUPLEX, imwrite   # to draw bboxes
from glob import glob, glob1
from IPython.display import Image, clear_output   # to display images
from os import listdir, walk, makedirs
from os.path import isfile, isdir, join, splitext, basename, abspath, getsize, getmtime
from shutil import copy, move, rmtree
from zipfile import ZipFile, ZIP_STORED
from numpy.random import RandomState
!wget -q https://raw.githubusercontent.com/Nagidrop/fast-copy/master/fast-copy.py   # fast-copy

### Utility functions

# Count & print total files in folder
def total_files(input_dir):
    all_files = [join(root, f) for root, _, files in walk(input_dir) for f in files]
    print(f'{input_dir}: {len(all_files)}')

# Count total files only (NO print)
def total_files2(input_dir):
    all_files = [join(root, f) for root, _, files in walk(input_dir) for f in files]
    return len(all_files)

# Count total files with listdir (no recursive - only top-level files)
def total_files3(input_dir):
    all_files = listdir(input_dir)
    print(f'{input_dir}: {len(all_files)}')

# Get the latest subdir of a specified directory (sorted by last modified)
def latest_modified_subdir(parent_dir):
    if not isdir(parent_dir):
        return None
    return max([join(parent_dir, dir) for dir in listdir(parent_dir)], key=getmtime)

# Get folder size
def folder_size(folder_path):
    print(sum(entry.stat().st_size for entry in scandir(folder_path)))

# Check GPU
!nvidia-smi -L

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [None]:
import numpy as np, random
from collections import Counter # Check dups
from cv2 import imread, rectangle, getTextSize, putText, FONT_HERSHEY_DUPLEX, imwrite   # to draw bboxes
from glob import glob, glob1
from IPython.display import Image, clear_output   # to display images
from os import listdir, walk, makedirs, scandir
from os.path import isfile, isdir, join, splitext, basename, abspath, getsize, getmtime
from shutil import copy, move, rmtree
from zipfile import ZipFile, ZIP_STORED
from numpy.random import RandomState

# I. Clean Datasets

##Step 1: Import datasets to clean

###a. Original imports

In [None]:
# Local & Drive directories

image_dir_local  = '/content/images'        # change according to needs
label_dir_local  = '/content/labels'        # change according to needs

image_dir_drive  = '/content/drive/MyDrive/yolov5/org-datasets/full-dataset-images.zip'      # CHANGE THIS EVERY RUN
label_dir_drive  = '/content/drive/MyDrive/yolov5/org-datasets/full-dataset-labels.zip'      # CHANGE THIS EVERY RUN

# Create local directories

os.makedirs(image_dir_local,  exist_ok=True)
os.makedirs(label_dir_local,  exist_ok=True)

# Unzip images & labels from Drive to local directories

!unzip $image_dir_drive -d $image_dir_local
!unzip $label_dir_drive -d $label_dir_local

In [None]:
# Download file from Google Drive given ID

!gdown --id 17-FCstm8Fz3bDzFgTmOWHa_c39lTR_1P

Downloading...
From: https://drive.google.com/uc?id=17-FCstm8Fz3bDzFgTmOWHa_c39lTR_1P
To: /content/CMFD.zip
100% 10.3G/10.3G [01:15<00:00, 136MB/s]


In [None]:
# Validate the number of files

# Image & label directories

image_dir = '/content/images'
label_dir = '/content/labels'

# Count number of images

print(f'Images count: {len(os.listdir(image_dir))}')

# Count number of labels

print(f'Labels count: {len(os.listdir(label_dir))}')

Images count: 98378
Labels count: 98371


In [None]:
# Download from Google Drive (limited sharing files)
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
file_id = '1gjltyD_MnNWcnd56NnjUOizdi39CUEPF' # URL id. 
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('IMFD.zip')

KeyboardInterrupt: ignored

### b. Playground

In [None]:
makedirs('8th-VDAS-images')
makedirs('8th-VDAS-labels')

In [None]:
makedirs('7th-VDAS-labels')

In [None]:
!unzip -qq /content/drive/MyDrive/yolo/4-datasets/7th-VDAS/0420-masks-labels.zip -d /content/8th-VDAS-labels
!unzip -qq /content/drive/MyDrive/yolo/4-datasets/7th-VDAS/0420-masks.zip -d /content/8th-VDAS-images

In [None]:
!unzip -qq /content/drive/MyDrive/yolo/4-datasets/7th-VDAS/CrowdHuman-merged-labels-13887.zip -d /content/8th-VDAS-labels
!unzip -qq /content/drive/MyDrive/yolo/4-datasets/7th-VDAS/CrowdHuman-merged-labels-13887-lbp.zip -d /content/8th-VDAS-labels


!unzip -qq /content/drive/MyDrive/yolo/4-datasets/7th-VDAS/CrowdHuman-optimized-13887-lbp.zip -d /content/8th-VDAS-images
!unzip -qq /content/drive/MyDrive/yolo/4-datasets/7th-VDAS/CrowdHuman-optimized-13887.zip -d /content/8th-VDAS-images


In [None]:
total_files3('/content/8th-VDAS-images')
total_files3('/content/8th-VDAS-labels')

/content/8th-VDAS-images: 44510
/content/8th-VDAS-labels: 44510


In [None]:
# Remove all files given text same folder
!find . -name '*cuuamtest*' # checks before removing
!rm *cuuamtest*

/bin/bash: /bin/rm: Argument list too long


## Step 2: Clean the dataset



> Keep only specified classes and delete others
1. Delete labels and coordinates (belong to other classes)
    * Keep "person" classes only 
    * Keep "person" and "car" classes
2. Delete the odd ones
    * Delete images without corresponding labels
    * Delete labels without corresponding images
    * Count only without deletion
3. Split to train/val/test datasets
    * Split to halves (2 sets)
    * Split to thirds (3 sets)
4. Check train/val/test integrity after splitting 


### a. Delete labels and coordinates (belong to other classes)

#### Originals

&ndash;&nbsp;&nbsp;Keep only  "person"  classes




In [None]:
# Label directory to clean
label_dir = '/content/labels2017/coco/labels/train2017'

# Delete labels without 'person' classes 
labels = os.listdir(label_dir)
deleted_labels_count = 0

for label in labels:
    if not label.endswith('.txt'):
        print('Ahaaa')
        break

    try:
        isPerson = False  # indicates this label including 'person' class

        # for every label,
        # read all lines, split by whitespace, check if there is 0,
        # if there is 0 => 'person' class
        with open(os.path.join(label_dir, label), 'r') as fr:
            for line in fr.readlines():
                if line.startswith('0 '):
                    isPerson = True
                    break

        # delete labels that don't have 'person' class
        if not isPerson:
            os.remove(os.path.join(label_dir, label))
            deleted_labels_count += 1

    except Exception as e: 
        print(e)
        exit(0)

print('Deleted ' + str(deleted_labels_count) + ' labels')

# Delete non-person coordinates (labels that are not of 'person' class within same label file)
labels = os.listdir(label_dir)
deleted_coordinates_count = 0

for label in labels:
    if not label.endswith('.txt'):
        print('Ahaaa 2')
        break

    try:
        # Read all lines and write back the lines if that line contains 'person' class
        with open(os.path.join(label_dir, label), 'r') as fr:
            lines = fr.readlines()

        with open(os.path.join(label_dir, label), 'w') as fw:
            for line in lines:
                if line.strip().startswith('0 '):
                    fw.write(line)
                else:
                    deleted_coordinates_count += 1
                    
    except Exception as e: 
        print(e)
        exit(0)

print(f'Deleted {str(deleted_coordinates_count)} coordinates')

Deleted 53151 labels
Deleted 344392 coordinates


In [None]:
# Different purpose (testing, dont run this)

import os

# Label directory to clean
label_dir = "/content/CrowdHuman-labels"

# Delete labels without 'person' classes
labels = os.listdir(label_dir)
deleted_list = []

for label in labels:
    if not label.endswith('.txt'):
        print('Ahaaa')
        break

    try:
        isValid = False  # init flag

        with open(os.path.join(label_dir, label), 'r', encoding='UTF-8') as fr:
            for line in fr:
                if line.startswith('0 '):
                    isValid = True
                    break

        # delete labels that don't have 'person' class
        if not isValid:
            deleted_list.append(os.path.join(label_dir, label))

    except Exception as e:
        print(e)
        exit(0)

if len(deleted_list) == len(labels):
    print(f'Something is wrong with the code. All {len(labels)} labels marked as deleted')
    exit(0)
else:
    for deleted in deleted_list:
        os.remove(deleted)

print('Deleted ' + str(len(deleted_list)) + ' labels')

# Delete non-person coordinates (labels that are not of 'person' class within same label file)
labels = os.listdir(label_dir)
deleted_coordinates_count = 0

for label in labels:
    if not label.endswith('.txt'):
        print('Ahaaa 2')
        break

    try:
        # Read all lines and write back the lines if that line contains 'person' class
        with open(os.path.join(label_dir, label), 'r', encoding='UTF-8') as fr:
            lines = list(fr)

        with open(os.path.join(label_dir, label), 'w', encoding='UTF-8') as fw:
            for line in lines:
                if line.strip().startswith('0 '):
                    fw.write(line)
                else:
                    deleted_coordinates_count += 1

    except Exception as e:
        print(e)
        exit(0)

print('Deleted ' + str(deleted_coordinates_count) + ' coordinates')

Deleted 0 labels
Deleted 439046 coordinates




&ndash;&nbsp;&nbsp;Keep only "person" and "car" classes



In [None]:
# Label directory to clean
label_dir = '/content/obj-labels-org'   

# Delete labels without 'person' or 'car' classes 
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            isPerson = False 
            isCar = False

            with open(os.path.join(label_dir, label), 'r') as f:
                for line in f.readlines():
                    data = line.split(' ')
                    if '6' in data:
                        isPerson = True
                        break
                    elif '8' in data:
                        isCar = True
                        break

            if not isPerson and not isCar:
                os.remove(os.path.join(label_dir, label))
                count += 1

        except Exception as e: 
            print(e)

print('Deleted ' + str(count) + ' labels')

# Delete non-person and non-car coordinates, then convert classes to their correct numbers
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            with open(os.path.join(label_dir, label), 'r') as fr:
                lines = fr.readlines()

            with open(os.path.join(label_dir, label), 'w') as fw:
                for line in lines:
                    if line.strip().startswith('6 '):
                        line = line.replace('6 ', '0 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('8 '):
                        line = line.replace('8 ', '1 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    else:
                        count += 1
        except Exception as e: 
            print(e)

print('Deleted ' + str(count) + ' coordinates')

Deleted 1372 labels
Deleted 99051 coordinates


#### Playground

In [None]:
import math

# Label directory to clean
label_dir = '/content/labels'   

# Delete labels without 'person' or 'car' classes 
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            isValid = False

            with open(os.path.join(label_dir, label), 'r') as f:
                for line in f.readlines():
                    data = line.split(' ')
                    if '1' in data or '2' in data:
                        isValid = True
                        break
                    any(math.isnan(val) for val in d.values())

            if not isValid:
                os.remove(os.path.join(label_dir, label))
                count += 1

        except Exception as e: 
            print(e)
    else:
        print('Yabai desu yo')
        %%script stop_code_here

print(f'Deleted {count} labels')

In [None]:
# Label directory to clean
label_dir = '/content/labels'   

# Delete labels without 'person' or 'car' classes 
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            isValid = False

            with open(os.path.join(label_dir, label), 'r') as f:
                for line in f.readlines():
                    data = line.split(' ')
                    if '1' in data or '2' in data:
                        isValid = True
                        break

            if not isValid:
                os.remove(os.path.join(label_dir, label))
                count += 1

        except Exception as e: 
            print(e)
    else:
        print('Yabai desu yo')
        %%script stop_code_here

print(f'Deleted {count} labels')

Deleted 194 labels


In [None]:
# Label directory to clean
label_dir = '/content/obj-labels-org'   

# Delete labels without 'person' or 'car' classes 
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            isValid = False

            with open(os.path.join(label_dir, label), 'r') as f:
                for line in f.readlines():
                    data = line.split(' ')
                    if '1' in data or '2' in data or '3' in data or '5' in data:
                        isValid = True
                        break
                    elif '6' in data or '8' in data or '9' in data or '10' in data or '11' in data:
                        isValid = True
                        break

            if not isValid:
                os.remove(os.path.join(label_dir, label))
                count += 1

        except Exception as e: 
            print(e)
    else:
        print('Yabai desu yo')
        %%script stop_code_here

print(f'Deleted {count} labels')

# Delete non-person and non-car coordinates, then convert classes to their correct numbers
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            with open(os.path.join(label_dir, label), 'r') as fr:
                lines = fr.readlines()

            with open(os.path.join(label_dir, label), 'w') as fw:
                for line in lines:
                    if line.strip().startswith('1 '):
                        line = line.replace('1 ', '3 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('2 '):
                        line = line.replace('2 ', '4 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('3 '):
                        line = line.replace('3 ', '5 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('5 '):
                        line = line.replace('5 ', '6 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('6 '):
                        line = line.replace('6 ', '0 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('8 '):
                        line = line.replace('8 ', '1 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('9 '):
                        line = line.replace('9 ', '2 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('10 '):
                        line = line.replace('10 ', '7 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('11 '):
                        line = line.replace('11 ', '8 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    else:
                        count += 1
        except Exception as e: 
            print(e)

print(f'Deleted {count} coordinates')

Deleted 263 labels
Deleted 7858 coordinates


In [None]:
# Label directory to clean
label_dir = '/content/gen-labels-full'   

# Convert classes to their correct numbers
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            with open(os.path.join(label_dir, label), 'r') as fr:
                lines = fr.readlines()

            with open(os.path.join(label_dir, label), 'w') as fw:
                for line in lines:
                    if line.strip().startswith('0 '):
                        fw.write(line)
                    elif line.strip().startswith('2 '):
                        line = line.replace('2 ', '1 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('3 '):
                        line = line.replace('3 ', '2 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('5 '):
                        line = line.replace('5 ', '1 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('7 '):
                        line = line.replace('7 ', '1 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('9 '):
                        line = line.replace('9 ', '4 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    else:
                        count += 1
        except Exception as e: 
            print(e)

print(f'Deleted {count} coordinates')

Deleted 0 coordinates


In [None]:
# Label directory to clean
label_dir = '/content/obj-labels'   

# Delete labels without 'person' or 'car' classes 
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            isValid = False

            with open(os.path.join(label_dir, label), 'r') as f:
                for line in f.readlines():
                    data = line.split(' ')
                    if '1' in data or '2' in data or '3' in data or '5' in data:
                        isValid = True
                        break
                    elif '6' in data or '8' in data or '9' in data or '10' in data or '11' in data:
                        isValid = True
                        break

            if not isValid:
                os.remove(os.path.join(label_dir, label))
                count += 1

        except Exception as e: 
            print(e)
    else:
        print('Yabai desu yo')
        %%script stop_code_here

print(f'Deleted {count} labels')

# Delete non-person and non-car coordinates, then convert classes to their correct numbers
labels = os.listdir(label_dir)
count = 0

for label in labels:
    if label.endswith('.txt'):
        try:
            with open(os.path.join(label_dir, label), 'r') as fr:
                lines = fr.readlines()

            with open(os.path.join(label_dir, label), 'w') as fw:
                for line in lines:
                    if line.strip().startswith('1 '):
                        line = line.replace('1 ', '3 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('2 '):
                        line = line.replace('2 ', '4 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('3 '):
                        line = line.replace('3 ', '5 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('5 '):
                        line = line.replace('5 ', '6 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('6 '):
                        line = line.replace('6 ', '0 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('8 '):
                        line = line.replace('8 ', '1 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('9 '):
                        line = line.replace('9 ', '2 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('10 '):
                        line = line.replace('10 ', '7 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    elif line.strip().startswith('11 '):
                        line = line.replace('11 ', '8 ', 1)   # fixed to replace only first occurence
                        fw.write(line)
                    else:
                        count += 1
        except Exception as e: 
            print(e)

print(f'Deleted {count} coordinates')

Deleted 1 labels
Deleted 3260 coordinates


In [None]:
# Directories
data_dir = '/content/obj-org/obj'
dest_dir_labels = '/content/obj-org-labels'
dest_dir_images_jpg = '/content/obj-org-images_jpg'
dest_dir_images_png = '/content/obj-org-images_png'

# Create destination dir if not exists
os.makedirs('/content/obj-org-labels', exist_ok=True)
os.makedirs('/content/obj-org-images_jpg', exist_ok=True)
os.makedirs('/content/obj-org-images_png', exist_ok=True)

# Copy a number of images to 'test' folder
files = os.listdir(data_dir)
label_count = 0
jpgimg_count = 0
pngimg_count = 0

for file in files:
    if file.endswith('.txt'):
        try:
            shutil.copy(os.path.join(data_dir, file), dest_dir_labels)
            label_count += 1
        except Exception as e: 
            print(e)
    elif file.endswith('.jpg'):
        try:
            shutil.copy(os.path.join(data_dir, file), dest_dir_images_jpg)
            jpgimg_count += 1
        except Exception as e: 
            print(e)
    elif file.endswith('.png'):
        try:
            shutil.copy(os.path.join(data_dir, file), dest_dir_images_png)
            pngimg_count += 1
        except Exception as e: 
            print(e)

print('Label count (.TXT) in obj.rar: ' + str(label_count))
print('Image count (.JPG) in obj.rar: ' + str(jpgimg_count))
print('Image count (.PNG) in obj.rar: ' + str(pngimg_count))
print('###     Transferred completed     ###')

### b. Delete the odd ones


#### Originals



&ndash;&nbsp;&nbsp;Delete images without labels - JPG only (1)

In [None]:
# Directories
label_dir = '/content/labels2017/coco/labels/train2017'
image_dir = '/content/train2017/train2017'

# Labels' and images' names only (without extensions)
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]

print('Labels count: ' + str(len(labels)))
print('Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('Images without labels count: ' + str(len(no_label_images)))

# Delete images that don't have labels (.JPG images only)
for filename in no_label_images:
    os.remove(os.path.join(image_dir, filename + '.jpg'))

# Check after deletion
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]
no_label_images = list(set(images) - set(labels))

print('Images without labels count (After delete): ' + str(len(no_label_images)))

if not no_label_images:
    print('SUCCESSFULLY deleted all images without labels')

Labels count: 64115
Images count: 118287
Images without labels count: 54172
Images without labels count (After delete): 0
SUCCESSFULLY deleted all images without labels


&ndash;&nbsp;&nbsp;Delete labels without images instead (2)

In [None]:
# Directories
label_dir = '/content/obj-labels'
image_dir = '/content/obj-images'

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]

print('Labels count: ' + str(len(labels)))
print('Images count: ' + str(len(images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('Labels without images count: ' + str(len(no_image_labels)))

# Delete labels that don't have corresponding images
for filename in no_image_labels:
    os.remove(os.path.join(label_dir, filename + '.txt'))

# Check after deletion
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]
no_image_labels = list(set(labels) - set(images))

print('Labels without images count (After delete): ' + str(len(no_image_labels)))

if not no_image_labels:
    print('SUCCESSFULLY deleted all labels without images')

Labels count: 67242
Images count: 67242
Labels without images count: 0
Labels without images count (After delete): 0
SUCCESSFULLY deleted all labels without images


&ndash;&nbsp;&nbsp;Count the number of files only (3)

In [None]:
# Directories
label_dir = '/content/labels'
image_dir = '/content/images'

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]

print('Labels count: ' + str(len(labels)))
print('Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('Labels without images count: ' + str(len(no_image_labels)))

Labels count: 131999
Images count: 131999
Images without labels count: 0
Labels without images count: 0


In [None]:
# Directories
label_dir = '/content/obj-new-lbp-labels'
image_dir = '/content/obj-new-lbp-images'

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]

print('Labels count: ' + str(len(labels)))
print('Images count: ' + str(len(images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('Labels without images count: ' + str(len(no_image_labels)))

# # Delete labels that don't have corresponding images
# for filename in no_image_labels:
#     os.remove(os.path.join(label_dir, filename + '.txt'))

# # Check after deletion
# labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]
# images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]
# no_image_labels = list(set(labels) - set(images))

# print('Labels without images count (After delete): ' + str(len(no_image_labels)))

# if not no_image_labels:
#     print('SUCCESSFULLY deleted all labels without images')

Labels count: 157003
Images count: 157003
Labels without images count: 0


&ndash;&nbsp;&nbsp;Check Train / Val / Test integrity after splitting (4)

In [None]:
# Directories
train, val, test = 'train', 'val', 'test'

image_dir = '/content/WiderPerson-images-splits'
label_dir = '/content/WiderPerson-labels-splits'

image_dirs = [os.path.join(image_dir, x) for x in os.listdir(image_dir) if os.path.isdir(os.path.join(image_dir, x))]
label_dirs = [os.path.join(image_dir, x) for x in os.listdir(label_dir) if os.path.isdir(os.path.join(label_dir, x))]

if len(image_dirs) != len(label_dirs):
    print('Folder amount is different!')
    print(f'\t+ Image dir: {len(image_dirs)} folders')
    print(f'\t+ Label dir: {len(label_dirs)} folders')
else:
    # for i in range(0, len(image_dirs)):
    image_dir_train = os.path.join(image_dir, train)
    label_dir_train = os.path.join(label_dir, train)

    image_dir_val   = os.path.join(image_dir, val)
    label_dir_val   = os.path.join(label_dir, val)

    image_dir_test  = os.path.join(image_dir, test)
    label_dir_test  = os.path.join(label_dir, test)

    # List labels and images without extensions
    labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_train)]
    images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir_train)]

    print('- Train:')
    print('\t+ Labels count: ' + str(len(labels)))
    print('\t+ Images count: ' + str(len(images)))

    # Find images that don't have labels and vice versa
    no_label_images = list(set(images) - set(labels))
    no_image_labels = list(set(labels) - set(images))

    # if ()
    print('\t+ Images without labels count: ' + str(len(no_label_images)))

    print('\t+ Labels without images count: ' + str(len(no_image_labels)))

    # List labels and images without extensions
    labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_val)]
    images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir_val)]

    print('- Val:')
    print('\t+ Labels count: ' + str(len(labels)))
    print('\t+ Images count: ' + str(len(images)))

    # Find images that don't have labels
    no_label_images = list(set(images) - set(labels))

    print('\t+ Images without labels count: ' + str(len(no_label_images)))

    # Find labels that don't have corresponding images
    no_image_labels = list(set(labels) - set(images))

    print('\t+ Labels without images count: ' + str(len(no_image_labels)))

    # List labels and images without extensions
    labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_test)]
    images = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_test)]

    print('- Test:')
    print('\t+ Labels count: ' + str(len(labels)))
    print('\t+ Images count: ' + str(len(images)))

    # Find images that don't have labels
    no_label_images = list(set(images) - set(labels))

    print('\t+ Images without labels count: ' + str(len(no_label_images)))

    # Find labels that don't have corresponding images
    no_image_labels = list(set(labels) - set(images))

    print('\t+ Labels without images count: ' + str(len(no_image_labels)))

- Train:
	+ Labels count: 7200
	+ Images count: 7200
	+ Images without labels count: 0
	+ Labels without images count: 0
- Val:
	+ Labels count: 900
	+ Images count: 900
	+ Images without labels count: 0
	+ Labels without images count: 0
- Test:
	+ Labels count: 900
	+ Images count: 900
	+ Images without labels count: 0
	+ Labels without images count: 0


#### Playground

In [None]:
!find . -name "*(1)*" -type f -delete

In [None]:
# Directories
data_dir = '/content/converted'
dest_dir = '/content/converted-200'

# Create destination dir if not exists
os.makedirs(dest_dir, exist_ok=True)

# Copy a number of images to 'test' folder
images = os.listdir(data_dir)
total_images = 200
count = 0

for img in images:
    if count == total_images:
        break
    elif img.endswith('.jpg') or img.endswith('.png'):
        try:
            shutil.copy(os.path.join(data_dir, img), dest_dir)
            count += 1
        except Exception as e:
            print(e)

print('###     Transferred completed     ###')

In [None]:
%cd /content/obj-labels-org

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-89-20820bf99216>", line 1, in <module>
    get_ipython().magic('cd /content/obj-labels-org')
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2160, in magic
    return self.run_line_magic(magic_name, magic_arg_s)
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2081, in run_line_magic
    result = fn(*args,**kwargs)
  File "<decorator-gen-84>", line 2, in cd
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py", line 188, in <lambda>
    call = lambda f, *a, **k: f(*a, **k)
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/magics/osm.py", line 288, in cd
    oldcwd = py3compat.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exc

FileNotFoundError: ignored

In [None]:
# Directories
data_dir = '/content/obj-org/obj'
dest_dir_labels = '/content/obj-org-labels'
dest_dir_images_jpg = '/content/obj-org-images_jpg'
dest_dir_images_png = '/content/obj-org-images_png'

# Create destination dir if not exists
os.makedirs('/content/obj-org-labels', exist_ok=True)
os.makedirs('/content/obj-org-images_jpg', exist_ok=True)
os.makedirs('/content/obj-org-images_png', exist_ok=True)

# Copy a number of images to 'test' folder
files = os.listdir(data_dir)
label_count = 0
jpgimg_count = 0
pngimg_count = 0

for file in files:
    if file.endswith('.txt'):
        try:
            shutil.copy(os.path.join(data_dir, file), dest_dir_labels)
            label_count += 1
        except Exception as e: 
            print(e)
    elif file.endswith('.jpg'):
        try:
            shutil.copy(os.path.join(data_dir, file), dest_dir_images_jpg)
            jpgimg_count += 1
        except Exception as e: 
            print(e)
    elif file.endswith('.png'):
        try:
            shutil.copy(os.path.join(data_dir, file), dest_dir_images_png)
            pngimg_count += 1
        except Exception as e: 
            print(e)

print('Label count (.TXT) in obj.rar: ' + str(label_count))
print('Image count (.JPG) in obj.rar: ' + str(jpgimg_count))
print('Image count (.PNG) in obj.rar: ' + str(pngimg_count))
print('###     Transferred completed     ###')

Label count (.TXT) in obj.rar: 0
Image count (.JPG) in obj.rar: 0
Image count (.PNG) in obj.rar: 0
###     Transferred completed     ###


### c. Split to train/val/test datasets

#### Originals

&ndash;&nbsp; Split to Train/Val/Test (3 Sets)

In [None]:
# Mark the start of code
print("###  Split COCO Dataset to Train/Val/Test by Ratio  ###")

new_dir = '/content/WiderPerson-labels-splits'  # new dir to create

data_dir = '/content/WiderPerson-labels'  # dir with data to copy

# Ratios
val_ratio = 0.1
test_ratio = 0.1

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss

# Randomize dataset
seeded_np = RandomState(2022)  # Create instance with seed 25/10/2021
seeded_np.shuffle(all_file_names)  # Shuffle dataset

# Splitting by ratio
train_file_names, val_file_names, test_file_names = np.split(np.array(all_file_names),
                                                             [int(len(all_file_names) * (1 - (val_ratio + test_ratio))),
                                                              int(len(all_file_names) * (1 - test_ratio))])

# File names for Train/Val/Test
train_file_names = [os.path.join(data_dir, '') + name for name in train_file_names.tolist()]
val_file_names = [os.path.join(data_dir, '') + name for name in val_file_names.tolist()]
test_file_names = [os.path.join(data_dir, '') + name for name in test_file_names.tolist()]

# Notify image amount
print('Total images: \t'  + str(len(all_file_names)))
print('Training: \t'      + str(len(train_file_names)))
print('Validation: \t'    + str(len(val_file_names)))
print('Testing: \t'       + str(len(test_file_names)))

# Creating Train/Val/Test folders (One time use)
os.makedirs(os.path.join(new_dir, 'train'))
os.makedirs(os.path.join(new_dir, 'val'))
os.makedirs(os.path.join(new_dir, 'test'))

# Copy-pasting images
for name in train_file_names:
    shutil.copy(name, os.path.join(new_dir, 'train'))

for name in val_file_names:
    shutil.copy(name, os.path.join(new_dir, 'val'))

for name in test_file_names:
    shutil.copy(name, os.path.join(new_dir, 'test'))

print("###     Script completed     ###")

###  Split COCO Dataset to Train/Val/Test by Ratio  ###
Total images: 	9000
Training: 	7200
Validation: 	900
Testing: 	900
###     Script completed     ###


&ndash;&nbsp; Split to halves (2 sets)

In [None]:
# Mark the start of code
print("###     Halve Dataset     ###")

root_dir = '/content/train-1410-halves'  # new dir to create

data_dir = '/content/datasets/coco/images'  # dir with data to copy

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss

# Splitting by ratio
train_file_names = all_file_names[:int(len(all_file_names) / 2)]
val_file_names = all_file_names[int(len(all_file_names) / 2):]

# File names for Train/Val/Test
train_file_names = [os.path.join(data_dir, '') + name for name in train_file_names]
val_file_names = [os.path.join(data_dir, '') + name for name in val_file_names]

# Notify image amount
print('Total images: \t'  + str(len(all_file_names)))
print('First half: \t'    + str(len(train_file_names)))
print('Second half: \t'   + str(len(val_file_names)))

# Creating Train/Val/Test folders (One time use)
os.makedirs(os.path.join(root_dir, 'first-half'))
os.makedirs(os.path.join(root_dir, 'second-half'))
# Copy-pasting images
for name in train_file_names:
    shutil.copy(name, os.path.join(root_dir, 'first-half'))

for name in val_file_names:
    shutil.copy(name, os.path.join(root_dir, 'second-half'))

print("###     Script completed     ###")

### Split COCO Dataset to Train/Val/Test by Ratio ###
Total images: 	64115
Training: 	32057
Validation: 	32058
###      Script completed      ###


&ndash;&nbsp; Split to thirds (3 sets)

In [None]:
# Mark the start of code
print("###  Split to thirds  ###")

new_dir = '/content/obj-labels-new-lbp-thirds'  # new dir to create

data_dir = '/content/obj-labels'  # dir with data to copy

# Ratios

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss

# Randomize dataset
seeded_np = RandomState(25102021)  # Create instance with seed 25/10/2021
seeded_np.shuffle(all_file_names)  # Shuffle dataset

# Splitting by ratio
train_file_names = all_file_names[:int(len(all_file_names) * 1 / 3)]
val_file_names = all_file_names[int(len(all_file_names) * 1 / 3):int(len(all_file_names) * 2 / 3)]
test_file_names = all_file_names[int(len(all_file_names) * 2 / 3):]

# File names for Train/Val/Test
train_file_names = [os.path.join(data_dir, '') + name for name in train_file_names]
val_file_names = [os.path.join(data_dir, '') + name for name in val_file_names]
test_file_names = [os.path.join(data_dir, '') + name for name in test_file_names]

# Notify image amount
print('Total images: \t'  + str(len(all_file_names)))
print('First third: \t'   + str(len(train_file_names)))
print('Second third: \t'  + str(len(val_file_names)))
print('Third third: \t'   + str(len(test_file_names)))

# Creating Train/Val/Test folders (One time use)
os.makedirs(os.path.join(new_dir, 'first-third'))
os.makedirs(os.path.join(new_dir, 'second-third'))
os.makedirs(os.path.join(new_dir, 'third-third'))

# Copy-pasting images
for name in train_file_names:
    shutil.copy(name, os.path.join(new_dir, 'first-third'))

for name in val_file_names:
    shutil.copy(name, os.path.join(new_dir, 'second-third'))

for name in test_file_names:
    shutil.copy(name, os.path.join(new_dir, 'third-third'))

print("###     Script completed     ###")

###  Split to thirds  ###
Total images: 	65593
First third: 	21864
Second third: 	21864
Third third: 	21865
###     Script completed     ###


&ndash;&nbsp; [S-Class] Split to arbitrary parts, then split to train/val/test

In [None]:
# Mark the start of code
print("###  Split to arbitrary parts, then to train/val/test  ###")

new_dir = '/content/VDAS-FM-parts'   # new dir to create

data_dir = '/content/images'    # dir with data to copy

number_of_parts = 8

train_ratio, val_ratio, test_ratio = 0.8, 0.1, 0.1

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss (os.listdir non-blocking)

# Randomize dataset
random.seed(2021)  # Create instance with seed 2021
random.shuffle(all_file_names)  # Shuffle dataset


def split_to_equal_parts(file_names, num_of_parts):
    if num_of_parts < 2:
        print('ONLY split to 2 or more parts. Check params')
        return

    split_parts = [file_names[:int(len(file_names) * (1 / num_of_parts))]]

    for part_num in range(2, num_of_parts):
        split_parts.append(file_names[int(len(file_names) * (part_num / num_of_parts)):int(
            len(file_names) * ((part_num + 1) / num_of_parts))])

    split_parts.append(file_names[int(len(file_names) * ((num_of_parts - 1) / num_of_parts)):])

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([os.path.join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


def split_to_train_val_test(file_names, train_ratio, val_ratio, test_ratio):
    file_names = [os.path.basename(fn) for fn in file_names]
    split_parts = [file_names[:int(len(file_names) * train_ratio)],
                   file_names[int(len(file_names) * train_ratio):int(len(file_names) * (train_ratio + val_ratio))],
                   file_names[int(len(file_names) * (train_ratio + val_ratio)):]]

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([os.path.join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


split_file_names_result = split_to_equal_parts(all_file_names, number_of_parts)
print(f'Total images: \t{len(all_file_names)}')
for i in range(0, number_of_parts):
    print(f'Part {i + 1}: \t\t{len(split_file_names_result[i])}')
    os.makedirs(os.path.join(new_dir, f'part-{i + 1}'))

    train_val_test_file_names = split_to_train_val_test(split_file_names_result[i], train_ratio, val_ratio, test_ratio)

    # Notify image amount (Train/Val/Test)
    print(f'\t + Train:\t{len(train_val_test_file_names[0])}')
    print(f'\t + Val:\t\t{len(train_val_test_file_names[1])}')
    print(f'\t + Test:\t{len(train_val_test_file_names[2])}')

    # Creating Train/Val/Test folders
    os.makedirs(os.path.join(new_dir, f'part-{i + 1}', 'train'))
    os.makedirs(os.path.join(new_dir, f'part-{i + 1}', 'val'))
    os.makedirs(os.path.join(new_dir, f'part-{i + 1}', 'test'))

    # Copy-pasting images
    for name in train_val_test_file_names[0]:
        shutil.copy(name, os.path.join(new_dir, f'part-{i + 1}', 'train'))

    for name in train_val_test_file_names[1]:
        shutil.copy(name, os.path.join(new_dir, f'part-{i + 1}', 'val'))

    for name in train_val_test_file_names[2]:
        shutil.copy(name, os.path.join(new_dir, f'part-{i + 1}', 'test'))


###  Split to arbitrary parts, then to train/val/test  ###
Total images: 	64115
Part 1: 		12823
	 + Train:	10258
	 + Val:		1282
	 + Test:	1283
Part 2: 		12823
	 + Train:	10258
	 + Val:		1282
	 + Test:	1283
Part 3: 		12823
	 + Train:	10258
	 + Val:		1282
	 + Test:	1283
Part 4: 		12823
	 + Train:	10258
	 + Val:		1282
	 + Test:	1283
Part 5: 		12823
	 + Train:	10258
	 + Val:		1282
	 + Test:	1283


In [None]:
# Split to arbitrary parts only, no train val test
# Mark the start of code
print("###  Split to arbitrary parts only  ###")

new_dir = '/content/gen-label-splits'     # new dir to create

data_dir = '/content/gen-labels-full'    # dir with data to copy

number_of_parts = 10

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss (os.listdir non-blocking)


def split_to_equal_parts(file_names, num_of_parts):
    if num_of_parts < 2:
        print('Split to 2 parts or above')
        return

    split_parts = [file_names[:int(len(file_names) * (1 / num_of_parts))]]

    for part_num in range(2, num_of_parts):
        split_parts.append(file_names[int(len(file_names) * (part_num / num_of_parts)):int(
            len(file_names) * ((part_num + 1) / num_of_parts))])

    split_parts.append(file_names[int(len(file_names) * ((num_of_parts - 1) / num_of_parts)):])

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([os.path.join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


split_file_names_result = split_to_equal_parts(all_file_names, number_of_parts)
print(f'Total images: \t{len(all_file_names)}')
for i in range(0, number_of_parts):
    print(f'Part {i + 1}: \t\t{len(split_file_names_result[i])}')
    os.makedirs(os.path.join(new_dir, f'part-{i + 1}'))

    # Copy-pasting images
    for name in split_file_names_result[i]:
        shutil.copy(name, os.path.join(new_dir, f'part-{i + 1}'))


###  Split to arbitrary parts only  ###
Total images: 	17552
Part 1: 		1755
Part 2: 		1755
Part 3: 		1755
Part 4: 		1756
Part 5: 		1755
Part 6: 		1755
Part 7: 		1755
Part 8: 		1755
Part 9: 		1756
Part 10: 		1756


In [None]:
!cd /content/VDAS-FM-images && zip -r -0 /content/VDAS-FM-images.zip .
!cp -r /content/VDAS-FM-images.zip /content/drive/MyDrive/yolo/4-datasets

!cd /content/VDAS-FM-labels && zip -r -0 /content/VDAS-FM-labels.zip .
!cp -r /content/VDAS-FM-labels.zip /content/drive/MyDrive/yolo/4-datasets

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: 16_Award_Ceremony_Awards_Ceremony_16_401.txt (stored 0%)
  adding: crawler2_03402.txt (stored 0%)
  adding: train_00017994.txt (stored 0%)
  adding: 12_Group_Group_12_Group_Group_12_61.txt (stored 0%)
  adding: 35_Basketball_Basketball_35_188.txt (stored 0%)
  adding: pexels_pexels-samson-katt-5225236_face_0.txt (stored 0%)
  adding: train_00007615.txt (stored 0%)
  adding: 3040.txt (stored 0%)
  adding: video19_frame_633_face_2.txt (stored 0%)
  adding: train_00018165.txt (stored 0%)
  adding: train_00015225.txt (stored 0%)
  adding: test_00004874.txt (stored 0%)
  adding: train_00016155.txt (stored 0%)
  adding: unsplash_gabriella-clare-marino-tAPq5wEnPRw-unsplash_face_3.txt (stored 0%)
  adding: FFHQ_12352.txt (stored 0%)
  adding: 61_Street_Battle_streetfight_61_708.txt (stored 0%)
  adding: train_00011815.txt (stored 0%)
  adding: train_00002482.txt (stored 0%)
  adding: train_00005632.txt (stored 0%)
  add

#### Playground

In [None]:
# Mark the start of code
print("###     Custom Split     ###")

root_dir = '/content/obj-processed-labels'  # new dir to create

data_dir = '/content/obj-labels-2'  # dir with data to copy

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss

# Splitting by ratio
train_file_names = all_file_names[:int(len(all_file_names) / 2)]
val_file_names = all_file_names[int(len(all_file_names) / 2):]

# File names for Train/Val/Test
train_file_names = [os.path.join(data_dir, '') + name for name in train_file_names]
val_file_names = [os.path.join(data_dir, '') + name for name in val_file_names]

# Notify image amount
print('Total images: \t'  + str(len(all_file_names)))
print('First half: \t'    + str(len(train_file_names)))
print('Second half: \t'   + str(len(val_file_names)))

# Creating Train/Val/Test folders (One time use)
os.makedirs(os.path.join(root_dir, 'first-half'))
os.makedirs(os.path.join(root_dir, 'second-half'))
# Copy-pasting images
for name in train_file_names:
    shutil.copy(name, os.path.join(root_dir, 'first-half'))

for name in val_file_names:
    shutil.copy(name, os.path.join(root_dir, 'second-half'))

print("###     Script completed     ###")

In [None]:
# Mark the start of code
print("###     Custom Split     ###")

root_dir = '/content/obj-processed-labels'  # new dir to create

data_dir = '/content/obj-labels-2'  # dir with data to copy

all_file_names = os.listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss

# Splitting by ratio
train_file_names = all_file_names[:int(len(all_file_names) / 2)]
val_file_names = all_file_names[int(len(all_file_names) / 2):]

# File names for Train/Val/Test
train_file_names = [os.path.join(data_dir, '') + name for name in train_file_names]
val_file_names = [os.path.join(data_dir, '') + name for name in val_file_names]

# Notify image amount
print('Total images: \t'  + str(len(all_file_names)))
print('First half: \t'    + str(len(train_file_names)))
print('Second half: \t'   + str(len(val_file_names)))

# Creating Train/Val/Test folders (One time use)
os.makedirs(os.path.join(root_dir, 'first-half'))
os.makedirs(os.path.join(root_dir, 'second-half'))
# Copy-pasting images
for name in train_file_names:
    shutil.copy(name, os.path.join(root_dir, 'first-half'))

for name in val_file_names:
    shutil.copy(name, os.path.join(root_dir, 'second-half'))

print("###     Script completed     ###")

###     Halve Dataset     ###
Total images: 	65293
First half: 	32646
Second half: 	32647
###      Script completed      ###


In [None]:
print(os.listdir('/content/nightowl-org/nightowls_validation'))

['58c580debc26013448bf2aeb.png', '58c58321bc260137001584d2.png', '58c5821dbc26013700130a47.png', '58c580adbc260137e095705b.png', '58c580e1bc26013448bf31f3.png', '58c5835fbc26013700160be3.png', '58c58360bc26013700160cea.png', '58c58338bc2601370015b9b5.png', '58c5821dbc26013700130ae4.png', '58c5832bbc26013700159af7.png', '58c58339bc2601370015bcb6.png', '58c58222bc2601370013180a.png', '58c58335bc2601370015b4b8.png', '58c5832fbc2601370015a510.png', '58c58371bc26013700163785.png', '58c580d5bc26013448bf1614.png', '58c580e2bc26013448bf35aa.png', '58c580d3bc26013448bf115d.png', '58c580d4bc26013448bf14f7.png', '58c580b1bc260137e0957933.png', '58c580b4bc260137e095809f.png', '58c5832fbc2601370015a588.png', '58c58320bc260137001580a7.png', '58c5832dbc26013700159fd1.png', '58c580e3bc26013448bf375a.png', '58c580d7bc26013448bf1a61.png', '58c580adbc260137e0956fbd.png', '58c580d9bc26013448bf1e21.png', '58c58224bc26013700131da9.png', '58c58371bc260137001637ac.png', '58c58223bc260137001319fa.png', '58c582

### d. Check train/val/test integrity after splitting 

#### Originals

In [None]:
# Directories
train, val, test = 'train', 'val', 'test'
image_dir = '/content/WiderPerson-images-splits'
label_dir = '/content/WiderPerson-labels-splits'

image_dir_train = os.path.join(image_dir, train)
label_dir_train = os.path.join(label_dir, train)

image_dir_val   = os.path.join(image_dir, val)
label_dir_val   = os.path.join(label_dir, val)

image_dir_test  = os.path.join(image_dir, test)
label_dir_test  = os.path.join(label_dir, test)

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_train)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir_train)]

print('- Train:')
print('\t+ Labels count: ' + str(len(labels)))
print('\t+ Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('\t+ Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('\t+ Labels without images count: ' + str(len(no_image_labels)))

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_val)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir_val)]

print('- Val:')
print('\t+ Labels count: ' + str(len(labels)))
print('\t+ Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('\t+ Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('\t+ Labels without images count: ' + str(len(no_image_labels)))

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_test)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_test)]

print('- Test:')
print('\t+ Labels count: ' + str(len(labels)))
print('\t+ Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('\t+ Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('\t+ Labels without images count: ' + str(len(no_image_labels)))

- Train:
	+ Labels count: 7200
	+ Images count: 7200
	+ Images without labels count: 0
	+ Labels without images count: 0
- Val:
	+ Labels count: 900
	+ Images count: 900
	+ Images without labels count: 0
	+ Labels without images count: 0
- Test:
	+ Labels count: 900
	+ Images count: 900
	+ Images without labels count: 0
	+ Labels without images count: 0


In [None]:
# Count number of images and labels in halves

files = os.listdir('/content/obj-images/first-half')

print('Images count (first half): ' + str(len(files)))

files = os.listdir('/content/obj-labels/first-half')

print('Labels count (first half): ' + str(len(files)))

files = os.listdir('/content/obj-images/second-half')

print('Images count (second half): ' + str(len(files)))

files = os.listdir('/content/obj-labels/second-half')

print('Labels count (second half): ' + str(len(files)))

Images count (first half): 33332
Labels count (first half): 32646
Images count (second half): 33333
Labels count (second half): 32647


In [None]:
# Count number of images and labels in thirds

print('Images count (first third): '  + str(len(os.listdir('/content/obj-images-new-lbp-first-third'))))

print('Labels count (first third): '  + str(len(os.listdir('/content/obj-labels-new-lbp-first-third'))))

print('Images count (second third): ' + str(len(os.listdir('/content/obj-images-new-lbp-second-third'))))

print('Labels count (second third): ' + str(len(os.listdir('/content/obj-labels-new-lbp-second-third'))))

print('Images count (third third): '  + str(len(os.listdir('/content/obj-images-new-lbp-third-third'))))

print('Labels count (third third): '  + str(len(os.listdir('/content/obj-labels-new-lbp-third-third'))))

#### Playground

In [None]:
os.makedirs('/content/CrowdHuman-images')
os.makedirs('/content/CrowdHuman-labels')

In [None]:
total_files('/content/CrowdHuman-images')
total_files('/content/CrowdHuman-labels')

19370
19370


In [None]:
!cd /content/WiderPerson-labels-splits && zip -r -0 /content/WiderPerson-labels.zip . && cp /content/WiderPerson-labels.zip /content/drive/MyDrive/yolo/4-datasets/person-datasets/WiderPerson

In [None]:
# Directories
train, val, test = 'train', 'val', 'test'
image_dir = '/content/masks-dataset-1203-images'
label_dir = '/content/masks-dataset-1203-labels'

image_dir_train = os.path.join(image_dir, train)
label_dir_train = os.path.join(label_dir, train)

image_dir_val   = os.path.join(image_dir, val)
label_dir_val   = os.path.join(label_dir, val)

image_dir_test  = os.path.join(image_dir, test)
label_dir_test  = os.path.join(label_dir, test)

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_train)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir_train)]

print('- Train:')
print('\t+ Labels count: ' + str(len(labels)))
print('\t+ Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('\t+ Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('\t+ Labels without images count: ' + str(len(no_image_labels)))

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_val)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir_val)]

print('- Val:')
print('\t+ Labels count: ' + str(len(labels)))
print('\t+ Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('\t+ Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('\t+ Labels without images count: ' + str(len(no_image_labels)))

# List labels and images without extensions
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_test)]
images = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir_test)]

print('- Test:')
print('\t+ Labels count: ' + str(len(labels)))
print('\t+ Images count: ' + str(len(images)))

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print('\t+ Images without labels count: ' + str(len(no_label_images)))

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('\t+ Labels without images count: ' + str(len(no_image_labels)))

- Train:
	+ Labels count: 1816
	+ Images count: 1816
	+ Images without labels count: 0
	+ Labels without images count: 0
- Val:
	+ Labels count: 519
	+ Images count: 519
	+ Images without labels count: 0
	+ Labels without images count: 0
- Test:
	+ Labels count: 260
	+ Images count: 260
	+ Images without labels count: 0
	+ Labels without images count: 0


In [None]:
# Edit this for other kinds of counting

files = os.listdir('/content/obj-org/obj')

print('All files count in obj.rar:\t\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-labels-org')

print('Labels count (original):\t\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-org-images_jpg') + os.listdir('/content/obj-org-images_png')

print('Images count (original):\t\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-org-images_jpg')

print('Images count (original - JPG only):\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-org-images_png')

print('Images count (original - PNG only):\t' + str(len(files)))

# # Edit this for other kinds of counting

# files = os.listdir('/content/obj-images-org')

# print('Images count (original - JPG only):\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-images/obj')

print('Images count (new LBP):\t\t\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-org-labels')

print('Labels count (original - person, car):\t' + str(len(files)))

# Edit this for other kinds of counting

files = os.listdir('/content/obj-labels')

print('Labels count (old - person, car):\t' + str(len(files)))

All files count in obj.rar:		135014
Labels count (original):		67507
Images count (original):		67507
Images count (original - JPG only):	66667
Images count (original - PNG only):	840
Images count (new LBP):			66967
Labels count (original - person, car):	66135
Labels count (old - person, car):	65293


In [None]:
# Edit this for other kinds of counting

files = os.listdir('/content/obj-images')

print('Images count: ' + str(len(files)))

import glob
print('JPG count: ' + str(len(glob.glob('/content/obj-images/*.jpg'))))
print('PNG count: ' + str(len(glob.glob('/content/obj-images/*.png'))))
print('(JPG + PNG) count: ' + str(len(glob.glob('/content/obj-images/*.jpg') + glob.glob('/content/obj-images/*.png'))))

Images count: 66667
Images (JPG) count: 66667


In [None]:
import os
import cv2

image_dir = 'images'
label_dir = 'labels'
name_list = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]

for name in name_list:
    if os.path.isfile(os.path.join(image_dir, name + '.jpg')):
        img = cv2.imread(os.path.join(image_dir, name + '.jpg'))
    else:
        img = cv2.imread(os.path.join(image_dir, name + '.png'))

    dh, dw, _ = img.shape

    label = os.path.join(label_dir, name + '.txt')
    fl = open(label, 'r')
    data = fl.readlines()
    fl.close()

    index = 0  # Line no.

    for dt in data:
        index += 1
        _, x, y, w, h = map(float, dt.split(' '))

        l = int((x - w / 2) * dw)
        r = int((x + w / 2) * dw)
        t = int((y - h / 2) * dh)
        b = int((y + h / 2) * dh)

        if l < 0:
            l = 0
        if r > dw - 1:
            r = dw - 1
        if t < 0:
            t = 0
        if b > dh - 1:
            b = dh - 1

        label_v = 'vehicle' + str(index)  # Xoa phan trc, giu str(index) neu label lon qua
        label_m = 'motorbike' + str(index)  # Xoa phan trc, giu str(index) neu label lon qua
        label_v_color = (62, 43, 255)  # red
        label_m_color = (44, 104, 255)  # orange

        if _ == 1.0:
            img = cv2.rectangle(img, (l, t), (r, b), label_v_color, 2)

            (w, h), _2 = cv2.getTextSize(label_v, cv2.FONT_HERSHEY_DUPLEX, 0.6, 1)

            img = cv2.rectangle(img, (l, t - 20), (l + w, t), label_v_color, -1)
            img = cv2.putText(img, label_v, (l, t - 5),
                              cv2.FONT_HERSHEY_DUPLEX, 0.5, (255, 255, 255), 1)
        elif _ == 2.0:
            img = cv2.rectangle(img, (l, t), (r, b), label_m_color, 2)

            (w, h), _2 = cv2.getTextSize(label_m, cv2.FONT_HERSHEY_DUPLEX, 0.6, 1)

            img = cv2.rectangle(img, (l, t - 20), (l + w, t), label_m_color, -1)
            img = cv2.putText(img, label_m, (l, t - 5),
                              cv2.FONT_HERSHEY_DUPLEX, 0.5, (255, 255, 255), 1)

    cv2.imwrite(os.path.join('converted', name + '-bbox.jpg'), img)


## Step 3: Save results to Google Drive

### a. Zip and copy images/labels

In [None]:
# Legacy improved
cd_dir = '/content/images'
zip_name = '/content/VDAS-FM-images.zip'
target_dir = '/content/drive/MyDrive/yolo/4-datasets'
!cd $cd_dir && zip -r -0 $zip_name .
!cp $zip_name $target_dir

In [None]:
print("###  Batch zip folders without parent  ###")

output_dir = '/content/'     # new dir to create

data_dir = '/content/labels2017-parts'    # dir with data to copy

zipname_prefix = 'labels2017'

os.makedirs(output_dir, exist_ok=True)


def zipdir(path, ziph):
    length = len(path)

    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        folder = root[length:]  # path without "parent"
        for file in files:
            ziph.write(os.path.join(root, file), os.path.join(folder, file))


for root, dirs, files in os.walk(data_dir):
    if root == data_dir:
        for dirname in dirs:
            zip_name = output_dir + '/' + zipname_prefix + '-' + dirname + '.zip'
            zip_dir = root + '/' + dirname
            zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_STORED)
            zipdir(zip_dir, zipf)
            zipf.close()
            print(f'Created {os.path.basename(zip_name)} with {len(zipf.namelist())} files')

###  Batch zip folders without parent  ###
Created labels2017-part-3.zip with 12823 files
Created labels2017-part-4.zip with 12823 files
Created labels2017-part-5.zip with 12823 files
Created labels2017-part-1.zip with 12823 files
Created labels2017-part-2.zip with 12823 files


&ndash;&nbsp; If only split to train/val/test

In [None]:
%cd /content/labels2017/coco/labels/train2017

!zip -r -0 /content/labels2017-person.zip .

!cp -r /content/labels2017-person.zip /content/drive/MyDrive/yolo/4-datasets/coco-datasets

In [None]:
%cd /content/train2017/train2017

!zip -r -0 /content/train2017/train2017-person.zip .

!cp -r /content/train2017/train2017-person.zip /content/drive/MyDrive/yolo/4-datasets/coco-datasets

In [None]:
%cd /content/train-1410-4

!zip -r /content/train-1410-sorted.zip .

!cp -r /content/train-1410-sorted.zip /content/drive/MyDrive/coco/

&ndash;&nbsp; If split to 2 halves, and then train/val/test

In [None]:
# First half
%cd /content/train-1410-first-half

!zip -r -0 /content/train-1410-first-half.zip .

!cp -r /content/train-1410-first-half.zip /content/drive/MyDrive/coco/

# Second half
%cd /content/train-1410-second-half

!zip -r -0 /content/train-1410-second-half.zip .

!cp -r /content/train-1410-second-half.zip /content/drive/MyDrive/coco/

### b. Playground

In [None]:
# %cd /content/mixed-dataset-lbp-8-images

# !zip -r -0 /content/mixed-dataset-lbp-8-images.zip .

# !cp -r /content/mixed-dataset-lbp-8-images.zip /content/drive/MyDrive/yolov5/

# %cd /content/mixed-dataset-lbp-9-images

# !zip -r -0 /content/mixed-dataset-lbp-9-images.zip .

# !cp -r /content/mixed-dataset-lbp-9-images.zip /content/drive/MyDrive/yolov5/

# %cd /content/mixed-dataset-lbp-10-images

# !zip -r -0 /content/mixed-dataset-lbp-10-images.zip .

# !cp -r /content/mixed-dataset-lbp-10-images.zip /content/drive/MyDrive/yolov5/

### c. Force sync data to Google Drive

In [None]:
# Sync data (of local VM's disk cache) of Drive-mounted folder back to Google Drive

drive.flush_and_unmount()

In [None]:
# Mount back if needed

drive.mount('/content/drive')

# S-Class cells

## [S-Class] Count class instances

In [None]:
label_dir = '/content/8th-VDAS-labels'
labels = listdir(label_dir)

person, no_mask, mask, wrong_mask, oh_no, background, empty = 0, 0, 0, 0, 0, 0, 0

for label in labels:
    with open(join(label_dir, label), 'r') as fr:
        fr.seek(0)  # Ensure pointer is at beginning of file
        # Not empty label
        if getsize(join(label_dir, label)) > 0:
            for line in fr:
                # If not empty string
                if line:
                    if line.startswith('0 '):
                        person += 1
                    elif line.startswith('1 '):
                        no_mask += 1
                    elif line.startswith('2 '):
                        mask += 1
                    elif line.startswith('3 '):
                        wrong_mask += 1
                    else:
                        # print(line)
                        oh_no += 1
        # Empty label
        else:
            background += 1

print(
    f'''
Dir: {label_dir}
    Total files:   {len(labels)}
    0. Person:     {person}
    1. No mask:    {no_mask}
    2. Mask:       {mask}
    3. Wrong mask: {wrong_mask}
    Background:    {background}
    Empty:         {empty}
    Oh no:         {oh_no}
    '''
)


Dir: /content/8th-VDAS-labels
    Total files:   44549
    0. Person:     714580
    1. No mask:    249729
    2. Mask:       99801
    3. Wrong mask: 0
    Background:    0
    Empty:         0
    Oh no:         0
    


## [S-Class] Pseudo-labeling

In [None]:
# 1. Clone YOLOv5

%cd /content/
!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt        # install dependencies

import torch
from yolov5 import utils
display = utils.notebook_init()          # setup checks

print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

YOLOv5 🚀 v6.1-119-gaa542ce torch 1.10.0+cu111 CUDA:0 (Tesla P100-PCIE-16GB, 16281MiB)


Setup complete ✅ (2 CPUs, 12.7 GB RAM, 40.0/166.8 GB disk)
Setup complete. Using torch 1.10.0+cu111 (Tesla P100-PCIE-16GB)


In [None]:
# 2. Extract images

# !unzip -qq /content/drive/MyDrive/yolo/4-datasets/coco-datasets/train2017-person.zip -d /content/images
# !unzip -qq /content/drive/MyDrive/yolo/4-datasets/VDAS-FM-full-30k_optimized.zip -d /content/images
# !unzip -qq /content/drive/MyDrive/yolo/4-datasets/5th-VDAS/Kaggle-12k-images.zip -d /content/images
!unzip -qq /content/drive/MyDrive/yolo/4-datasets/5th-VDAS/Kaggle-1st-920-images.zip -d /content/images

In [None]:
# 3. Copy weights & check size (SKIP THIS if pseudo-labeling for person)
!cp /content/drive/MyDrive/yolo/1-weights/best-0314-20.pt /content
os.path.getsize('/content/best-0314-20.pt')     # 25099669

In [None]:
# Pseudo-labeling on remote
!python3.7 detect.py --weights yolov5x6.pt --img 1280 --source ../MaskedFace-Net-org --classes 0 --conf 0.5 --iou 0.6 --save-txt --nosave

In [None]:
# 4. Pseudo-labeling
%cd /content/yolov5
!python detect.py --weights yolov5x6.pt --img 1280 --source /content/images --classes 0 --conf 0.25 --iou 0.5 --save-txt

/content/yolov5
[34m[1mdetect: [0mweights=['yolov5x6.pt'], source=/content/images, data=data/coco128.yaml, imgsz=[1280, 1280], conf_thres=0.25, iou_thres=0.5, max_det=1000, device=, view_img=False, save_txt=True, save_conf=False, save_crop=False, nosave=False, classes=[0], agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
YOLOv5 🚀 v6.1-119-gaa542ce torch 1.10.0+cu111 CUDA:0 (Tesla P100-PCIE-16GB, 16281MiB)

Fusing layers... 
YOLOv5x6 summary: 574 layers, 140730220 parameters, 0 gradients, 209.8 GFLOPs
image 1/920 /content/images/920--1x-1.jpg: 896x1280 18 persons, Done. (0.092s)
image 2/920 /content/images/920--I1-MS09uaqsLdGTFkgnS0Rcg1mmPyAj95ySg_eckoM.jpeg: 896x1280 5 persons, Done. (0.090s)
image 3/920 /content/images/920-0002526673.jpg: 896x1280 5 persons, Done. (0.090s)
image 4/920 /content/images/920-0009S6815V3PEU1N-C123-F4.jpg: 960x1280 1

In [None]:
# 5. Check num of labels generated
latest_detect_dir = latest_modified_subdir(join('/content/yolov5', 'runs/detect'))
latest_labels   = join(latest_detect_dir, 'labels')
total_files(latest_labels)

/content/yolov5/runs/detect/exp4/labels: 920


In [None]:
# 6. Copy to Drive
dir_to_zip = latest_detect_dir
zip_name = 'Kaggle-1st-920-c25-i5-full.zip'    # Change this
drive_location = '/content/drive/MyDrive/yolo/2-detects/5th-VDAS'   # And possibly this
!cd $dir_to_zip && zip -r -0 $zip_name . && cp $zip_name $drive_location

dir_to_zip = latest_labels
zip_name = 'Kaggle-1st-920-c25-i5.zip'    # Change this
drive_location = '/content/drive/MyDrive/yolo/2-detects/5th-VDAS'   # And possibly this
!cd $dir_to_zip && zip -r -0 $zip_name . && cp $zip_name $drive_location

  adding: 920-6d4afcbc-d223-4602-a8e4-22c156ede18f-coronavirus005.jpeg (stored 0%)
  adding: 920-20190919FL13.jpeg (stored 0%)
  adding: 920-3050308_1_1.jpg (stored 0%)
  adding: 920-5865559_012020-cc-cnn-coronavirus-monday-update-vid.jpg (stored 0%)
  adding: 920-merlin_167450115_211c72f4-b732-4b8f-aac3-de17afbab5cc-superJumbo.jpg (stored 0%)
  adding: 920-file77fstl7kf2pn88rslff.jpg (stored 0%)
  adding: 920-46db30ff1d5a845cbe4e481ccc2626cd-149819.jpg (stored 0%)
  adding: 920-20200129001153.jpg (stored 0%)
  adding: 920-RTX7CCFN.jpg (stored 0%)
  adding: 920-virus_protection123.jpg (stored 0%)
  adding: 920-Face-Mask-vs-Surgical-Mask.jpg (stored 0%)
  adding: 920-1200x818.jpg (stored 0%)
  adding: 920-0_Concern-In-China-As-Mystery-Virus-Spreads.jpg (stored 0%)
  adding: 920-b83abacd915ca1914599d05b4d62592a.jpg (stored 0%)
  adding: 920-12240386-3x2-xlarge.jpg (stored 0%)
  adding: 920-rally-against-an-anti-mask-law-meant-to-deter-anti-government-protesters-in-hong-kong-china-shutter

## [S-Class] Pseudo-labeling (YOLOv5-face)

In [None]:
# 0. Clone YOLOv5 (to install requirements)

%cd /content/
!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt        # install dependencies

import torch
from yolov5 import utils
display = utils.notebook_init()          # setup checks

print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

In [None]:
# 1. Clone YOLOv5-face
%cd /content
!git clone https://github.com/deepcam-cn/yolov5-face

/content
Cloning into 'yolov5-face'...
remote: Enumerating objects: 494, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 494 (delta 35), reused 39 (delta 15), pack-reused 417[K
Receiving objects: 100% (494/494), 6.63 MiB | 18.26 MiB/s, done.
Resolving deltas: 100% (239/239), done.


In [None]:
# 2. Extract images
makedirs('/content/yolov5-face/images/test')
!unzip /content/drive/MyDrive/yolo/4-datasets/coco-datasets/train2017-person.zip -d /content/yolov5-face/images/test

In [None]:
# 3. Copy face.pt weights & check size
!cp /content/drive/MyDrive/yolo/1-weights/face.pt /content/yolov5-face
os.path.getsize('/content/yolov5-face/face.pt')     # 373762519

In [None]:
python3.7 test.py --weights face.pt --img-size 800 --conf-thres 0.7 --iou-thres 0.6 --task test --data data/data-colab.yaml --save-txt

In [None]:
# 4. Pseudo-labeling (YOLOv5-face - no mask)
%cd /content/yolov5-face
!python test.py --weights face.pt --img-size 800 --conf-thres 0.5 --task test --data data/data-colab.yaml --save-txt

/content/yolov5-face
Namespace(augment=False, batch_size=32, conf_thres=0.5, data='data/data-colab.yaml', device='', exist_ok=False, img_size=800, iou_thres=0.6, name='exp', project='runs/test', save_conf=False, save_hybrid=False, save_json=False, save_txt=True, single_cls=False, task='test', verbose=False, weights=['face.pt'])
YOLOv5 aaa233a torch 1.10.0+cu111 CUDA:0 (Tesla T4, 15109.75MB)

Fusing layers... 
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 399 layers, 46596016 parameters, 0 gradients, 110.8 GFLOPS
Scanning 'images/test' for images and labels... 0 found, 64115 missing, 0 empty, 0 corrupted: 100% 64115/64115 [01:19<00:00, 807.15it/s]
New cache created: images/test.cache
Scanning 'images/test.cache' for images and labels... 0 found, 64115 missing, 0 empty, 0 corrupted: 100% 64115/64115 [00:00<?, ?it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 2004/2004 [22:51<00:00,  1.46it/s]


In [None]:
# 5. Check num of labels generated
latest_test_dir = latest_modified_subdir(join('/content/yolov5-face', 'runs/test'))
latest_labels   = join(lastest_test_dir, 'labels')
total_files(latest_labels)

/content/yolov5-face/runs/test/exp2/labels: 49544


In [None]:
# 6. Copy to Drive
dir_to_zip = latest_test_dir
zip_name = 'train2017-labels-face-c5-i6-full.zip'    # Change this
drive_location = '/content/drive/MyDrive/yolo/2-detects/3rd-VDAS'   # And possibly this
!cd $dir_to_zip && zip -r -0 $zip_name . && cp $zip_name $drive_location

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: labels/000000389644.txt (stored 0%)
  adding: labels/000000120179.txt (stored 0%)
  adding: labels/000000058868.txt (stored 0%)
  adding: labels/000000456608.txt (stored 0%)
  adding: labels/000000467800.txt (stored 0%)
  adding: labels/000000001270.txt (stored 0%)
  adding: labels/000000182723.txt (stored 0%)
  adding: labels/000000001757.txt (stored 0%)
  adding: labels/000000066959.txt (stored 0%)
  adding: labels/000000238963.txt (stored 0%)
  adding: labels/000000140087.txt (stored 0%)
  adding: labels/000000052543.txt (stored 0%)
  adding: labels/000000348838.txt (stored 0%)
  adding: labels/000000484166.txt (stored 0%)
  adding: labels/000000545088.txt (stored 0%)
  adding: labels/000000276197.txt (stored 0%)
  adding: labels/000000534224.txt (stored 0%)
  adding: labels/000000522612.txt (stored 0%)
  adding: labels/000000476722.txt (stored 0%)
  adding: labels/000000329654.txt (stored 0%)
  adding: label

## [S-Class] Draw bounding boxes

In [None]:
print(len(listdir('/content/IMFD-CMFD')))

24623


In [None]:
makedirs('/content/MaskedFace-Net-CMFD')
makedirs('/content/MaskedFace-Net-IMFD')

In [None]:
import os
path = "/content" # /content is pretty much the root. you can choose other path in your colab workspace
os.chdir(path)

In [None]:
total_files3('/content/MaskedFace-Net-CMFD')
total_files3('/content/MaskedFace-Net-IMFD')

/content/MaskedFace-Net-CMFD: 67048
/content/MaskedFace-Net-IMFD: 66734


In [None]:
!unzip -j /content/drive/MyDrive/yolo/4-datasets/MaskedFace-Net/CMFD.zip -d /content/MaskedFace-Net-CMFD
!unzip -j /content/drive/MyDrive/yolo/4-datasets/MaskedFace-Net/CMFD1.zip -d /content/MaskedFace-Net-CMFD
!unzip -j /content/drive/MyDrive/yolo/4-datasets/MaskedFace-Net/IMFD.zip -d /content/MaskedFace-Net-IMFD
!unzip -j /content/drive/MyDrive/yolo/4-datasets/MaskedFace-Net/IMFD1.zip -d /content/MaskedFace-Net-IMFD

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /content/MaskedFace-Net-IMFD/64732_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64733_Mask_Mouth_Chin.jpg  
 extracting: /content/MaskedFace-Net-IMFD/64734_Mask_Mouth_Chin.jpg  
 extracting: /content/MaskedFace-Net-IMFD/64735_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64736_Mask_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64737_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64738_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64739_Mask_Chin.jpg  
 extracting: /content/MaskedFace-Net-IMFD/64740_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64741_Mask_Nose_Mouth.jpg  
 extracting: /content/MaskedFace-Net-IMFD/64742_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64743_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-IMFD/64744_Mask_Mouth_Chin.jpg  
  inflating: /content/MaskedFace-Net-

In [None]:
# Dirs
image_dir = "C:/Users/Nagidrop/Downloads/C/FM-30k/VDAS-FM-full-30k"
label_dir = "C:/Users/Nagidrop/Downloads/C/FM-30k/VDAS-FM-labels_with_person"
result_dir = "C:/Users/Nagidrop/Downloads/C/FM-30k/VDAS-FM-labels-bbox"

name_list = [splitext(filename)[0] for filename in listdir(image_dir)]

makedirs(result_dir, exist_ok=True)

# jpg, png = 0, 0
for name in name_list:
    if isfile(join(image_dir, name + '.jpg')):
        img = imread(join(image_dir, name + '.jpg'))
        # jpg += 1
    elif isfile(join(image_dir, name + '.png')):
        img = imread(join(image_dir, name + '.png'))
        # png += 1
    elif isfile(join(image_dir, name + '.jpeg')):
        img = imread(join(image_dir, name + '.jpeg'))
    else:
        print('File not JPG, PNG or JPEG')
        break

    dh, dw, _ = img.shape

    label = join(label_dir, name + '.txt')
    with open(label, 'r') as f:
        data = f.read().splitlines()

    index = 0  # Line no.

    for dt in data:
        index += 1
        _, x, y, w, h = map(float, dt.split())

        l = int((x - w / 2) * dw)
        r = int((x + w / 2) * dw)
        t = int((y - h / 2) * dh)
        b = int((y + h / 2) * dh)

        if l < 0:
            l = 0
        if r > dw - 1:
            r = dw - 1
        if t < 0:
            t = 0
        if b > dh - 1:
            b = dh - 1

        label_1 = 'person'
        label_2 = 'no mask'
        label_3 = 'mask'
        label_4 = 'wrong mask'
        label_1_color = (62, 43, 255)  # red
        label_2_color = (44, 104, 255)  # orange
        label_3_color = (17, 103, 177)  # blue (1167B1)
        label_4_color = (230, 0, 126)  # pink (E6007E)
        white = (255, 255, 255)

        if _ == 0.0:
            img = rectangle(img, (l, t), (r, b), label_1_color, 2)

            (w, h), _2 = getTextSize(label_1, FONT_HERSHEY_DUPLEX, 0.6, 1)

            img = rectangle(img, (l, t - 20), (l + w, t), label_1_color, -1)
            img = putText(img, label_1, (l, t - 5),
                          FONT_HERSHEY_DUPLEX, 0.5, white, 1)
        elif _ == 1.0:
            img = rectangle(img, (l, t), (r, b), label_2_color, 2)

            (w, h), _2 = getTextSize(label_2, FONT_HERSHEY_DUPLEX, 0.6, 1)

            img = rectangle(img, (l, t - 20), (l + w, t), label_2_color, -1)
            img = putText(img, label_2, (l, t - 5),
                          FONT_HERSHEY_DUPLEX, 0.5, white, 1)
        elif _ == 2.0:
            img = rectangle(img, (l, t), (r, b), label_3_color, 2)

            (w, h), _2 = getTextSize(label_3, FONT_HERSHEY_DUPLEX, 0.6, 1)

            img = rectangle(img, (l, t - 20), (l + w, t), label_3_color, -1)
            img = putText(img, label_3, (l, t - 5),
                          FONT_HERSHEY_DUPLEX, 0.5, white, 1)
        elif _ == 3.0:
            img = rectangle(img, (l, t), (r, b), label_4_color, 2)

            (w, h), _2 = getTextSize(label_4, FONT_HERSHEY_DUPLEX, 0.6, 1)

            img = rectangle(img, (l, t - 20), (l + w, t), label_4_color, -1)
            img = putText(img, label_4, (l, t - 5),
                          FONT_HERSHEY_DUPLEX, 0.5, white, 1)

    imwrite(join(result_dir, name + '-bbox.jpg'), img)
    # break


## [S-Class] Check integrity optimize

In [None]:
from os import listdir, remove
from os.path import isfile, join, splitext

# Directories
dir_1 = "/content/7th-VDAS-images"  # base
dir_2 = "/content/7th-VDAS-labels"  # extract

# List dir_1_files and dir_2_files without extensions
dir_1_files = [filename for filename in listdir(dir_1)]
dir_2_files = [filename for filename in listdir(dir_2)]

print(f'Dir 1 file count: {len(dir_1_files)}')
print(f'Dir 2 file count: {len(dir_2_files)}')

# Find dir_2_files that don't have dir_1_files
dir_2_no_1 = list(set(dir_2_files) - set(dir_1_files))

print(f"Dir 2 files without dir 1's count: {len(dir_2_no_1)}")

# Find dir_1_files that don't have corresponding dir_2_files
dir_1_no_2 = list(set(dir_1_files) - set(dir_2_files))

print(f"Dir 1 files without dir 2's count: {len(dir_1_no_2)}")

# dest_dir = '/content/IFMD-part-1-remain'
# makedirs(dest_dir)
# for file in dir_1_no_2:
#     copy(join(dir_1, file), dest_dir)

# print(len(listdir(dest_dir)))

# jpg, png, jpeg, txt = 0, 0, 0, 0
# for name in dir_2_no_1:
#     if isfile(join(dir_2, name + '.jpg')):
#         remove(join(dir_2, name + '.jpg'))
#         jpg += 1
#     elif isfile(join(dir_2, name + '.png')):
#         remove(join(dir_2, name + '.png'))
#         png += 1
#     elif isfile(join(dir_2, name + '.jpeg')):
#         remove(join(dir_2, name + '.jpeg'))
#         jpeg += 1
#     elif isfile(join(dir_2, name + '.txt')):
#         remove(join(dir_2, name + '.txt'))
#         txt += 1
#     else:
#         print('File not JPG, PNG or JPEG')
#         exit(0)

# print(
#     f'''
#     Num of JPGs:  {jpg}
#     Num of PNGs:  {png}
#     Num of JPEGs: {jpeg}
#     Num of TXTs:  {txt}
#     '''
# )

# # Checks
# dir_1_files = [splitext(filename)[0] for filename in listdir(dir_1)]
# dir_2_files = [splitext(filename)[0] for filename in listdir(dir_2)]
# dir_2_no_1 = list(set(dir_2_files) - set(dir_1_files))

# print(f"Dir 2 files without dir 1's count: {len(dir_2_no_1)}")

Dir 1 file count: 86552
Dir 2 file count: 85439
Dir 2 files without dir 1's count: 85439
Dir 1 files without dir 2's count: 86552


In [None]:
# NEWEST (22-04-15)

# Directories
dir_1 = "/content/7th-VDAS-images"  # new
dir_2 = "/content/7th-VDAS-labels" # org

# List dir_1_files and dir_2_files without extensions
dir_1_files = [splitext(filename)[0] for filename in listdir(dir_1)]
dir_2_files = [splitext(filename)[0] for filename in listdir(dir_2)]

# print(dir_2_files)
print(f'Dir 1 file count: {len(dir_1_files)}')
print(f'Dir 2 file count: {len(dir_2_files)}')

# Find dir_1_files that don't have corresponding dir_2_files
dir_1_no_2 = list(set(dir_1_files) - set(dir_2_files))

print(f"Dir 1 files without dir 2's count: {len(dir_1_no_2)}")

# Find dir_2_files that don't have dir_1_files
dir_2_no_1 = list(set(dir_2_files) - set(dir_1_files))

print(f"Dir 2 files without dir 1's count: {len(dir_2_no_1)}")

# Delete dir 2 no 1
jpg, png, jpeg, txt = 0, 0, 0, 0
for name in dir_2_no_1:
    if isfile(join(dir_2, name + '.jpg')):
        remove(join(dir_2, name + '.jpg'))
        jpg += 1
    elif isfile(join(dir_2, name + '.png')):
        remove(join(dir_2, name + '.png'))
        png += 1
    elif isfile(join(dir_2, name + '.jpeg')):
        remove(join(dir_2, name + '.jpeg'))
        jpeg += 1
    elif isfile(join(dir_2, name + '.txt')):
        remove(join(dir_2, name + '.txt'))
        txt += 1
    else:
        print('File not JPG, PNG or JPEG')
        exit(0)

# Delete dir 1 no 2
jpg, png, jpeg, txt = 0, 0, 0, 0
for name in dir_1_no_2:
    if isfile(join(dir_1, name + '.jpg')):
        remove(join(dir_1, name + '.jpg'))
        jpg += 1
    elif isfile(join(dir_1, name + '.png')):
        remove(join(dir_1, name + '.png'))
        png += 1
    elif isfile(join(dir_1, name + '.jpeg')):
        remove(join(dir_1, name + '.jpeg'))
        jpeg += 1
    elif isfile(join(dir_1, name + '.txt')):
        remove(join(dir_1, name + '.txt'))
        txt += 1
    else:
        print('File not JPG, PNG or JPEG')
        exit(0)

print(
    f'''
    Num of JPGs:  {jpg}
    Num of PNGs:  {png}
    Num of JPEGs: {jpeg}
    Num of TXTs:  {txt}
    '''
)
# Checks
dir_1_files = [splitext(filename)[0] for filename in listdir(dir_1)]
dir_2_files = [splitext(filename)[0] for filename in listdir(dir_2)]
dir_1_no_2 = list(set(dir_1_files) - set(dir_2_files))
dir_2_no_1 = list(set(dir_2_files) - set(dir_1_files))

# List dir_1_files and dir_2_files without extensions
dir_1_files = [splitext(filename)[0] for filename in listdir(dir_1)]
dir_2_files = [splitext(filename)[0] for filename in listdir(dir_2)]

print(f'Dir 1 file count: {len(dir_1_files)}')
print(f'Dir 2 file count: {len(dir_2_files)}')
print(f"Dir 1 files without dir 2's count: {len(dir_1_no_2)}")
print(f"Dir 2 files without dir 1's count: {len(dir_2_no_1)}")

Dir 1 file count: 15000
Dir 2 file count: 100863
Dir 1 files without dir 2's count: 1113
Dir 2 files without dir 1's count: 86976

    Num of JPGs:  1113
    Num of PNGs:  0
    Num of JPEGs: 0
    Num of TXTs:  0
    
Dir 1 file count: 13887
Dir 2 file count: 13887
Dir 1 files without dir 2's count: 0
Dir 2 files without dir 1's count: 0


In [None]:
from os import listdir, remove
from os.path import isfile, join, splitext

# Directories
dir_1 = "/content/MaskedFace-Net-org"  # base
dir_2 = "/content/IMFD-9000"  # extract

# List dir_1_files and dir_2_files without extensions
dir_1_files = [filename for filename in listdir(dir_1)]
dir_2_files = [filename for filename in listdir(dir_2)]

print(f'Dir 1 file count: {len(dir_1_files)}')
print(f'Dir 2 file count: {len(dir_2_files)}')

# Find common files
common_files = list(set(dir_1_files) & set(dir_2_files))

print(f"Common files count: {len(common_files)}")

# Delete common files

for file in common_files:
    remove(join(dir_1, file))

print(len(listdir(dir_1)))

# jpg, png, jpeg, txt = 0, 0, 0, 0
# for name in dir_2_no_1:
#     if isfile(join(dir_2, name + '.jpg')):
#         remove(join(dir_2, name + '.jpg'))
#         jpg += 1
#     elif isfile(join(dir_2, name + '.png')):
#         remove(join(dir_2, name + '.png'))
#         png += 1
#     elif isfile(join(dir_2, name + '.jpeg')):
#         remove(join(dir_2, name + '.jpeg'))
#         jpeg += 1
#     elif isfile(join(dir_2, name + '.txt')):
#         remove(join(dir_2, name + '.txt'))
#         txt += 1
#     else:
#         print('File not JPG, PNG or JPEG')
#         exit(0)

# print(
#     f'''
#     Num of JPGs:  {jpg}
#     Num of PNGs:  {png}
#     Num of JPEGs: {jpeg}
#     Num of TXTs:  {txt}
#     '''
# )

# # Checks
# dir_1_files = [splitext(filename)[0] for filename in listdir(dir_1)]
# dir_2_files = [splitext(filename)[0] for filename in listdir(dir_2)]
# dir_2_no_1 = list(set(dir_2_files) - set(dir_1_files))

# print(f"Dir 2 files without dir 1's count: {len(dir_2_no_1)}")

Dir 1 file count: 124782
Dir 2 file count: 9000
Common files count: 0
124782


## [S-Class] Copy files from flattened folder structure to another

In [None]:
!unzip /content/drive/MyDrive/yolo/4-datasets/face-mask-datasets/IMFD-9000_optimized.zip -d /content/IMFD-9000

In [None]:
import random
from os import listdir, makedirs, walk
from os.path import join, basename, abspath, isfile
from zipfile import ZipFile, ZIP_STORED
from shutil import copy
from glob import glob1

src_dir = '/content/IMFD-CMFD'
dest_dir = '/content/IMFD-CMFD-full-remain'

all_files = []

makedirs(dest_dir, exist_ok=True)
if len(listdir(dest_dir)) != 0:
    print(f"'{dest_dir}' exists with {len(listdir(dest_dir))} files/folders")
    %%script stop_code_here

# Build list of files with absolute path
for root, _, files in walk(src_dir):
    for file in files:
        all_files.append(abspath(join(root, file)))

# Copy the files
for file in all_files:
    copy(file, dest_dir)

if len(all_files) == len(listdir(dest_dir)):
    print(f"All {len(all_files)} are copied to '{dest_dir}'")
else:
    print(f'Total {len(all_files)}; But only {len(listdir(dest_dir))} copied')

All 124782 are copied to '/content/IMFD-CMFD-full-remain'


## [S-Class] Count files by type (JPG, PNG, TXT, others)

In [None]:
from os import walk

data_dir = '/content/IFMD-part-1'

note = '/content/IFMD-part-1'
print(note)

jpg = 0
png = 0
jpeg = 0
txt = 0
others = 0

# Walk all files
for root, _, filenames in walk(data_dir):
    for f in filenames:
        if f.endswith('.jpg'):
            jpg += 1
        elif f.endswith('.png'):
            png += 1
        elif f.endswith('.jpeg'):
            jpeg += 1
        elif f.endswith('.txt'):
            txt += 1
        else:
            others += 1

print(
    f'''
    Total files: {jpg + png + jpeg + txt + others}
    Num of JPGs:  {jpg}
    Num of PNGs:  {png}
    Num of JPEGs: {jpeg}
    Num of TXTs:  {txt}
    Num of other files: {others} 
    '''
)

/content/IFMD-part-1

    Total files: 33553
    Num of JPGs:  33553
    Num of PNGs:  0
    Num of JPEGs: 0
    Num of TXTs:  0
    Num of other files: 0 
    


## [S-Class][WIP] Diff & dup checker

In [None]:
!python duplicates.py $image_dir

In [None]:
from pathlib import Path

image_dir = '/content/images'
label_dir = '/content/labels'

images_no_ext = [Path(f).stem for f in os.listdir(image_dir)]
labels_no_ext = [Path(f).stem for f in os.listdir(label_dir)]

total_files(image_dir)
total_files(label_dir)
print(list(set(images_no_ext) - set(labels_no_ext)))
print(list(set(labels_no_ext) - set(images_no_ext)))
print(list(set(images_no_ext).symmetric_difference(set(labels_no_ext))))
print(list(set(images_no_ext).difference(set(labels_no_ext))))
print(list(set(images_no_ext) ^ set(labels_no_ext)))
print(list((Counter(images_no_ext) - Counter(labels_no_ext)).elements()))
print(len(images_no_ext) != len(set(images_no_ext)))

98378
98371
[]
[]
[]
[]
[]
['6101', '6200', '6137', '6166', '6100', '6108', '6107']
True


In [None]:
# Directories
image_dir = "/content/images"
label_dir = "/content/labels"

# List images and labels without extensions
images = [os.path.splitext(filename)[0] for filename in os.listdir(image_dir)]
labels = [os.path.splitext(filename)[0] for filename in os.listdir(label_dir)]

print(f'Images count (w/ dups): {len(images)}')
print(f'Labels count (w/ dups): {len(labels)}')

# Find images that don't have labels
no_label_images = list(set(images) - set(labels))

print(f'Images without labels count: {len(no_label_images)}')

# Find labels that don't have corresponding images
no_image_labels = list(set(labels) - set(images))

print('Labels without images count: ' + str(len(no_image_labels)))

# Find images that don't have labels (with duplicates)
no_label_images_dups = list((Counter(images) - Counter(labels)).elements())

print(f'Images without labels count (w/ dups): {len(no_label_images_dups)}')

# Find labels that don't have corresponding images (with duplicates)
no_image_labels_dups = list((Counter(labels) - Counter(images)).elements())

print(f'Labels without images count (w/ dups): {len(no_image_labels_dups)}')

Images count (w/ dups): 98178
Labels count (w/ dups): 98178
Images without labels count: 0
Labels without images count: 0
Images without labels count (w/ dups): 0
Labels without images count (w/ dups): 0


## [S-Class] Split to arbitrary parts, then split to train/val/test

In [None]:
shutil.rmtree('/content/VDAS-images-parts')
shutil.rmtree('/content/VDAS-labels-parts')
shutil.rmtree('/content/VDAS-images-zips')
shutil.rmtree('/content/VDAS-labels-zips')

In [None]:
rmtree('/content/7th-VDAS-labels-7-parts')

In [None]:
# Mark the start of code
print("###  Split to arbitrary parts, then to train/val/test  ###")

new_dir = '/content/8th-VDAS-labels-3-parts'   # new dir to create

data_dir = '/content/8th-VDAS-labels'    # dir with data to copy

number_of_parts = 3

train_ratio, val_ratio, test_ratio = 0.8, 0.1, 0.1

all_file_names = listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss (os.listdir non-blocking)

# Randomize dataset
random.seed(2022)  # Create instance with seed 2022
random.shuffle(all_file_names)  # Shuffle dataset


def split_to_equal_parts(file_names, num_of_parts):
    if num_of_parts < 2:
        print(f'Only {num_of_parts} found, 2 or above num_of_parts expected')
        return

    split_parts = [file_names[:int(len(file_names) * (1 / num_of_parts))]]

    for part_num in range(1, num_of_parts - 1):
        split_parts.append(file_names[int(len(file_names) * (part_num / num_of_parts)):int(
            len(file_names) * ((part_num + 1) / num_of_parts))])

    split_parts.append(file_names[int(len(file_names) * ((num_of_parts - 1) / num_of_parts)):])

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


def split_to_train_val_test(file_names, train_ratio, val_ratio, test_ratio):
    file_names = [basename(fn) for fn in file_names]
    split_parts = [file_names[:int(len(file_names) * train_ratio)],
                   file_names[int(len(file_names) * train_ratio):int(len(file_names) * (train_ratio + val_ratio))],
                   file_names[int(len(file_names) * (train_ratio + val_ratio)):]]

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


print(f'Data dir: {data_dir}')
split_file_names_result = split_to_equal_parts(all_file_names, number_of_parts)
print(f'Total files: {len(all_file_names)}')
for i in range(0, number_of_parts):
    print(f'Part {i + 1}: \t\t{len(split_file_names_result[i])}')
    makedirs(join(new_dir, f'part-{i + 1}'))

    train_val_test_file_names = split_to_train_val_test(split_file_names_result[i], 
                                                        train_ratio, val_ratio, test_ratio)

    # Notify image amount (Train/Val/Test)
    print(f'\t + Train:\t{len(train_val_test_file_names[0])}')
    print(f'\t + Val:\t\t{len(train_val_test_file_names[1])}')
    print(f'\t + Test:\t{len(train_val_test_file_names[2])}')

    # Creating Train/Val/Test folders
    makedirs(join(new_dir, f'part-{i + 1}', 'train'))
    makedirs(join(new_dir, f'part-{i + 1}', 'val'))
    makedirs(join(new_dir, f'part-{i + 1}', 'test'))

    # Copy-pasting images
    for name in train_val_test_file_names[0]:
        copy(name, join(new_dir, f'part-{i + 1}', 'train'))

    for name in train_val_test_file_names[1]:
        copy(name, join(new_dir, f'part-{i + 1}', 'val'))

    for name in train_val_test_file_names[2]:
        copy(name, join(new_dir, f'part-{i + 1}', 'test'))

###  Split to arbitrary parts, then to train/val/test  ###
Data dir: /content/8th-VDAS-labels
Total files: 44510
Part 1: 		14836
	 + Train:	11868
	 + Val:		1484
	 + Test:	1484
Part 2: 		14837
	 + Train:	11869
	 + Val:		1484
	 + Test:	1484
Part 3: 		14837
	 + Train:	11869
	 + Val:		1484
	 + Test:	1484


In [None]:
# Mark the start of code
print("###  Split to arbitrary parts, then to train/val/test  ###")

new_dir = '/content/8th-VDAS-images-3-parts'   # new dir to create

data_dir = '/content/8th-VDAS-images'    # dir with data to copy

number_of_parts = 3

train_ratio, val_ratio, test_ratio = 0.8, 0.1, 0.1

all_file_names = listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss (os.listdir non-blocking)

# Randomize dataset
random.seed(2022)  # Create instance with seed 2022
random.shuffle(all_file_names)  # Shuffle dataset


def split_to_equal_parts(file_names, num_of_parts):
    if num_of_parts < 2:
        print(f'Only {num_of_parts} found, 2 or above num_of_parts expected')
        return

    split_parts = [file_names[:int(len(file_names) * (1 / num_of_parts))]]

    for part_num in range(1, num_of_parts - 1):
        split_parts.append(file_names[int(len(file_names) * (part_num / num_of_parts)):int(
            len(file_names) * ((part_num + 1) / num_of_parts))])

    split_parts.append(file_names[int(len(file_names) * ((num_of_parts - 1) / num_of_parts)):])

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


def split_to_train_val_test(file_names, train_ratio, val_ratio, test_ratio):
    file_names = [basename(fn) for fn in file_names]
    split_parts = [file_names[:int(len(file_names) * train_ratio)],
                   file_names[int(len(file_names) * train_ratio):int(len(file_names) * (train_ratio + val_ratio))],
                   file_names[int(len(file_names) * (train_ratio + val_ratio)):]]

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


print(f'Data dir: {data_dir}')
split_file_names_result = split_to_equal_parts(all_file_names, number_of_parts)
print(f'Total files: {len(all_file_names)}')
for i in range(0, number_of_parts):
    print(f'Part {i + 1}: \t\t{len(split_file_names_result[i])}')
    makedirs(join(new_dir, f'part-{i + 1}'))

    train_val_test_file_names = split_to_train_val_test(split_file_names_result[i], 
                                                        train_ratio, val_ratio, test_ratio)

    # Notify image amount (Train/Val/Test)
    print(f'\t + Train:\t{len(train_val_test_file_names[0])}')
    print(f'\t + Val:\t\t{len(train_val_test_file_names[1])}')
    print(f'\t + Test:\t{len(train_val_test_file_names[2])}')

    # Creating Train/Val/Test folders
    makedirs(join(new_dir, f'part-{i + 1}', 'train'))
    makedirs(join(new_dir, f'part-{i + 1}', 'val'))
    makedirs(join(new_dir, f'part-{i + 1}', 'test'))

    # Copy-pasting images
    for name in train_val_test_file_names[0]:
        copy(name, join(new_dir, f'part-{i + 1}', 'train'))

    for name in train_val_test_file_names[1]:
        copy(name, join(new_dir, f'part-{i + 1}', 'val'))

    for name in train_val_test_file_names[2]:
        copy(name, join(new_dir, f'part-{i + 1}', 'test'))

###  Split to arbitrary parts, then to train/val/test  ###
Data dir: /content/8th-VDAS-images
Total files: 44510
Part 1: 		14836
	 + Train:	11868
	 + Val:		1484
	 + Test:	1484
Part 2: 		14837
	 + Train:	11869
	 + Val:		1484
	 + Test:	1484
Part 3: 		14837
	 + Train:	11869
	 + Val:		1484
	 + Test:	1484


In [None]:
# Mark the start of code
print("###  Split to arbitrary parts, then to train/val/test  ###")

new_dir = '/content/7th-VDAS-images-7-parts'   # new dir to create

data_dir = '/content/7th-VDAS-images'    # dir with data to copy

number_of_parts = 7

train_ratio, val_ratio, test_ratio = 0.8, 0.1, 0.1

all_file_names = listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss (os.listdir non-blocking)

# Randomize dataset
random.seed(2022)  # Create instance with seed 2022
random.shuffle(all_file_names)  # Shuffle dataset


def split_to_equal_parts(file_names, num_of_parts):
    if num_of_parts < 2:
        print(f'Only {num_of_parts} found, 2 or above num_of_parts expected')
        return

    split_parts = [file_names[:int(len(file_names) * (1 / num_of_parts))]]

    for part_num in range(1, num_of_parts - 1):
        split_parts.append(file_names[int(len(file_names) * (part_num / num_of_parts)):int(
            len(file_names) * ((part_num + 1) / num_of_parts))])

    split_parts.append(file_names[int(len(file_names) * ((num_of_parts - 1) / num_of_parts)):])

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


def split_to_train_val_test(file_names, train_ratio, val_ratio, test_ratio):
    file_names = [basename(fn) for fn in file_names]
    split_parts = [file_names[:int(len(file_names) * train_ratio)],
                   file_names[int(len(file_names) * train_ratio):int(len(file_names) * (train_ratio + val_ratio))],
                   file_names[int(len(file_names) * (train_ratio + val_ratio)):]]

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


print(f'Data dir: {data_dir}')
split_file_names_result = split_to_equal_parts(all_file_names, number_of_parts)
print(f'Total files: {len(all_file_names)}')
for i in range(0, number_of_parts):
    print(f'Part {i + 1}: \t\t{len(split_file_names_result[i])}')
    makedirs(join(new_dir, f'part-{i + 1}'))

    train_val_test_file_names = split_to_train_val_test(split_file_names_result[i], 
                                                        train_ratio, val_ratio, test_ratio)

    # Notify image amount (Train/Val/Test)
    print(f'\t + Train:\t{len(train_val_test_file_names[0])}')
    print(f'\t + Val:\t\t{len(train_val_test_file_names[1])}')
    print(f'\t + Test:\t{len(train_val_test_file_names[2])}')

    # Creating Train/Val/Test folders
    makedirs(join(new_dir, f'part-{i + 1}', 'train'))
    makedirs(join(new_dir, f'part-{i + 1}', 'val'))
    makedirs(join(new_dir, f'part-{i + 1}', 'test'))

    # Copy-pasting images
    for name in train_val_test_file_names[0]:
        copy(name, join(new_dir, f'part-{i + 1}', 'train'))

    for name in train_val_test_file_names[1]:
        copy(name, join(new_dir, f'part-{i + 1}', 'val'))

    for name in train_val_test_file_names[2]:
        copy(name, join(new_dir, f'part-{i + 1}', 'test'))

###  Split to arbitrary parts, then to train/val/test  ###
Data dir: /content/7th-VDAS-images
Total files: 100863
Part 1: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441
Part 2: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441
Part 3: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441
Part 4: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441
Part 5: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441
Part 6: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441
Part 7: 		14409
	 + Train:	11527
	 + Val:		1441
	 + Test:	1441


## [S-Class] Split to arbitrary parts only

In [None]:
# Split to arbitrary parts only, no train val test
# Mark the start of code
print("###  Split to arbitrary parts only  ###")

new_dir = '/content/IMFD-CMFD-full-remain-parts'     # new dir to create

data_dir = '/content/IMFD-CMFD-full-remain'    # dir with data to copy

number_of_parts = 24

all_file_names = listdir(data_dir)

all_file_names.sort()  # sort to avoid data loss (os.listdir non-blocking)

# Randomize dataset
random.seed(2022)  # Create instance with seed 2022
random.shuffle(all_file_names)  # Shuffle dataset

def split_to_equal_parts(file_names, num_of_parts):
    if num_of_parts < 2:
        print(f'Only {num_of_parts} found, 2 or above parts expected')
        return

    split_parts = [file_names[:int(len(file_names) * (1 / num_of_parts))]]

    for part_num in range(1, num_of_parts - 1):
        split_parts.append(file_names[int(len(file_names) * (part_num / num_of_parts)):int(
            len(file_names) * ((part_num + 1) / num_of_parts))])

    split_parts.append(file_names[int(len(file_names) * ((num_of_parts - 1) / num_of_parts)):])

    split_file_names = []
    for part_num in range(0, len(split_parts)):
        split_file_names.append([join(data_dir, '') + name for name in split_parts[part_num]])

    return split_file_names


split_file_names_result = split_to_equal_parts(all_file_names, number_of_parts)

print(f'Total files: \t{len(all_file_names)}')
for i in range(0, number_of_parts):
    print(f'Part {i + 1}: \t\t{len(split_file_names_result[i])}')
    makedirs(join(new_dir, f'part-{i + 1}'))

    # Copy-pasting images
    for name in split_file_names_result[i]:
        copy(name, join(new_dir, f'part-{i + 1}'))


###  Split to arbitrary parts only  ###
Total files: 	124782
Part 1: 		5199
Part 2: 		5199
Part 3: 		5199
Part 4: 		5200
Part 5: 		5199
Part 6: 		5199
Part 7: 		5199
Part 8: 		5200
Part 9: 		5199
Part 10: 		5199
Part 11: 		5199
Part 12: 		5200
Part 13: 		5199
Part 14: 		5199
Part 15: 		5199
Part 16: 		5200
Part 17: 		5199
Part 18: 		5199
Part 19: 		5199
Part 20: 		5200
Part 21: 		5199
Part 22: 		5199
Part 23: 		5199
Part 24: 		5200


## [S-Class] Check integrity of train/val/test in multiple-parts folder

In [None]:
# Directories
train, val, test = 'train', 'val', 'test'

image_dir = '/content/8th-VDAS-images-3-parts'
label_dir = '/content/8th-VDAS-labels-3-parts'

image_dirs = [join(image_dir, x) for x in listdir(image_dir) if isdir(join(image_dir, x))]
label_dirs = [join(label_dir, x) for x in listdir(label_dir) if isdir(join(label_dir, x))]

all_images = [join(root, f) for root, _, files in walk(image_dir) for f in files]
all_labels = [join(root, f) for root, _, files in walk(label_dir) for f in files]

if len(image_dirs) != len(label_dirs):
    print('Total folders is different!')
    print(f'\t+ Image dir: {len(image_dirs)} folders')
    print(f'\t+ Label dir: {len(label_dirs)} folders')
elif len(all_images) != len(all_labels):
    print('Total files is different!')
    print(f'\t+ Total images: {len(all_images)} images')
    print(f'\t+ Total labels: {len(all_labels)} labels')
else:
    diff_count = 0
    for i in range(0, len(image_dirs)):
        image_dir_to_check = image_dirs[i]
        label_dir_to_check = label_dirs[i]

        print(f'Checking {basename(image_dir_to_check)}')

        image_dir_train = join(image_dir_to_check, train)
        label_dir_train = join(label_dir_to_check, train)

        image_dir_val   = join(image_dir_to_check, val)
        label_dir_val   = join(label_dir_to_check, val)

        image_dir_test  = join(image_dir_to_check, test)
        label_dir_test  = join(label_dir_to_check, test)

        # List labels and images without extensions
        labels = [splitext(filename)[0] for filename in listdir(label_dir_train)]
        images = [splitext(filename)[0] for filename in listdir(image_dir_train)]

        print(f'- Train: {len(images)} files')

        # Find images without labels & labels without images
        no_label_images = list(set(images) - set(labels))
        no_image_labels = list(set(labels) - set(images))

        if len(no_label_images) != len(no_image_labels) or len(images) != len(labels):
            diff_count += 1
            print(f'Gotcha {basename(image_dir_to_check)} (train)')
            print(f'\t+ Images count: {len(images)}')
            print(f'\t+ Labels count: {len(labels)}')
            print(f'\t+ Images without labels count: {len(no_label_images)}')
            print(f'\t+ Labels without images count: {len(no_image_labels)}')

        # List labels and images without extensions
        labels = [splitext(filename)[0] for filename in listdir(label_dir_val)]
        images = [splitext(filename)[0] for filename in listdir(image_dir_val)]

        print(f'- Val: {len(images)} files')

        # Find images without labels & labels without images
        no_label_images = list(set(images) - set(labels))
        no_image_labels = list(set(labels) - set(images))

        if len(no_label_images) != len(no_image_labels) or len(images) != len(labels):
            diff_count += 1
            print(f'Gotcha {basename(image_dir_to_check)} (val)')
            print(f'\t+ Images count: {len(images)}')
            print(f'\t+ Labels count: {len(labels)}')
            print(f'\t+ Images without labels count: {len(no_label_images)}')
            print(f'\t+ Labels without images count: {len(no_image_labels)}')

        # List labels and images without extensions
        labels = [splitext(filename)[0] for filename in listdir(label_dir_test)]
        images = [splitext(filename)[0] for filename in listdir(label_dir_test)]

        print(f'- Test: {len(images)} files')

        # Find images without labels & labels without images
        no_label_images = list(set(images) - set(labels))
        no_image_labels = list(set(labels) - set(images))

        if len(no_label_images) != len(no_image_labels) or len(images) != len(labels):
            diff_count += 1
            print(f'Gotcha {basename(image_dir_to_check)} (test)')
            print(f'\t+ Images count: {len(images)}')
            print(f'\t+ Labels count: {len(labels)}')
            print(f'\t+ Images without labels count: {len(no_label_images)}')
            print(f'\t+ Labels without images count: {len(no_image_labels)}')

        # This line to separate each checking iteration
        print('')

    if diff_count != 0:
        print(f'Please check, {diff_count} diff(s) found')
    else:
        print('Good to go!')

Checking part-1
- Train: 11868 files
- Val: 1484 files
- Test: 1484 files

Checking part-2
- Train: 11869 files
- Val: 1484 files
- Test: 1484 files

Checking part-3
- Train: 11869 files
- Val: 1484 files
- Test: 1484 files

Good to go!


## [S-Class] Batch zip folders & Move to Google Drive

In [None]:
# Org
new_dir = '/content/8th-VDAS-labels-3-zips'     # new dir to create
data_dir = '/content/8th-VDAS-labels-3-parts'    # dir with data to copy
zipname_prefix = '8th-VDAS-labels'

makedirs(new_dir, exist_ok=True)

def zipdir(path, ziph):
    length = len(path)

    # ziph is zipfile handle
    for root, dirs, files in walk(path):
        folder = root[length:]  # path without "parent"
        for file in files:
            ziph.write(join(root, file), join(folder, file))

i = 0   # Keep track of the ordinals when there's many
for root, dirs, files in walk(data_dir):
    if root == data_dir:
        for dirname in dirs:
            i += 1
            zip_name = new_dir + '/' + zipname_prefix + '-' + dirname + '.zip'
            zip_dir = root + '/' + dirname
            zipf = ZipFile(zip_name, 'w', ZIP_STORED)
            zipdir(zip_dir, zipf)
            zipf.close()
            print(f'{i}. Created {basename(zip_name)} with {len(zipf.namelist())} files')

1. Created 8th-VDAS-labels-part-1.zip with 14836 files
2. Created 8th-VDAS-labels-part-2.zip with 14837 files
3. Created 8th-VDAS-labels-part-3.zip with 14837 files


In [None]:
# Org
new_dir = '/content/8th-VDAS-images-3-zips'     # new dir to create
data_dir = '/content/8th-VDAS-images-3-parts'    # dir with data to copy
zipname_prefix = '8th-VDAS-images'

makedirs(new_dir, exist_ok=True)

def zipdir(path, ziph):
    length = len(path)

    # ziph is zipfile handle
    for root, dirs, files in walk(path):
        folder = root[length:]  # path without "parent"
        for file in files:
            ziph.write(join(root, file), join(folder, file))

i = 0   # Keep track of the ordinals when there's many
for root, dirs, files in walk(data_dir):
    if root == data_dir:
        for dirname in dirs:
            i += 1
            zip_name = new_dir + '/' + zipname_prefix + '-' + dirname + '.zip'
            zip_dir = root + '/' + dirname
            zipf = ZipFile(zip_name, 'w', ZIP_STORED)
            zipdir(zip_dir, zipf)
            zipf.close()
            print(f'{i}. Created {basename(zip_name)} with {len(zipf.namelist())} files')

1. Created 8th-VDAS-images-part-1.zip with 14836 files
2. Created 8th-VDAS-images-part-2.zip with 14837 files
3. Created 8th-VDAS-images-part-3.zip with 14837 files


In [None]:
# Move all zips to desired location
zip_dir = '/content/8th-VDAS-labels-3-zips'
drive_location = '/content/drive/MyDrive/yolo/4-datasets/8th-VDAS-3-2'

i = 0   # Keep track of the ordinals when there's many
for zip_file in listdir(zip_dir):
    i += 1
    zip_file_path = join(zip_dir, zip_file)
    !cp $zip_file_path $drive_location
    print(f'{i}. Copied {zip_file} to Drive')

1. Copied 8th-VDAS-labels-part-1.zip to Drive
2. Copied 8th-VDAS-labels-part-3.zip to Drive
3. Copied 8th-VDAS-labels-part-2.zip to Drive


In [None]:
# Move all zips to desired location
zip_dir = '/content/8th-VDAS-images-3-zips'
drive_location = '/content/drive/MyDrive/yolo/4-datasets/8th-VDAS-3-2'

i = 0   # Keep track of the ordinals when there's many
for zip_file in listdir(zip_dir):
    i += 1
    zip_file_path = join(zip_dir, zip_file)
    !cp $zip_file_path $drive_location
    print(f'{i}. Copied {zip_file} to Drive')

1. Copied 8th-VDAS-images-part-3.zip to Drive
2. Copied 8th-VDAS-images-part-1.zip to Drive
3. Copied 8th-VDAS-images-part-2.zip to Drive


## [S-Class] Batch unzip

In [None]:
# Batch unzip
# !unzip $eachfile -d $new_dir

new_dir = '/content/MaskedFace-Net'     # new dir to create
data_dir = '/content/drive/MyDrive/yolo/4-datasets/MaskedFace-Net-20-parts-lbp'    # dir with data to copy

makedirs(new_dir, exist_ok=True)

i = 0   # Keep track of the ordinals when there's many
for root, dirs, files in walk(data_dir):
    for eachfile in files:
        i += 1
        eachfile = join(data_dir, eachfile)
        !unzip -j $eachfile -d $new_dir
        total_files(new_dir)
total_files(new_dir)

## [S-Class] One-liner zip copy to Drive

In [None]:
dir_to_zip = '/content/7th-VDAS-labels'
zip_name = '/content/7th-VDAS-labels.zip'
drive_location = '/content/drive/MyDrive/yolo/4-datasets/7th-VDAS'
!cd $dir_to_zip && zip -r -0 $zip_name . && cp $zip_name $drive_location

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: 284193,17791000d03014ec.txt (stored 0%)
  adding: 1180.txt (stored 0%)
  adding: 282555,11a967000ae27fc9a.txt (stored 0%)
  adding: crawl_234_6412_lbp.txt (stored 0%)
  adding: crawl_234_2447.txt (stored 0%)
  adding: 273278,bdefe000c3dd1516_lbp.txt (stored 0%)
  adding: 282555,61f73000d05d8111.txt (stored 0%)
  adding: crawl_234_7235.txt (stored 0%)
  adding: 283992,4484000a872eb12_lbp.txt (stored 0%)
  adding: crawl_234_599.txt (stored 0%)
  adding: WMAugmented_768_2887588.txt (stored 0%)
  adding: crawl_234_1763_lbp.txt (stored 0%)
  adding: unsplash_jacob-boavista-U3HQImjgJ1s-unsplash_face_0.txt (stored 0%)
  adding: crawl_234_5138_lbp.txt (stored 0%)
  adding: 273271,1ed14000c461b66c_lbp.txt (stored 0%)
  adding: crawl_234_4252.txt (stored 0%)
  adding: crawler2_02873.txt (stored 0%)
  adding: pexels_pexels-photo-4098787_face_0.txt (stored 0%)
  adding: 5305_lbp.txt (stored 0%)
  adding: 5675_lbp.txt (store

In [None]:
dir_to_zip = '/content/7th-VDAS-labels'
zip_name = '/content/7th-VDAS-labels.zip'
drive_location = '/content/drive/MyDrive/yolo/4-datasets/7th-VDAS'
!cd $dir_to_zip && zip -r -0 $zip_name . && cp $zip_name $drive_location

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: WMAugmented_568_8983923.png (stored 0%)
  adding: 1298_lbp.jpg (stored 0%)
  adding: 0644_lbp.jpg (stored 0%)
  adding: 6423.png (stored 0%)
  adding: unsplash_jason-leung-YiHmRbEQx20-unsplash_face_0_lbp.jpg (stored 0%)
  adding: crawler2_02605_lbp.jpg (stored 0%)
  adding: 273278,e75b6000d1e66210.jpg (stored 0%)
  adding: 273278,12279c000ec07fe3a_lbp.jpg (stored 0%)
  adding: WMAugmented_332_5586599_lbp.png (stored 0%)
  adding: crawl_234_4258.jpg (stored 0%)
  adding: WMAugmented_605_2156800_lbp.png (stored 0%)
  adding: 283554,266f3000ced6e504.jpg (stored 0%)
  adding: WMAugmented_811_7007721_lbp.png (stored 0%)
  adding: 283554,2ca1900090b57840_lbp.jpg (stored 0%)
  adding: maksssksksss18.png (stored 0%)
  adding: FFHQ_30881.png (stored 0%)
  adding: 273278,fe9bc0007eb0c1fe.jpg (stored 0%)
  adding: WMAugmented_829_3136223.png (stored 0%)
  adding: crawler_00293.jpg (stored 0%)
  adding: 282555,cf0300009a95f

# II. Perform inference on images (a portion of large datasets)

## Step 1: Copy an amount of images to test

&ndash;&nbsp; Original copying (100 images)

In [None]:
# Directories
data_dir = '/content/images'
dest_dir = '/content/images-jpg'

# Create destination dir if not exists
makedirs(dest_dir, exist_ok=True)

# Copy x images to 'test' folder
images = listdir(data_dir)
copy_amount = 30000
count = 0

for image in images:
    if count == copy_amount:
        break
    elif image.endswith('.png') or image.endswith('.jpg') or image.endswith('.jpeg'):
        try:
            copy(join(data_dir, image), dest_dir)
            count += 1
        except Exception as e:
            print(e)
            break

actual_count = total_files2(dest_dir)   # actual num of files in destination dir

print(f'''
        Copied {count}/{copy_amount} images
        {actual_count} images in destination folder
''')

if count == copy_amount && count == actual_count:
    print(f'''
            
    ''')

24080
###     Transferred completed     ###


&ndash;&nbsp; Playground

In [None]:
print(glob.glob('/content/images/*.jpg'))

['/content/images/pexels_pexels-photo-4429411_face_0.jpg', '/content/images/pexels_pexels-charlotte-may-5965765_face_0.jpg', '/content/images/unsplash_katerina-kerdi-swIfqUbmu0o-unsplash_face_0.jpg', '/content/images/crawler_00447.jpg', '/content/images/crawler2_00577.jpg', '/content/images/16233_Mask_Mouth_Chin.jpg', '/content/images/unsplash_marina-abrosimova-kfviTQzoGYA-unsplash_face_0.jpg', '/content/images/unsplash_gabriella-clare-marino-2PEeh8fHeCk-unsplash_face_1.jpg', '/content/images/01296_Mask_Mouth_Chin.jpg', '/content/images/13462_Mask_Mouth_Chin.jpg', '/content/images/20527_Mask_Mouth_Chin.jpg', '/content/images/17267_Mask_Mouth_Chin.jpg', '/content/images/5053.jpg', '/content/images/27979_Mask_Mouth_Chin.jpg', '/content/images/unsplash_chands-nowland-IMV4W4dgYaE-unsplash_face_0.jpg', '/content/images/09494_Mask_Mouth_Chin.jpg', '/content/images/crawler2_02368.jpg', '/content/images/crawler2_01208.jpg', '/content/images/video1_frame_4459_face_2.jpg', '/content/images/1974.

In [None]:
print(str(len(glob.glob('/content/images/*.jpg'))))

24080


## Step 2: Clone YOLOv5

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt        # install dependencies

import torch
import glob
from IPython.display import Image, clear_output   # to display images

from yolov5 import utils
display = utils.notebook_init()          # setup checks

print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

YOLOv5 🚀 v6.1-32-gc13d4ce torch 1.10.0+cu111 CUDA:0 (Tesla T4, 15110MiB)


Setup complete ✅ (2 CPUs, 12.7 GB RAM, 39.9/166.8 GB disk)
Setup complete. Using torch 1.10.0+cu111 (Tesla T4)


## Step 3: Inference from pretrained weight


## Optional Step: View all inference results

In [None]:
import cv2
import glob

imdir = detect_folder
ext = ['png', 'jpg', 'gif']    # Add image formats here

files = []
[files.extend(glob.glob(imdir + '*.' + e)) for e in ext]

images = [cv2.imread(file) for file in files]

In [None]:
# List all of the inferenced images

import glob
from IPython.display import Image, display

detect_folder = '/content/yolov5/runs/detect/exp3/'

count = 0

for imageName in os.listdir(detect_folder):  # edit path depends on your needs
    file_name = detect_folder + imageName
    display(Image(filename=file_name))
    print("\n")
    count += 1

    if count == 100: 
        break

In [None]:
# List all of the inferenced images

import glob
from IPython.display import Image, display

detect_folder = '/content/yolov5/runs/detect/exp3/'

count = 0

for imageName in os.listdir(detect_folder):  # edit path depends on your needs
    file_name = detect_folder + imageName
    display(Image(filename=file_name))
    print("\n")
    count += 1

    if count == 100: 
        break

## Step 4: Save detection results to Google Drive

In [None]:
%cd /content/obj-labels
!zip -r -0 /content/obj-labels-2510.zip .
!cp -r /content/obj-labels-2510.zip /content/drive/MyDrive/yolov5

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: COCO_train2014_000000413115.txt (stored 0%)
  adding: COCO_train2014_000000491665.txt (stored 0%)
  adding: COCO_train2014_000000184315.txt (stored 0%)
  adding: COCO_train2014_000000188965.txt (stored 0%)
  adding: d_2509.txt (stored 0%)
  adding: rgbLineStop_1036.txt (stored 0%)
  adding: COCO_train2014_000000208838.txt (stored 0%)
  adding: COCO_train2014_000000271469.txt (stored 0%)
  adding: COCO_train2014_000000404004.txt (stored 0%)
  adding: COCO_train2014_000000147250.txt (stored 0%)
  adding: COCO_train2014_000000394518.txt (stored 0%)
  adding: COCO_train2014_000000217054.txt (stored 0%)
  adding: iara_2615.txt (stored 0%)
  adding: COCO_train2014_000000455874.txt (stored 0%)
  adding: rgbLineStop_1880.txt (stored 0%)
  adding: COCO_train2014_000000406508.txt (stored 0%)
  adding: COCO_train2014_000000572724.txt (stored 0%)
  adding: COCO_train2014_000000035110.txt (stored 0%)
  adding: COCO_train2014

In [None]:
# Zip 'detect' folder results and copy to Drive

%cd /content/yolov5/runs/detect/exp   # folder for zipping (change this to correct folder path)
!zip -r -0 /content/yolov5-detect-2210-200img.zip .
!cp -r /content/yolov5-detect-2210-200img.zip /content/drive/MyDrive/yolov5/detect

[Errno 2] No such file or directory: '/content/yolov5/runs/detect/exp # folder for zipping (change this to correct folder path)'
/content/yolov5
  adding: export.py (stored 0%)
  adding: detect.py (stored 0%)
  adding: runs/ (stored 0%)
  adding: runs/detect/ (stored 0%)
  adding: runs/detect/exp/ (stored 0%)
  adding: runs/detect/exp/img_130.png (stored 0%)
  adding: runs/detect/exp/img_95.png (stored 0%)
  adding: runs/detect/exp/img_71.png (stored 0%)
  adding: runs/detect/exp/img_69.png (stored 0%)
  adding: runs/detect/exp/img_84.png (stored 0%)
  adding: runs/detect/exp/img_180.png (stored 0%)
  adding: runs/detect/exp/img_68.png (stored 0%)
  adding: runs/detect/exp/img_46.png (stored 0%)
  adding: runs/detect/exp/img_122.png (stored 0%)
  adding: runs/detect/exp/img_166.png (stored 0%)
  adding: runs/detect/exp/img_36.png (stored 0%)
  adding: runs/detect/exp/img_187.png (stored 0%)
  adding: runs/detect/exp/img_18.png (stored 0%)
  adding: runs/detect/exp/img_35.png (stored 0%

In [None]:
# Sync data (of local VM's disk cache) of Drive-mounted folder back to Google Drive

from google.colab import drive
drive.flush_and_unmount()

# Playground

In [None]:
### Drive Mount workaround (min.)

!apt-get install -y -qq software-properties-common module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null && apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p /content/drive/MyDrive && google-drive-ocamlfuse /content/drive/MyDrive
%cd /content


In [None]:
# Experimental Code Cells

In [None]:
for root, _, files in os.walk(new_dir):
    if root == new_dir:
        for each_ in _:
            total_files(os.path.join(new_dir, each_))

In [None]:
# Mount Google Drive

from google.colab import drive
drive.mount('/content/drive') # the '_' is intentional

# Imports

import os, random, shutil, timeit, zipfile, glob
import numpy as np
from numpy.random import RandomState
!wget -q https://raw.githubusercontent.com/ikonikon/fast-copy/master/fast-copy.py

In [None]:
a = 0
if 1 == 1:
    %%script stop_code_here
    a += 1
else:
    print('nope')

print(a)

UsageError: Line magic function `%script` not found (But cell magic `%%script` exists, did you mean that instead?).


In [None]:
import os
a = '/content/sample_data'

if (os.path.exists(a)):
    !cd $a && zip -r /content/output.zip .
else:
    print('nope')

updating: README.md (deflated 42%)
updating: anscombe.json (deflated 83%)
updating: california_housing_test.csv (deflated 76%)
updating: mnist_train_small.csv (deflated 88%)
updating: california_housing_train.csv (deflated 79%)
updating: mnist_test.csv (deflated 88%)


In [None]:
!df -h
!cat /proc/cpuinfo
!cat /proc/meminfo

In [None]:
import multiprocessing

cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

2