# Dataset creator Notebook

>This notebook downloads specific classes from the Open Images dataset with the desired cardinality. Then, it transform this set into .tfrecord format to make it possible to use with Tensorflow Object Detection API.
>
>The downloaded set can be saved into the mounted Drive folder.

## 1. Prepare the environment

### Clone MISC repos with Git

In [2]:
!git clone https://github.com/EscVM/OIDv4_ToolKit.git

Cloning into 'OIDv4_ToolKit'...
remote: Enumerating objects: 422, done.[K
remote: Total 422 (delta 0), reused 0 (delta 0), pack-reused 422[K
Receiving objects: 100% (422/422), 34.08 MiB | 33.30 MiB/s, done.
Resolving deltas: 100% (146/146), done.


### Install Object Detection module

In [3]:
!pip3 install tensorflow-object-detection-api

Collecting tensorflow-object-detection-api
[?25l  Downloading https://files.pythonhosted.org/packages/4e/11/7f6d3c5c4b603cc40b2813059779afb641bd5eb68045c62ca520bfce0359/tensorflow_object_detection_api-0.1.1.tar.gz (577kB)
[K     |████████████████████████████████| 583kB 2.7MB/s 
Collecting twine
  Downloading https://files.pythonhosted.org/packages/ad/db/b2c65078b783c6694bdfa0911bbbe0e2be7fcbc98ff23a99b8be544906b6/twine-3.2.0-py3-none-any.whl
Collecting pkginfo>=1.4.2
  Downloading https://files.pythonhosted.org/packages/e6/d5/451b913307b478c49eb29084916639dc53a88489b993530fed0a66bab8b9/pkginfo-1.5.0.1-py2.py3-none-any.whl
Collecting keyring>=15.1
  Downloading https://files.pythonhosted.org/packages/e4/ed/7be20815f248b0d6aae406783c2bee392640924623c4e17b50ca90c7f74d/keyring-21.4.0-py3-none-any.whl
Collecting colorama>=0.4.3
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collec

### Import required dependencies

In [4]:
import imageio
import zipfile
import shutil
import os
import csv
import tensorflow as tf
import pandas as pd
import contextlib2
import gc
from PIL import Image
from google.colab import drive
from pathlib import Path
from datetime import datetime
from object_detection.utils import dataset_util

### Define paths

In [71]:
rootPath = '/content/drive/My Drive/Machine Learning/Stolen vehicle detection'
dataPath = rootPath + '/data'

localPath = '/content'
rootOIDv4Path = localPath + '/OIDv4_ToolKit'
datasetRootPath = rootOIDv4Path + '/OID'
generatedDatasetPath = localPath + '/generated'
recordsPath = generatedDatasetPath + '/records'
csvPath = generatedDatasetPath + '/csv'

### Install OIDv4 requirements

In [6]:
os.chdir(rootOIDv4Path)
!pip3 install -r requirements.txt

Collecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/d1/68/511b344c5d0a4ca99477c1c1aaf7a3c20fd65c16fc76e2602c90b5d758fe/awscli-1.18.153.tar.gz (1.3MB)
[K     |████████████████████████████████| 1.3MB 2.8MB/s 
Collecting botocore==1.18.12
[?25l  Downloading https://files.pythonhosted.org/packages/14/d5/9d0db656de5bb8c431232d5976526ca5b6ac1e4563f95ce6cd842bbccff4/botocore-1.18.12-py2.py3-none-any.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 12.8MB/s 
[?25hCollecting docutils<0.16,>=0.10
[?25l  Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
[K     |████████████████████████████████| 552kB 40.0MB/s 
[?25hCollecting s3transfer<0.4.0,>=0.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl (69kB)
[K     |█████████████████

### Mount Google Drive to this Notebook instance
>As the dataset has been prepared previously and updated to Google Drive, the model building and training process will be done there, not locally.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

os.chdir(localPath)
# Show current directory
!pwd

Mounted at /content/drive
/content


##2. Get the data
>There are 3 possibilities to get the data.

#### A: extract a previously downloaded raw dataset from Drive

In [8]:
# Run it only when the dataset is not yet extracted
'''
zipRef = zipfile.ZipFile(dataPath + "/dataset.zip", 'r')
zipRef.extractall(localPath + "/data")
zipRef.close()
print('Dataset extraction successful')
'''

Dataset extraction successful


#### B: extract a previously downloaded raw dataset placed under `/content`

In [None]:
# Run it only when the dataset is not yet extracted 
'''
zipRef = zipfile.ZipFile("/content/dataset.zip", 'r')
zipRef.extractall(localPath + "/data")
zipRef.close()
print('Dataset extraction successful')
'''

Dataset extraction successful


#### C: download from Open Image Dataset

##### show available OIDv4 commands

In [7]:
os.chdir(rootOIDv4Path)
!python3 main.py -h

usage: main.py [-h] [--Dataset /path/to/OID/csv/] [-y]
               [--classes list of classes [list of classes ...]]
               [--type_csv 'train' or 'validation' or 'test' or 'all']
               [--sub Subset of human verified images or machine generated h or m)]
               [--image_IsOccluded 1 or 0] [--image_IsTruncated 1 or 0]
               [--image_IsGroupOf 1 or 0] [--image_IsDepiction 1 or 0]
               [--image_IsInside 1 or 0] [--multiclasses 0 (default or 1]
               [--n_threads [default 20]] [--noLabels]
               [--limit integer number]
               <command> 'downloader', 'visualizer' or 'ill_downloader'.

Open Image Dataset Downloader

positional arguments:
  <command> 'downloader', 'visualizer' or 'ill_downloader'.
                        'downloader', 'visualizer' or 'ill_downloader'.

optional arguments:
  -h, --help            show this help message and exit
  --Dataset /path/to/OID/csv/
                        Directory of the OID da

##### Download train, test, validation sets

In [None]:
!python3 main.py downloader --classes "Vehicle registration plate" --type_csv train --multiclasses 0 --limit 5368

[92m
		   ___   _____  ______            _    _    
		 .'   `.|_   _||_   _ `.         | |  | |   
		/  .-.  \ | |    | | `. \ _   __ | |__| |_  
		| |   | | | |    | |  | |[ \ [  ]|____   _| 
		\  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
		 `.___.'|_____||______.'   \__/     |_____|
	[0m
[92m
             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        [0m
    [INFO] | Downloading Vehicle registration plate.[0m
[91m   [ERROR] | Missing the class-descriptions-boxable.csv file.[0m
[94m[DOWNLOAD] | Do you want to download the missing file? [Y/n] [0mY
...

In [None]:
!python3 main.py downloader --classes "Vehicle registration plate" --type_csv test --multiclasses 0 --limit 1113

[92m
		   ___   _____  ______            _    _    
		 .'   `.|_   _||_   _ `.         | |  | |   
		/  .-.  \ | |    | | `. \ _   __ | |__| |_  
		| |   | | | |    | |  | |[ \ [  ]|____   _| 
		\  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
		 `.___.'|_____||______.'   \__/     |_____|
	[0m
[92m
             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        [0m
    [INFO] | Downloading Vehicle registration plate.[0m
[91m   [ERROR] | Missing the test-annotations-bbox.csv file.[0m
[94m[DOWNLOAD] | Do you want to download the missing file? [Y/n] [0mY
...100%,

In [None]:
!python3 main.py downloader --classes "Vehicle registration plate" --type_csv validation --multiclasses 0 --limit 386

[92m
		   ___   _____  ______            _    _    
		 .'   `.|_   _||_   _ `.         | |  | |   
		/  .-.  \ | |    | | `. \ _   __ | |__| |_  
		| |   | | | |    | |  | |[ \ [  ]|____   _| 
		\  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
		 `.___.'|_____||______.'   \__/     |_____|
	[0m
[92m
             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        [0m
    [INFO] | Downloading Vehicle registration plate.[0m
[91m   [ERROR] | Missing the validation-annotations-bbox.csv file.[0m
[94m[DOWNLOAD] | Do you want to download the missing file? [Y/n] [0mY
..

##### Save the downloaded set to Drive

In [None]:
shutil.make_archive("/content/dataset", 'zip', datasetRootPath)
shutil.move("/content/dataset.zip", dataPath + "/dataset.zip")
print(f'Dataset successfully zipped & moved to: {dataPath}')

Dataset successfully zipped & moved to: /content/drive/My Drive/Machine Learning/Stolen vehicle detection/data


##3. Create supplementary files
>Create a `txt` file containing the class names, and `tfrecord`s containing the images and their annotations.

### Define class names to process
>Order of the classes counts!

In [74]:
classNames = ["Vehicle registration plate", "Vehicle"]

### Create class names file

#### Create class names `txt` file

In [75]:
def createClassTxt(destinationDir, classNames):

  os.chdir(destinationDir)
  # creating the `classes.txt` file
  classesPath = os.path.join("classes.txt")

  content = ""

  # creates a txt file that contains the class names
  for i, className in enumerate(classNames):

      content = (
          content + 
          f"{className}\n"
      )

  content = content.strip()

  with open(classesPath, "w") as f:
      f.write(content)

  print("Class names txt file creation successful. Destination: %s" %(destinationDir + '/' + classesPath))

#### Create class names `pbtxt` file

In [76]:
def createClassPbtxt(destinationDir, classNames):

  os.chdir(destinationDir)
  # creating the `classes.pbtxt` file
  classesPath = os.path.join("classes.pbtxt")

  content = ""

  # creates a pbtxt file that contains the class names
  for i, className in enumerate(classNames):

      content = (
          content + 
          f"item {{\n    id: {i + 1}\n    name: '{className}'\n }}\n\n"
      )

  content = content.strip()

  with open(classesPath, "w") as f:
      f.write(content)

  print("Class names pbtxt file creation successful. Destination: %s" %(destinationDir + '/' + classesPath))

### Create the `tfrecord` files
>These files are the inputs of the Tensorflow Object Detection API models. 

#### Create one `tfrecord` from the input parameters

In [77]:
'''
Creates the tfrecord element from the input.

  Args:
    path: path of the image
    imageId: Id of the current image to create records for
    mustPresentClasses: list of class names that must present (returns empty record otherwise)
    classes: dict of classes that is added to the record if present
    imageAnnotations: dataframe containing the image BBs

  Returns:
    TFRecord
'''
def createRecord(path, imageId, mustPresentClasses, classes, imageAnnotations):

  image = Image.open(path)
  imgWidth, imgHeight = image.size
  imgData = tf.io.gfile.GFile(path, 'rb').read()

  xmins = []
  xmaxs = []
  ymins = []
  ymaxs = []
  classesText = []
  classesInt = []

  for mustClass in mustPresentClasses:

    if(mustClass not in imageAnnotations['LabelName'].values):
      return None

  for _, row in imageAnnotations.loc[imageAnnotations['LabelName'].isin(classes.keys())].iterrows():

      xmins.append(row['XMin'])
      xmaxs.append(row['XMax'])
      ymins.append(row['YMin'])
      ymaxs.append(row['YMax'])
      classesText.append(row['LabelName'].encode('utf8'))
      classesInt.append(classes[row['LabelName']])

  tFRecord = tf.train.Example(features=tf.train.Features(feature={
      'image/height': dataset_util.int64_feature(imgHeight),
      'image/width': dataset_util.int64_feature(imgWidth),
      'image/filename': dataset_util.bytes_feature(imageId.encode('utf8')),
      'image/source_id': dataset_util.bytes_feature(imageId.encode('utf8')),
      'image/encoded': dataset_util.bytes_feature(imgData),
      'image/format': dataset_util.bytes_feature(b'jpg'),
      'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
      'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
      'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
      'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
      'image/object/class/text': dataset_util.bytes_list_feature(classesText),
      'image/object/class/label': dataset_util.int64_list_feature(classesInt)
  }))

  return tFRecord

#### Create a `csv` row from the input parameters

##### Encode list elements as a row

In [78]:
def appendListAsRow(fileName, list):

    # Open file in append mode
    with open(fileName, 'a+', newline='') as writeObj:

      # Create a writer object from csv module
      writer = csv.writer(writeObj)

      # Add contents of list as last row in the csv file
      writer.writerow(list)

##### Create `csv` rows of an image

In [79]:
def createCsvRowsOfImage(imgPath, imageId, classes, annotations):

  image = Image.open(imgPath)
  imgWidth, imgHeight = image.size
  imgData = 1
  extension = "jpg"

  imageAnnotations = annotations.loc[annotations['ImageID'] == imageId]

  CsvRows = []

  for _, row in imageAnnotations.loc[imageAnnotations['LabelName'].isin(classes.keys())].iterrows():

    CsvRow = []

    CsvRow.append(imgHeight)
    CsvRow.append(imgWidth)
    CsvRow.append(imageId)
    CsvRow.append(imageId)
    CsvRow.append(imgData)
    CsvRow.append(extension)
    CsvRow.append(row['XMin'])
    CsvRow.append(row['XMax'])
    CsvRow.append(row['YMin'])
    CsvRow.append(row['YMax'])
    CsvRow.append(row['LabelName'].encode('utf8'))
    CsvRow.append(classes[row['LabelName']])

    CsvRows.append(CsvRow)

  return CsvRows

#### Add `csv` file header

In [80]:
def csvAddHeader(fileName):

  headerRow = []

  headerRow.append('imgHeight')
  headerRow.append('imgWidth')
  headerRow.append('imageId')
  headerRow.append('imageId')
  headerRow.append('imgData')
  headerRow.append('extension')
  headerRow.append('XMin')
  headerRow.append('XMax')
  headerRow.append('YMin')
  headerRow.append('YMax')
  headerRow.append('LabelName')
  headerRow.append('LabelId')

  appendListAsRow(fileName, headerRow)

#### Generate the `tfrecord` files from the provided images & annotations
>The `numShards` variable serves for deciding how many sub-files must be created in order to speed up processing.

In [81]:
def open_sharded_tfrecord_output(exit_stack, base_path, num_shards):
  """Opens all TFRecord shards for writing and adds them to an exit stack.

  Args:
    exit_stack: A context2.ExitStack used to automatically closed the TFRecords
      opened in this function.
    base_path: The base path for all shards
    num_shards: The number of shards

  Returns:
    The list of opened TFRecords. Position k in the list corresponds to shard k.
  """
  tf_record_output_filenames = [
      '{}-{:05d}-of-{:05d}'.format(base_path, idx, num_shards)
      for idx in range(num_shards)
  ]

  tfrecords = [
      exit_stack.enter_context(tf.io.TFRecordWriter(file_name))
      for file_name in tf_record_output_filenames
  ]

  return tfrecords

In [82]:
def open_sharded_csv_output(csvDir, numShards):
  csvFileNames = []

  for i in range(0, numShards):
    csvFileNames.append('{}-{:05d}-of-{:05d}.csv'.format(csvDir, i, numShards))

  return csvFileNames

In [83]:
def tfrecordGenerator(
    classesFile,
    classDescriptionsFile,
    annotationsFile,
    imagesDir,
    outputFile,
    numShards,
    csvNeeded,
    csvDir
):

    classes = list(filter(None, open(classesFile).read().split('\n')))
    classes = {name: idx + 1 for idx, name in enumerate(classes)}
    nameMustPresentClasses = ["Vehicle registration plate"]

    classDescriptions = {row[0]: row[1] for _, row in pd.read_csv(classDescriptionsFile, header=None).iterrows()}

    annotations = pd.read_csv(annotationsFile)
    annotations['ImageID'] = annotations['ImageID'].astype(str)
    annotations['LabelName'] = annotations['LabelName'].map(lambda n: classDescriptions[n])

    #list of labels to rename to Vehicle
    toReplace = ["Car", "Airplane", "Helicopter", "Boat", "Motorcycle", "Bus", "Taxi", "Truck", "Ambulance"]

    #generate list of labels to keep
    toKeep = toReplace.copy()
    for item in nameMustPresentClasses:
      toKeep.append(item)
    #keep only labels in toKeep
    annotations = annotations[annotations['LabelName'].isin(toKeep)]

    #change toReplace labels to Vehicle
    annotations.loc[annotations['LabelName'].isin(toReplace), 'LabelName'] = "Vehicle"

    images = tf.io.gfile.glob(imagesDir + '/*/*.jpg')
    images = map(lambda i: (os.path.basename(i).split('.jpg')[0], i), images)
    images = dict(images)
    print(f'{len(images)} images found')

    with contextlib2.ExitStack() as tfRecordCloseStack:

      outputRecords = open_sharded_tfrecord_output(tfRecordCloseStack, outputFile, numShards)
      outputCsvs = open_sharded_csv_output(csvDir, numShards)

      index = 0

      if(csvNeeded == True):
        for filePath in outputCsvs: 
          csvAddHeader(filePath)

      for imageId, path in images.items():

        imageAnnotations = annotations.loc[annotations['ImageID'] == imageId]
        TFRecord = createRecord(path, imageId, nameMustPresentClasses, classes, imageAnnotations)

        if(TFRecord is None):
          continue

        outputShardIndex = index % numShards
        outputRecords[outputShardIndex].write(TFRecord.SerializeToString())

        if(csvNeeded == True):
          imageRows = createCsvRowsOfImage(path, imageId, classes, annotations)
          for row in imageRows:
            appendListAsRow(outputCsvs[outputShardIndex], row)

        index += 1

    print('TFRecords has been successfully created')

##4. Generate dataset
>The generated `tfrecord` files need to be zipped and saved to the Drive directory.

#### Create output directories

In [84]:
if not os.path.exists(recordsPath):
  os.makedirs(recordsPath)

if not os.path.exists(csvPath):
  os.makedirs(csvPath)

#### Generate train, test, validation `tfrecord` files
>`subsetTypes` contains the subset names, `numShards` contains their corresponding values to be sharded into. More than a few thousand `tfrecord` rows dramatically slow down the process. `numShards=1` means sharding is not needed at all.

In [85]:
# if the data is newly downloaded, use this:
#dataSourcePath = datasetRootPath
# if the data has been extracted, use this:
dataSourcePath = localPath + "/data"

createClassTxt(dataSourcePath, classNames)
createClassPbtxt(recordsPath, classNames)

# train, test, validation subsets
subsetTypes = ['train', 'test', 'validation']
numShards = [10, 3, 1]

csvNeeded = True
csvDirName = ""

for subsetType, numShard in zip(subsetTypes, numShards):

  print(f'***Generating {subsetType} tfrecord files [sharded into {numShard} piece(s)]***')

  classesFile = dataSourcePath + '/classes.txt'
  classDescriptionsFile = dataSourcePath + '/csv_folder/class-descriptions-boxable.csv'
  annotationsFile = dataSourcePath + '/csv_folder/' + subsetType + '-annotations-bbox.csv'
  imagesDir = dataSourcePath + '/Dataset/' + subsetType
  outputFile = recordsPath + '/' + subsetType + 'Dataset.tfrecord'
  csvDirName = csvPath + '/' + subsetType

  tfrecordGenerator(
      classesFile,
      classDescriptionsFile,
      annotationsFile,
      imagesDir,
      outputFile,
      numShard,
      csvNeeded,
      csvDirName
  )

  print('')

Class names txt file creation successful. Destination: /content/data/classes.txt
Class names pbtxt file creation successful. Destination: /content/generated/records/classes.pbtxt
***Generating train tfrecord files [sharded into 10 piece(s)]***
5368 images found
TFRecords has been successfully created

***Generating test tfrecord files [sharded into 3 piece(s)]***
1113 images found
TFRecords has been successfully created

***Generating validation tfrecord files [sharded into 1 piece(s)]***
386 images found
TFRecords has been successfully created



###Zip & save `tfrecord` files to Drive

In [86]:
outputFileName = "datasetRecords"
os.chdir(localPath)
shutil.make_archive(outputFileName, 'zip', recordsPath)
shutil.move(localPath + '/' + outputFileName + '.zip', dataPath)

'/content/drive/My Drive/Machine Learning/Stolen vehicle detection/data/datasetRecords.zip'

###Zip & save `csv` files to Drive

In [87]:
outputFileName = "datasetCsvs"
os.chdir(localPath)
shutil.make_archive(outputFileName, 'zip', csvPath)
shutil.move(localPath + '/' + outputFileName + '.zip', dataPath)

'/content/drive/My Drive/Machine Learning/Stolen vehicle detection/data/datasetCsvs.zip'

### How to use the sharded dataset

In [None]:
tf_record_input_reader {
  input_path: "/path/to/trainDataset.tfrecord-?????-of-00010"
}

##Force garbage collection

In [None]:
gc.collect()

759