# Dataset creator Notebook

>This notebook downloads specific classes from the Open Images dataset with the desired cardinality. Then, it transform this set into .tfrecord format to make it possible to use with Tensorflow Object Detection API.
>
>The downloaded set can be saved into the mounted Drive folder.

## 1. Prepare the environment

### Clone MISC repos with Git

In [1]:
!git clone https://github.com/EscVM/OIDv4_ToolKit.git

Cloning into 'OIDv4_ToolKit'...
remote: Enumerating objects: 422, done.[K
remote: Total 422 (delta 0), reused 0 (delta 0), pack-reused 422[K
Receiving objects: 100% (422/422), 34.08 MiB | 16.45 MiB/s, done.
Resolving deltas: 100% (146/146), done.


In [0]:
!git clone --q https://github.com/tensorflow/models.git

### Prepare and test Object Detection module

In [0]:
import os
os.chdir('/content/models/research')

# compiling the proto buffers - more about them here: https://developers.google.com/protocol-buffers/
!protoc object_detection/protos/*.proto --python_out=.

# export the PYTHONPATH environment variable with the reasearch and slim folders' paths
os.environ['PYTHONPATH'] += ':/content/models/research/:/content/models/research/slim/'

In [4]:
# test the model builder
!python3 object_detection/builders/model_builder_test.py

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Running tests under Python 3.6.9: /usr/bin/python3
[ RUN      ] ModelBuilderTest.test_create_experimental_model
[       OK ] ModelBuilderTest.test_create_experimental_model
[ RUN      ] ModelBuilderTest.test_create_faster_rcnn_model_from_config_with_example_miner
[       OK ] ModelBuilderTest.test_create_faster_rcnn_model_from_config_with_example_miner
[ RUN      ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_with_matmul
[       OK ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_with_matmul
[ RUN      ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_wi

### Import required dependencies

In [0]:
import imageio
import zipfile
import shutil
import csv
import tensorflow as tf
import pandas as pd
import contextlib2
import gc
from PIL import Image
from google.colab import drive
from pathlib import Path
from datetime import datetime
from object_detection.utils import dataset_util
from object_detection.dataset_tools import tf_record_creation_util

### Define paths

In [0]:
rootPath = '/content/drive/My Drive/Machine Learning/License plate detection'
dataPath = rootPath + '/data'

localPath = '/content'
rootOIDv4Path = localPath + '/OIDv4_ToolKit'
datasetRootPath = rootOIDv4Path + '/OID'
generatedDatasetPath = localPath + '/generated'
recordsPath = generatedDatasetPath + '/records'
csvPath = generatedDatasetPath + '/csv'

if not os.path.exists(recordsPath):
  os.makedirs(recordsPath)

if not os.path.exists(csvPath):
  os.makedirs(csvPath)

### Mount Google Drive to this Notebook instance
>As the dataset has been prepared previously and updated to Google Drive, the model building and training process will be done there, not locally.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

os.chdir(localPath)
# Show current directory
!pwd

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
/content


### Install OIDv4 requirements

In [8]:
os.chdir(rootOIDv4Path)
!pip3 install -r requirements.txt

Collecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/58/ad/891570946642748fab392d770197ac7c2cfdec93e87a27bf3bbfa864694f/awscli-1.18.16-py2.py3-none-any.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 8.9MB/s 
Collecting botocore==1.15.16
[?25l  Downloading https://files.pythonhosted.org/packages/0d/fb/6b3e135e09ec38051d903f1ff6509309a5731e57c95dae2daeb0b70cf52c/botocore-1.15.16-py2.py3-none-any.whl (5.9MB)
[K     |████████████████████████████████| 6.0MB 59.1MB/s 
Collecting colorama<0.4.4,>=0.2.5; python_version != "3.4"
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting rsa<=3.5.0,>=3.1.2
[?25l  Downloading https://files.pythonhosted.org/packages/e1/ae/baedc9cb175552e95f3395c43055a6a5e125ae4d48a1d7a924baca83e92e/rsa-3.4.2-py2.py3-none-any.whl (46kB)
[K     |████████████████████████████████| 51kB 9.0MB/s 
[31mERROR: boto3 1.11.1

### Extract a previously downloaded raw dataset from Drive

In [0]:
# Run it only when the dataset is not yet extracted 
zipRef = zipfile.ZipFile(dataPath + "/dummyDataset.zip", 'r')
zipRef.extractall(localPath + "/data")
zipRef.close()
print('Dataset extraction successful')

Dataset extraction successful


##2. Use OIDv4 downloader to get the desired images
>This part downloads the images and their annotations as the raw dataset.

### Show available OIDv4 commands

In [9]:
!python3 main.py -h

usage: main.py [-h] [--Dataset /path/to/OID/csv/] [-y]
               [--classes list of classes [list of classes ...]]
               [--type_csv 'train' or 'validation' or 'test' or 'all']
               [--sub Subset of human verified images or machine generated h or m)]
               [--image_IsOccluded 1 or 0] [--image_IsTruncated 1 or 0]
               [--image_IsGroupOf 1 or 0] [--image_IsDepiction 1 or 0]
               [--image_IsInside 1 or 0] [--multiclasses 0 (default or 1]
               [--n_threads [default 20]] [--noLabels]
               [--limit integer number]
               <command> 'downloader', 'visualizer' or 'ill_downloader'.

Open Image Dataset Downloader

positional arguments:
  <command> 'downloader', 'visualizer' or 'ill_downloader'.
                        'downloader', 'visualizer' or 'ill_downloader'.

optional arguments:
  -h, --help            show this help message and exit
  --Dataset /path/to/OID/csv/
                        Directory of the OID da

### Define class names to be referencable
>Order of the classes counts!

In [0]:
classNames = ["Vehicle registration plate", "Car", "Person"]

### Download train, test, validation classes

In [11]:
!python3 main.py downloader --classes "Vehicle registration plate" "Car" "Person" --type_csv train --multiclasses 0 --limit 5500

[92m
		   ___   _____  ______            _    _    
		 .'   `.|_   _||_   _ `.         | |  | |   
		/  .-.  \ | |    | | `. \ _   __ | |__| |_  
		| |   | | | |    | |  | |[ \ [  ]|____   _| 
		\  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
		 `.___.'|_____||______.'   \__/     |_____|
	[0m
[92m
             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        [0m
    [INFO] | Downloading Vehicle registration plate.[0m
[91m   [ERROR] | Missing the class-descriptions-boxable.csv file.[0m
[94m[DOWNLOAD] | Do you want to download the missing file? [Y/n] [0mY
...

In [12]:
!python3 main.py downloader --classes "Vehicle registration plate" "Car" "Person" --type_csv test --multiclasses 0 --limit 500

[92m
		   ___   _____  ______            _    _    
		 .'   `.|_   _||_   _ `.         | |  | |   
		/  .-.  \ | |    | | `. \ _   __ | |__| |_  
		| |   | | | |    | |  | |[ \ [  ]|____   _| 
		\  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
		 `.___.'|_____||______.'   \__/     |_____|
	[0m
[92m
             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        [0m
    [INFO] | Downloading Vehicle registration plate.[0m
[91m   [ERROR] | Missing the test-annotations-bbox.csv file.[0m
[94m[DOWNLOAD] | Do you want to download the missing file? [Y/n] [0mY
...100%,

In [13]:
!python3 main.py downloader --classes "Vehicle registration plate" "Car" "Person" --type_csv validation --multiclasses 0 --limit 500

[92m
		   ___   _____  ______            _    _    
		 .'   `.|_   _||_   _ `.         | |  | |   
		/  .-.  \ | |    | | `. \ _   __ | |__| |_  
		| |   | | | |    | |  | |[ \ [  ]|____   _| 
		\  `-'  /_| |_  _| |_.' / \ \/ /     _| |_  
		 `.___.'|_____||______.'   \__/     |_____|
	[0m
[92m
             _____                    _                 _             
            (____ \                  | |               | |            
             _   \ \ ___  _ _ _ ____ | | ___   ____  _ | | ____  ____ 
            | |   | / _ \| | | |  _ \| |/ _ \ / _  |/ || |/ _  )/ ___)
            | |__/ / |_| | | | | | | | | |_| ( ( | ( (_| ( (/ /| |    
            |_____/ \___/ \____|_| |_|_|\___/ \_||_|\____|\____)_|    
                                                          
        [0m
    [INFO] | Downloading Vehicle registration plate.[0m
[91m   [ERROR] | Missing the validation-annotations-bbox.csv file.[0m
[94m[DOWNLOAD] | Do you want to download the missing file? [Y/n] [0mY
..

##3. Create supplementary files
>Create a `txt` file containing the class names, and `tfrecord`s containing the images and their annotations.

### Create class names file

#### Create class names `txt` file

In [0]:
def createClassTxt(destinationDir, classNames):

  os.chdir(destinationDir)
  # creating the `classes.txt` file
  classesPath = os.path.join("classes.txt")

  content = ""

  # creates a txt file that contains the class names
  for i, className in enumerate(classNames):

      content = (
          content + 
          "{0}\n".format(className)
      )

  content = content.strip()

  with open(classesPath, "w") as f:
      f.write(content)

  print("Class names txt file creation successful. Destination: %s" %(destinationDir + '/' + classesPath))

#### Create class names `pbtxt` file

In [0]:
def createClassPbtxt(destinationDir, classNames):

  os.chdir(destinationDir)
  # creating the `classes.pbtxt` file
  classesPath = os.path.join("classes.pbtxt")

  content = ""

  # creates a pbtxt file that contains the class names
  for i, className in enumerate(classNames):

      content = (
          content + 
          "item {{\n    id: {0}\n    name: '{1}'\n }}\n\n".format(i + 1, className)
      )

  content = content.strip()

  with open(classesPath, "w") as f:
      f.write(content)

  print("Class names pbtxt file creation successful. Destination: %s" %(destinationDir + '/' + classesPath))

In [35]:
createClassPbtxt(recordsPath, classNames)

Class names pbtxt file creation successful. Destination: /content/generated/records/classes.pbtxt


### Create the `tfrecord` files
>These files are the inputs of the Tensorflow Object Detection API models. 

#### Create one `tfrecord` from the input parameters

In [0]:
def createRecord(path, imageId, classes, annotations):

  image = Image.open(path)
  imgWidth, imgHeight = image.size
  imgData = tf.gfile.GFile(path, 'rb').read()

  xmins = []
  xmaxs = []
  ymins = []
  ymaxs = []
  classesText = []
  classesInt = []

  imageAnnotations = annotations.get_group(imageId)

  for _, row in imageAnnotations.loc[imageAnnotations['LabelName'].isin(classes.keys())].iterrows():

      xmins.append(row['XMin'])
      xmaxs.append(row['XMax'])
      ymins.append(row['YMin'])
      ymaxs.append(row['YMax'])
      classesText.append(row['LabelName'].encode('utf8'))
      classesInt.append(classes[row['LabelName']])

  tfRecord = tf.train.Example(features=tf.train.Features(feature={
      'image/height': dataset_util.int64_feature(imgHeight),
      'image/width': dataset_util.int64_feature(imgWidth),
      'image/filename': dataset_util.bytes_feature(imageId.encode('utf8')),
      'image/source_id': dataset_util.bytes_feature(imageId.encode('utf8')),
      'image/encoded': dataset_util.bytes_feature(imgData),
      'image/format': dataset_util.bytes_feature(b'jpg'),
      'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
      'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
      'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
      'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
      'image/object/class/text': dataset_util.bytes_list_feature(classesText),
      'image/object/class/label': dataset_util.int64_list_feature(classesInt)
  }))

  return tfRecord

#### Create a `csv` row from the input parameters

##### Encode list elements as a row

In [0]:
def appendListAsRow(fileName, list):

    # Open file in append mode
    with open(fileName, 'a+', newline='') as writeObj:

      # Create a writer object from csv module
      writer = csv.writer(writeObj)

      # Add contents of list as last row in the csv file
      writer.writerow(list)

##### Create `csv` rows of an image

In [0]:
def createCsvRowsOfImage(path, imageId, classes, annotations):

  image = Image.open(path)
  imgWidth, imgHeight = image.size
  imgData = 1
  extension = b'jpg'

  imageAnnotations = annotations.get_group(imageId)

  CsvRows = []

  for _, row in imageAnnotations.loc[imageAnnotations['LabelName'].isin(classes.keys())].iterrows():

    CsvRow = []

    CsvRow.append(imgHeight)
    CsvRow.append(imgWidth)
    CsvRow.append(imageId)
    CsvRow.append(imageId)
    CsvRow.append(imgData)
    CsvRow.append(extension)
    CsvRow.append(row['XMin'])
    CsvRow.append(row['XMax'])
    CsvRow.append(row['YMin'])
    CsvRow.append(row['YMax'])
    CsvRow.append(row['LabelName'].encode('utf8'))
    CsvRow.append(classes[row['LabelName']])

    CsvRows.append(CsvRow)

  return CsvRows

#### Add `csv` file header

In [0]:
def csvAddHeader(fileName):

  headerRow = []

  headerRow.append('imgHeight')
  headerRow.append('imgWidth')
  headerRow.append('imageId')
  headerRow.append('imageId')
  headerRow.append('imgData')
  headerRow.append('extension')
  headerRow.append('XMin')
  headerRow.append('XMax')
  headerRow.append('YMin')
  headerRow.append('YMax')
  headerRow.append('LabelName')
  headerRow.append('LabelId')

  appendListAsRow(fileName, headerRow)

#### Generate the `tfrecord` files from the provided images & annotations
>The `numShards` variable serves for deciding how many sub-files must be created in order to speed up processing.

In [0]:
def tfrecordGenerator(
    classesFile,
    classDescriptionsFile,
    annotationsFile,
    imagesDir,
    outputFile,
    numShards,
    csvNeeded,
    csvFileName
):

    classes = list(filter(None, open(classesFile).read().split('\n')))
    classes = {name: idx + 1 for idx, name in enumerate(classes)}
    print(f'Classes: {classes}')

    classDescriptions = {row[0]: row[1] for _, row in pd.read_csv(classDescriptionsFile, header=None).iterrows()}

    annotations = pd.read_csv(annotationsFile)
    annotations['LabelName'] = annotations['LabelName'].map(lambda n: classDescriptions[n])
    annotations = annotations.groupby('ImageID')

    images = tf.gfile.Glob(imagesDir + '/*/*.jpg')
    images = map(lambda i: (os.path.basename(i).split('.jpg')[0], i), images)
    images = dict(images)
    print(f'{len(images)} images found')

    with contextlib2.ExitStack() as tfRecordCloseStack:

      outputRecords = tf_record_creation_util.open_sharded_output_tfrecords(
      tfRecordCloseStack, outputFile, numShards)

      index = 0

      if(csvNeeded == True):
        csvAddHeader(csvFileName)

      for imageId, path in images.items():

        tfRecord = createRecord(path, imageId, classes, annotations)
        outputShardIndex = index % numShards
        outputRecords[outputShardIndex].write(tfRecord.SerializeToString())

        if(csvNeeded == True):
          imageRows = createCsvRowsOfImage(path, imageId, classes, annotations)
          for row in imageRows:
            appendListAsRow(csvFileName, row)

        index += 1

    print('TFRecords has been successfully created')

#### Generate train/test/validation `tfrecord` files
>`subsetTypes` contains the subset names, `numShards` contains their corresponding values to be sharded into. More than a few thousand `tfrecord` rows dramatically slow down the process. `numShards=1` means sharding is not needed at all.

In [21]:
os.chdir(datasetRootPath)

# train, test, validation subsets
subsetTypes = ['train', 'test', 'validation']
numShards = [10, 1, 1]
csvNeeded = True
csvFileName = ""

for subsetType, numShard in zip(subsetTypes, numShards):

  print(f'***Generating {subsetType} tfrecord files [sharded into {numShard} piece(s)]***')

  classesFile = datasetRootPath + '/classes.txt'
  classDescriptionsFile = datasetRootPath + '/csv_folder/class-descriptions-boxable.csv'
  annotationsFile = datasetRootPath + '/csv_folder/' + subsetType + '-annotations-bbox.csv'
  imagesDir = datasetRootPath + '/Dataset/' + subsetType
  outputFile = recordsPath + '/' + subsetType + 'Dataset.tfrecord'
  csvFileName = csvPath + '/' + subsetType + 'Dataset.csv'

  tfrecordGenerator(
      classesFile,
      classDescriptionsFile,
      annotationsFile,
      imagesDir,
      outputFile,
      numShard,
      csvNeeded,
      csvFileName
  )

  print('')

***Generating train tfrecord files [sharded into 10 piece(s)]***
Classes: {'Vehicle registration plate': 1, 'Car': 2, 'Person': 3}
15972 images found

TFRecords has been successfully created

***Generating test tfrecord files [sharded into 1 piece(s)]***
Classes: {'Vehicle registration plate': 1, 'Car': 2, 'Person': 3}
1440 images found
TFRecords has been successfully created

***Generating validation tfrecord files [sharded into 1 piece(s)]***
Classes: {'Vehicle registration plate': 1, 'Car': 2, 'Person': 3}
1342 images found
TFRecords has been successfully created



##4. Generate dataset
>The generated `tfrecord` files need to be zipped and saved to the Drive directory.

###Zip & save `tfrecord` files to Drive

In [38]:
outputFileName = "datasetRecords"
os.chdir(localPath)
shutil.make_archive(outputFileName, 'zip', recordsPath)
shutil.move(localPath + '/' + outputFileName + '.zip', dataPath)

'/content/drive/My Drive/Machine Learning/License plate detection/data/datasetRecords.zip'

###Zip & save `csv` files to Drive

In [27]:
outputFileName = "datasetCsvs"
os.chdir(localPath)
shutil.make_archive(outputFileName, 'zip', csvPath)
shutil.move(localPath + '/' + outputFileName + '.zip', dataPath)

'/content/drive/My Drive/Machine Learning/License plate detection/data/datasetCsvs.zip'

### How to use the sharded dataset

In [0]:
tf_record_input_reader {
  input_path: "/path/to/trainDataset.tfrecord-?????-of-00010"
}

##Force garbage collection

In [0]:
gc.collect()

759