# Manual review of dataset

On viewing the images in the various brand folders, I noticed that there are many unsusable images like images of car-interiors, humans etc.

Thus, sufficient data pruning is required.

# Code to remove unusable images (bad data types)

Many images in the dataset are unable to be read by the code, thus these images shall be removed.

In [4]:
import os
import cv2
import imghdr
formats = ['jpeg','jpg', 'bmp', 'png','webp']
data_dir = ".\photos"
count=0
for image_class in os.listdir(data_dir): 
    for image in os.listdir(os.path.join(data_dir, image_class)):
        image_path = os.path.join(data_dir, image_class, image)
        try: 
            img = cv2.imread(image_path)
            img_type = imghdr.what(image_path)
            if img_type not in formats: 
                os.remove(image_path)
                count=count+1
        except Exception as e: 
            os.remove(image_path)
            count=count+1


Deleted {count} images that were in unsuable formats.


In [5]:
print(f"Deleted {count} images that were in unsuable formats.")

Deleted 11 images that were in unsuable formats.


# Count number of images in each brand's folder

In [6]:
import os

photos_dir = './photos'
for class_dir in os.listdir(photos_dir):
    class_path = os.path.join(photos_dir, class_dir)
    image_count = len(os.listdir(class_path))
    print(f"{class_dir}: {image_count} images")


Ford: 597 images
Honday: 50 images
Hyundai: 553 images
Nissan: 538 images
Renault: 583 images
Suzuki: 577 images
Tata: 499 images
Toyota: 620 images
Volkswagen: 252 images


# Data imbalance
As the above data shows, Honda and Volkswagen have the least amount of images, so the datset is imbalanced.

In future code we shall perform data augmentation on these folders to increase dataset size.

# Remove images that are not cars

1. This dataset is very mixed and thus contains a huge number of images that are in no way related to the project.

2. We need to remove these images from the dataset.

3. We shall achieve this by using some pre-trained model to detect cars in the images.

4. If there are no images detected in the car, that image is useless and needs to be removed.

5. A function 'detect_cars(image)' will be needed to achieve this.

In [None]:
def detect_cars(img):
    # TODO: Add logic to detect cars in the image and add them to list

    # code to return an empty list of car images
    car_images = []

    return car_images


A function named 'is_usable(image)' will be defined to check every image in the dataset for a valid size and also if it has a car or not.

If the image fails any of these criteria, the following code will delete that image.

In [None]:
def is_usable(image_path):
    # Load the image using OpenCV
    img = cv2.imread(image_path)

    # Use the provided detect_cars function to detect cars in the image
    car_images = detect_cars(img)

    # Check if any car images were detected
    return len(car_images) > 0

photos_dir = 'photos'
for class_dir in os.listdir(photos_dir): #iterate through folders
    class_path = os.path.join(photos_dir, class_dir)
    for image_name in os.listdir(class_path): #iterate through images in a folder
        image_path = os.path.join(class_path, image_name)
        if not is_usable(image_path): #check if image is usable
            os.remove(image_path) #remove if not usable

# Checking image quality and applying sharpening if required.

Firstly we count the number of images that are low-quality and/or require sharpening and cropping.

In [7]:
import PIL
import os
from PIL import Image

def is_low_resolution(img, min_size):
    # Check if the image is smaller than the minimum size in either dimension
    return img.size[0] < min_size[0] or img.size[1] < min_size[1]

def requires_cropping(img, size):
    # Check if the image is not already the given size
    return img.size != size

photos_dir = './photos'
min_size = (256, 256)
size = (256, 256)
low_res_count = 0
crop_count = 0

for class_dir in os.listdir(photos_dir):
    class_path = os.path.join(photos_dir, class_dir)
    for image_name in os.listdir(class_path):
        image_path = os.path.join(class_path, image_name)
        img = Image.open(image_path)

        # Check if the image is low resolution
        if is_low_resolution(img, min_size):
            low_res_count += 1

        # Check if the image requires cropping
        if requires_cropping(img, size):
            crop_count += 1

print(f"Number of low resolution images: {low_res_count}")
print(f"Number of images requiring cropping: {crop_count}")


Number of low resolution images: 230
Number of images requiring cropping: 4269


# Above code shows that around 230 images require sharpening and almost all require cropping

In [None]:
def is_sharp_enough(img):
    # TODO: Add logic to check if the image is sharp enough
    return True

def sharpen_image(img):
    # TODO: Add logic to sharpen the image
    return img

def crop_image(img, size):
    # TODO: Add logic to crop the image to the given size
    return img

def augment_image(img):
    # TODO: Add logic to perform image augmentation
    return img


# PROBLEM FACED: Extract only front-facing cars

The major problem I have faced in this dataset is extracting only the front-facing vehicles from all types of vehicles. Since all major publicly-available models identify cars and not just front-facing cars.

1. Maybe an improved dataset is provided so that there is no need to extract front-facing cars.
2. Otherwise need to extract front facing cars.

# Manual inspection of the dataset

After all of the above process the data should be refined than before.

We can now manually go through the dataset and remove any leftover unusable images in the dataset.