# Calculate Image Counts for Each Vehicle Class

## *Abstract*

> This experiment was done to figure out how much data augmentation is needed for each vehicle class. Some will require more augmentation than others. We are trying to ensure that we have at least [`1000 images`](http://image-net.org/about-overview) per vehicle class for `training`, with an additional `200 images` for `testing` purposes. We are using the [Pandas](https://pandas.pydata.org/) Python library to efficiently create and manipulate tables of image count data, and generate a profile how many images of each class and direction (front, back, side, mix) we have or need. This profile allows us to determine which vehicle classes are best suited for use in our machine learning models, and which vehicle class must be augmented with additional image data (using traditional image data augmentation techniques, or by scraping other used car platforms). This image count profile can be thought of as a checkpoint for the progress of our image collection/generation efforts. It shows us how close we are to having a diverse, and evenly distributed image dataset consisting of many vehicle classes in many directions.

## *Introduction*

> Important to any image classification project is a robust image dataset. The greater the quantity and diversity of the images we provide as training data for our machine learning system, the higher its chances of accurately predicting a wide array of vehicle classes in various conditions. We are taking inspiration from the [ImageNet](http://image-net.org/about-stats) project, in which their image classifier is divided into `high level categories`, and then `subcategories`, and each `subcategory` has approximately 1000 images in their training dataset. The purpose of this Python notebook is to help us generate an image data profile similar to the one created by the ImageNet project in the link above. In our case, we have decided to divide our dataset into the following way:

```
Vehicle Class (High Level Category)
│   * BMW_1시리즈 _1시리즈 (F20) (12년~현재)
│   * 기아_K3_K3(12~15년)
|   * 렉서스_IS_뉴 IS250(13년~현재)
│
└───Direction (Subcategories)
    │   * Front
    │   * Back
    │   * Side
    │   * Mix
```

## *Method*

> Please refer to the Python code below to see my methods for achieving the goals set out in the `Introduction`.

In [60]:
# import all dependencies here
import os
import pandas as pd
import shutil
import random
import sys
from pprint import pprint
import typing as T

In [53]:
ROOT = '/Volumes/TriveStorage/code/trive-image-recognition/complete_manual/directional'
directions = ['front', 'back', 'side', 'mix']
processed_folders = [x for x in os.listdir(ROOT) if '.DS_Store' not in x]


In [54]:
data = []
for processed_folder in processed_folders:
    image_counts = []
    image_total = 0
    image_counts.append(processed_folder)
    for direction in directions:
        direction_path = os.path.join(ROOT, processed_folder, direction)
        image_paths = [os.path.join(direction_path, s) for s in os.listdir(direction_path)]
        image_counts.append(len(image_paths))
    image_counts.append(sum(image_counts[1:]))
    data.append(image_counts)

In [55]:
headers = ['name', 'front', 'back', 'side', 'mix', 'total']
df = pd.DataFrame(data, columns=headers)
df['side'] = df['side'] + df['mix']
df = df.drop(columns='mix')

In [56]:
# figure out how much data augmentation is required for each
# vehicle class
# * xxxAug = 0, indicates no augmentation is required
# * yyyAug = 100, indicates that 100 additional images are required

training_image_req = 334 # 1000 / 3
test_image_req = 67 # 200 / 3

image_requirement = training_image_req + test_image_req

df['frontAug'] = image_requirement - df['front']
df.loc[df['frontAug']<0,'frontAug'] = 0

df['backAug'] = image_requirement - df['back']
df.loc[df['backAug']<0,'backAug'] = 0

df['sideAug'] = image_requirement - df['side']
df.loc[df['sideAug']<0,'sideAug'] = 0

In [57]:
df['augTotal'] = df['frontAug'] + df['backAug'] + df['sideAug']
df = df.sort_values(by='augTotal', ascending=True)

In [58]:
df.to_csv('image_counts.csv', sep=',', index=False)
df['name'].to_csv('class_names.csv', sep=',', index=False)
print('number of classes:', len(df.index))
top_100 = df.head(100)
top_100.to_csv(r'top_200.csv', index = None, header=True)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(top_100)

number of classes: 638


Unnamed: 0,name,front,back,side,total,frontAug,backAug,sideAug,augTotal
117,기아_카니발_올 뉴 카니발(14~18년),1179,842,1552,3573,0,0,0,0
64,기아_레이_레이(11~17년),1096,1337,1558,3991,0,0,0,0
74,기아_모닝_올 뉴 모닝 (JA)(17년~현재),550,548,805,1903,0,0,0,0
75,기아_모닝_올 뉴 모닝(11~15년),723,1064,1928,3715,0,0,0,0
77,기아_모하비_모하비(07~16년),603,575,905,2083,0,0,0,0
42,기아_K5_K5 2세대(15~18년),450,414,697,1561,0,0,0,0
57,기아_K7_올 뉴 K7(16년~현재),657,536,1008,2201,0,0,0,0
87,기아_스포티지_더 뉴 스포티지 R(13~15년),495,453,958,1906,0,0,0,0
54,기아_K7_더 뉴 K7(12~16년),694,600,1066,2360,0,0,0,0
89,기아_스포티지_스포티지 R(10~13년),613,552,1109,2274,0,0,0,0


## *Results and Analysis*

> The columns of the table should be interpreted in the following way:

* `name`: this is the higher level category (vehicle class)
* `front`: this is a subcategory representing an image of the front of a vehicle
* `back`: this is a subcategory representing an image of the back of a vehicle
* `side`: this is a subcategory representing an image of the side of a vehicle
* `total`: this is just the sum of `front`, `back`, and `side`
* `frontAug`: this indicates how much augmentation is required for the `front` subcategory
* `backAug`: this indicates how much augmentation is required for the `back` subcategory
* `sideAug`: this indicates how much augmentation is required for the `side` subcategory

> As you can see, some vehicle categories require little to no augmentation whatsoever, whereas others require a significant amount of augmentation in order to be suitable inputs into our machine learning models.

## *Discussion*

> Judging by the significant amount of data augmentation required by a large percentage of the vehicle classes in the table above, I believe that it is best that we continue our data gathering efforts by scraping even more used car websites. I feel that data augmentation techniques should only be used as a last resort, and that we will be able to have a more robust and even distribution of image data if we scrape a few more, major used car platforms.

In [None]:
names = top_100['name'].tolist()
with open('top_100.txt', 'w') as filehandle:
    filehandle.writelines(f"{x}\n" for x in names)

In [None]:
DEST = '/Volumes/TriveStorage/code/trive-image-recognition/complete_manual/cars_refined'

if not os.path.exists(DEST):
    os.mkdir(DEST)

TEST_PATH = os.path.join(DEST, 'test')

if not os.path.exists(TEST_PATH):
    os.mkdir(TEST_PATH)

TRAIN_PATH = os.path.join(DEST, 'train')

if not os.path.exists(TRAIN_PATH):
    os.mkdir(TRAIN_PATH)

augmentations = [
    'hflip',
    'smartcrop',
    'saltpepper0d01',
    'saltpepper0d02',
    'saltpepper0d03',
    'saltpepper0d04',
    'saltpepper0d05',
    'saltpepper0d06',
    'saltpepper0d07',
    'saltpepper0d08',
    'saltpepper0d09',
    'saltpepper0d10',
    'translatex25y0',
    'translatex0y25',
    'translatex0ym25',
    'translatexm25y0',
    'gaussian',
]

front_augs = [f'front_{x}' for x in augmentations]
back_augs = [f'back_{x}' for x in augmentations]
side_augs = [f'side_{x}' for x in augmentations] + [f'mix_{x}' for x in augmentations]
    
limit = sys.maxsize
for index, row in top_100.iterrows():
    print(row['name'])
    vehicle_test_path = os.path.join(TEST_PATH, row['name'])
    if not os.path.exists(vehicle_test_path):
        os.mkdir(vehicle_test_path)
        
    vehicle_train_path = os.path.join(TRAIN_PATH, row['name'])
    if not os.path.exists(vehicle_train_path):
        os.mkdir(vehicle_train_path)
        
    origin = os.path.join(ROOT, row['name'])
    imgs_front = [os.path.join(origin, 'front', x) for x in os.listdir(os.path.join(origin, 'front')) if x != '.DS_Store']
    imgs_back = [os.path.join(origin, 'back', x) for x in os.listdir(os.path.join(origin, 'back')) if x != '.DS_Store']
    imgs_side = [os.path.join(origin, 'side', x) for x in os.listdir(os.path.join(origin, 'side')) if x != '.DS_Store']
    imgs_mix = [os.path.join(origin, 'mix', x) for x in os.listdir(os.path.join(origin, 'mix')) if x != '.DS_Store']
    imgs_side += imgs_mix
    
    if row['frontAug'] != 0:
        curr_augs = [os.path.join(origin, x) for x in front_augs if os.path.exists(os.path.join(origin, x))]
        extra_data = [[os.path.join(x, y) for y in os.listdir(x) if '.DS_Store' not in y] for x in curr_augs]
        extra_data = [item for sublist in extra_data for item in sublist]
        random.shuffle(extra_data)
        print(f'frontAug: {row["frontAug"]} out of {len(extra_data)}')
        imgs_front += extra_data[:row['frontAug']]
        
    if row['backAug'] != 0:
        curr_augs = [os.path.join(origin, x) for x in back_augs if os.path.exists(os.path.join(origin, x))]
        extra_data = [[os.path.join(x, y) for y in os.listdir(x) if '.DS_Store' not in y] for x in curr_augs]
        extra_data = [item for sublist in extra_data for item in sublist]
        random.shuffle(extra_data)
        print(f'backAug: {row["backAug"]} out of {len(extra_data)}')
        imgs_back += extra_data[:row['backAug']]
        
    if row['sideAug'] != 0:
        curr_augs = [os.path.join(origin, x) for x in side_augs if os.path.exists(os.path.join(origin, x))]
        extra_data = [[os.path.join(x, y) for y in os.listdir(x) if '.DS_Store' not in y] for x in curr_augs]
        extra_data = [item for sublist in extra_data for item in sublist]
        random.shuffle(extra_data)
        print(f'sideAug: {row["sideAug"]} out of {len(extra_data)}')
        imgs_side += extra_data[:row['sideAug']]
        
    random.shuffle(imgs_front)
    random.shuffle(imgs_back)
    random.shuffle(imgs_side)
    
    for im_path in imgs_front[:training_image_req]:
        try:
            rand_name = f'{random.randint(1, limit)}.jpg'
            shutil.copy2(im_path, os.path.join(vehicle_train_path, rand_name))
        except Exception as e:
            print(e)
        
    
    for im_path in imgs_front[training_image_req:training_image_req+test_image_req]:
        try:
            rand_name = f'{random.randint(1, limit)}.jpg'
            shutil.copy2(im_path, os.path.join(vehicle_test_path, rand_name))
        except Exception as e:
            print(e)
        
        
    for im_path in imgs_back[:training_image_req]:
        try:
            rand_name = f'{random.randint(1, limit)}.jpg'
            shutil.copy2(im_path, os.path.join(vehicle_train_path, rand_name))
        except Exception as e:
            print(e)
        
    
    for im_path in imgs_back[training_image_req:training_image_req+test_image_req]:
        try:
            rand_name = f'{random.randint(1, limit)}.jpg'
            shutil.copy2(im_path, os.path.join(vehicle_test_path, rand_name))
        except Exception as e:
            print(e)
        
        
    for im_path in imgs_side[:training_image_req]:
        try:
            rand_name = f'{random.randint(1, limit)}.jpg'
            shutil.copy2(im_path, os.path.join(vehicle_train_path, rand_name))
        except Exception as e:
            print(e)
        
    
    for im_path in imgs_side[training_image_req:training_image_req+test_image_req]:
        try:
            rand_name = f'{random.randint(1, limit)}.jpg'
            shutil.copy2(im_path, os.path.join(vehicle_test_path, rand_name))
        except Exception as e:
            print(e)
        
    
    
    