# Split Vehicle Images by Company


## *Abstract*

> The purpose of this notebook is to split up vehicle images by their respective web platforms. We found that a significant amount of the image data that we scraped was unreliable (i.e. the vehicle was wrongly labeled on the scraped platform). In order to isolate the problem (i.e. figure out which platforms and which vehicle models are the most unreliable), we decided to again consolidate all vehicle image data into the encar categories, but this time split up each data set into their respective web platforms (e.g. spcarz, chacourt, kauctioncar, etc.).

## *Introduction*

> In order to ensure maximum accuracy for our vehicle image classifier, we need to make sure that the label on each vehicle image is correct. Late in the process we realized that our image scraper was pulling in advertisement images that resided on the same page as the actual vehicle images we were trying to scrape. For example, directly below is an example of the image gallery we are trying to extract from a vehicle details page.

![desired](images/desired-vehicle-images.png)

> Here is an example of some advertisement images we pull in by mistake.

![undesired](images/undesired-ad-images.png)

> This problem does not seem to show up frequently, and it is not clear why it arises for some vehicle class and not for others. Nevertheless, because of how important it is to maintain proper labeling of our data set, we decided to err on the side of caution, and identify exactly which company websites were the most problematic (data organization example directly below).

```
Used Car Company Name
│   * chacha
│   * kauctioncar
│   * jchere
│   * chacourt
│   * usedcarmall
│   * jcpremium
│   * encar
│   * bobaedream
│
└───Vehicle Class
    │   * 푸조_2008_2008(13년~현재)
    │   * 푸조_206_206CC(00~08년)
    │   * 푸조_207_207CC(06~13년)
    │   * 푸조_208_208(13년~현재)
    │   * ...
```

## *Method*

> Please refer to the Python code below to see my methods for achieving the goals set out in the `Introduction`.

In [1]:
# import all dependencies here
import csv
import os
import random
import shutil
from pprint import pprint
import typing as T

In [2]:
# create a mapping of all data sources and all data destinations
# absolute path to root directory for this project
cwd = os.getcwd()
dst = os.path.join(cwd, 'images-by-company')

def make_src(path_name: str) -> str:
    """Convenience method for generating image source paths"""
    return os.path.join(cwd, path_name, 'images')

def make_dst(path_name: str) -> str:
    """Convenience method for generating image destination paths"""
    return os.path.join(dst, path_name)

src_and_dst: T.List[T.Dict] = [
    {'src': make_src('archived-images-chacha'), 'dst': make_dst('chacha')},
    {'src': make_src('archived-images-chacourt'), 'dst': make_dst('chacourt')},
    {'src': make_src('archived-images-jchere'), 'dst': make_dst('jchere')},
    {'src': make_src('archived-images-jcpremium'), 'dst': make_dst('jcpremium')},
    {'src': make_src('archived-images-kauctioncar'), 'dst': make_dst('kauctioncar')},
    {'src': make_src('archived-images-usedcarmall'), 'dst': make_dst('usedcarmall')},
    {'src': make_src('archived-images-encar'), 'dst': make_dst('encar')},
]

In [3]:
to_encar_categories = {}
with open('vehicle_names_refined.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        if row and row[0]:
            to_encar_categories[row[0]] = row[1]

In [None]:
all_sources: T.List[T.List] = []
    
for obj in src_and_dst:
    ds_store = os.path.join(obj['src'], '.DS_Store')
    if os.path.exists(ds_store):
        os.remove(ds_store)
    
    if not os.path.exists(obj['dst']):
        os.mkdir(obj['dst'])
                
    all_sources += [{
        'src': os.path.join(obj['src'], n),
        'dst': obj['dst'],
        'name': n,
    } for n in os.listdir(obj['src'])]
    


In [None]:
for obj in all_sources:
    n = obj['name']
    if n in to_encar_categories:
        # create a new path that is mapped to the encar category
        dst = os.path.join(obj['dst'], to_encar_categories[n])
        # create a directory with that path if it does not yet exist
        if not os.path.exists(dst):
            os.mkdir(dst)
        paths = [{'path': os.path.join(obj['src'], fn), 'name': fn} for fn in os.listdir(obj['src'])]
        random.shuffle(paths)
        for img_obj in paths:
            if '.DS_Store' in img_obj['path']:
                os.remove(img_obj['path'])
                continue
            if os.path.exists(os.path.join(dst, img_obj['name'])):
                continue
            try:
                shutil.copy2(img_obj['path'], dst)
            except Exception as e:
                print(e)

[Errno 5] Input/output error


## *Discussion*

> It turns out that the greatest source of erroneous labeling was Usecarmall/Foreign vehicles. Our used car expert was able to manually relabel many of the problematic photos from this source. As a result, we can be confident that our image data set is clean and correctly labeled.