# Step 1: Data Load from Kaggle

This was performed locally in two notebooks:

* Car Data/CUHK_CompCars_Data_Load.ipynb
* Lyft Data/Lyft_Data_Load.ipynb

Loaded two datasets from Kaggle:

* [CUHK CompCars](https://www.kaggle.com/datasets/renancostaalencar/compcars/data)
* [Lyft autonomous vehicles](https://www.kaggle.com/c/3d-object-detection-for-autonomous-vehicles/data)

Other relevant datasets:

* [Highway Traffic - US](https://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset)
* [Traffic Detection - Europe](https://www.kaggle.com/datasets/boukraailyesali/traffic-road-object-detection-dataset-using-yolo)
* [Traffic Detection](https://www.kaggle.com/datasets/saumyapatel/traffic-vehicles-object-detection)


In [1]:
# Mount google drive and set wd to shared drive

from google.colab import drive
import os

drive.mount("/content/drive", force_remount=True)
root_dir = "/content/drive/Shareddrives/ADSP Computer Vision Summer '24 Group 2"
os.chdir(root_dir)

Mounted at /content/drive


# Step 2: Unzip and handle `CompCars`

In [2]:
# Change directory to cars

# os.chdir("Car Data")
os.chdir("Car Data/labels_balanced_updated")

In [None]:
! unzip -q sample_balanced.zip

In [None]:
! unzip -q labels_balanced.zip

In [3]:
! unzip -q labels_balanced_updated.zip

In [6]:
# List out the directory and relevant folders

from glob import glob

os.chdir(f"{root_dir}/Car Data")

images = glob("sample_balanced/**/*.*", recursive=True)
labels = glob("labels_balanced_updated/**/*.*", recursive=True)
print(f"We have {len(images):,.0f} images")
print(f"We have {len(labels):,.0f} labels")

We have 9,601 images
We have 9,601 labels


In [None]:
# Counts by type

for dir in os.listdir("sample_balanced"):
  subdir = f"sample_balanced/{dir}"
  print(f"{dir} has {len(os.listdir(subdir)):,.0f} images")

sports_convertible has 1,200 images
hatchback has 1,200 images
mini_van has 1,200 images
pickup has 1,200 images
minibus has 1,200 images
station_wagon has 1,200 images
sedan has 1,200 images
suv_crossover has 1,200 images


In [7]:
drive.flush_and_unmount()

# Step 3: Unzip and handle `Lyft` data

In [None]:
# Change to Lyft directory

os.chdir(f"{root_dir}/Lyft Data")

In [None]:
# ! unzip -q sample.zip

In [None]:
# Load training data

import pandas as pd

df_lyft = pd.read_csv("train_filtered.csv", index_col=0)
print(f"{len(df_lyft):,.0f} objects in filtered df_lyft")
distinct_samples = df_lyft["sample_id"].drop_duplicates().tolist()
print(f"{len(distinct_samples):,.0f} distinct samples df_lyft")
df_lyft.head()

2,980 objects in filtered df_lyft
100 distinct samples df_lyft


Unnamed: 0,sample_id,object_id,center_x,center_y,center_z,width,length,height,yaw,class_name
4781,67f629809f849cf7eaaabae7478c04fba2c4c28d71c20b...,0,1819.467099,1179.239315,-20.038317,1.876,4.5,1.409,0.95011,car
4782,67f629809f849cf7eaaabae7478c04fba2c4c28d71c20b...,1,1828.50223,1210.501083,-18.972204,1.974,4.405,1.714,-0.541565,car
4783,67f629809f849cf7eaaabae7478c04fba2c4c28d71c20b...,2,1896.705341,1162.124012,-18.485319,2.318,5.895,2.416,-0.562509,car
4784,67f629809f849cf7eaaabae7478c04fba2c4c28d71c20b...,3,1825.324182,1207.739823,-19.23555,1.818,4.599,1.479,-0.541565,car
4785,67f629809f849cf7eaaabae7478c04fba2c4c28d71c20b...,4,1915.460199,1157.880226,-18.565417,2.041,5.081,1.822,-0.541565,car


In [None]:
# List out the directory and relevant folders

from glob import glob

images = glob("sample/**/*.*", recursive=True)
print(f"We have {len(images):,.0f} images")

folders = os.listdir("sample")
print(f"We have {len(folders):,.0f} folders")

assert len(distinct_samples) == len(folders)

We have 700 images
We have 100 folders


In [None]:
# Counts by sample

for dir in os.listdir("sample"):
  subdir = f"sample/{dir}"
  print(f"{dir} has {len(os.listdir(subdir)):,.0f} images")

8f8b289f90de6bb46c938e727f23699bdcd4ad655ea7ee2f96c9b081e27fccb6 has 7 images
0032f9d03002a3bc68c6dea503558bd3f5851e7d0cab353a30f36049e322d401 has 7 images
fff07bea08ec6f72e156facb62f9d6c5e17634fd3bdffeb3c57e4438e161ff30 has 7 images
c869ac00bad3b9d3af35f93533b8523dd52ad136c39612ea23ccd995746433c1 has 7 images
68f3e48571ffc48d83abd26f1b7a903fe25be8d89871e463a972b0fe1f3c1384 has 7 images
257e5f8388b998cc5ddc1f92d0cd6f89db6343074b4dbc8209023ffa3369fad1 has 7 images
52822c940b62464148b4b4194bdadc1546414bd0760d2e1538a2e4aca7ad0bba has 7 images
d869a327bf0e78226e68c8afbd0739926b2f066e5beef391379a90785ccdac02 has 7 images
bd50a094c87df9c80798f44b24adabd6f53cf4a7dca0790a18b9fdb4bf99869a has 7 images
d020d5e9f62b044680c4e6db79fa44844565042a10e838cc14aa83e0e3f97338 has 7 images
c48a8a58cebb7112abe27021fc4ec22487317d7a15b4bc3b9714d1b38895cb90 has 7 images
4743979a732346be68e7175a155d3d99bbdd821c3280e71586696e74276b97a0 has 7 images
5fb3095293ab34ae67f689025f836020c1d9d80aba49eee7c28f937b1376e7f3

In [None]:
# Map sample to directory

df_lyft.loc[:, "image_dir"] = [os.path.join(root_dir, "Lyft Data/sample", id) for id in df_lyft["sample_id"]]
for path in df_lyft["image_dir"]:
  assert os.path.isdir(path)