# ImageNet Dataset Setup Notebook

This notebook is dedicated to the initial setup and preparation of the ImageNet dataset for machine learning tasks. It encompasses the following key processes:

1. **Downloading the Dataset**: Automates the download of the ImageNet dataset (both training and validation parts) from the official source.

2. **Extracting the Dataset**: Methodically extracts the downloaded dataset, which is initially in compressed tar file format, into a structured directory format suitable for machine learning models. This includes creating separate directories for each class in the training set.

3. **File Path Extraction**: Iterates through the extracted directories to compile a comprehensive list of file paths for all images. This list is crucial for efficient data loading during the model training process.

4. **Saving File Paths**: Saves the generated list of image file paths to a file on Google Drive. This enables easy and quick access to the dataset in future sessions or in other notebooks, particularly in model training and validation stages.

Overall, this notebook is intended to streamline the data handling aspect of working with the large-scale ImageNet dataset, ensuring that subsequent stages of the project, such as model training and evaluation, can proceed smoothly and efficiently.


Mount Google Drive to current Colab session

In [None]:
from google.colab import drive
# Will provide you with an authentication link
drive.mount('/content/drive')

Only call in case drive needs to be remounted

In [None]:
drive.mount('/content/drive', force_remount=True)

Set target directory for download of ImageNet

In [None]:
import os

target_dir = '/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/TrainValTar'
os.makedirs(target_dir, exist_ok=True)

Download train tar file from ImageNet

In [None]:
!wget -c https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar -P {target_dir}

Download validation tar file from ImageNet

In [None]:
!wget -c https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar -P {target_dir}

Run script to extract downloaded tar file from ImageNet

In [None]:
%cd scripts/
!chmod +x extract_imagenet.sh
!./extract_imagenet.sh

Run script that count total amount of train images (due to Google Colab/Drive sync problems, not all training images from ImageNet, have been unpacked)

In [None]:
!sed -i 's/\r$//' count_train_images_imagenet.sh
!chmod +x count_train_images_imagenet.sh
!./count_train_images_imagenet.sh

Run script that count total amount of validation images (due to Google Colab/Drive sync problems, not all validation images from ImageNet, have been unpacked)

In [None]:
!sed -i 's/\r$//' count_val_images_imagenet.sh
!chmod +x count_val_images_imagenet.sh
!./count_val_images_imagenet.sh

Get a list of a all image paths in the training set

In [None]:
import os

def get_image_paths_1(root_dir):
    image_paths = []
    for subdir, dirs, files in os.walk(root_dir):
        for file in files:
            if file.endswith(('.JPG', '.JPEG')):
                full_path = os.path.join(subdir, file)
                image_paths.append(full_path)
    return image_paths

train_dir = '/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/train'
train_image_paths = get_image_paths_1(train_dir)


Save training image paths to Google Drive

In [None]:
import json

# Save the list of paths as a JSON file
with open('/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/train_image_paths.json', 'w') as f:
    json.dump(train_image_paths, f)


Load training image paths from Google Drive

In [None]:
# Load the list of paths from the JSON file
with open('/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/train_image_paths.json', 'r') as f:
    train_image_paths = json.load(f)


Get a list of a all image paths in the validation set

In [None]:
import os

def get_image_paths_2(directory):
    return [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith(('.jpg', '.jpeg'))]

val_dir = '/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/val'
val_image_paths = get_image_paths_2(val_dir)


Save validation image paths to Google Drive

In [None]:
import json

# Save the list of validation image paths as a JSON file
with open('/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/val_image_paths.json', 'w') as f:
    json.dump(val_image_paths, f)


Load validation image paths from Google Drive

In [None]:
# Load the list of validation image paths from the JSON file
with open('/content/drive/MyDrive/AnomalyDetection/Datasets/ImageNet/val_image_paths.json', 'r') as f:
    val_image_paths = json.load(f)
