## **Image Data Preprocessing for AlexNet Paper**

In this script perform the following task are done 
1. Take the Data folder named my_dataset/train which contains 50000 images under 100 classes.
2. Read the data from file and create a data frame to store all 50000 images data  
3. Split the DataFrame and create test, train and validation dataframe.

In [2]:
# Importing Libraries
import os
import pandas as pd
from sklearn.model_selection import train_test_split

**The Split data are stored separately in folders. Directory are created for respective sets**

In [3]:
# Source and Destination folder path

source_folder = 'my_dataset/train'  
output_folder = 'alexnet_dataset'    

# Creating directory in the current directory path
train_dir = os.path.join(output_folder, 'train')
test_dir = os.path.join(output_folder, 'test')
val_dir = os.path.join(output_folder, 'val')

for directory in [train_dir, test_dir, val_dir]:
    os.makedirs(directory, exist_ok=True)

**Data Frame is created to store the Image File name, Image Class name and path**

In [4]:
# Navigating to the sopurce folder directories to gather information regarding image file

image_data = []


for class_dir in os.listdir(source_folder):
    class_path = os.path.join(source_folder, class_dir)
    if os.path.isdir(class_path):
        for img_file in os.listdir(class_path):
            if img_file.lower().endswith(('.jpeg')):   # only taking .jpeg image as source contains only jpeg file 
                image_data.append({
                    'path': os.path.join(class_path, img_file),
                    'class': class_dir,
                    'filename': img_file
                })

# Converting to DataFrame
image_df = pd.DataFrame(image_data)
print(f"Total images found: {len(image_df)}")

Total images found: 50000


In [5]:
image_df.head(5)

Unnamed: 0,path,class,filename
0,my_dataset/train\n01530575\n01530575_10018.JPEG,n01530575,n01530575_10018.JPEG
1,my_dataset/train\n01530575\n01530575_10021.JPEG,n01530575,n01530575_10021.JPEG
2,my_dataset/train\n01530575\n01530575_10023.JPEG,n01530575,n01530575_10023.JPEG
3,my_dataset/train\n01530575\n01530575_10024.JPEG,n01530575,n01530575_10024.JPEG
4,my_dataset/train\n01530575\n01530575_10039.JPEG,n01530575,n01530575_10039.JPEG


In [8]:
# Spliting 50000 data in train, test and validation set

total_images = 50000  
train_size = 30000
test_size = 10000
val_size = 10000

train_df, temp_df = train_test_split(
    image_df, 
    train_size=train_size,
    test_size=total_images-train_size, 
    stratify=image_df['class'],    # statifying sampling is used for the selction of image.
    random_state=42
)

In [9]:
# Spliting temp_df into test and validation
test_df, val_df = train_test_split(
    temp_df, 
    train_size=test_size,
    test_size=val_size, 
    stratify=temp_df['class'],    # statifying sampling is used for the selction of image.
    random_state=42
)

In [10]:
print(f"Training set: {len(train_df)} images")
print(f"Testing set: {len(test_df)} images")
print(f"Validation set: {len(val_df)} images")

Training set: 30000 images
Testing set: 10000 images
Validation set: 10000 images
