# 1.0 Data & Preliminary Analysis

This Jupyter notebook contains the data and preliminary analysis of a deep learning for image classification project. The notebook imports necessary libraries such as Tensorflow, Pandas, Matplotlib, and Seaborn and maps the ImageNet class index to human-readable labels. It analyses and visualises class imbalances in training and validation datasets using class counts and percentages. It also analyses the distribution of image file sizes within each class in the training dataset. The notebook provides clear and concise code with comments explaining each step of the analysis. The visualisations are presented in an explicit and effective manner using Seaborn and Matplotlib libraries. 

The notebook is structured into three  main sections:
1. **Importing Required Libraries:** This section imports necessary libraries such as Tensorflow, Pandas, Matplotlib, and Seaborn.
2. **Mapping ImageNet Class Index to Human-Readable Labels:** This section maps the ImageNet class index to human-readable labels using a JSON file obtained from the internet.
3. **Class Distribution Analysis:** This section analyses and visualises class imbalances in the training and validation datasets using class counts and percentages. It also analyses the distribution of image file sizes within each class in the training dataset.

Moreover, this notebook will serve as a starting point for further investigation and modelling for this project by providing essential insights into the class distribution and size of images that will be utilised for training the deep learning for the image classification model. Furthermore, the insights generated will help to guide future decision-making in the development and deployment of the model.

## 1.1 Importing Required Libraries

In [None]:
import os
import json
import glob
import pandas as pd
import seaborn as sns
import tensorflow as tf
from packaging import version
import matplotlib.pyplot as plt

In [None]:
# Check if the TensorFlow version is 2.10.0; if not, raise an AssertionError with a message indicating 
# the required version
assert version.parse(tf.__version__) == version.parse('2.10.0'), "Please install TF version 2.10.0. Current version: " + str(tf.__version__)

## 1.2 Mapping Imagenet Class Index to Human-Readable Labels

> The ImageNet metadata JSON file was obtained from: https://www.kaggle.com/keras/resnet50

In [None]:
# Create an empty dictionary
imagenet2idx = {}

# Open the file 'imagenet_class_index.json' in read mode and assign its content to the dictionary
with open('imagenet_class_index.json') as f:
    idx2imagenet = json.load(f)
    
# Print the contents
idx2imagenet

In [None]:
# Update the dictionary with keys and values from 'idx2imagenet'
# The keys are the first elements of the values in 'idx2imagenet' and the values are a list of the 
# corresponding key in 'idx2imagenet' and the second element of the values in 'idx2imagenet'
imagenet2idx = {v[0]: [k, v[1]] for k, v in idx2imagenet.items()}

# Print the contents
imagenet2idx

## 1.3 Class Distribution Analysis

### 1.3.1 Class Counts

In [None]:
# Define file paths for training and validation datasets
TRAIN_PATH = "imageset/train"
VAL_PATH = "imageset/val"

# Create empty dictionaries for storing class counts of the datasets
TRAIN_CLASS_COUNTS = {}
VAL_CLASS_COUNTS = {}

# Loop through directories within the training path
for folder_name in os.listdir(TRAIN_PATH):
    # Retrieve the corresponding class index from mapping using folder name as the key
    class_index = imagenet2idx[folder_name][1]
    # Count the number of image files within the folder that have a .JPEG extension
    image_count = len(glob.glob(os.path.join(TRAIN_PATH, folder_name, "*.JPEG")))
    # Add the count to the TRAIN_CLASS_COUNTS dictionary
    TRAIN_CLASS_COUNTS[class_index] = image_count

# Loop through directories within the validation path
for folder_name in os.listdir(VAL_PATH):
    # Retrieve the corresponding class index from mapping using folder name as the key
    class_index = imagenet2idx[folder_name][1]
    # Count the number of image files within the folder that have a .JPEG extension
    image_count = len(glob.glob(os.path.join(VAL_PATH, folder_name, "*.JPEG")))
    # Add the count to the VAL_CLASS_COUNTS dictionary
    VAL_CLASS_COUNTS[class_index] = image_count

In [None]:
# Print the number of images per class in the training dataset
print("Number of images per class in the training dataset:")
TRAIN_CLASS_COUNTS

In [None]:
# Print the number of images per class in the validation dataset
print("Number of images per class in the validation dataset:")
VAL_CLASS_COUNTS

In [None]:
# Set the default seaborn theme
sns.set_theme()

# Set the context for plotting
sns.set_context("paper")

# Create a figure with a size of 9x5 inches
fig = plt.figure(figsize=(9, 5))

# Create the subplot for the training split class counts
ax1 = fig.add_subplot(1, 2, 1)
ax1.set_xlabel('Class Names')
ax1.set_ylabel('Class Counts')
ax1.set_title('Training Split Class Counts')
ax1.bar(range(len(TRAIN_CLASS_COUNTS)), list(TRAIN_CLASS_COUNTS.values()), tick_label=list(TRAIN_CLASS_COUNTS.keys()))
ax1.tick_params(axis='x', rotation=90)

# Create the subplot for the validation split class counts
ax2 = fig.add_subplot(1, 2, 2)
ax2.set_xlabel('Class Names')
ax2.set_ylabel('Class Counts')
ax2.set_title('Validation Split Class Counts')
ax2.bar(range(len(VAL_CLASS_COUNTS)), list(VAL_CLASS_COUNTS.values()), tick_label=list(VAL_CLASS_COUNTS.keys()))
ax2.tick_params(axis='x', rotation=90)

# Adjust the subplot layout for better spacing
fig.tight_layout()

# Save the figure to a PNG file
plt.savefig('figures/class_count.png', dpi=300)

# Display the plot
plt.show()

### 1.3.2 Class Percentages

In [None]:
# Calculate total number of training and validation images
total_train_images = sum(TRAIN_CLASS_COUNTS.values())
total_val_images = sum(VAL_CLASS_COUNTS.values())

# Calculate the percentage of each class in the training and validation sets
train_percentages = [(count / total_train_images) * 100 for count in TRAIN_CLASS_COUNTS.values()]
val_percentages = [(count / total_val_images) * 100 for count in VAL_CLASS_COUNTS.values()]

In [None]:
# Print the percentage of each class in the training dataset
print("Training dataset class percentages:")
for class_name, percentage in zip(TRAIN_CLASS_COUNTS.keys(), train_percentages):
    print(f"{class_name}: {percentage:.2f}%")

In [None]:
# Print the percentage of each class in the validaiton dataset
print("Validation dataset class percentages:")
for class_name, percentage in zip(VAL_CLASS_COUNTS.keys(), val_percentages):
    print(f"{class_name}: {percentage:.2f}%")

In [None]:
# Combine the training and validation class percentages into a single DataFrame
class_percentages_df = pd.DataFrame({
    'Class': list(TRAIN_CLASS_COUNTS.keys()) + list(VAL_CLASS_COUNTS.keys()),
    'Percentage': train_percentages + val_percentages,
    'Dataset': ['Training'] * len(TRAIN_CLASS_COUNTS) + ['Validation'] * len(VAL_CLASS_COUNTS)
})

# Set the default seaborn theme
sns.set_theme()

# Set the context for plotting
sns.set_context("paper")

# Create a figure with a size of 9x5 inches
fig = plt.figure(figsize=(9, 5))

# Create the subplot for the training and validation dataset class percentages
ax = fig.add_subplot(1, 1, 1)
sns.barplot(x='Class', y='Percentage', hue='Dataset', data=class_percentages_df, ax=ax)
ax.set_xlabel('Class Names')
ax.set_ylabel('Class Percentages')
ax.set_title('Class Percentages in Training and Validation Datasets')
ax.tick_params(axis='x', rotation=90)

# Adjust the subplot layout for better spacing
fig.tight_layout()

# Save the figure to a PNG file
plt.savefig('figures/class_percentage.png', dpi=300)

# Display the plot
plt.show()

### 1.3.3 Image File Size

In [None]:
# Loop through directories within the training path
for folder_name in os.listdir(TRAIN_PATH):
    # Retrieve the corresponding class index from mapping using folder name as the key
    class_index = imagenet2idx[folder_name][1]
    # Get a list of the image file paths within the folder that have a .JPEG extension
    image_paths = glob.glob(os.path.join(TRAIN_PATH, folder_name, "*.JPEG"))
    # Analyse the distribution of image file sizes within the folder
    image_sizes = [os.path.getsize(path) for path in image_paths]
    print(f"Class {class_index} image size statistics:")
    print(pd.Series(image_sizes).describe())