# How To Use Mock Dataset Object


This notebook will give an overview of how to use the

1. [`MockImageClassificationGenerator`](#mockimageclassificationgenerator)
2. [`MockImageClassificationDataset`](#mockimageclassificationdataset)
3. [`MockCIFAR10`](#mockcifar10)

And additional information

1. [Additional Information](#additional-information)
2. [Warnings](#warnings)
3. [Future Additions](#future-additions)


## `MockImageClassificationGenerator`

---


The `MockImageClassificationGenerator` has 4 input parameters:

- Limit:
  - The total number of images in the dataset
- Labels
  - A list of labels to be used
- Image Dimensions
  - The height and width of the images
- Channels
  - The number of channels in the images


### Setting up a `MockImageClassificationGenerator`


**Import dependencies**


In [18]:
import numpy as np
from utils.MockGenerators import MockImageClassificationGenerator

**Set up generator parameters**  
These should be changed to fit **your** testing needs.  
Parameters _limit_, _labels_, and _image dimensions_ are **required**.


In [19]:
default_limit = 100
default_labels = [1, 2, 3, 4, 5]
default_image_dimensions = [32, 32]
default_channels = 3

**Create the dataset by instantiating the `MockImageClassificationGenerator`**


In [20]:
generator = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=default_image_dimensions,
    channels=default_channels,
)

**Retrieve the newly created `MockImageClassificationDataset`**

Now we confirm that the dataset is of type `MockImageClassificationDataset`.


In [21]:
mock_dataset = generator.dataset
print(type(mock_dataset))

<class 'utils.MockDatasets.MockImageClassificationDataset'>


## `MockImageClassificationDataset`

---


The `MockImageClassificationDataset` contains all of your images and labels.  
This class should not be created on its own, and should instead by created by a `MockImageClassificationGenerator` to ensure reproducibility.  
This class outputs a dictionary containing keys _image_ and _label_.


### **Accessing the data**


The data can be accessed in 3 ways:

1. Manually
2. Individually
3. Iteratively


**Manually**  
To access the entire image and label data, you can call the `images` and `labels` attributes directly from the dataset.


In [22]:
mds_images = mock_dataset.images
mds_labels = mock_dataset.labels

Now we check to make sure the length of the `images` and `labels` are the same as the length of the dataset.


In [23]:
mds_len = len(mock_dataset)
images_len = len(mds_images)
labels_len = len(mds_labels)

print("Dataset length:", mds_len)
print("Images length:", images_len)
print("Labels length:", labels_len)

Dataset length: 100
Images length: 100
Labels length: 100


As you can see, the dataset, `images`, and `labels` all have a length of 100, so all data must have been grabbed.


**Individually**  
We can also access individual images and labels directly. This can be done in two ways:


In [24]:
# Example 1: Grab a dictionary at a specific index containing the image-label pair
index = 0
data_0 = mock_dataset[index]
image_pair_0 = data_0["image"]
label_pair_0 = data_0["label"]
print(f"Image shape from pair at index {index}:", image_pair_0.shape)
print(f"Label from pair at index {index}:", label_pair_0)

Image shape from pair at index 0: (32, 32, 3)
Label from pair at index 0: 1


In [25]:
# Example 2: Grab the image and label individually at a specific index
index = 0
image_index_0 = mock_dataset.images[index]
label_index_0 = mock_dataset.labels[index]
print(f"Image shape from images array at index {index}:", image_index_0.shape)
print(f"Label from images array at index {index}:", label_index_0)

Image shape from images array at index 0: (32, 32, 3)
Label from images array at index 0: 1


Based on the shape of image and labels above, we can see that they are equivalent to a single image in the dataset and were specified by the `img_dims` and `channels` parameters.  
For further proof, see below


In [26]:
# First we get the the image and label at index 0
index = 0
single_image = mock_dataset.images[index]
single_label = mock_dataset.labels[index]

# Then we compare the shape of a single image to our individually accessed image
print("Single image shape:", single_image.shape)
print("Individual image shape:", image_index_0.shape)

# And do the same for the label
print("Single label:", single_label)
print("Individual label:", single_label)

Single image shape: (32, 32, 3)
Individual image shape: (32, 32, 3)
Single label: 1
Individual label: 1


We have now confirmed that grabbing an image or label individually gives us a single image or label.


**Iteratively**

Lastly, the data can be accessed through iterative methods (loops)  
This is possible because the return function is defined using Python `Iterator` standards


In [27]:
# Example 1:
# Similar to a ML training cycle (without batching),
# we can use a for loop to iterate over the data.
for i, data in enumerate(mock_dataset):
    image = data["image"]
    label = data["label"]
    print(f"Image shape at loop iteration {i}:", image.shape)
    print(f"Label at loop iteration {i}:", label)
    break

Image shape at loop iteration 0: (32, 32, 3)
Label at loop iteration 0: 1


Now it has been shown that using an iterative method gives the same result as individually indexing the dataset,  
even allowing the `enumerate` function to be used for indexing if needed


**Confirm image parameters equal the generator inputs**  
Now we want to confirm that the image shape is the same as our inputs to the generator.
We also want to confirm that all the labels exist in the dataset.


In [28]:
# First lets get our entire array of images
all_images = mock_dataset.images

# Then get the shape, which has the format (limit, height, width, channels)
ishape = all_images.shape

# Now we compare each dimension to their corresponding input parameter for the generator
assert ishape[0] == default_limit  # Image count equals limit parameter
assert ishape[1] == default_image_dimensions[0]  # Image height equals height parameter
assert ishape[2] == default_image_dimensions[1]  # Image width equals width parameter
assert ishape[3] == default_channels  # Image channels equals parameter channels
print("Passed")

Passed


In [29]:
# First lets get our entire array of labels
all_labels = mock_dataset.labels

# Then we can use numpy.unique to get all the unique label values and their counts
label_counts = np.unique(all_labels, return_counts=True)

# Now we check that all the labels to the generator are in the dataset
print("Generator Labels:", default_labels)
print("Unique labels:", label_counts[0])

# Only pass if they are the same
assert list(label_counts[0]) == default_labels
print("Passed")

# And lastly check the counts of the labels
print("Label counts", label_counts[1])

Generator Labels: [1, 2, 3, 4, 5]
Unique labels: [1 2 3 4 5]
Passed
Label counts [20 20 20 20 20]


Since these have passed, you can see that the data was made accordingly


# `MockCifar10`

---


The `MockCifar10` class creates two `MockImageClassificationDatasets` using `MockImageClassificationGenerators`.  
These two datasets represent the training and testing datasets used for `CIFAR10`

This class takes in 0 parameters and has 2 attributes( `train_dataset`, `test_dataset`)

To access these datasets, instantiate the `MockCifar10` class and call `MockCifar10.train_dataset` and `MockCifar10.test_dataset` respectively


In [30]:
from utils.MockObjects import MockCifar10

In [31]:
# Here we create a MockCifar10 object containing a mock train and test dataset
mc10 = MockCifar10()
train_cifar10 = mc10.train_dataset
test_cifar10 = mc10.test_dataset

# Let us see that each dataset is of type MockImageClassificationDataset
print("Train dataset type:", type(train_cifar10))
print("Test dataset type:", type(test_cifar10))

Train dataset type: <class 'utils.MockDatasets.MockImageClassificationDataset'>
Test dataset type: <class 'utils.MockDatasets.MockImageClassificationDataset'>


The real Cifar10 dataset contains 60,000 total images, typically split into a 50,000 and 10,000 train and test set respectively  
We can show here that the mock version is split the same way


In [32]:
print("Training data count:", len(train_cifar10))
print("Training size:", train_cifar10.images.shape)
print("Testing data count:", len(test_cifar10))
print("Testing size:", test_cifar10.images.shape)

Training data count: 50000
Training size: (50000, 32, 32, 3)
Testing data count: 10000
Testing size: (10000, 32, 32, 3)


Each image in the dataset has a shape of (32, 32, 3).  
This means each image has a height of 32px, a width of 32px, and 3 channels (Red, Green, Blue)  
We can show that the mock has the same size.


In [33]:
for data in train_cifar10:
    img = data["image"]
    print("Image height:", img.shape[0])
    print("Image width:", img.shape[1])
    print("Image channels:", img.shape[2])
    break

Image height: 32
Image width: 32
Image channels: 3


## Additional Information

---


**Variations in the `MockImageClassificationGenerator` parameters**  
In this section, we will explain other ways to specify parameters and potentially cases when you might do so


In [34]:
# These were the previous parameters for the generated dataset
default_limit = 100
default_labels = [1, 2, 3, 4, 5]
default_image_dimensions = [32, 32]
default_channels = 3

**A Single Label**  
If labels are not important, or the data only has a single label, there are two ways to give that to the generator

1. A single item list
2. An integer


In [35]:
# In this first example, we give the generator a single item list
new_label = [1]
gen_one_label = MockImageClassificationGenerator(
    limit=default_limit,
    labels=new_label,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
# We then confirm that there is only one label in the dataset
labels = gen_one_label.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Unique Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])
print("First 10 labels:", labels[:10])

Unique Labels: [1]
Label counts: [100]
First 10 labels: [1 1 1 1 1 1 1 1 1 1]


The `gen_one_label.dataset.labels` contains only one label, and it equals the size of our dataset. Therefore this works as expected.  
But there is also the second way; specifying an integer. We will now show an example of that


In [36]:
int_label = 1
gen_int_label = MockImageClassificationGenerator(
    limit=default_limit,
    labels=int_label,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
labels = gen_int_label.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Unique Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])
print("First 10 labels:", labels[:10])

Unique Labels: [1]
Label counts: [100]
First 10 labels: [1 1 1 1 1 1 1 1 1 1]


Again we can see that there is only 1 unique label.  
To further prove this, the value of the label does not have to equal the value 1, as long as it is an integer


In [37]:
int_label_42 = 42
gen_int_label = MockImageClassificationGenerator(
    limit=default_limit,
    labels=int_label_42,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
labels = gen_int_label.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])
print("First 10 labels:", labels[:10])

Labels: [42]
Label counts: [100]
First 10 labels: [42 42 42 42 42 42 42 42 42 42]


**Repeated labels**  
Repeated labels are allowed, but will count towards the number of splits. This can be used to create imbalances in the dataset


In [38]:
repeat_labels = [1, 1, 2, 3]
gen_repeat_label = MockImageClassificationGenerator(
    limit=default_limit,
    labels=repeat_labels,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
labels = gen_repeat_label.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Unique Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])

Unique Labels: [1 2 3]
Label counts: [50 25 25]


The unique labels correctly show that there were only 3 (1, 2, 3), but _why_ does this happen?  
During generation, the limit is simply divided by the length of the labels' list.  
So each label in the list (4 items) is given 100 / 4 = `25 images`


**Uneven Splits**  
Based on the explanation above, it would make sense to wonder what happens if the limit divided by the length of the labels is not whole.  
In the current implementation, the generator attempts to evenly split the labels. Let's look at a couple examples


In [39]:
# Example 1
# Set the label length to 3
# 100 / 3 = 33r1
labels_3 = [1, 2, 3]
gen_3_labels = MockImageClassificationGenerator(
    limit=default_limit,
    labels=labels_3,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
labels = gen_3_labels.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Unique Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])

Unique Labels: [1 2 3]
Label counts: [34 33 33]


So we can see that each label is given an equal amount, but the overflow is given to the first label.  
Let's now do a more extreme version, where there is a larger remainder for the division.


In [40]:
# Example 2
# Set the label length to 6
# 100 / 6 = 98r4
labels_6 = [0, 1, 2, 3, 4, 5]
gen_6_labels = MockImageClassificationGenerator(
    limit=default_limit,
    labels=labels_6,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
labels = gen_6_labels.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Unique Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])

Unique Labels: [0 1 2 3 4 5]
Label counts: [17 17 17 17 16 16]


Similar to the first example, there is a remainder after the division.  
With this more extreme example, you can see that the remainders are evenly split as well over the labels  
rather than giving all of the remainder to one label.


**Using range instead of a list**  
One last thing you can do for the `labels` parameter is use a range instead of writing out the entire list.


In [41]:
# This will give us 5 labels [0, 1, 2, 3, 4]
range_labels = range(0, 5)
gen_range_labels = MockImageClassificationGenerator(
    limit=default_limit,
    labels=range_labels,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
labels = gen_range_labels.dataset.labels
unique_labels = np.unique(labels, return_counts=True)
print("Unique Labels:", unique_labels[0])
print("Label counts:", unique_labels[1])

Unique Labels: [0 1 2 3 4]
Label counts: [20 20 20 20 20]


As we can see, there are 5 labels, each with a count of 20, the same as if had explicitly used a list.


**Image Dimensions**  
Next we will give examples for the `img_dims` parameter.  
Image dimensions can be given as an integer, height-width, or height-width-channel.


In [42]:
# Example 1: Using an integer for img_dims will create a square image (height == width)
image_dimensions_int = 32
gen_int_dim = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=image_dimensions_int,
    channels=default_channels,
)
print("Image dimensions using an integer:", gen_int_dim.dataset.images.shape)

# Example 2: Using a list or tuple gives the same results
image_dimensions_list = [32, 32]
gen_list_dim = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=image_dimensions_list,
    channels=default_channels,
)
image_dimesions_tuple = (32, 32)
gen_tuple_dim = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=image_dimesions_tuple,
    channels=default_channels,
)
print("Image dimensions as list:", gen_list_dim.dataset.images.shape)
print("Image dimensions from tuple:", gen_tuple_dim.dataset.images.shape)

Image dimensions using an integer: (100, 32, 32, 3)
Image dimensions as list: (100, 32, 32, 3)
Image dimensions from tuple: (100, 32, 32, 3)


Even though each method was different, the results for the image shapes were the same.  
Now we will show that supplying the channel within `img_dims` will overwrite the `channels` parameter.


In [43]:
image_dimensions_channel = [32, 32, 2]
channels = 15
gen_channel_dim = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=image_dimensions_channel,
    channels=channels,
)
print("Image dimensions with channel:", gen_channel_dim.dataset.images.shape)

Image dimensions with channel: (100, 32, 32, 2)


The size of the channels is equal to 2 even though we specified it to be 3 with the `channels` parameter. This works as intended, so be careful when supplying 3 dimensions.


In all of the previous examples, the image has been square. This is not a requirement and can be easily set to any size.


In [44]:
# Create non-square dimensions
image_dimensions_nonsqr = [32, 256]
gen_nonsquare_dim = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=image_dimensions_nonsqr,
    channels=default_channels,
)
print("Non square dimensions:", gen_nonsquare_dim.dataset.images.shape)

Non square dimensions: (100, 32, 256, 3)


## Warnings

---

Note that most parameters can be set to 0 and will not give a warning


In [45]:
gen_0_limit = MockImageClassificationGenerator(
    limit=0,
    labels=default_labels,
    img_dims=default_image_dimensions,
    channels=default_channels,
)
print("Dataset with no images: ", gen_0_limit.dataset.images.shape)

gen_0_dims = MockImageClassificationGenerator(
    limit=default_limit, labels=default_labels, img_dims=0, channels=default_channels
)
print("Dataset with no image dimensions: ", gen_0_dims.dataset.images.shape)

gen_0_channels = MockImageClassificationGenerator(
    limit=default_limit,
    labels=default_labels,
    img_dims=default_image_dimensions,
    channels=0,
)
print("Dataset with no channels: ", gen_0_channels.dataset.images.shape)

gen_0 = MockImageClassificationGenerator(limit=0, labels=0, img_dims=0, channels=0)
print("Dataset with no data", gen_0.dataset.images.shape)

Dataset with no images:  (0, 32, 32, 3)
Dataset with no image dimensions:  (100, 0, 0, 3)
Dataset with no channels:  (100, 32, 32, 0)
Dataset with no data (0, 0, 0, 0)


## Future Additions

---

- Return the dataset when calling the generator (instead of the generator)
- Warnings when parameters are set to 0
- Function to recreate a dataset from a current generator instead of a new generator
