# Image Prediction - Properly load any image dataset as ImageDataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/new/docs/tutorials/image_prediction/dataset.ipynb)
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/new/docs/tutorials/image_prediction/dataset.ipynb)

:label:`sec_imgdataset`


Preparing the dataset for ImagePredictor is not difficult at all, however, we'd like to introduce the
recommended ways to initialize the dataset, so you will have a smoother experience using `autogluon.vision.ImagePredictor`.

There are generally three ways to load a dataset for ImagePredictor:

- Load a csv file or construct your own pandas `DataFrame` with `image` and `label` columns

- Load a image folder directly with `ImageDataset`

- Convert a list of images into a dataset directly with `ImageDataset`

We will go through these four methods one by one. First of all, let's import AutoGluon:

In [None]:
%matplotlib inline
import autogluon.core as ag
from autogluon.vision import ImageDataset
import pandas as pd

## Load a csv file or construct a DataFrame object

We use a csv file from PetFinder competition as an example. You may use any tabular data as long as you can
create `image`(absolute or relative paths to images) and `label`(category for each image) columns.

In [None]:
csv_file = ag.utils.download('https://autogluon.s3-us-west-2.amazonaws.com/datasets/petfinder_example.csv')
df = pd.read_csv(csv_file)
df.head()

If the image paths are not relative to the current working directory, you may use the helper function to prepend a prefix for each image. Using absolute paths can reduce the chance of an OSError happening during file access:

In [None]:
df = ImageDataset.from_csv(csv_file, root='/home/ubuntu')
df.head()

Or you can perform the correction by yourself:

In [None]:
import os
df['image'] = df['image'].apply(lambda x: os.path.join('/home/ubuntu', x))
df.head()

Otherwise you may use the `DataFrame` as-is, `ImagePredictor` will apply auto conversion during `fit` to ensure other metadata is available for training. You can have multiple columns in the `DataFrame`, `ImagePredictor` only cares about `image` and `label` columns during training.

## Load an image directory

It's pretty common that sometimes you only have a folder of images, organized by the category names. Recursively looping through images is tedious. You can use `ImageDataset.from_folders` or `ImageDataset.from_folder` to avoid implementing recursive search.

The difference between `from_folders` and `from_folder` is the targeting folder structure.
If you have a folder with splits, e.g., `train`, `test`, like:

- root/train/car/0001.jpg
- root/train/car/xxxa.jpg
- root/val/bus/123.png
- root/test/bus/023.jpg

Then you can load the splits with `from_folders`:

In [None]:
train_data, _, test_data = ImageDataset.from_folders('https://autogluon.s3.amazonaws.com/datasets/shopee-iet.zip', train='train', test='test')
print('train #', len(train_data), 'test #', len(test_data))
train_data.head()

If you have a folder without `train` or `test` root folders, like:

- root/car/0001.jpg
- root/car/xxxa.jpg
- root/bus/123.png
- root/bus/023.jpg

Then you can load the splits with `from_folder`:

In [None]:
# use the train from shopee-iet as new root
root = os.path.join(os.path.dirname(train_data.iloc[0]['image']), '..')
all_data = ImageDataset.from_folder(root)
all_data.head()

In [None]:
# you can manually split the dataset or use `random_split`
train, val, test = all_data.random_split(val_size=0.1, test_size=0.1)
print('train #:', len(train), 'test #:', len(test))

## Convert a list of images to dataset

You can create a dataset from a list of images with a function, the function is used to determine the label of each image. We use the Oxford-IIIT Pet Dataset mini pack as an example, where images are scattered in `images` directory but with a unique pattern: filenames of cats start with a capital letter, otherwise they are dogs. So we can use a function to distinguish and assign a label to each image:

In [None]:
pets = ag.utils.download('https://autogluon.s3-us-west-2.amazonaws.com/datasets/oxford-iiit-pet-mini.zip')
pets = ag.utils.unzip(pets)
image_list = [x for x in os.listdir(os.path.join(pets, 'images')) if x.endswith('jpg')]
def label_fn(x):
    return 'cat' if os.path.basename(x)[0].isupper() else 'dog'
new_data = ImageDataset.from_name_func(image_list, label_fn, root=os.path.join(os.getcwd(), pets, 'images'))
new_data

## Visualize images

You can use `show_images` to visualize the images, as well as the corresponding labels:

In [None]:
new_data.show_images()

For raw DataFrame objects, you can convert them to Dataset first to use `show_images`.

Congratulations, you can now proceed to :ref:`sec_imgquick` to start training the `ImagePredictor`.