# Data preprocessing

Using dataset National Flowers, which has 9 classes of flowers:

- daisy
- dandelion
- lavender
- lilly
- lotus
- orchid
- rose
- sunflower
- tulip
---------

## Loading data

We need to load data and split it into 3 pieces:
- 20% test
- 16% validation
- 64% train

In [1]:
import os
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split, Subset

In [2]:
# Train data normalizing and augmentation settings:
train_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(20),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [3]:
# Val and test data normalizing and augmentation settings:
val_test_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [4]:
dataset_path = "/kaggle/input/national-flowers/flowerdataset/"

In [5]:
# 1. Load and transform as training all train data
full_train_dataset = datasets.ImageFolder(
    root=os.path.join(dataset_path, 'train'), transform=train_transform)

In [6]:
# 2. Split train data into train and validation data
train_size = int(0.8 * len(full_train_dataset))
val_size = len(full_train_dataset) - train_size
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])

In [7]:
# 3. Load and transform as validation only already splitted val data
val_dataset = Subset(datasets.ImageFolder(
    root=os.path.join(dataset_path, 'train'), transform=val_test_transform), val_dataset.indices)

In [8]:
# 4. Load and transform as testing all test data
test_dataset = datasets.ImageFolder(
    root=os.path.join(dataset_path, 'test'), transform=val_test_transform)

## Generating data batches

In [9]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)