# Torch Data API Lab

In this lab, you will create your own dataset and dataloader.
This is a key part of the ML workflow with torch.
You'll have two chances to practice creating datasets and dataloaders with different types of data.
Be prepared to do some external research on things like `glob` syntax, `pathlib`, and working with `PIL.Image` objects.

Make sure to take time to inspect and learn about the objects you're working with as you're working through these exercises.

## Setup

In the cell below, import the `Dataset` and `DataLoader` classes from `torch`.
It's useful to remember where these are, since you'll be importing them a lot!

In [1]:
# Your work here
from torch.utils.data import Dataset, DataLoader

## Wine Dataset

This section contains no work. 
Run this section to lode the wine dataset into a `pd.DataFrame`.
In the first section of this lab, you'll be working with this dataframe to create datasets and dataloaders.

In [2]:
import requests
import pandas as pd
import numpy as np
import torch

In [3]:
URLS = [
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv',
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names'
]
for url in URLS:
    !wget {url}

--2021-10-03 18:23:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84199 (82K) [application/x-httpd-php]
Saving to: ‘winequality-red.csv’


2021-10-03 18:23:48 (825 KB/s) - ‘winequality-red.csv’ saved [84199/84199]

--2021-10-03 18:23:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 264426 (258K) [application/x-httpd-php]
Saving to: ‘winequality-white.csv’


2021-10-03 18:23:49 (1.25 MB/s) - ‘winequality-white.csv’ saved [264426/264426]

--2021-10-03 18:23:49--  http

In [4]:
red = pd.read_csv('winequality-red.csv', delimiter=';')
red['is_red'] = 1
white = pd.read_csv('winequality-white.csv', delimiter=';')
white['is_red'] = 0
wine = red.append(white).reset_index(drop=True)

### Split data into training and validaiton sets.

Question: 
What are train and validation datasets used for in machine learning?

Answer:
* Train: the train set is used to find optimal parameters for your model.
* Validation: the validation dataset is used to detect overfitting as you train and tune hyperparameters.

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
train, valid = train_test_split(wine, random_state=42, stratify=wine.is_red)

### Create a class for you Dataset

The only argument to the dataset should be a `pd.DataFrame`.
The `WineDataset` object should implement the following methods:
* a `__len__` method, that returns the number of rows in the dataset
* a `__getitem__` method, that returns an (X, y) tuple at an index in the dataset.

In this case, your X value shoudl be values in the DataFrame, and the y value should be the quality rating.

In [9]:
class WineDataset(Dataset):
    def __init__(self, df):
        super().__init__()
        self.df = df
        self.x_cols = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
            'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
            'pH', 'sulphates', 'alcohol']
        self.y_col = 'quality'

    def __len__(self):
        return self.df.shape[0]
    
    def __getitem__(self, idx):
        return self.df[self.x_cols].iloc[idx].values, self.df[self.y_col].iloc[idx]

In [10]:
train_ds = WineDataset(train)
valid_ds = WineDataset(valid)

In [11]:
# Sanity checks
x, y = train_ds[0]
assert isinstance(x, (np.ndarray, torch.Tensor)), """
x should be an array or tensor.
"""
assert isinstance(y, (int, float, torch.Tensor, np.number)), """
y should be some numeric type.
"""

Take a look at `x` and `y` to make sure they are in line with your expectations.

In [12]:
x

array([7.4000e+00, 2.7000e-01, 3.1000e-01, 2.4000e+00, 1.4000e-02,
       1.5000e+01, 1.4300e+02, 9.9094e-01, 3.0300e+00, 6.5000e-01,
       1.2000e+01])

In [13]:
y

4

## Create your train and validation DataLoaders from your datasets.

By now, you should have train and validation datasets instantiated.

In [14]:
train_dl = DataLoader(train_ds, batch_size=16)
valid_dl = DataLoader(valid_ds, batch_size=16)

In [15]:
# Sanity checks
for x, y in train_dl:
    break
assert isinstance(x, torch.Tensor)
assert isinstance(y, torch.Tensor)
assert x.shape[0] == y.shape[0]

## More practice with images: Dogs!

In [16]:
!pip install -Uqq fastai

### Download the dataset

In [17]:
from fastai.vision.all import untar_data, URLs, Path
from PIL import Image
import typing

In [18]:
path = untar_data(URLs.IMAGEWOOF_320)

In [19]:
list(path.ls())

[Path('/root/.fastai/data/imagewoof2-320/noisy_imagewoof.csv'),
 Path('/root/.fastai/data/imagewoof2-320/val'),
 Path('/root/.fastai/data/imagewoof2-320/train')]

In [20]:
one_image_path = (path/'train/n02099601').ls()[0]

In [21]:
train_path = path/'train'
valid_path = path/'val'

### Create some utility functions & objects we'll use later.

In [22]:
# Create a function that lists all the image files in a path.
# This function shuld return a list of image file paths.
# HINT: Use path.glob with glob syntax 
# https://docs.python.org/3/library/glob.html
def list_image_files(path:Path):
    return list(path.glob('**/*.JPEG'))

In [23]:
# Make sure list_image_files returns only .JPEG files
assert set(i.suffix for i in list_image_files(train_path)) == {'.JPEG'}

In [24]:
# Create a function that opens an image from a path
# This function should return a PIL.Image
def open_image(path):
    return Image.open(path)

In [25]:
# Sanity check: open_image returns a PIL.Image
img = open_image(one_image_path)
assert isinstance(img, Image.Image)

In [26]:
# Create a function to label the image by the path. 
# The label in this case is the parent directory.
def label_from_image_path(path):
    return path.parent.name

In [27]:
# Sanity check: function correctly labels a single image
assert label_from_image_path(one_image_path) == 'n02099601'

In [28]:
# Create a function that resizes a PIL Image to (244, 244)
# We resize images so they're all the same shape when
# we put them into a batch. We'll learn about more
# sophisticated ways to do this when we cover CNNs.
def resize_image(img):
    return img.resize((244, 244))

In [42]:
# Sanity check: image resized to (244, 244)
img = open_image(one_image_path)
assert not img.shape == (244, 244)
assert resize_image(img).shape == (244, 244)

In [30]:
# Find all the unique labels in the train set.
# We will use these to encode our labels in our Dataset.
# The labels variable should be a list of all the classes 
# (for example, 'n02115641')
labels = list(set(label_from_image_path(path) for path in list_image_files(train_path)))
labels, len(labels)

(['n02115641',
  'n02105641',
  'n02088364',
  'n02096294',
  'n02086240',
  'n02087394',
  'n02089973',
  'n02099601',
  'n02093754',
  'n02111889'],
 10)

In [44]:
# Label sanity checks
assert isinstance(labels, list)
assert len(labels) == 10

### Create your DogDataset class.

This dataset should be instantiated with a `Path` object for the train or valid DS.
The `__getitem__` method should return a tuple of `(PIL.Image.Image, int)`, where the int is the index of the label in the `labels` object we defined above.

In [31]:
class DogDataset(Dataset):
    def __init__(self, path):
        self.path = Path(path)
        self.files = list_image_files(self.path)

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        return open_image(self.files[idx]), labels.index(label_from_image_path(self.files[idx]))

In [32]:
train_ds = DogDataset(train_path)
valid_ds = DogDataset(valid_path)

In [41]:
# Sanity checks for the DogDataset
x, y = train_ds[0]
assert isinstance(x, Image.Image)
assert isinstance(y, int)

In [35]:
# Create a collate_fn that resizes returns a batch of images and labels as tensors.
# The function should resize all the images to (244, 244) and
# normalize images between 0 and 1 by dividing all values by 255
def collate_fn(batch):
    imgs, labels = tuple(zip(*batch))
    imgs = [np.array(resize_image(img)) for img in imgs]
    imgs = torch.tensor(imgs) / 255.
    labels = torch.tensor(labels)
    return imgs, labels

In [36]:
# Sanity checks for collate function
one_batch = [train_ds[i] for i in range(8)]
x, y = collate_fn(one_batch)
assert isinstance(x, torch.Tensor)
assert x.max() <= 1
assert x.ndim == 4
assert isinstance(y, torch.Tensor)
assert x.shape[0] == y.shape[0]

### Create your train and validation DataLoaders

At this point, you should have everything you need to create your `DataLoader` objects for your train and validation datasets.
Your `DogDataset` object should load objects from disk into memory, and your `collate_fn` should be able to take a batch of these raw objects and convert them into tensors.

In [37]:
# Create a train and valid DataLoader where batch_size = 16.
# Don't forget to pass your collate_fn!
train_dl = DataLoader(train_ds, batch_size=16, collate_fn=collate_fn)
valid_dl = DataLoader(valid_ds, batch_size=16, collate_fn=collate_fn)

In [38]:
# Sanity checks for the dataloader
for x, y in train_dl:
    break
assert isinstance(x, torch.Tensor)
assert x.max() <= 1
assert x.ndim == 4
assert isinstance(y, torch.Tensor)
assert x.shape[0] == y.shape[0]

## Review

In this lab, we created two different dataloaders that return batches of data we could use to train an ML model.
One works with in-memory data in the form of a `pd.DataFrame`, and the other works with out-of-memory data in the form of image files.
We'll get plenty more practice with other methods of creating datasets and dataloaders throught the course!