## Datasets & DataLoaders

Code for preprocessing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity.

PyTorch provides two data primitives: <i><b>torch.utils.data.DataLoader</i></b> and <i><b>torch.utils.data.Dataset</i></b> that allow you to use pre-loaded datasets as well as your own data.

<i><b>Dataset</i></b> stores the samples and their corresponding labels, and <b><i>DataLoader</i></b> wraps an iterable around the Dataset to enable easy access to the samples. 


PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that subclass <b><i>torch.utils.data.Dataset</i></b> and implement functions specific to the particular data. They can be used to prototype and benchmark your model.

We will use our own data to learn how to load it. The data used will be the California Housing data. 

In [5]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

In [6]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [7]:
y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

We are now in the deep-learning realm, so let's cast everything into PyTorch tensors!

In [8]:
import torch

torch.set_printoptions(sci_mode=False, linewidth=300)


In [9]:
X = torch.from_numpy(X.to_numpy()).float()
y = torch.from_numpy(y.to_numpy()).float()

In [10]:
# 20640 samples and 8 features
X.size(), y.size()

(torch.Size([20640, 8]), torch.Size([20640]))

A note on dtype: in NumPy, the default <i><b>dtype</i></b> is usually <i><b>np.float64</i></b> (also known as the <i><b>double</i></b> type). But in PyTorch the default is </i></b>torch.float32<i><b> (also known as the <i><b>float</i></b>type). 

Computations performed in double precisions are often much slover and more memory-intensive than that in single precisions, even on a GPU. Moreover, the additional precision offered by <i><b>double</i></b> usually doesn't matter (i.e., it won't affect the evaluation metrics). That's why we cast <i><b>x</i></b> and <i><b>y</i></b> to <i><b>torch.float32</i></b> using <i><b>.float()</i></b>

### Creating a custom Dataset

When your data is already stored as tensors, TensorDataset will come in handy

In [11]:
from torch.utils.data import TensorDataset

cal_housing = TensorDataset(X,y)

Why do we need a Dataset? Because it allows us to index into our tensors and retrieve (features, label) pairs.

In [12]:
cal_housing[0] 

(tensor([   8.3252,   41.0000,    6.9841,    1.0238,  322.0000,    2.5556,   37.8800, -122.2300]),
 tensor(4.5260))

In [13]:
cal_housing[:5] 

(tensor([[     8.3252,     41.0000,      6.9841,      1.0238,    322.0000,      2.5556,     37.8800,   -122.2300],
         [     8.3014,     21.0000,      6.2381,      0.9719,   2401.0000,      2.1098,     37.8600,   -122.2200],
         [     7.2574,     52.0000,      8.2881,      1.0734,    496.0000,      2.8023,     37.8500,   -122.2400],
         [     5.6431,     52.0000,      5.8174,      1.0731,    558.0000,      2.5479,     37.8500,   -122.2500],
         [     3.8462,     52.0000,      6.2819,      1.0811,    565.0000,      2.1815,     37.8500,   -122.2500]]),
 tensor([4.5260, 3.5850, 3.5210, 3.4130, 3.4220]))

In [14]:
len(cal_housing)

20640

We can split our dataset into training and testing sets using PyTorch's <i><b>random_split</i></b> function. 

In [15]:
import math
from torch.utils.data import random_split

train_frac = 0.73 
train_size = math.floor(train_frac * len(cal_housing)) # floor because the size has to be an int
test_size = len(cal_housing) - train_size

cal_housing_train, cal_housing_test = random_split(cal_housing, (train_size, test_size))

In [17]:
len(cal_housing_train), len(cal_housing_test)

(15067, 5573)

### Prepraring your data for training with DataLoaders

The <i><b>Dataset</i></b> retrieves our dataset's features and labels on sample at a time. While training a model, we typically want to pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's <i><b>multipreprocessing</i></b> to speed up data retrieval. 

<i><b>DataLoader</i></b> is an iterable that abstracts this complexity for us in an easy API.


In [18]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(cal_housing_train, batch_size=64, shuffle=True)
test_dataloader = DataLoader(cal_housing_test, batch_size=64, shuffle=True)

#### Iterate through the DataLoader

We have loaded that dataset into the <i><b>DataLoader</i></b> and can iterate through the dataset as needed. Each iteration below returns a batch of <i><b>train_features</i></b> and <i><b>train_labels</i></b> (containing <i><b>batch_size = 64</i></b> features and labels respectively). Because we specified <i><b>shuffle=True</i></b>, after we iterate over all batches the data is shuffled

In [26]:
train_features, train_labels = next(iter(train_dataloader))


print(train_features)
print(train_labels)

print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")

tensor([[     4.0223,     33.0000,      5.0475,      1.0289,   1601.0000,      3.3079,     34.3000,   -118.4300],
        [     4.1717,     33.0000,      4.9853,      1.1187,   2701.0000,      1.8007,     34.1100,   -118.3500],
        [     9.3092,     39.0000,      7.2343,      1.1162,    975.0000,      1.8571,     34.1000,   -118.3800],
        [     2.5132,     44.0000,      4.9402,      1.0370,   1133.0000,      3.2279,     34.0600,   -118.1900],
        [     2.1513,      8.0000,      8.1178,      1.5589,    786.0000,      2.6465,     40.6100,   -121.6700],
        [     1.6685,     28.0000,      3.7382,      0.9717,   1362.0000,      3.2123,     38.4300,   -122.7000],
        [     2.3382,     19.0000,      4.0599,      1.0526,   1438.0000,      2.6098,     37.9700,   -122.3400],
        [     6.6204,     16.0000,      6.7293,      0.9658,   2464.0000,      3.2378,     37.7500,   -121.9400],
        [     4.1449,     37.0000,      5.7692,      1.0533,   1183.0000,      3.5000,  

<i><b>next</i></b> + <i><b>iter</i></b> is useful when you want to examine a batch of data (e.g., to check if you have implemented your data-loading functions correctly)

But a more common paradigm is to use your <i><b>DataLoader<i><b> with a <i><b>for<i><b> loop, so that you can continously step into your batches.

In [29]:
for features, labels in test_dataloader:
    print(f"Feature batch shape: {features.size()}")
    print(f"Labels batch shape: {labels.size()}")

Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: torch.Size([64])
Feature batch shape: torch.Size([64, 8])
Labels batch shape: tor