# Making a PyTorch Dataset


[Video - YouTube](https://youtu.be/6KSzXare_iU)


In [1]:
# @title # Run the following cell to install the necessary libraries for this practical. { display-mode: "form" }
# @markdown Don't worry about what's in this collapsed cell

# pip install -q torch
# pip install -q pandas
import os
import urllib.request
from typing import Tuple

if not os.path.exists('data'):
    os.makedirs('data')

if not os.path.exists('data/BostonHousing.csv'):
    print('Downloading BostonHousing.csv...')
    urllib.request.urlretrieve(
        'https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/practicals_files/de5ba7fa-5835-4f4b-87c9-58d0d0bf086a/BostonHousing.csv', 'data/BostonHousing.csv')

print('Files are downloaded.')


Files are downloaded.


#### 1. Import `torch` and `pandas`.


In [2]:
import torch
import pandas as pd


#### 2. Define a class called `BostonHousingDataset` that inherits from `torch.utils.data.Dataset`.


In [3]:
data = pd.read_csv("data/BostonHousing.csv")

columns = data.columns
features_dataframe = data.loc[:, [col for col in columns if col != "medv"]]
tensor = torch.Tensor(features_dataframe.to_numpy())
print(tensor.shape)
tensor[10, :].shape


torch.Size([506, 13])


torch.Size([13])

In [4]:
class BostonHousingDataset(torch.utils.data.Dataset):

    data: pd.DataFrame = None
    X: torch.Tensor = None  # Features
    Y: torch.Tensor = None  # Labels

    def __init__(self) -> None:
        super().__init__()
        self.data = pd.read_csv("data/BostonHousing.csv")

        columns = self.data.columns
        features_dataframe = self.data.loc[:, [
            col for col in columns if col != "medv"]]
        self.X = torch.Tensor(features_dataframe.to_numpy())

        self.Y = torch.Tensor(self.data["medv"])

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, torch.Tensor]:
        return self.X[index, :], self.Y[index]

    def __len__(self) -> int:
        return len(self.X)


#### 3. Inside the class constructor, read in the dataset csv file using `pd.read_csv`.

#### 4. Assign two attributes, `self.X` and `self.Y`, and assign them to your features and labels.

The labels are in the column called "medv”, all the other columns are features. Convert the data to torch tensor format as you assign them, and set the datatype to float32. You can look at the docs for `torch.tensor()` for help.

#### 5. Now define the second crucial method of the dataset class: `__getitem__`.

This needs to take in an index of your dataset and return the features and label corresponding to that index.

#### 6. Then, define the `__len__` method, which defines how your dataset responds to the len() method in python.

It should print the number of rows in your dataset when called.


#### 7. Finally, let's load our data into a dataloader as if we were going to perform minibatch optimisation.

Create an instance of your BostonHousingDatset class, and pass it as an argument to an instance of the `DataLoader` class (found in `torch.utils.data`). Specify a batch size of 4 and set shuffle to `True`, and call the instance `train_loader`.


In [6]:
dataset = BostonHousingDataset()
train_loader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)


#### 8. We can now test our dataloader by running the command `next(iter(train_loader)`

Print the result.


In [16]:
next(iter(train_loader))


[tensor([[2.9916e-01, 2.0000e+01, 6.9600e+00, 0.0000e+00, 4.6400e-01, 5.8560e+00,
          4.2100e+01, 4.4290e+00, 3.0000e+00, 2.2300e+02, 1.8600e+01, 3.8865e+02,
          1.3000e+01],
         [6.4440e+00, 0.0000e+00, 1.8100e+01, 0.0000e+00, 5.8400e-01, 6.4250e+00,
          7.4800e+01, 2.2004e+00, 2.4000e+01, 6.6600e+02, 2.0200e+01, 9.7950e+01,
          1.2030e+01],
         [5.6460e-02, 0.0000e+00, 1.2830e+01, 0.0000e+00, 4.3700e-01, 6.2320e+00,
          5.3700e+01, 5.0141e+00, 5.0000e+00, 3.9800e+02, 1.8700e+01, 3.8640e+02,
          1.2340e+01],
         [1.3522e+01, 0.0000e+00, 1.8100e+01, 0.0000e+00, 6.3100e-01, 3.8630e+00,
          1.0000e+02, 1.5106e+00, 2.4000e+01, 6.6600e+02, 2.0200e+01, 1.3142e+02,
          1.3330e+01]]),
 tensor([21.1000, 16.1000, 21.2000, 23.1000])]