# 3.3. Synthetic Regression Data

$ \newcommand{\mb}{\mathbf}$

In [1]:
import random
import torch
from d2l import torch as d2l

## 3.3.1. Generating the Dataset

We will generate a synthetic dataset $\mb{X}$ and corresponding labels $\mb{y}$ from the following linear regression model:

$\mb{y} = \mb{X} \mb{w} + \mb{b} + \mb{\epsilon}$

where $\mb{X} \in \mathbb{R}^{n \times d}$, $\mb{w} \in \mathbb{R}^{d}$, $\mb{b} \in \mathbb{R}$, and $\mb{\epsilon} \in \mathbb{R}^{n}$.

In [3]:
class SyntheticRegressionData(d2l.DataModule):
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000, batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, len(w))
        noise = torch.randn(n, 1) * noise
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise

As an example, let's set the true parameter as
$\mb{w} = [2,-3.4]^T$ and $b = 4.2$.

Our job is to estimate $\mb{w}$ and $b$ from the dataset $\mb{X}$ and $\mb{y}$, and compare the estimated values with the true values.


In [16]:
data = SyntheticRegressionData(
    w=torch.tensor([2, -3.4]), 
    b=4.2, 
    num_train=1000,
    num_val=1000,
    batch_size=32 
    )

In [17]:
# print first 4 rows
for i in range(4):
    print(f'[Row {i}] features: {data.X[i]}, label: {data.y[i]}')

[Row 0] features: tensor([-1.3575, -0.6905]), label: tensor([3.8496])
[Row 1] features: tensor([ 2.5720, -0.1056]), label: tensor([9.7022])
[Row 2] features: tensor([ 0.0956, -0.9048]), label: tensor([7.4566])
[Row 3] features: tensor([1.6608, 0.1641]), label: tensor([6.9709])


## 3.3.2. Reading the Dataset

- SyntheticRegressionData inherits from d2l.DataModule class.
- We need to implement `get_dataloader` method to return the dataloader for the synthetic regression dataset.

In [7]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self,train):
    if train:
        indices = list(range(0, self.num_train))
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train + self.num_val))
    
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i: i + self.batch_size])
        yield (self.X[batch_indices], self.y[batch_indices])

In [18]:
# First batch of training data (batch size 32)
X, y = next(iter(data.get_dataloader(train=True)))
print (f'X.shape: {X.shape}\ny.shape: {y.shape}')

X.shape: torch.Size([32, 2])
y.shape: torch.Size([32, 1])


- The above example illustrates usefulness of object-oriented programming.
- We can define a generic class `DataModule` that can be inherited by other classes to implement specific functionalities.
- This way, we can reuse the code and make it more modular.

But, the above implementation is inefficient in many ways. For example, we load all the data in memory and that we perform lots of random memory access.

In deep learning frameworks (e.g. pytorch), optimized built-in data loaders are available that can handle large datasets efficiently.

## 3.3.3. Concise Implementation of the Data Loader 

d2l.DataModule implements `get_tensorloader` method that takes tensors (features, target) and returns the pytorch dataloader.

```python
class DataModule(d2l.HyperParameters):
    ...

    def get_tensorloader(self, tensors, train, indices=slice(0, None)):
        """Defined in :numref:`sec_synthetic-regression-data`"""
        tensors = tuple(a[indices] for a in tensors)
        dataset = torch.utils.data.TensorDataset(*tensors)
        return torch.utils.data.DataLoader(dataset, self.batch_size,
                                           shuffle=train)

    ...
```

We can use get_tensorloader method to get the dataloader for the synthetic regression dataset.

In [19]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    # if train is True, return the first num_train data points
    # if train is False, return the rest of the data points
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

The new data loader behaves just like the previous one, except that it is more efficient and has some added functionality.

In [21]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
print(f'len(data.train_dataloader()): {len(data.train_dataloader())}')

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])
len(data.train_dataloader()): 32
