In [None]:
!pip install d2l==0.16.2

# The `Dataset` class

A lot of effort in solving any machine learning problem goes in to preparing the data. The most important tool is the Dataset class. This class allows your deep learning algorithm to iterate over your data and apply different transormations or filters to it.

For more info you can have a look at the [documentation](https://pytorch.org/docs/stable/data.html) of the Dataset class, or to its [source code](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataset.py).

To load your custom dataset, you need to create your own `CustomDataset` class, which should inherit the `torch.utils.data.Dataset` class. You'll always need to overwrite 3 metods:
* `__init__` How your dataset should be initialized/created.
* `__len__` To calculate the lenght of your dataset. This methods allows you to do `len(dataset)` and get the size of the dataset.
* `__getitem__` To get one sample from your dataset based on its index. It supports the indexing such that `dataset[i]` can be used to get ith sample.




# Building a custom dataset for leaf counting
In this tutorial, you'll build a custom dataset cass which can be used to train a deep learning algorithm to count the number of leafs given an image of a plant.

First, let's download the dataset.

In [None]:
# Get dataset from git.wur.nl
!git clone https://git.wur.nl/deep-learning-course/leaf-dataset

In [None]:
# Have a look at what the dataset contains
!ls leaf-dataset/detection

## The Leaf Detection and counting Dataset
This dataset contains labels for two different tasks: counting and detecting leafs.

Concretely, it contains:
* A RGB image per plant (e.g. *ara2012_plant117_rgb.png*)
* A csv file per plant containing the bounding box of the leafs (e.g. *ara2012_plant117_bbox.csv*)
* A csv file which store the total number of leafs per every plant in the dataset: *Leaf_counts.csv*
### Regression task
Since we want to create a dataset class that allows us to train a deep learning to count the number of leafs, we will need 2 things from the dataset:
  1. The RGB image of every plant (both ara2012 and ara2013 sub-sets)
  2. A csv files with the leaf counts per image.

In [None]:
import os
from d2l import torch as d2l
import glob
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torchvision
from PIL.Image import Image


In [None]:
root = 'leaf-dataset/detection/' # Path to dataset

# Take a look at one image
image = d2l.Image.open(os.path.join(root, 'ara2012_plant001_rgb.png'))
d2l.plt.imshow(image);

In [None]:
# Check what the csv files contain
with open(os.path.join(root, 'Leaf_counts.csv'), 'r') as f:
    for line in f:
        filename, n_leafs = (line.rstrip().split(', '))
        print(filename, n_leafs)

## `LeafCountDataset`
Let's start to create our custom dataset.

To do so, first we need to create the class and the `__init__` method.
```python
class LeafDataset(torch.utils.data.Dataset):
    def __init__(self, ...):
        ...
```

**Exercise 1:** complete the `__init__` method so the `self.images` and `self.labels` attributes contain a list with all the images paths and counts of leafs. Do it in a way that `self.labels[i]` contains the number of leafs of the ith image `self.image[i]`. For instance:
* `self.images[3] = 'leaf_segmentation_dataset/detection/ara2012_plant004_rgb.png'`
* `self.labels[3] = 13`

**Exercise 2:** now that the `__init__` method is complete, let's go for the `__get_item__` method. Try to complete the code in this method. To read an image given a path you might want to use the `d2l.Image.open` function. Additionally, the returned `labels` should be `torch.tensor`.


In [None]:
class LeafDataset(torch.utils.data.Dataset):
    def __init__(self, directory, is_train=True, transforms=None):
        self.images = []
        self.labels = []
        self.transforms = transforms
  
        with open(os.path.join(root, 'Leaf_counts.csv'), 'r') as f:
            for line in f:
                filename, n_leafs = (line.rstrip().split(', '))
                img_path = os.path.join(directory, filename)
                # TODO: add your code here (~2 lines). Fill the correspoinding lists with the images and labels

      
        X_train, X_test, y_train, y_test = train_test_split(
            self.images, self.labels, test_size=0.25, random_state=42)
        if is_train:
            self.images = X_train
            self.labels = y_train
        else:
            self.images = X_test
            self.labels = y_test
    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        # TODO: complete the code
        # img = ...
        # labels = ...
        
        # Assertions to check that everything is correct
        assert(isinstance(img, Image)), "Image variable should be a PIL Image"
        assert(isinstance(labels, torch.Tensor)), "Labels varibable should be a torch tensor"
        assert(labels.dtype == torch.float32), "Labels variable datatype should be float32"
        
        if self.transforms is not None:
            img = self.transforms(img)
            
        return img, labels

Let's check if our code works as it should. If you run the following block, you should see an image a a plant an a text saying how many leaves it has.

In [None]:
# Let's check if the dataset class works properly
dataset = LeafDataset(root, is_train=True)
image, label = dataset[3]
d2l.plt.imshow(image)
print('This plant contains', int(label.detach().numpy()), 'leafs')

Congratulations, you have built your first custom dataset class!

## Performance metrics
When you learned LeNet, you used this function to evaluate the performance of a network in a dataset:
```python
def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
    """Compute the accuracy for a model on a dataset using a GPU."""
    if isinstance(net, nn.Module):
        net.eval()  # Set the model to evaluation mode
        if not device:
            device = next(iter(net.parameters())).device
    # No. of correct predictions, no. of predictions
    metric = d2l.Accumulator(2)

    with torch.no_grad():
        for features, labels in data_iter:        
            if isinstance(features, list):
               features = [feature.to(device) for feature in features]
            else:
               features = features.to(device)
            labels = labels.to(device)
            metric.add(d2l.accuracy(net(features), labels), labels.numel())
    return metric[0] / metric[1]
```

However, this fuction was built for classification, not for regression. We need to develop a corresponding one for regression which we can then use in our training loops. 

During the MLP lecture, you learned about Pearson correlation coefficient (also known as *r*). You can learn more about it [here](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). In summary, it evaluates the correlation between 2 set of values. The closer the value is to 1 or -1, the more correlated the values are. In our case, with predictions and ground truths, *r* tells you how close your network predictions are to the ground truth.
```python
def pearson_correlation(x1, x2, eps=1e-8):
    """Returns Pearson coefficient between 1D-tensors x1 and x2
    Args:
        x1 (Variable): First input (1D).
        x2 (Variable): Second input (of size matching x1).
        eps (float, optional): Small value to avoid division by zero.
            Default: 1e-8
    Example:
        >>> input1 = autograd.Variable(torch.randn(128))
        >>> input2 = autograd.Variable(torch.randn(128))
        >>> output = F.pearson_correlation(input1, input2)
        >>> print(output)
    """
    assert x1.dim() == 1, "Input must be 1D matrix / vector."
    assert x1.size() == x2.size(), "Input sizes must be equal."
    x1_bar = x1 - x1.mean()
    x2_bar = x2 - x2.mean()
    dot_prod = x1_bar.dot(x2_bar)
    norm_prod = x1_bar.norm(2) * x2_bar.norm(2)
    return dot_prod / norm_prod.clamp(min=eps)
```

The loss itself can be used as a metric for our predictions. Therefore, we would like to calculate it together with the Pearson coefficient (*r*).

Complete the code in the next block to develop this function.



In [None]:
# We saw this function in the MLP notebooks
def pearson_correlation(x1, x2, eps=1e-8):
    """Returns Pearson coefficient between 1D-tensors x1 and x2
    Args:
        x1 (torch.Tensor): First input (1D).
        x2 (torch.Tensor): Second input (of size matching x1).
        eps (float, optional): Small value to avoid division by zero.
            Default: 1e-8
    Example:
        >>> input1 = autograd.Variable(torch.randn(128))
        >>> input2 = autograd.Variable(torch.randn(128))
        >>> output = F.pearson_correlation(input1, input2)
        >>> print(output)
    """
    assert x1.dim() == 1, "Input must be 1D matrix / vector."
    assert x1.size() == x2.size(), "Input sizes must be equal."
    x1_bar = x1 - x1.mean()
    x2_bar = x2 - x2.mean()
    dot_prod = x1_bar.dot(x2_bar)
    norm_prod = x1_bar.norm(2) * x2_bar.norm(2)
    return dot_prod / norm_prod.clamp(min=eps)


def evaluate_loss_pearson_gpus(net, data_iter, loss, device):
    '''
    Function to evaluate the loss and Pearson coefficient of a CNN on a data iterator
    Args:
        net: network
        data_iter (torch.data.Dataloader): dataloader to iterate to
        loss: loss used to train the model
        device: device where the model is loaded. For example, gpu0. You can get it from d2l.try_all_gpus()
    
    Returns:
        average loss of the model in the data
        average pearson coefficient in the data
    
    '''
    if isinstance(net, torch.nn.Module):
        net.eval()  # Set the model to evaluation mode
    metric = d2l.Accumulator(4)  # loss, num_examples, pearson, 1 (for loop iter counter)
    for features, labels in data_iter:        
        if isinstance(features, list):
            features = [feature.to(device) for feature in features]
        else:
            features = features.to(device)
        labels = labels.to(device)
        pred = net(features)
        # TODO: add your code here (~2 lines). 
        # Expected output: variable called loss_sum which contains the sum of the losses per sample
        # ...
        # loss_sum = ...

        # Check that loss_sum is a float
        assert(isinstance(loss_sum, float)), "loss variable variable should be a float type"
        
        # TODO: add your code here (~2 lines). 
        # Expected output: variable called pr which contains pearson coefficient of the samples and GT
        # ...
        # pr = ...
        
        # Check that pr is a float
        assert(isinstance(pr, float)), "Pearson coefficient variable should be a float type"
        metric.add(loss_sum, pr, labels.shape[0], 1)
    return metric[0] / metric[2], metric[1] / metric[3]

Now you can start with the first part of the project!