In [None]:
%pip install -q --upgrade pip
%pip install -q fastai

Download the Pets dataset

In [None]:
from fastai.vision.all import *

In [None]:
path = untar_data(URLs.PETS)

We want to extract the breed of each pet from each image. For this, we need to understand how the data is laid out in the dataset.

See the content of the dataset.

In [None]:
path.ls()

The dataset provides *images* and *annotations* directories. The source website states that the *annotations* directory contains information about what the pets are instead of what they are.

This project focuses on classification, not localization. This *annotation* information is **not useful for this project**.

Let's focus on the *images* directory.

In [None]:
(path/"images").ls()

The structure of the filenames appears to be:
- pet breed
- underscore
- number
- file extension

Our project will require to extract the breed from the filename.

We can't make too many assumptions. Some breeds have multiple words, so we cannot assume that the breed is located before the first underscore.

Let's pick one of these filenames to test our code.

In [None]:
fname = (path/"images").ls()[0]
fname

The best way to extract the breed is to use a *regular expression*, also known as *regex*. We need a regex that extracts the breed from the filename.

Use the ```findall``` methos to try a regex against the filename of the ```fname``` object.

In [None]:
re.findall(r'(.+)_\d+.jpg$', fname.name)

Now that we confirmed that the regex works, let's use it to label the entire dataset.

The ```RegexLabeller``` class is used for labeling with regex.

In [None]:
pets = DataBlock(
    blocks = (ImageBlock, CategoryBlock),
    get_items = get_image_files, 
    splitter = RandomSplitter(seed=42), 
    get_y = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'), 
    item_tfms = Resize(460), 
    batch_tfms = aug_transforms(size=224, min_scale=0.75)
)

dls = pets.dataloaders(path/"images")

The following lines implement the fastai data augmentation strategy called *presizing*. 

```python
    item_tfms = Resize(460), 
    batch_tfms = aug_transforms(size=224, min_scale=0.75))
```

This is a technique that minimizes the data destruction while maintaining good performance.

## Checking and Debugging a DataBlock

- Never assume that the code is working perfectly.
- Even if the code works, there's no guarantee that the template will work with the data source as intended.
- Always check your data.

With fastai the data can be checked using the ```show_batch``` method.

In [None]:
dls.show_batch(nrows=1, ncols=3)

The ```summary``` method provides a summary of the data. This can be useful to check if the data is being processed as expected. For instance, one common mistake is to forget to use a ```Resize``` transform, which can lead to a mismatch in the images.

This example shows how to use the ```summary``` method to check the data following the previous example.

In [None]:
# pets1 = DataBlock(
#     blocks = (ImageBlock, CategoryBlock),
#     get_items = get_image_files,
#     splitter = RandomSplitter(seed=42),
#     get_y = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
# )

# pets1.summary(path/"images")

Once the data looks right, it can be used to train a **simple** model. This is important because it helps to know the baseline results. 
- Perhaps the problem doesn't require a lot of domain-specific engineering.
- Perhaps the data doesn't seem to train the model at all.

These are things you want to know **as soon as possible**.

Initial test:

In [None]:
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(2)