In [1]:
%pip install -q fastai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Download the Pets dataset

In [2]:
from fastai.vision.all import *

In [3]:
path = untar_data(URLs.PETS)

We want to extract the breed of each pet from each image. For this, we need to understand how the data is laid out in the dataset.

See the content of the dataset.

In [4]:
path.ls()

(#2) [Path('/home/abraham/.fastai/data/oxford-iiit-pet/images'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/annotations')]

The dataset provides *images* and *annotations* directories. The source website states that the *annotations* directory contains information about what the pets are instead of what they are.

This project focuses on classification, not localization. This *annotation* information is **not useful for this project**.

Let's focus on the *images* directory.

In [5]:
(path/"images").ls()

(#7393) [Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/wheaten_terrier_73.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/newfoundland_19.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/newfoundland_75.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/beagle_40.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/japanese_chin_132.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/Sphynx_41.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/yorkshire_terrier_40.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/Sphynx_67.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/keeshond_81.jpg'),Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/american_bulldog_174.jpg')...]

The structure of the filenames appears to be:
- pet breed
- underscore
- number
- file extension

Our project will require to extract the breed from the filename.

We can't make too many assumptions. Some breeds have multiple words, so we cannot assume that the breed is located before the first underscore.

Let's pick one of these filenames to test our code.

In [6]:
fname = (path/"images").ls()[0]
fname

Path('/home/abraham/.fastai/data/oxford-iiit-pet/images/wheaten_terrier_73.jpg')

The best way to extract the breed is to use a *regular expression*, also known as *regex*. We need a regex that extracts the breed from the filename.

Use the ```findall``` methos to try a regex against the filename of the ```fname``` object.

In [7]:
re.findall(r'(.+)_\d+.jpg$', fname.name)

['wheaten_terrier']

Now that we confirmed that the regex works, let's use it to label the entire dataset.

The ```RegexLabeller``` class is used for labeling with regex.

In [8]:
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                get_items = get_image_files, 
                splitter = RandomSplitter(seed=42), 
                get_y = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'), 
                item_tfms = Resize(460), 
                batch_tfms = aug_transforms(size=224, min_scale=0.75))

dls = pets.dataloaders(path/"images")

The following lines implement the fastai data augmentation strategy called *presizing*. 

```python
    item_tfms = Resize(460), 
    batch_tfms = aug_transforms(size=224, min_scale=0.75))
```

This is a technique that minimizes the data destruction while maintaining good performance.