# Example: Cats vs. Dogs

This notebook demonstrates the usage of ``image_featurizer`` using the Kaggle Cats vs. Dogs dataset.

We will look at the usage of the ``ImageFeaturizer()`` class, which provides a convenient pipeline to quickly tackle image problems with DataRobot's platform. 

It allows users to load image data into the featurizer, and then featurizes the images into a maximum of 2048 features. It appends these features to the CSV as extra columns in line with the image rows. If no CSV was passed in with an image directory, the featurizer generates a new CSV automatically and performs the same function.


In [1]:
import pandas as pd
import numpy as np
from sklearn import svm
from image_featurizer.image_featurizer import ImageFeaturizer

Using TensorFlow backend.


## Formatting the Data

'ImageFeaturizer' accepts as input either:
1. An image directory
2. A CSV with URL pointers to image downloads, or 
3. A combined image directory + CSV with pointers to the included images. 

For this example, we will load in the Kaggle Cats vs. Dogs dataset of 25,000 images, along with a CSV that includes each images class label.

In [2]:
pd.options.display.max_rows = 10

image_path = 'cat_vs_dogs_images/'
csv_path = 'cat_vs_dog_classes.csv'

pd.read_csv(csv_path)

Unnamed: 0,images,animal
0,cat.0.jpg,0
1,cat.1.jpg,0
2,cat.10.jpg,0
3,cat.100.jpg,0
4,cat.1000.jpg,0
...,...,...
24995,dog.9995.jpg,1
24996,dog.9996.jpg,1
24997,dog.9997.jpg,1
24998,dog.9998.jpg,1


The image directory contains 12,500 images of cats and 12,500 images of dogs. The CSV contains pointers to each image in the directory, along with a class label (0 for cats, 1 for dogs).

## Initializing the Featurizer

We will now initialize the ImageFeaturizer( ) class with a few parameters that define the model. If in doubt, we can always call the featurizer with no parameters, and it will initialize itself to a cookie-cutter build. Here, we will call the parameters explicitly to demonstrate functionality. In general, however, the model defaults to a depth of 1 and automatic downsampling, so no parameters are necessary for this build.

Because we have not downloaded the pre-trained weights, the featurizer will automatically download the weights through the Keras backend.

The depth indicates how far down we should cut the model to draw abstract features– the further down we cut, the less complex the representations will be, but they may also be less specialized to the specific classes in the ImageNet dataset that the model was trained on.

Automatic downsampling means that this model will downsample the final layer from 2048 features to 1024 features, which is a more compact representation. With large datasets, more features may run into memory problems or difficulty optimizing, so it may be worth downsampling to a smaller featurspace.

In [3]:
featurizer = ImageFeaturizer(depth=1, automatic_downsample = True)


Building the featurizer!

Can't find weight file. Need to download weights from Keras!

Model successfully initialized.
Model decapitated!
Automatic downsampling to 1024. If you would like to set custom downsampling, pass in an integer divisor of 2048 to num_pooled_features!
Model downsampled!
Full featurizer is built!
Final layer feature space downsampled to 1024


This featurizer was 'decapitated' to the first layer below the prediction layer, which will produce complex representations. Because it is so close to the final prediction layer, it will create more specialized feature representations, and therefore will be better suited for image datasets that are similar to classes within the original ImageNet dataset. Cats and dogs are present within ImageNet, so a depth of 1 should perform well. 

## Loading the Data

Now that the featurizer is built, we can load our data into the network. This will parse through the images in the order given by the csv, rescale them to a target size with a default of (299, 299), and build a 4D tensor containing the vectorized representations of the images. This tensor will later be fed into the network in order to be featurized.

The tensor will have the dimensions: [number of images, height, width, color channels]. In this case, the image tensor will have size [25000, 299, 299, 3].

We have to pass in the name of the column in the CSV that contains pointers to the images, as well as the path to the image directory and the path to the CSV itself, which are both saved from earlier. 

If there are images in the directory that aren't in the CSV, or image names in the CSV that aren't in the directory, or even files that aren't valid image files in the directory, don't fear– the featurizer will only try to vectorize valid images that are in both the CSV and the directory. Any images present in the CSV but not the directory will be given zero vectors, and the order of the CSV is considered the canonical order for the images.

In [4]:
featurizer.load_data('images', image_path = image_path, csv_path = csv_path)

Found image paths that overlap between both the directory and the csv!
Converting images!
Converted 0 images! Only 25000 images left to go!
Converted 1000 images! Only 24000 images left to go!
Converted 2000 images! Only 23000 images left to go!
Converted 3000 images! Only 22000 images left to go!
Converted 4000 images! Only 21000 images left to go!
Converted 5000 images! Only 20000 images left to go!
Converted 6000 images! Only 19000 images left to go!
Converted 7000 images! Only 18000 images left to go!
Converted 8000 images! Only 17000 images left to go!
Converted 9000 images! Only 16000 images left to go!
Converted 10000 images! Only 15000 images left to go!
Converted 11000 images! Only 14000 images left to go!
Converted 12000 images! Only 13000 images left to go!
Converted 13000 images! Only 12000 images left to go!
Converted 14000 images! Only 11000 images left to go!
Converted 15000 images! Only 10000 images left to go!
Converted 16000 images! Only 9000 images left to go!
Conver

The image data has now been loaded into the featurizer and vectorized, and is ready for featurization. We can check the size, format, and other stored information about the data by calling the featurizer object attributes:

In [5]:
print('Vectorized data shape: {}'.format(featurizer.data.shape))

print('CSV path: \'{}\''.format(featurizer.csv_path))

print('Image directory path: \'{}\''.format(featurizer.image_path))

Vectorized data shape: (25000, 299, 299, 3)
CSV path: 'cat_vs_dog_classes.csv'
Image directory path: 'cat_vs_dogs_images/'


For a full list of attributes, call:

In [6]:
featurizer.__dict__.keys()

['downsample_size',
 'visualize',
 'image_path',
 'image_column_header',
 'csv_path',
 'number_crops',
 'image_list',
 'scaled_size',
 'depth',
 'automatic_downsample',
 'featurized_data',
 'model',
 'isotropic_scaling',
 'data',
 'crop_size']

## Featurizing the Data

Now that the data is loaded, we're ready to featurize the data. This will push the vectorized images through the network and save the 2D matrix output– each row representing a single image, and each column storing a different feature.

It will then create and save a new CSV by appending these features to the end of the given CSV in line with each image's row. The features themselves will also be saved in a separate CSV file without the image names or other data. Both generated CSVs will be saved to the same path as the original CSV, with the features-only CSV appending '_features_only' and the combined CSV appending '_full' to the end of their respective filenames.

The featurize( ) method requires no parameters, as it uses the data we just loaded into the network. This requires pushing images through the deep InceptionV3 network, and so relatively large datasets will require a GPU to perform in a reasonable amount of time. Using a mid-range GPU, it can take about 30 minutes to process the full 25,000 photos in the Dogs vs. Cats. 

In [7]:
featurizer.featurize()

Checking array initialized.
Trying to featurize data!
Creating feature array!
Feature array created successfully.
Adding image features to csv!


Unnamed: 0,images,animal,image_feature_0,image_feature_1,image_feature_2,image_feature_3,image_feature_4,image_feature_5,image_feature_6,image_feature_7,...,image_feature_1014,image_feature_1015,image_feature_1016,image_feature_1017,image_feature_1018,image_feature_1019,image_feature_1020,image_feature_1021,image_feature_1022,image_feature_1023
0,cat.0.jpg,0,0.197623,0.153129,0.237384,0.325404,0.174551,0.093449,0.397525,0.536594,...,0.346164,0.676448,0.316065,0.374114,0.615561,0.902570,0.520825,0.560737,1.129250,0.372229
1,cat.1.jpg,0,0.077591,0.144663,0.107953,0.211797,0.072236,0.121741,0.129945,0.131091,...,0.245134,1.225246,0.399110,0.129122,0.189429,0.235852,0.220862,0.278146,0.134976,0.535692
2,cat.10.jpg,0,0.247032,0.181568,0.097621,0.154228,0.247492,0.178228,0.175542,0.306296,...,0.263852,0.940070,0.358609,0.506933,0.753087,0.233318,0.655843,0.799059,0.681655,0.579095
3,cat.100.jpg,0,0.259990,0.129949,0.165196,0.130249,0.156296,0.108147,0.477028,0.563983,...,0.541575,0.772276,0.498641,0.867951,0.543547,0.722114,0.631598,0.749955,0.552881,0.813209
4,cat.1000.jpg,0,0.446480,0.201302,0.126474,0.079862,0.215835,0.399131,0.636886,0.109699,...,0.308573,0.947029,0.174811,0.453257,0.792167,0.285224,0.642199,0.629148,0.613605,0.903375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,dog.9995.jpg,1,0.119224,0.098877,0.158328,0.201316,0.271268,0.393014,0.268696,0.374308,...,0.272386,0.190257,0.244063,0.040812,0.220867,0.548893,0.155369,0.188319,0.014220,0.469891
24996,dog.9996.jpg,1,0.218130,0.122711,0.497641,0.232362,0.282693,0.115563,0.222415,0.178117,...,0.374624,0.298325,0.131216,0.221744,0.285206,0.168957,0.454489,0.197727,0.098618,0.380426
24997,dog.9997.jpg,1,0.214751,0.287097,0.240857,0.236063,0.811545,0.395226,0.139665,0.351027,...,0.271042,0.196889,0.410896,0.236834,0.291153,0.828539,0.511775,0.504615,0.496241,0.379007
24998,dog.9998.jpg,1,0.296174,0.281312,0.426309,0.146541,0.766958,0.184113,0.174594,0.280894,...,0.160002,0.192043,0.224755,0.141034,0.426096,0.439477,0.307610,0.533777,0.197895,0.528016


## Results

The dataset has now been fully featurized! The features are saved under the featurized_data attribute:

In [8]:
featurizer.featurized_data

array([[ 0.19762258,  0.15312853,  0.23738424, ...,  0.56073749,
         1.12924969,  0.3722288 ],
       [ 0.07759078,  0.14466347,  0.10795289, ...,  0.27814645,
         0.13497572,  0.53569233],
       [ 0.24703167,  0.18156832,  0.09762136, ...,  0.79905927,
         0.68165481,  0.57909489],
       ..., 
       [ 0.21475141,  0.28709683,  0.24085702, ...,  0.50461513,
         0.49624115,  0.37900704],
       [ 0.29617372,  0.28131226,  0.42630851, ...,  0.53377748,
         0.19789484,  0.52801621],
       [ 0.49579096,  0.22766027,  0.10793425, ...,  0.13083567,
         0.59388638,  0.51173502]], dtype=float32)

The full data has also been successfully saved in CSV form, which allows it to be dropped directly into the DataRobot app:

In [9]:
pd.read_csv('cat_vs_dog_classes_full.csv')

Unnamed: 0,images,animal,image_feature_0,image_feature_1,image_feature_2,image_feature_3,image_feature_4,image_feature_5,image_feature_6,image_feature_7,...,image_feature_1014,image_feature_1015,image_feature_1016,image_feature_1017,image_feature_1018,image_feature_1019,image_feature_1020,image_feature_1021,image_feature_1022,image_feature_1023
0,cat.0.jpg,0,0.197623,0.153129,0.237384,0.325404,0.174551,0.093449,0.397525,0.536594,...,0.346164,0.676448,0.316065,0.374114,0.615561,0.902570,0.520825,0.560737,1.129250,0.372229
1,cat.1.jpg,0,0.077591,0.144663,0.107953,0.211797,0.072236,0.121741,0.129945,0.131091,...,0.245134,1.225246,0.399110,0.129122,0.189429,0.235852,0.220862,0.278146,0.134976,0.535692
2,cat.10.jpg,0,0.247032,0.181568,0.097621,0.154228,0.247492,0.178228,0.175542,0.306296,...,0.263852,0.940070,0.358609,0.506933,0.753087,0.233318,0.655843,0.799059,0.681655,0.579095
3,cat.100.jpg,0,0.259990,0.129949,0.165196,0.130249,0.156296,0.108147,0.477028,0.563983,...,0.541575,0.772276,0.498641,0.867951,0.543547,0.722114,0.631598,0.749955,0.552881,0.813209
4,cat.1000.jpg,0,0.446480,0.201302,0.126474,0.079862,0.215835,0.399131,0.636886,0.109699,...,0.308573,0.947029,0.174811,0.453257,0.792167,0.285224,0.642199,0.629148,0.613605,0.903375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,dog.9995.jpg,1,0.119224,0.098877,0.158328,0.201316,0.271268,0.393014,0.268696,0.374308,...,0.272386,0.190257,0.244063,0.040812,0.220867,0.548893,0.155369,0.188319,0.014220,0.469891
24996,dog.9996.jpg,1,0.218130,0.122711,0.497641,0.232362,0.282693,0.115563,0.222415,0.178117,...,0.374624,0.298325,0.131216,0.221744,0.285206,0.168957,0.454489,0.197727,0.098618,0.380426
24997,dog.9997.jpg,1,0.214751,0.287097,0.240857,0.236063,0.811545,0.395226,0.139665,0.351027,...,0.271042,0.196889,0.410896,0.236834,0.291153,0.828539,0.511775,0.504615,0.496241,0.379007
24998,dog.9998.jpg,1,0.296174,0.281312,0.426309,0.146541,0.766958,0.184113,0.174594,0.280894,...,0.160002,0.192043,0.224755,0.141034,0.426096,0.439477,0.307610,0.533777,0.197895,0.528016


But, for the purposes of this demo, we can simply test the performance of a linear classifier over the featurized data. First, we'll build the training and test sets. 

In [10]:
# Creating a training set of 10,000 for each class
train_cats = featurizer.featurized_data[:10000, :]
train_dogs = featurizer.featurized_data[12500:22500, :]

# Creating a test set from the remaining 2,500 of each class
test_cats = featurizer.featurized_data[10000:12500, :]
test_dogs = featurizer.featurized_data[22500:, :]

# Combining the training data, and creating the class labels
train_combined = np.concatenate((train_cats, train_dogs))
labels_train = np.concatenate((np.zeros((10000,)), np.ones((10000,))))

# Combining the test data, and creating the class labels to check predictions
test_combined = np.concatenate((test_cats, test_dogs))
labels_test = np.concatenate((np.zeros((2500,)), np.ones((2500,))))

Then, we'll train the linear classifier:

In [11]:
# Initialize the linear SVC
clf = svm.LinearSVC()

# Fit it on the training data
clf.fit(train_combined, labels_train)

# Check the performance of the linear classifier over the full Cats vs. Dogs dataset!
clf.score(test_combined, labels_test)

0.99139999999999995

After featurizing the Cats vs. Dogs dataset, we find that a simple linear classifier trained over the data achieves over 99% accuracy on distinguishing dogs vs. cats.

## Summary

That's it! We've looked at the following:

1. What data formats can be passed into the featurizer
2. How to initialize a simple featurizer
3. How to load data into the featurizer
4. How to featurize the loaded data

And as a bonus, we looked at how we might use the featurized data to perform predictions without dropping the CSV into the DataRobot app.

Unless you would like to examine the loaded data before featurizing it, steps 3 and 4 can actually be combined into a single step with the load_and_featurize_data( ) method.

## Next Steps

We have not covered using only a CSV with URL pointers, or a more complex dataset. That will be the subject of another Notebook. 

To have more control over the options in the featurizer, or to understand its internal functioning more fully, check out the full package documentation.