# Example: Cats vs. Dogs With SqueezeNet

This notebook demonstrates the usage of ``image_featurizer`` using the Kaggle Cats vs. Dogs dataset.

We will look at the usage of the ``ImageFeaturizer()`` class, which provides a convenient pipeline to quickly tackle image problems with DataRobot's platform. 

It allows users to load image data into the featurizer, and then featurizes the images into a maximum of 2048 features. It appends these features to the CSV as extra columns in line with the image rows. If no CSV was passed in with an image directory, the featurizer generates a new CSV automatically and performs the same function.


In [1]:
# Importing the dependencies for this example
import os
import pandas as pd
import numpy as np
from sklearn import svm
from pic2vec import ImageFeaturizer

Using TensorFlow backend.


In [2]:
# Setting up stdout logging
import logging
import sys

root = logging.getLogger()
root.setLevel(logging.INFO)

ch = logging.StreamHandler(sys.stdout)
ch.setFormatter(logging.Formatter('%(levelname)s - %(message)s'))
root.addHandler(ch)

# Setting pandas display options
pd.options.display.max_rows = 10


### Formatting the Data

'ImageFeaturizer' accepts as input either:
1. An image directory
2. A CSV with URL pointers to image downloads, or 
3. A combined image directory + CSV with pointers to the included images. 

For this example, we will load in the Kaggle Cats vs. Dogs dataset of 25,000 images, along with a CSV that includes each image's class label. Our working directory is at `~/pic2vec_demo/`. The `cats_vs_dogs.csv` file can be found in the same `cats_vs_dogs/` example folder as this notebook. The

In [3]:
WORKING_DIRECTORY = os.path.expanduser('~/pic2vec_demo/')

csv_path = WORKING_DIRECTORY + 'cats_vs_dogs.csv'
image_path = WORKING_DIRECTORY + 'cats_vs_dogs_images/'

Let's take a look at the csv before featurizing the images:

In [4]:
pd.read_csv(csv_path)

Unnamed: 0,images,label
0,cat.0.jpg,0
1,cat.1.jpg,0
2,cat.2.jpg,0
3,cat.3.jpg,0
4,cat.4.jpg,0
...,...,...
24995,dog.12495.jpg,1
24996,dog.12496.jpg,1
24997,dog.12497.jpg,1
24998,dog.12498.jpg,1


The image directory contains 12,500 images of cats and 12,500 images of dogs. The CSV contains pointers to each image in the directory, along with a class label (0 for cats, 1 for dogs).

## Initializing the Featurizer

We will now initialize the ImageFeaturizer( ) class with a few parameters that define the model. If in doubt, we can always call the featurizer with no parameters, and it will initialize itself to a cookie-cutter build. Here, we will call the parameters explicitly to demonstrate functionality. However, these are generally the default weights, so for this build we could just call ```featurizer = ImageFeaturizer()```.

Because we have not specified a model, the featurizer will default to the built-in SqueezeNet model, with loaded weights prepackaged. If you initialize another model, pic2vec will automatically download the model weights through the Keras backend.

The depth indicates how far down we should cut the model to draw abstract features– the further down we cut, the less complex the representations will be, but they may also be less specialized to the specific classes in the ImageNet dataset that the model was trained on– and so they may perform better on data that is further from the classes within the dataset.

Automatic downsampling means that this model will downsample the final layer from 512 features to 256 features, which is a more compact representation. With large datasets and bigger models (such as InceptionV3, more features may run into memory problems or difficulty optimizing, so it may be worth downsampling to a smaller featurspace.

In [5]:
featurizer = ImageFeaturizer(depth=1, autosample = False, model='squeezenet')

INFO - Building the featurizer.
INFO - Loading/downloading SqueezeNet model weights. This may take a minute first time.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
INFO - Model successfully initialized.
INFO - Model decapitated.
INFO - Model downsampled.
INFO - Full featurizer is built.
INFO - No downsampling. Final layer feature space has size 512


This featurizer was 'decapitated' to the first layer below the prediction layer, which will produce complex representations. Because it is so close to the final prediction layer, it will create more specialized feature representations, and therefore will be better suited for image datasets that are similar to classes within the original ImageNet dataset. Cats and dogs are present within ImageNet, so a depth of 1 should perform well. 

## Loading and Featurizing Images Simultaneously

Now that the featurizer is built, we can actually load our data into the network and featurize the images all at the same time, using a single method:  

In [6]:
featurized_df = featurizer.featurize(image_columns='images', 
                                     image_path = image_path,
                                     csv_path = csv_path)

INFO - Found image paths that overlap between both the directory and the csv.

INFO - Loading image batch.
INFO - Converting images.
INFO - Converted 0 images in batch. Only 1000 images left to go.
INFO - Converted 500 images in batch. Only 500 images left to go.
INFO - 
Featurizing image batch.
INFO - Trying to featurize data.
INFO - Creating feature array.
INFO - Feature array created successfully.
INFO - Combining image features with original dataframe.
INFO - Number of missing photos: 1000
INFO - Featurized batch #1. Number of images left: 24000
Estimated total time left: 1187 seconds

INFO - Loading image batch.
INFO - Converting images.
INFO - Converted 0 images in batch. Only 1000 images left to go.
INFO - Converted 500 images in batch. Only 500 images left to go.
INFO - 
Featurizing image batch.
INFO - Trying to featurize data.
INFO - Creating feature array.
INFO - Feature array created successfully.
INFO - Combining image features with original dataframe.
INFO - Number of miss

INFO - Converted 0 images in batch. Only 1000 images left to go.
INFO - Converted 500 images in batch. Only 500 images left to go.
INFO - 
Featurizing image batch.
INFO - Trying to featurize data.
INFO - Creating feature array.
INFO - Feature array created successfully.
INFO - Combining image features with original dataframe.
INFO - Number of missing photos: 1000
INFO - Featurized batch #15. Number of images left: 10000
Estimated total time left: 498 seconds

INFO - Loading image batch.
INFO - Converting images.
INFO - Converted 0 images in batch. Only 1000 images left to go.
INFO - Converted 500 images in batch. Only 500 images left to go.
INFO - 
Featurizing image batch.
INFO - Trying to featurize data.
INFO - Creating feature array.
INFO - Feature array created successfully.
INFO - Combining image features with original dataframe.
INFO - Number of missing photos: 1000
INFO - Featurized batch #16. Number of images left: 9000
Estimated total time left: 437 seconds

INFO - Loading imag

The images have now been featurized. The featurized dataframe contains the original csv, along with the generated features appended to the appropriate row, corresponding to each image.

There is also an `images_missing` column, to track which images were missing. Missing image features are generated on a matrix of zeros.

If there are images in the directory that aren't contained in the CSV, or image names in the CSV that aren't in the directory, or even files that aren't valid image files in the directory, have no fear– the featurizer will only try to vectorize valid images that are present in both the CSV and the directory. Any images present in the CSV but not the directory will be given zero vectors, and the order of the image column from the CSV is considered the canonical order for the images.

In [7]:
featurized_df

Unnamed: 0,images,label,images_missing,images_feat_0,images_feat_1,images_feat_2,images_feat_3,images_feat_4,images_feat_5,images_feat_6,...,images_feat_502,images_feat_503,images_feat_504,images_feat_505,images_feat_506,images_feat_507,images_feat_508,images_feat_509,images_feat_510,images_feat_511
0,cat.0.jpg,0,False,0.422239,4.080936,14.894337,2.842498,0.967452,10.851055,0.164090,...,5.716428,0.227580,1.512349,1.838279,6.923377,2.754216,1.599615,0.942032,8.596214,0.195745
1,cat.1.jpg,0,False,2.235883,1.766027,0.489503,1.077848,3.744066,3.900755,0.678774,...,0.466049,0.456763,0.000000,8.796008,8.920897,2.318893,3.206552,5.324099,25.885130,0.000000
2,cat.2.jpg,0,False,0.804545,0.685238,0.411905,3.651519,7.440580,1.365789,1.759454,...,0.991104,0.015178,0.018916,7.745066,0.000000,0.187744,0.248889,7.293088,8.606462,0.000000
3,cat.3.jpg,0,False,0.481214,0.229483,5.039218,0.669226,3.988109,2.878755,0.642729,...,0.000000,0.469958,0.532943,3.121966,0.095707,3.489891,0.262518,1.729952,5.988695,0.080222
4,cat.4.jpg,0,False,0.000000,3.258759,9.666997,6.237058,1.160069,0.055264,0.394765,...,0.670079,0.755777,0.076195,7.925221,0.149376,5.640311,0.217993,1.215899,12.723279,0.856007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,dog.12495.jpg,1,False,0.319323,10.394070,2.253926,17.211163,10.175515,0.754048,0.000000,...,6.547691,0.542316,0.000000,2.855716,0.984909,0.789532,1.463116,7.819314,0.194761,2.515571
24996,dog.12496.jpg,1,False,4.812177,2.173984,3.127878,7.731273,2.563596,0.855450,1.506930,...,0.805109,0.897770,0.067206,1.332061,1.023679,2.697655,2.661853,0.294248,11.114500,0.605132
24997,dog.12497.jpg,1,False,0.603782,10.804434,5.988084,11.258055,1.711970,0.172748,0.839580,...,9.690125,0.000000,0.681894,4.975673,0.593460,12.891261,3.147193,2.841281,2.273726,0.726203
24998,dog.12498.jpg,1,False,0.012421,0.779886,3.646794,1.577259,0.314669,1.035333,0.140289,...,6.654152,3.092728,1.509475,2.738165,0.000000,3.335813,2.847281,1.110609,3.183074,1.643685


As you can see, the `featurize()` function loads the images as tensors, featurizes them using deep learning, and then appends these features to the dataframe in the same row as the corresponding image.

This can be used with both an image directory and a csv with a column containing the image filepaths (as it is in this case). However, it can also be used with just an image directory, in which case it will construct a brand new DataFrame with the image column header specified. Finally, it can be used with just a csv, as long as the image column header contains URLs of each image.

This is the simplest way to use pic2vec, but it is also possible to perform the function in multiple steps. There are actually two processes happening behind the scenes in the above code block: 
1. The images are loaded into the network, and then 
2. The images are featurized and these features are appended to the csv.


## Results

The dataset has now been fully featurized! The features are saved under the featurized_data attribute if the `save_features` argument was set to True in either the `featurize()` or `featurize_preloaded_data()` functions:

In [10]:
featurizer.features

Unnamed: 0,images_missing,images_feat_0,images_feat_1,images_feat_2,images_feat_3,images_feat_4,images_feat_5,images_feat_6,images_feat_7,images_feat_8,...,images_feat_502,images_feat_503,images_feat_504,images_feat_505,images_feat_506,images_feat_507,images_feat_508,images_feat_509,images_feat_510,images_feat_511
0,False,0.422239,4.080936,14.894337,2.842498,0.967452,10.851055,0.164090,3.244710,0.000000,...,5.716428,0.227580,1.512349,1.838279,6.923377,2.754216,1.599615,0.942032,8.596214,0.195745
1,False,2.235883,1.766027,0.489503,1.077848,3.744066,3.900755,0.678774,0.039899,2.031924,...,0.466049,0.456763,0.000000,8.796008,8.920897,2.318893,3.206552,5.324099,25.885130,0.000000
2,False,0.804545,0.685238,0.411905,3.651519,7.440580,1.365789,1.759454,0.458921,1.410855,...,0.991104,0.015178,0.018916,7.745066,0.000000,0.187744,0.248889,7.293088,8.606462,0.000000
3,False,0.481214,0.229483,5.039218,0.669226,3.988109,2.878755,0.642729,0.366843,0.000000,...,0.000000,0.469958,0.532943,3.121966,0.095707,3.489891,0.262518,1.729952,5.988695,0.080222
4,False,0.000000,3.258759,9.666997,6.237058,1.160069,0.055264,0.394765,0.476731,0.611324,...,0.670079,0.755777,0.076195,7.925221,0.149376,5.640311,0.217993,1.215899,12.723279,0.856007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,False,0.319323,10.394070,2.253926,17.211163,10.175515,0.754048,0.000000,1.031596,1.111096,...,6.547691,0.542316,0.000000,2.855716,0.984909,0.789532,1.463116,7.819314,0.194761,2.515571
24996,False,4.812177,2.173984,3.127878,7.731273,2.563596,0.855450,1.506930,2.816796,0.108556,...,0.805109,0.897770,0.067206,1.332061,1.023679,2.697655,2.661853,0.294248,11.114500,0.605132
24997,False,0.603782,10.804434,5.988084,11.258055,1.711970,0.172748,0.839580,1.017261,0.428469,...,9.690125,0.000000,0.681894,4.975673,0.593460,12.891261,3.147193,2.841281,2.273726,0.726203
24998,False,0.012421,0.779886,3.646794,1.577259,0.314669,1.035333,0.140289,0.526032,1.808931,...,6.654152,3.092728,1.509475,2.738165,0.000000,3.335813,2.847281,1.110609,3.183074,1.643685


The dataframe can be saved in CSV form either by calling the pandas `DataFrame.to_csv()` method, or by using the `ImageFeaturizer.save_csv()` method on the featurizer itself. This will allow the features to be used directly in the DataRobot app:

In [11]:
featurizer.save_csv(omit_time=True)



In [12]:
pd.read_csv(os.path.expanduser('~/pic2vec_demo/cats_vs_dogs_featurized_squeezenet_depth-1_output-512.csv'))

Unnamed: 0,images,label,images_missing,images_feat_0,images_feat_1,images_feat_2,images_feat_3,images_feat_4,images_feat_5,images_feat_6,...,images_feat_502,images_feat_503,images_feat_504,images_feat_505,images_feat_506,images_feat_507,images_feat_508,images_feat_509,images_feat_510,images_feat_511
0,cat.0.jpg,0,False,0.422239,4.080936,14.894337,2.842498,0.967452,10.851055,0.164090,...,5.716428,0.227580,1.512349,1.838279,6.923377,2.754216,1.599615,0.942032,8.596214,0.195745
1,cat.1.jpg,0,False,2.235883,1.766027,0.489503,1.077848,3.744067,3.900755,0.678774,...,0.466049,0.456763,0.000000,8.796008,8.920897,2.318893,3.206552,5.324099,25.885130,0.000000
2,cat.2.jpg,0,False,0.804545,0.685238,0.411905,3.651519,7.440580,1.365789,1.759454,...,0.991105,0.015178,0.018916,7.745066,0.000000,0.187744,0.248889,7.293088,8.606462,0.000000
3,cat.3.jpg,0,False,0.481215,0.229483,5.039218,0.669226,3.988109,2.878755,0.642729,...,0.000000,0.469958,0.532943,3.121966,0.095707,3.489891,0.262518,1.729952,5.988695,0.080222
4,cat.4.jpg,0,False,0.000000,3.258759,9.666997,6.237058,1.160069,0.055264,0.394765,...,0.670079,0.755777,0.076195,7.925221,0.149376,5.640311,0.217993,1.215900,12.723279,0.856007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,dog.12495.jpg,1,False,0.319323,10.394070,2.253926,17.211163,10.175515,0.754048,0.000000,...,6.547691,0.542316,0.000000,2.855716,0.984909,0.789532,1.463116,7.819313,0.194761,2.515571
24996,dog.12496.jpg,1,False,4.812177,2.173984,3.127878,7.731273,2.563596,0.855450,1.506930,...,0.805109,0.897770,0.067206,1.332061,1.023680,2.697655,2.661853,0.294249,11.114500,0.605132
24997,dog.12497.jpg,1,False,0.603782,10.804434,5.988084,11.258055,1.711970,0.172748,0.839580,...,9.690125,0.000000,0.681894,4.975673,0.593460,12.891261,3.147193,2.841281,2.273726,0.726203
24998,dog.12498.jpg,1,False,0.012421,0.779886,3.646794,1.577259,0.314669,1.035333,0.140289,...,6.654152,3.092728,1.509475,2.738165,0.000000,3.335813,2.847281,1.110609,3.183074,1.643685


The `save_csv()` function can be called with no arguments in order to create an automatic csv name, like above. It can also be called with the `new_csv_path='{insert_new_csv_path_here}'` argument. 

Alternatively, you can omit certain parts of the automatic name generation with `omit_model=True`, `omit_depth=True`, `omit_output=True`, or `omit_time=True` arguments. 

But, for the purposes of this demo, we can simply test the performance of a linear SVM classifier over the featurized data. First, we'll build the training and test sets. 

In [13]:
# Creating a training set of 10,000 for each class
train_cats = featurized_df.iloc[:10000, :]
train_dogs = featurized_df.iloc[12500:22500, :]

# building training set from 12,500 images of each class
train_cats, labels_cats = train_cats.drop(['label', 'images'], axis=1), train_cats['label']
train_dogs, labels_dogs = train_dogs.drop(['label', 'images'], axis=1), train_dogs['label']

# Combining the train data and the class labels to train on
train_combined = pd.concat((train_cats, train_dogs), axis=0)
labels_train = pd.concat((labels_cats, labels_dogs), axis=0)

# Creating a test set from the remaining 2,500 of each class
test_cats = featurized_df.iloc[10000:12500, :]
test_dogs = featurized_df.iloc[22500:, :]

test_cats, test_labels_cats = test_cats.drop(['label', 'images'], axis=1), test_cats['label']
test_dogs, test_labels_dogs = test_dogs.drop(['label', 'images'], axis=1), test_dogs['label']

# Combining the test data and the class labels to check predictions
labels_test = pd.concat((test_labels_cats, test_labels_dogs), axis=0)
test_combined = pd.concat((test_cats, test_dogs), axis=0)


Then, we'll train the linear SVM:

In [14]:
# Initialize the linear SVC
clf = svm.LinearSVC()

# Fit it on the training data
clf.fit(train_combined, labels_train)

# Check the performance of the linear classifier over the full Cats vs. Dogs dataset!
clf.score(test_combined, labels_test)



0.955

After running the Cats vs. Dogs dataset through the lightest-weight pic2vec model, we find that a simple linear SVM trained over the featurized data achieves over 95% accuracy on distinguishing dogs vs. cats out of the box.

## Summary

That's it! We've looked at the following:

1. What data formats can be passed into the featurizer
2. How to initialize a simple featurizer
3. How to load and featurize the data simultaneously (preferred method)
3. How to load data into the featurizer independently
4. How to featurize the loaded data independently
5. How to save the featurized dataframe as a csv

And as a bonus, we looked at how we might use the featurized data to perform predictions without dropping the CSV into the DataRobot app.

Unless you would like to examine the loaded data before featurizing it, it is recommend to use the `ImageFeaturizer.featurize()` method to perform both functions at once and allow batch processing.

## Next Steps

We have not covered using only a CSV with URL pointers, or a more complex dataset. That will be the subject of another Notebook. 

To have more control over the options in the featurizer, or to understand its internal functioning more fully, check out the full package documentation.