<a href="https://colab.research.google.com/github/mlelarge/dataflowr/blob/master/Notebooks/04_dogscast_features_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Preparations

In [0]:
!pip install -U bcolz

In [0]:
#!pip install Pillow==4.0.0

In [0]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import os
import torch
import torch.nn as nn
import torchvision
from torchvision import models,transforms,datasets
import bcolz
import time
%matplotlib inline

We will first precompute the outputs of Vgg16 model on our dataset and store these values.

In [0]:
use_gpu = torch.cuda.is_available()
print('Using gpu: %s ' % use_gpu)

dtype = torch.FloatTensor
if use_gpu:
    dtype = torch.cuda.FloatTensor

The following commands will download the dataset and need to be run only once.

In [0]:
%mkdir data

In [0]:
%cd data/
!wget http://files.fast.ai/data/dogscats.zip

In [0]:
!unzip dogscats.zip

In [0]:
def save_array(fname, arr):
    c=bcolz.carray(arr, rootdir=fname, mode='w')
    c.flush()
def load_array(fname):
    return bcolz.open(fname)[:]

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

prep1 = transforms.Compose([
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                normalize,
            ])

In [0]:
data_dir = '/content/data/dogscats'

Initialize paths for dataset items

In [0]:
dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), prep1)
         for x in ['train', 'valid']}

If you are running on CPU, you will probably need to lower the size of the batches. On Colab too :(

In [0]:
batch_size = 4
#batch_size = 64

Initialize data loader that will fetch images from disk using num_workers parallel threads.

In [0]:
dset_loaders = {x: torch.utils.data.DataLoader(dsets[x], batch_size=batch_size,
                                               shuffle=False, num_workers=0)
                for x in ['train', 'valid']}

In [0]:
dset_sizes = {x: len(dsets[x]) for x in ['train', 'valid']}
dset_sizes

Instantiate VGG16 model pretrained on ImageNet from the ```torchvision``` model Zoo.

In [0]:
model_vgg = models.vgg16(pretrained=True)

In [0]:
if use_gpu:
    model_vgg = model_vgg.cuda()

By default all the modules are initialized to train mode (```self.training = True```). Also be aware that some layers have different behavior during train/and evaluation (like _BatchNorm_, _Dropout_) so setting it matters.

Also as a rule of thumb for programming in general, try to explicitly state your intent and set ```model.train()``` and ```model.eval()``` when necessary.

In [0]:
model_vgg.eval()

## 2. Feature extraction

Function for extracting and storing CNN features, i.e. the ouput of VGG16 model in this case.

In [0]:
def prefeat(dataset):
    features = []
    labels_list = []
    for data in dataset:
        inputs,labels = data
        if use_gpu:
            inputs , labels = inputs.cuda(),labels.cuda()
        else:
            inputs , labels = inputs,labels
        
        x = model_vgg(inputs)
        features.extend(x.data.cpu().numpy())
        labels_list.extend(labels.data.cpu().numpy())
    features = np.concatenate([[feat] for feat in features])
    return (features,labels_list)

In [0]:
%%time
feat_train,lbs_train = prefeat(dset_loaders['train'])

In [0]:
feat_train.shape

Loading and resizing the images every time we want to use them isn't necessary - instead we should save the processed arrays. By far the fastest way to save and load numpy arrays is using bcolz. This also compresses the arrays, so we save disk space. Here are the functions we'll use to save and load using bcolz (already loaded above...).

In [0]:
%cd /content/data/dogscats/

In [0]:
%mkdir vgg16

In [0]:
save_array(os.path.join(data_dir,'vgg16','feat_train.bc'),feat_train)
save_array(os.path.join(data_dir,'vgg16','lbs_train.bc'),lbs_train)

In [0]:
%%time
feat_val,lbs_val = prefeat(dset_loaders['valid'])

In [0]:
feat_val.shape

In [0]:
save_array(os.path.join(data_dir,'vgg16','feat_val.bc'),feat_val)
save_array(os.path.join(data_dir,'vgg16','lbs_val.bc'),lbs_val)

In [0]:
%cd /content/data/dogscats/
!zip -r vgg16 vgg16/*

In [0]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
upload = drive.CreateFile({'title': 'vgg16_drive.zip'})
upload.SetContentFile('vgg16.zip')
upload.Upload()