# [Scene Recognition with Bag-of-Words](https://www.cc.gatech.edu/~hays/compvision/proj4/)
For this project, you will need to report performance for three
combinations of features / classifiers. It is suggested you code them in
this order, as well:
1. Tiny image features and nearest neighbor classifier
2. Bag of sift features and nearest neighbor classifier
3. Bag of sift features and linear SVM classifier

The starter code is initialized to 'placeholder' just so that the starter
code does not crash when run unmodified and you can get a preview of how
results are presented.

## Setup

In [1]:
# Set up parameters, image paths and category list
%matplotlib notebook
%load_ext autoreload
%autoreload 2

import cv2
import numpy as np
import os.path as osp
import pickle
from random import shuffle
import matplotlib.pyplot as plt
from utils import *
import student_code as sc


# This is the list of categories / directories to use. The categories are
# somewhat sorted by similarity so that the confusion matrix looks more
# structured (indoor and then urban and then rural).
categories = ['Kitchen', 'Store', 'Bedroom', 'LivingRoom', 'Office', 'Industrial', 'Suburb',
              'InsideCity', 'TallBuilding', 'Street', 'Highway', 'OpenCountry', 'Coast',
              'Mountain', 'Forest'];
# This list of shortened category names is used later for visualization
abbr_categories = ['Kit', 'Sto', 'Bed', 'Liv', 'Off', 'Ind', 'Sub',
                   'Cty', 'Bld', 'St', 'HW', 'OC', 'Cst',
                   'Mnt', 'For'];

# Number of training examples per category to use. Max is 100. For
# simplicity, we assume this is the number of test cases per category, as
# well.
num_train_per_cat = 100

# This function returns lists containing the file path for each train
# and test image, as well as lists with the label of each train and
# test image. By default all four of these lists will have 1500 elements
# where each element is a string.
data_path = osp.join('..', 'data')
train_image_paths, test_image_paths, train_labels, test_labels = get_image_paths(data_path,
                                                                                 categories,
                                                                                 num_train_per_cat);

## Section 1: Tiny Image features with Nearest Neighbor classifier

### Section 1a: Represent each image with the Tiny Image feature

Each function to construct features should return an N x d numpy array, where N is the number of paths passed to the function and d is the dimensionality of each image representation. See the starter code for each function for more details.

In [2]:
print('Using the TINY IMAGE representation for images')

train_image_feats = sc.get_tiny_images(train_image_paths, standardize_pixels=True)
test_image_feats = sc.get_tiny_images(test_image_paths, standardize_pixels=True)

Using the TINY IMAGE representation for images
Standardizing
Standardizing


### Section 1b: Classify each test image by training and using the Nearest Neighbor classifier

Each function to classify test features will return an N element list, where N is the number of test cases and each entry is a string indicating the predicted category for each test image. Each entry in 'predicted_categories' must be one of the 15 strings in 'categories', 'train_labels', and 'test_labels'. See the starter code for each function for more details.

In [3]:
print('Using NEAREST NEIGHBOR classifier to predict test set categories')

predicted_categories = sc.nearest_neighbor_classify(train_image_feats, train_labels, test_image_feats,\
                                                    perform_kNN=True, k=11)

Using NEAREST NEIGHBOR classifier to predict test set categories


### Section 1c: Build a confusion matrix and score the recognition system

(You do not need to code anything in this section.)

If we wanted to evaluate our recognition method properly we would train
and test on many random splits of the data. You are not required to do so
for this project.

This function will create a confusion matrix and various image
thumbnails each time it is called. View the confusion matrix to help interpret
your classifier performance. Where is it making mistakes? Are the
confusions reasonable?

Interpreting your performance with 100 training examples per category:
- accuracy  =   0 -> Your code is broken (probably not the classifier's fault! A classifier would have to be amazing to perform this badly).
- accuracy ~= .07 -> Your performance is chance. Something is broken or you ran the starter code unchanged.
- accuracy ~= .20 -> Rough performance with tiny images and nearest neighbor classifier. Performance goes up a few percentage points with K-NN instead of 1-NN.
- accuracy ~= .20 -> Rough performance with tiny images and linear SVM classifier. The linear classifiers will have a lot of trouble trying to separate the classes and may be unstable (e.g. everything classified to one category)
- accuracy ~= .50 -> Rough performance with bag of SIFT and nearest neighbor classifier. Can reach .60 with K-NN and different distance metrics.
- accuracy ~= .60 -> You've gotten things roughly correct with bag of SIFT and a linear SVM classifier.
- accuracy >= .70 -> You've also tuned your parameters well. E.g. number of clusters, SVM regularization, number of patches sampled when building vocabulary, size and step for dense SIFT features.
- accuracy >= .80 -> You've added in spatial information somehow or you've added additional, complementary image features. This represents state of the art in Lazebnik et al 2006.
- accuracy >= .85 -> You've done extremely well. This is the state of the art in the 2010 SUN database paper from fusing many  features. Don't trust this number unless you actually measure many random splits.
- accuracy >= .90 -> You used modern deep features trained on much larger image databases.
- accuracy >= .96 -> You can beat a human at this task. This isn't a realistic number. Some accuracy calculation is broken or your classifier is cheating and seeing the test labels.

In [4]:
show_results(train_image_paths, test_image_paths, train_labels, test_labels, categories, abbr_categories,
             predicted_categories)

<IPython.core.display.Javascript object>

## Section 2: Bag of SIFT features with Nearest Neighbor classifier

### Section 2a: Represent each image with the Bag of SIFT feature

To create a new vocabulary, make sure `vocab_filename` is different than the old vocabulary, or delete the old one.

In [None]:
print('Using the BAG-OF-SIFT representation for images')

vocab_size = 400
vocab_filename = 'vocab.pkl'
if not osp.isfile(vocab_filename):
    # Construct the vocabulary
    print('No existing visual word vocabulary found. Computing one from training images')
    vocab = sc.build_vocabulary(train_image_paths, vocab_size)
    with open(vocab_filename, 'wb') as f:
        pickle.dump(vocab, f)
        print('{:s} saved'.format(vocab_filename))

train_image_feats = sc.get_bags_of_sifts(train_image_paths, vocab_filename)
test_image_feats = sc.get_bags_of_sifts(test_image_paths, vocab_filename)

### Section 2b: Classify each test image by training and using the Nearest Neighbor classifier

In [None]:
print('Using NEAREST NEIGHBOR classifier to predict test set categories')
predicted_categories = sc.nearest_neighbor_classify(train_image_feats, train_labels, test_image_feats)

### Section 2c: Build a confusion matrix and score the recognition system

In [None]:
show_results(train_image_paths, test_image_paths, train_labels, test_labels, categories, abbr_categories,
             predicted_categories)

## Cross-Validation section
In this section, cross-validation is used to tune the vocabulary size hyper-parameter. My approach is to split the training data into 60/40 splits and use the 40% split as a validation set. The following code will perform 10 iterations for each choice of vocabulary size (with different 60/40 splits randomly sampled each time) and compute the average accuracy and standard deviation on the validation sets.

In [None]:
from sklearn.metrics import confusion_matrix

num_train_per_cat_cross_val = 25
data_path = osp.join('..', 'data')
train_image_paths_cross_val, _, train_labels_cross_val, _ = get_image_paths(data_path, \
                                                                  categories, num_train_per_cat_cross_val);

iters_per_param_value    =  10
candidate_vocab_sizes    =  [10, 20, 50, 100, 150, 200, 400, 600, 1000, 5000, 10000]
candidate_lambda_values  =  [0.1, 1, 10, 100, 200, 500, 750, 900, 1000, 1200, 1500, 2000]
num_training_images      =  len(train_image_paths_cross_val)
train_percentage         =  0.6 # validation_percentage = 1.0 - train_percentage
num_training_split       =  int(train_percentage*num_training_images)

np_paths_array           =  np.array(train_image_paths_cross_val) # for being able to randomly access indices
np_labels_array          =  np.array(train_labels_cross_val)
cat2idx                  =  {cat: idx for idx, cat in enumerate(categories)}

mean_acc_list          = [] # for plotting error bars
std_acc_list           = [] # for plotting error bars
best_cand_lambdas_list = []

for i in range(len(candidate_vocab_sizes)):
    accuracy_list = [] # store accuracy values to compute mean and std acc over iterations 
    lambdas_vote = np.zeros((len(candidate_lambda_values),1))
    print("trying candidate vocab size:",candidate_vocab_sizes[i])
    for j in range(iters_per_param_value):
        
        # Generate random indices for training images:
        rnd_indxs = np.random.choice(num_training_images, num_training_images, replace=False)
        
        # Depending on train/validation percentage split, separate training and validation images:
        train_image_paths_split       =  np_paths_array[rnd_indxs[:num_training_split]]
        validation_image_paths_split  =  np_paths_array[rnd_indxs[num_training_split:]]
        train_labels_split            =  np_labels_array[rnd_indxs[:num_training_split]]
        validation_labels_split       =  np_labels_array[rnd_indxs[num_training_split:]]
        
        # Create vocabulary for current train/validation split:
        vocab_filename = 'cross_validation_data/vocab_' + \
                         str(candidate_vocab_sizes[i]) + '_iter_' + str(j+1) + '.pkl'
        if not osp.isfile(vocab_filename):
            vocab = sc.build_vocabulary(train_image_paths_split, candidate_vocab_sizes[i])
            with open(vocab_filename, 'wb') as f:
                pickle.dump(vocab, f)

        # Use the generated vocabulary to extract bags-of-sift features from train & validation images:
        train_features        =  sc.get_bags_of_sifts(train_image_paths_split, vocab_filename)
        validation_features   =  sc.get_bags_of_sifts(validation_image_paths_split, vocab_filename)  
        
        y_true                =  [cat2idx[cat] for cat in validation_labels_split]
        
        # for current vocabulary size test possible candidate lambdas:
        testing_lambdas_list  =  []
        for k in range(len(candidate_lambda_values)):
            predicted_categories  =  sc.svm_classify(train_features, train_labels_split, \
                                                     validation_features, lambda_value=candidate_lambda_values[k])
        
            # Create a confusion matrix, compute accuracy and store in a list:
            y_pred  =  [cat2idx[cat] for cat in predicted_categories]
            cm      =  confusion_matrix(y_true, y_pred)
            cm      =  cm.astype(np.float) / cm.sum(axis=1)[:, np.newaxis]
            acc     =  np.mean(np.diag(cm))
            testing_lambdas_list.append(acc)
        best_lambda_indx = np.argmax(np.array(testing_lambdas_list))
        lambdas_vote[best_lambda_indx,0] += 1
        
        accuracy_list.append(testing_lambdas_list[best_lambda_indx]) # Store accuracy corresponding to best lambda
        print("Iteration:", j+1, "best acc:", accuracy_list[-1],\
              "best lambda:", candidate_lambda_values[best_lambda_indx])
        
    acc_list_array = np.array(accuracy_list)    
    mean_acc_list.append(np.mean(acc_list_array))
    std_acc_list.append(np.std(acc_list_array))
        
    highest_voted_lambda_indx = np.argmax(lambdas_vote[:,0])
    best_cand_lambdas_list.append(candidate_lambda_values[highest_voted_lambda_indx])
    
    print("Stats: mean acc = ", mean_acc_list[-1], "std acc = ", std_acc_list[-1], \
          "highest voted lambda = ", best_cand_lambdas_list[-1])

cross_val_means   = np.array(mean_acc_list)
cross_val_stds    = np.array(std_acc_list)
cross_val_lambdas = np.array(best_cand_lambdas_list) 
np.savez('cross_validation_results', means=cross_val_means, stds=cross_val_stds, lambdas=cross_val_lambdas)

In [None]:
sc.plot_cross_validation_results_vocab_size()

## Coarse-to-fine cross-validation for picking lambda 
Based on the results from cross-validation above, the tuned vocabulary size is used to fine tune the lambda parameter which controls how strongly regularized the model is. The range is determined based on the highest voted lambda in the cross-validation section above. 
#### NOTE: RUN CELL IN SECTION 2b BEFORE RUNNING THE CELL BELOW 

In [None]:
import sys
candidate_lambda_values       =  np.arange(1,1000,10)
train_labels_array            =  np.array(train_labels)
iters                         =  20

# split training data into train and validation:
train_split                   =  0.7
total_images                  =  train_image_feats.shape[0]
num_train_images_split        =  int(train_split*total_images)
cat2idx                       =  {cat: idx for idx, cat in enumerate(categories)}

mean_acc_list          = [] # for plotting error bars
std_acc_list           = [] # for plotting error bars

for i in range(candidate_lambda_values.shape[0]):
    temp_acc_list=[]
    for j in range(iters):
        rnd_indxs                     =  np.random.choice(total_images, total_images, replace=False)
        train_image_feats_split       =  train_image_feats[rnd_indxs[:num_train_images_split], :]
        validation_image_feats_split  =  train_image_feats[rnd_indxs[num_train_images_split:], :]
        train_labels_split            =  train_labels_array[rnd_indxs[:num_train_images_split]]
        validation_labels_split       =  train_labels_array[rnd_indxs[num_train_images_split:]]
    
        y_true                        =  [cat2idx[cat] for cat in validation_labels_split]
        predicted_categories          =  sc.svm_classify(train_image_feats_split, train_labels_split, \
                                           validation_image_feats_split, \
                                           lambda_value=candidate_lambda_values[i])
        
        # build confusion matrix and compute prediction accuracy on validation data:
        y_pred  =  [cat2idx[cat] for cat in predicted_categories]
        cm      =  confusion_matrix(y_true, y_pred)
        cm      =  cm.astype(np.float) / cm.sum(axis=1)[:, np.newaxis]
        acc     =  np.mean(np.diag(cm))
        temp_acc_list.append(acc)
        sys.stdout.write("trial:%d/%d, lambda:%f, iter:%d, acc:%f\r" % (i+1, candidate_lambda_values.shape[0], \
                                                             candidate_lambda_values[i], j+1, temp_acc_list[-1]))
        sys.stdout.flush()
    mean_acc_list.append(np.mean(np.array(temp_acc_list)))
    std_acc_list.append(np.std(np.array(temp_acc_list)))
    
cross_val_lambda_means = np.array(mean_acc_list)
cross_val_lambda_stds  = np.array(std_acc_list)
np.savez('cross_validation_lambda_results', means=cross_val_lambda_means, stds=cross_val_lambda_stds)

In [None]:
cross_val_results_data = np.load('cross_validation_lambda_results.npz')

X    = np.arange(1,1000,10)
Y    = cross_val_results_data['means']
Yerr = cross_val_results_data['stds']

print("best mean acc:",Y[np.argmax(Y)],"for lambda:", X[np.argmax(Y)])

plt.figure()
plt.errorbar(X,Y,Yerr)
plt.xlabel('lambda')
plt.ylabel('accuracy')
plt.show()

## Section 3: Bag of SIFT features and SVM classifier
We will reuse the bag of SIFT features from Section 2a.

The difference is that this time we will classify them with a support vector machine (SVM).

### Section 3a: Classify each test image by training and using the SVM classifiers

In [None]:
print('Using SVM classifier to predict test set categories')
predicted_categories = sc.svm_classify(train_image_feats, train_labels, test_image_feats, lambda_value=585)

### Section 3b: Build a confusion matrix and score the recognition system

In [None]:
show_results(train_image_paths, test_image_paths, train_labels, test_labels, categories, abbr_categories,
             predicted_categories)