# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

#### To be done in Lab

The objective of this experiment is given a query image drawn from the test set, we attempt to find top “relevant” images from the training set using the K-Nearest Neighbour Method we learned earlier. 

The algorithm is simple, we rank the images present in the training set by their distances from the given query image. ***A
retrieved image is considered “relevant” if the class of retrieved image is same as the query
image.***

In this experiment we will use CIFAR-10 dataset. The data set contains 60,000 32x32 colour images in 10 classes, with 6000 images per class. 

There are 50,000 training images and 10,000 test images.

#### Data Source

https://www.cs.toronto.edu/~kriz/cifar.html


The images have been downloaded and unzipped for you in the directory Datasets/AIML_DS_CIFAR-10_STD

They are in a particular python-specific format called pickle. You need not worry about the format's internals, as the site has given the code needed to read such files. The code is given in the first code block below.

**The code returns the contents of each data file as a dictionary**.

#### Data set Information

There are 8 files in the cifar-10 directory.

    -batches.meta

    -data_batch_1

    -data_batch_2	

    -data_batch_3

    -data_batch_4	

    -data_batch_5

    -readme.html

    -test_batch

We will take a peek at these files.

**data** a 10,000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.

**labels** a list of 10,000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

### Setup Steps

In [0]:
#@title Please enter your registration id to start: (e.g. P181900101) { run: "auto", display-mode: "form" }
Id = "P181902118" #@param {type:"string"}


In [0]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "8860303743" #@param {type:"string"}


In [13]:
#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()
  
notebook="BLR_M1W2_SAT_EXP_3" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
#    ipython.magic("sx wget https://www.dropbox.com/s/yqsekuw2epfd6m1/AIML_DS_CIFAR-10_STD.zip?dl=1")
   ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/week3/Exp3/AIML_DS_CIFAR-10_STD.zip")
   ipython.magic("sx unzip AIML_DS_CIFAR-10_STD.zip")
#     ipython.magic("sx unzip AIML_DS_CIFAR-10_STD.zip?dl=1")
   print ("Setup completed successfully")
   return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "id" : Id, "file_hash" : file_hash, "notebook" : notebook}

      r = requests.post(url, data = data)
      print("Your submission is successful. Ref:", submission_id)
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
  
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


In [0]:
## Importing Required Packages
import scipy.io as sio
import numpy as np
import math
import collections

In [0]:
def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo,encoding='bytes')
    return dict

In [14]:
list(unpickle("AIML_DS_CIFAR-10_STD/data_batch_1").keys())

[b'batch_label', b'labels', b'data', b'filenames']

In [16]:
unpickle("AIML_DS_CIFAR-10_STD/data_batch_1")[b'filenames'][:10]

[b'leptodactylus_pentadactylus_s_000004.png',
 b'camion_s_000148.png',
 b'tipper_truck_s_001250.png',
 b'american_elk_s_001521.png',
 b'station_wagon_s_000293.png',
 b'coupe_s_001735.png',
 b'cassowary_s_001300.png',
 b'cow_pony_s_001168.png',
 b'sea_boat_s_001584.png',
 b'tabby_s_001355.png']

In [15]:
unpickle("AIML_DS_CIFAR-10_STD/data_batch_1")[b'labels'][:10]

[6, 9, 9, 4, 1, 1, 2, 7, 8, 3]

In [17]:
unpickle("AIML_DS_CIFAR-10_STD/data_batch_1")[b'data'].shape

(10000, 3072)

In [18]:
test_data = unpickle("AIML_DS_CIFAR-10_STD/test_batch")
print(list(test_data.keys())) 
print(len(test_data), len(test_data[b'labels']), test_data[b'data'].shape)

[b'batch_label', b'labels', b'data', b'filenames']
4 10000 (10000, 3072)


In [19]:
type(test_data[b'data'])

numpy.ndarray

In [21]:
unpickle("AIML_DS_CIFAR-10_STD/batches.meta")

{b'label_names': [b'airplane',
  b'automobile',
  b'bird',
  b'cat',
  b'deer',
  b'dog',
  b'frog',
  b'horse',
  b'ship',
  b'truck'],
 b'num_cases_per_batch': 10000,
 b'num_vis': 3072}

### Loading image features

We can use the 3072 pixels as columns of data and find the distance between two 3072 dimension-space points. 

However that is often not the best way from effectiveness or computational efficiency. For example, instead of merely looking at the individual pixels in an image, we may find it more useful to figure out whether both images contain similar colors or similar shapes. The transformation or extraction of such higher order information is what is termed as *Feature Extraction*. This has been done for you for the cifar-10 images.

The images have been converted to relevant 512 features and saved in the file "cifar-10/cifar10features.mat". Let us load them.

In [0]:
# Load the features of images
features = sio.loadmat('AIML_DS_CIFAR-10_STD/cifar10features.mat')

In [0]:
train_features = features['x_train']
train_labels = np.transpose(features['y_train'])
test_features = features['x_test']
test_labels = np.transpose(features['y_test'])

In [24]:
print(train_features.shape, train_labels.shape, test_features.shape, test_labels.shape)

(50000, 512) (50000, 1) (10000, 512) (10000, 1)


### k-NN:

Remember the kNN code:

In [0]:
from sklearn.neighbors import KNeighborsClassifier

In [0]:
neigh = KNeighborsClassifier(n_neighbors=3)

In [27]:
neigh.fit(train_features,train_labels)

  """Entry point for launching an IPython kernel.


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [0]:
pred = neigh.predict(test_features)

In [0]:
from sklearn.metrics import accuracy_score

In [31]:
accuracy_score(test_labels, pred)

0.9347

In [37]:
### Hint: Recall the definition of relevance and use it to calculate precision and recall
### Hint: Number of relevant images is equal to the number of images of that class in the training set 
### Your Code here

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(test_labels, pred)
precision = np.diag(cm) / np.sum(cm, axis = 0)
recall = np.diag(cm) / np.sum(cm, axis = 1)

print("cm = " ,cm, "\n\n")
print("precision = ", precision, "\n\n")
print("recall =", recall)


cm =  [[938   4  16   6   4   0   1   3  23   5]
 [  4 965   0   0   0   2   0   0   3  26]
 [ 15   0 922  18  17   9  10   6   3   0]
 [  7   0  19 864  15  76  11   4   2   2]
 [  3   0  12  15 946   9   6   9   0   0]
 [  1   2   9  85  12 876   2   9   1   3]
 [  4   1  14  17   3   5 955   0   1   0]
 [  4   1   8   6   9  11   0 960   0   1]
 [ 14   3   2   4   1   1   1   0 968   6]
 [  9  22   0   2   1   2   0   1  10 953]] 


precision =  [0.93893894 0.96693387 0.92015968 0.84955752 0.93849206 0.8839556
 0.96855984 0.96774194 0.95746785 0.95682731] 


recall = [0.938 0.965 0.922 0.864 0.946 0.876 0.955 0.96  0.968 0.953]


Ideally, the top samples (the best 10, or best 100...) must have the same label as the given test sample. But this is not always true. To check how good the retrieval performed, we shall look at metrics such as precision@k and recall@k, in addition to accuracy.

**Exercise 1**  :: Do you think accuracy is a valid metric to evaluate our search engine performance?
If Yes, Explain.

Information Retrieval experts usually use two very closely related metrics called Precision@k and Recall@k to evaluate their search engine models where k corresponds to the top-k retrievals. Let’s say q is the query, U is number of images in the training set, R is the set of “relevant” images in the training set and T (k) is the retrieved set
of images from our algorithm.

                $p@k = |T(k) ∩ R|/ |T (k)|$
                
                $$ r@k = |T (k) ∩ R| / |R| $$

**Exercise 2**  :: Compute the precision@k and recall@k for k = 10, 100, 500, 1000, 2000, 3000, 4000, 5000, 5500, 6000. (see this and difference from earlier precision here)


In [0]:
## Your code

**Exercise 3**  ::  Plot the Precision-Recall Curve.

In [0]:
### Your code

**Exercise 4**  ::  Does precision increase or decrease as we increase k, what do you expect?

In [0]:
## Your Answer

**Exercise 5**  ::  Is there a way to make recall@k = 1 for every query for some k? What is that value of k?

In [0]:
## Your Answer

**Exercise 6**  ::  For real search engines, is finding recall@k feasible? Why or Why not? Is finding precision@k feasible?

In [0]:
## Your Answer Here



**Exercise 7** :: Do you think the feature transformation is good? Try the same set of experiments with the image pixels (converting 32 x 32 image into 3072 x 1 vector that is) directly.

In [0]:
## Your Answer Here



### Please answer the questions below to complete the experiment:

In [0]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging me", "Was Tough, but I did it", "Too Difficult for me"]


In [0]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "text" #@param {type:"string"}

In [0]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["Yes", "No"]

In [0]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")