## k-nearest neighbors regression - English Dataset
The present notebook does the next:
  * Find the embeddings for the questions using a pre-trained Bert Model
  * Find the k-nearest neighbors of a given query question
  * Predict the output for the query question using the k-nearest neighbors
  * Choose the best value of k using a validation set

### Imports needed

In [35]:
#Imports for model
from os import path
import numpy as np
import pandas as pd
import tensorflow as tf
import os
import csv
import re
from tensorflow import keras
from sentence_transformers import SentenceTransformer

#Import GloVe model
from glove import Glove

print("TensorFlow Version: "+tf.__version__)
print("Numpy Version: "+np.__version__)
if tf.test.gpu_device_name(): 
    print('Default GPU Device:{}'.format(tf.test.gpu_device_name()))
else:
    print("GPU not found. Please install GPU version of TF if needed")

TensorFlow Version: 2.3.1
Numpy Version: 1.18.5
Default GPU Device:/device:GPU:0


### Load Data

In [36]:
#Download and prepare the Pre-trained GloVe Word Embedding model
path_to_glove_zipfile = "../processed_files/glove.42B.300d.zip"
path_to_glove_file = "../processed_files/glove.42B.300d.txt"

if not path.exists(path_to_glove_file):
    if not path.exists(path_to_glove_zipfile):
        print("downloading glove .zip file...")
        !wget http://nlp.stanford.edu/data/glove.42B.300d.zip
    print("unzipping glove .zip file...")
    !unzip -q glove.42B.300d.zip

In [37]:
#Create instance Sentence Embedding
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [38]:
#Check some word vector representations
query1 = "let me code!"
query2 = "yes I want to launch"
query_vec = model.encode([query1, query2])
print(query_vec.shape)

(2, 768)


In [39]:
#Sentence to sequence vectors
def convert_to_vec(sentence):
    vec = model.encode([sentence])[0]
    return vec

#Read CSV and fill up given matrix for embeddings and labels
questions = []
def read_and_parse(file_path, input_size=float("inf"), read_embeddings=True, read_labels=True, training=False):
    matrix = []
    labels = []
    with open(file_path, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter=",")
        i = 0
        for row in reader:
            question = row["title"]
            if training:
                questions.append(question)
            if read_embeddings:
                vec = convert_to_vec(question)
                matrix.append(vec)
            if read_labels:
                labels.append(float(row["stars"]))
            i += 1
            if i % 1000 == 0:
                print(str(i//1000)+"k, ", end="")
                if i % 20000 == 0:
                    print()
            if i == input_size:
                break
    print()
    return np.array(matrix), np.array(labels)

In [40]:
#Parameters for inputs
VEC_DIM = 768
#INPUT_SIZE_TRAIN = 97528
#INPUT_SIZE_TEST = 5418
INPUT_SIZE_TRAIN = 80000
INPUT_SIZE_TEST = 4000
INPUT_SIZE_VAL = 4000
INPUT_FILE_TRAIN = "../processed_files/english_train.csv"
INPUT_FILE_TEST = "../processed_files/english_test.csv"
INPUT_FILE_VAL = "../processed_files/english_val.csv"
PATH_EMBED_TRAIN = "../processed_files/bert/bert_embed_train.txt"
PATH_EMBED_TEST = "../processed_files/bert/bert_embed_test.txt"
PATH_EMBED_VAL = "../processed_files/bert/bert_embed_val.txt"

In [43]:
#Read and create all sentence Embeddings!

if not path.exists(PATH_EMBED_TRAIN) or not path.exists(PATH_EMBED_TEST) or not path.exists(PATH_EMBED_VAL):
    #Read CSV for training
    print("Parsing train file...")
    input_sequences_train, labels_train = read_and_parse(INPUT_FILE_TRAIN, INPUT_SIZE_TRAIN, training=True)

    #Read CSV and create input matrix for testing
    print("Parsing test file...")
    input_sequences_test, labels_test = read_and_parse(INPUT_FILE_TEST, INPUT_SIZE_TEST)

    #Read CSV and create input matrix for validation
    print("Parsing validation file...")
    input_sequences_val, labels_val = read_and_parse(INPUT_FILE_VAL, INPUT_SIZE_VAL)
    
else:
    print("Embeddings already exist. Importing from files...")
    #Read Train Embeddings
    print("Importing train embeddings...")
    input_sequences_train = np.loadtxt(PATH_EMBED_TRAIN).reshape(INPUT_SIZE_TRAIN, VEC_DIM)
    _, labels_train = read_and_parse(INPUT_FILE_TRAIN, INPUT_SIZE_TRAIN, read_embeddings=False, training=True)
    
    #Read Test Embeddings
    print("Importing test embeddings...")
    input_sequences_test = np.loadtxt(PATH_EMBED_TEST).reshape(INPUT_SIZE_TEST, VEC_DIM)
    _, labels_test = read_and_parse(INPUT_FILE_TEST, INPUT_SIZE_TEST, read_embeddings=False)
    
    #Read Test Embeddings
    print("Importing val embeddings...")
    input_sequences_val = np.loadtxt(PATH_EMBED_VAL).reshape(INPUT_SIZE_VAL, VEC_DIM)
    _, labels_val = read_and_parse(INPUT_FILE_VAL, INPUT_SIZE_VAL, read_embeddings=False)

#Convert questions into numpy array
questions = np.array(questions)
    

Embeddings already exist. Importing from files...
Importing train embeddings...
1k, 2k, 3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k, 12k, 13k, 14k, 15k, 16k, 17k, 18k, 19k, 20k, 
21k, 22k, 23k, 24k, 25k, 26k, 27k, 28k, 29k, 30k, 31k, 32k, 33k, 34k, 35k, 36k, 37k, 38k, 39k, 40k, 
41k, 42k, 43k, 44k, 45k, 46k, 47k, 48k, 49k, 50k, 51k, 52k, 53k, 54k, 55k, 56k, 57k, 58k, 59k, 60k, 
61k, 62k, 63k, 64k, 65k, 66k, 67k, 68k, 69k, 70k, 71k, 72k, 73k, 74k, 75k, 76k, 77k, 78k, 79k, 80k, 

Importing test embeddings...
1k, 2k, 3k, 4k, 
Importing val embeddings...
1k, 2k, 3k, 4k, 


In [19]:
#Export embeddings into txt files

def export_numpy(filename, data):
    with open(filename, "w") as f:
        for i,row in enumerate(data,1):
            np.savetxt(f, row)
            if i % 1000 == 0:
                print(str(i//1000)+"k, ", end="")
                if i % 20000 == 0:
                    print()
        print()

#Export train data
print("Exporting train...")
export_numpy(PATH_EMBED_TRAIN, input_sequences_train)

#Export test data
print("Exporting test...")
export_numpy(PATH_EMBED_TEST, input_sequences_test)

#Export val data
print("Exporting val...")
export_numpy(PATH_EMBED_VAL, input_sequences_val)

Exporting train...
1k, 2k, 3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k, 12k, 13k, 14k, 15k, 16k, 17k, 18k, 19k, 20k, 
21k, 22k, 23k, 24k, 25k, 26k, 27k, 28k, 29k, 30k, 31k, 32k, 33k, 34k, 35k, 36k, 37k, 38k, 39k, 40k, 
41k, 42k, 43k, 44k, 45k, 46k, 47k, 48k, 49k, 50k, 51k, 52k, 53k, 54k, 55k, 56k, 57k, 58k, 59k, 60k, 
61k, 62k, 63k, 64k, 65k, 66k, 67k, 68k, 69k, 70k, 71k, 72k, 73k, 74k, 75k, 76k, 77k, 78k, 79k, 80k, 

Exporting test...
1k, 2k, 3k, 4k, 
Exporting val...
1k, 2k, 3k, 4k, 


In [44]:
#Some prints to check expected results
print("input shape:",input_sequences_train.shape)
print("labels train:",labels_train.shape)
#print("input shape:",input_sequences_test.shape)
#print("labels train:",labels_test.shape)
#labels_train[0]
print("trues over all on training:", sum(labels_train)/len(labels_train))

input shape: (80000, 768)
labels train: (80000,)
trues over all on training: 0.50835


In [45]:
#Shuffle train data
idx = np.random.choice(range(INPUT_SIZE_TRAIN), INPUT_SIZE_TRAIN, replace=False)
input_sequences_train = input_sequences_train[idx]
labels_train = labels_train[idx]
questions = questions[idx]

#Shuffle test data
idx = np.random.choice(range(INPUT_SIZE_TEST), INPUT_SIZE_TEST, replace=False)
input_sequences_test = input_sequences_test[idx]
labels_test = labels_test[idx]

#Shuffle val data
idx = np.random.choice(range(INPUT_SIZE_VAL), INPUT_SIZE_VAL, replace=False)
input_sequences_val = input_sequences_val[idx]
labels_val = labels_val[idx]

print("trues over all on training:", sum(labels_train)/len(labels_train)) #to validate the result is the same

trues over all on training: 0.50835


### Functions and k-NN model

In [46]:
#Function that compute the distances of all embeddings with a single question query
def k_nearest_neighbors(sentences, query_vec, k=5):
    #Normalize input query vector
    query_vec = query_vec / np.linalg.norm(query_vec)
    #Find the cosine similarity between all sentences and query sentence
    norm_sentences = np.linalg.norm(sentences, axis=1)
    dot_product = np.dot(sentences, query_vec)
    cosine_sims = dot_product / norm_sentences
    #Sort the similarities and return the top K of them
    return np.argsort(cosine_sims)[::-1][:k]

In [48]:
#Test for some distances between vectors
query_sent = "every bode knows that everybody"
query_vec =  model.encode([query_sent])[0]
knn = k_nearest_neighbors(input_sequences_train, query_vec)
print(knn)
print(questions[knn[0]])

[48602 45801 35191 14065 37333]
everybody knows that  vs everyone knows that 


In [49]:
#Clean sentence (remove non alpha chars)
def clean_sentence(sentence):
    p = re.compile(r'<.*?>')
    sentence = p.sub('', sentence) 
    sentence = ''.join([(i.lower() if i.isalpha() else " ") for i in sentence if i.isalpha() or i == " " or i == "-"])
    return sentence

### Predictions of new questions and scores!

In [50]:
#Function that estimates the final score and gives recommendations for a query question and the input sentences
def predict(query_sent, encoded=False, k=5, sentences_vecs=input_sequences_train, labels=labels_train, questions=questions):
    #We first clean the given sentence and encode it (if needed)
    if not encoded:
        query_sent = clean_sentence(query_sent)
        query =  model.encode([query_sent])[0]
    else:
        query = query_sent
    #Then, we get the k-NN for query
    knn = k_nearest_neighbors(sentences_vecs, query, k)
    #With these indices, we find the mean score and we use it as our final prediction. As a special case, if the exact
    #question was entered, then the first index should match it, so we consider its score directly
    if (not encoded and questions[knn[0]] == query_sent) or (encoded and np.all(sentences_vecs[knn[0]] == query)):
        score = labels[knn[0]]
    else:
        score = np.mean(labels[knn])
    #Now, the final prediction is "Good question" if score >= 0.5, "Bad question" otherwise
    pred = "Good Question (above the median)" if score >= 0.5 else "Bad Question (below the meadian)"
    #Now, to recommend similar questions, we take the first two most similar questions (non equal)
    if (not encoded and questions[knn[0]] == query_sent) or (encoded and np.all(sentences_vecs[knn[0]] == query)):
        knn = knn[1:]
    recommendations = (questions[knn[0]], questions[knn[1]])
    return score, pred, recommendations

In [51]:
#Test for some predictions
print(predict("what's the best time of the year to ski"))
print(predict("yall or yall"))
print(predict(input_sequences_train[0], encoded=True))
print(questions[0])

(0.4, 'Bad Question (below the meadian)', ('best term in worldwide english for a monthly cost', 'which season do you like better  best spring or winter'))
(1.0, 'Good Question (above the median)', ('the case of yall', 'seemed or seems'))
(1.0, 'Good Question (above the median)', ('that represents vs representing', 'describe with vs describe by'))
as follows vs as the following


In [58]:
#Test accuracy on test set
def test_accuracy(sentences_vecs, labels):
    correctly_predicted = 0
    #Loop through all input sentences to predict each one of them
    for i, sentence in enumerate(sentences_vecs):
        #Find prediction for sentence and check if correctly predicted
        score, _, sim = predict(sentence, encoded=True)
        score = 1 if score >= 0.5 else 0
        correctly_predicted += 1 if score == labels[i] else 0
        if i != 0 and i % 500 == 0:
            print(str(i/1000)+"k, ",end="")
            if i != 0 and i % 2000 == 0:
                print()
    print()
    #The final accuracy would be the correctly predicted over the total
    return correctly_predicted / sentences_vecs.shape[0]

In [59]:
test_accuracy(input_sequences_test, labels_test)

0.5k1.0k1.5k2.0k
2.5k3.0k3.5k


0.56075

To efficiently compute pairwise distances among data points, we will convert the SFrame into a 2D Numpy array. First import the numpy library and then copy and paste `get_numpy_data()` from the second notebook of Week 2.

In [None]:
import numpy as np # note this allows us to refer to numpy as np instead

In [None]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
    # add the column 'constant' to the front of the features list so that we can extract it along with the others:
    features = ['constant'] + features # this is how you combine two lists
    # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
    features_sframe = data_sframe[features]
    # the following line will convert the features_SFrame into a numpy matrix:
    feature_matrix = features_sframe.to_numpy()
    # assign the column of data_sframe associated with the output to the SArray output_sarray
    output_sarray = data_sframe[output]
    # the following will convert the SArray into a numpy array by first converting it to a list
    output_array = output_sarray.to_numpy()
    return(feature_matrix, output_array)

We will also need the `normalize_features()` function from Week 5 that normalizes all feature columns to unit norm. Paste this function below.

In [None]:
def normalize_features(feature_matrix):
    norms = np.linalg.norm(feature_matrix, axis=0) # gives [norm(X[:,0]), norm(X[:,1]), norm(X[:,2])]
    normalized_features = feature_matrix / norms
    return (normalized_features, norms)

# Split data into training, test, and validation sets

In [None]:
(train_and_validation, test) = sales.random_split(.8, seed=1) # initial train/test split
(train, validation) = train_and_validation.random_split(.8, seed=1) # split training set into training and validation sets

# Extract features and normalize

Using all of the numerical inputs listed in `feature_list`, transform the training, test, and validation SFrames into Numpy arrays:

In [None]:
feature_list = ['bedrooms',  
                'bathrooms',  
                'sqft_living',  
                'sqft_lot',  
                'floors',
                'waterfront',  
                'view',  
                'condition',  
                'grade',  
                'sqft_above',  
                'sqft_basement',
                'yr_built',  
                'yr_renovated',  
                'lat',  
                'long',  
                'sqft_living15',  
                'sqft_lot15']
features_train, output_train = get_numpy_data(train, feature_list, 'price')
features_test, output_test = get_numpy_data(test, feature_list, 'price')
features_valid, output_valid = get_numpy_data(validation, feature_list, 'price')

In computing distances, it is crucial to normalize features. Otherwise, for example, the `sqft_living` feature (typically on the order of thousands) would exert a much larger influence on distance than the `bedrooms` feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.

IMPORTANT: Make sure to store the norms of the features in the training set. The features in the test and validation sets must be divided by these same norms, so that the training, test, and validation sets are normalized consistently.

In [None]:
features_train, norms = normalize_features(features_train) # normalize training set features (columns)
features_test = features_test / norms # normalize test set by training set norms
features_valid = features_valid / norms # normalize validation set by training set norms

# Compute a single distance

To start, let's just explore computing the "distance" between two given houses.  We will take our **query house** to be the first house of the test set and look at the distance between this house and the 10th house of the training set.

To see the features associated with the query house, print the first row (index 0) of the test feature matrix. You should get an 18-dimensional vector whose components are between 0 and 1.

In [None]:
print (features_test[0])

Now print the 10th row (index 9) of the training feature matrix. Again, you get an 18-dimensional vector with components between 0 and 1.

In [None]:
print (features_train[9])

***QUIZ QUESTION ***

What is the Euclidean distance between the query house and the 10th house of the training set? 

Note: Do not use the `np.linalg.norm` function; use `np.sqrt`, `np.sum`, and the power operator (`**`) instead. The latter approach is more easily adapted to computing multiple distances at once.

In [None]:
dist = np.sqrt(np.sum((features_test[0]-features_train[9])**2))
print("Distance = "+str(dist))

# Compute multiple distances

Of course, to do nearest neighbor regression, we need to compute the distance between our query house and *all* houses in the training set.  

To visualize this nearest-neighbor search, let's first compute the distance from our query house (`features_test[0]`) to the first 10 houses of the training set (`features_train[0:10]`) and then search for the nearest neighbor within this small set of houses.  Through restricting ourselves to a small set of houses to begin with, we can visually scan the list of 10 distances to verify that our code for finding the nearest neighbor is working.

Write a loop to compute the Euclidean distance from the query house to each of the first 10 houses in the training set.

In [None]:
test_dists = []
for i in range(10):
    test_dists.append(np.sqrt(np.sum((features_test[0]-features_train[i])**2)))

*** QUIZ QUESTION ***

Among the first 10 training houses, which house is the closest to the query house?

In [None]:
print("Closest house (index) = "+str(test_dists.index(min(test_dists))))
print("Closest distance      = "+str(min(test_dists)))

It is computationally inefficient to loop over computing distances to all houses in our training dataset. Fortunately, many of the Numpy functions can be **vectorized**, applying the same operation over multiple values or vectors.  We now walk through this process.

Consider the following loop that computes the element-wise difference between the features of the query house (`features_test[0]`) and the first 3 training houses (`features_train[0:3]`):

In [None]:
for i in range(3):
    print (features_train[i]-features_test[0])
    # should print 3 vectors of length 18

The subtraction operator (`-`) in Numpy is vectorized as follows:

In [None]:
print (features_train[0:3] - features_test[0])

Note that the output of this vectorized operation is identical to that of the loop above, which can be verified below:

In [None]:
# verify that vectorization works
results = features_train[0:3] - features_test[0]
print (results[0] - (features_train[0]-features_test[0]))
# should print all 0's if results[0] == (features_train[0]-features_test[0])
print (results[1] - (features_train[1]-features_test[0]))
# should print all 0's if results[1] == (features_train[1]-features_test[0])
print (results[2] - (features_train[2]-features_test[0]))
# should print all 0's if results[2] == (features_train[2]-features_test[0])

Aside: it is a good idea to write tests like this cell whenever you are vectorizing a complicated operation.

# Perform 1-nearest neighbor regression

Now that we have the element-wise differences, it is not too hard to compute the Euclidean distances between our query house and all of the training houses. First, write a single-line expression to define a variable `diff` such that `diff[i]` gives the element-wise difference between the features of the query house and the `i`-th training house.

In [None]:
diff = features_train-features_test[0]

To test the code above, run the following cell, which should output a value -0.0934339605842:

In [None]:
print (diff[-1].sum()) # sum of the feature differences between the query and last training house
# should print -0.0934339605842

The next step in computing the Euclidean distances is to take these feature-by-feature differences in `diff`, square each, and take the sum over feature indices.  That is, compute the sum of square feature differences for each training house (row in `diff`).

By default, `np.sum` sums up everything in the matrix and returns a single number. To instead sum only over a row or column, we need to specifiy the `axis` parameter described in the `np.sum` [documentation](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.sum.html). In particular, `axis=1` computes the sum across each row.

Below, we compute this sum of square feature differences for all training houses and verify that the output for the 16th house in the training set is equivalent to having examined only the 16th row of `diff` and computing the sum of squares on that row alone.

In [None]:
print (np.sum(diff**2, axis=1)[15]) # take sum of squares across each row, and print the 16th sum
print (np.sum(diff[15]**2)) # print the sum of squares for the 16th row -- should be same as above

With this result in mind, write a single-line expression to compute the Euclidean distances between the query house and all houses in the training set. Assign the result to a variable `distances`.

**Hint**: Do not forget to take the square root of the sum of squares.

In [None]:
distances = np.sqrt(np.sum(diff**2, axis=1))

To test the code above, run the following cell, which should output a value 0.0237082324496:

In [None]:
print (distances[100]) # Euclidean distance between the query house and the 101th training house
# should print 0.0237082324496

Now you are ready to write a function that computes the distances from a query house to all training houses. The function should take two parameters: (i) the matrix of training features and (ii) the single feature vector associated with the query.

In [None]:
def compute_distances(features_instances, features_query):
    diff = features_instances-features_query
    distances = np.sqrt(np.sum(diff**2, axis=1))
    return distances

*** QUIZ QUESTIONS ***

1.  Take the query house to be third house of the test set (`features_test[2]`).  What is the index of the house in the training set that is closest to this query house?
2.  What is the predicted value of the query house based on 1-nearest neighbor regression?

In [None]:
distances = compute_distances(features_train, features_test[2])
predicted_house = np.amin(distances)
indices = np.where(distances == predicted_house)
print(indices[0])

In [None]:
print(output_train[indices][0])

# Perform k-nearest neighbor regression

For k-nearest neighbors, we need to find a *set* of k houses in the training set closest to a given query house. We then make predictions based on these k nearest neighbors.

## Fetch k-nearest neighbors

Using the functions above, implement a function that takes in
 * the value of k;
 * the feature matrix for the training houses; and
 * the feature vector of the query house
 
and returns the indices of the k closest training houses. For instance, with 2-nearest neighbor, a return value of [5, 10] would indicate that the 6th and 11th training houses are closest to the query house.

**Hint**: Look at the [documentation for `np.argsort`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html).

In [None]:
def k_nearest_neighbors(k, feature_train, features_query):
    #First, compute distances for all houses
    distances = compute_distances(feature_train, features_query)
    #Now, return top k indices
    neighbors = np.argsort(distances)[0:k]
    return neighbors

*** QUIZ QUESTION ***

Take the query house to be third house of the test set (`features_test[2]`).  What are the indices of the 4 training houses closest to the query house?

In [None]:
print(k_nearest_neighbors(4, features_train, features_test[2]))

## Make a single prediction by averaging k nearest neighbor outputs

Now that we know how to find the k-nearest neighbors, write a function that predicts the value of a given query house. **For simplicity, take the average of the prices of the k nearest neighbors in the training set**. The function should have the following parameters:
 * the value of k;
 * the feature matrix for the training houses;
 * the output values (prices) of the training houses; and
 * the feature vector of the query house, whose price we are predicting.
 
The function should return a predicted value of the query house.

**Hint**: You can extract multiple items from a Numpy array using a list of indices. For instance, `output_train[[6, 10]]` returns the prices of the 7th and 11th training houses.

In [None]:
def predict_output_of_query(k, features_train, output_train, features_query):
    #Get indices of top k closest houses
    indices = k_nearest_neighbors(k, features_train, features_query)
    #Now, average prices for those indices
    prices = output_train[indices]
    prediction = np.sum(prices) / k
    return prediction

*** QUIZ QUESTION ***

Again taking the query house to be third house of the test set (`features_test[2]`), predict the value of the query house using k-nearest neighbors with `k=4` and the simple averaging method described and implemented above.

In [None]:
print(predict_output_of_query(4, features_train, output_train, features_test[2]))

Compare this predicted value using 4-nearest neighbors to the predicted value using 1-nearest neighbor computed earlier.

## Make multiple predictions

Write a function to predict the value of *each and every* house in a query set. (The query set can be any subset of the dataset, be it the test set or validation set.) The idea is to have a loop where we take each house in the query set as the query house and make a prediction for that specific house. The new function should take the following parameters:
 * the value of k;
 * the feature matrix for the training houses;
 * the output values (prices) of the training houses; and
 * the feature matrix for the query set.
 
The function should return a set of predicted values, one for each house in the query set.

**Hint**: To get the number of houses in the query set, use the `.shape` field of the query features matrix. See [the documentation](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ndarray.shape.html).

In [None]:
def predict_output(k, features_train, output_train, features_query):
    #Init list of predictions and loop through all houses
    predictions = []
    for i in range(features_query.shape[0]):
        #Get value and append to list
        predictions.append(predict_output_of_query(k, features_train, output_train, features_query[i]))
    return predictions

*** QUIZ QUESTION ***

Make predictions for the first 10 houses in the test set using k-nearest neighbors with `k=10`. 

1. What is the index of the house in this query set that has the lowest predicted value? 
2. What is the predicted value of this house?

In [None]:
values = predict_output(10, features_train, output_train, features_test[0:10])
min_house = np.amin(values)
indices = np.where(values == min_house)
print(indices[0])
print(min_house)

## Choosing the best value of k using a validation set

There remains a question of choosing the value of k to use in making predictions. Here, we use a validation set to choose this value. Write a loop that does the following:

* For `k` in [1, 2, ..., 15]:
    * Makes predictions for each house in the VALIDATION set using the k-nearest neighbors from the TRAINING set.
    * Computes the RSS for these predictions on the VALIDATION set
    * Stores the RSS computed above in `rss_all`
* Report which `k` produced the lowest RSS on VALIDATION set.

(Depending on your computing environment, this computation may take 10-15 minutes.)

In [None]:
all_RSS = []
for k in range(1,16):
    #Find prices
    prices_validation = predict_output(k, features_train, output_train, features_valid)
    #Compute RSS
    all_RSS.append(np.sum(np.power(output_valid - prices_validation,2)))
print("Best k value = "+str((all_RSS.index(min(all_RSS))+1)))

To visualize the performance as a function of `k`, plot the RSS on the VALIDATION set for each considered `k` value:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

kvals = range(1, 16)
plt.plot(kvals, all_RSS,'bo-')

***QUIZ QUESTION ***

What is the RSS on the TEST data using the value of k found above?  To be clear, sum over all houses in the TEST set.

In [None]:
prices_test = predict_output(8, features_train, output_train, features_test)
print(np.sum(np.power(output_test - prices_test,2)))