## Make multiple predictions

Write a function to predict the value of *each and every* house in a query set. (The query set can be any subset of the dataset, be it the test set or validation set.) The idea is to have a loop where we take each house in the query set as the query house and make a prediction for that specific house. The new function should take the following parameters:
 * the value of k;
 * the feature matrix for the training houses;
 * the output values (prices) of the training houses; and
 * the feature matrix for the query set.
 
The function should return a set of predicted values, one for each house in the query set.

**Hint**: To get the number of houses in the query set, use the `.shape` field of the query features matrix. See [the documentation](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ndarray.shape.html).

In [None]:
def KNN_predict(k,train_features,train_output,query_features):
    num_query = query_features.shape[0]
    predict_list = np.zeros(num_query)
    for i in range(0,num_query):
        predict_price = knn_search(k,train_features, query_features[i],train_output)
        predict_list[i] = predict_price
    return predict_list

*** QUIZ QUESTION ***

Make predictions for the first 10 houses in the test set using k-nearest neighbors with `k=10`. 

1. What is the index of the house in this query set that has the lowest predicted value? 6
2. What is the predicted value of this house?350032.0

In [None]:
predict_all = KNN_predict(10,features_train, output_train, features_test[0:10,:])
print predict_all.shape
print np.argsort(predict_all,axis = 0)[0]
print np.min(predict_all)

## Choosing the best value of k using a validation set

There remains a question of choosing the value of k to use in making predictions. Here, we use a validation set to choose this value. Write a loop that does the following:

* For `k` in [1, 2, ..., 15]:
    * Makes predictions for each house in the VALIDATION set using the k-nearest neighbors from the TRAINING set.
    * Computes the RSS for these predictions on the VALIDATION set
    * Stores the RSS computed above in `rss_all`
* Report which `k` produced the lowest RSS on VALIDATION set.8

(Depending on your computing environment, this computation may take 10-15 minutes.)

In [None]:
def RSS(predict,output):
    RSS_output = ((predict-output)**2).sum()
    return RSS_output
RSS_all = np.zeros(15)
for k in range(1,16):
    predict_all = KNN_predict(k,features_train, output_train, features_valid)
    RSS_value = RSS(predict_all,output_valid)
    RSS_all[k-1] = RSS_value
print RSS_all

To visualize the performance as a function of `k`, plot the RSS on the VALIDATION set for each considered `k` value:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

kvals = range(1, 16)
plt.plot(kvals, RSS_all,'bo-')

***QUIZ QUESTION ***

What is the RSS on the TEST data using the value of k found above?  1.33118823552e+14
To be clear, sum over all houses in the TEST set.

To test the code above, run the following cell, which should output a value 0.0237082324496:

In [None]:
print distances[100] # Euclidean distance between the query house and the 101th training house
# should print 0.0237082324496

Now you are ready to write a function that computes the distances from a query house to all training houses. The function should take two parameters: (i) the matrix of training features and (ii) the single feature vector associated with the query.

In [None]:
def house_distance(train_features,query_features):
    distance = distances = np.sqrt(np.sum((train_features[:] - query_features)**2,axis = 1))
    return distance
    

*** QUIZ QUESTIONS ***

1.  Take the query house to be third house of the test set (`features_test[2]`).  What is the index of the house in the training set that is closest to this query house?382
2.  What is the predicted value of the query house based on 1-nearest neighbor regression?600000

In [None]:
np.argmin(house_distance(features_train[:],features_test[2]))


In [None]:
output_train[382]

# Perform k-nearest neighbor regression

For k-nearest neighbors, we need to find a *set* of k houses in the training set closest to a given query house. We then make predictions based on these k nearest neighbors.

## Fetch k-nearest neighbors

Using the functions above, implement a function that takes in
 * the value of k;
 * the feature matrix for the training houses; and
 * the feature vector of the query house
 
and returns the indices of the k closest training houses. For instance, with 2-nearest neighbor, a return value of [5, 10] would indicate that the 6th and 11th training houses are closest to the query house.

**Hint**: Look at the [documentation for `np.argsort`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html).

In [None]:
def k_nearest_neighbors(k,train_features, query_feature):
    all_distance = house_distance(train_features,query_feature)
    sort_index = np.argsort(all_distance)
    index_list = sort_index[0:k]
    return index_list

*** QUIZ QUESTION ***

Take the query house to be third house of the test set (`features_test[2]`).  What are the indices of the 4 training houses closest to the query house?[ 382, 1149, 4087, 3142]

In [None]:
diff = (features_train[:] - features_test[0])
#diff1 = np.sqrt((features_train[:] - features_test[0])**2)
print diff


To test the code above, run the following cell, which should output a value -0.0934339605842:

In [None]:
print diff[-1].sum() # sum of the feature differences between the query and last training house
# should print -0.0934339605842

The next step in computing the Euclidean distances is to take these feature-by-feature differences in `diff`, square each, and take the sum over feature indices.  That is, compute the sum of square feature differences for each training house (row in `diff`).

By default, `np.sum` sums up everything in the matrix and returns a single number. To instead sum only over a row or column, we need to specifiy the `axis` parameter described in the `np.sum` [documentation](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.sum.html). In particular, `axis=1` computes the sum across each row.

Below, we compute this sum of square feature differences for all training houses and verify that the output for the 16th house in the training set is equivalent to having examined only the 16th row of `diff` and computing the sum of squares on that row alone.

In [None]:
print np.sum(diff**2, axis=1)[15] # take sum of squares across each row, and print the 16th sum
print np.sum(diff[15]**2) # print the sum of squares for the 16th row -- should be same as above

With this result in mind, write a single-line expression to compute the Euclidean distances between the query house and all houses in the training set. Assign the result to a variable `distances`.

**Hint**: Do not forget to take the square root of the sum of squares.

In [None]:
distances = np.sqrt(np.sum((features_train[:] - features_test[0])**2,axis = 1))