# Image Scene Classification
The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a number.

|Number|Class                               |
|------|------------------------------------|
|1     |red soil                            |
|2     |cotton crop                         |
|3     |grey soil                           | 
|4     |damp grey soil                      |
|5     |soil with vegetation stubble        |
|6     |mixture class (all types present)   |
|7     |very damp grey soil                 |

The data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset.

In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary.

The last column has 1 for train, 0 for test.

## Load and Prepare Data

In [1]:
import numpy as np

In [2]:
data = np.load('../data/sat.npy')
is_train = data[:, -1] == 1
X_train, X_test = data[is_train, :36], data[~is_train, :36]
y_train, y_test = data[is_train, 36], data[~is_train, 36]

## k-Nearest-Neighbors

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [4]:
# PAGE 471. Then five-nearest-neighbors classification was carried out in this
#           36-dimensional feature space. The resulting test error rate was
#           about 9.5% (see Figure 13.8). Of all the methods used in the
#           STATLOG project, including LVQ, CART, neural networks, linear
#           discriminant analysis and many others, k-nearest-neighbors
#           performed best on this task. Hence it is likely that the decision
#           boundaries in IR36 are quite irregular.
model = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
y_test_hat = model.predict(X_test)
error_rate = 1 - accuracy_score(y_test, y_test_hat)
print(f'Test Error rate: {100*error_rate:.1f}%')

Test Error rate: 9.6%
