# Introduction
The task for this project is to do predictions using the KNN algorithm for the KDD-Cup99 data set. The project uses the implementation of the KNN algorithm developed in the last exercise. Code for those is available at https://github.com/alileino/ml_study. The project uses numpy, pandas and scikit-learn. The KNN implementation was fast enough to produce these plots in less than 30 seconds, so the full data set could be used.

# Preprocessing

The following class handles data loading and preprocessing. Test set contained missing values which were removed without any thought about their meaning, since there were only 16 of them. The training set and test set are combined so that they can be normalized together. Features 1-3 correspond to catecorigal values, which are binary encoded in the encode-function. Columns which only contain one value are removed since they cause divisions by zero later on, and don't provide any value. The objective column was removed before Z-score normalization, and finally the data was split back to the same training and test matrices.
```python
class KddDataProvider:
    '''
    Loads and preprocesses KddCup99 data.
    '''
    label_columns = [1,2,3]
    def __init__(self, debug=False):
        '''
        :param debug: If true, it uses a small part of the entire training data set
        '''
        trainX, trainY, testX, testY = self.__load_data(debug)
        self.trainX = trainX
        self.trainY = trainY
        self.testX = testX
        self.testY = testY

    def get_train(self):

        return self.trainX, self.trainY

    def get_test(self):
        return self.testX, self.testY

    def __load_data(self, debug):
        '''
        Loads the training and test data and does preprocessing to them
        :param debug:
        :return:
        '''
        dftrain = pd.read_csv(TRAIN_FILE, header=None, names=np.arange(0,42))

        if debug:
            dftrain = dftrain.sample(frac=0.3, random_state=1)

        train_size = len(dftrain)
        dftest = pd.read_csv(TEST_FILE, header=None, names=np.arange(0,42))

        # Drop rows with NA's
        dftest = dftest.dropna(axis=0, how="any")
        df = dftrain.append(dftest , ignore_index=True)

        for c in df.columns: # Drop columns with only a single value,
            if len(df[c].unique()) <= 1:
                df = df.drop(c, axis=1)
        df = self.__encode(df)

        trainY = df.values[:train_size,-1]

        testY = df.values[train_size:, -1]
        df = df.drop(df.columns[-1], axis=1) # drop the last column

        # Z-score normalize the data
        df = (df - df.mean()) / (df.max() - df.min())

        trainX = df.values[:train_size,:-1]
        testX = df.values[train_size:, :-1]
        return trainX, trainY, testX, testY

    def __encode(self, df):
        newdf = df[df.columns[0]]

        for label_column in KddDataProvider.label_columns:
            newdf = pd.concat((newdf, pd.get_dummies(df[label_column], '', '')
            .astype(int)), axis=1, ignore_index=True)

        newdf = pd.concat((newdf, df[df.columns[4:]]), axis=1)
        newdf.columns = np.arange(0, len(newdf.columns))

        return newdf
```


# Plotting implementation
The following code generates a plot of 10-fold cross-validated accuracy against different neighbor values. 
``` python
def plot_model_selection():
    neighbors = np.arange(1,11)

    data = KddDataProvider()
    X, y = data.get_train()

    scores = []

    for k in neighbors:

        knn = KNN(n_neighbors=k)
        s = cv_accuracy_score(knn, X, y, cv=KFold(n_splits=10))
        scores.append(s)


    plt.figure()
    y = np.mean(scores, axis=1)

    # Uncomment to plot std error bars
    # e = np.std(allscores, axis=1)

    # plt.errorbar(neighbors+offset, y, yerr=e, lw=2, label=scorer_names[scorer])
    plt.plot(neighbors, y, label="Accuracy")
    plt.legend()

    plt.suptitle("10-fold average CV score and Std for KNN")
    plt.xlabel("k (neighbors)")
    plt.ylabel("Score")
    plt.xticks(neighbors)
    plt.savefig(path.join(OUTPUT_PATH, "kdd_model_selection.png"), format="PNG")
```

# Confusion matrix and results implementation
The following code plots the confusion matrix and prints the accuracy and f1-score for the test data for 8 neighbors. 
```python
def plot_confusion(truey, predy):
    cm = confusion_matrix(truey, predy)
    plot_confusion_matrix(cm, np.unique(truey))
    plt.savefig(path.join(OUTPUT_PATH, "kdd_confusion_matrix.png"), format="PNG")


def print_results():
    data = KddDataProvider()
    trainX, trainY = data.get_train()
    testX, testY = data.get_test()
    knn = KNN(n_neighbors=8)
    knn.fit(trainX, trainY)
    predY = knn.predict(testX)
    accuracy = accuracy_score(testY, predY)
    fscore = f1_score(testY, predY, average="weighted")
    print("Accuracy:", accuracy, "F1-score", fscore)
    plot_confusion(testY, predY)
```

# Results
For debugging purposes, the results were compared to scikit-learn implementation of KNN. They were identical so not included not in this document. The neighbor value of 1 gave the best results with accuracy of 0.968, but the difference between that and a neighbor value of 8 was less than 0.01, so 8 was chosen as the final hyperparameter value. This was done because 8 is likely to generalize better with the unknown data than 1. 


![](img/kdd_model_selection.png)
![](img/kdd_confusion_matrix.png)


The confusion matrix shows that classes 1,2 and 3 were predicted very well. However, class 4 had abysmal predictions, with only one correct and 500 incorrect. This contributes most to the lower accuracy values.

The final accuracy score for the test set was 0.772, and the f1-score was 0.715.