## KNN Model for Costa Rican Poverty Level Predicion

### Outline
**1. Project Setup** \
\
**2. What is a KNN Model?** \
\
**3. Models** \
*3.1 Basic Models* \
*3.2 Radius Neighbors Models* \
*3.3 Oversampled Models* \
\
**4. Results and Next Steps**

## 1. Project Setup

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import VarianceThreshold
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.utils import compute_class_weight
import load_data as ld
import evaluate_classification as ec


df, X_valid, y_valid = ld.load_train_data()

**X and y using oversampling**

In [None]:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_res, y_train_res = smote.fit_resample(df.drop(columns='Target'), df.loc[:,'Target'])


### 2. KNN Model

The KNN algorithm is a common supervised learning classification algorithm. Class labels are assigned by plurality vote of the K nearest neighbors (hence the name). Two common methods for this plurality vote system are uniform (1 point 1 vote), and weighted, where points closer to our target point get more weight on their "vote". KNN is a good fit for smaller datasets like the one we have, but it does suffer because it values all columns equally when computing distance between points, suffering from the curse of dimensionality and being prone to overfitting.

KNN is a relatively simple model, requiring only two hyperparameters, k and a distance metric, although we use three below when substituting between uniform and weighted weighting as well. 

In this case, the KNN models make a prediction, using the K nearest neighbors of which of these categories best describes a household:

- 1 = extreme poverty
- 2 = moderate poverty
- 3 = vulnerable households
- 4 = non vulnerable households

## 3. Models

### 3.1 Basic KNN Models

Below we run our first set of KNN models, looping through selected k values from 3-20 as well as alternating between uniform and distance based weighting.

In [None]:
preds = {}
scores = {}
for num in [3,5,10,15,20]:
    for choice in ['uniform','distance']:
        neigh = KNeighborsClassifier(n_neighbors=num,weights=choice)
        neigh.fit(df.drop(columns='Target'),df.loc[:,'Target'])
        preds[f'k_{num}_{choice}'] = neigh.predict(X_valid)
        print(f'k_{num}_{choice}')
        accuracy, f1, recall = ec.evaluate_classification(preds[f'k_{num}_{choice}'],y_valid,cm=True,return_vals=True)
        scores[f'k_{num}_{choice}'] = {'accuracy': accuracy, 'f1': f1, 'recall': recall}
        print('\n')
    

### 3.2 Radius Neighbors Models

Below we run a set of radius sizes and weighting options on a Radius Neighbors model, which is similar to KNN except rather than specifying the number of neighbors we consider all neighbors within a certain radius.

In [None]:
for num in [10,50,100,150,200]:
    for choice in ['uniform','distance']:
        neigh = RadiusNeighborsClassifier(radius=num, weights=choice, outlier_label='most_frequent')
        neigh.fit(df.drop(columns='Target'),df.loc[:,'Target'])
        preds[f'rad_{num}_{choice}'] = neigh.predict(X_valid)
        print(f'rad_{num}_{choice}')
        accuracy, f1, recall = ec.evaluate_classification(preds[f'rad_{num}_{choice}'],y_valid,cm=True,return_vals=True)
        scores[f'rad_{num}_{choice}'] = {'accuracy': accuracy, 'f1': f1, 'recall': recall}
        print('\n')


### 3.3 Oversampled Models

Finally, we come back to our original KNN form, but instead of using our original dataset, we use the oversampled version we got through the SMOTE process we ran at the top of the file.

In [None]:
for num in [3,5,10,15,20]:
    for choice in ['uniform','distance']:
        neigh = KNeighborsClassifier(n_neighbors=num,weights=choice)
        neigh.fit(X_train_res,y_train_res)
        preds[f're_{num}_{choice}'] = neigh.predict(X_valid)
        print(f're_{num}_{choice}')
        accuracy, f1, recall = ec.evaluate_classification(preds[f're_{num}_{choice}'],y_valid,cm=True,return_vals=True)
        scores[f're_{num}_{choice}'] = {'accuracy': accuracy, 'f1': f1, 'recall': recall}
        print('\n')

## 4. Results and Next Steps

In [None]:
# create a DataFrame from the dictionary of dictionaries
df3 = pd.DataFrame.from_dict(scores, orient='index')
df3.sort_values(by = 'f1', ascending=False,inplace=True)

print(df3.to_latex())

\begin{array}{c|ccc}
\text{} & \text{accuracy} & \text{f1} & \text{recall} \\
\hline
k\_15\_distance    &      0.60 &  0.53 &     [0.04, 0.1, 0.1, 0.9] \\
k\_20\_distance    &      0.61 &  0.53 &  [0.04, 0.09, 0.12, 0.91] \\
k\_3\_uniform      &      0.53 &  0.52 &  [0.22, 0.23, 0.07, 0.74] \\
rad\_150\_uniform  &      0.56 &  0.52 &  [0.08, 0.14, 0.07, 0.83] \\
k\_5\_uniform      &      0.55 &  0.52 &  [0.06, 0.19, 0.11, 0.79] \\
k\_10\_uniform     &      0.60 &  0.52 &  [0.02, 0.11, 0.04, 0.91] \\
k\_10\_distance    &      0.58 &  0.52 &   [0.04, 0.1, 0.12, 0.86] \\
k\_15\_uniform     &      0.62 &  0.52 &  [0.06, 0.07, 0.04, 0.94] \\
k\_20\_uniform     &      0.62 &  0.52 &  [0.02, 0.06, 0.04, 0.96] \\
rad\_150\_distance &      0.57 &  0.52 &   [0.04, 0.1, 0.08, 0.86] \\
k\_5\_distance     &      0.54 &  0.51 &  [0.06, 0.12, 0.14, 0.79] \\
rad\_200\_distance &      0.56 &  0.51 &  [0.02, 0.12, 0.08, 0.83] \\
rad\_200\_uniform  &      0.55 &  0.51 &  [0.04, 0.16, 0.07, 0.82] \\
rad\_50\_uniform   &      0.60 &  0.51 &  [0.04, 0.07, 0.04, 0.93] \\
rad\_100\_uniform  &      0.57 &  0.51 &  [0.12, 0.08, 0.05, 0.86] \\
k\_3\_distance     &      0.52 &  0.51 &  [0.08, 0.14, 0.18, 0.75] \\
rad\_100\_distance &      0.57 &  0.50 &  [0.04, 0.07, 0.05, 0.88] \\
rad\_50\_distance  &      0.60 &  0.50 &  [0.02, 0.05, 0.04, 0.93] \\
re\_20\_distance   &      0.46 &  0.50 &  [0.29, 0.33, 0.34, 0.53] \\
re\_5\_distance    &      0.45 &  0.49 &   [0.24, 0.25, 0.3, 0.56] \\
re\_10\_distance   &      0.45 &  0.49 &  [0.37, 0.28, 0.33, 0.52] \\
re\_15\_distance   &      0.45 &  0.49 &  [0.31, 0.32, 0.32, 0.53] \\
rad\_10\_distance  &      0.63 &  0.48 &     [0.0, 0.0, 0.0, 0.99] \\
rad\_10\_uniform   &      0.63 &  0.48 &     [0.0, 0.0, 0.0, 0.99] \\
re\_3\_distance    &      0.44 &  0.48 &  [0.24, 0.24, 0.27, 0.55] \\
re\_3\_uniform     &      0.43 &  0.47 &  [0.37, 0.32, 0.16, 0.51] \\
re\_5\_uniform     &      0.41 &  0.46 &  [0.35, 0.34, 0.25, 0.47] \\
re\_15\_uniform    &      0.40 &  0.45 &  [0.35, 0.34, 0.26, 0.45] \\
re\_20\_uniform    &      0.40 &  0.45 &  [0.39, 0.36, 0.27, 0.43] \\
re\_10\_uniform    &      0.39 &  0.44 &  [0.43, 0.31, 0.29, 0.42] \\
\end{array}



As we can see on the above table, the highest performing models are weighted KNN models (represented with a k), although some of the larger radius models work well. The resampling models (represented with re), performed worst of all, as the resampling process combined with the limitations of KNN's curse of dimensionality to get models that predicted all four classes very poorly.

From our conclusions here, especially when compared with the other model types our group built, KNN was clearly the wrong approach to this problem, and as we take next steps we will be using alternative modelling approaches.