<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Marine-life-forecast---KNNImputer" data-toc-modified-id="Marine-life-forecast---KNNImputer-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Marine life forecast - KNNImputer</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li></ul></li><li><span><a href="#Importing-libraries" data-toc-modified-id="Importing-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing libraries</a></span></li><li><span><a href="#Importing-data" data-toc-modified-id="Importing-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Importing data</a></span></li><li><span><a href="#KNNImputer" data-toc-modified-id="KNNImputer-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>KNNImputer</a></span><ul class="toc-item"><li><span><a href="#Creating-knn-imputer" data-toc-modified-id="Creating-knn-imputer-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Creating knn imputer</a></span></li><li><span><a href="#Testing-the-imputer" data-toc-modified-id="Testing-the-imputer-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Testing the imputer</a></span></li><li><span><a href="#Saving-the-imputer" data-toc-modified-id="Saving-the-imputer-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Saving the imputer</a></span></li></ul></li></ul></div>

# Marine life forecast - KNNImputer

## Introduction
In this notebook I'll create a transformer to preproccess coordinates and assign a locality, a water body and a country code

# Importing libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
import pickle
pd.options.display.max_columns = None

# Importing data

In [2]:
data_folder = './data/'
input_file = 'occurrence_aggregated.txt'
df = pd.read_csv(data_folder + input_file)

In [3]:
columns =['decimalLatitude', 'decimalLongitude', 'waterBody',
          'locality', 'countryCode', 'hour', 'day', 'month', 'depth']
df = df[columns]
df.head()

Unnamed: 0,decimalLatitude,decimalLongitude,waterBody,locality,countryCode,hour,day,month,depth
0,-13.9387,-171.553,South Pacific Ocean,Taliga,WS,10,29,5,13.5
1,-13.9387,-171.553,South Pacific Ocean,Cowabunga,WS,12,29,5,10.2
2,-13.9387,-171.553,South Pacific Ocean,Cowabunga,WS,12,30,5,9.75
3,-13.9387,-171.553,South Pacific Ocean,Cowabunga,WS,10,30,5,10.5
4,53.8635,-166.049,Hidden Lake,Sunrise Grand Select Montemare Resort House Reef,EG,18,22,7,10.0


# KNNImputer

## Creating knn imputer

Convert to uppercase

In [4]:
df['locality'] = df['locality'].str.upper()
df['waterBody'] = df['waterBody'].str.upper()
df['countryCode'] = df['countryCode'].str.upper()

This function receives a dataframe and a column as parameters:
- Converts the column received as parameter to number using label encoder
- Create a knn imputer with 1 neighbor
- Create a new column transforming the input column with the label encoder
- Fit the knn imputer with the labeled column from previous step
- Returns the label encoder object and the knn imputer object to reverse the proccess in the next function

In [5]:
def knn_fit(data, column):
    le = LabelEncoder()
    knn = KNNImputer(n_neighbors=1)
    le.fit(data[column])
    data_aux = data[['decimalLatitude', 'decimalLongitude', column]].copy()
    data_aux[column+'_aux'] = le.transform(data_aux[column])
    knn.fit(data_aux[['decimalLatitude', 'decimalLongitude', column+'_aux']])
    return le, knn

This function receives a dataframe, a label encoder object, a knn imputer object and a column:
- Creates a new column with the name of the column received as parameter and adding _aux at the end of the name
- Fills the new columns with nans
- Applies the knn imputer to impute the nan values of the columns, that fills the column with continuous values
- Round the number, necessary for the next step
- Applies inverse of label encoder to get the original name 

This function is only for testing in this notebook. The same function will be created in the application

In [6]:
def knn_transform(data, le, knn, column):
    data_aux = data[['decimalLatitude', 'decimalLongitude']].copy()
    data_aux[column+'_aux'] = np.nan
    data_aux[column+'_aux'] = np.around(knn.transform(data_aux)[:,2]).astype(int)
    return le.inverse_transform(data_aux[column+'_aux'])

Create a list that will contain the label encoder and knn imputer of locality, water body and country code. This list will be saved to disk and imported in the application

In [7]:
knn_parts = []

knn_parts.append(knn_fit(df, 'locality'))
knn_parts.append(knn_fit(df, 'waterBody'))
knn_parts.append(knn_fit(df, 'countryCode'))

## Testing the imputer

In [8]:
X_test = df[8000:]

In [9]:
knn_transform(X_test, knn_parts[0][0], knn_parts[0][1], 'locality')

array(['AQUARIUM', 'SIRA ISLAND', 'SIRA ISLAND', ..., "ANNIE'S BOMMIE",
       "ANNIE'S BOMMIE", 'SAND CAY'], dtype=object)

## Saving the imputer

In [10]:
models_folder = './models/'
filename = 'knn.sav'
pickle.dump(knn_parts, open(models_folder + filename, 'wb'))