# Predicting categories with K-Nearest Neighbors

**Aim**: The aim of this notebook is to predict if a mobile transaction is fraudulent or not by using the K-NN algorithm with scikit-learn.

## Table of contents

1. Data preparation
2. Implementing the k-NN algorithm
3. Fine-tuning parameters using GridsearchCV
4. Scaling

## Package Requirements

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

## Data preparation

In [None]:
#Reading in the dataset

df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

In [None]:
#Viewing the data

df.head()

**Dropping the redundant features**

In [None]:
#Dropping the redundant features

df = df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

In [None]:
#Inspecting the data

df.info()

**Reducing the size of the data**

In [None]:
#Storing the fraudulent data into a dataframe

df_fraud = df[df['isFraud'] == 1]

In [None]:
#Storing the non-fraudulent data into a dataframe 

df_nofraud = df[df['isFraud'] == 0]

In [None]:
#Storing 12,000 rows of non-fraudulent data

df_nofraud = df_nofraud.head(12000)

In [None]:
#Joining both datasets together 

df = pd.concat([df_fraud, df_nofraud], axis = 0)

In [None]:
df.info()

**Encoding the categorical feature**

In [None]:
#Converting the type column to categorical

df['type'] = df['type'].astype('category')

In [None]:
#Integer Encoding the 'type' column

type_encode = LabelEncoder()

In [None]:
#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

In [None]:
df['type'].value_counts()

In [None]:
#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()
type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

In [None]:
#Adding the one hot encoded variables to the dataset 

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])
df = pd.concat([df, ohe_variable], axis=1)

In [None]:
#Dropping the original type variable 

df = df.drop('type', axis = 1)

**Checking for missing values**

In [None]:
#Checking every column for missing values

df.isnull().any()

In [None]:
#Imputing the missing values with a 0

df = df.fillna(0)

In [None]:
#Checking if there are missing values left

df.isnull().any()

**Exporting the dataset**

In [None]:
df.to_csv('fraud_prediction.csv')

## Implementing the k-NN Algorithm

In [None]:
#Creating the features 

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

**Splitting the data into training and test sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42, stratify = target)

**Building the knn classifier**

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors=3)

In [None]:
knn_classifier.fit(X_train, y_train)

In [None]:
knn_classifier.score(X_test, y_test)

## Fine Tuning Parameters using GridSearchCV

In [None]:
#Initializing a grid with possible number of neighbors from 1 to 24

grid = {'n_neighbors' : np.arange(1, 25)}

#Initializing a k-NN classifier 

knn_classifier = KNeighborsClassifier()

#Using cross validation to find optimal number of neighbors 

knn = GridSearchCV(knn_classifier, grid, cv = 10)

knn.fit(X_train, y_train)

In [None]:
#Extracting the optimal number of neighbors 

knn.best_params_

In [None]:
#Extracting the accuracy score for optimal number of neighbors

knn.best_score_

## Scaling

In [None]:
#Setting up the scaling pipeline 

pipeline_order = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 1))]

pipeline = Pipeline(pipeline_order)

#Fitting the classfier to the scaled dataset 

knn_classifier_scaled = pipeline.fit(X_train, y_train)

#Extracting the score 

knn_classifier_scaled.score(X_test, y_test)