# Using a K-nearest Neighbours Classifier To Detect Bank Account Opening Fraud
### 2022W2 DSCI 100 Group Project

Team Members:
- Aqil Faizal
- Andrei Chepakovich
- Jiaying Ong
- Lucy Liu

## Introduction
Fraud is a major problem in Canada, with a total of $531M being lost to fraud in 2022 according to the Canadian Anti-Fraud Centre. Although overall rates of digital financial fraud in Canada have fallen in recent years, identity theft rates have been on the rise. According to Statista, an online platform for consumer data, the rate of identity fraud in Canada has consistently grown from 2008 to 2021 and sat at 61.95 per 100,000 residents in the latter year. Identity thieves commonly use stolen personal information to create new bank accounts with which they will then perform fraudulent activities such as credit card fraud, fraudulent loan applications, and money laundering. This act is known as account opening fraud, and will be the focus of this group project.

In this project, we aim to use the R tidymodels framework to build a K-nearest neighbours classifier that can classify a bank account as fraudulent or non-fraudulent using data from the Bank Account Fraud suite of datasets published at NeurIPS 2022, a machine learning conference. This synthetic dataset is based on real-world bank account data. The main questions that we would like to answer in this project are: 1) given this data, can we use the K-nearest neighbours algorithm to classify a bank account as fraudulent or non-fraudulent with reasonable accuracy, and 2) which variables are good indicators of account opening fraud?

### Description of the dataset
We will use the Base.csv dataset from the aforementioned Bank Account Fraud suite, which has 32 variables and 1M observations. As the K-nearest neighbours algorithm will be very computationally costly with such a large number of observations, we will first extract the first 100,000 rows of data and then filter it further to obtain only observations with certain attributes.

We also plan to use the forward selection algorithm to select the predictor variables to be used by the KNN model. Since using many variables for forward selection involves the training of many models and increases the chance that we come across a model with a high cross-validation accuracy estimate, but a low true accuracy on the test data, we will chose 8 variables that we felt were most likely to be indicators of fraudulence.

## Preliminary exploratory data analysis

The required libraries for running the code in this proposal are:

- tidyverse
- tidymodels

In [11]:
### Setup
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.3     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.3
[32m✔[39m [34mmodeldata   [39m 1.1.0     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.4     [32m✔[39m [34myardstick   [39m 1.1.0
[32m✔[39m [34mrecipes     [39m 1.0.5     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

Since the dataset has 1M rows, we first downloaded Base.csv locally and extracted the first 100,000 rows, which we then wrote to a .csv file and hosted on Google Sheets. We then read the data in again from Google Sheets:

In [16]:
fraud_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRL6hq0R8gYhL5VhHlCOGGSKTea3N54QsEC8zcr5pHwuY2Ly5Zc6G83araSMkzwQF7suK_o6tr-4MEP/pub?output=csv")
fraud_data <- fraud_data |>
mutate(fraud_data = as_factor(fraud_bool))

fraud_data

[1mRows: [22m[34m100000[39m [1mColumns: [22m[34m32[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (5): payment_type, employment_status, housing_status, source, device_os
[32mdbl[39m (27): fraud_bool, income, name_email_similarity, prev_address_months_cou...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


fraud_bool,income,name_email_similarity,prev_address_months_count,current_address_months_count,customer_age,days_since_request,intended_balcon_amount,payment_type,zip_count_4w,⋯,proposed_credit_limit,foreign_request,source,session_length_in_minutes,device_os,keep_alive_session,device_distinct_emails_8w,device_fraud_count,month,fraud_data
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,⋯,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,0.9,0.16682773,-1,88,50,0.020925173,-1.3313450,AA,769,⋯,500,0,INTERNET,3.888115,windows,0,1,0,7,1
1,0.9,0.29628601,-1,144,50,0.005417538,-0.8162238,AB,366,⋯,1500,0,INTERNET,31.798819,windows,0,1,0,7,1
1,0.9,0.04498549,-1,132,40,3.108548793,-0.7557277,AC,870,⋯,200,0,INTERNET,4.728705,other,0,1,0,7,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0,0.6,0.8435506,-1,18,40,0.01090603,-0.8543749,AB,602,⋯,1500,0,INTERNET,5.939641,other,1,1,0,3,0
0,0.4,0.6882371,-1,50,20,0.02786849,50.3059704,AA,691,⋯,500,0,INTERNET,3.273092,linux,1,1,0,3,0
0,0.7,0.7537215,-1,50,30,0.02482313,98.5380625,AA,878,⋯,500,0,INTERNET,6.452148,windows,0,1,0,3,0


We then narrowed the dataset down to the 6 variables which we felt were most likely to be indicators of fraudulence, as well as the label to be predicted, which was whether the bank account was fraudulent: 

- Credit risk score
- Income
- Months lived at current address
- Name-email similarity
- Session length in minutes
- Proposed credit limit

In addition, we also filtered for observations from the month of July, with the employment status of 'CA', and the housing status of 'BA'. Since the dataset is synthetic, the employment and housing statuses have arbitrary meaning.



In [24]:
fraud_narrowed <- fraud_data |>
    filter(month == 7, employment_status == "CA", housing_status == "BA") |>
    select(fraud_bool, credit_risk_score, income, current_address_months_count, name_email_similarity, session_length_in_minutes, proposed_credit_limit)    
fraud_narrowed

number_of_rows <- nrow(fraud_narrowed)
number_of_rows

fraud_bool,credit_risk_score,income,current_address_months_count,name_email_similarity,session_length_in_minutes,proposed_credit_limit
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,185,0.9,88,0.1668277,3.888115,500
1,259,0.9,144,0.2962860,31.798819,1500
1,110,0.9,22,0.1595112,2.047904,200
⋮,⋮,⋮,⋮,⋮,⋮,⋮
0,179,0.9,134,0.6891395,2.310369,500
0,234,0.9,3,0.5094120,2.943703,1000
0,244,0.8,131,0.8623505,4.555719,1500


We were left with 15369 rows left from our original dataset. We then performed more preliminary exploration:

## Methods


## Expected outcomes and significance