# Predicting Fradulent Activity through Data Analysis


### Introduction
    
   The dataset used will be Audit Data, which was collected by the Comptroller and Auditor General of India for the 2015-2016 year from the Auditor General Office of the CAG. The data features information from 777 different firms across 46 different cities and 14 different industry sectors. Auditing is the practice of examining businesses' financial records compared to their financial statements to ensure they are in compliance with India's accounting laws. The purpose of the project centers on the prevention of fraudulent data activity. In the long term the prediction could help the impact it may cause to the economy and individuals on that society.
 The question we will be trying to answer in this data analysis is: *“What’s the likelihood of an Indian firm being fraudulent?”*. 

### Reading the data

In [1]:
library(tidyverse)
library(tidymodels)
library(repr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [9]:
audit_trial <- read_csv("audit_data/trial.csv")%>%
    select(Money_Value, TOTAL, Risk) %>%
    mutate(Risk = as.factor(Risk))
audit_trial

audit_split <- initial_split(audit_trial, prop = 0.75, strata = Risk)
audit_training <- training(audit_split)
audit_testing <- testing(audit_split)

Parsed with column specification:
cols(
  Sector_score = [32mcol_double()[39m,
  LOCATION_ID = [31mcol_character()[39m,
  PARA_A = [32mcol_double()[39m,
  SCORE_A = [32mcol_double()[39m,
  PARA_B = [32mcol_double()[39m,
  SCORE_B = [32mcol_double()[39m,
  TOTAL = [32mcol_double()[39m,
  numbers = [32mcol_double()[39m,
  Marks = [32mcol_double()[39m,
  Money_Value = [32mcol_double()[39m,
  MONEY_Marks = [32mcol_double()[39m,
  District = [32mcol_double()[39m,
  Loss = [32mcol_double()[39m,
  LOSS_SCORE = [32mcol_double()[39m,
  History = [32mcol_double()[39m,
  History_score = [32mcol_double()[39m,
  Score = [32mcol_double()[39m,
  Risk = [32mcol_double()[39m
)



Money_Value,TOTAL,Risk
<dbl>,<dbl>,<fct>
3.380,6.68,1
0.940,4.83,0
0.000,0.74,0
11.750,10.80,1
0.000,0.08,0
2.950,0.83,0
44.950,8.51,1
7.790,20.53,1
7.340,19.45,1
1.930,4.97,1


In [4]:
audit_recipe <- recipe(Risk ~., data = audit_training) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors())

audit_recipe

Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor          2

Operations:

Scaling for all_predictors()
Centering for all_predictors()

In [5]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
       set_engine("kknn") %>%
       set_mode("classification")
knn_spec


K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 3
  weight_func = rectangular

Computational engine: kknn 


In [12]:
audit_workflow <- workflow() %>%
       add_recipe(audit_recipe) %>%
       add_model(knn_spec)%>%
       fit(data= audit_training)

audit_workflow

══ Workflow [trained] ══════════════════════════════════════════════════════════
[3mPreprocessor:[23m Recipe
[3mModel:[23m nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

● step_scale()
● step_center()

── Model ───────────────────────────────────────────────────────────────────────

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~3, kernel = ~"rectangular")

Type of response variable: nominal
Minimal misclassification: 0.1423671
Best kernel: rectangular
Best k: 3

In [16]:
audit_predictions <- predict(audit_workflow, audit_testing) %>%
       bind_cols(audit_testing)

audit_predictions

ERROR: Error: Can't recycle `..1` (size 192) to match `..2` (size 193).


In [25]:
audit_prediction_accuracy <- audit_predictions%>%
         metrics(truth = Money_Value, estimate = .pred_class)             

audit_prediction_accuracy

ERROR: Error in eval(lhs, parent, parent): object 'audit_predictions' not found


## Methods

After a preliminary investigation into the paper utilizing this data we determined the most important factors were Score A, Score B, Money Value, Total, Loss, and History. However, to improve the final data visualization of the project we narrowed it down to two factors:  the amount of money involved in misstatements in past audits (Money_Value), and the total amount of discrepancy found in past audit report (TOTAL). To visualize the data, we plan on using a scatter plot with fraudulent and non-fraudulent firms color coded with the two predictors on the axes. 


## Expected outcomes and significance
Through this project, we expect to find If we can use the amount of money involved in misstatements and the total amount of discrepancy found in past audit reports to predict whether a firm is fraudulent or not.
<br>Our findings could help auditors conduct risk assessment through identification of anomalies and trends and then point out items they need to investigate further, as a result, they can improve their models for auditing. Besides, our findings can provide evidence and statistical support for possible legal policies and actions imposed upon firms that are predicted to be fraudulent.
<br>This project could lead to the question that if there are other variables that can be looked at that help provide a more accurate prediction. In addition, whether there are any trends in fraudulent firms or not and how these patterns can be more easily detected are to be explored further in future studies.