## Technical Report: Finding Fraudelent Activity  
Project Name: Fraudulent or Not?  
Team Members: Cristal Meza and Luke Nguyen  
CPSC 322, Spring 2022

### Introduction  
For this project, we decided upon a dataset that contained a number of transactions in which several were flagged as fraudelent and several were flagged as actually being fraudelent. We saw that this would make a good dataset for classification as we can train a classifier to be able to flag and determine from several attributes if a bank transaction is fraudelent or not.  
The "isFraud" class label on the dataset is the one that was chosen to be y_train for our dataset. 

### Data Analysis  

In [None]:
# read the CSV file to a table
import myutils
import plot_utils
header, table = myutils.read_csv_to_table("Fraud.csv")

#### Information about the dataset  
Class labels (taken directly from kaggle):  
* step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).  

* type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.  

* amount - amount of the transaction in local currency.  

* nameOrig - customer who started the transaction  

* oldbalanceOrg - initial balance before the transaction  

* newbalanceOrig - new balance after the transaction  

* nameDest - customer who is the recipient of the transaction  

* oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).  

* newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).  

* isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.  

* isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.  

Class attributes:  

* isFraud: We would try to predict if the particular transaction is fraudulennt and this class label is the answer,

### Relevant Summary Statistics

#### Data visualizations  
Below are some data visualizations that highlight important and unteresting aspects of the dataset.

In [None]:
# Pie Chart - Number of transactions flagged as fraudelent
flagged_values, flagged_count = myutils.get_frequencies(table, header, "isFlaggedFraud")
plot_utils.pie_chart(flagged_values, flagged_count, "Figure 1 - Number of transactions flagged as fraudelent")

In [None]:
# Pie Chart - Number of transactions determined as fraudelent
fraud_values, fraud_count = myutils.get_frequencies(table, header, "isFraud")
plot_utils.pie_chart(fraud_values, fraud_count, "Figure 2 - Number of transactions determiined as fraudelent")

In [None]:
# Bar Graph - Number of transaction types
type_values, type_count =  myutils.get_frequencies(table, header, "type")
plot_utils.bar_chart(type_values, type_count, "Figure 3 - Number of types of transactions")

#### Data Cleaning  
The CSV file did not have any missing data, but it was decided to remove a number of class attributes prior to classification. The removed attributes are:  
* "step" - We did not deem this attribute as relevant to be included in X_train
* "isFlagged" - We deemed this attribute as too similar to "isFraud," which is the class label that is used for y_train.  

Some attributes were changed for better classification and for the reader to be able to better understand the results.  
* "isFraud" - The values were changed from "0" and "1" to "no" and "yes"

In [None]:
# remove "step" and "isFlagged" from the dataset
myutils.drop_cols(table, header, "step")
myutils.drop_cols(table, header, "isFlaggedFraud")

# change attributes
myutils.change_isFraud(table, header, "isFraud")

### Classification Results

In [None]:
import myclassifiers
import myevaluation

# create X_train and y_train 
y_actual = myutils.get_column(table, header, header[-1])
X_train, X_test, y_train, y_test  = myevaluation.train_test_split(table,y_actual, random_state=20)
myutils.drop_cols(table, header, "isFraud")

In [None]:
# fit data into Decision Tree
print("Decision Tree")
tree_clf = myclassifiers.MyDecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
y_predicted = tree_clf.predict(X_test)

# results 
accuracy = myevaluation.accuracy_score(y_test, y_predicted, normalize=True)
error_score = 1.0 - accuracy
print("Error score:", round(error_score, 2) * 100, "%")

binary_score = myevaluation.binary_precision_score(y_test, y_predicted, labels=["no", "yes"], pos_label="no")
print("Binary Precision Score:", round(binary_score, 3))

recall = myevaluation.binary_recall_score(y_test, y_predicted, labels=["no", "yes"], pos_label="no")
print("Binary Recall Score:", round(recall))

f1 = myevaluation.binary_f1_score(y_test, y_predicted, labels=["no", "yes"], pos_label="no")
print("Binary f1 score:", round(f1))

In [None]:
# fit data into Dummy classifier
print("Dummy Classifier")
dummy_clf = myclassifiers.MyDummyClassifier()
dummy_clf.fit(X_train, y_train)
y_predicted = dummy_clf.predict(X_test)

# results 
accuracy = myevaluation.accuracy_score(y_test, y_predicted, normalize=True)
error_score = 1.0 - accuracy
print("Error score:", error_score * 100, "%")

binary_score = myevaluation.binary_precision_score(y_test, y_predicted, labels=["no", "yes"], pos_label="no")
print("Binary Precision Score:", round(binary_score, 3))

recall = myevaluation.binary_recall_score(y_test, y_predicted, labels=["no", "yes"], pos_label="no")
print("Binary Recall Score:", round(recall))

f1 = myevaluation.binary_f1_score(y_test, y_predicted, labels=["no", "yes"], pos_label="no")
print("Binary f1 score:", round(f1))