
# Project Title
**Real-time Anomaly Detection in Financial Transactions**



# Authors and Team

- **Author 1**: Haozhen Guo
- **Author 2**: Yang Liu
- **Team**: Black Myth

# Abstract
For this project, Real-Time Anomaly Detection in Financial Transactions, the major objective is to develop and optimize machine learning algorithms for anomaly detection, focusing on fraud detection in the financial market. By improving the performance of the anomaly detection method,  we are trying to reduce financial losses for individuals, achieve better jobs in risk management, and enhance anti-money laundering systems.

For our project, we will implement both supervised and unsupervised machine learning algorithms. For supervised learning, we will apply LightGBM, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). For these methods, we will train the models with labeled data and evaluate their performances with testing sets. On the other hand, for unsupervised learning methods, we will use some advanced clustering techniques, including isolation, Local Outlier Factor (LOF), and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). For these algorithms, we will initially train the model without the labels. After clustering, we will evaluate the performance based on the original labels inside the data. Compared with these two methods, we would like to see which one performs better on this problem.

In conclusion, we expect to generate an efficient and accurate anomaly detection system that can identify suspicious activities in the financial market. Based on that system, we hope it will not only protect financial institutions but also prevent personal asset losses from fraud in the future.


# **Project Details**

## **Project Description**

### **Motivation**

According to the Statista Report (December 2023) \ref{statista2023}, global losses due to card fraud in 2023 amounted to \$33.45 billion. Although the year-over-year percentage increase has decelerated from 13.8\% to 3.4\%, the overall increase in the total loss underlines the significant situation of personal financial loss. In detail, the Consumer Sentinel Network Data Book (2023) from the Federal Trade Commission reported over \$10 billion in losses across 2.5 million fraud reports in the United States \ref{ftc2023}. Considering the relatively low rates of reported fraud among individuals under 19 and over 80, the actual value of losses is likely much higher. All of the above news emphasized one significant problem: The public is at increasingly terrible risk of fraud.

While it is essential to improve public awareness of the common types of fraud in daily life, there is a growing need for advanced anomaly detection in financial transactions. Due to the massive daily transactions, traditional analytical methods are facing significant challenges regarding technology innovation and data privacy policy. As Fathe Jeribi (2024) states, Machine Learning (ML) has emerged as a pioneer tool for fraud detection, as it learns and then extracts the hidden information from data without explicit instructions from human beings, which will support the decision-making process in the future \ref{jeribi2024}. Recently, with the development of big data technologies, researchers have made more effort to combine them with existing methods of anomaly detection.  In their study on clustering-based real-time anomaly detection, Habeeb et al. (2022), summarize a range of algorithms focusing on anomaly detection, including clustering, Naïve Bayes, SVM, K-means, and other big data-driven approaches \ref{habeeb2022}.

In our study, we aim to explore anomaly detection within financial marketing with different methods and seek to optimize algorithms through research and experiments.


### **Potential Impact on Decision Making**

Based on our research, anomaly detection algorithms can significantly influence business decisions in three main aspects:

- **Personal Savings Protection**: Detecting abnormal transactions allows banks and other financial institutions to track and block potential suspicious or high-risk transactions, which will support personal savings protection.

- **Risk Management**: Financial institutions can use anomaly detection to identify unusual patterns in portfolio management and make appropriate adjustments in investment strategies.

- **Anti-Money Laundering**: One key application of anomaly detection is to discover the irregularities in transactions, which might related to money laundering activities.

### **Social Significance**

Our research holds societal relevance in two primary areas:

- **Protection of Vulnerable Populations**: Across all of the people who suffer from financial fraud, vulnerable groups such as the elderly, low-income families, and individuals with liabilities are most likely to become the target of fraud, while at the same time, they do not have enough ability to contend with it. Once the anomalies related to financial fraud become undetected, these populations are exposed to considerable financial risks. By developing advanced and effective detection methods, we wish to contribute more to preventing vulnerable individuals from fraud and the risk of loss of personal property.

- **Enhancing Trust in the Digital Age**: As financial systems become more digitized, public attitudes toward automated systems are becoming increasingly polarized and extreme.  Some people fully trust algorithms without a comprehensive understanding of them, while others believe nothing but rely solely on personal judgment. As both views are taken to extremes, the development of the financial market will face significant challenges. By establishing a robust anomaly detection system, we want to help restore and strengthen public trust in financial institutions.

### **Personal and Academic Motivation**

Our interest in this particular research comes from both personal and academic motivations. As individuals, we have experienced how commonly fraud influences our daily lives. Moreover, we are more concerned about the risks and threats to our family members, especially the elderly or less educated groups, when they are exposed to fraud than ourselves. By participating in this project, we hope to expand our knowledge of anomaly detection and warn those around us to beware of fraud and its risks in the future.

On the other hand, as students majoring in data analysis and statistics, we recognize the significance of machine learning (ML) and artificial intelligence (AI) in the era of big data. We believe that the traditional expansion focusing on dimensions or the size of data does not present the "true" meaning of big data. Time, as the scale of change, plays an important role in the ultimate definition of big data. We hope to deepen our understanding of time-series data through this project, enabling us to address and solve complex, real-time problems in the future.


## **Methods**


### **Data Preprocessing**:

A large amount of raw data usually contains many missing values, inconsistent values, and abnormal values, which can strongly affect the efficiency of model training and prediction. It can even introduce bias into the results of the model. Therefore, data preprocessing is crucial in machine learning.

#### **Filling missing value**:

Missing values always affect data analysis, so one way to handle them is to use a reasonable regression function or model to predict and fill the missing values.

   - **Time Based Interpolation**: For time series data, we could use this way to approximate missing values. ⁤⁤This function assumes that data has a smooth change with time, and then we could interpolate based on the time
   
   - **Linear Interpolation**: As for linear data, we could draw a line between two data points and then use somewhere of this line to fill the missing value. This function assumes that data points obtain a linear regression line. ⁤
   
   - **Spline Interpolation**: As for periodic data or polynomial data, we could use a curve to predict and interpolate missing values. ⁤⁤It is efficient when data is not linear.


### **Data Transformation**:

In data analytics and modeling, different scales of data will influence the efficiency and accuracy of the model training. Therefore, we need data transformation to get a suitable data format or scales.

   - **Difference**: It is a technology to compute the differences between continuous data points. It will make non-stationary datasets more suitable for analysis.
   
   - **Log-transformation**: Log-transformation could be used to handle the time series with exponential increasing trend or large amount of time series. By getting the log of the time series, we could transfer the increasing trend into linear increasing, which will make the time series much more stable and easy to analyze and model.


### **Correlation Analysis**:

Correlation analysis quantifies the relationships between variables, assessing whether and how two variables change together.

   - **Pearson Correlation**:

⁤This is a measure of the linear relationship between two variables. ⁤⁤It ranges from -1 to +1, where +1 indicates a strong positive linear relationship, -1 indicates a strong negative linear relationship, and 0 indicates no linear relationship.

   - **MIC (Maximal Information Coefficient)**:

A measure of both linear and non-linear associations between two variables. Unlike Pearson correlation, MIC can capture more complex dependencies that are not necessarily linear, making it useful in identifying intricate relationships between features and fraud transactions.


### **Dimension Reduction and Feature Extraction**:

Our dataset includes 590540 tuples and each tuple includes 434 features which will increase the complexity of the computation and model training time. Dimension reduction is necessary for us to simplify datasets, increasing the efficiency, performance, and robustness of the model.

   - **t-SNE (t-distributed Stochastic Neighbor Embedding)**:

t-SNE is a nonlinear dimensionality reduction that preserves local relationships between data points. Using t-SNE can help us better visualize the local structure of nonlinear data.

   - **PCA (Principal Component Analysis)**:

PCA is a common method of linear dimensionality reduction, which reduces the dimension of data by finding the main components in the data. Compared with t-SNE, it focuses more on global structure and is suitable for processing linear data.


### **Model Selection**:

#### **Supervised Learning--Classification**:

A technique where models are trained using labeled data, with the goal of predicting categorical labels.

   - **LightGBM**: LightGBM is an algorithm based on GBDT, compared with the traditional GBDT method, it has the ability to deal with massive data. Our dataset contains 500,000+ data, and LightGBM was chosen for its efficient parallel training with faster training speed, lower memory consumption, and better accuracy.
   
   - **KNN (K-Nearest Neighbors)**: The KNN classifies a data point based on the labels of the K nearest Neighbors. Since the model only needs to make predictions based on the most recent data points, the model works well in low-dimensional data and is more laborious in the face of dimensional curses. Therefore, it is necessary to choose whether to use KNN after dimensionality reduction.
   
   - **SVM (Support Vector Machine)**: SVM is an algorithm that maximizes the separation of different classes of data points by finding a hyperplane. It is suitable for binary classification of linear and nonlinear data. The goal of our project is fraud detection, which is a binary classification algorithm. Maximizing separation can help us minimize Type 1 error and Type 2 error.

   - **LSTM (Long Short-Term Memory)**:  LSTM is a neural network with the ability to remember long and short term information, and it has a longer memory than RNNS for the prediction of time series problems.


#### **Unsupervised Learning**:

Unsupervised Learning is building several clusters based on datasets and it doesn’t need a target feature. When we overlook our dataset, only 2% of transactions are fraudulent, which means fraud is an abnormal situation. Outlier detection could be used to help us reach our goal.

   - **Isolation Forest**: Isolation Forest is an outlier detection algorithm based on the tree structure, it builds a binary search tree named isolated tree to find the anomaly data. Compared with traditional algorithms such as K-means, isolation forest has robustness in high dimensional datasets.

   - **LOF(Local Outlier Factor)**: LOF is a traditional outlier detection algorithm based on density calculation. It is based on the ratio of the neighbors’ average density divided by its own density, if the ratio is far away from 1 then it will be detected as outliers.
   
   - **DBSCAN(Density-Based Spatial Clustering of Applications with Noise)**: DBSCAN is also an outlier detection algorithm based on density. It needs a radius epsilon and minPoints. Epsilon is the radius of the circle created for each data point to check the density, and minPoints is the minimum number of data points required within the circle. If the number of the datapoint’s neighbors is less than minPoints, then it will be detected as an outlier.


### **Model Evaluated Metrics**:

#### **TPR (True Positive Rate)**:

TPR(recall, sensitivity) measures the proportion of actual positive cases that are correctly identified by the model. It is crucial in fraud detection, where missing a fraud case (false negative) can be costly.
$$
\text{TPR} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

#### **AUC-ROC (Area Under the Receiver Operating Characteristic Curve)**:

AUC measures the overall performance of a classification model. The ROC curve plots the TPR against the FPR, and the area under the curve reflects the model’s ability to distinguish between classes.


#### **Accuracy Score**:

Accuracy is the proportion of correctly classified instances out of the total instances.

$$
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples (TP + TN + FP + FN)}}
$$


### **Dataset Infomation**

We consulted multiple dataset sources such as Asos E-commerce datasets on Hugging Face, A synthetic credit card transaction dataset on GitHub, and IEEE-CIS Fraud Detection on Kaggle. We review the datasets from these multiple dimensions, such as whether the dataset includes the results of fraud detection, the number of features, the number of data tuples, and whether the data features are suitable for model training. After comparison, we finally have chosen IEEE-CIS Fraud Detection on Kaggle as our dataset. This dataset includes 590540 tuples and 343 features.

After simple data visualization for the dataset, we found that there are many features including key information such as the operating system of mobile phones used at the time of the transaction, the type of browsers used, and the type of payment card, which provide us rich options when feature engineering. According to the statistics of the dataset, there are 2% of data has been detected as fraud, which shows it is an unbalanced dataset. For this unbalanced dataset, we will handle it in two different ways. First is for supervised models, we will add weight for the fraud transactions and adjust the thresholds during model training. As for the unsupervised learning, we will delete the target column “Is Fraud”, and then use outliers detection models. Given that the dataset contains a large amount of data and features, in order to avoid the curse of dimensionality, in the initial training we will choose SVM, LightGBM, LSTM, and isolation forest as models.




# Bibliography


\begin{thebibliography}{9}

\bibitem{statista2023}
\label{statista2023}
Statista. (2023). Global card fraud losses in 2023. \textit{Statista}. Retrieved from \url{https://www.statista.com/statistics/1394119/global-card-fraud-losses/}

\bibitem{ftc2023}
\label{ftc2023}
Federal Trade Commission. (2023). \textit{Consumer Sentinel Network Data Book 2023}. Federal Trade Commission. Retrieved from \url{https://www.ftc.gov/system/files/ftc_gov/pdf/CSN-Annual-Data-Book-2023.pdf}

\bibitem{jeribi2024}
\label{jeribi2024}
Fathe Jeribi, “A Comprehensive Machine Learning Framework for Anomaly Detection in Credit Card Transactions.” \textit{International Journal of Advanced Computer Science and Applications(ijacsa)}, 15(6), 2024. http://dx.doi.org/10.14569/IJACSA.2024.0150688

\bibitem{habeeb2022}
\label{habeeb2022}
Ariyaluran Habeeb, Riyaz Ahamed, et al. (2022). “Clustering‐based Real‐time Anomaly Detection—A Breakthrough in Big Data Technologies”. \textit{Transactions on Emerging Telecommunications Technologies}, 33(8), p. n/a-n/a. https://doi.org/10.1002/ett.3647

\end{thebibliography}
