# Real-time Anomaly Detection in Financial Transactions


## Authors and Team

- **Author 1**: Ferris Atassi, Developer
- **Author 2**: Charles Hang, Developer

# Executive Summary

### Decisions to be impacted

Our project will impact decisions to accept or reject financial transactions based on suspicion of fraud. More broadly, it will help fraud detection specialists at financial institutions develop tools to detect fraudulent financial transactions.

### Business Value

According to global card industry research company Nilson Report, $33 billion was lost to credit card fraud in 2022. Reducing that fraud will lead to savings for  companies, which will sell fewer goods and services to fraudulent buyers and thus incur fewer costs associated with such sales (such as chargebacks). For example, the dataset used in this project was provided by Vesta Corporation, which guarantees that credit card transactions will go through in exchange for a cut of the revenues. Whenever a transaction does turn out to be fraudulent, Vesta is responsible for compensating the seller for the lost volume. Reducing credit card fraud by detecting it when it happens will reduce costs for transaction guarantee companies such as Vesta.

Similarly, reducing the number of fraudulent transactions will help consumers by protecting them from accidentally paying for fraudulent transactions and by also lowering how much they pay for goods and services (since providers will not need to raise prices as much to account for fraud costs). 

### Data Assets

This project uses the IEEE-CIS dataset, which was used in the IEEE-CIS Fraud Detection competition on Kaggle in 2019. The dataset consists of 590,540 actual credit card transactions spanning a little over six months. The transactions were provided by Vesta Corporation, a leader in the credit card payment guarantee industry.

### Questions and Answers

Question: 

You mainly focus on known fraud patterns in financial transaction anomaly detection. Do you take measures to identify those unseen abnormal patterns? For example, have you considered introducing adaptive models or online learning to detect new types of fraud in a timely manner?

Answer:

We have not considered introducing adaptive models or online learning because in the short term, we expect new fraudulent transactions to be similar to the ones in our existing dataset. In an industry application, it would make sense to continuously update the model as new batches of transaction data become available, which would allow the model to adapt to new types of fraud. Of course, this approach would be necessarily reactive to a degree, while adaptive models or online learning based models would allow for more proactive fraud detection, but since the industry generally prefers false negatives than false positives (it is usually better to allow a few fraudulent transactions through rather than deny large numbers of legitimate transactions for customer service reasons), this sort of more aggressive proactive fraud detection might be more trouble than it's worth. 

Question:

When addressing class imbalance, have you evaluated the impact of different features, particularly outliers and high-variance features, on the model’s performance?

Answer:

Yes, we used Principle Component Analysis (PCA) to analyze the impact of different features on the variance in the data. The PCA output indicated that fraudulent and non-fraudulent transactions had similar values for the principal components, indicating that the variance in feature values was not a major contributor to fraud detection--in other words, there wasn't a high-variance feature with one set of values for fraudulent transactions and another set of values for non-fraudulent transactions. To address outliers, including outliers introduced by high-variance features, we used Mahalonobis distance and Isolation Forest Anomaly Detection to remove outlier values, but due to the unstructured and high-variability factors of the dataset, onl



Question:

Can you share more details on how the outlier detection is done? How do you eliminate the outliers?

Answer:

Outliers were detected using two methods, Isolation Forest and Mahalonobis Distance outlier detection. Training data that had been cleaned of outliers with Mahalonobis Distance detection was found to be more effective in training an accurate model to detect fraud, so that method was ultimately used to remove outliers.


Question:

How do you know whether an outlier is a real outlier or a fraud? How to make the rule of removing outliers.

Answer:

Outliers were detected using two methods, Isolation Forest and Mahalonobis distance detection. Training data that had been cleaned of outliers with Mahalonobis distance detection was found to be more effective in training an accurate model to detect fraud, so that method was ultimately used to remove outliers.

Based on our analysis of the data using PCA and simpler techniques, outliers were not found to be more likely to be fraudulent than normal transactions, so removing outliers did not disproportionately remove fraudulent transactions. Because of that, our expectation was that outlier removal would not reduce our model's ability to detect fraudulent transactions.

Question:

Why you choose left join two datasets? What is the differences between the identity dataset and the transaction dataset.

Answer:

The transaction dataset had all of our transactions, while the identity dataset had identifying information about the parties involved in each transaction. The transaction dataset was left joined to the identity dataset so that all transaction data would be preserved, even if it mean that some transactions lacked identity data.



Question:

In your project, outlier detection is addressed as a supervised learning problem, but some research suggests that it should be approached as an unsupervised learning task since the proportion of fraud transactions are extremly low in most cases. Have you considered handling this problem in that way?

Answer:

We actually addressed outlier detection as an unsupervised learning problem, not a supervised learning problem. Both of the outlier detection techniques we used, Isolation Forest and Mahalonobis Distance, were unsupervised techniques.

Question:

I wonder if your project can be extended from classification to prediction. In other words, can we predict roughly when and where the next frauds are going to happen given the history of frauds? That would be great for practical purposes : ) 

Answer:

We expect that our model could be used to detect future fraudulent transactions by predicting which transactions in future batches of transaction data are fraudulent.


Question:

Can you explain how you validated your model and ensured it would perform well on new, unseen data?

Answer:

We divided our transaction data into training data to train our model and test data for validation. For the model, the test data was "new, unseen" data.



Question:

Have you ever considered how the dominated value of a feature influences the model training?

Answer:

We had some features with one dominant value (for example, the large majority of our card transactions were for Visa cards, and essentially all of our transactions took place in the U.S.). Our understanding of the literature is that just because one or more features has a single dominant view, that doesn't mean the feature has to be transformed in any particular way.



Question:

While dealing with categorical features, do you use sparse encoding like one-hot encoding? Since I noticed you have mail locations in the dataset, and one hot encoding will lead to a very sparse data.

Answer:

No, we did not use any encoding methods to transform our categorical feature data. Some of the data in the original dataset may have already been encoded using one-hot encoding, however.



Question:

The outliers are an essential component in fraud detection....since the one with outliers could be potentially called as fraud transactions. So my question is would you still do outlier detection and remove them?

Answer:

Based on our analysis of the dataset, outlier transactions were not disproportionately likely to be fraudulent, and thus removing them did not reduce our ability to detect fraudulent transactions. It would be much easier to detect fraudulent transactions if they were all outliers!

Question:

Why did you choose the metrics you did that lead you to make an XGBoost model?

Answer:

Our assessment of the literature indicated that XGBoost is one of the most commonly used methods for fraud detection using machine learning, since it has some of the highest accuracy scores for similar problems.

Question:

So which of the outlier detection methods did you think were more accurate?

Answer:

Mahalonobis Distance outlier detection resulted in significantly stronger model performance than Isolation Forest.

Question:

Can you elaborate on your decision to focus on tree-based models? You mentioned that the performed better in the literature with regard to accuracy, but do you think there might be other advantages of non-tree classifiers?

Answer:

Tree-based models were just one option we chose to use in our project; we also used other models such as XGBoost. We used tree-based models because they have an established history of being effective in detecting fraudulent transactions in industry. 


Question:

It looks like your dataset is extremely imbanlance, how do you deal with this problem?

Answer:

We used models (XGBoost, Isolation Forest, SMOTE) which have a proven history of being effective in making predictions on data in imbalanced datasets.

Question:

Are there more creative sampling techniques that can be used to make the data set more balanced in training?

Answer:

Yes, oversampling is a common method of making an imbalanced dataset more balanced in training. SMOTE is one oversampling method we used to make our dataset more balanced for our CART Tree model.

Question:

Why did the XGBoost model do better? How did it beter handle the class imbalance?

Answer:

XGBoost is an ensemble model, meaning that it combines multiple decision trees, which makes it more effective at handling imbalanced data. Additionally, XGBoost's nature as an iterative boosting model also helps it handle imbalanced data. 



Question:

Does the label distribution shift after removing outliers? Intuitively, fraud transactions may be outliers in the feature space.

Answer:

PCA indicates that fraudulent and non-fraudulent transactions have similar principal component distributions, which implies that outliers are not disproportionately fraudulent.

# Sources

Castillo, Michelle. “Why Credit Card Fraud Alerts Are Rising, and How Worried You Should Be about Them.” CNBC, 12 Sept. 2024, www.cnbc.com/2024/09/12/why-credit-card-fraud-alerts-are-rising.html.

“IEEE-CIS Fraud Detection.” @Kaggle, 2024, www.kaggle.com/competitions/ieee-fraud-detection/leaderboard. Accessed 28 Oct. 2024.