Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Fraud Ring Identification

Within the financial industry, an organization can expect to lose 3-6%, and up to 10%, of its [business to fraudulent activities](http://www.crowe.ie/wp-content/uploads/2019/08/The-Financial-Cost-of-Fraud-2019.pdf).  Fraudulent activities not only impact financial aspects, but victims often have negative views of the company, leading to negative market sentiment.  Overall, fraudulent activities has a significant impact on a business, in terms of both consumer confidence and bottom-line revenue.  Due to the impact of these illicit activities on the bottom line, companies expend significant time and money to detect and prevent fraud.  

When dealing with fraud, there are two main components to a robust fraud system: fraud detection and fraud prevention.  In the fraud detection component of a system, the main goals are to develop a system and methodology that allows for the rapid discovery of fraudulent activities.   This usually involves a posterior evaluation of data, such as transactions, users, credit cards, etc. to determine what patterns or combinations represent actual fraud.  This process usually involves a human-in-the-loop system where automated processes flag likely or potential fraudulent activities, which are then evaluated by an expert in the domain to determine the legitimacy, or illegitimacy, of the activities flagged.  The output of this process is a set of known and evolving patterns of fraud that are fed into a fraud prevention system.   In this notebook we will focus on finding one type of pattern, fraud rings.

## What is a Fraud Ring

 A fraud ring refers to a group of people who work together to commit various types of fraud or scams, often targeting businesses or individuals for financial gain. Some key characteristics of a fraud ring include:

* Organization - The group has a structure and hierarchy, with different members having specific roles in carrying out the fraud. There is often a ringleader organizing the operations.

* Collaboration - Members of the ring work together and share information/resources to facilitate the fraud. Their combined skills and knowledge make the fraud harder to detect.

* Specialization - Each member specializes in a specific type of fraud or scam, or handles a certain part of the operation. This division of roles makes the ring more efficient.

* Inside contacts - Fraud rings often recruit or bribe insiders who can provide sensitive information or access to facilitate the fraud.

* Multiple schemes - Rings typically run various interconnected fraud schemes simultaneously to maximize financial gain. 

* Sophistication - Rings use increasingly advanced techniques to avoid detection, such as identity theft, forgery, money laundering, hacking, and more.

* Large scale - Fraud rings are able to operate fraud at a much larger scale than individuals. The financial damage can amount to millions.

To summarize, a fraud ring is a structured criminal enterprise that works in a coordinated manner to pull off major frauds against businesses, government programs, or unsuspecting individuals. Breaking up these rings requires following the money trail and connections between members.

A fraud ring can operate under one of several pretexts. One common premise involves forgery, in which the fraud ring will create fake claims, steal identities and even print counterfeit checks and currency. Some rings target individuals, committing identity theft and the like, but many will focus on targeting e-commerce websites, businesses, charities or government agencies. The fraud rings will often test their software against the business’s payment solutions by trying to make purchases through bogus gift cards or by using fake credit cards. If the fraud ring can get past the first line of defense the business has in place, it will move on to more severe crimes against the business, including larger purchases paid for via fraudulent means, hacking into the company’s database and stealing the personal details of their customers and more. 


## Challenges of Detecting Fraud Rings

When dealing with fraud, it is often helpful to understand some challenges of finding fraudulent activities when looking into data.  Often this is aided by first understanding the definition and nature of fraud:

[Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types of forms .](https://www.amazon.com/Analytics-Descriptive-Predictive-Network-Techniques/dp/1119133122)


This definition highlights the complex nature of the problems we must address when working on fraud systems.  First, fraud is *uncommon*.  Within any system of recorded transactions, only a small fraction of these transactions consist of fraudulent or illicit activities.  The sparse nature of these illicit activities complicates the nature of identifying these activities.  Second, fraud is *well-considered* and *imperceptibly concealed,* meaning that fraudulent activities are rarely impulsive activities.  Most fraudulent activities, at least at scale, involve multiple parties colluding together to perform actions specifically designed to exploit weaknesses in the system and elude detection.  Finally, fraud is *time-evolving*.  Fraudsters are continuously evolving and adapting their techniques as detection and prevention improve in an endless game of hide and seek.

With these challenges in mind, many fraud detection systems take a multi-faceted approach to identifying illicit activities.   In this notebook, we will focus on identifying fraud rings through a guilt-by-association approach. 

## Creating a fraud graph

In this section we'll load a sample fraud graph and set some visualization options. We'll then use algorithms and openCypher queries to inspect the data model to look for patterns that indicate fraud ring activity.

### Load data
The cell below loads the example fraud graph into your graph. When you run the cell below, a graph for an example Fraud dataset will load, which will take less than 1 minute to load.

To load this dataset, run the two cells below.  This first cell will setup a few python variables using the configuration parameters of this Neptune Notebook.  The second cell will use Neptune Analytics bulk load feature to load the data from the provided S3 bucket.  

**Note:** You only need to do this once. If you have already loaded the data previously you do not need to load it again.

In [None]:
import graph_notebook as gn
config = gn.configuration.get_config.get_config()

s3_bucket = f"s3://aws-neptune-customer-samples-{config.aws_region}/sample-datasets/gremlin/Fraud/"
region = config.aws_region
load_arn = config.load_from_s3_arn

In [None]:
%%oc

CALL neptune.load({format: "csv", 
                   source: "${s3_bucket}", 
                   region : "${region}"})

### Set visualization and configuration options

The cell below configures the visualization to use specific colors and icons for the different parts of the data model.

In [None]:
%%graph_notebook_vis_options

{
  "groups": {
    "Account": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf2bb",
        "color": "red"
      }
    },
    "Transaction": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf155",
        "color": "green"
      }
    },
    "Merchant": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf290",
        "color": "orange"
      }
    },
    "DateOfBirth": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf1fd",
        "color": "blue"
      }
    },
    "EmailAddress": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf1fa",
        "color": "blue"
      }
    },
    "Address": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf015",
        "color": "blue"
      }
    },
    "IpAddress": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf109",
        "color": "blue"
      }
    },
    "PhoneNumber": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf095",
        "color": "blue"
      }
    }
  },
  "edges": {
    "color": {
      "inherit": false
    },
    "smooth": {
      "enabled": true,
      "type": "straightCross"
    },
    "arrows": {
      "to": {
        "enabled": false,
        "type": "arrow"
      }
    },
    "font": {
      "face": "courier new"
    }
  }
}

### Data model
The fraud graph included in this example contains synthetic data that models credit card accounts, account holder information, merchants, and the transactions performed when an account holder purchases goods or services from a merchant.

**Account and features**

An Account has a number of features, including physical Address, IpAddress, DateOfBirth of the account holder, EmailAddress, and contact PhoneNumber. An account holder can have multiple email addresses and phone numbers.

In many graph data models, these features of the account holder would be modelled as properties of the account. With fraud detection, it's important to be able to link accounts based on shared features, and to find related accounts at query time based on one or more shared features. Hence, our fraud detection application graph data model stores each feature as a separate vertex. Multiple accounts that share the same feature value - the same physical address, for example - are connected to the single vertex representing that feature value. 

The following query shows a single account and its associated features. After running the query, click the Graph tab to see a visualization of the results.

### What does a fraud graph look like for an account?

In [None]:
%%oc -d value -l 20
MATCH p=(n)-[]-()
WHERE id(n)='account-4398046519460' 
RETURN p
LIMIT 10

## Finding Fraud Rings in your graph

Now that we have an idea what the data in our fraud graph looks like for an account, let's examine how you go about looking for fraud rings within data.

Detecting fraud rings involves identifying unusual or suspicious patterns in data. These patterns can vary depending on the type of fraud and the context in which it occurs. Here are some common patterns that analysts and machine learning models might look for:

* Unusual Behavior Patterns:
    * Frequency: Unusually high or low transaction frequencies for certain accounts.
    * Time of Activity: Transactions occurring at unusual times or outside regular business hours.
    * Location: Transactions from unexpected or geographically distant locations.

* Transaction Specifics:
    * Transaction Amounts: Unusually large or small transactions compared to historical behavior.
    * Transaction Types: Identifying unusual types of transactions for a specific user.
    
* Social Network Analysis:
    * Connections: Identifying networks of accounts that frequently transact with each other.
    * Topology Analysis: Examining the structure of connections between accounts.
    
These are just a few on the patterns you can look for to find fraud rings.  For this notebook we will be looking at detecting anomalous behavior using Social Network Analysis to find groups of accounts that are disproportionately highly connected with one another.  We will then use these groups to perform a topological analysis of these accounts by looking at the structure of the connections between the accounts.

To begin this process we will start by running a graph algorithm that finds groups of highly connected nodes. Algorithms that accomplish this below to a category of algorithms called `Community Detection`.  Community detection algorithms calculate meaningful groups or clusters of nodes within a network, revealing hidden patterns and structures that can provide insights into the organization and dynamics of complex systems.

There are a variety of supported community detection algorithms in Neptune Analytics and for this demonstration we will be using one known as **Label Propagation**

The label propagation algorithm is a semi-supervised machine learning algorithm that assigns labels to nodes based on the consensus of their neighboring nodes.  This algorithm functions by assigning a label to a small subset of nodes.  These labels are then propagated to that nodes neighbors based on the maximum set of neighbor nodes.

Label propagation can be an advantage when beginning fraud ring analysis as it does not require prior labeling of the communities.  However, it does have a disadvantage in that multiple runs of the same algorithm may yield different results, due to random assignment of initial starting nodes/labels.

Let's run our label propagation algorithm and find out the size of the largest community in our graph.  We will also be storing the output of this algorithm for later analysis into a Python variable in the notebook named `community_data` using the `--store-to` switch on the `%%oc` magic.

In [None]:
%%oc --store-to community_data

MATCH (n)
CALL neptune.algo.labelPropagation(n)
YIELD community
RETURN community, count(n) as size
ORDER BY size DESC

As we can see from the data above we have several communities that are around 300 members in size.  

When looking for fraud rings within data the most common thing we want to look for is groupings that are anomalous to investigate.  In different domains, and even within different datasets within the same domain what constitutes an anomaly will vary.  When looking for fraud rings within data the most common thing we want to look for is groupings that are anomalous to investigate.  In different domains, and even within different datasets within the same domain what constitutes an anomaly will vary so a common way to determine what is "normal" in your particular data is to look at the distributions of communities.  


To accomplish this we will use some Python libraries with the data we stored in the `community_data` variable.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

# Create a numpy histogram with the nubmer of bins being the max size
df = pd.DataFrame(community_data['results'])
hist = np.histogram(df.get('size'), bins=community_data['results'][0]['size'])

# Plot the histogram using Plotly
fig = px.bar(hist[0].tolist(), title = "Community Size Distribution")
fig.update_layout(xaxis_title='Community Size', yaxis_title='Occurrences', title_x=0.5)
fig.update_traces(showlegend=False)
fig.show()

From the histogram chart above we can see that most of the communities in our on the small side, less than ~50 members.  What we also see is that a relatively small number of these communities have a large number of members, >300.

Before we continue on though, let's take a moment and store the community value for each node back into our graph to simplify further analysis using a variation of the same algorithm.  This algorithm works the same except that this version stores the community value into a property specified, in this case the property name is `community`.

In [None]:
%%oc

CALL neptune.algo.labelPropagation.mutate({writeProperty: 'community'})

Now that we have identified that large communities are the anomalies in our dataset, lets take a look at one of the communities, in this case we will look at the largest community in our graph.

In [None]:
%%oc -g community

MATCH (n) 
WITH n.community as community, count(n.community) as community_size 
ORDER BY community_size DESC LIMIT 1
MATCH (n) WHERE n.community = community
MATCH p=(n)-[]->()
RETURN p

Looking at the graph visualization above, we see that the community here is quite connected, as expected.  What we also see is that a few specific vertices are much more connected than the others.  This is what is known as "centrality" or "influence" in a graph.  

 Centrality in a graph refers to measures that identify the most important or central nodes in a network graph. Some common centrality measures include:

- Degree centrality - Counts the number of edges connected to a node. Nodes with higher degree are more central or "connected" in the graph.

- Closeness centrality - Calculates how close a node is to all other nodes by finding the shortest paths. Nodes with high closeness can spread information more quickly.

- PageRank - A variant of eigenvector centrality used by Google Search to rank website importance. Important pages are those linked to by other important pages.

In general, nodes with high centrality values are considered influential, visible, and critical to efficient network flow. Centrality helps identify the most important nodes to target in a network.  Each of the 

There are a variety of supported centrality algorithms in Neptune Analytics and for this demonstration we will be using one known as **Closeness Centrality**

 Closeness centrality is a measure of the average shortest path between a node and all other nodes in a network. It indicates how close a node is to the rest of the nodes in the network.

The key points about closeness centrality are:

- It measures how close a node is to all other nodes. Nodes with high closeness centrality can spread information faster as they have shorter paths to all others.

- It is calculated as the inverse of the sum of the shortest paths between a node and all other nodes. Nodes with lower total distances to others have higher closeness centrality.

- Nodes with high closeness centrality are able to communicate faster with the entire network. They are often influential nodes.

- It is useful for identifying nodes that can distribute or receive information fastest to/from all others.

- Differences in closeness centrality are more meaningful in larger networks. In small networks most nodes have similar closeness values.

So in the case of our fraud graph, we have chosen to use high closeness centrality to find the nodes that are the influential or well-positioned nodes in a network to propagate fraudulent activities.

Let's run our closeness centrality algorithm and find out the most import nodes within our graph.

In [None]:
%%oc

MATCH (n) 
CALL neptune.algo.closenessCentrality(n, {numSources: 8192})
YIELD score
RETURN n, score 
ORDER BY score DESC LIMIT 1

As with our earlier algorithms, we will run this again to store the centrality value back into our graph, this time in a property called `centrality`.

In [None]:
%%oc
CALL neptune.algo.closenessCentrality.mutate({numSources: 8192, writeProperty: "centrality"})

## Examining a Fraud Ring

Now that we have stored both the community value of our nodes as well as the relative importance, let's use this information to begin looking at a potential fraud ring.

A common workflow for fraud ring investigation is to look at the most important node inside an anomalous communities.  In the case of our data, we have already determined that the largest communities are anomalous.  We can use this information combined with our centrality measurements to find a list of the 5 most important nodes to begin our investigation.

In [None]:
%%oc -g community

MATCH (n) 
WITH n.community as community, count(n.community) as community_size 
ORDER BY community_size DESC LIMIT 1
MATCH (n) 
WHERE n.community = community
RETURN n
ORDER BY n.centrality DESC LIMIT 5

### What if I only want to look at the nodes but their connections?

While the node information is very important, determining what does and does not constitute fraud often requires looking at not just the data on the node, but how that node is connected to other entities in the graph.  To do that we can use graph traversals to see the connections to a node, which is known as a neighborhood in graphs.  

Running the query below you will see retrieve the top 5 most important nodes in the largest community, as well as the entities within 2 neighborhoods of those nodes.

In [None]:
%%oc -g community -sd 30000

MATCH (n) 
WITH n.community as community, count(n.community) as community_size 
ORDER BY community_size DESC LIMIT 1
MATCH (n) 
WHERE n.community = community
WITH n ORDER BY n.centrality DESC LIMIT 5
MATCH p=(n)-[]-()-[]-()
RETURN p

## Analyzing the results


A critical and ongoing part of any fraud workflow is to have a mechanism to enable analysts to investigate and prove/disprove that a potentially fraudulent activity exists.  

Visual inspection, combined with the domain expertise of a fraud analyst, is a critical factor in being able to determine if anomalous patterns in a graph represent actual fraud or legitimate activity. Expert analysts are skilled in looking at the patterns of transactions and connections and the structural connections between items to determine the legitimacy of an account/transaction. Once they have made this determination, they will often flag these accounts/transactions as fraudulent in the graph to aid in future investigations.Visual inspection, combined with the domain expertise of a fraud analyst, is a critical factor in being able to determine if anomalous patterns in a graph represent actual fraud or legitimate activity. Expert analysts are skilled in looking at the patterns of transactions and connections and the structural connections between items to determine the legitimacy of an account/transaction. Once they have made this determination, they will often flag these accounts/transactions as fraudulent in the graph to aid in future investigations.

### Mark as Fraud/Not Fraud

In our scenario here let's assume that a domain expert has made a determination that the `merchant-48` node in our graph is a fraudulent.

Let's mark the account above as fraudulent by setting the `isFraud` property to `True`

In [None]:
%%oc -d value -l 20
MATCH (a)
WHERE id(a)='merchant-48'
SET a.isFraud=True
RETURN a

### Find all items within three hops of our fraudulent merchant

Now that we have completed our investigation of `merchant-48` let's take a look at another account from our list above `merchant-48`.  In addition to looking at the connections, as shown above, another common use of graphs when analyzing anomalous activity is to look how closely an account is connected to a known fraudulent account. 

Let's take a look at this newly discovered fraudulent merchant and see what other items of interest we may way to investigate to look for other possible collusion.

In [None]:
%%oc -d value -l 20

MATCH p=(a)-[*1..3]-()
WHERE a.isFraud=True
RETURN p

Wow, there are a lot of shared connections and attributes to a known fraudster so it definitely looks suspicious, and is something we should continue to investigate.

## Conclusion

This notebook has shown how you can use Amazon Neptune Analytics to run analytics on your data to detect fraud rings. We've used a credit card dataset with account- and transaction-centric queries to perform a graph based fraud ring analysis based on a guilt-by-associated approach.  We first identified the groups in our data.  We then identified the most influential nodes within these groups and stored this information within our graph.  Using this information we were able to explore the connections around the most influential entities to identify other potentially fraudulent accounts.

Combating fraud is an ongoing challenge for any organization.  The faster a team can identify fraud and the more they do, the more efficient anti-fraud systems become, preventing significant financial losses.  Finding and understanding fraud rings is a problem that requires the ability to query, analyze, and explore the connections between accounts, transactions, and account features.  Combining the ability to query a graph with the ability to run network analysis and graph algorithms on top of that data enables us to derive novel insights from this data. 