## Investigate GNN embeddings

After importing the new graph into Neptune Analytics you can further explore the graph, using the GNN's predictions as a way to focus on potentially fraudulent transaction. 

Neptune Analytics has created a vector index for the embeddings on which you can run similarity and topk queries. You can also make use of the GraphStorm predictions included for each transaction node.

## Upload this notebook to your graph notebook instance

In note book `1-SageMaker-Setup` you created a SageMaker AI notebook instance, that comes pre-configured as a [Graph Notebook](https://github.com/aws/graph-notebook). This will allow you to run OpenCypher queries against your graph directly from the notebook.

<!-- TODO: Remove when we include the notebook download in the LCC -->
To use this notebook you will need to upload it to that instance (a [Graph Notebook](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/notebooks.html) that can access your graph endpoint and submit queries) and run the cells there using the default `python3` kernel.

### Risk Score Validation using Known Labels

Fraud detection systems often rely on risk scores, but their effectiveness needs
constant validation. In this section, you will analyze how well the prediction model
performs against known fraud cases. By segmenting transactions into risk bands
(Very Low to Very High), you can see the actual fraud rates in each band and
validate the model's ability to identify suspicious transactions.

In [None]:
%%oc
// Detailed risk band analysis
MATCH (t:Transaction)
WITH CASE
    WHEN t.pred < 0.2 THEN "Very Low Risk (0-0.2)"
    WHEN t.pred < 0.4 THEN "Low Risk (0.2-0.4)"
    WHEN t.pred < 0.6 THEN "Medium Risk (0.4-0.6)"
    WHEN t.pred < 0.8 THEN "High Risk (0.6-0.8)"
    ELSE "Very High Risk (0.8-1.0)"
END as risk_category,
    t.pred as risk_score,
    t.isFraud as is_fraud
WITH risk_category,
     count(*) as total_cases,
     sum(is_fraud) as confirmed_fraud,
     sum(risk_score) as risk_sum
WHERE total_cases > 5
RETURN
    risk_category,
    total_cases,
    confirmed_fraud,
    round(1000.0 * confirmed_fraud / total_cases) / 1000.0 as fraud_rate,
    round(1000.0 * risk_sum / total_cases) / 1000.0 as avg_risk_score
ORDER BY avg_risk_score


From the above analysis you can see that the model has good alignment with the actual fraud rates. You can now proceed with a deeper investigation of the fraud patterns in the data.

### Community Detection

One advantage of graph-based analysis is the ability to find natural clusters in the transaction network. Using Neptune's community detection capabilities, you can identify groups of transactions that form natural communities based on the way the connect to other nodes in the graph. These communities might represent normal business patterns, but they could also reveal coordinated fraudulent activities. The GNN predictions can help you differentiate between benign and potentially harmful groups.

This step transforms the transaction network from a collection of individual events into meaningful groups for fraud analysis using the [Louvain](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/louvain.html) community detection algorithm.


In [None]:
%%oc

CALL neptune.algo.louvain.mutate(
  {
    writeProperty: "louvainCommunity",
    maxLevels: 3,
    maxIterations: 10
  }
)
YIELD success
RETURN success

#### Uncover suspicious communities

With communities identified, you can analyze their characteristics to spot potentially fraudulent patterns. You can examine the average risk scores, fraud rates (when present), and transaction patterns within each community. 

This analysis helps you prioritize which communities to investigate first, based on factors like high average risk scores, unusual transaction amounts, or suspicious patterns of shared features. The graph structure helps us see connections that might be missed in traditional tabular analysis.

**Here you use the predictions of the GNN to rank communities by average risk score**, and since in this example all the data are annotated, you can also list the actual fraud rates for each community, as a way to verify the GNN predictions.

In [None]:
%%oc --store-to suspicious_communities

MATCH (t:Transaction)
WITH t.louvainCommunity as community_id,
     count(t) as tx_count,
     avg(t.pred) as avg_risk_score,
     avg(t.isFraud) as avg_fraud
WHERE tx_count > 10
RETURN community_id, tx_count, avg_risk_score, avg_fraud
ORDER BY avg_risk_score DESC LIMIT 10


#### Investigate Most Suspicious Community

For communities flagged as high-risk, you can conduct a detailed examination
of their transaction patterns. This includes analyzing the **types of features
shared between transactions**, and the network of connections
between high-risk transaction. 

The graph structure makes it easy to visualize and
understand these relationships, revealing patterns that might indicate coordinated
fraud attempts or compromised features being used across multiple transactions.

In [None]:
# Exctract the id of the community with the highest average risk
high_risk_community = suspicious_communities['results'][0]['community_id']

Here you can visualize the community ranked as the most suspicious and analyze the features that its transactions have by selecting the **Graph** tab from the output widget.

In [None]:
%%oc
MATCH (n) WHERE n.louvainCommunity = ${high_risk_community}
MATCH p=(n)-[]->()
RETURN p
LIMIT 100

You can use this kind of analysis to detect common elements in high-risk communities. For example, most of the transactions connecting to the same value for one of the Card node types or a particular Address.

### Analyzing Feature Combinations in High-risk Transactions

Moving beyond individual communities, you can analyze which combinations of features appear frequently in high-risk transactions. This helps identify "suspicious patterns" - combinations of attributes that might indicate fraudulent activity, and help you decide which types of features are actually useful to have in the graph.

By leveraging the graph structure, you can easily find transactions sharing the same features, which would require complex joins in traditional SQL analysis. Instead here following the paths in the graph helps refine your understanding of fraud indicators.


In [None]:
%%oc
// This query identifies which types of features most commonly
// connect high-risk transactions, helping to spot patterns
// that might indicate fraudulent behavior

// Start with a few high-risk transaction seeds
MATCH (t1:Transaction)
WHERE t1.pred > 0.6
WITH t1 LIMIT 5

// Transactions have 2-hop connections through feature rel,
// named "Transaction,identified_by,<feature_type>".
// Here we find feature nodes 'f' that connect our seed transactions
// to other high-risk transactions
MATCH path = (t1)-[r1]->(f)<-[r2]-(t2:Transaction)
WHERE t1 <> t2
  AND t2.pred > 0.6  // Both transactions are high risk
  AND type(r1) CONTAINS 'identified_by'
  AND type(r2) CONTAINS 'identified_by'
//  AND split(type(r1), ',')[2] IN ['Card4', 'Card6']  // Optional: focus on specific features

// Extract feature type from the edge type,
// we split edge type names on the ',' character
// and retain the <feature_type>  element
WITH split(type(r1), ',')[2] as feature_type,
     f,
     t1,
     t2
LIMIT 50000 // Limits number of edges we review to help with result explosion
// Analyze feature types and their connection patterns
WITH feature_type,
     count(DISTINCT f) as unique_feature_values,
     count(DISTINCT t1) + count(DISTINCT t2) as transactions_connected,
     avg(t1.pred + t2.pred)/2 as avg_risk_score,
     avg(t1.isFraud + t2.isFraud)/2 as avg_actual_fraud

RETURN
    feature_type,
    unique_feature_values,
    transactions_connected,
    round(1000.0 * avg_risk_score) / 1000.0 as avg_risk,
    round(1000.0 * avg_actual_fraud) / 1000.0 as avg_fraud
ORDER BY transactions_connected DESC
LIMIT 20


During a fraud analysis you could use such a query to identify a number of risk indicators:

1. **Identify Compromised Features:**
   - If a `Card[4,6]` value connects many high-risk transactions with just a few (1-2) unique feature values, these specific card types might be compromised
   - Few unique email domains connecting many transactions suggests specific email providers are frequently used in fraud

2. **Pattern Recognition:**
   - High `avg_fraud` rates across features confirm these connections are reliable fraud indicators
   - Features with high `transactions_connected` but few `unique_features` (like Card6, Card4) might indicate reused fraudulent credentials

3. **Investigation Prioritization:**
   - Recipent (R) and Purchaser (P) EmailDomain with few unique values each connecting thousands of supsicious transactions might warrant investigation of specific suspicious domains

4. **Risk Assessment:**
   - The correlation between `avg_risk` (prediction) and `avg_fraud` (actual) validates the model's performance
   - Features like `Address[1,2]` with few unique values might indicate specific locations being used for fraud


### Analyze feature "bridge" values

Finally, you can look for **specific** feature values that act as "bridges" connecting multiple high-risk transactions. These bridges might represent suspicious locations, compromised cards, suspicious email domains, or other attributes being reused across fraudulent transactions. 

This type of analysis drills down on the results from the previous query and is particularly powerful in graph form, as it naturally reveals fraud connection patterns. Understanding these bridges helps identify potential fraud enablers and improve detection mechanisms.


In [None]:
%%oc
// This query identifies unique feature values that act as "bridges"
// connecting multiple high-risk transactions
// These features might be indicators of organized fraud

// Start with high risk transactions and their unique feature values
MATCH (f)<-[r]-(t:Transaction)
WHERE t.pred > 0.6
AND type(r) CONTAINS 'identified_by'
WITH DISTINCT f, t

// Analyze their connectivity patterns
WITH f,
     labels(f)[0] as feature_type,
     count(DISTINCT t) as connected_tx,
     sum(t.isFraud) as fraudulent_tx,
     avg(t.pred) as avg_risk_score,
     avg(t.isFraud) as avg_fraud_score
WHERE connected_tx >= 100  // Unique feature values connecting at least 100 suspicious transactions

RETURN
    feature_type,
    id(f) as feature_id,
    connected_tx,
    fraudulent_tx,
    round(1000.0 * avg_risk_score) / 1000.0 as avg_risk,
    round(1000.0 * avg_fraud_score) / 1000.0 as avg_fraud
ORDER BY avg_risk DESC
LIMIT 20

Note that while the specific values for this dataset are anonymized, you can still draw some conclusions. 
In a real-world scenario you would have access to the original values for features like Address and the various card features, allowing you to focus your investigation on the particular values.

1. **High-Volume Bridges:**
   - Anonymized Card3 value `card3:150` connects thousands of risky transactions, that vast majority if which are fraudulent
   - This single card feature value is a very strong fraud indicator

2. **Payment Card Patterns:**
   - Card1 feature `card1:15063` and `card2:nan` demonstrate very high risk and corresponding fraud rate.

3. **Email Domain Analysis:**
   - aol.com and anonymous.com email domains have very high predicted risk and should be flagged.

Potential actions that you could take after combining the GNN prediction with your graph analytics:
 
1. **Immediate Action Items:**
   - Block or heavily scrutinize aol.com and anonymous.com email domains.
   - Investigate all transactions with feature value card1:15063
   - Block or heavily scrutinize transactions for `Address[1,2]` values listed

## Use graph embeddings to discover transactions similar to high-risk ones

The node embeddings that the GNN model has created contain semantic information about the characteristics of each transaction. Using the predicted risk scores, you can isolate high-risk transactions, then expand your search to similar transactions to find characteristics that join them. For example, you can get the top-k most similar transactions to each high-risk transaction, and investigate the resulting graph.

In the following query we start with 5 very high risk transactions, and collect the 3 most similar transactions to each one.

In [None]:
%%oc --store-to high_risk_neighbors
MATCH (t:Transaction)
WHERE t.pred >= 0.8
WITH t LIMIT 5
CALL neptune.algo.vectors.topKByNode(t, {topK: 4})
YIELD node as similar_node, score
WHERE id(t) <> id(similar_node) // Remove source node from query results
RETURN id(t), id(similar_node), score as distance, similar_node.pred, similar_node.isFraud

You can then investigate the connections between these transactions by retrieving the paths connecting them. First extract the high-risk neighbor identifiers in Python:

In [None]:
high_risk_neighbor_ids = [
    entry['id(similar_node)'] for entry in high_risk_neighbors['results']]

Then submit a new OpenCypher query to extract the sub-graph that contains the suspicious similar transactions and their neighbors (features/identifiers). Run the following query and select the **Graph** tab for a graph view of the results.

In [None]:
%%oc
MATCH (n) WHERE id(n) = ${high_risk_neighbor_ids}
MATCH p=(n)-[]->()
RETURN p
LIMIT 1000


In this example, you can see that the transaction components are connected by bridge nodes of types Address2, Card6, and Card4 . You can use the values of these nodes, and find other transactions that share the same characteristics as another way to uncover potentially risky transactions

## Conclusion

This notebook demonstrates how to leverage both machine learning predictions 
and graph analytics in Amazon Neptune to enhance fraud detection capabilities. 
By combining GNN-generated risk scores with graph-based analysis, you've learned how to:

1. **Validate and Enhance Risk Assessment**
   - Used known fraud labels to validate GNN prediction accuracy across risk bands
   - Identified cases where graph patterns reveal higher risk than individual predictions
   - Demonstrated how network context improves fraud detection accuracy

2. **Uncover Fraud Patterns**
   - Identified specific feature values highly associated with fraud (>90% fraud rates)
   - Discovered bridge features connecting thousands of suspicious transactions
   - Found communities of related transactions showing coordinated fraud patterns

3. **Enable Actionable Intelligence**
   - Pinpointed specific features for immediate investigation (e.g., aol.com and anonymous.com email domains)
   - Identified locations and card characteristics frequently involved in fraud
   - Revealed patterns of feature sharing that indicate organized fraud

The power of this approach lies in its unique combination of machine learning predictions, graph structure analysis, and known fraud labels. By integrating these three elements, you can create a comprehensive system that goes beyond simple risk scoring. 

While the GNN provides initial risk assessments, the graph structure reveals complex relationships and patterns that might be invisible when looking at transactions in isolation.

This hybrid approach **significantly advances traditional fraud detection capabilities** by enabling the identification of entire fraud rings and coordinated activities, rather than just flagging individual suspicious transactions. 

For fraud analysts and investigators, this means being able to prioritize cases based on both risk scores and network context, while clearly visualizing how fraudulent activities are connected.