# Fraud Detection with DataRobot & Neo4j Knowledge Graph

This notebook demonstrates an end-to-end pipeline for:
1) Flattening a Neo4j graph of (Client–Loan) pairs into train + holdout CSV files.
2) Creating a DataRobot project from the training data, with 'Fraud' as the target label.
3) Retrieving detailed model attributes from the best model on the Leaderboard using the DataRobot Python SDK.
4) Scoring (predicting on) the holdout (pending) dataset, collecting predictions & explanations.
5) (Optional) Updating Neo4j with these predictions for deeper analysis.

Dependencies:
 - `pip install -r requirements.txt`
 - Your environment must have a class "ClientLoanFeatureExtractor" that extracts a DataFrame from Neo4j
   and optionally a method to update predictions back into Neo4j.

Remember to adapt or remove code as needed for your environment.

### 0. Imports, Setup, and Explanation

In [None]:
import datetime

from FraudGraphFeatureExtractor import (
    ClientLoanFeatureExtractor,
    update_neo4j_predictions,
)
import datarobot as dr
from neo4j import GraphDatabase
import pandas as pd

### 1. DataRobot Credentials

In [None]:
# Example:
# DR_TOKEN = "YOUR_DATAROBOT_API_TOKEN"
# DR_ENDPOINT = "https://app.datarobot.com/api/v2"  # or your DR cluster
#
# If you haven't already called dr.Client(...), do so below:
# dr.Client(token=DR_TOKEN, endpoint=DR_ENDPOINT)
dr.Client()

### 2. Neo4j Credentials

In [None]:
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "password"
NEO4J_DATABASE = "neo4j"  # or None if single db

# Connect to DataRobot (assuming you've already configured dr.Client(...) globally)
dr.Client()

### 3) Create Two CSVs from Neo4j

In this step, we:
1. Use 'ClientLoanFeatureExtractor' to flatten the Neo4j graph into a single DataFrame 'df'.
2. Split 'df' into:
   - df_train: all rows with loan_status != 'pending' (i.e., closed loans, labeled Fraud=0/1)
   - df_holdout: all rows with loan_status == 'pending' (unlabeled)
3. Save these to 'train.csv' (for modeling) and 'holdout.csv' (for scoring).


In [None]:
print("Extracting data from Neo4j -> dataframes...")

extractor = ClientLoanFeatureExtractor(
    uri=NEO4J_URI, user=NEO4J_USER, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)
df = extractor.extract_client_loan_rows()
extractor.close()

print("Full DataFrame shape:", df.shape)
print(df.head(5))

# Separate training vs. holdout
df_train = df[df["loan_status"] != "pending"].copy()
df_holdout = df[df["loan_status"] == "pending"].copy()

# Save to CSV
df_train.to_csv("train.csv", index=False)
df_holdout.to_csv("holdout.csv", index=False)

print("train.csv shape:", df_train.shape)
print("holdout.csv shape:", df_holdout.shape)
print("Saved train.csv, holdout.csv.")

### 4) Upload the Training Dataset to DataRobot and Start AutoPilot

Next, we:
1. Create a UseCase object (optional) in DataRobot to categorize our project. 
2. Upload 'train.csv' as a DataRobot dataset.
3. (Optionally) create a feature list if we have a known subset of features.
4. Create a Project using 'Project.create_from_dataset(...)'.
5. Set 'Fraud' as our target label. DataRobot runs Autopilot to train multiple models.

In [None]:
use_case_name = "AI Accelerator: Fraud Detection with Knowledge Graphs"

use_case = dr.UseCase.create(
    name=use_case_name, 
    description="Fraud Detection with Knowledge Graphs and DataRobot"
)

train_ds = dr.Dataset.create_from_file("./train.csv", categories=["TRAINING"], use_cases=[use_case])

# Load a separate CSV for selected features and create feature list
features = pd.read_csv("./Selected Features.csv", header=None)[0].to_list()
train_ds.create_featurelist("Selected Features", features)

project_name = f"Fraud_Loan_Demo_{datetime.datetime.now().strftime('%Y%m%d_%H%M')}"
print(f"
Creating new DataRobot project: {project_name}")

project = dr.Project.create_from_dataset(
    dataset_id=train_ds.id,
    project_name=project_name,
    use_case=use_case
)

fl = project.get_featurelist_by_name("Selected Features")

project.set_options(shap_only_mode=True)
project.analyze_and_model(
    target="Fraud",
    mode=dr.AUTOPILOT_MODE.FULL_AUTO,
    worker_count=-1
    featurelist_id=fl.id, 
)
print("Autopilot started. Building models...")

### 5) Choose the Best Model & Retrieve Detailed Info from the Python SDK

We wait for autopilot to finish, then pick the top model from the leaderboard
and display some advanced attributes, e.g., blueprint_id, metrics, etc.

In [None]:
print("\nWaiting for Autopilot to finish (this may take a while).")
project.wait_for_autopilot()

# Retrieve models sorted by rank
models = project.get_models()
best_model = models[0]
print(f"Best model = {best_model.model_type}, id={best_model.id}")

print("\n--- Detailed Model Info ---")
# Let's print each relevant attribute from the Model class
print("Model ID:", best_model.id)
print("Project ID:", best_model.project_id)
print("Processes:", best_model.processes)
print("Featurelist Name:", best_model.featurelist_name)
print("Featurelist ID:", best_model.featurelist_id)
print("Sample pct (if non-datetime partition):", best_model.sample_pct)
print("Training row count:", best_model.training_row_count)
print("Training duration (datetime partition):", best_model.training_duration)
print("Training start date:", best_model.training_start_date)
print("Training end date:", best_model.training_end_date)
print("Model Type:", best_model.model_type)
print("Model Category:", best_model.model_category)
print("Is Frozen?:", best_model.is_frozen)
print("Blueprint ID:", best_model.blueprint_id)
print("Metrics:", best_model.metrics)
print("N Clusters:", best_model.n_clusters)
print("Has Empty Clusters?:", best_model.has_empty_clusters)
print("Is starred?:", best_model.is_starred)
print("Prediction Threshold:", best_model.prediction_threshold)
print("Model Number:", best_model.model_number)
print("Parent Model ID:", best_model.parent_model_id)
print("Supports composable ml?:", best_model.supports_composable_ml)
if hasattr(best_model, "use_project_settings"):
    print("Use project settings:", best_model.use_project_settings)
print("------------------------------------------")

### 6) Predict on the Holdout (Pending) Dataset & Possibly Request Explanations
We upload 'holdout.csv' to DataRobot as a separate dataset, 
then request predictions using the best model. 
We optionally also request prediction explanations to see top feature drivers.

In [None]:
print("\nScoring holdout.csv with best model...")
holdout_ds = project.upload_dataset("holdout.csv")

# Request predictions on the holdout dataset
pred_job = best_model.request_predictions(holdout_ds.id)
pred_df = pred_job.get_result_when_complete(max_wait=600)

print("Predictions shape:", pred_df.shape)
print("Sample predictions:")
print(pred_df.head())

# If you want explanations:
explanations_job = dr.PredictionExplanations.create(
    project_id=project.id,
    model_id=best_model.id,
    dataset_id=holdout_ds.id,
    max_explanations=5,
)
explanations_df = explanations_job.get_result_when_complete(max_wait=999).get_all_as_dataframe()
print("\nExplanations sample:\n", explanations_df.head())

### 7. Combine predictions + top explanation with the original holdout

In [None]:
df_scored = df_holdout.copy()

# The output column might be 'positive_probability' or 'prediction'
pred_col = "positive_probability" if "positive_probability" in pred_df.columns else "prediction"
df_scored["pred_fraud_probability"] = pred_df[pred_col].values

# Minimal approach for top explanation
df_scored["top_feature"] = explanations_df["explanation_0_feature"]
df_scored["top_feature_value"] = explanations_df["explanation_0_feature_value"]
df_scored["top_feat_qual_strgth"] = explanations_df["explanation_0_qualitative_strength"]

df_scored.to_csv("holdout_scored.csv", index=False)
print("holdout_scored.csv saved with predictions + top explanation.")

### 8) Post-Processing & Update Neo4j with DataRobot Predictions

We might define a threshold for "flagged_as_fraud". 
Then we can re-inject these predictions back to Neo4j if we choose.

In [None]:
df_scored["flagged_as_fraud"] = (df_scored["pred_fraud_probability"] > 0.45).astype(int)
print("\nHigh-level summary of flagged loans:")
print(df_scored["flagged_as_fraud"].value_counts())


print("\nUpdating Neo4j with predictions from best model...")
update_neo4j_predictions(df_scored, best_model)
print("All done!")