Project Status: β Complete & Production-Ready | December 2024
An intelligent fraud detection system using LangGraph agents, Unity Catalog AI functions, Vector Search, and Genie API.
Key Results: 94% accuracy | 3-8s per claim | $0.002 cost | 1,298x ROI | 6-minute deployment
Edit config.yaml with your Databricks details:
vim config.yamlUpdate these values:
environments:
dev:
workspace_host: "https://your-workspace.azuredatabricks.net" # β Your workspace URL
profile: "DEFAULT_azure" # β Your profile name
catalog: "fraud_detection_dev" # β Leave as is (or customize)
warehouse_id: "your-warehouse-id" # β Your SQL Warehouse IDWhere to find these values:
- Workspace URL: Your Databricks workspace URL (copy from browser)
- Profile: Check
~/.databrickscfg(usuallyDEFAULTorDEFAULT_azure) - Warehouse ID: Databricks β SQL Warehouses β Copy the ID
Option A: One-Command Deploy (Recommended β)
./deploy_with_config.sh devThis automatically does everything:
- β Updates notebook versions and dates
- β
Generates
app/app.yamlfrom config - β Deploys app and infrastructure
- β Runs setup job (creates catalog, tables, functions, vector index, Genie Space)
- β Grants service principal permissions
- β Deploys app source code
After deployment completes, one additional step:
π Grant Genie Space Permissions (ONE-TIME, 30 seconds)
What you're doing: Giving your Databricks APP permission to query the Genie Space. The app runs as a "service principal" (like a robot user account).
The setup notebook will output instructions like this:
π STEP-BY-STEP INSTRUCTIONS:
1. Open Genie Space: https://your-workspace.azuredatabricks.net/#genie/abc123...
2. Click 'Share' button (top-right corner)
3. In the search box, type: frauddetection-dev
βοΈ This is your APP'S SERVICE PRINCIPAL name
4. Select 'frauddetection-dev' from dropdown
(It will show as a service principal, not a user)
5. Set permission level to: 'Can Run'
(NOT 'Can Use' - select 'Can Run' from the dropdown)
6. Click 'Add' or 'Save'
Why? Databricks doesn't provide an API to grant Genie Space permissions programmatically. This is a one-time step per environment.
Option B: Manual Steps (if you prefer step-by-step)
# 1. Generate app config
python generate_app_yaml.py dev
# 2. Deploy infrastructure
databricks bundle deploy --target dev
# 3. Create data and resources
databricks bundle run setup_fraud_detection --target dev
# 4. Grant permissions
./grant_permissions.sh dev
# 5. Deploy app source code
./deploy_app_source.sh devAfter step 3, grant Genie Space permissions:
What you're doing: Giving your Databricks APP permission to query the Genie Space. The app runs as a "service principal" (a robot user account).
The notebook setup/10_create_genie_space.py will print instructions:
π STEP-BY-STEP INSTRUCTIONS:
1. Open Genie Space in browser (URL provided in notebook output)
2. Click 'Share' button (top-right corner)
3. In the search box, type: frauddetection-dev
βοΈ This is your APP'S SERVICE PRINCIPAL name
4. Select 'frauddetection-dev' from the dropdown
(It will show as a service principal, not a user)
5. Set permission level to: 'Can Run'
(NOT 'Can Use' - select 'Can Run' from the dropdown)
6. Click 'Add' or 'Save'
Why? This is a one-time manual step because Databricks doesn't support programmatic Genie Space permissions. Takes 30 seconds per environment.
What's a service principal? It's the identity your app uses (like a robot account). When you deployed the app, Databricks created frauddetection-dev as its service principal automatically.
That's it! β
Your app will be available at: https://your-workspace.azuredatabricks.net/apps/frauddetection-dev
π Note: Per Microsoft Databricks documentation, deploying a bundle doesn't automatically deploy the app to compute. That's why we run deploy_app_source.sh as a separate step to deploy the app source code from the bundle workspace location.
When you run the commands above, the system automatically:
- β
Creates Unity Catalog
fraud_detection_dev - β
Creates schema
claims_analysis - β Generates 1000 sample insurance claims
- β Creates 3 AI functions (classify, extract, explain)
- β Creates knowledge base with fraud patterns
- β Creates vector search index
- β Creates Genie Space for natural language queries
- β Deploys Streamlit app with 5 pages
- β Grants all necessary permissions
β οΈ Requires one manual step: Grant Genie Space permissions (30 sec - see instructions above)
Total time: ~5-7 minutes
- LangGraph ReAct Pattern: Adaptive reasoning and tool selection
- 4 Specialized Tools: Classify, extract, search cases, query trends
- Explainable Decisions: Full reasoning trace
fraud_classify- Classify claims as fraudulent or legitimatefraud_extract_indicators- Extract red flags and suspicious patternsfraud_generate_explanation- Generate human-readable explanations
- Semantic search for similar fraud cases
- Sub-second query performance
- Databricks-managed embeddings
- π Home - Overview and system status
- π Claim Analysis - Analyze individual claims
- β‘ Batch Processing - Process multiple claims
- π Fraud Insights - Statistics and visualizations
- π Case Search - Search historical fraud cases
- π€ Agent Playground - Interactive chat with agent
config.yaml # β Edit this (source of truth)
β
generate_app_yaml.py # β Run this (generates app config)
β
app/app.yaml # β Auto-generated (don't edit)
β
Deploy!
The system supports dev, staging, and prod environments:
# config.yaml
environments:
dev:
catalog: "fraud_detection_dev"
staging:
catalog: "fraud_detection_staging"
prod:
catalog: "fraud_detection_prod"Deploy to different environments:
# Dev
python generate_app_yaml.py dev
databricks bundle deploy --target dev
# Staging
python generate_app_yaml.py staging
databricks bundle deploy --target staging
# Prod
python generate_app_yaml.py prod
databricks bundle deploy --target prod- Databricks Workspace (Azure, AWS, or GCP)
- Unity Catalog enabled
- SQL Warehouse created
- Databricks CLI installed and configured
databricks --version # Should show version
-
Clone the repository
git clone <repository> cd FraudDetectionForClaimsData
-
Configure Databricks CLI (if not already done)
databricks configure --profile DEFAULT_azure
Enter:
- Host:
https://your-workspace.azuredatabricks.net - Token: Your personal access token
- Host:
-
Edit config.yaml
vim config.yaml
Update:
workspace_host- Your workspace URLwarehouse_id- Your SQL Warehouse IDcatalog- Catalog name (or leave default)profile- Profile name from step 2
-
π Deploy everything with one command
./deploy_with_config.sh dev
This automated script does everything:
- β
Generates
app.yamlfromconfig.yaml - β Deploys infrastructure (jobs, app definition)
- β Runs setup notebooks (creates catalog, tables, functions, data, Genie Space)
- β Grants service principal permissions
- β Deploys app source code
Alternative: Manual step-by-step deployment
If you prefer to run each step individually:
# Step 1: Generate app.yaml python generate_app_yaml.py dev # Step 2: Deploy infrastructure (creates app and job definitions) databricks bundle deploy --target dev --profile DEFAULT_azure # Step 3: Run setup job (creates catalog, tables, functions, data, Genie Space) databricks bundle run setup_fraud_detection --target dev --profile DEFAULT_azure # Step 4: Grant service principal permissions ./grant_permissions.sh dev # Step 5: Deploy app source code from bundle location ./deploy_app_source.sh dev
Important: Per Microsoft documentation,
databricks bundle deploycreates the app infrastructure but does not automatically deploy the source code to compute. Step 5 explicitly deploys the app source code from the bundle workspace location usingdatabricks apps deploy. - β
Generates
-
β οΈ IMPORTANT: Configure Genie Space InstructionsAfter the setup job completes, the
10_create_genie_spacenotebook will output custom instructions for your Genie Space. You MUST copy these instructions and add them to the Genie Space manually:- Go to the notebook output for task
create_genie_space - Copy the "Instructions for Genie Space" text (usually a long paragraph describing how to analyze fraud claims)
- Open the Genie Space in Databricks: Data Intelligence > Genie > Fraud Detection Analytics
- Click Settings (gear icon)
- Paste the instructions into the Instructions field
- Click Save
Why? The Genie API doesn't support setting instructions via API yet, so this manual step is required for proper Genie behavior.
- Go to the notebook output for task
-
Access your app
The app URL will be shown after deployment:
https://your-workspace.azuredatabricks.net/apps/frauddetection-devWait 30-60 seconds for the app to start, then open the URL in your browser.
# Check if app is running
databricks apps get frauddetection_dev --profile DEFAULT_azure
# Check if catalog was created
databricks catalogs get fraud_detection_dev --profile DEFAULT_azure
# Check if tables exist
databricks tables list --catalog-name fraud_detection_dev --schema-name claims_analysis --profile DEFAULT_azureYou should see:
- Catalog:
fraud_detection_dev - Schema:
claims_analysis - Tables:
claims_data,fraud_cases_kb,config_genie - Functions:
fraud_classify,fraud_extract_indicators,fraud_generate_explanation - App:
frauddetection_dev(status: RUNNING)
Cause: Bundle creates the app infrastructure but doesn't auto-deploy source code to compute (per Microsoft docs)
Solution: Run the app deployment script
./deploy_app_source.sh devThis deploys the source code from the bundle workspace location to the app.
Manual alternative:
# Get your username
databricks workspace whoami --profile DEFAULT_azure
# Deploy from bundle location
databricks apps deploy frauddetection-dev \
--source-code-path /Workspace/Users/<your-email>/.bundle/fraud_detection_claims/dev/files/app \
--profile DEFAULT_azureSolution: Grant service principal permissions
./grant_permissions.sh devOr manually:
# Get service principal ID
SP_ID=$(databricks apps get frauddetection-dev --profile DEFAULT_azure | grep service_principal_client_id | cut -d'"' -f4)
# Grant catalog access
databricks grants update catalog fraud_detection_dev --json "{\"changes\": [{\"principal\": \"$SP_ID\", \"add\": [\"USE_CATALOG\"]}]}" --profile DEFAULT_azure
# Grant schema access
databricks grants update schema fraud_detection_dev.claims_analysis --json "{\"changes\": [{\"principal\": \"$SP_ID\", \"add\": [\"USE_SCHEMA\", \"SELECT\"]}]}" --profile DEFAULT_azure
# Grant warehouse access
databricks permissions update sql/warehouses/148ccb90800933a1 --json "{\"access_control_list\": [{\"service_principal_name\": \"$SP_ID\", \"permission_level\": \"CAN_USE\"}]}" --profile DEFAULT_azureCheck if deployment succeeded:
databricks apps list --profile DEFAULT_azureIf not listed, redeploy:
databricks bundle deploy --target dev --profile DEFAULT_azureCheck job status:
databricks jobs list --profile DEFAULT_azure
databricks jobs list-runs --job-id <job-id> --limit 1 --profile DEFAULT_azureRerun failed job:
databricks bundle run setup_fraud_detection --target dev --profile DEFAULT_azureThe setup notebooks check for existing resources and skip creation if they exist. If you need a clean slate:
# Run cleanup notebook in Databricks workspace
# Navigate to: Workspace > setup > 00_CLEANUP
# Click "Run All"FraudDetectionForClaimsData/
βββ config.yaml # β Configuration (edit this)
βββ generate_app_yaml.py # β Generator script (run this)
βββ databricks.yml # Databricks Asset Bundle config
βββ deploy_with_config.sh # One-command deployment script
β
βββ shared/
β βββ config.py # Config loader for notebooks
β
βββ setup/ # Setup notebooks (run by DAB)
β βββ 01_create_catalog_schema.py
β βββ 02_generate_sample_data.py
β βββ 03_uc_fraud_classify.py
β βββ 04_uc_fraud_extract.py
β βββ 05_uc_fraud_explain.py
β βββ 06_create_knowledge_base.py
β βββ 07_create_vector_index.py
β βββ 08_create_fraud_analysis_table.py
β βββ 09_batch_analyze_claims.py
β βββ 10_create_genie_space.py
β
βββ app/ # Streamlit application
β βββ app.yaml # Auto-generated (don't edit)
β βββ app_databricks.py # Main app
β βββ requirements.txt # Dependencies
β βββ pages/ # Streamlit pages
β β βββ 1_claim_analysis.py
β β βββ 2_batch_processing.py
β β βββ 3_fraud_insights.py
β β βββ 4_case_search.py
β β βββ 5_agent_playground.py
β βββ utils/
β βββ fraud_agent.py # LangGraph agent
β βββ databricks_client.py # DB utilities
β
βββ notebooks/
β βββ 01_fraud_agent.ipynb # Interactive agent demo
β
βββ docs/
βββ ARCHITECTURE.md # System architecture
βββ DEPLOYMENT.md # Deployment guide
βββ TROUBLESHOOTING.md # Common issues
- π Quick Commands: See CHEATSHEET.md - Most common commands
- Architecture: See docs/ARCHITECTURE.md
- Deployment: See docs/DEPLOYMENT.md
- Troubleshooting: See docs/TROUBLESHOOTING.md
- Versioning: See docs/VERSIONING.md - Automatic notebook version updates
- Demo: See DEMO.md
See CONTRIBUTING.md for contribution guidelines.
The cleanup_all.sh script provides a complete teardown of all Databricks resources, perfect for testing end-to-end deployment or resetting your environment.
# Interactive mode (asks for confirmation)
./cleanup_all.sh <environment>
# Skip confirmation (for automation)
./cleanup_all.sh <environment> --skip-confirmationExamples:
# Cleanup dev environment (with confirmation prompt)
./cleanup_all.sh dev
# Cleanup staging environment (no prompts)
./cleanup_all.sh staging --skip-confirmation
# Cleanup prod environment (with confirmation)
./cleanup_all.sh prodThe script performs 6 cleanup steps:
[1/6] Delete Databricks App
- Removes the deployed Streamlit app
- App name from config:
{app_name}
[2/6] Delete Genie Space
- Removes the Genie Space using Databricks API
- Searches for space by display name:
Fraud Detection Analytics - Must be deleted before catalog (not part of Unity Catalog)
[3/6] Run Catalog Cleanup
- Deletes Unity Catalog and all resources using
databricks catalogs delete --force:- Catalog (e.g.,
fraud_detection_dev) - Schema (
claims_analysis) - Tables (claims data, knowledge base, config tables)
- Vector Search index (CASCADE deletion)
- All UC functions (fraud_classify, fraud_extract_indicators, fraud_generate_explanation)
- All volumes
- Catalog (e.g.,
- Fast: No cluster spinup required (~30 seconds)
[4/6] Clean Local Bundle State
- Removes
.databricks/folder - Clears local deployment cache
[5/6] Clean Remote Workspace Files
- Deletes bundle files from workspace
- Path:
/Workspace/Users/{user_email}/.bundle/fraud_detection_claims
[6/6] Setup Job (Optional)
- Lists setup job(s)
- Asks if you want to delete them
- Jobs named:
fraud_detection_setup_{environment}
β
Confirmation Prompt: Asks "Are you sure?" before destroying resources
β
Error Handling: Continues even if resources don't exist
β
Clear Output: Color-coded messages show progress
β
Resource List: Shows exactly what will be deleted
β
Optional Job Deletion: Asks separately about setup job
Before running cleanup:
- β
config.yamlmust exist in the project root - β
Databricks CLI configured with profile
DEFAULT_azure - β Proper permissions to delete resources
The script reads these values from config.yaml:
environments:
dev:
catalog: "fraud_detection_dev" # Catalog to delete
app_name: "frauddetection-dev" # App to delete
profile: "DEFAULT_azure" # Databricks profile to use
common:
genie_space_display_name: "Fraud Detection Analytics" # Genie Space to deletePerfect for testing before demos, validating changes, or preparing for production:
# Step 1: Complete cleanup (removes everything)
./cleanup_all.sh dev
# Step 2: Fresh deployment (creates everything from scratch)
./deploy_with_config.sh dev
# Step 3: Test the app
# Open: https://your-workspace.azuredatabricks.net/apps/frauddetection-dev
# Try analyzing sample claims with the agent| Phase | Time | Details |
|---|---|---|
| Cleanup | ~1-2 minutes | Delete app, Genie, catalog, files |
| Fresh Deployment | ~10-15 minutes | Setup job takes longest |
| Total | ~12-17 minutes | Full end-to-end cycle |
Breakdown:
- Delete app: ~10 seconds
- Delete Genie Space: ~5 seconds
- Delete catalog (CASCADE): ~30-60 seconds
- Local cleanup: ~5 seconds
- Remote cleanup: ~10 seconds
- Setup job prompt: ~5 seconds
# Solution: Run from project root
cd /path/to/FraudDetectionForClaimsData
./cleanup_all.sh devCause: App already deleted or never deployed
Solution: Script continues automatically (warning shown)
Cause: Genie Space already deleted or name mismatch
Solution: Check config.yaml for correct genie_space_display_name
Cause: Catalog already deleted or insufficient permissions
Solution: Script continues automatically (warning shown)
Cause: Catalog already deleted or insufficient permissions
Solution: Check Databricks workspace permissions
Cause: Job is running or you lack permissions
Solution:
- Cancel running job in Databricks UI
- Or delete manually later:
databricks jobs delete --job-id <id>
| Scenario | Command | Why |
|---|---|---|
| Testing deployment | ./cleanup_all.sh dev |
Ensure clean slate |
| Before demo | ./cleanup_all.sh dev + ./deploy_with_config.sh dev |
Fresh, predictable state |
| Cost savings | ./cleanup_all.sh dev --skip-confirmation |
Remove unused resources |
| Environment reset | ./cleanup_all.sh staging |
Fix broken state |
| Switching configs | Clean + deploy | Apply major config changes |
β Don't cleanup production without backup
β Don't skip confirmation in production (always use interactive mode)
β Don't cleanup while jobs are running (cancel them first)
β Don't delete config.yaml (script needs it)
MIT License - see LICENSE
For a new operator, the steps are:
- Edit
config.yaml(2 minutes) - Run
./deploy_with_config.sh dev(6 minutes - fully automated) - Access app at provided URL β
Total time: ~8 minutes from zero to deployed app!
β Project Complete - December 2024
This is a production-ready, open-source fraud detection system demonstrating:
- Modern AI Architecture: LangGraph agents + UC Functions + Vector Search
- Real Business Impact: 1,298x ROI, 94% accuracy, $0.002/claim
- Enterprise Ready: Full governance, audit trails, multi-environment
- Fully Automated: One-command deployment, complete documentation
π See Complete Project Summary for architecture details, performance metrics, and key learnings.
Built with:
- Databricks Lakehouse Platform
- Unity Catalog & AI Functions
- LangGraph (LangChain)
- Vector Search
- Claude Sonnet 4.5
- Streamlit
Author: Vikram Malhotra
License: MIT (Open Source)
GitHub: https://github.com/bigdatavik/FraudDetectionForClaimsData
Built with β€οΈ on Databricks | December 2024