insurance_claim_counterfactual_simulator is a local-first insurance analytics project built around one business question:
What would need to change in a claim so that it would no longer be considered fraudulent?
The project simulates a realistic company workflow where a claim package arrives as json + pdf + image, an extraction layer consolidates the package into structured fields, PostgreSQL becomes the system of record, analysts work from SQL, and a FastAPI service serves production-like scoring and counterfactual explanations.
The goal is not only to classify fraud. The main value is decision support:
- explain why a claim looks risky
- help analysts understand which factors push a file toward fraud
- propose plausible counterfactual changes that would reduce the risk score
- make the modeling layer more governable and easier to discuss with business stakeholders
- A client package arrives with CRM JSON, a PDF attachment, and an image attachment.
- A local extraction layer parses and consolidates the package.
- Structured records are inserted directly into PostgreSQL.
- Data scientists and analysts work from PostgreSQL through SQL, notebooks, and reporting scripts.
- The fraud model is trained only on historical extracted claims.
- After the development boundary is frozen, new packages are scored through FastAPI.
- The API returns a fraud probability, a model decision, and a counterfactual explanation when requested.
The active pipeline is centered on PostgreSQL, not on intermediate CSV files.
Data levels:
- Synthetic internal ground truth used only to generate realistic claim packages and historical labels.
- Raw client packages under
data/raw/.../clients/. - Canonical structured analytical records stored directly in PostgreSQL.
Historical schema:
customerspoliciesclaims
Production schema:
production_claim_intakeproduction_claim_decisions
Reusable SQL analytics views:
vw_historical_claims_enrichedvw_production_claim_intake_enrichedvw_production_claim_decisions_enriched
The project enforces a hard temporal split:
- historical development cutoff:
2024-10-02 - production start date:
2024-10-03
This means:
- historical data is the only source used for training
- historical data is the only source used for evaluation
- production-like data is reserved for post-development scoring
- the API rejects production packages dated before
2024-10-03
Current generated dataset:
- total claims:
60,274 - historical claims:
54,248 - production-like claims:
6,026 - active locations:
37
Main claim columns:
claim_idcustomer_idclaim_datepolicy_start_datepolicy_age_daysclaim_amountclaim_typecustomer_agenum_previous_claimstime_since_last_claim_daysservice_provider_idlocationweather_conditionis_fraud
Fraud labels are generated probabilistically. Fraud propensity increases when:
- the claim amount is high
- the claim occurs shortly after policy start
- the previous claim was recent
- a suspicious provider is involved
- the claim type, weather, and amount form an unusual combination
The extraction engine is implemented in document_extraction.py.
It currently:
- parses CRM JSON fields
- reads claim details from PDF text
- recovers identifiers and metadata from image files
- consolidates the three sources into one structured claim
- records source mismatches
This is a real local extraction layer, but it remains rule-based. It is not a full OCR or computer vision stack.
The project includes analyst-oriented SQL assets:
The exploration script answers 10 business questions ranging from portfolio profiling to suspicious-provider analysis and counterfactual-oriented fraud patterns.
The processed SQL report artifact is:
The active training pipeline does not rely on a fixed 0.5 threshold anymore.
Current modeling process:
- load historical extracted claims from PostgreSQL
- benchmark multiple classifiers on the same historical split
- benchmark multiple imbalance strategies per classifier
- tune the decision threshold on a validation split
- select the deployment champion using a deployment-oriented objective
The current objective is:
- maximize precision
- while enforcing a minimum recall of
0.20 - use F1 and ROC AUC as secondary ranking signals
Current candidate comparison:
rfwithweightedimbalance handlingxgbwithoversampleimbalance handling
Current deployment champion:
- classifier:
xgb - imbalance strategy:
oversample - decision threshold:
0.83
Current historical holdout metrics for the deployment champion:
- ROC AUC:
0.764572 - Precision:
0.525140 - Recall:
0.247043 - F1:
0.336014 - Precision at top 1%:
0.777778 - Precision at top 5%:
0.431734 - Precision at top 10%:
0.302304
This is materially stronger than the previous presentation point based on a default 0.5 threshold.
The current pipeline keeps only the active artifacts needed for deployment and benchmarking:
- champion deployment model: deployment_champion.pkl
- champion deployment metadata: deployment_champion.json
- candidate benchmark model: candidate_rf.pkl
- candidate benchmark metadata: candidate_rf.json
- candidate benchmark model: candidate_xgb.pkl
- candidate benchmark metadata: candidate_xgb.json
Evaluation and diagnostics:
PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtEnvironment variables:
POSTGRES_HOSTPOSTGRES_PORTPOSTGRES_DBPOSTGRES_USERPOSTGRES_PASSWORDPOSTGRES_TABLEPOSTGRES_CUSTOMERS_TABLEPOSTGRES_POLICIES_TABLEPOSTGRES_PRODUCTION_INTAKE_TABLEPOSTGRES_PRODUCTION_DECISIONS_TABLEPOSTGRES_HISTORICAL_VIEWPOSTGRES_PRODUCTION_INTAKE_VIEWPOSTGRES_PRODUCTION_DECISIONS_VIEWAPI_HOSTAPI_PORTMODEL_DECISION_THRESHOLDMODEL_MIN_RECALL_FOR_THRESHOLD
For simple values such as ports and thresholds, quotes are optional.
Example with Docker:
docker compose -f docker/docker-compose.yml up -d postgresThe local Docker setup exposes PostgreSQL on 5434 by default.
Generate synthetic truth and raw packages:
.\.venv\Scripts\python -m src.data_generation.generate_synthetic_data
.\.venv\Scripts\python -m src.data_generation.generate_pdfs_and_imagesExtract and load directly into PostgreSQL:
.\.venv\Scripts\python -m src.data_generation.extract_claim_packages --split all --reset-dbTrain the current benchmark candidates:
.\.venv\Scripts\python -m src.ml.train_model --source postgres --classifier rf
.\.venv\Scripts\python -m src.ml.train_model --source postgres --classifier xgbOr benchmark both in one run from Python:
.\.venv\Scripts\python -c "from src.ml.train_model import train_model; train_model(source='postgres', classifier_name=['rf','xgb'])"Evaluate:
.\.venv\Scripts\python -m src.ml.evaluate_model --source postgresScore the reserved production-like intake:
.\.venv\Scripts\python -m src.ml.score_productionGenerate the SQL analyst report:
.\.venv\Scripts\python -m src.db.query_from_postgres --write-reportRecreate analytics views explicitly:
psql -d insurance_claims -f sql/02_create_analytics_views.sqlRun the API:
.\.venv\Scripts\python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000Swagger docs:
Main endpoints:
POST /predictPOST /intake/packagePOST /predict/packageGET /claimsGET /claims/exportPOST /counterfactual
The production scoring path uses the champion model and its tuned decision threshold.
Insurance Claim Counterfactual Simulator/
├── README.md
├── details.txt
├── report/
│ └── insurance_counterfactual_study_report.docx
├── data/
│ ├── raw/
│ └── processed/
├── notebooks/
│ ├── 01_exploration_and_eda.ipynb
│ └── 02_modeling_and_counterfactuals.ipynb
├── models/
├── src/
│ ├── data_generation/
│ ├── db/
│ ├── ml/
│ └── api/
├── scripts/
├── sql/
└── docker/
- all data is synthetic
- extraction is rule-based rather than full OCR or CV
- images are not yet used as learned visual features
- counterfactuals are heuristic rather than causal guarantees
svmremains available in code but is not part of the current deployment-ready benchmark loop because it is too costly for this dataset size
- probability calibration
- more formal business cost optimization for threshold selection
- SHAP or LIME explanations alongside counterfactuals
- DiCE or constrained optimization-based counterfactual generation
- OCR and multimodal image features
- monitoring and drift reporting

