## Auto Insurance Claims Fraud - Unsupervised Learning

Nearly one of 10 Americans would commit insurance fraud if they knew they could get away with it. Nearly one of four Americans say it’s ok to defraud insurers.   About one in 10 people agree it’s ok to submit claims for items that aren’t lost or damaged, or for personal injuries that didn’t occur. Two of five people are “not very likely” or “not likely at all” to report someone who ripped of an insurer.  
\- Accenture Ltd.(2003)   

Nearly three of 10 Americans (29 percent) wouldn't report insurance scams committed by someone they know.  
\- Progressive Insurance (2001)

This notebook shows how to "flag" anomalous insurance claims using an unsupervised learning algorithm (1-Class Support Vector Machine).  The notebook first builds a 1-Class SVM model and then applies the model to flag unusual or suspicious auto insurance claims .   The anomaly detection model can also be applied to “score” new records.  The entire machine learning methodology runs inside the ADW.

Copyright (c) 2021 Oracle Corporation 
###### <a href="https://oss.oracle.com/licenses/upl/" onclick="return ! window.open('https://oss.oracle.com/licenses/upl/');">The Universal Permissive License (UPL), Version 1.0</a>
---


 
![tiny arrow](http://www.oracle.com/technetwork/database/options/advanced-analytics/autoinsurancepic60-5434493.jpg "tiny arrow")


## Business Problem

---
We want to sift through our recent automobile insurance claims looking for anomalies, that is, suscicious or potentially fraudulent insurance claims. We will use the unsupervised Oracle Machine Learning 1-class Support Vector Machine algorithm. The goal is to build and apply the model on our *unlabeled* data to identify the most suspicious claims (e.g., the top 1-2%) for further investigation by auto insurance claims investigators who will assign labels in the column **FraudFound** of **Yes** or **No** for further supervised learning.

In [3]:
%python

import oml
import pandas as pd 

# URL of the location of the data in CSV format
url="https://raw.githubusercontent.com/oracle/oracle-db-examples/master/machine-learning/datasets/CLAIMS.csv"

# Create a local Pandas Dataframe
claims_pd = pd.read_csv(url)

# Check the number of rows and columns of the PD
claims_pd.shape


# Ensure a table with that name does not exist
try:
    oml.drop(table='CLAIMS')
except:
    pass
    
claims_pd = claims_pd.rename(columns = {"DAYS:POLICY-ACCIDENT": 'DAYSPOLICYACCIDENT', 'DAYS:POLICY-CLAIM': 'DAYSPOLICYCLAIM', 'ADDRESSCHANGE-CLAIM': 'ADDRESSCHANGECLAIM'})


# Create the table CLAIMS and get back a proxy object CLAIMS_DF    
CLAIMS_DF = oml.create(claims_pd, table = 'CLAIMS')

In [4]:
%python

CLAIMS_UNSPV_DF = CLAIMS_DF.drop('FRAUDFOUND')

try:
    oml.drop(table='CLAIMS_UNSPV')
except:
    pass
    
_ = CLAIMS_UNSPV_DF.materialize(table = 'CLAIMS_UNSPV')

## Data Exploration

---

In [6]:
%sql 

select * from CLAIMS_UNSPV;

In [7]:
%sql 

select AGEOFPOLICYHOLDER, DEDUCTIBLE, POLICYNUMBER from CLAIMS_UNSPV SAMPLE(50);

In [8]:
%sql
SELECT AGEOFPOLICYHOLDER, VEHICLEPRICE, POLICYNUMBER FROM CLAIMS_UNSPV;

In [9]:
%sql

SELECT DRIVERRATING, VEHICLECATEGORY, POLICYNUMBER from CLAIMS_UNSPV;

In [10]:
%sql

SELECT MONTHCLAIMED, NUMBEROFCARS, POLICYNUMBER from CLAIMS_UNSPV

## Modeling

---

In [12]:
%script

BEGIN DBMS_DATA_MINING.DROP_MODEL('CLAIMSMODEL');
EXCEPTION WHEN OTHERS THEN NULL; END;
/
DECLARE
    v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
    v_setlst('ALGO_NAME')   := 'ALGO_SUPPORT_VECTOR_MACHINES';
    V_setlst('PREP_AUTO')   := 'ON';

    DBMS_DATA_MINING.CREATE_MODEL2(
        MODEL_NAME          => 'CLAIMSMODEL',
        MINING_FUNCTION     => 'CLASSIFICATION',
        DATA_QUERY          => 'select * from CLAIMS_UNSPV',
        SET_LIST            => v_setlst,
        CASE_ID_COLUMN_NAME => 'POLICYNUMBER',
        TARGET_COLUMN_NAME  => NULL);
END;

### Examples of possible setting overrides for SVM

If the user does not override the default settings,  relevant settings are determined by the algorithm.

A complete list of settings can be found in the Documentation link:

-- Algorithm Settings: <a href="https://docs.oracle.com/en/database/oracle/oracle-database/21/arpls/DBMS_DATA_MINING.html#GUID-12408982-E738-4D0F-A2BC-84D895E07ABB" onclick="return ! window.open('https://docs.oracle.com/en/database/oracle/oracle-database/21/arpls/DBMS_DATA_MINING.html#GUID-12408982-E738-4D0F-A2BC-84D895E07ABB');">Support Vector Machine</a> 

-- Specify SVMS_COMPLEXITY_FACTOR metric for Support Vector Machine. 
   Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor.Value of complexity factor for SVM algorithm (both classification and regression).Default value estimated from the data by the algorithm.
    'SVMS_COMPLEXITY_FACTOR' : '1'
    
-- Convergence tolerance for SVM algorithm.and the default is 0.0001.
    'SVMS_CONV_TOLERANCE' : '0.005'
    
-- Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data.Value of epsilon factor for SVM regression.Default      is 0.1. 
    'SVMS_EPSILON' : '0.2'
    
-- Kernel for Support Vector Machine. Linear or Gaussian.The default value is SVMS_LINEAR.
    'SVMS_KERNEL_FUNCTION' : 'SVMS_GAUSSIAN'

-- The desired rate of outliers in the training data. Valid for One-Class SVM models only (anomaly detection).Default is 0.01.     
    'SVMS_OUTLIER_RATE' : '0.05'
    
-- Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases.        Value of standard deviation for SVM algorithm.This is applicable only for Gaussian kernel. Default value estimated from the data by the algorithm.       
    'SVMS_STD_DEV' : '2'
    
--This setting sets an upper limit on the number of pivots used in the Incomplete Cholesky decomposition. It can be set only for non-linear kernels. The default value is 200.
    'SVMS_NUM_PIVOTS' : '220'
    
-This setting applies to SVM models with linear kernel. This setting sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is      20000.
    'SVMS_BATCH_ROWS' : '21000'

--This setting controls the type of regularization that the SGD SVM solver uses. The setting can be used only for linear SVM models. The default is system determined because it depends on      the potential model size. Values : SVMS_REGULARIZER_L1 or SVMS_REGULARIZER_L2
    'SVMS_REGULARIZER' : 'SVMS_REGULARIZER_L1'


--This setting allows the user to choose the SVM solver. The SGD solver cannot be selected if the kernel is non-linear. The default value is system determined. Different Values                SVMS_SOLVER_SGD (Sub-Gradient Descend),SVMS_SOLVER_IPM (Interior Point Method)
    'SVMS_SOLVER' : 'SVMS_SOLVER_SGD'
    

### Apply model to flag anomalies

Note that anomalies are predicted with a value of 0, and normal cases with a value of 1. Below, we get the prediction probability of being an anomaly. 

In [15]:
%sql

-- Obtain the anomlay prediction probability using the 1-Class SVM model.

select  round((prediction_probability(CLAIMSMODEL, '0' using *))*100,2) prob_fraud,
        POLICYNUMBER, AGEOFPOLICYHOLDER, SEX, MARITALSTATUS, NUMBEROFCARS, WITNESSPRESENT
from CLAIMS_UNSPV order by prob_fraud desc;



In [16]:
%sql
SELECT NUMBEROFCARS, AGEOFPOLICYHOLDER, POLICYNUMBER FROM
(select POLICYNUMBER, AGEOFPOLICYHOLDER, SEX, MARITALSTATUS, NUMBEROFCARS, WITNESSPRESENT, round(prob_fraud*100,2) percent_fraud,
      rank() over (order by prob_fraud desc) rnk from
(select POLICYNUMBER, AGEOFPOLICYHOLDER, SEX, MARITALSTATUS, NUMBEROFCARS, WITNESSPRESENT, prediction_probability(CLAIMSMODEL, '0' using *) prob_fraud
from CLAIMS_UNSPV
))

### Display Prediction Details

Prediction details by default are produced as an XML string, however, using SQL, we can easily convert this to multi-column format as shown below. 

In [18]:
%script

BEGIN EXECUTE IMMEDIATE 'DROP TABLE SUSPICIOUS_CLAIMS';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
CREATE TABLE SUSPICIOUS_CLAIMS AS 
  SELECT * FROM
  (select POLICYNUMBER, 
  AGEOFPOLICYHOLDER, 
  SEX, 
  MARITALSTATUS,
  DEDUCTIBLE, 
  AGEOFVEHICLE, 
  WEEKOFMONTHCLAIMED, 
  BASEPOLICY, 
  ADDRESSCHANGECLAIM, 
  DAYSPOLICYCLAIM, 
  DRIVERRATING, 
  POLICEREPORTFILED, 
  PASTNUMBEROFCLAIMS, 
  WEEKOFMONTH, 
  WITNESSPRESENT, 
  DAYOFWEEKCLAIMED, 
  MONTHCLAIMED, 
  MAKE, 
  REPNUMBER, 
  NUMBEROFCARS, 
  DAYSPOLICYACCIDENT, 
  FAULT, 
  NUMBEROFSUPPLIMENTS, 
  ACCIDENTAREA, 
  VEHICLEPRICE, 
  VEHICLECATEGORY, 
  PREDICTIONDETAILS,
  AGENTTYPE, 
  round(prob_fraud*100,2) percent_fraud,
      rank() over (order by prob_fraud desc) rnk 
      from
  (select POLICYNUMBER, 
  AGEOFPOLICYHOLDER, 
  SEX, MARITALSTATUS, 
  DEDUCTIBLE, 
  AGEOFVEHICLE, 
  WEEKOFMONTHCLAIMED,
  BASEPOLICY, ADDRESSCHANGECLAIM, DAYSPOLICYCLAIM, DRIVERRATING, POLICEREPORTFILED, PASTNUMBEROFCLAIMS, WEEKOFMONTH, WITNESSPRESENT, DAYOFWEEKCLAIMED, MONTHCLAIMED, MAKE, REPNUMBER, NUMBEROFCARS, DAYSPOLICYACCIDENT, FAULT, NUMBEROFSUPPLIMENTS, ACCIDENTAREA, VEHICLEPRICE, VEHICLECATEGORY, AGENTTYPE, prediction_probability(CLAIMSMODEL, '0' using *) prob_fraud, PREDICTION_DETAILS("CLAIMSMODEL", '0', 5 ABS USING *) "PREDICTIONDETAILS" 
   from CLAIMS_UNSPV
  ))
order by percent_fraud desc;

In [19]:
%sql

SELECT RNK,PERCENT_FRAUD,PREDICTIONDETAILS,POLICYNUMBER FROM SUSPICIOUS_CLAIMS;

In [20]:
%sql

-- Parsed XML Output to view Fraudulent Transactions based on Policy Numbers and Percent Fraud.

SELECT POLICYNUMBER,
    round(percent_fraud*100,2) percent_fraud,
    RTRIM(TRIM(SUBSTR(OUTPRED."Attribute1",17,100)),'rank="1"/>') FIRST_ATTRIBUTE,
    RTRIM(TRIM(SUBSTR(OUTPRED."Attribute2",17,100)),'rank="2"/>') SECOND_ATTRIBUTE,
    RTRIM(TRIM(SUBSTR(OUTPRED."Attribute3",17,100)),'rank="3"/>') THIRD_ATTRIBUTE
FROM (SELECT POLICYNUMBER,
     PREDICTION(CLAIMSMODEL USING *) percent_fraud,
     PREDICTION_DETAILS(CLAIMSMODEL USING *) PD
    FROM SUSPICIOUS_CLAIMS
    WHERE POLICYNUMBER < 100000
    ORDER BY POLICYNUMBER) OUT,
    XMLTABLE('/Details'
    PASSING OUT.PD
    COLUMNS 
    "Attribute1" XMLType PATH 'Attribute[1]',
    "Attribute2" XMLType PATH 'Attribute[2]',
    "Attribute3" XMLType PATH 'Attribute[3]') 
    OUTPRED

# End of script
