# Vehicle Insurance Claim Fraud Detection
### The goal of this notebook is to:
1. Predict whether insurance claim fraud can be mitigated by training the algorithms (decision tree, random forest and logistic regression).
2. Calculate and compare the performance of the 3 algorithms.

## 1. Data Gathering

For vehicle insurance claim fraud, a dataset is chosen from [Kaggle](https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection) having 33 features and 15420 data entries. This dataset has several features that indicate the following:

| **Type** | **Features** |
|:---:|:---:|
| Vehicle | make, category, price, age |
| Customer | sex, age, marital-status, driver-rating, number-of-vehicles |
| Policy | policy-number, fault, type, deductible, days between events, police-report filed, witness present, agent-type, accident-area, address-change, base-policy, supplements |
| Data labeled with binary classification if claim was fraudulent or not (1 or 0) | fraud-found |


The data size available for this use-case is sufficient - more than 10 times degrees of freedom. The other observations ensure acceptable nulls/missing values (2%), the data is imbalanced 6%, however, it is in line with insurance industry trends.

In [31]:
from pyspark.sql import SparkSession

# Initializing Spark session
spark = SparkSession\
        .builder\
        .appName("Vehicle Insurance Claim Fraud Detection")\
        .getOrCreate()
print('Spark session is initialized.')

# Load the vehicle insurance claim fraud dataset from the csv file and create a dataframe
df = spark.read.format("csv").option("header", "true").load("../dataset/vehicle_insurance_claim_fraud_data.csv")
print('Spark dataframe is created from the dataset.')

Spark session is initialized.
Spark dataframe is created from the dataset.


## Data Preparation

For vehicle insurance claim fraud, a dataset is chosen from Kaggle having 33 features and 15420 data entries. This dataset has several features that indicate the following:

| **Type** | **Features** |
|:---:|:---:|
| Vehicle | make, category, price, age |
| Customer | sex, age, marital-status, driver-rating, number-of-vehicles |
| Policy | policy-number, fault, type, deductible, days between events, police-report filed, witness present, agent-type, accident-area, address-change, base-policy, supplements |
| Data labeled with binary classification if claim was fraudulent or not (1 or 0) | fraud-found |


The data size available for this use-case is sufficient - more than 10 times degrees of freedom. The other observations ensure acceptable nulls/missing values (2.07%), the data is imbalanced 6%, however, it is in line with insurance industry trends.

For data preparation, following steps were performed to improve the quality of the data:
1. Age: column has '0' as value for 2.07% rows.
2. To impute 'Age' column having 0 as value, a new column 'Imputed_Age' has been added to the dataframe which computes the average age from the 'AgeOfPolicyHolder' column ranged value.
3. All the numerical features are converted to double type to maintain uniformity.
4. The dataset is split into training and test set using random split (80/20 ratio).
5. One-hot encoding is performed for categorical features.
6. Features are extracted and training and test sets are assembled.

In [34]:
from pyspark.sql.functions import *
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import VectorAssembler

# Adds a new column to the dataframe called 'Imputed_Age' that copies the 'Age' column value when it is not 0, 
# or the average age from the range given in 'AgeOfPolicyHolder' column value
df = df.withColumn("Imputed_Age", when(col("Age") == 0, (split(col("AgeOfPolicyHolder"), " ")[0].cast("int") + 
    split(col("AgeOfPolicyHolder"), " ")[0].cast("int")) / 2).otherwise(col("Age")))

# Convert the columns having numerical values to Double type
numerical_cols = ['WeekOfMonth', 'WeekOfMonthClaimed', 'Age', 'Imputed_Age', 'FraudFound_P', 'RepNumber', 'Deductible', 
                  'DriverRating', 'Year']
for col in numerical_cols:
    df = df.withColumn(col, df[col].cast(DoubleType()))

# Divide the dataset into training and test sets
training, test = df.randomSplit([0.8, 0.2], seed=42)

# Create a list of the columns having categorical values
categorical_cols = ['Month', 'DayOfWeek', 'Make', 'AccidentArea', 'DayOfWeekClaimed', 'MonthClaimed', 'Sex', 'MaritalStatus', 
                    'Fault', 'PolicyType', 'VehicleCategory', 'VehiclePrice', 'Days_Policy_Accident', 'Days_Policy_Claim', 
                    'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder', 'PoliceReportFiled', 'WitnessPresent', 
                    'AgentType', 'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'BasePolicy']

# Perform one-hot encoding for categorical features
indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed", handleInvalid="keep") for col in categorical_cols]
encoders = [OneHotEncoder(inputCol=col+"_indexed", outputCol=col+"_encoded") for col in categorical_cols]
pipeline = Pipeline(stages=indexers + encoders)
pipelineModel = pipeline.fit(training)
training_encoded = pipelineModel.transform(training).drop(*categorical_cols, *["PolicyNumber"])
test_encoded = pipelineModel.transform(test).drop(*categorical_cols, *["PolicyNumber"])

# Extracting the features and removing the 'FraudFound_P' column
features = training_encoded.columns
features.remove('FraudFound_P')

# Assembles the columns into a feature vector for the training and test sets
assembler = VectorAssembler(inputCols=features, outputCol="features", handleInvalid="keep")
train_df = assembler.transform(training_encoded).select("features", "FraudFound_P")
test_df = assembler.transform(test_encoded).select("features", "FraudFound_P")

# Count the number of records with FraudFound_P value as 0 and 1 in the training set
count_0 = train_df.filter(train_df["FraudFound_P"] == 0).count()
count_1 = train_df.filter(train_df["FraudFound_P"] == 1).count()

# Calculate the ratio of FraudFound_P values 0 and 1
fraud_ratio = count_1 / count_0

# Create separate DataFrames for minority and majority class
minority_df = train_df.filter(train_df["FraudFound_P"] == 1)
majority_df = train_df.filter(train_df["FraudFound_P"] == 0)

# Sample the majority class DataFrame to balance the dataset
balanced_majority_df = majority_df.sample(fraction=fraud_ratio, seed=42)

# Combine the minority class DataFrame and the balanced majority class DataFrame
balanced_train_df = minority_df.unionAll(balanced_majority_df)

# Shuffle the rows of the balanced DataFrame
balanced_train_df = balanced_train_df.orderBy(rand())

print('Data is prepared.')

Data is prepared.


## 3. Model Training

For this usecase, model understandability is a major aspect for competitive markets such as insurance. Hence, the model were chosen in a way that they are whitebox, simple to understand and able to generate workable insights for upselling in the business. Example: If the model shows that individuals who has a voluntary deductible of \$1000 don't commit fraud, customers willing to pay the same can be awarded with better insurance premiums. 

### 3.1 Decision Tree Classifier
Decision trees are the simplest and easiest to understand. To prevent overfitting, the model was adjusted by using techniques such as stratified sampling and balanced sampling. Fitting was performed on multiple hyperparameters settings: maxDepth, minInstancesPerNode

In [35]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

# Define the decision tree classifier
dt = DecisionTreeClassifier(labelCol="FraudFound_P", featuresCol="features")

# Define the parameter grid for cross-validation
paramGrid_dt = ParamGridBuilder() \
    .addGrid(dt1.maxDepth, [2, 5, 10]) \
    .addGrid(dt1.minInstancesPerNode, [1, 5, 10]) \
    .build()

# Define the evaluator as precision for the classification
evaluator_dt = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="FraudFound_P",
                                              metricName="weightedPrecision")

print('Decision Tree Classifier, ParamGrid and Precision Evaluator is defined.')

Decision Tree Classifier, ParamGrid and Precision Evaluator is defined.


#### 3.1.1 Data Sampling
1. Simple Random Sampling
2. Stratified Sampling
3. Balanced Sampling

In [36]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from stratifier import StratifiedCrossValidator

# Define the k-fold cross-validator
cv_dt = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid_dt, evaluator=evaluator_dt, numFolds=5)

# Define the stratified k-fold cross-validator
scv_dt = StratifiedCrossValidator(estimator=dt, estimatorParamMaps=paramGrid_dt, evaluator=evaluator_dt, numFolds=5)

# Run the Decision Tree model on the training set
random_model_dt = cv_dt.fit(train_df)

# Run the Decision Tree model using stratified k-fold cross-validation on the training set
stratified_model_dt = scv_dt2.fit(train_df)

# Run the Decision Tree model on the balanced training set
balanced_model_dt = cv_dt.fit(balanced_train_df)

#### 3.1.2 Decision Tree Model Comparison

Following comparison metrics were used for model evaluation:
1. Precision (since it’s more important to predict fraudsters)
2. Recall
3. F1 Score
4. Accuracy

In [48]:
# Get the predictions on the test set from Random Sampling
predictions_random_model = random_model_dt.transform(test_df)

# Get the predictions on the test set from Stratified Sampling
predictions_stratified_model = stratified_model_dt.transform(test_df)

# Get the predictions on the test set from Balanced Sampling
predictions_balanced_model = balanced_model_dt.transform(test_df)

# Set the evaluator metric to Precision
evaluator_dt.setMetricName('weightedPrecision')

# Precision for the decision tree model on Random Sampling
precision_random_model = evaluator_dt.evaluate(predictions_random_model)

# Precision for the decision tree model on Stratified Sampling
precision_stratified_model = evaluator_dt.evaluate(predictions_stratified_model)

# Precision for the decision tree model on Balanced Sampling
precision_balanced_model = evaluator_dt.evaluate(predictions_balanced_model)

# Set the evaluator metric to Recall
evaluator_dt.setMetricName('weightedRecall')

# Recall for the decision tree model on Random Sampling
recall_random_model = evaluator_dt.evaluate(predictions_random_model)

# Recall for the decision tree model on Stratified Sampling
recall_stratified_model = evaluator_dt.evaluate(predictions_stratified_model)

# Recall for the decision tree model on Balanced Sampling
recall_balanced_model = evaluator_dt.evaluate(predictions_balanced_model)

# Set the evaluator metric to F1
evaluator_dt.setMetricName('f1')

# F1 Score for the decision tree model on Random Sampling
f1_score_random_model = evaluator_dt.evaluate(predictions_random_model)

# F1 Score for the decision tree model on Stratified Sampling
f1_score_stratified_model = evaluator_dt.evaluate(predictions_stratified_model)

# F1 Score for the decision tree model on Balanced Sampling
f1_score_balanced_model = evaluator_dt.evaluate(predictions_balanced_model)

# Set the evaluator metric to Accuracy
evaluator_dt.setMetricName('accuracy')

# Recall for the decision tree model on Random Sampling
accuracy_random_model = evaluator_dt.evaluate(predictions_random_model)

# Recall for the decision tree model on Stratified Sampling
accuracy_stratified_model = evaluator_dt.evaluate(predictions_stratified_model)

# Recall for the decision tree model on Balanced Sampling
accuracy_balanced_model = evaluator_dt.evaluate(predictions_balanced_model)

print('Metrics are evaluated.')

Metrics are evaluated.


In [49]:
from IPython.display import Markdown, display

table = "| **Type of sampling** | **Precision** | **Recall** | **F1 Score** | **Accuracy** |\n" + \
                 "|:---:|:---:|:---:|:---:|:---:|\n" + \
                 "| Random | {:.2f} | {:.2f} | {:.2f} | {:.2f} |\n" + \
                 "| Stratified | {:.2f} | {:.2f} | {:.2f} | {:.2f} |\n" + \
                 "| Balanced | {:.2f} | {:.2f} | {:.2f} | {:.2f} |"

display(Markdown(table.format(precision_random_model, recall_random_model, 
                 f1_score_random_model, accuracy_random_model, precision_stratified_model, recall_stratified_model,
                 f1_score_stratified_model, accuracy_stratified_model, precision_balanced_model, recall_balanced_model,
                 f1_score_balanced_model, accuracy_balanced_model)))

| **Type of sampling** | **Precision** | **Recall** | **F1 Score** | **Accuracy** |
|:---:|:---:|:---:|:---:|:---:|
| Random | 0.94 | 0.95 | 0.92 | 0.95 |
| Stratified | 0.94 | 0.95 | 0.92 | 0.95 |
| Balanced | 0.95 | 0.59 | 0.70 | 0.59 |

### 3.2 Random Forest Classifier

Random forest model was adjusted by using techniques such as stratified sampling and balanced sampling. Fitting was performed on multiple hyperparameters settings: numTrees, maxDepth, minInstancesPerNode, minInfoGain

In [None]:
from pyspark.ml.classification import RandomForestClassifier

# Define the random forest classifier
rf = RandomForestClassifier(labelCol="FraudFound_P", featuresCol="features")
print('Random forest is defined.')

# Define the parameter grid for cross-validation
paramGrid_rf = ParamGridBuilder() \
    .addGrid(rf.numTrees, [50, 100]) \
    .addGrid(rf.maxDepth, [2, 5, 10]) \
    .addGrid(rf.minInstancesPerNode, [1, 5, 10]) \
    .addGrid(rf.minInfoGain, [0.0, 0.05]) \
    .build()
print('Parameter grid for cross-validation is defined.')

# Define the evaluator as precision for the classification
evaluator_rf1 = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="FraudFound_P", 
                      metricName="weightedPrecision")
print('Precision evaluator is defined.')

cv_rf1 = CrossValidator(estimator=rf1, estimatorParamMaps=paramGrid_rf1, evaluator=evaluator_rf1, numFolds=5, seed=42)
cvModel_rf1 = cv_rf.fit(train_df)

#### 3.2.1 Data Sampling
1. Simple Random Sampling
2. Stratified Sampling
3. Balanced Sampling

#### Performance of the model is evaluated.

## Training the model with Random Forest Classifier and stratified k-fold cross validation

The model is trained with random forest classifier on the training dataset with precision as evaluator and stratified k-fold cross validation is used.

#### The important features are obtained from the trained model.

#### Performance of the model is evaluated.

## Training the model with Random Forest Classifier and stratified k-fold cross validation using balanced dataset

The model is trained with random forest classifier on the training dataset with precision as evaluator and stratified k-fold cross validation is used on balanced dataset by undersampling the majority class.

#### The important features are obtained from the trained model.

#### Performance of the model is evaluated.

## Training the model with Logistic Regression and k-fold cross validation

The model is trained with logistic regression on the training dataset with precision as evaluator and k-fold cross validation is used.

In [None]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="FraudFound_P", featuresCol="features",maxIter = 10)

import numpy as np
paramGrid_lr = ParamGridBuilder() \
    .addGrid(lr.regParam, np.linspace(0.3, 0.01, 10)) \
    .addGrid(lr.elasticNetParam, np.linspace(0.3, 0.8, 6)) \
    .build()

# Define the evaluator as precision for the classification
evaluator = BinaryClassificationEvaluator(labelCol="FraudFound_P")
print('Precision evaluator is defined.')

crossval_lr = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid_lr,
                          evaluator=evaluator,
                          numFolds= 5)  
cvModel_lr = crossval_lr.fit(train_df)
predictions = cvModel_lr.transform(test_df)
metricValue = evaluator.evaluate(predictions)
print("Metric value = %s" % metricValue)

#### The important features are obtained from the trained model.