# Fraudulent Claim on Car Insurance

## 1. Business Understanding

**Overview:** Our primary objective is to gain a comprehensive understanding of the business problem at hand, the associated objectives, and the specific requirements involved in predicting fraudulent claims related to car physical damage. It is essential to establish a clear context for our data mining efforts.

**Business Problem:** Our organization operates within the car insurance industry, offering coverage for various types of car damages, including physical damage incurred due to accidents, collisions, or other incidents. However, a significant concern that the insurance industry faces is the submission of fraudulent claims. These fraudulent claims result in substantial financial losses and undermine the trust of our honest policyholders.

**Objectives:** Our main objectives for this project are as follows:
- Develop a predictive model to identify claims that are likely to be fraudulent.
- Minimize financial losses caused by processing fraudulent claims.
- Enhance the efficiency of claim processing by directing resources towards legitimate claims.

**Definition of a "Fraudulent" Claim:** To move forward, it's crucial to establish a clear definition of what we consider a "fraudulent" claim in the context of car physical damage. A claim can be considered fraudulent when it involves any of the following activities:
- Deliberate misrepresentation of facts related to the incident leading to the damage.
- Staging accidents or collisions to make a claim.
- Falsifying documentation or evidence to support the claim.
- Engaging in any form of deceit or illegal activity to obtain compensation.

**Specific Requirements:** In order to address the business problem and objectives effectively, we need to define the specific requirements for this project. These requirements may include the following:
- Access to historical data of car insurance claims, including both legitimate and potentially fraudulent cases.
- Data quality assessment, including data cleaning and handling of missing values.
- Feature engineering to create relevant features that can help in fraud detection.
- Building and evaluating predictive models to identify fraudulent claims.
- Interpretation of model results and identification of important features.
- Recommendations for actions and preventive measures based on the model findings.
- Collaboration with the claim processing department to integrate the model into their workflow for real-time fraud detection and claim verification.

## 2. Data Understanding

### 1.Import the required packages 

The provided dataset appears to contain information related to insurance claims. Here is a summary of the columns in the dataset and their potential meaning:

- **claim_number:** A unique identifier for each insurance claim.
- **age_of_driver:** The age of the driver involved in the claim.
- **gender:** The gender of the driver (M for male, F for female).
- **marital_status:** The marital status of the driver (0 for unmarried, 1 for married).
- **safety_rating:** A safety rating associated with the driver.
- **annual_income:** The annual income of the driver.
- **high_education_ind:** Indicates whether the driver has a high level of education (1 for yes, 0 for no).
- **address_change_ind:** Indicates whether the driver changed their address (1 for yes, 0 for no).
- **living_status:** The living status of the driver (e.g., "Own," "Rent").
- **zip_code:** The ZIP code of the driver's residence.
- **claim_date:** The date on which the insurance claim was filed.
- **claim_day_of_week:** The day of the week when the claim was filed.
- **accident_site:** The location where the accident occurred (e.g., "Local," "Highway").
- **past_num_of_claims:** The number of past claims filed by the driver.
- **witness_present_ind:** Indicates whether there were witnesses present at the time of the accident (1 for yes, 0 for no).
- **liab_prct:** A percentage related to liability (e.g., insurance coverage).
- **channel:** The channel through which the claim was reported (e.g., "Broker," "Phone").
- **policy_report_filed_ind:** Indicates whether a policy report was filed (1 for yes, 0 for no).
- **claim_est_payout:** The estimated payout amount for the claim.
- **age_of_vehicle:** The age of the vehicle involved in the claim.
- **vehicle_category:** The category of the vehicle (e.g., "Compact," "Large").
- **vehicle_price:** The price of the vehicle.
- **vehicle_color:** The color of the vehicle.
- **vehicle_weight:** The weight of the vehicle.
- **fraud:** Indicates whether the claim is related to fraud (1 for yes, 0 for no).

In [126]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

### 2. Import the available data

In [106]:
df = pd.read_csv("Data/fraudulentclaim.csv")

In [107]:
#Checking the first five records of the dataset
df.head()

Unnamed: 0,claim_number,age_of_driver,gender,marital_status,safty_rating,annual_income,high_education_ind,address_change_ind,living_status,zip_code,...,liab_prct,channel,policy_report_filed_ind,claim_est_payout,age_of_vehicle,vehicle_category,vehicle_price,vehicle_color,vehicle_weight,fraud
0,1,46,M,1.0,85,38301,1,1,Rent,80006,...,74,Broker,0,7530.940993,9.0,Compact,12885.45235,white,16161.33381,0
1,3,21,F,0.0,75,30445,0,1,Rent,15021,...,79,Online,0,2966.024895,4.0,Large,29429.45218,white,28691.96422,0
2,4,49,F,0.0,87,38923,0,1,Own,20158,...,0,Broker,0,6283.888333,3.0,Compact,21701.18195,white,22090.94758,1
3,5,58,F,1.0,58,40605,1,0,Own,15024,...,99,Broker,1,6169.747994,4.0,Medium,13198.27344,other,38329.58106,1
4,6,38,M,1.0,95,36380,1,0,Rent,50034,...,7,Broker,0,4541.38715,7.0,Medium,38060.21122,gray,25876.56319,0


In [108]:
# To ascertain the number roows and columns
df.shape

(17998, 25)

There are 17,998 records across 25 columns or fields or variables.

In [109]:
#The snapshot of the dataset is as follows
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17998 entries, 0 to 17997
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   claim_number             17998 non-null  int64  
 1   age_of_driver            17998 non-null  int64  
 2   gender                   17998 non-null  object 
 3   marital_status           17993 non-null  float64
 4   safty_rating             17998 non-null  int64  
 5   annual_income            17998 non-null  int64  
 6   high_education_ind       17998 non-null  int64  
 7   address_change_ind       17998 non-null  int64  
 8   living_status            17998 non-null  object 
 9   zip_code                 17998 non-null  int64  
 10  claim_date               17998 non-null  object 
 11  claim_day_of_week        17998 non-null  object 
 12  accident_site            17998 non-null  object 
 13  past_num_of_claims       17998 non-null  int64  
 14  witness_present_ind   

## 3. Data Preparation

Prepare the data for model building. This involves data cleaning, feature engineering, and splitting the data into training and testing sets.

### 1. Data Cleaning

In [110]:
# Handle missing values if any
df.dropna(inplace=True)

In [111]:
print(df.columns[df.columns.duplicated()])

Index([], dtype='object')


In [112]:
df['claim_date'] = pd.to_datetime(df['claim_date'])

In [113]:
# Convert the datetime column to separate numerical features
df['claim_year'] = df['claim_date'].dt.year
df['claim_month'] = df['claim_date'].dt.month
df['claim_day'] = df['claim_date'].dt.day
df['claim_hour'] = df['claim_date'].dt.hour
df['claim_minute'] = df['claim_date'].dt.minute
df['claim_second'] = df['claim_date'].dt.second

# Drop the original 'claim_date' column
df.drop('claim_date', axis=1, inplace=True)

In [114]:
# Convert categorical variables to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['gender', 'marital_status','address_change_ind', 'living_status', 'address_change_ind', 'living_status', 'zip_code', 'claim_day_of_week', 'accident_site', 'witness_present_ind', 'channel', 'policy_report_filed_ind', 'vehicle_color', 'vehicle_category'])

In [115]:
scaler = StandardScaler()
columns_to_scale = ['age_of_driver', 'safty_rating', 'annual_income', 'claim_year', 'claim_month', 'claim_day', 'claim_hour', 'claim_minute', 'claim_second', 'past_num_of_claims', 'liab_prct', 'claim_est_payout', 'age_of_vehicle', 'vehicle_price', 'vehicle_weight']
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

In [116]:
df.rename(columns={'address_change_ind_0': 'address_change_indusA'}, inplace=True)

In [117]:
df.rename(columns={'address_change_ind_1': 'address_change_ind_b'}, inplace=True)

In [118]:
df.rename(columns={'living_status_Own': 'living_status_Own_a'}, inplace=True)

In [119]:
df.rename(columns={'living_status_Rent': 'living_status_Rent_a'}, inplace=True)

### 2. Split the data into training and testing sets

You might want to create additional features that can help the model detect fraud more effectively. For example, you can calculate the age of the car from its manufacturing date.

In [120]:
X = df.drop('fraud', axis=1)
y = df['fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [121]:
print(df.columns[df.columns.duplicated()])

Index(['address_change_indusA', 'address_change_ind_b', 'living_status_Own_a',
       'living_status_Rent_a'],
      dtype='object')


### Model Development

To select the best model for predicting fraudulent claims on car physical damage and identify important features, you can train multiple machine learning models and evaluate their performance. We'll train six different models and use feature importance techniques to identify key features. Finally, we'll advise the data owner based on the selected features. Here are the steps:

We will train six different classification models and evaluate their performance. The models we'll consider are:

1. Random Forest Classifier
2. Gradient Boosting Classifier
3. Logistic Regression
4. Support Vector Machine (SVM)
5. K-Nearest Neighbors (KNN)
6. XGBoost Classifier

In [127]:
# Create models
rf_model = RandomForestClassifier()
gb_model = GradientBoostingClassifier()
lr_model = LogisticRegression()
svm_model = SVC()
knn_model = KNeighborsClassifier()

In [128]:
# Train the models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)

In [129]:
# Make predictions
rf_preds = rf_model.predict(X_test)
gb_preds = gb_model.predict(X_test)
lr_preds = lr_model.predict(X_test)
svm_preds = svm_model.predict(X_test)
knn_preds = knn_model.predict(X_test)

In [130]:
# Evaluate models
rf_accuracy = accuracy_score(y_test, rf_preds)
gb_accuracy = accuracy_score(y_test, gb_preds)
lr_accuracy = accuracy_score(y_test, lr_preds)
svm_accuracy = accuracy_score(y_test, svm_preds)
knn_accuracy = accuracy_score(y_test, knn_preds)

### Model Evaluation

- Evaluate model performance using accuracy or other relevant metrics.
- Perform model tuning to improve performance.

In [131]:
# Model tuning for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
}

In [132]:
rf_grid_search = GridSearchCV(rf_model, param_grid, cv=3)
rf_grid_search.fit(X_train, y_train)
best_rf_model = rf_grid_search.best_estimator_
best_rf_accuracy = accuracy_score(y_test, best_rf_model.predict(X_test))

In [133]:
gb_grid_search = GridSearchCV(gb_model, param_grid, cv=3)
gb_grid_search.fit(X_train, y_train)
best_gb_model = gb_grid_search.best_estimator_
best_gb_accuracy = accuracy_score(y_test, best_gb_model.predict(X_test))

In [134]:
lr_grid_search = GridSearchCV(lr_model, param_grid, cv=3)
lr_grid_search.fit(X_train, y_train)
best_lr_model = lr_grid_search.best_estimator_
best_lr_accuracy = accuracy_score(y_test, best_lr_model.predict(X_test))

ValueError: Invalid parameter 'max_depth' for estimator LogisticRegression(). Valid parameters are: ['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'].

In [None]:
svm_grid_search = GridSearchCV(svm_model, param_grid, cv=3)
svm_grid_search.fit(X_train, y_train)
best_svm_model = svm_grid_search.best_estimator_
best_svm_accuracy = accuracy_score(y_test, best_svm_model.predict(X_test))

In [None]:
knn_grid_search = GridSearchCV(knn_model, param_grid, cv=3)
knn_grid_search.fit(X_train, y_train)
best_knn_model = knn_grid_search.best_estimator_
best_knn_accuracy = accuracy_score(y_test, best_knn_model.predict(X_test))