# Fraud Detection Project

## Previous Notebooks

- [EDA](1-EDA.ipynb)
- [Network Analysis](2.1-Network.ipynb)
- [Lawyers' Network Analysis](2.2-Network-Lawyers.ipynb)
- [Witnesses' Network Analysis](2.3-Network-Witnesses.ipynb)

In [1]:
import numpy as np
import pandas as pd

## Dataset Construction

In this notebook I create the dataset I will feed into the random forest using the claim score obtained from the network analysis and the following features:

- claim filing month
- claim profile
- province of the accident
- difference between accident date and filing date
- outgoing money provisions.

As possibile target variables I use the company score and a fraud indicator obtained from it by rating a claim as a fraud if the score is greater than 50.

In [2]:
# claims data
claims = pd.read_csv('../data/raw/claims.csv', sep=';', na_values='', keep_default_na=False)
claims['filing_date'] = pd.to_datetime(claims['filing_date'])
claims['accident_date'] = pd.to_datetime(claims['accident_date'])
claims['policy_start_date'] = pd.to_datetime(claims['policy_start_date'])
claims['policy_end_date'] = pd.to_datetime(claims['policy_end_date'])
# setting index and dropping undesired columns
claims.set_index('claim_code', inplace=True)
claims.drop(['crash'], inplace=True, axis=1)

In [3]:
claims.dropna(inplace=True)

In [4]:
# columns derived from claim_profile
claims['card'] = claims['claim_profile'].apply(lambda x: 1 if 'CARD' in x else 0)
claims['injury'] = claims['claim_profile'].apply(lambda x: 1 if 'L' in x else 0)
claims['rca'] = claims['claim_profile'].apply(lambda x: 1 if 'RCA' in x else 0)
claims['n_signatures'] = claims['claim_profile'].apply(lambda x: 1 if '1' in x else 2 if '2' in x else 0)
# difference between accident and filing + filing month
claims['filing_diff'] = (claims['filing_date'] - claims['accident_date']).dt.days
claims['filing_month'] = claims['filing_date'].dt.month
claims.drop(['filing_date', 'accident_date', 'policy_start_date', 'policy_end_date', 'claim_profile'], axis=1, inplace=True)

In [5]:
# adding company scoring
assessments = pd.read_csv('../data/raw/antifraud_assessments.csv', sep=';')
claims = claims.merge(assessments, how='inner', left_index=True, right_on='claim_code').drop(['fraud_evaluation', 'ivass_score'], axis=1).set_index('claim_code')

In [6]:
# adding network scores
# I'm using 0 to fill missing values because if a claim isn't in the network then it's fair that its score is 0
scores = pd.read_pickle('../data/interim/network_scores.pkl')
claims = claims.merge(scores, how='left', left_index=True, right_on='claim_code').set_index('claim_code')
claims['score'] = claims['score'].fillna(0)

In [7]:
# adding lawyers' network scores
# I'm using 0 to fill missing values because if a claim isn't in the network then it's fair that its score is 0
scores_legale = pd.read_pickle('../data/interim/lawyers_network_scores.pkl')
claims = claims.merge(scores_legale, how='left', left_index=True, right_on='claim_code').set_index('claim_code')
claims['lawyer_score'] = claims['lawyer_score'].fillna(0)

In [8]:
# adding witnesses' network scores
# I'm using 0 to fill missing values because if a claim isn't in the network then it's fair that its score is 0
scores_legale = pd.read_pickle('../data/interim/witnesses_network_scores.pkl')
claims = claims.merge(scores_legale, how='left', left_index=True, right_on='claim_code').set_index('claim_code')
claims['witness_score'] = claims['witness_score'].fillna(0)

In [9]:
# adding outgoing provisions
provisions = pd.read_csv('../data/raw/provisions.csv', sep=';')
provisions['total_outgoing_provision'] = provisions['outgoing_forfait_provision'] + provisions['outgoing_provision']
claims = claims.merge(provisions, how='left', left_index=True, right_on='claim_code').drop(['incoming_forfait_provision', 'incoming_provision', 'outgoing_forfait_provision', 'outgoing_provision'], axis=1).set_index('claim_code')

In [10]:
claims.info() # note to self: should I account for doubled edges? as of now the network cycles algo doesn't check them...

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92583 entries, 2010010000083100 to 2017079430026300
Data columns (total 16 columns):
black_box                   92583 non-null int64
black_box_active            92583 non-null int64
accident_province           92583 non-null object
n_vehicles                  92583 non-null int64
n_people                    92583 non-null int64
card                        92583 non-null int64
injury                      92583 non-null int64
rca                         92583 non-null int64
n_signatures                92583 non-null int64
filing_diff                 92583 non-null int64
filing_month                92583 non-null int64
company_score               92511 non-null float64
score                       92583 non-null float64
lawyer_score                92583 non-null float64
witness_score               92583 non-null float64
total_outgoing_provision    92316 non-null float64
dtypes: float64(5), int64(10), object(1)
memory usage: 12.0+ MB


In [11]:
# dropping missing values and adding fraud indicator
claims.dropna(inplace=True)
claims['fraud_indicator'] = claims['company_score'].apply(lambda x: 1 if x > 50 else 0)

In [12]:
claims.to_pickle('../data/processed/claims.pkl')

## Following Notebooks

- [Random Forest Prediction](4-Model.ipynb)