# Exploratory Data Analysis

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
df = pd.read_excel('../Data/Insurance Claims Data - Cleaned.xlsx')

In [4]:
df.head()

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,witnesses,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,2,YES,71610,6510,13020,52080,Saab,92x,2004,Y
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,0,?,5070,780,780,3510,Mercedes,E400,2007,Y
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,...,3,NO,34650,7700,3850,23100,Dodge,RAM,2007,N
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,...,2,NO,63400,6340,6340,50720,Chevrolet,Tahoe,2014,Y
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,...,1,NO,6500,1300,650,4550,Accura,RSX,2009,N


#### Basic Analysis

In [8]:
print(f'The dataset consist of {df.shape[0]} rows and {df.shape[1]} columns.')

The dataset consist of 1000 rows and 39 columns.


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   months_as_customer           1000 non-null   int64         
 1   age                          1000 non-null   int64         
 2   policy_number                1000 non-null   int64         
 3   policy_bind_date             1000 non-null   datetime64[ns]
 4   policy_state                 1000 non-null   object        
 5   policy_csl                   1000 non-null   object        
 6   policy_deductable            1000 non-null   int64         
 7   policy_annual_premium        1000 non-null   float64       
 8   umbrella_limit               1000 non-null   int64         
 9   insured_zip                  1000 non-null   int64         
 10  insured_sex                  1000 non-null   object        
 11  insured_education_level      1000 non-null  

The datset is comprised of the following feature types:
- Numeric: 18
- Object: 19
- Datetime: 2

#### Redundant Features

In [6]:
df['policy_number'].nunique()

1000

In [7]:
df['incident_location'].nunique()

1000

The feature `policy_number` is a unique customer identifier and the dataset contains no moe than one claim per customer. This feature provides no additional information that might help us evaluate claims, and will therefore be removed.

In [8]:
df = df.drop('policy_number', axis=1)

#### Missing Values

In [5]:
df.isna().any()

months_as_customer             False
age                            False
policy_number                  False
policy_bind_date               False
policy_state                   False
policy_csl                     False
policy_deductable              False
policy_annual_premium          False
umbrella_limit                 False
insured_zip                    False
insured_sex                    False
insured_education_level        False
insured_occupation             False
insured_hobbies                False
insured_relationship           False
capital-gains                  False
capital-loss                   False
incident_date                  False
incident_type                  False
collision_type                 False
incident_severity              False
authorities_contacted           True
incident_state                 False
incident_city                  False
incident_location              False
incident_hour_of_the_day       False
number_of_vehicles_involved    False
p

Despite addressing missing values and imputing data prior to loading the dataset, the DataFrame is still reporting missing values. This is because the 'None' category in the `authorities_contacted` feature is being reported as NaN. We'll fill these datapoints with 'None' as a string to address this problem.

In [10]:
df['authorities_contacted'] = df['authorities_contacted'].fillna('None')

In [11]:
df['authorities_contacted']

0      Police
1      Police
2      Police
3      Police
4        None
        ...  
995      Fire
996      Fire
997    Police
998     Other
999    Police
Name: authorities_contacted, Length: 1000, dtype: object

The features `collision_type`, `property_damage` and `police_report_available` contain '?' as entries. We'll replace these with the string 'Unknown'.

In [12]:
df = df.replace('?','Unknown')

In [13]:
df[['collision_type','property_damage','police_report_available']].tail(10)

Unnamed: 0,collision_type,property_damage,police_report_available
990,Rear Collision,Unknown,YES
991,Rear Collision,NO,NO
992,Front Collision,YES,YES
993,Side Collision,Unknown,Unknown
994,Unknown,Unknown,YES
995,Front Collision,YES,Unknown
996,Rear Collision,YES,Unknown
997,Side Collision,Unknown,YES
998,Rear Collision,Unknown,YES
999,Unknown,Unknown,Unknown


In [15]:
df['incident_severity'].value_counts()

incident_severity
Minor Damage      354
Total Loss        280
Major Damage      276
Trivial Damage     90
Name: count, dtype: int64

In [16]:
df['fraud_reported'].value_counts()

fraud_reported
N    753
Y    247
Name: count, dtype: int64

In [None]:
# Change everything to lower case to prevent ambiguity

### Summary Statistics

In [None]:
# Split into the four datasets and get the summary statistics for each class

Relevant features:
- Incident date
- Incident type
- Collision type
- Authorities contacted
- Incident state
- Incident city
- Incident location
- Incident hour of the day
- Number of vehicles involved
- Property damage
- Bodily injuries
- Police report available
- Total claim amount
- Injury claim
- Property claim
- Auto make
- Auto model
- Auto year

Target variable:
- Incident severity