# Car Insurance Data Analysis

## Exploratory Data Analysis


**Exploratory Data Analysis (EDA)** is a crucial step in any data engineering and anlytics project to understand the dataset before performing further analysis. It involves summarizing key characteristics and visualizing data patterns, outliers, and trends.

**Step 1. Understand the Data Structure**
- **Data loading**: First, load the dataset and examine its structure.

- **Check the dimensions**: Find the number of rows and columns.

- **Look at the first few records**: Get a sample of the dataset.


In [1]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
import numpy as np
import os, sys
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))

Load the car insurance dataset

In [2]:
# Import load_data function from scripts
from load_data import load_data

# read the dataset 

data = load_data('../data.zip', filename='MachineLearningRating_v3.txt')

In [3]:
# Explore the first few rows
data.head()

Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0


In [4]:
# Find the number of rows and columns
data.shape

(1000098, 52)

The dataset has 100,0098 rows and 52 columns

**Data Cleaning:**

- Handle missing or incomplete data: Since the dataset spans several categories, some fields may have missing values. 


In [5]:
# Import the Class to data processing

from data_processing import DataProcessing
# Create instance of the class
data_processing = DataProcessing(data)

# Summary of Missing data
missing_summary = data_processing.missing_data_summary()

# Display results
missing_summary

Unnamed: 0,Missing Count,Percentage (%)
NumberOfVehiclesInFleet,1000098,100.0
CrossBorder,999400,99.930207
CustomValueEstimate,779642,77.95656
Converted,641901,64.18381
Rebuilt,641901,64.18381
WrittenOff,641901,64.18381
NewVehicle,153295,15.327998
Bank,145961,14.59467
AccountType,40232,4.022806
Gender,9536,0.953507


**Drop Columns with High Missing Data:**
- Columns with high missing values offer little analytical value.

**Dropped Columns**

- `NumberOfVehiclesInFleet` (100% missing)
- `CrossBorder` (~99.93%)
- `CustomValueEstimate` (~77.96%)
- `Converted, Rebuilt, WrittenOff` (~64.18%)


In [None]:
cols_to_drop = ['NumberOfVehiclesInFleet', 
                'CrossBorder', 
                'CustomValueEstimate', 
                'Converted', 'Rebuilt', 
                'WrittenOff']

# Drop these columns
data = data_processing.handle_missing_data('high', cols_to_drop)


**Impute Moderate Missing Data:**
- Imputation preserves useful information, using the mode for categorical and median for numerical columns.

**Imputed Columns:**

- `NewVehicle` (~15.33%)
- `Bank` (~14.59%)
- `AccountType` (~4.02%)

In [7]:
# Impute or drop columns with moderate missing data
missing_cols = ['NewVehicle', 'Bank', 'AccountType']
data = data_processing.handle_missing_data('moderate', missing_cols)

**Handle Low Missing Data - Standard Imputation**

- These columns can be reasonably imputed without affecting data quality.

**Imputed Columns:**

- `Gender` (~0.95%)
- `MaritalStatus` (~0.83%)
- `Various vehicle-related columns (~0.055% each): `Cylinders, CubicCapacity, Kilowatts`, etc.
 


In [8]:
# Handle low missing data (standard imputation)
missing_cols = ['Gender', 'MaritalStatus', 'Cylinders', 'cubiccapacity', 
                'kilowatts', 'NumberOfDoors', 'VehicleIntroDate', 'Model', 
                'make', 'VehicleType', 'mmcode', 'bodytype', 'CapitalOutstanding']

data = data_processing.handle_missing_data('low', missing_cols)


**Overall Decision Summary:**

- High missing data: Dropped.

- Moderate missing data: Imputed with mode (categorical) or median (numerical).

- Low missing data: Imputed to avoid unnecessary data loss.

You can decide on appropriate methods (imputation, removal) based on the importance of each feature.
Convert categorical columns (e.g., MaritalStatus, VehicleType, Bank) into appropriate formats (e.g., categories, dummy variables) for analysis.

**Summarize Key Statistics**

- **Descriptive statistics:**
    - In the descriptive statistics calculate and examine the variability for numerical features such as TotalPremium, TotalClaim, etc.

- **Distribution of numerical features:** Understand the range, variance, and skewness of features.

In [9]:
# Statistic summary of numerical features
display(data.describe())

Unnamed: 0,UnderwrittenCoverID,PolicyID,PostalCode,mmcode,RegistrationYear,Cylinders,cubiccapacity,kilowatts,NumberOfDoors,SumInsured,CalculatedPremiumPerTerm,TotalPremium,TotalClaims
count,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0,1000098.0
mean,104817.5,7956.682,3020.601,54880560.0,2010.225,4.046616,2466.869,97.21553,4.019239,604172.7,117.8757,61.9055,64.86119
std,63293.71,5290.039,2649.854,13600590.0,3.261391,0.293941,442.7106,19.39061,0.4681854,1508332.0,399.7017,230.2845,2384.075
min,1.0,14.0,1.0,4041200.0,1987.0,0.0,0.0,0.0,0.0,0.01,0.0,-782.5768,-12002.41
25%,55143.0,4500.0,827.0,60056920.0,2008.0,4.0,2237.0,75.0,4.0,5000.0,3.2248,0.0,0.0
50%,94083.0,7071.0,2000.0,60058420.0,2011.0,4.0,2694.0,111.0,4.0,7500.0,8.4369,2.178333,0.0
75%,139190.0,11077.0,4180.0,60058420.0,2013.0,4.0,2694.0,111.0,4.0,250000.0,90.0,21.92982,0.0
max,301175.0,23246.0,9870.0,65065350.0,2015.0,10.0,12880.0,309.0,6.0,12636200.0,74422.17,65282.6,393092.1
