# Risk Analyst Python Workflow: Analyzing Car Insurance Claims

Tasks:
1. Load and explore the dataset
2. Preprocess the data (handle missing values, encode categorical variables)
3. Perform basic analysis (EDA)
4. Visualize findings
5. Analyze the relationship between variables
6. Create a simple risk score (optional)
7. Create a summary, and save the preprocessed data into a CSV
8. Extra: Save the encoded DataFrame and pickle the dictionaries

## Some questions to spark ideas

### Data Exploration and Preprocessing

1. How many unique categories are there in the 'driving_experience' column? What is the distribution of these categories?
2. Is there a significant difference in outcomes between different age groups?
3. Calculate the percentage of missing values in each column. Which columns have the highest percentage of missing values?
4. Create a new feature 'claims_per_year' by dividing the 'past_accidents' by the number of years of driving experience.
5. Find the distribution of the claim outcomes. What does this tell you about the nature of insurance claims?

### Data Visualization

1. Create a heatmap of the correlation matrix for all numerical variables. What insights can you draw from this visualization?
2. Generate a pair plot for age, credit_score, past_accidents, and DUIs. What patterns or relationships do you observe?
3. Create a visualization showing the proportion of claims by driving experience categories for different age groups (you may need to bin the age variable).

### Risk Assessment

1. Create a calculation to derive a new metric 'risk score'.
2. Create a function that categorizes policyholders into 'Low', 'Medium', and 'High' risk based on your risk score. What percentage of policyholders fall into each category?




## 1. Load and explore the dataset


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:

# Load the dataset
df = pd.read_csv('data/car_insurance_claim.csv')


In [None]:
# Display basic information about the dataset
df.info()


In [None]:
# Show the first few rows
df.head()


In [None]:
# Display summary statistics
df.describe()


In [None]:
df.columns = df.columns.str.lower()


## 2. Preprocess the data


In [None]:
# Handle missing values
df['credit_score'].fillna(df['credit_score'].mean(), inplace=True)

df['annual_mileage'].fillna(df['annual_mileage'].mean(), inplace=True)



## 3. Perform basic statistical analysis


In [None]:
df.describe()

In [None]:
cols = [
    'credit_score',
    'vehicle_ownership',
    'married',
    'children',
    'annual_mileage',
    'speeding_violations',
    'duis',
    'past_accidents',
    'outcome'
]

df[cols].corr()

In [None]:
sns.heatmap(df[cols].corr(), annot=True)


## 4. Visualize claim distributions


In [None]:

plt.figure(figsize=(10, 6))
plt.hist(df['annual_mileage'])
plt.title('Annual Mileage Distribution')
plt.xlabel('Annual Mileage')
plt.ylabel('Frequency')
plt.show()



## 5. Analyze the relationship between variables


In [None]:
fig, ax1 = plt.subplots()

ax1.bar(df['past_accidents'].unique(), df['past_accidents'].value_counts())
ax1.set_xlabel('Past Accidents')
ax1.set_ylabel('Count')

ax2 = ax1.twinx()
ax2.plot(df.groupby('past_accidents')['outcome'].mean(), color='red')
ax2.set_ylabel('Mean Outcome')

In [None]:
fig, ax1 = plt.subplots()

ax1.bar(df['duis'].unique(), df['duis'].value_counts())
ax1.set_xlabel('DUIs')
ax1.set_ylabel('Count')

ax2 = ax1.twinx()
ax2.plot(df.groupby('duis')['outcome'].mean(), color='red')
ax2.set_ylabel('Mean Outcome')

In [None]:
fig, ax1 = plt.subplots()

ax1.bar(df['speeding_violations'].unique(), df['speeding_violations'].value_counts())
ax1.set_xlabel('Speeding Violations')
ax1.set_ylabel('Count')

ax2 = ax1.twinx()
ax2.plot(df.groupby('speeding_violations')['outcome'].mean(), color='red')
ax2.set_ylabel('Mean Outcome')