# Exploratory Data Analysis of Auto Insurance Policies 

### Agenda

    1. Quick Data Summary
    2. Facets Dive
    3. Conclusion / Take aways

# Quick Data Summary

In [4]:
import pandas as pd
import pprint as pp

# read data
df = pd.read_csv('../data/auto_policies.csv')
# sort values and set index to year and month
df.sort_values(['year', 'month'], inplace=True)
print("Columns")
pp.pprint(df.columns.values.tolist())
print()
print(F"There are {df.shape[0]} rows and {df.shape[1]} columns")

Columns
['year',
 'month',
 'driver_age',
 'driver_gender',
 'driver_employment',
 'driver_marital',
 'driver_location',
 'vehicle_age',
 'vehicle_model',
 'insurance_premium',
 'insurance_claims',
 'insurance_losses']

There are 200000 rows and 12 columns


### Set indexes for the dataframe
Useful for grouping later, can take a few seconds

In [None]:
# Set the year and month to a datetime 
df['date'] = df['month'].map(lambda x: str(x).lstrip('0')) + '/' + df['year'].astype(str)
df['date'].apply(lambda x: pd.to_datetime(x, format='%M/%Y'))
# Set the index
df.set_index('date', inplace=True)

### Separate Categorical values from Numerical Values

In [None]:
print("Here are the categorical columns")
cat_df_auto = df.filter(items=[col for col in df.columns if df[col].dtypes==object], axis=1)
display(cat_df_auto.head())
print()
num_df_auto = df[[col for col in df.columns if df[col].dtypes != object]]
print("Here are the numerical columns")
display(num_df_auto.head())

### How many null values? Memory Size?

In [None]:
print("General DF Info described below")
print(df.info())

### Describe the general statistics of each column. 

In [None]:
import json
import pprint as pp


desc = num_df_auto.describe()
for col in desc:
    stats = desc[col]
    print(F"Stats for the column: {col}")
    pp.pprint(json.loads(stats.to_json()))
    print()

### Extract unique values for the categorical columns

In [None]:
for col in cat_df_auto:
    print(F"Unique values for column {col} are:")
    print(cat_df_auto[col].unique())
    print()


### Analyze relationships in the data - via Distributions
    Focus on the premium, claims, and losses for the dataset and their equivalent probability distributions so that we can quickly identify trends. 
    
### Distributions to Study
    Insurance Premium Distribution by Other Features

    Insurance Losses Distribution by Other Features
    
    Insurance Claims Distribution by Other Features

In [None]:

# Losses > Premium - Insurance Company Loses Money
df[df.insurance_losses > df.insurance_premium]

## Quick Data Summary
1. There are 5 categorical columns
    - driver_gender, driver_employment, driver_marital, driver_location, vehicle_model
2. There are 5 numerical columns
    - driver_age, vehicle_age, insurance_premium, insurance_claims, insurance_losses
3. There are 2 date columns
    - year, month
4. 200,000 records all non-null

  

## Facets Dive - Interactive Data Visualization
### Pros
- Good for initial data discovery. 
- Easy to share with others. 

### Cons
- Does not work well for large datasets >100k records (sampling should be used)
- Is not good for answering specific questions like nullity, sparseness, etc..

In [2]:
"""
Leverage the PAIR-overview Library to visualize data quickly 
Then save the results into a static web page for easy sharing. 
"""
from IPython.core.display import display, HTML

# convert df into json string
jsonstr = auto_policies.sample(.20).to_json(orient='records')
# boilerplate template for facets dive
HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""

# Inject into the Above HTML Template
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
# Leveraging display() will work, but it can be slow for large datasets. (> 100k records)
# So I will save it instead. 
# display(HTML(html))
with open("auto_policy_dive.html", "w") as f:
    f.write(html)

NameError: name 'auto_policies' is not defined