# Introduction

This competition is hosted by the third largest insurance company in Brazil: Porto Seguro with the task of predicting the probability that a driver will initiate an insurance claim in the next year.

This notebook will aim to provide some interactive charts and analysis of the competition data by the way of the Python visualization library Plot.ly and hopefully bring some insights and beautiful plots that others can take and replicate. Ploy.ly is one of the main products offered by the SW company - Plotly which specializes in providing online graphical and statistical visualizations(charts and dashboards) as well as an API to a whole rich suite of programming languages and tools such as Python, R, Matlab, Node.js etc.

Listed below for easy convenience are links to the various Plotly plots in this notebook:

- Simple horizontal bar plot - Used to inspect the Target variable distribution
- Correlation Heatmap plot - Inspect the correlation between the different features
- Scatter plot - Compare the feature importances generated by Random Forest and Gradient-Boosted model
- Vertical bar plot - List in descending order, the importance of the various features
- 3D Scatter plot

The themes in this notebook can be briefly summarized below:

1. Data Quality Checks - Visualizing and evaluating all missing/Null values (values that are -1)
2. Feature inspection and filtering - Correlation and feature Mutual information plots against the target variable. Inspection of the Binary, categorical and other variables.
3. Feature importance ranking via learning models - Building a Random Forest and Gradient Boosted model to help us rank features based off the learning process.

Let's Go

In [1]:
# Let us load in the relevant Python modules
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.offline as py
import seaborn as sns

py.init_notebook_mode(connected=True)
import warnings
from collections import Counter

import plotly.graph_objs as go
import plotly.tools as tls
from sklearn.feature_selection import mutual_info_classif

warnings.filterwarnings("ignore")

Let us load in the training data provided using Pandas:

In [2]:
train = pd.read_csv("../input/train.csv")
train.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


In [3]:
# Taking a look at how many rows and columns the train dataset contains
rows = train.shape[0]
columns = train.shape[1]
print(f"The train dataset contains {rows} rows and {columns} columns")

The train dataset contains 595212 rows and 59 columns
