# 1. Introduction

This is my first kernel at Kaggle and my goal is to take it to completion through the stages of data exploration, data preprocessing, predicting and refinement. In the end I hope to get a rank on the leaderboard. 
I believe this is the best way to get started and I am thankful to everyone who has shared their work at Kaggle. 

In [None]:
# Loading the relevant Python modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
from collections import Counter
from sklearn.feature_selection import mutual_info_classif
from sklearn.utils import shuffle
from sklearn.preprocessing import Imputer
warnings.filterwarnings('ignore')

# 2. Analysis

## Data Exploration

Making use of EDA done by [Anisotropic](https://www.kaggle.com/arthurtok/interactive-porto-insights-a-plot-ly-tutorial)

In [None]:
# Importing the training and testing data
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
# Printing first 5 rows of training data
train.head()

In [None]:
# Taking a look at rows and columns of the train and test dataset
rows_train = train.shape[0]
columns_train = train.shape[1]
print("The train dataset contains {0} rows and {1} columns".format(rows_train, columns_train))
rows_test = test.shape[0]
columns_test = test.shape[1]
print("The test dataset contains {0} rows and {1} columns".format(rows_test, columns_test))

So, we can see that there are total 595212 rows and 59 columns in training data. As one column is for ID and one is for the target value, there are total 57 features in the training data. Along the same lines the test data set contains 892816 rows and 58 columns. Since, we have to predict target value there is one column less.

2 important points that I observed in the data description of the competition were that:
1. Null values have been replaced by -1
2. Column names just give an idea of type of variable and not much information is provided about what the feature actually means.

So, even though checking for null values in data would return nothing, we have to take care of them. Also, we need to make use of various statistical and computational methods in order to derive meaning from the data.

In [None]:
# Checking for null values
train.isnull().any().any()

### **Null or missing values**

In [None]:
import missingno as msno
#Creating a copy of training data
train_null = train
train_null = train_null.replace(-1, np.NaN)

msno.matrix(df=train_null.iloc[:,2:39], figsize=(20, 14), color=(0.42, 0.1, 0.05))

In this visualization missing data is shown by white bands superimposed on dark red bands that show the values that are not missing or null. This visualization is really helpful in order to get a visual estimate of amount of data that is missing and to clearly find out which features have the most missing values. However, this visualization excludes certain null features as it can only fit in approximately 40 odd features. So, from the visualization we can see that ps_reg_03, ps_car_03_cat and ps_car_05_cat have the most missing values and we can get the list of features with missing values as below.

In [None]:
test_null = test
test_null = test_null.replace(-1, np.NaN)
# Extract columns with null data
train_null = train_null.loc[:, train_null.isnull().any()]
test_null = test_null.loc[:, test_null.isnull().any()]

print(train_null.columns)
print(test_null.columns)

In [None]:
print('Columns \t Number of NaN')
for column in train_null.columns:
    print('{}:\t {}'.format(column,len(train_null[column][np.isnan(train_null[column])])))

**ps_car_03_cat**, **ps_car_05_cat**, **ps_reg_03** , **ps_car_14** and **ps_car_07_cat** have missing values for more than 10,000 rows in training data. Many null values will cause error in training of models and which will lead to worng predictions. We will need to do feature analysis in order to decide a strategy for treating these missing values.

### **Target Exploration**

In [None]:
targets = train['target'].values
sns.set(style="darkgrid")
ax = sns.countplot(x = targets)
for p in ax.patches:
    ax.annotate('{:.2f}%'.format(100*p.get_height()/len(targets)), (p.get_x()+ 0.3, p.get_height()+10000))
plt.title('Distribution of Target', fontsize=20)
plt.xlabel('Claim', fontsize=20)
plt.ylabel('Frequency [%]', fontsize=20)
ax.set_ylim(top=700000)

As we can see, target value is highly imbalanced. Value for target is 1 for 3.64% of the records in training data. So, if we use a naive classifier that simply classifies target as 0 for all the rows then the prediction accuracy will be very high. Also, if we train a model using such imbalanced data then it will have high accuracy as the model will be biased to label data as 1 regardless of the data it is asked to predict. We will have to use a [strategy](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/) to overcome this problem.

## Exploratory Visualization

### Datatype check

In order to visualize the data it would be a good idea to check kind of datatypes the training data is made up of. Based on the trick provided by   we can get the count of kind of datatype as follows:

In [None]:
Counter(train.dtypes.values)

So, there are only 2 datatypes int and float in the training data. So, we can divide the traiing data into 2 parts:

In [None]:
train_float = train.select_dtypes(include=['float64'])
train_int = train.select_dtypes(include=['int64'])

Data provided by Porto Seguro has suffixes with abbreviations such as "bin", "cat" and "reg", where bin  indicates binary features while cat indicates categorical features while the rest are either continuous or ordinal features. So, train_float has continuous features whereas train_int has binary, categorical and ordinal features.

### Metadata

I really liked the idea of metadata as mentioned by <>. So, we can create a dataframe to store metadata for training data such as:
* **role**: input, ID, target
* **level**: nominal, interval, ordinal, binary
* **keep**: True or False
* **dtype**: int, float, str

In [None]:
data = []
for f in train.columns:
    # Defining the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    # Defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    # Initialize keep to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    # Defining the data type 
    dtype = train[f].dtype
    
    # Creating a Dict that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

So, we can get a nice summary of meta data as below:

In [None]:
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

### **Correlation Plots**

Correlation plots are useful to check if there is a high correlation between variables or not. This will be useful in deciding if PCA would be helpful or not.

#### **Correlation of float features**

In [None]:
colormap = plt.cm.magma
plt.figure(figsize=(16,12))
plt.title('Pearson correlation of continuous features', y=1.05, size=15)
sns.heatmap(train_float.corr(),linewidths=0.1,vmax=1.0, square=True, 
            cmap=colormap, linecolor='white', annot=True)

From the correlation plot, we can see that the majority of the features display zero or no correlation to one another.  Ony paired features such as (ps_reg_01, ps_reg_03), (ps_reg_02, ps_reg_03), (ps_car_12, ps_car_13), (ps_car_13, ps_car_15) have high correlation. So, it might not make any difference if we do PCA on the training data as the number of correlated variables is low.

#### **Correlation of integer features**

In [None]:
data = [
    go.Heatmap(
        z= train_int.corr().values,
        x=train_int.columns.values,
        y=train_int.columns.values,
        colorscale='Viridis',
        reversescale = False,
        text = True ,
        opacity = 1.0 )
]

layout = go.Layout(
    title='Pearson Correlation of Integer-type features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

The heat map show above uses plot.ly as descirbed by <>. x and y axes take in the column names while the correlation value is provided by the z-axis. It is quite evident that a huge number of columns with integer datatype are also not linearly correlated.

### **Binary features inspection**

In [None]:
bin_col = [col for col in train.columns if '_bin' in col]
zero_list = []
one_list = []
for col in bin_col:
    zero_list.append((train[col]==0).sum())
    one_list.append((train[col]==1).sum())
    
trace1 = go.Bar(
    x=bin_col,
    y=zero_list ,
    name='Zero count'
)
trace2 = go.Bar(
    x=bin_col,
    y=one_list,
    name='One count'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title='Count of 1 and 0 in binary variables'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='stacked-bar')

From this plot we can observe that **ps_ind_10_bin, ps_ind_11_bin, ps_ind_12_bin, ps_ind_13_bin** are completely dominated by zeros and they will be of no use in our prediction model.

### **Categorical and Ordinal Feature Inspection**

#### **Checking the cardinality of the categorical variables**

Cardinality refers to the number of different values in a variable. As we will create dummy variables from the categorical variables later on, we need to check whether there are variables with many distinct values. We should handle these variables differently as they would result in many dummy variables.

In [None]:
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))

Only ps_car_11_cat has many distinct values, although it is still reasonable.

#### **Feature Importance**

We can obtain feature importance using RandomForestClassifier of sklearn. Having trained the Random Forest, we can obtain the list of feature importances by invoking the attribute "featureimportances" and display a sorted list of all the features ranked by order of their importance, from highest to lowest via the same plotly barplots as follows

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=150, max_depth=8, min_samples_leaf=4, max_features=0.2, n_jobs=-1, random_state=0)
rf.fit(train.drop(['id', 'target'],axis=1), train.target)
features = train.drop(['id', 'target'],axis=1).columns.values
print("----- Training Done -----")

In [None]:
x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features), 
                                                            reverse = False)))
trace2 = go.Bar(
    x=x ,
    y=y,
    marker=dict(
        color=x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name='Random Forest Feature importance',
    orientation='h',
)

layout = go.Layout(
    width = 900, height = 2000,
    title='Barplot of Feature importances',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
    ))

fig1 = go.Figure(data=[trace2], layout = layout)

py.iplot(fig1, filename='plots')