# Kaggle Competiton | BNP Paribas Cardif Claims Management

>We need to build a model to quickly and efficiently classify BNP Paribas Cardif claims. In order to learn from the train dataset we need first to process the data. We will include the steps detailed in the previous notebook (check_input_data.ipynb) and the classes are described in data_modifier.py. 

>Thus, in the current notebook we will set up a protocol for processing the data and explore possible errors.

Go to the official page of the [Kaggle Competition.](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management)

### Goal for this Notebook:
* Develope a protocol for processing data to include it into a pipeline
* Develope new classes and methods (data_modifier.py) 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from scipy import stats
from data_modifier import *

%load_ext autoreload
%autoreload 2

  if 'order' in inspect.getargspec(np.copy)[0]:


### 1. Process Data

Just for giving a try using given train and test datasets.

In [2]:
train = pd.read_csv("../../../github_data/bnp_paribas_cardif_data/train.csv")
test = pd.read_csv("../../../github_data/bnp_paribas_cardif_data/test.csv")

In [3]:
y = train.target
columns = train.columns
x_train = train[columns[2:]]
x_test = train[columns[1:]]

#### 1. Transform Null to NaN

In [4]:
nton = NulltoNanTrans()
nton = nton.fit(x_train)
x_tr_ntont = nton.transform(x_train)
x_te_ntont = nton.transform(x_test)

NullToNaNTrans fit done.
NullToNaNTrans transform done.
NullToNaNTrans transform done.


#### 2. Process Continous Float Data
* Select Continous data

In [5]:
dat = DataSpliterTrans(dtype=np.float64)
dat = dat.fit(x_tr_ntont)
x_tr_datct = dat.transform(x_tr_ntont)
x_te_datct = dat.transform(x_te_ntont)

DataSpliterTrans fit done.
DataSpliterTrans transform done.
DataSpliterTrans transform done.


* Substitute NaN values by the median using Imputer from Sklearn

In [6]:
## if nan change nan to median
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
imp = imp.fit(x_tr_datct)
x_tr_ntovt = imp.transform(x_tr_datct)
x_te_ntovt = imp.transform(x_te_datct)

In [7]:
x_tr_ntovt

array([[ 1.33573942,  8.72747444,  3.92102575, ...,  2.02428538,
         0.63636451,  2.85714374],
       [ 1.4695499 ,  7.02380312,  4.20599079, ...,  1.95782501,
         1.56013756,  1.58940327],
       [ 0.94387691,  5.3100792 ,  4.41096869, ...,  1.12046842,
         0.88311753,  1.1764715 ],
       ..., 
       [ 1.4695499 ,  7.02380312,  4.20599079, ...,  2.41760583,
         1.56013756,  1.58940327],
       [ 1.4695499 ,  7.02380312,  4.20599079, ...,  3.52664991,
         1.56013756,  1.58940327],
       [ 1.61976313,  7.93297797,  4.6400847 , ...,  1.60449252,
         1.78761032,  1.38613767]])

The continous data is already processed and given as a matrix and will be merge with the next two.

#### 3. Process Categorical Integer Data
* Select data type integers from the data since in this case all are categorical

In [8]:
dat = DataSpliterTrans(dtype=np.int)
dat = dat.fit(x_tr_ntont)
x_tr_datit = dat.transform(x_tr_ntont)
x_te_datit = dat.transform(x_te_ntont)

DataSpliterTrans fit done.
DataSpliterTrans transform done.
DataSpliterTrans transform done.


* Substitute NaN values by the most frequent using Imputer from Sklearn (despite of not NaN found in first check exploration)

In [9]:
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp = imp.fit(x_tr_datit)
x_tr_ntovt2 = imp.transform(x_tr_datit)
x_te_ntovt2 = imp.transform(x_te_datit)

* Transform categories into boolean features to easyly learn from them with OneHotEncoder 

In [10]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
enc = enc.fit(x_tr_ntovt2)
x_tr_catobit = enc.transform(x_tr_ntovt2)
x_te_catobit = enc.transform(x_te_ntovt2)

In [11]:
x_tr_catobit

<114321x43 sparse matrix of type '<class 'numpy.float64'>'
	with 457284 stored elements in Compressed Sparse Row format>

The integer categorical data is already processed and given as a matrix and will be merge with the other two. The matrix contain final 43 features.

#### 4. Process Categorical Integer Data
* Select data type objects from the data

In [12]:
dat = DataSpliterTrans(dtype=np.object)
dat = dat.fit(x_tr_ntont)
x_tr_datot = dat.transform(x_tr_ntont)
x_te_datot = dat.transform(x_te_ntont)

DataSpliterTrans fit done.
DataSpliterTrans transform done.
DataSpliterTrans transform done.


* Change the string categories to integer categories through the new class created in data_modifier.py

In [13]:
cat = ObjtoCatStrtoIntTrans()
cat = cat.fit(x_tr_datot)
x_tr_catot = cat.transform(x_tr_datot)
x_tr_catot = cat.transform(x_te_datot)

ObjtoCatStrtoIntTrans fit done.
(114321, 19)
ObjtoCatStrtoIntTrans transform done.
(114321, 19)
ObjtoCatStrtoIntTrans transform done.


* Substitute NaN values by the most frequent using Imputer from Sklearn

In [14]:
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp = imp.fit(x_tr_catot)
x_tr_ntovt3 = imp.transform(x_tr_catot)
x_te_ntovt3 = imp.transform(x_tr_catot)

* Transform categories into boolean features to easyly learn from them with OneHotEncoder

In [15]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
enc = enc.fit(x_tr_ntovt3)
x_tr_catobit = enc.transform(x_tr_ntovt3)
x_te_catobit = enc.transform(x_te_ntovt3)

In [16]:
x_tr_catobit

<114321x18574 sparse matrix of type '<class 'numpy.float64'>'
	with 2172099 stored elements in Compressed Sparse Row format>

The string categorical data is already processed and given as a matrix and will be merge with the other two. The matrix contain final 18574 features.

Now that we have three matrices with the processed data and all methods are working fine, we will build a pipeline in data_modifier.py that will perform all this steps automatically.

In the next notebook (cv_roc.ipynb) we will apply the pipeline, perform the first prediction and evaluate it.