# BNP-Paribas Kaggle Challenge 
Adam Li, adam2392

Submission of the data should be 2 columns:
'id' and 'predictedprob'.


https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/details/evaluation

Evaluation:
The evaluation metric for this competition is Log Loss

$$logloss = −1/N \sum_{i=1}^{N} (y_i*log(p_i) + (1−y_i)*log(1−p_i))$$

where 
* N is the number of observations, 
* loglog is the natural logarithm, 
* yiyi is the binary target, and 
* pipi is the predicted probability that yiyi equals 1.

Note: the actual submitted predicted probabilities are replaced with max(min(p,1−10−15),10−15)



In [2]:
# Import Necessary Libraries
import numpy as np
import os, csv, json

from matplotlib import *
from matplotlib import pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import scipy
from sklearn.decomposition import PCA
import skimage.measure

# pretty charting
import seaborn as sns
sns.set_palette('muted')
sns.set_style('darkgrid')

%matplotlib inline

## Problems With Data?
Just looking at the dataset, there are a couple of problems:

1. Are there missing entries in the data?
- Yes there are missing data points within the dataset, just by opening train.csv, you can see that.
2. Are there different types of data?
- Yes, there is both categorical and numerical data. So we should try to answer how to handle each type of data. There is also binary data, it seems; some fields 0, some are 1. There are no ordinal variables at least.

Things to Note/Remember:
column['target'] is the vector for which prediction we want. It is equal to 1 for claims suitable for an accelerated approval.
column['id'] is the vector for which person it is?

In [12]:
#### RUN AT BEGINNING AND TRY NOT TO RUN AGAIN - TAKES WAY TOO LONG ####
# load in the feature data
column = {}
list_of_features = []
with open('train.csv', 'rb') as file:
    reader = csv.reader(file)
    headers = reader.next()
    
    # get headers
    for h in headers:
        column[h] = []
    
    # append data
    for row in reader:
        # just create a matrix of all features
        list_of_features.append(row)
        
        for h, v in zip(headers, row):
            # create a dict to access specific columns
            column[h].append(v)
            
# conver to a numpy matrix
list_of_features = np.array(list_of_features)

In [13]:
# how many variables do we have?
print len(column.keys())
print column.keys()
print len(column['v18'])

print list_of_features.shape

133
['v18', 'v19', 'v12', 'v13', 'v10', 'v11', 'v16', 'v17', 'v14', 'v15', 'v118', 'v119', 'v114', 'v115', 'v116', 'v117', 'v110', 'v111', 'v112', 'v113', 'v89', 'v88', 'v85', 'v84', 'v87', 'v86', 'v81', 'v80', 'v83', 'v82', 'v69', 'v68', 'v67', 'v66', 'v65', 'v64', 'v63', 'v62', 'v61', 'v60', 'v92', 'v93', 'v90', 'v91', 'v96', 'v97', 'v94', 'v95', 'v107', 'v106', 'v98', 'v99', 'v103', 'v102', 'v101', 'v100', 'v105', 'v104', 'v78', 'v79', 'v74', 'v75', 'v76', 'v77', 'v70', 'v71', 'v72', 'v73', 'v130', 'v131', 'v125', 'v124', 'v127', 'v126', 'v121', 'v120', 'v123', 'v122', 'v129', 'v128', 'v41', 'v40', 'v43', 'v42', 'v45', 'v44', 'v47', 'v46', 'v49', 'v48', 'v23', 'v22', 'v21', 'v20', 'v27', 'v26', 'v25', 'v24', 'v29', 'v28', 'target', 'v56', 'v57', 'v54', 'v55', 'v52', 'v53', 'v50', 'v51', 'v109', 'v58', 'v59', 'v30', 'v31', 'v32', 'v33', 'v34', 'v35', 'v36', 'v37', 'v38', 'v39', 'v108', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'ID']
114321
(114321, 133)


# Imputation
Need to fill in data with imputed values because there are too many problems.

1. What package can we use?
2. What imputation method should we use?

Answers:
1. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
2. replace with mean for now... but need better method


In [62]:
####################### NORMALIZE DATA AND SAVE #######################
# write new list_of_features to new txt file
csvfile = "data_normalized/data_numerical_normalized.txt"

numcols = list_of_features.shape[1]
numrows = list_of_features.shape[0]

new_list_of_features = []
categorical_cols = []
numerical_cols = []

# extract data that is only numerical
for i in range(0, numcols):
    col = list_of_features[:,i]
    
    try:
        float(col[0])
        new_list_of_features.append(col.T)
        numerical_cols.append(i)
    except:
        # do nothing
        categorical_cols.append(i)

# convert to np matrix
new_list_of_features = np.array(new_list_of_features).T
print new_list_of_features.shape
print new_list_of_features.shape[1]

## Normalize
for i in range(0, new_list_of_features.shape[1]):
    col = new_list_of_features[:,i]
    new_col = []
    for j in col:
        try:
            new_col.append(float(j))
        except:
            new_col.append(j)
    
    max_col = max(new_col)
    (new_col) = np.asarray(new_col)/np.asarray(max_col)
    new_list_of_features[:,i] = new_col


#Assuming res is a flat list
# with open(csvfile, "w") as output:
#     # write to new file the data
#     writer = csv.writer(output, lineterminator='\n')
#     for row in range(0, len(locations)):
#         writer.writerow(locations[row,:])

(114321, 129)
129


TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [7]:
letters = ['a','b','c','d','e','f','g']
>>> [ord(x) for x in letters]
[97, 98, 99, 100, 101, 102, 103]

AttributeError: 'dict' object has no attribute 'shape'