# Problem Definition

## Processing data headers

Here we will load and develop initial expectations from the training data. The first step is to clean and add sensible variable headers to be able to describe features.

In [1]:
import pandas as pd
import numpy as np
import re
# Load the tab-delimited training data
train_data = pd.read_csv("./data/ticdata2000.txt",
                         sep = "\t", header= None)

test_data = pd.read_csv("./data/ticeval2000.txt",
                        sep = "\t", header = None)
# Load data dict, clean to add headers
with open("./data/dictionary.txt","r") as f:
    line = f.readlines()
line_clean = [re.sub("\n|\...","",x) for x in line[3:89]]
line_clean = [re.sub("\<","less_than_",x) for x in line_clean]
line_clean = [re.sub("\>","higher_than_",x) for x in line_clean]
line_clean = [re.sub("\(|\)|\/","",x) for x in line_clean]
line_clean = [re.sub("\-","_",x) for x in line_clean]
line_clean = [re.sub("^[0-9]{1} |^[0-9]{2} ","",x) for x in line_clean]
line_clean = [re.sub(" ","_",x) for x in line_clean]
line_clean = [re.sub("___|__","_",x) for x in line_clean]
line_clean = [re.sub("\_$","",x) for x in line_clean]

train_data.columns = line_clean
test_data.columns = line_clean[0:-1]

## One-hot encoding categorical features

Here our goal is to perform a first pass examination of raw features and determine whether we need to one-hot encode certain categorical variables to be able to interpret them. As we explore these features, we will implement a utility which can process any training/testing data set consistently.



In [2]:
# Categorical features to one-hot encode
categorical_features = ['MOSTYPE_Customer_Subtype_see_L0',
                        'MGEMLEEF_Avg_age_see_L1',
                        'MOSHOOFD_Customer_main_type_see_L2',
                        'MGODRK_Roman_catholic_see_L3',
                        'PWAPART_Contribution_private_third_party_insurance_see_L4'
                       ]


In [3]:
# Look at the subset of training data without categorical features
train_subset = train_data[[x for x in train_data.columns if x not in categorical_features]]

In [4]:
# Unique values of each variable
train_subset.apply(axis=0,func= lambda x: [np.unique(x)])

MAANTHUI_Number_of_houses_1_–_10                              [[1, 2, 3, 4, 5, 6, 7, 8, 10]]
MGEMOMV_Avg_size_household_1_–_6                                           [[1, 2, 3, 4, 5]]
MGODPR_Protestant                                           [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MGODOV_Other_religion                                                   [[0, 1, 2, 3, 4, 5]]
MGODGE_No_religion                                          [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MRELGE_Married                                              [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MRELSA_Living_together                                            [[0, 1, 2, 3, 4, 5, 6, 7]]
MRELOV_Other_relation                                       [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MFALLEEN_Singles                                            [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MFGEKIND_Household_without_children                         [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MFWEKIND_Household_with_children                            [[0, 1, 2,

Since we don't have explicit interpretation of any other variables, we will assume they are quantitative (or ordinal) variables (e.g: MBERBOER_Farmer, it is not clear what exactly 0-9 means from the data description).

In [5]:
# Note we need to convert an otherwise numeric variable to object to be able to use get_dummies
train_data_cat = pd.get_dummies(train_data[categorical_features].astype("object"))
test_data_cat = pd.get_dummies(test_data[categorical_features].astype("object"))

In [6]:
# Keep category levels that exist in both data sets
keep = [x for x in np.intersect1d(train_data_cat.columns,test_data_cat.columns)]
train_data_cat = train_data_cat[keep]
test_data_cat = test_data_cat[keep]

In [7]:
# Replace categorical features with original ones
train_data = pd.concat([train_data.drop(labels=categorical_features, axis=1),train_data_cat],axis = 1)
test_data = pd.concat([test_data.drop(labels=categorical_features, axis=1),test_data_cat],axis = 1)

In [10]:
# Dimensions of the training and test data sets
print(train_data.shape)
print(test_data.shape)

(5822, 150)
(4000, 149)


In [12]:
# Serialize and store data sets
import pickle as pkl
with open("./data/train_data.pkl","wb") as f:
    pkl.dump(train_data,f)
with open("./data/test_data.pkl","wb") as f:
    pkl.dump(test_data,f)    

## Exploring the target variable along with features


In [1]:
import pandas as pd
import numpy as np
import pickle as pkl
with open("./data/train_data.pkl",'rb') as f:
    train_data = pkl.load(f)

In [8]:
train_data.CARAVAN_Number_of_mobile_home_policies_0_1.value_counts()

0    5474
1     348
Name: CARAVAN_Number_of_mobile_home_policies_0_1, dtype: int64

This turns out to be a highly skewed or imbalanced target variable! Are there any features that are strongly associated with having an insuarance policy?

In [22]:
# Goal: build marginal logistic regression models with each feature and target
from sklearn.linear_model import LogisticRegression
models = {} # A dict to hold models that are built
y_train = train_data.CARAVAN_Number_of_mobile_home_policies_0_1
x_train = train_data.drop('CARAVAN_Number_of_mobile_home_policies_0_1', axis = 1)
glm = LogisticRegression(solver='lbfgs')
for i in range(1,x_train.shape[1]):
    model_name = x_train.columns[i]
    models[model_name] = glm.fit(X = x_train.iloc[:,i].values.reshape(-1,1), 
                                 y = y_train)

In [27]:
models['AAANHANG_Number_of_trailer_policies'].coef_

array([[0.72537909]])