# Problem Definition

## Processing data headers

Here we will load and develop initial expectations from the training data. The first step is to clean and add sensible variable headers to be able to describe features.

In [1]:
import pandas as pd
import numpy as np
import re
# Load the tab-delimited training data
train_data = pd.read_csv("./data/ticdata2000.txt",
                         sep = "\t", header= None)
# Load data dict, clean to add headers
with open("./data/dictionary.txt","r") as f:
    line = f.readlines()
line_clean = [re.sub("\n|\...","",x) for x in line[3:89]]
line_clean = [re.sub("\<","less_than_",x) for x in line_clean]
line_clean = [re.sub("\>","higher_than_",x) for x in line_clean]
line_clean = [re.sub("\(|\)|\/","",x) for x in line_clean]
line_clean = [re.sub("\-","_",x) for x in line_clean]
line_clean = [re.sub("^[0-9]{1} |^[0-9]{2} ","",x) for x in line_clean]
line_clean = [re.sub(" ","_",x) for x in line_clean]
line_clean = [re.sub("___|__","_",x) for x in line_clean]
line_clean = [re.sub("\_$","",x) for x in line_clean]

train_data.columns = line_clean

## One-hot encoding categorical features

Here our goal is to perform a first pass examination of raw features and determine whether we need to one-hot encode certain categorical variables to be able to interpret them. As we explore these features, we will implement a utility which can process any training/testing data set consistently.



In [8]:
# Categorical features to one-hot encode
categorical_features = ['MOSTYPE_Customer_Subtype_see_L0',
                        'MGEMLEEF_Avg_age_see_L1',
                        'MOSHOOFD_Customer_main_type_see_L2',
                        'MGODRK_Roman_catholic_see_L3',
                        'PWAPART_Contribution_private_third_party_insurance_see_L4'
                       ]


In [12]:
# Look at the subset of training data without categorical features
train_subset = train_data[[x for x in train_data.columns if x not in categorical_features]]

In [20]:
# Unique values of each variable
train_subset.apply(axis=0,func= lambda x: [np.unique(x)])

MAANTHUI_Number_of_houses_1_–_10                              [[1, 2, 3, 4, 5, 6, 7, 8, 10]]
MGEMOMV_Avg_size_household_1_–_6                                           [[1, 2, 3, 4, 5]]
MGODPR_Protestant                                           [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MGODOV_Other_religion                                                   [[0, 1, 2, 3, 4, 5]]
MGODGE_No_religion                                          [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MRELGE_Married                                              [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MRELSA_Living_together                                            [[0, 1, 2, 3, 4, 5, 6, 7]]
MRELOV_Other_relation                                       [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MFALLEEN_Singles                                            [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MFGEKIND_Household_without_children                         [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
MFWEKIND_Household_with_children                            [[0, 1, 2,

Since we don't have explicit interpretation of any other variables, we will assume they are quantitative (or ordinal) variables (e.g: MBERBOER_Farmer, it is not clear what exactly 0-9 means from the data description).

In [24]:
from sklearn.preprocessing import OneHotEncoder 
train_data_proc = OneHotEncoder.fit_transform(train_data, categorical_features = categorical_features)

TypeError: fit_transform() got an unexpected keyword argument 'categorical_features'