## BOI Challenge: Feature Engineering

In the following cells, we will engineer / pre-process the variables of the master Data Set. We will engineer the variables so that we tackle:

1. Missing values
2. Zero values
3. Making significant groups of variables like Age.
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
6. Standarise the values of the variables to the same range.

In [23]:
#==============================================================================
# Import Packages
#==============================================================================
import os
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib qt
import string

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

In [289]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

#Label Encoding
from sklearn.preprocessing import LabelEncoder

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [291]:
# Set Directory
os.chdir('C:/Users/aksha/Desktop/BOI_Challenge/')


# load dataset
master_data = pd.read_csv('Master_Data.csv')
print(master_data.shape)
master_data.head()

(12000, 16)


Unnamed: 0,Client_ID,Age,Gender,County,Income_Grp,Population,Density_km_sq,Province,Prod_Held_Count,Prv_Loan_Flag,Avg_amt_CA_txn,Num_txns,Lst_txn_amt,Mrchnt_cde,Lst_txn_Nrtve,Loan_Flag
0,1,36,M,Cork,10001-40000,519032.0,69.0,Munster,4.0,1.0,58,0,,,,0.0
1,2,43,M,Cavan,0-10000,73183.0,37.7,Ulster,4.0,0.0,2663,17,83.66,7211.0,THE BRIDGE LAUNDRY WICKLOW TOWN,0.0
2,3,32,F,Dublin,10001-40000,1273069.0,1380.8,Leinster,2.0,0.0,46,25,526.18,3667.0,LUXOR HOTEL/CASINO LAS VEGAS NV,0.0
3,4,52,M,Louth,40001-60000,122897.0,148.7,Leinster,2.0,1.0,0,13,70.68,5712.0,HARVEY NORMAN CARRICKMINES,0.0
4,5,63,F,Kilkenny,60001-100000,95419.0,46.0,Leinster,1.0,0.0,126,39,259.07,5999.0,PAYPAL *PETEWOODWAR 35314369001,0.0


In [292]:
master_data["Mrchnt_cde"] = master_data["Mrchnt_cde"].astype(object)
print(master_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 16 columns):
Client_ID          12000 non-null int64
Age                12000 non-null int64
Gender             12000 non-null object
County             11991 non-null object
Income_Grp         12000 non-null object
Population         11978 non-null float64
Density_km_sq      11978 non-null float64
Province           11978 non-null object
Prod_Held_Count    11999 non-null float64
Prv_Loan_Flag      11995 non-null float64
Avg_amt_CA_txn     12000 non-null int64
Num_txns           12000 non-null int64
Lst_txn_amt        8722 non-null float64
Mrchnt_cde         8722 non-null object
Lst_txn_Nrtve      8722 non-null object
Loan_Flag          10000 non-null float64
dtypes: float64(6), int64(4), object(6)
memory usage: 1.5+ MB
None


**Remove Insignificant variables**

We will remove all the insignificant variables which were found out in the previous step of Exploratory Data Analysis

In [293]:
master_data = master_data.drop(['County','Population','Density_km_sq','Lst_txn_Nrtve','Mrchnt_cde'], axis = 1) 

In [294]:
print(master_data.info())
#len(master_data["Num_txns"].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 11 columns):
Client_ID          12000 non-null int64
Age                12000 non-null int64
Gender             12000 non-null object
Income_Grp         12000 non-null object
Province           11978 non-null object
Prod_Held_Count    11999 non-null float64
Prv_Loan_Flag      11995 non-null float64
Avg_amt_CA_txn     12000 non-null int64
Num_txns           12000 non-null int64
Lst_txn_amt        8722 non-null float64
Loan_Flag          10000 non-null float64
dtypes: float64(4), int64(4), object(3)
memory usage: 1.0+ MB
None


**Create categories for Ages and Transaction Amount**

In [295]:
master_data['Age_Grp'] = pd.cut(x=master_data['Age'], bins=[19, 29, 45, 60, 70,200], labels=['<30', '30-45', '46-60','61-70','>70'])
master_data['Avg_CA_txn_amt_grp'] = pd.cut(x=master_data['Avg_amt_CA_txn'], bins=[-1, 1000, 2000, 3000, 4000,5000,6000,10000], labels=['0-1000', '1000-2000', '2000-3000','3000-4000','4000-5000','5000-6000','6000+'])

master_data['Age_Grp'] = master_data['Age_Grp'].astype(object)
master_data['Avg_CA_txn_amt_grp'] = master_data['Avg_CA_txn_amt_grp'].astype(object)

In [296]:
#Dropping 'Age' and 'Avg CA Txn Amt' column
master_data = master_data.drop(['Age','Avg_amt_CA_txn'], axis = 1)

**Separate dataset into train and test**

Before beginning to engineer our features, it is important to separate our data intro training and testing set. This is to avoid over-fitting. This step involves randomness, therefore, we need to set the seed.

In [297]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(master_data, master_data.Loan_Flag,
                                                    test_size=2000,
                                                    random_state=0,shuffle=False) # we are setting the seed here
X_train.shape, X_test.shape


((10000, 11), (2000, 11))

**Missing values**

For categorical variables, we will fill missing information by adding an additional category: "missing"

In [298]:
# make a list of the categorical variables that contain missing values
vars_with_na = [var for var in master_data.columns if X_train[var].isnull().sum()>1 and X_train[var].dtypes=='O']

# print the variable name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

Province 0.002  % missing values


In [299]:
# function to replace NA in categorical variables
def fill_categorical_na(df, var_list):
    X = df.copy()
    X[var_list] = df[var_list].fillna('Missing')
    return X

In [301]:
# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)

# check that we have no missing information in the engineered variables
X_train[vars_with_na].isnull().sum()

Province    0
dtype: int64

In [302]:
# check that test set does not contain null values in the engineered variables
tst_vars_with_na =[vr for var in vars_with_na if X_test[var].isnull().sum()>0]

# print the variable name and the percentage of missing values
for var in tst_vars_with_na:
    print(var, np.round(X_test[var].isnull().mean(), 3),  ' % missing values')

**For numerical variables, we are going to add an additional variable capturing the missing information, and then replace the missing information in the original variable by the mean (avg) or mode, or most frequent value:**

In [303]:
# make a list of the numerical variables that contain missing values
vars_with_na = [var for var in master_data.columns if X_train[var].isnull().sum()>1 and X_train[var].dtypes!='O']

# print the variable name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

Prv_Loan_Flag 0.0  % missing values
Lst_txn_amt 0.328  % missing values


**'Prv_Loan_Flag' is a discrete variable for which we have to take mode value, whereas the remaining one is continuous variable for which we will take mean value.**

In [304]:
# list of numerical variables
num_vars = [var for var in master_data.columns if master_data[var].dtypes != 'O']

print('Number of numerical variables: ', len(num_vars))

# visualise the numerical variables
master_data[num_vars].head()

Number of numerical variables:  6


Unnamed: 0,Client_ID,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag
0,1,4.0,1.0,0,,0.0
1,2,4.0,0.0,17,83.66,0.0
2,3,2.0,0.0,25,526.18,0.0
3,4,2.0,1.0,13,70.68,0.0
4,5,1.0,0.0,39,259.07,0.0


In [305]:
#  list of discrete variables
discrete_vars = [var for var in num_vars if len(master_data[var].unique())<20 and var not in ['Client_ID','Loan_Flag']]

print('Number of discrete variables: ', len(discrete_vars))

# let's visualise the discrete variables
master_data[discrete_vars].head()

Number of discrete variables:  2


Unnamed: 0,Prod_Held_Count,Prv_Loan_Flag
0,4.0,1.0
1,4.0,0.0
2,2.0,0.0
3,2.0,1.0
4,1.0,0.0


In [306]:
# list of continuous variables
cont_vars = [var for var in num_vars if var not in discrete_vars+['Client_ID','Loan_Flag']]

print('Number of continuous variables: ', len(cont_vars))

# let's visualise the continuous variables
master_data[cont_vars].head()

Number of continuous variables:  2


Unnamed: 0,Num_txns,Lst_txn_amt
0,0,
1,17,83.66
2,25,526.18
3,13,70.68
4,39,259.07


In [307]:
# replace the missing values
for var in vars_with_na:
    
    if var in discrete_vars:
        # calculate the mode
        mode_val = X_train[var].mode()[0]
    
        # train
        X_train[var+'_na'] = np.where(X_train[var].isnull(), 1, 0)
        X_train[var].fillna(mode_val, inplace=True)
    
        # test
        X_test[var+'_na'] = np.where(X_test[var].isnull(), 1, 0)
        X_test[var].fillna(mode_val, inplace=True)
        
    else:
        # calculate the mean
        mean_val = np.round(X_train[var].mean(),3)
    
        # train
        X_train[var+'_na'] = np.where(X_train[var].isnull(), 1, 0)
        X_train[var].fillna(mean_val, inplace=True)
    
        # test
        X_test[var+'_na'] = np.where(X_test[var].isnull(), 1, 0)
        X_test[var].fillna(mean_val, inplace=True)

# check that we have no more missing values in the engineered variables
X_train[vars_with_na].isnull().sum()

Prv_Loan_Flag    0
Lst_txn_amt      0
dtype: int64

In [309]:
# check that test set does not contain null values in the engineered variables
tst_vars_with_na =[vr for var in vars_with_na if X_test[var].isnull().sum()>0]

# print the variable name and the percentage of missing values
for var in tst_vars_with_na:
    print(var, np.round(X_test[var].isnull().mean(), 3),  ' % missing values')

**Categorical variables**

First, we will remove those categories within variables that are present in less than 1% of the observations:

In [311]:
# let's capture the categorical variables first
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']

In [312]:
def find_frequent_labels(df, var, rare_perc):
    # finds the labels that are shared by more than a certain % of the loan purchasers
    df = df.copy()
    tmp = df.groupby(var)['Loan_Flag'].count() / len(df)
    return tmp[tmp>rare_perc].index

for var in cat_vars:
    frequent_ls = find_frequent_labels(X_train, var, 0.01)
    X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
    X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')

**Next, we need to transform the strings of these variables into numbers. We will do it so that we capture the monotonic relationship between the label and the target:**

In [315]:
#Transforming Nominal Attributes
lbl_encde = LabelEncoder()
Gendr_label = lbl_encde.fit_transform(X_train["Gender"])
X_train["Gender_lbl"] = Gendr_label
X_train.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl
0,1,M,10001-40000,Munster,4.0,1.0,0,451.707,0.0,30-45,0-1000,0,1,1
1,2,M,0-10000,Ulster,4.0,0.0,17,83.66,0.0,30-45,2000-3000,0,0,1
2,3,F,10001-40000,Leinster,2.0,0.0,25,526.18,0.0,30-45,0-1000,0,0,0
3,4,M,40001-60000,Leinster,2.0,1.0,13,70.68,0.0,46-60,0-1000,0,0,1
4,5,F,60001-100000,Leinster,1.0,0.0,39,259.07,0.0,61-70,0-1000,0,0,0


In [316]:
#Transforming Nominal Attributes
Gendr_label = lbl_encde.fit_transform(X_test["Gender"])
X_test["Gender_lbl"] = Gendr_label
X_test.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl
10000,10001,M,10001-40000,Munster,4.0,0.0,2,12.59,,46-60,0-1000,0,0,1
10001,10002,M,10001-40000,Munster,4.0,0.0,0,30.0,,<30,0-1000,0,0,1
10002,10003,F,10001-40000,Leinster,2.0,0.0,28,1003.01,,46-60,0-1000,0,0,0
10003,10004,M,60001-100000,Leinster,2.0,0.0,31,873.25,,30-45,0-1000,0,0,1
10004,10005,F,40001-60000,Leinster,1.0,0.0,12,926.75,,<30,0-1000,0,0,0


In [317]:
#Transforming Nominal Attributes
Province_label = lbl_encde.fit_transform(X_train["Province"])
X_train["Province_lbl"] = Province_label
X_train.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl,Province_lbl
0,1,M,10001-40000,Munster,4.0,1.0,0,451.707,0.0,30-45,0-1000,0,1,1,2
1,2,M,0-10000,Ulster,4.0,0.0,17,83.66,0.0,30-45,2000-3000,0,0,1,4
2,3,F,10001-40000,Leinster,2.0,0.0,25,526.18,0.0,30-45,0-1000,0,0,0,1
3,4,M,40001-60000,Leinster,2.0,1.0,13,70.68,0.0,46-60,0-1000,0,0,1,1
4,5,F,60001-100000,Leinster,1.0,0.0,39,259.07,0.0,61-70,0-1000,0,0,0,1


In [318]:
#Transforming Nominal Attributes
Province_label = lbl_encde.fit_transform(X_test["Province"])
X_test["Province_lbl"] = Province_label
X_test.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl,Province_lbl
10000,10001,M,10001-40000,Munster,4.0,0.0,2,12.59,,46-60,0-1000,0,0,1,2
10001,10002,M,10001-40000,Munster,4.0,0.0,0,30.0,,<30,0-1000,0,0,1,2
10002,10003,F,10001-40000,Leinster,2.0,0.0,28,1003.01,,46-60,0-1000,0,0,0,1
10003,10004,M,60001-100000,Leinster,2.0,0.0,31,873.25,,30-45,0-1000,0,0,1,1
10004,10005,F,40001-60000,Leinster,1.0,0.0,12,926.75,,<30,0-1000,0,0,0,1


In [319]:
#Transforming Ordinal Attributes
# this function will assign discrete values to the strings of the variables, 
# so that the smaller value corresponds to the smaller count of target values

Incme_ord_map = {'0-10000': 1, '10001-40000': 2, '40001-60000': 3, 
                 '60001-100000': 4, '100000+': 5,}
X_train["Income_Grp_lbl"] = X_train["Income_Grp"].map(Incme_ord_map)
X_test["Income_Grp_lbl"] = X_test["Income_Grp"].map(Incme_ord_map)
X_test.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl,Province_lbl,Income_Grp_lbl
10000,10001,M,10001-40000,Munster,4.0,0.0,2,12.59,,46-60,0-1000,0,0,1,2,2
10001,10002,M,10001-40000,Munster,4.0,0.0,0,30.0,,<30,0-1000,0,0,1,2,2
10002,10003,F,10001-40000,Leinster,2.0,0.0,28,1003.01,,46-60,0-1000,0,0,0,1,2
10003,10004,M,60001-100000,Leinster,2.0,0.0,31,873.25,,30-45,0-1000,0,0,1,1,4
10004,10005,F,40001-60000,Leinster,1.0,0.0,12,926.75,,<30,0-1000,0,0,0,1,3


In [320]:
#Transforming Ordinal Attributes
Age_grp_map = {'<30': 1, '30-45': 2, '46-60': 3, 
                 '61-70': 4, '>70': 5,}
X_train["Age_Grp_lbl"] = X_train["Age_Grp"].map(Age_grp_map)
X_test["Age_Grp_lbl"] = X_test["Age_Grp"].map(Age_grp_map)
X_test.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl,Province_lbl,Income_Grp_lbl,Age_Grp_lbl
10000,10001,M,10001-40000,Munster,4.0,0.0,2,12.59,,46-60,0-1000,0,0,1,2,2,3.0
10001,10002,M,10001-40000,Munster,4.0,0.0,0,30.0,,<30,0-1000,0,0,1,2,2,1.0
10002,10003,F,10001-40000,Leinster,2.0,0.0,28,1003.01,,46-60,0-1000,0,0,0,1,2,3.0
10003,10004,M,60001-100000,Leinster,2.0,0.0,31,873.25,,30-45,0-1000,0,0,1,1,4,2.0
10004,10005,F,40001-60000,Leinster,1.0,0.0,12,926.75,,<30,0-1000,0,0,0,1,3,1.0


In [321]:
#Transforming Ordinal Attributes
Avg_CA_txn_amt_grp_map = {'0-1000': 1, '1000-2000': 2, '2000-3000': 3, 
                 '3000-4000': 4, '4000-5000': 5,'5000-6000':6,'6000+':7}
X_train["Avg_CA_txn_amt_grp_lbl"] = X_train["Avg_CA_txn_amt_grp"].map(Avg_CA_txn_amt_grp_map)
X_test["Avg_CA_txn_amt_grp_lbl"] = X_test["Avg_CA_txn_amt_grp"].map(Avg_CA_txn_amt_grp_map)
X_test.head()

Unnamed: 0,Client_ID,Gender,Income_Grp,Province,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Loan_Flag,Age_Grp,Avg_CA_txn_amt_grp,Prv_Loan_Flag_na,Lst_txn_amt_na,Gender_lbl,Province_lbl,Income_Grp_lbl,Age_Grp_lbl,Avg_CA_txn_amt_grp_lbl
10000,10001,M,10001-40000,Munster,4.0,0.0,2,12.59,,46-60,0-1000,0,0,1,2,2,3.0,1.0
10001,10002,M,10001-40000,Munster,4.0,0.0,0,30.0,,<30,0-1000,0,0,1,2,2,1.0,1.0
10002,10003,F,10001-40000,Leinster,2.0,0.0,28,1003.01,,46-60,0-1000,0,0,0,1,2,3.0,1.0
10003,10004,M,60001-100000,Leinster,2.0,0.0,31,873.25,,30-45,0-1000,0,0,1,1,4,2.0,1.0
10004,10005,F,40001-60000,Leinster,1.0,0.0,12,926.75,,<30,0-1000,0,0,0,1,3,1.0,1.0


In [330]:
# check absence of na
[var for var in X_train.columns if X_train[var].isnull().sum()>0]

[]

**Removing Rare Labels**

In [327]:
#Removing Rare labels and not null values
X_train = X_train[X_train['Age_Grp'] != 'Rare']
X_test = X_test[X_test['Age_Grp'] != 'Rare']
X_train = X_train[X_train["Avg_CA_txn_amt_grp"] != 'Rare']
X_test = X_test[X_test["Avg_CA_txn_amt_grp"] != 'Rare']
X_train = X_train[X_train['Prod_Held_Count'].notnull()]

**Correcting -ve values**

In [328]:
#Removing -ve values from Product Held Count
X_train = X_train[X_train['Prod_Held_Count']>0]
X_train['Prod_Held_Count'].unique()

array([4., 2., 1., 5., 3.])

In [329]:
# check absence of na
[var for var in X_test.columns if X_test[var].isnull().sum()>0]

['Loan_Flag']

In [335]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1711 entries, 10000 to 11999
Data columns (total 11 columns):
Client_ID                 1711 non-null int64
Prod_Held_Count           1711 non-null float64
Prv_Loan_Flag             1711 non-null float64
Num_txns                  1711 non-null int64
Lst_txn_amt               1711 non-null float64
Loan_Flag                 0 non-null float64
Gender_lbl                1711 non-null int32
Province_lbl              1711 non-null int32
Income_Grp_lbl            1711 non-null int64
Age_Grp_lbl               1711 non-null float64
Avg_CA_txn_amt_grp_lbl    1711 non-null float64
dtypes: float64(6), int32(2), int64(3)
memory usage: 147.0 KB


In [337]:
#Drop unnecessary columns 
X_train = X_train.drop(['Gender','Income_Grp','Province','Age_Grp','Avg_CA_txn_amt_grp','Prv_Loan_Flag_na','Lst_txn_amt_na'], axis = 1)
X_test = X_test.drop(['Gender','Income_Grp','Province','Age_Grp','Avg_CA_txn_amt_grp','Prv_Loan_Flag_na','Lst_txn_amt_na'], axis = 1)
X_test = X_test.drop(['Loan_Flag'], axis = 1)

In [338]:
#Verify that all the columns are in desirable data type
X_train["Gender_lbl"] = X_train["Gender_lbl"].astype(int)
X_train["Province_lbl"] = X_train["Province_lbl"].astype(int)
X_train["Income_Grp_lbl"] = X_train["Income_Grp_lbl"].astype(int)
X_train["Age_Grp_lbl"] = X_train["Age_Grp_lbl"].astype(int)
X_train["Avg_CA_txn_amt_grp_lbl"] = X_train["Avg_CA_txn_amt_grp_lbl"].astype(int)
X_train["Prod_Held_Count"] = X_train["Prod_Held_Count"].astype(int)
X_train["Prv_Loan_Flag"] = X_train["Prv_Loan_Flag"].astype(int)
X_train["Loan_Flag"] = X_train["Loan_Flag"].astype(int)

X_test["Gender_lbl"] = X_test["Gender_lbl"].astype(int)
X_test["Province_lbl"] = X_test["Province_lbl"].astype(int)
X_test["Income_Grp_lbl"] = X_test["Income_Grp_lbl"].astype(int)
X_test["Age_Grp_lbl"] = X_test["Age_Grp_lbl"].astype(int)
X_test["Avg_CA_txn_amt_grp_lbl"] = X_test["Avg_CA_txn_amt_grp_lbl"].astype(int)
X_test["Prod_Held_Count"] = X_test["Prod_Held_Count"].astype(int)
X_test["Prv_Loan_Flag"] = X_test["Prv_Loan_Flag"].astype(int)

### Feature Scaling

For use in  models, features need to be either scaled or normalised. In the next section, I will scale features between the min and max values:

In [339]:
train_vars = [var for var in X_train.columns if var in ['Num_txns', 'Lst_txn_amt']]
train_vars

['Num_txns', 'Lst_txn_amt']

In [340]:
# fit scaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X_train[train_vars]) #  fit  the scaler to the train set for later use

X_train[train_vars] = scaler.transform(X_train[train_vars])
X_test[train_vars] = scaler.transform(X_test[train_vars])

  return self.partial_fit(X, y)


In [341]:
X_test.head()

Unnamed: 0,Client_ID,Prod_Held_Count,Prv_Loan_Flag,Num_txns,Lst_txn_amt,Gender_lbl,Province_lbl,Income_Grp_lbl,Age_Grp_lbl,Avg_CA_txn_amt_grp_lbl
10000,10001,4,0,0.02,0.008049,1,2,2,3,1
10001,10002,4,0,0.0,0.027513,1,2,2,1,1
10002,10003,2,0,0.28,1.11532,0,1,2,3,1
10003,10004,2,0,0.31,0.970251,1,1,4,2,1
10004,10005,1,0,0.12,1.030062,0,1,3,1,1


In [342]:
# let's now save the train and test sets for the next notebook!

X_train.to_csv('xtrain.csv', index=False)
X_test.to_csv('xtest.csv', index=False)
#y_train.to_csv('ytrain.csv', index=False)

**Please refer to the next section BOI_Challenge_Featre_Selectn.ipynb**