# **Feature Engineering**
* Feature engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of our machine learning algorithms.
* Some Feature Engineering techniques we will be applying are: Imputation,
Log Transformations, One-Hot Label Encoding, Grouping Operations and Scaling.

#### **Loading in and merging the datasets**

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_identity = pd.read_csv('train_identity.csv')
train_transaction = pd.read_csv('train_transaction.csv')

test_identity = pd.read_csv('test_identity.csv')
test_transaction = pd.read_csv('test_transaction.csv')

In [3]:
train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True, on='TransactionID')
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True, on='TransactionID')

print(train.shape)
print(test.shape)

del train_transaction, train_identity, test_transaction, test_identity

(590540, 434)
(506691, 433)


### **Reducing Memory Usage**

In [4]:
from memory_reduction import reduce_mem_usage

train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

Memory usage of dataframe is 1955.37 MB
Memory usage after optimization is: 525.57 MB
Decreased by 73.1%
Memory usage of dataframe is 1673.87 MB
Memory usage after optimization is: 458.22 MB
Decreased by 72.6%


# **Feature Engineering**

In [5]:
#Change the columns in the test dataset to match those in the training set
test.columns = train.drop('isFraud',axis=1).columns

#### **Mapping email domains**

In [6]:
emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 
          'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft',
          'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',
          'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 
          'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink',
          'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other',
          'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 
          'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 
          'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo',
          'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
          'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft',
          'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 
          'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 
          'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 
          'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 
          'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 
          'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 
          'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other',
          'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}

us_emails = ['gmail', 'net', 'edu']


for c in ['P_emaildomain', 'R_emaildomain']:
    train[c + '_bin'] = train[c].map(emails)
    test[c + '_bin'] = test[c].map(emails)
    
    train[c + '_suffix'] = train[c].map(lambda x: str(x).split('.')[-1])
    test[c + '_suffix'] = test[c].map(lambda x: str(x).split('.')[-1])
    
    train[c + '_suffix'] = train[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
    test[c + '_suffix'] = test[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')

### **Feature grouping and scaling**

* Here we are simply performing aggregations on what we determined to be the top numerical features in this data from the EDA stage of the project. We will be scaling the numerical features: **TransactionAmt,  id_02,  D10, dist1, C1, & C2.**

In [7]:
from feat_engineering_functions import feat_scale, comb_feat_scaling

#### **Train and Test dataset feature grouping & scaling**

In [8]:
train_encoded = train.copy()
test_encoded = test.copy()

train_encoded, test_encoded = feat_scale(train_encoded, test_encoded, 'TransactionAmt')

train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'TransactionAmt','card1')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'TransactionAmt','card4')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'TransactionAmt','card5')

train_encoded['TransactionAmt'] = np.log(train_encoded['TransactionAmt'])
test_encoded['TransactionAmt'] = np.log(test_encoded['TransactionAmt'])

train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'id_02','card1')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'id_02','card4')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'id_02','card5')

train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'D1','card1')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'D1','card4')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'D1','card5')

train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'D10','card1')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'D10','card4')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'D10','card5')

train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'C1','addr1')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'C1','addr2')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'C1','dist1')

train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'C2','addr1')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'C2','addr2')
train_encoded, test_encoded = comb_feat_scaling(train_encoded, test_encoded, 'C2','dist1')

### **Categorical Feature Label Encoding**

In [9]:
from feat_engineering_functions import label_encoding

#### **Train and Test dataset categorical feature label encoding**

In [10]:
categorical = ['ProductCD','card1','card2','card3','card4','card5','card6','addr1','addr2','P_emaildomain_bin'\
              ,'P_emaildomain_suffix','R_emaildomain_bin','R_emaildomain_suffix','M1','M2','M3','M4','M5','M6','M7','M8','M9'\
              ,'id_12','id_13','id_14','id_15','id_16','id_17','id_18','id_19','id_20','id_21','id_22'\
              ,'id_23','id_24','id_25','id_26','id_27','id_28','id_29','id_30','id_31','id_32','id_33'\
              ,'id_34','id_35','id_36','id_37','id_38','DeviceType','DeviceInfo']

train_encoded, test_encoded = label_encoding(train_encoded, test_encoded, categorical)             

In [11]:
train_encoded.drop(['P_emaildomain','R_emaildomain'], axis=1, inplace = True)
test_encoded.drop(['P_emaildomain','R_emaildomain'], axis=1, inplace = True)

In [12]:
print(train_encoded['isFraud'].value_counts(),'\n')
train_encoded.head(10)

0    569877
1     20663
Name: isFraud, dtype: int64 



Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,...,R_emaildomain_bin,R_emaildomain_suffix,TransactionAmt_min_mean,TransactionAmt_min_std,TransactionAmt_to_mean_card1,TransactionAmt_to_std_card1,TransactionAmt_to_mean_card4,TransactionAmt_to_std_card4,TransactionAmt_to_mean_card5,TransactionAmt_to_std_card5,id_02_to_mean_card1,id_02_to_std_card1,id_02_to_mean_card4,id_02_to_std_card4,id_02_to_mean_card5,id_02_to_std_card5,D1_to_mean_card1,D1_to_std_card1,D1_to_mean_card4,D1_to_std_card4,D1_to_mean_card5,D1_to_std_card5,D10_to_mean_card1,D10_to_std_card1,D10_to_mean_card4,D10_to_std_card4,D10_to_mean_card5,D10_to_std_card5,C1_to_mean_addr1,C1_to_std_addr1,C1_to_mean_addr2,C1_to_std_addr2,C1_to_mean_dist1,C1_to_std_dist1,C2_to_mean_addr1,C2_to_std_addr1,C2_to_mean_addr2,C2_to_std_addr2,C2_to_mean_dist1,C2_to_std_dist1
0,2987000,0,86400,4.226562,4,3417,500,42,1,38,1,166,65,19.0,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,13.0,13.0,...,6,6,-1.0,-1.0,0.19458,0.18456,0.257812,0.170241,0.357666,0.205331,0.42253,0.673947,0.413871,0.461706,0.346766,0.337136,2.664062,1.026518,0.269531,0.125694,0.277588,0.140884,2.021484,0.881627,0.12207,0.078395,0.190186,0.104673,0.126099,0.029831,0.104919,0.019378,0.120422,0.036014,0.130859,0.029246,0.110535,0.018442,0.134644,0.040731
1,2987001,0,86401,3.367188,4,7922,303,42,2,2,1,173,65,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,...,6,6,-1.0,-1.0,0.123779,0.063004,0.219116,0.114214,0.135376,0.067027,0.573879,0.591021,0.57046,0.622199,0.568846,0.631865,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.118713,0.022376,0.104919,0.019378,-1.0,-1.0,0.12439,0.021449,0.110535,0.018442,-1.0,-1.0
2,2987002,0,86469,4.078125,4,9383,389,42,4,58,2,178,65,287.0,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,315.0,...,6,6,-1.0,-1.0,0.608398,0.589226,0.443115,0.25855,0.603027,0.436054,1.057784,1.186661,1.095867,1.200162,1.1249,1.218236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110046,0.018915,0.104919,0.019378,0.456299,0.753371,0.116455,0.017686,0.110535,0.018442,0.567871,0.957617
3,2987003,0,86499,3.912109,4,6991,466,42,2,14,2,282,65,,,2.0,5.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,112.0,112.0,0.0,94.0,0.0,,,,,84.0,,...,6,6,-1.0,-1.0,0.405029,0.25946,0.377686,0.196921,0.399902,0.257159,1.304678,1.360737,1.278956,1.394954,1.292045,1.399203,0.783691,0.590902,1.233398,0.729629,0.944824,0.668494,0.543945,0.434038,0.718262,0.471344,0.538574,0.427838,0.262939,0.042893,0.209839,0.038755,-1.0,-1.0,0.685547,0.101281,0.552734,0.092209,-1.0,-1.0
4,2987004,0,86506,3.912109,1,9262,413,42,2,2,1,241,65,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,,,...,6,6,-1.0,-1.0,0.515625,0.882898,0.377686,0.196921,0.233398,0.115564,0.094114,0.073463,0.04301,0.046911,0.042888,0.04764,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.117126,0.01426,0.104919,0.019378,-1.0,-1.0,0.116882,0.013089,0.110535,0.018442,-1.0,-1.0
5,2987005,0,86510,3.892578,4,10366,454,42,4,108,2,132,65,36.0,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,0.0,...,6,6,-1.0,-1.0,0.365234,0.491192,0.368164,0.214728,0.347168,0.208137,0.73708,1.588894,0.349643,0.382919,0.349946,0.382569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.128418,0.026933,0.104919,0.019378,0.223145,0.057093,0.137939,0.026453,0.110535,0.018442,0.237305,0.063425
6,2987006,0,86522,5.070312,4,2009,259,42,4,58,2,17,65,0.0,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,0.0,...,6,6,-1.0,-1.0,1.560547,0.899948,1.194336,0.69677,1.625,1.175127,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.126587,0.016887,0.104919,0.019378,0.101746,0.032208,0.129517,0.015654,0.110535,0.018442,0.110901,0.035151
7,2987007,0,86529,6.046875,4,2360,389,42,4,108,2,173,65,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,...,6,6,-1.0,-1.0,2.994141,1.962651,3.173828,1.851479,2.994141,1.79465,0.180105,0.196428,0.18279,0.200187,0.182949,0.200004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.118713,0.022376,0.104919,0.019378,-1.0,-1.0,0.12439,0.021449,0.110535,0.018442,-1.0,-1.0
8,2987008,0,86535,2.708984,1,7962,0,42,4,108,2,183,65,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,,,...,6,6,-1.0,-1.0,0.105164,0.057256,0.112671,0.065733,0.106262,0.063715,0.657979,0.696879,0.663922,0.727108,0.664498,0.726444,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.112366,0.019056,0.104919,0.019378,-1.0,-1.0,0.117676,0.01794,0.110535,0.018442,-1.0,-1.0
9,2987009,0,86536,4.761719,4,6370,10,42,2,106,2,78,65,19.0,,2.0,2.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,1.0,0.0,12.0,2.0,61.0,61.0,30.0,318.0,30.0,,,,,40.0,302.0,...,6,6,-1.0,-1.0,0.958984,0.587465,0.883789,0.460794,1.041992,0.594646,1.429816,1.490656,1.481928,1.616335,1.471004,1.605648,0.730469,0.476714,0.671875,0.397387,0.587402,0.368665,0.333984,0.241096,0.342041,0.224449,0.346191,0.220714,0.224365,0.03717,0.209839,0.038755,0.240845,0.072029,0.234863,0.035297,0.221069,0.036884,0.269287,0.081461


In [13]:
test_encoded.head(10)

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,...,R_emaildomain_bin,R_emaildomain_suffix,TransactionAmt_min_mean,TransactionAmt_min_std,TransactionAmt_to_mean_card1,TransactionAmt_to_std_card1,TransactionAmt_to_mean_card4,TransactionAmt_to_std_card4,TransactionAmt_to_mean_card5,TransactionAmt_to_std_card5,id_02_to_mean_card1,id_02_to_std_card1,id_02_to_mean_card4,id_02_to_std_card4,id_02_to_mean_card5,id_02_to_std_card5,D1_to_mean_card1,D1_to_std_card1,D1_to_mean_card4,D1_to_std_card4,D1_to_mean_card5,D1_to_std_card5,D10_to_mean_card1,D10_to_std_card1,D10_to_mean_card4,D10_to_std_card4,D10_to_mean_card5,D10_to_std_card5,C1_to_mean_addr1,C1_to_std_addr1,C1_to_mean_addr2,C1_to_std_addr2,C1_to_mean_dist1,C1_to_std_dist1,C2_to_mean_addr1,C2_to_std_addr1,C2_to_mean_addr2,C2_to_std_addr2,C2_to_mean_dist1,C2_to_std_dist1
0,3663549,18403224,3.464844,4,353,11,45,4,94,2,45,62,1.0,,6.0,6.0,0.0,0.0,3.0,4.0,0.0,0.0,6.0,0.0,5.0,1.0,115.0,6.0,419.0,419.0,27.0,398.0,27.0,,,,,418.0,203.0,,...,6,6,-1.0,-1.0,0.339355,0.260375,0.237305,0.129845,0.223877,0.12765,1.340721,1.566553,1.453882,1.530158,1.454355,1.527538,1.305664,2.261447,3.738281,2.325825,3.355469,2.222939,1.438477,2.07947,2.537109,1.717111,2.248047,1.650887,0.678223,0.182833,0.76416,0.151788,0.649414,0.194325,0.69873,0.178441,0.788574,0.145724,0.696777,0.20198
1,3663550,18403263,3.892578,4,8860,11,45,4,94,2,129,62,4.0,,3.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0,1.0,12.0,2.0,149.0,149.0,7.0,634.0,7.0,,,,,231.0,634.0,,...,6,6,-1.0,-1.0,0.333496,0.134278,0.364014,0.199117,0.343262,0.195751,0.019308,0.01936,0.018565,0.019538,0.018571,0.019505,0.918945,0.707987,1.330078,0.827084,1.193359,0.790496,1.108398,0.891555,1.402344,0.94893,1.242188,0.912332,0.383301,0.063863,0.38208,0.075894,0.22937,0.081199,0.263672,0.040994,0.262939,0.048575,0.163452,0.058039
2,3663551,18403310,5.140625,4,9014,470,45,4,94,2,243,62,2636.0,,2.0,2.0,0.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,2.0,0.0,22.0,2.0,137.0,137.0,10.0,97.0,10.0,,,,,136.0,136.0,,...,6,6,-1.0,-1.0,1.485352,1.895643,1.270508,0.694877,1.198242,0.683131,1.094332,1.9191,0.960696,1.011098,0.961009,1.009366,1.009766,0.617242,1.222656,0.760473,1.09668,0.726832,0.644043,0.602116,0.825684,0.558677,0.730957,0.537131,0.242432,0.043048,0.254639,0.050596,0.571289,2.645751,0.245483,0.039954,0.262939,0.048575,0.842285,3.864367
3,3663552,18403310,5.652344,4,838,258,45,4,53,2,68,62,17.0,,5.0,2.0,0.0,0.0,1.0,1.0,0.0,0.0,2.0,0.0,2.0,0.0,7.0,4.0,42.0,42.0,41.0,242.0,41.0,,,,,242.0,242.0,,...,6,6,-1.0,-1.0,2.970703,1.914712,2.117188,1.158128,2.851562,1.926732,1.229895,1.281581,1.312037,1.380871,1.334273,1.411645,0.693359,0.349426,0.374756,0.233138,0.449219,0.273644,1.867188,1.167725,1.46875,0.994117,1.576172,1.104285,1.082031,0.209791,0.636719,0.12649,0.577637,0.172899,0.46167,0.085461,0.262939,0.048575,0.249512,0.072573
4,3663553,18403317,4.21875,4,6719,350,45,2,12,2,103,62,6.0,,6.0,6.0,0.0,0.0,2.0,5.0,0.0,0.0,5.0,0.0,6.0,0.0,14.0,6.0,22.0,22.0,0.0,22.0,0.0,,,,,22.0,22.0,,...,6,6,-1.0,-1.0,0.567383,0.310075,0.517578,0.277417,0.550293,0.360051,1.781765,1.923796,1.709331,1.811612,1.685348,1.77826,0.188477,0.123553,0.209839,0.127592,0.151123,0.11624,0.117432,0.086124,0.143433,0.093855,0.101135,0.084889,0.920898,0.208656,0.76416,0.151788,0.617188,0.187468,0.974609,0.211744,0.788574,0.145724,0.661621,0.193813
5,3663554,18403323,4.058594,4,2364,219,45,4,94,2,275,62,,,5.0,5.0,0.0,0.0,2.0,3.0,0.0,0.0,2.0,0.0,4.0,0.0,10.0,4.0,36.0,36.0,35.0,0.0,,,,,,0.0,0.0,,...,6,6,-1.0,-1.0,0.459473,0.327832,0.43042,0.235435,0.405762,0.231455,0.191403,0.203028,0.190811,0.200822,0.190873,0.200478,0.341797,0.221353,0.321289,0.199832,0.28833,0.190992,0.0,0.0,0.0,0.0,0.0,0.0,0.962891,0.168893,0.636719,0.12649,-1.0,-1.0,0.975098,0.159918,0.657227,0.121437,-1.0,-1.0
6,3663555,18403350,4.464844,4,5511,374,45,4,20,2,8,62,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,2.0,1.0,,,0.0,0.0,,,,,,0.0,0.0,,...,6,6,-1.0,-1.0,0.967285,0.819857,0.645996,0.353534,0.890625,0.561712,0.064809,0.07041,0.065238,0.06866,0.066197,0.06994,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102966,0.023418,0.127319,0.025298,-1.0,-1.0,0.104492,0.022322,0.13147,0.024287,-1.0,-1.0
7,3663556,18403387,5.964844,4,4247,70,45,2,2,1,62,62,303.0,,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,1.0,1.0,11.0,1.0,,,,126.0,4.0,,,,,126.0,126.0,,...,6,6,-1.0,-1.0,1.682617,1.089329,2.970703,1.592534,1.850586,1.154221,0.229556,0.245405,0.232602,0.24652,0.233006,0.246595,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.071289,0.610012,0.821289,0.537533,0.989258,0.571911,0.38916,0.053651,0.38208,0.075894,1.730469,3.121261,0.127197,0.016355,0.13147,0.024287,0.75,2.04939
8,3663557,18403405,4.644531,4,7758,0,45,4,94,2,261,62,3.0,,152.0,148.0,0.0,0.0,135.0,95.0,0.0,0.0,77.0,0.0,122.0,0.0,407.0,108.0,128.0,128.0,13.0,644.0,13.0,,,,,106.0,631.0,,...,6,6,-1.0,-1.0,0.762695,0.409356,0.771973,0.422361,0.728027,0.415222,0.667521,0.699089,0.653554,0.687842,0.653767,0.686664,1.176758,0.726365,1.142578,0.710515,1.025391,0.679084,0.615723,0.435826,0.643555,0.43544,0.569824,0.418646,6.5,2.648948,19.359375,3.84529,13.953125,4.138383,6.707031,2.551387,19.453125,3.594531,14.695312,4.199774
9,3663558,18403416,4.761719,4,2106,219,45,4,94,2,246,62,8.0,,2.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,2.0,1.0,8.0,2.0,69.0,69.0,35.0,,,,,,,68.0,68.0,,...,6,6,-1.0,-1.0,1.035156,0.739257,0.869141,0.475442,0.819824,0.467405,0.501414,0.526087,0.512176,0.539046,0.512343,0.538123,0.518555,0.367094,0.615723,0.383012,0.552734,0.366069,0.327881,0.266251,0.412842,0.279339,0.365479,0.268565,0.320801,0.053377,0.254639,0.050596,0.167969,0.049357,0.334961,0.051012,0.262939,0.048575,0.179688,0.052043


### **Principal Cluster Analysis on the Vesta Engineered Features (V1:V339)**

In [25]:
from pca_change import PCA_change
from sklearn.preprocessing import minmax_scale

In [26]:
test_encoded['isFraud'] = 'test'
df = pd.concat([train_encoded, test_encoded], axis=0, sort=False )
df = df.reset_index()
df = df.drop('index', axis=1)

In [27]:
v_total = train_encoded.columns[55:394]

**Performing PCA**

In [28]:
for col in v_total:
    df[col] = df[col].fillna((df[col].min() - 2))
    df[col] = (minmax_scale(df[col], feature_range=(0,1)))

df = PCA_change(df, v_total, prefix='PCA_V_', n_components=30)

In [32]:
df = reduce_mem_usage(df)

Memory usage of dataframe is 858.05 MB
Memory usage after optimization is: 305.55 MB
Decreased by 64.4%


## **Preparing the data for modeling**

In [33]:
def clean_inf_nan(df):
    return df.replace([np.inf, -np.inf], np.nan) 

# Cleaning infinite values to NaN
df = clean_inf_nan(df)

In [34]:
def missing_data_finder(df):

    df_missing = df.isnull().sum().reset_index().rename(columns={'index': 'column_name', 0: 'missing_row_count'}).copy()
    df_missing_rows = df_missing[df_missing['missing_row_count'] > 0].sort_values(by='missing_row_count',ascending=False)
    df_missing_rows['missing_row_percent'] = (df_missing_rows['missing_row_count'] / df.shape[0]).round(4)
    return df_missing_rows

In [35]:
print(len(missing_data_finder(df)))
missing_data_finder(df).head()

52


Unnamed: 0,column_name,missing_row_count,missing_row_percent
60,id_08,1087017,0.9907
59,id_07,1087017,0.9907
14,dist2,1023168,0.9325
35,D7,998181,0.9097
56,id_04,964426,0.879


### **Dropping the columns with a missing_row_percent of over 90%**

In [39]:
df = df.drop(columns=['id_08','id_07','dist2','D7','D13'], axis = 1)

**Resorting df to updated train and test datasets prepared for modeling**

In [42]:
#FOR CHECKING
train, test = df[df['isFraud'] != 'test'], df[df['isFraud'] == 'test'].drop('isFraud', axis=1)
print(train.shape)
print(test.shape)

(590540, 160)
(506691, 159)


**Sorting the values by TransactionDT, timedelta feature**

In [46]:
#FOR CHECKING
X_train = train.sort_values('TransactionDT').drop(['isFraud', 'TransactionDT', ],axis=1)
y_train = train.sort_values('TransactionDT')['isFraud'].astype(bool)
X_test = test.sort_values('TransactionDT').drop(['TransactionDT', ], axis=1)
#test_encoded = test_encoded[["TransactionDT"]]

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

(590540, 158)
(590540,)
(506691, 158)


# **Saving Data for Modeling (Next Notebook)**

In [47]:
df.to_csv('full_data.csv')