# Step 4 Preprocessing, Feature Engineering


**Introduction**

IEEE Computational Intelligence Society (IEEE-CIS) works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with Vesta, the world’s leading payment service company, seeking the best solutions for the fraud detection industry. The fraud prevention system used by Vesta is actually saving consumers millions of dollars per year. Researchers from the IEEE-CIS want to improve fraud detection accuracy but also the customer experiences.

**Data Source**

The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features, available in Kaggle competition (https://www.kaggle.com/c/ieee-fraud-detection/data). Only train_identity and train_transaction datasets will be used for this project.

**The Data Science Method**  

1.   Problem Identification 

2.   Data Wrangling 
 
3.   Exploratory Data Analysis

4.   **Pre-processing and Training Data Development**
    - Create new features
    - Standardize numeric features
    - Split into testing and training datasets
    - Resampling training dataset

5.   Modeling 

6.   Documentation

In [1]:
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
#os.getcwd()

In [2]:
path=".../data"
os.chdir(path) 

df = pd.read_csv('step3_output.csv')
print(df.head())

   isFraud  TransactionDT  TransactionAmt ProductCD  card1  card2  card3  \
0        0          86401            29.0         W   2755  404.0  150.0   
1        0          86469            59.0         W   4663  490.0  150.0   
2        0          86499            50.0         W  18132  567.0  150.0   
3        0          86510            49.0         W   5937  555.0  150.0   
4        0          86522           159.0         W  12308  360.0  150.0   

        card4  card5   card6  ...  V281  V282 V283  V284  V286  V291  V297  \
0  mastercard  102.0  credit  ...   0.0   1.0  1.0   0.0   0.0   1.0   0.0   
1        visa  166.0   debit  ...   0.0   1.0  1.0   0.0   0.0   1.0   0.0   
2  mastercard  117.0   debit  ...   0.0   0.0  0.0   0.0   0.0   1.0   0.0   
3        visa  226.0   debit  ...   0.0   1.0  1.0   0.0   0.0   1.0   0.0   
4        visa  166.0   debit  ...   0.0   1.0  1.0   0.0   0.0   1.0   0.0   

   V299 V305  V311  
0   0.0  1.0   0.0  
1   0.0  1.0   0.0  
2   0.0  1.

In [3]:
print(df.columns.tolist())

['isFraud', 'TransactionDT', 'TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2', 'P_emaildomain', 'C3', 'D1', 'D4', 'D10', 'D15', 'M6', 'V25', 'V26', 'V46', 'V47', 'V55', 'V56', 'V61', 'V62', 'V66', 'V67', 'V77', 'V78', 'V82', 'V83', 'V98', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V114', 'V115', 'V116', 'V118', 'V120', 'V121', 'V122', 'V124', 'V281', 'V282', 'V283', 'V284', 'V286', 'V291', 'V297', 'V299', 'V305', 'V311']


Check for NAs

In [5]:
nas = pd.DataFrame(df.isnull().sum().sort_values(ascending=False)/len(df),columns = ['percent'])
pos = nas['percent'] > 0
nas[pos]

Unnamed: 0,percent
M6,0.182761
addr2,0.139652
addr1,0.139652
card2,0.01701
card5,0.007059
card3,0.003258
card4,0.003258
card6,0.003258
P_emaildomain,0.003107


### Some Feature Engineering
- Create new features: 
    - TransactionAmt_log - log of transaction amount
    - Transaction_day - day of the week in which a transaction happened
    - Transaction_hour - hour of the day in which a transaction happened
- Card features: frequency encoding
- P_emaildomain:
    - Fill NAs with "email_not_provided"
    - Split the email domain
- Encode objects
- V features: normalize against themselves
- Fill NAs with -1

In [6]:
# New features
df['TransactionAmt_log'] = np.log(df['TransactionAmt'])
df['Transaction_day'] = np.floor((df['TransactionDT'] / (3600 * 24) - 1) % 7)
df['Transaction_hour'] = np.floor(df['TransactionDT'] / 3600) % 24

In [7]:
# Card features encoding
for col in ['card1', 'card2', 'card3', 'card4', 'card5', 'card6']:
    freq = df[col].value_counts(dropna=False).to_dict()
    df[col+'_freq'] = df[col].map(freq)

In [8]:
# email feature
df['P_emaildomain'] = df['P_emaildomain'].fillna('email_not_provided')
df['P_prefix'] = df['P_emaildomain'].apply(lambda x: x.split('.')[0])

In [9]:
# objects
from sklearn.preprocessing import LabelEncoder
for col in df.drop('isFraud', axis=1).columns:
    if df[col].dtype == 'object':
        le = LabelEncoder()
        le.fit(list(df[col].astype(str).values))
        df[col] = le.transform(list(df[col].astype(str).values))

In [10]:
# V features - normalization
V = df[['V25', 'V26', 'V46', 'V47', 'V55', 'V56', 'V61', 'V62', 'V66', 'V67', 'V77', 'V78', 'V82', 'V83', 'V98', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V114', 'V115', 'V116', 'V118', 'V120', 'V121', 'V122', 'V124', 'V281', 'V282', 'V283', 'V284', 'V286', 'V291', 'V297', 'V299', 'V305', 'V311']]
for v in V:
    df[v] = (df[v] - df[v].mean()) / df[v].std()

In [11]:
nas = pd.DataFrame(df.isnull().sum().sort_values(ascending=False)/len(df),columns = ['percent'])
pos = nas['percent'] > 0
nas[pos]

Unnamed: 0,percent
addr1,0.139652
addr2,0.139652
card2,0.01701
card5,0.007059
card3,0.003258


In [12]:
for col in ['addr1', 'addr2', 'card2', 'card5', 'card3']:
    df[col].fillna(-1, inplace=True)

In [13]:
print(df.isnull().sum().tolist())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [14]:
df['isFraud'].dtypes

dtype('int64')

### Split into training and testing datasets

Note that we have imbalanced datasets. As seen in Step 3, 96.76% transactions are non-fraud, compared to only 3.24% are fraud. We will apply some resampling techniques to balance the training set.

In [36]:
X = df.sort_values('TransactionDT').drop(['isFraud', 'TransactionDT'], axis=1)
y = df.sort_values('TransactionDT')['isFraud']

In [30]:
#from sklearn import preprocessing
#scaler = preprocessing.StandardScaler().fit(X)
#X_scaled = scaler.transform(X)
#y = y.ravel() # get 1-dim flattened array

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

In [38]:
print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))
print("Number transactions X_train: ", X_train.shape)
print("Number transactions y_train: ", y_train.shape)
print("Number transactions X_test: ", X_test.shape)
print("Number transactions y_test: ", y_test.shape)

Shape of X: (411937, 67)
Shape of y: (411937,)
Number transactions X_train:  (308952, 67)
Number transactions y_train:  (308952,)
Number transactions X_test:  (102985, 67)
Number transactions y_test:  (102985,)


### Oversampling: SMOTE

One approach to deal with imbalanced datasets is to oversample the minority class. A widely used approach is Synthetic Minority Oversampling Technique (SMOTE) for the minority class.

In [39]:
import imblearn
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=123)
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train)

In [40]:
print('Shape of X: {}'.format(X_train_sm.shape))
print('Shape of y: {}'.format(y_train_sm.shape))

Shape of X: (598006, 67)
Shape of y: (598006,)
