# Feature Engineering

In this notebook I take a look at the predictive capability of the features. I would normally generate new features here, but I found little to no relationships between the variables and the use of domain knowledge is limited due to the anonymous nature of the data set. I will use tools to resample the data as well as deturmine the most useful features. Also I will make sure the features are clear of problems such as multicolinearity.

### Road Map
1) Resample Data <br>
2) See Which Features Are Worth Using <br>
3) Deal With Multicolinearity <br>

In [8]:
# import libraries

# standard scientific libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# plotting
import seaborn as sns

# resampling
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import NearMiss

In [5]:
# import data
# Note: only use training data
train = pd.read_csv("train.csv")

train.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,2.238954,-1.724499,-2.151484,-2.577803,0.993668,3.565492,-1.785957,0.860122,-1.264003,1.567867,...,-0.149574,-0.049333,0.278442,0.684735,-0.219028,-0.159167,0.03792,-0.049932,3.466048,0
1,-1.315062,1.630783,0.597001,-0.038359,-0.40458,-0.965712,0.212249,0.735381,-1.267926,-0.482635,...,-0.238898,-0.946773,0.323904,0.515632,-0.713,-0.266503,-0.017794,0.051058,1.94591,0
2,1.908801,0.021184,-2.087997,0.12931,1.161468,0.605244,-0.022371,0.180296,0.283819,-0.497766,...,0.293609,1.095842,-0.044874,-1.689517,0.106098,0.007758,0.045164,-0.053068,2.70538,0
3,1.811257,0.316556,0.316751,3.880231,0.048454,1.020163,-0.734868,0.233651,0.681423,1.146705,...,0.138869,0.700422,0.174064,0.702997,-0.212523,-0.010018,-0.01774,-0.038006,2.851284,0
4,1.358817,-1.120881,0.550266,-1.547659,-1.19495,0.275448,-1.201843,0.212889,-2.094285,1.492821,...,-0.340972,-0.636442,0.252758,-0.34416,-0.064282,-0.439622,0.062524,0.013095,3.17847,0


# 1. Resample Data

Having unbalanced data is bad for machine learning algorithms. It induces bias because the algorithms are updated for every row they are trained on and the updates for failure will be much more frequent then the updates for success whenever the model predicts a fraud. This ultimately will make the model perform sub-optimaly. Resampling the data will make it balanced. I will be using the "NearMiss" algorithm which will remove non-fraud data points until there are equal amounts of fraud and non-fraud transactions in the data. The NearMiss algorithm is distance based. It first finds the distance between each point. Then it finds the points that belong to the majority class that are, on average, the farthest distance away from the points in the manority class. It then removes those points from the majority class one-by-one until the two classes have the same amount of points. This means only the members of the majority class in the closest proximity to the minority class will be saved, thus saving the most information. Since NearMiss uses distance it is ussually a good idea to scale the data. Another important thing to note is that resampled data should never be used for testing, only for training. After the data is resampled it will better expose it to testing and finding multicolinearity. 

In [16]:
# scale and resample data

# seperate X and y
X = train.drop("Class", axis=1)
y = train.Class

# instantiate scaler
sc = StandardScaler()

# fit the scaler and transform data
X_scaled = sc.fit_transform(X)


# instantiate NearMiss object
nm = NearMiss()

# fit the NearMiss object and resample the scaled data
train_resampled = nm.fit_resample(X_scaled, y)

# preview
train_resampled[0]

array([[ 9.06583638e-01, -2.31479655e-02, -1.42225657e+00, ...,
        -3.70579662e-03,  3.84898439e-02,  9.94475305e-01],
       [ 9.87335718e-01,  8.03721766e-01, -1.72646059e+00, ...,
        -6.92472447e-02,  2.18182979e-03, -8.37583563e-01],
       [ 9.26321788e-01,  3.64243906e-02, -1.45734077e+00, ...,
         3.91191346e-03,  4.13041761e-02,  9.43200783e-01],
       ...,
       [-7.08390690e-01,  1.69659208e+00, -2.88514212e+00, ...,
         1.16056739e+00,  2.93970994e-01,  1.48268843e+00],
       [-8.69029581e+00,  5.86526802e+00, -1.57493442e+01, ...,
         5.39396967e+00, -4.36590545e+00, -1.51598035e+00],
       [-7.19575462e-01,  1.59074915e+00, -4.00650342e+00, ...,
         1.56098912e-01, -7.41925960e-01, -1.51598035e+00]])

In [17]:
# place resampled data back into a dataframe
X = train_resampled[0]
y = train_resampled[1]
train_resampled = pd.DataFrame(X, columns=train.drop("Class", axis=1).columns)
y = pd.Series(y, name="Class")
train_resampled = pd.concat([train_resampled, y], axis=1)

# preview
train_resampled.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.906584,-0.023148,-1.422257,0.65586,0.236239,-0.955042,0.437714,-0.327925,0.677833,-1.031722,...,-0.116523,-0.250734,-0.087141,-0.369715,0.297917,-0.629323,-0.003706,0.03849,0.994475,0
1,0.987336,0.803722,-1.726461,2.777141,1.432907,0.027513,0.565542,-0.052569,-1.477013,0.308132,...,-0.228136,-0.51241,0.021478,-0.082168,0.54245,-0.002035,-0.069247,0.002182,-0.837584,0
2,0.926322,0.036424,-1.457341,0.649868,0.291778,-0.996993,0.46067,-0.361564,0.630039,-1.072727,...,-0.145436,-0.275812,-0.105116,-0.414568,0.379638,-0.629662,0.003912,0.041304,0.943201,0
3,0.71571,-0.427951,-1.632028,0.779803,0.056732,-0.961689,0.624765,-0.348329,0.778868,-1.130716,...,0.012528,-0.570361,-0.37971,-0.452673,0.150096,-0.667958,-0.119314,0.146349,1.459682,0
4,0.998509,0.003701,-1.29162,0.247384,0.359023,-0.525482,0.181924,-0.217894,0.778759,-0.89271,...,-0.326531,-0.704315,0.193914,0.624652,0.02867,-0.240005,-0.037145,-0.010047,0.606205,0
