# MACHINE LEARING FOR FINANCIAL SERVICES

Welcome to MACHINE LEARNING! Below is a simple introductory example of how easy for you to run machine learning algorithms to choose significant feature attributes.


## FEATURE SELECTION with KNN and Random Forest
Often in machine learning, we have hundreds or even thousands of features and we want a way to create a model that only includes the most relevant features. This has been a typical challenge across all industries including financial services, and with improved machine learning (and deep learning)libraries, the challenge has been effectively mitigated.  Feature selection is basically a process of identifying and selecting a subset of original variables (i.e. features or attributes) to

1. Make our model simpler to interpret. 
1. Reduce the variance of the model, and therefore overfitting. 
1. Reduce the computational cost and time. 

The model with the selected features may or may not necessarily gain accuracy of the model, and thus it is important to strike a balance between the above beneifts and its accuracy.

The purpose of this notebook is to show simple examples of feature selection methods using KNN and Random Forest, which are common machine learning algorithms that gained their popularity recently, and thus it would be interesting to deploy a couple of ways to perform feature selection from these two.  Especially, Random Forests are often used for feature selection in a data science process. The reason is because the tree-based strategies used by random forests naturally ranks by the features and how well they improve the purity of the node. This means systematic decreases in impurity over all trees (called gini impurity). In addition, there are a couple of more Scikit Learn based feature selection methods included for your reference.

> DIMENSIONALITY REDUCITON: in machine learning and statistics, dimensionality reduction is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. It can be divided into feature selection and feature extraction. Hence, dimensionality reduction is more of a comprehensive terminology for both feature selection and feature extraction

> FEATURE EXTRACTION basically transforms the data in the high dimensional space to a lower dimension, as compared to sub-setting a set of significant attributes by feature selection.  The data transformation may be linear or non-linear.  There are a few techniques to perfrom feature extraction such as PCA (Principal Component Analysis)..

In [1]:
# read a dataset of interest

import pandas as pd

url = 'https://raw.githubusercontent.com/YLEE200/MLFS/master/testdata/FRAUD_SAMPLE1.csv'

df = pd.read_csv(url)
#df.shape
df.head()

Unnamed: 0,﻿TXN_ID,TXN_DT,ACCT_NO,TXN_AMT,ACCT_BAL,TENURE,ACCT_TYPE,ATM_IND,TXN_ST,ONLINE_IND,NOLNK_ACCT,FRD_IND
0,123A112,9/9/11,xxxx8350,84.6,1057.46,4.1,REG,0,NV,0,0,0
1,123A485,9/9/11,xxxx8379,59.75,1194.94,5.6,PRM,1,OH,0,0,0
2,123A417,9/9/11,xxxx8402,179.08,3581.58,5.6,PRM,0,OH,0,0,0
3,123A377,9/9/11,xxxx8406,199.01,3980.22,5.6,PRM,0,OH,0,0,0
4,123A661,9/9/11,xxxx8409,218.14,4362.75,5.6,REG,0,OH,0,0,0


In [2]:
# let's map following into a numerical format

df['ST_NO'] = df.TXN_ST.map({'OH': 0, 'NV': 1})
df['ACCT_TIER'] = df.ACCT_TYPE.map({'PRM': 0,'REG': 1})

df.head(5)

Unnamed: 0,﻿TXN_ID,TXN_DT,ACCT_NO,TXN_AMT,ACCT_BAL,TENURE,ACCT_TYPE,ATM_IND,TXN_ST,ONLINE_IND,NOLNK_ACCT,FRD_IND,ST_NO,ACCT_TIER
0,123A112,9/9/11,xxxx8350,84.6,1057.46,4.1,REG,0,NV,0,0,0,1,1
1,123A485,9/9/11,xxxx8379,59.75,1194.94,5.6,PRM,1,OH,0,0,0,0,0
2,123A417,9/9/11,xxxx8402,179.08,3581.58,5.6,PRM,0,OH,0,0,0,0,0
3,123A377,9/9/11,xxxx8406,199.01,3980.22,5.6,PRM,0,OH,0,0,0,0,0
4,123A661,9/9/11,xxxx8409,218.14,4362.75,5.6,REG,0,OH,0,0,0,0,1


In [95]:
# numeric feature variables 

feature_cols = [
    'TXN_AMT',
    'ACCT_BAL',
    'TENURE',
    'ATM_IND',
    'ONLINE_IND',
    'NOLNK_ACCT',
    'ST_NO',
    'ACCT_TIER'
]

In [96]:
# selecting a few numerical feature attributes for this demo (feature variables should be numeric)

X = df[feature_cols]
X.head(10)

Unnamed: 0,TXN_AMT,ACCT_BAL,TENURE,ATM_IND,ONLINE_IND,NOLNK_ACCT,ST_NO,ACCT_TIER
0,84.6,1057.46,4.1,0,0,0,1,1
1,59.75,1194.94,5.6,1,0,0,0,0
2,179.08,3581.58,5.6,0,0,0,0,0
3,199.01,3980.22,5.6,0,0,0,0,0
4,218.14,4362.75,5.6,0,0,0,0,1
5,254.52,5090.46,5.6,1,0,0,0,0
6,275.18,5503.63,5.6,1,0,0,0,0
7,294.96,5899.26,5.6,1,0,0,0,1
8,299.55,5991.01,2.5,1,0,0,1,1
9,316.49,6329.73,5.6,1,0,0,0,0


In [97]:
# create a response vector 'y' by selecting a Series
y = df.FRD_IND
y.shape

(610,)

## KNN Model

> ### KNN Model Traing with all numerical features

In [98]:
# instanstiate a KNN model with the above features

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [99]:
# making prediction on test data
y_pred = knn.predict(X_test)

> ### KNN Feature Selection with KBest

In [100]:
# selecting important feature variables with KBest
# you can change k value to choose different number of features

from sklearn.feature_selection import SelectKBest 

select = SelectKBest(k=3)
select_features = select.fit(X_train, y_train)

indices_selected = select.get_support(indices = True)
colnames_selected = [X.columns[i] for i in indices_selected]

print (colnames_selected)

X_sel = X.loc[:, colnames_selected]

['TENURE', 'ONLINE_IND', 'ST_NO']


In [101]:
# split X and y into training and testing sets with test size is 40%

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sel, y, test_size=0.4, random_state=4)

In [102]:
# train the model on the training set

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [103]:
# STEP 3: make predictions on the testing set

y_pred_sel = knn.predict(X_test)

> ### Accuracy Comparison

In [104]:
from sklearn import metrics

# compare actual response values (y_test) with predicted response values (y_pred)
print(metrics.accuracy_score(y_test, y_pred))

# compare actual response values (y_test) with predicted response values from selected features (y_pred_sel)
print(metrics.accuracy_score(y_test, y_pred_sel))

0.901639344262
0.909836065574


There is a very minimal difference in the accuracies between the model with whole set of features and the one with selected features.  This could be definitely worthwhile if you have to deal with a large amount of feature dimensions. 

## Random Forest

> ### Random Forest Model Training with all numerical features

In [105]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

In [106]:
# Split the data into 40% test and 60% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

In [107]:
# Create a random forest classifier
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

# Train the classifier
clf.fit(X_train, y_train)

# Print the name and gini importance of each feature
for feature in zip(feature_cols, clf.feature_importances_):
    print(feature)

('TXN_AMT', 0.25079387077627319)
('ACCT_BAL', 0.26357109949620766)
('TENURE', 0.24299377592804486)
('ATM_IND', 0.034699731221416293)
('ONLINE_IND', 0.033992949227551965)
('NOLNK_ACCT', 0.080703091196977675)
('ST_NO', 0.077348075629225271)
('ACCT_TIER', 0.015897406524303024)


>### Ranom Forest Feature Selection

In [108]:
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.2
sfm = SelectFromModel(clf, threshold=0.2)

# Train the selector
sfm.fit(X_train, y_train)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=0.2)

In [109]:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(feature_cols[feature_list_index])

TXN_AMT
ACCT_BAL
TENURE


In [110]:
# Transform the data to create a new dataset containing only the most important features
# Note: We have to apply the transform to both the training X and test X data.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

In [111]:
# Create a new random forest classifier for the most important features
clf_important = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

# Train the new classifier on the new dataset containing the most important features
clf_important.fit(X_important_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

> ### Accuracy Comparison

In [112]:
# Apply The Full Featured Classifier To The Test Data
y_pred = clf.predict(X_test)

# View The Accuracy Of Our Full Feature (14 Features) Model
accuracy_score(y_test, y_pred)

0.92213114754098358

In [113]:
# Apply The Full Featured Classifier To The Test Data
y_important_pred = clf_important.predict(X_important_test)

# View The Accuracy Of Our Limited Feature (3 Features) Model
accuracy_score(y_test, y_important_pred)

0.88934426229508201

As can be seen by the accuracy scores, our original model which contained all 8 features is 92% accurate while the our 'limited' model which contained only three features is 89% accurate. Thus, for a very small cost in accuracy we greatly reduced the number of features in the model.

## Something Extra
There are a couple of more Scikit Learn based feature selections... 

#### The Recursive Feature Elimination (RFE) method ...
is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

In [114]:
# Recursive Feature Elimination

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()

# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)

indices_selected = rfe.get_support(indices = True)
colnames_selected = [X.columns[i] for i in indices_selected]

print (colnames_selected)

X_sel = X.loc[:, colnames_selected]

['ONLINE_IND', 'NOLNK_ACCT', 'ST_NO']


#### Scikit Learn also has...
methods that use ensembles of decision trees (like Random Forest or Extra Trees) can also compute the relative importance of each attribute. These importance values can be used to inform a feature selection process.

In [115]:
# Feature Importance

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(X,y)

# display the relative importance of each attribute
#print(model.feature_importances_)

# Print the name and gini importance of each feature
for feature in zip(feature_cols, model.feature_importances_):
    print(feature)

('TXN_AMT', 0.27760386284717614)
('ACCT_BAL', 0.26035591122344121)
('TENURE', 0.22769266122335349)
('ATM_IND', 0.03283368711922259)
('ONLINE_IND', 0.046818290225637674)
('NOLNK_ACCT', 0.0717727440717046)
('ST_NO', 0.063003159915584819)
('ACCT_TIER', 0.019919683373879393)


## SUMMARY
This illustrative python notebook shows how to run KNN and Random Forest to do feature selection. I hope you to see how easy to adopt machine learning for your data analytics and modeling needs.  