<a href="https://colab.research.google.com/github/VisionLogic-AI/Unsupervised_learning/blob/master/Unsupervised_Learning_(Anamoly_Detection).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Credit Card Fruad Detection (Anomaly Detection)
Lets build an applied machine learning solution using dimensionality reduction methods we created in our previous notebook.

In this notebook, we will build a fraud detection system using unsupervised learning....no labels!

In the real world, fraud often goes undiscoverd, and only the fraud that is caught provides labels for the dataset.
Moreover, fraud patterns change over time, so "supervised" systems that are built using fraud labels become stale, capturing historical patterns of fraud but failing to adapt to newly emrging patterns.

#Prepare the Data
Lets load the credit card transactions dataset, generate the features matrix and labels arrays, and then split the data into training and test sets.
*We will not use the labels to perform anomaly detection, but we will use the labels to help evaluate the fraud detection systems we build.

In [1]:
!git clone https://github.com/aapatel09/handson-unsupervised-learning.git

Cloning into 'handson-unsupervised-learning'...
remote: Enumerating objects: 49, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 459 (delta 14), reused 21 (delta 5), pack-reused 410[K
Receiving objects: 100% (459/459), 93.79 MiB | 30.20 MiB/s, done.
Resolving deltas: 100% (74/74), done.


In [2]:
#use the correct version of tensorflow
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

TensorFlow 1.x selected.
1.15.2


In [3]:
!pip install keras
!pip install xgboost



In [4]:
!pip install lightgbm
!pip install fastcluster
!pip install tslearn

Collecting fastcluster
[?25l  Downloading https://files.pythonhosted.org/packages/1e/9d/3d7525a4722ee4a11ad969762d1de53b6dac326b5ac1366221e06958e1d7/fastcluster-1.1.26-cp36-cp36m-manylinux1_x86_64.whl (154kB)
[K     |████████████████████████████████| 163kB 9.6MB/s 
Installing collected packages: fastcluster
Successfully installed fastcluster-1.1.26
Collecting tslearn
[?25l  Downloading https://files.pythonhosted.org/packages/07/2c/e1bbc15e38e5fa8ffc5b402a3bd91fb1d3e44fc35e8e491eb540bc3aaa38/tslearn-0.4.0-cp36-cp36m-manylinux2010_x86_64.whl (770kB)
[K     |████████████████████████████████| 778kB 8.8MB/s 
Installing collected packages: tslearn
Successfully installed tslearn-0.4.0


In [10]:
#load dataset
import pandas as pd
import numpy as np
import os, time
import pickle, gzip
import seaborn as sns
import matplotlib as mpl

%matplotlib inline

from sklearn import preprocessing as pp
from scipy.stats import pearsonr
from numpy.testing import assert_array_almost_equal
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

#unsupervised learning algos
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA
from sklearn.decomposition import SparsePCA, KernelPCA, TruncatedSVD
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn.manifold import Isomap, MDS, LocallyLinearEmbedding, TSNE
from sklearn.decomposition import MiniBatchDictionaryLearning, FastICA



file = '/content/handson-unsupervised-learning/datasets/credit_card_data/credit_card.csv'
data= pd.read_csv(file)
#datax= data.copy().drop(['Class'],axis= 1)
#datay= data['Class'].copy()

features_to_scale= data.columns
sx= pp.StandardScaler(copy= True)
data.loc[:, features_to_scale]= sx.fit_transform(data[features_to_scale])

x_train, x_test, y_train, y_test= train_test_split(data,test_size= 0.33,
                                                   random_state= 2018, stratify= datay)

ValueError: ignored

#Define Anomaly Score Function
Next, we need to define a function that calculates how anomalous each transaction is. 
*The more anamolous the transaction is, the more likely is it to be fraudulent

First, we will ned to rescale the data between 0 and 1.

In [0]:
#function
def anomalyscores(originaldf, reduceddf):
  loss= np.sum((np.array(oringaldf)- np.array(reduceddf))**2, axis= 1)
  loss= pd.Series(data= loss, index= originaldf.index)
  loss= (loss-np.min(loss)/np.max(loss)- np.min(loss))
  return loss

#Define Evaluation Metrics
Although we will not use the fraud detection labels in order to build the unsupervised fraud detection system, we will use the labels to evaluate the unsupervised model we develop.
*The labels will help us understand just how well these solutions are catching known patterns of fraud.

In [0]:
#function
def plot_results(truelabels, anomaly_scores, return_preds= False):
  preds= pd.concat([truelabels, anomaly_scores], axis= 1)
  preds.columns= ['truelabel', 'anomalyscore']
  precison, recall, threshold= average_precision_score(preds['truelabel'], preds['anomalyscore'])
  plt.step(recall, precision, color= 'k', alpha= 0.7,
           where= 'post')
  plt.fill_between(recall, precision, step= 'post', alpha= 0.3, color= 'k')
  plt.xlabel('Recall')
  plt.ylim([0.0, 1.05])
  plt.xlim([0.0, 1.0])
  plt.title('Precision Recall Curve: Average Precision= {0:02f}'.format(average_precision))
  fpr, tpr, thresholds= roc_curve(preds['truelabel'], preds['anomalyscore'])
  area_under_roc= auc(fpr, tpr)
  plt.figure()
  plt.plot(fpr, tpr, color= 'r', lw=2, label= "ROC Curve")
  plt.plot([0,1], [0,1], color='k',lw=2, linestyle= '--')
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('Receiver operating charateristic: Area Under Curve= {0:0.2f}'.format(area_under_roc))
  plt.legend(loc= 'lower right')
  plt.show()
  if return_preds== True:
    return preds

Teh fraud labels and the evaluatin metrics will help us assess just how good the unsupervised fraud detection systems are at catching known patterns of fraud- fraud that we have caught in the past and have labels for.

However,we will not be able to assess how good the unsupervised detection systems are at catching unknown patterns of fraud.
*In other words, there may be fraud in the dataset that is incorrectly labeled as not fraud because the financial company **never discovered it**.

#Define Plotting Function
We wil reuse the scatterplot function in the earlier notebook to display the separation of points the dimensioanlity reduction alogorithm achieves in just the first tw dimensions:


In [0]:
def scatterPlot(xdf, ydf, algoname):
  tempdf= pd.DataFrame(data= xdf.loc[:,0:1], index= xdf.index)
  tempdf= pd.concat((tepdf, ydf), axis= 1, join= 'inner')
  tempdf.columns= ['First Vector', 'Second Vector', 'Label']
  sns.implot(x= 'First Vector', y= 'Second Vector', hue= 'Label', data= tempdf,  fit_reg= False)
  ax= plt.gca()
  ax.set_title('Separation of Observations Using'+algoname)

#Normal PCA Anomaly Detection
We will now use PCA to learn the underlying structure of the credit card transactions dataset.
Once we learn this structure, we will use the learned model to reconstruct the credit card transactions, and then calculate "how different" the reconstructed transactions are from the original transactions.
*Those transactions that PCA does the poorest job of reconstructing are the most anomalous and most likely to be fraud. **bold text**

Deeper Understanding:<br>
***Anomaly detection  relies on reconstruction error. We want the reconstruction error fpr rare transaction- the ones that are most likely to be fraudulant- to be as high as possible and thereconstruction error for the rest to be as low as possible.***


For PCA, tbhe reconstruction error will depend largely on the number of principal components we keep and use to reconstruct the original transactions.
*The more principal components we keep, the better the PCA will be at learning the underlying structure of the original transactions.

HOWEVER....THERE IS A BALANCE!

If we keep topo many principal components, PCA may too easily reconstruct the original transactions. If we keep too few, PCA may not be able to reconstruct any of the original transactions well enough- not even the normal, non fraudulent transactions.

Lets search for the right number of principal components to keep in order to build a good fraud detection system.

#PCA Components Equal Number of Original Dimensions
If we use PCA to generate the same number of principal components as the number of original features, we will be able to perform anomaly detection?

The answer should be obvious.

When the number of principal components equals the number of original dimensions, PCA captures nearly 100% of the variance/information in the data as it generates the principal components.
Therefore, when PCA reconstructs transactions from the principal components, it will have too little reconstruction error for all the transactions, fraudulant or otherwise...in other words anomaly detection would be poor.

To highlight this point, lets apply PCA to generate the same numbert of principal components as the number of original features (30 for our credit card transaction dataset). <br>
*This is accomplished with the fit_transform function used in sklearn. To reconstruct trhe original transactions from the principal components we generate, we will use the inverse_transform function. **bold text**

In [0]:
#30 principal components
from sklearn.decomposition import PCA

n_components= 30
whiten = False
random_state= 2018

pca= PCA(n_components=n_components, whiten= whiten, random_state= random_state)

x_train_pca= pca.fit_transform(x_train)
x_train_pca= pd.DataFrame(data= x_train_pca, index= x_train.index)
x_train__pca_inverse= pca.inverse_transform(x_train_pca)
x_train__pca_inverse= pd.DataFrame(data= x_train_pca_inverse, index= x_train.index)

scatterPlot(x_train_pca, y_train, 'PCA')

Lets calculate the precision recall curve and ROC curve

In [0]:
anomaly_scores_pca= anomaly_scores(x_train, x_train_pca_inverse)
preds= plotResults(y_train, anomaly_scores_pca, True)

With an average precison of 0.11, this is a poor fraud detection solution. It catches very little fraud.

#Search for the Optimal Number of Principal Components
Now lets conduct a few experiments by reducing the number of principal components PCA generates and evaluate the fraud detection results.<br>

*We need the PCA based fraud detection solution to have enough error on the rare cases that it can meaningfully separate fraud cases from the normal ones. But thr error cannot be so low or so high for all transactions that are rare and normal transactions are virtually indistinguishable.***bold text***<br>

After running the above code, we can see that we are able to catch 80% of the fraud with 75% precision.<br>
*This is very impressive seeing that we did not use ANY LABELS!!!

Using PCA, we calculated the reconstruction error for each of these 190,820 transactions. If we sort these transactions by highest reconstruction error (also referred to as anomaly score) in descending order and extract the top 350 transactions from the list, we can see that 264 of these transactions are fraudulent.




In [0]:
#code
preds.sort_values(by= 'anomaly_score', ascending= False, inplace= True)
cutoff= 350
preds_top= preds[:cutoff]
print('Precision: 'np.round(preds_top.anomaly_score[preds_top.truelabel==1].count()/cutoff, 2))
print('Recall: ', np.round(preds_top.anomaly_score[preds_top.truelabel==1].count()/y_train.sum()/2)

#Sparse PCA Anomaly 
Lets try implementing the SparsePCA algorithm to our fraud detection dataset.<br>
*Remember that Sparse PCA is very similar to the standard PCA, the only difference is that it is a less dense version of PCA.
***bold text***

We will need to specify the number of principal components we desire, but we must also set the *alpha parameter*, which control the degree of sparsity.

We will experiment with different values for the principal components and the alpha parameters as we search for the optimal sparse PCA fraud detection solution.

**Note that the normal PCA within sklearn used a ***fit_transform*** function to generate the principal components and an ***inverse_transform*** function to reconstruct the original dimensions from the principal components. <br>
Using these two functions will allow us to calculate the reconstruction error between the original feature set and the reconstructed feature set derived from the PCA.

In [0]:
#Sparse PCA
from sklearn.decomposition import SparsePCA
n_components= 27
alpha= 0.0001
random_state= 2018
n_jobs= -1

sparse_pca= SparsePCA(n_components= n_components, alpha= alpha,
                      random_state= random_state, n_jobs= n_jobs)
sparse_pca.fit(x_train.loc[:,:])
x_train_sparse_pca= sparse_pca.transform(x_train)
x_train_sparse_pca= pd.DataFrame(data= x_train_sparse_pca, index= x_train.index)

scatterPlot(x_train_sparse_pca, y_train, 'SparsePCA')

Now lets generate the original dimensions from sparse PCA matrix by simple matrix mulriplication of the Sparse PCA matrix (with 190,820 samples and 27 dimensions) and the sparse PCA components  (a 27 x 30 matrix).<br>
This creates a matrix that is the original size(a 190, 820 x 30 matrix).<br>
We also need to ***add the mean*** of the original feature to this new matrix, but then we are done.<br>

*From this newly derived inverse matrix, we can calculate the reconstruction errors (anomaly scores) as we did with normal PCA:

In [0]:
#applying the inverse pca function
x_train_sparse_pca_inverse= np.array(x_train_sparse_pca).dot(sparse_pca.compenents_) + np.array(
    x_train.mean(axis = 0))
x_train_sparse_pca_inverse= pd.DataFrame(data= x_train_sparse_pca_inverse, index= x_train.index)
anomaly_scores_sparse_pca= anomaly_scores(x_train, x_train_sparse_pca_inverse)
preds= plot_results(y_train, anomaly_scores_sparse_pca, True)


#Kernel PCA Anomaly Detection
In this example, we will create a fraud detection system using the kernel pca algorithm which is a** non-linear form of PCA** and is useful for fraud transactions that are not linearly seperable from the non-fraid transactions.

We need tp specify the number of components we would like to generate, the kernel (we will use RBF kernelas we did in the previous notebook), and the gamma (which is set to 1/n_features by defualt, so 1/30 in our case).<br>
We will also need to set the fit_inverse_transform to True to apply built-in inverse_transform function provided by sklearn

Finally, because kernel pca is so expensive to train with, we will train on just the first two thousand samples in the transactions dataset. **(this is not ideal, but  it is necessary in order to perform experiments quickly)**


In [0]:
#Kernel pca
from sklearn.decomposition import KernelPCA

n_components = 27
kernel= 'rbf'
gamma= None
fit_inverse_transform= True
random_state= 2018
n_jobs= 1

kernel_pca= KernelPCA(n_components= n_components, kernel= kernel,
                      gamma= gamma, fit_inverse_transform= fit_inverse_transform,
                      random_state= random_state, n_jobs= n_jobs)
kernel_pca.fit(x_train.loc[:2000])

x_train_kernelpca= kernel_pca.transform(x_train)
x_train_kernelpca= pd.DataFrame(data= x_train_kernelpca, index= x_train.index)

x_train_kernelpca_inverse= kernel_pca.inverse_transform(x_train_kernelpca)
x_train_kernelpca_inverse= pd.DataFrame(data= x_train_kernelpca_inverse, index= x_train.index)

scatterPlot(x_train_kernelpca, y_train, 'Kernel PCA')

As you can see, the results are far worse than those using the normal PCA or Sparse PC.<br>
Whileit was worth experimenting with Kernel PCA, we will not use this solution for fraud detection given that we have better performing solutions from other algorithms.

*We will not create any fraud detection systems using SVD seeing that it is very similar to normal PCA.

#Gaussian Random Projection Anomaly Detection
Remember here that we can set either the number of components we want or the eps parameter, which controls quality of the embedding derived based on Johnson-Lindenstrauss lemma.<br>
We will choose to explicitly set the number of components.
(Gaussian Random projection trains very quickly, so we can train on the entire training set)<br>
*As with Sparse pca, we willneed to derive our own inverse_transform function because none is provided by sklearn.

In [0]:
#Gaussian Random Projection
from sklearn.random_projection import GaussianRandomProjection

n_components= 27
eps= None
random_state= 2018

grp= GaussianRandomProjection(n_components= n_components, eps= eps, random_state= random_state)

x_train_grp= grp.fit_transform(x_train)
x_train_grp= pd.DataFrame(data= x_train_grp, index = x_train.index)

scatterPlot(x_train_grp, y_train, 'Gaussian Random Projection')

As we can see these results are poor....so we won't be using this algorithm

## Sparse Random Projection Anomaly Detection
Here, we will designate the number of components we want (instead of setting the eps parameter). And like with Gaussian random projection, we will use our own inverse_transform function to create the original dimensions from the sparse random projection derived components.

In [0]:
#Sparse Random Projection
from sklearn.random_projection import SparseRandomProjection

n_components= 27
density = 'auto'
eps= .01
dense_output= True
random_state= 2018

srp= SparseRandomProjection(n_components= n_components, density= density, eps= eps,
                            dense_output= dense_output, random_state= random_state)

x_train_srp= srp.fit_transform(x_train)
x_train_srp= pd.DataFrame(data= x_train_srp, index = x_train.index)

scatterPlot(x_train_srp, y_train, 'Sparse Random Projection')

As with Gaussian random projection, these results are poor.

#Nonlinear Anomaly Detection
At this point we can see that PCAis the best solution thus far in creating a fraud detection system using unsupervised learning algorithms.

Now, we can turn to nonlinear dimensionality reduction techniques, BUT the open source versions of these algorithms run very slowly and are not viable for fast fraud detection.
For these reasons, we will skip this and go directly to "nonsdistance-based" dimensionality reduction methods:
- dictionary learning and independent component analysis

#Dictionary Learning Anomaly Detection
In dictionary learning, the algorithm learns the sparse representation of the original data.

For anolay detection, we want to learn the "undercomplete" so that the vectors in the dictionary are fewer in number than the original dimensions.

In our case, we will generate 28 vectors (or components) to learn the dictionary we will feed in 10 batches, where each batch has 200 samples.<br>
*We will need to use our own inverse_transform function as well

In [0]:
from sklearn.decomposition import MiniBatchDictionaryLearning

n_components= 28
alpha= 1
batch_size= 200
n_iter= 10
random_state= 2018

mbl= MiniBatchDictionaryLearning(n_component= n_components, alpha= alpha,
                                 batch_size= batch_size, n_iter= n_iter, random_state= random_state)
mb.fit(x_train)

x_train_mbl= mbl.fit_transform(x_train)
x_train_mbl= pd.DataFrame(data= x_train_mbl, index = x_train.index)

scatterPlot(x_train_mbl, y_train, 'MiniBatch Dictionary Learning')

These results are better than Kernel PCA, Gaussian Random Projection and sparse random projection BUT are know match for normal PCA.

#ICA Anomaly Detection
Last use ICA to design our final fraud detection solution.<br>
We need to first specify the number of components, which we will set to 27. (using sklearn inverse_transform function)

In [0]:
#ICA
from sklearn.decomposition import FastICA

n_components= 27
algorithm= 'parallel'
whiten= True
max_iter= 200
random_state= 2018

fast_ica= FastICA(n_components= n_components, algorithm= algorithm,
                  whiten= whiten, max_iter= max_iter, random_state)

x_train_fastica= fast_ica.fit_transform(x_train)
x_train_fatstica= pd.DataFrame(data= x_train_fastica, index= x_train.index)

x_train_fastica_inverse= fast_ica.inverse_transform(x_train_fastica)
x_train_fastica_invers= pd.DataFrame(data= x_train_fastica_inverse, index= x_train.index)

scatterPlot(x_train_fast_ica, y_train, 'Independent Component Analysis')

As we can see these results match those of normal PCA.

#Fraud Detection on Test 
Now to evaluate our fraud detection solutions, lets apply them to never before seen test set.<br>
*We will do this for the top three solutions:
- PCA
- ICA
- Dictioary Learning

#PCA (test set)
We will use the PCA embedding that the PCA algorithm learned from the training set and use this to transform the test set.<br>
We will then use the sklearn inverse-transform function to recreate the original dimensions from the principal components matrix of the test set.


In [0]:
#PCA on Tets Set
x_test_pca= pca.transform(x_test)
x_test_pca= pd.DataFrame(data= x_test_pca, index= x_test.index)

x_test_pca_inverse= pca.invserse_transform(x_test_pca)
x_test_pca_inverse= pd.DataFrame(data= x_test_pca_inverse, index= x_test.index)

scatterPlot(x_test_pca, y_test, 'PCA')

T

These are impressive results. We are able to catch 80% of the known fraud in the tets set with an 80% precision-- all without using any labels.

#ICA Anomaly Detection (On The Test Set)


In [0]:
#Independent Component Analysis

x_test_fastica= fast_ica.transform(x_test)
x_test_fastica= pd.DataFrame(data= x_test_fastica, index  x_test.index)

x_test_fastica_inverse= fast_ica.inverse_transform(x_test_fast_ica)
x_test_fastica_inverse= pd.DataFrame(data= x_test_fastica_inverse, index = x_test.index)

scatterPlot(x_test_fast_ica, y_train, 'Independent Component Analysis')

The results are the same as normal PCA and thus quite impressive.

#Dictionary Learning (On test Set)
Lets now use dictionary learing which did not perform as well as normal PCA and ICA, but it's still worth a shot.

In [0]:
#Dictionary Learning on test set
x_test_mbl= mbl.transform(x_test)
x_test_mbl= pd.DataFrame(data= x_test_mbl, index= x_test.index)

scatterPlot(x_test_mbl, y_train, 'Mini- batch Dictionary Learning')

Whie the results are not terrible- we can catch 80% of the fraud with 20% precision- they fall far short of the results from normal PCA and ICA.