In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.manifold import SpectralEmbedding
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# A first look at the data

In [None]:
data = pd.read_csv("../input/creditcard.csv")
data.describe()

In [None]:
data.head(20)

In [None]:
data.info()

Fraud detection problems usually fall under anamoly detection with a huge class impbalance, lets verify this

In [None]:
plt.hist(data["Class"], bins=2)

In [None]:
data["Class"].value_counts()

Observations:
    - There are 284807 data samples each with 28 obfuscated features + time + amount of transaction (overall 30 features)
    - We only have numerical features with no missing data
    - There is a hugh class imbalance problem as it is the case with any outlier/anamoly/fraud detection problem
    
In the rest of this notebook I will under-sample the negative examples (non-fradulent transactions) to be able to treat this as a classification problem. We can also
consider anamoly detection but for the sake of EDA we have to somewhat balance the classes.

# Data sub-sampling and cleaning

Goals :
    - Balance the classes (almost balanced is also ok)
    - Look for constant and duplicated features
    - We already know there are no missing data points so not checking that

Separating fradulent and non-fradulent data to be able to under sample negative examples

In [None]:
#randomly selecting 442 random non-fraudulent transactin, but normalizing data before undersampling
fraud = data[data['Class'] == 1]
non_fraud = data[data['Class'] == 0].sample(len(fraud) * 5)
non_fraud.reset_index(drop=True, inplace=True)
fraud.reset_index(drop=True, inplace=True)
new_data = pd.concat([non_fraud, fraud]).sample(frac=1).reset_index(drop=True)
new_data.describe()

Verifying there are no missing data points:  (this is already apparent from information above but just doing this as practice)

In [None]:
null_count = new_data.isnull().sum(axis=0).sort_values(ascending=False)
null_count.head(30)

Looking for constant features (features taking only one value)

In [None]:
values_count = new_data.nunique().sort_values()
np.sum(values_count == 1)

Ok looks like we don't have any constant features, lets look for duplicates now

In [None]:
duplicates = []
for i, ref in enumerate(new_data.columns[:-1]):
    for other in new_data.columns[i + 1:-1]:
        if other not in duplicates and np.all(new_data[ref] == new_data[other]):
            duplicates.append(other)    
len(duplicates)

Alright looks like we have a very clean data set! Lets go through the data in the next step.

# Going through data
  In this part I will :
      - Find correlation between data and target (to identify more important features)
      - Find number of highly correlated features (to see if dimensionality reduction such as PCA can be helpful)
      - Look at distribution of features to see if outlier detection is needed

In [None]:
corrmat = new_data.corr()
corrmat_orig = data.corr()
f, ax = plt.subplots(figsize=(16, 8))
plt.subplot(1, 2, 1)
plt.title('Correlation matrix of sub-sampled data')
sns.heatmap(corrmat, vmax=1, square=True)
plt.subplot(1, 2, 2)
plt.title('Correlation matrix of original data')
sns.heatmap(corrmat_orig, vmax=1, square=True)

Observations
- Some features show a very high positive correlation with the class (V2, V4 and V11 for instance)
- Many more exhibit a large negative correlation (V1, V3, V7, V10, V12, V14, V16-V18)
- Features themselves might be higlhy correlated. 
- The fact that there is a smaller region in the heatmap that seems to be separate from the rest of it suggest 
there is some intrinsic organization to the data (clusters of data), features V1-V19 seem to be highly correlated to target
and to other features in this group.
- Correlation matrix of original data points is quite different than that of the sub-sampled data. This is because, the sheer number of non-negative examples is so large, the class features "looks to be" independent of any feature and is almost always 0. Notice that even in this case some features show strong negative correlation.

Lets look at some of these features individually. Starting from distributions of time and amounts since these two are not obfuscated.

In [None]:
plt.figure(figsize=(16,8))
plt.subplot(1, 2, 1)
plt.title('Histogram of Time for non-fraudulent samples')
sns.distplot(non_fraud["Time"])
plt.subplot(1, 2, 2)
plt.title('Histogram of Time for fraudulent samples')
sns.distplot(fraud["Time"])

We can see that time of fraudulent and non-fraudulent transactions are prety much similar,
but there are still slight differences so Time can provide some useful information. Lets look at the same for amounts now.

In [None]:
plt.figure(figsize=(16,8))
plt.subplot(1, 2, 1)
plt.title('Histogram of Time for non-fraudulent samples, mean = %f' % (non_fraud["Amount"].mean()))
sns.distplot(non_fraud["Amount"])
plt.subplot(1, 2, 2)
plt.title('Histogram of Time for fraudulent samples, mean = %f' % (fraud["Amount"].mean()))
sns.distplot(fraud["Amount"])

Observations:
    - Distribution of amounts of transactions for fraudulent samples has a shorter tail 
    - Fraudulent transactions, on average, are larger than non-fraudulent ones, so amount is an important feature!

Lets look of features with a high (positive or negative) correlation:

In [None]:
important_feats = new_data.columns[np.abs(corrmat["Class"]) > 0.5]
important_feats

First we will look at distributions of these 12 feature (class is also included but its irrelevant!)

In [None]:
f, ax = plt.subplots(figsize=(24, 32))
for i in range(len(important_feats) - 1):
    plt.subplot(3, 4, i + 1)
    plt.title(important_feats[i])
    sns.distplot(new_data[important_feats[i]])

Most of distributions look Gaussian with a one-sided tail so some outlier detection and removal is necessary. Now lets see how Class changes vs these features.

In [None]:
f, ax = plt.subplots(figsize=(24, 32))
for i in range(len(important_feats) - 1):
    plt.subplot(3, 4, i + 1)
    plt.title(important_feats[i])
    sns.boxplot(x='Class', y=important_feats[i], data=new_data)

As it is intuitively expected, with these features, examples that correspond to fradulent samples (Class == 1) exhibit a larger dynamic range, we can add this information (i.e., range of 75-th quantile point - 25th quantile point of each feature)  as additional feature to the dataset!

I won't do this here since size of the dataset is small and we can add these features just before we train any models. Just making a note of this! 

# Dimensionality reduction and visualization
In this section I will perform dimensionality reduction to embed the data in a 2D space, this gives us an idea of how separable the data is, and, if, embedding into a lower dimenison helps.
I try the following three methods:
    - Principal component analysis (and kernel PCA)
    - Indipendent component analysis (ICA)
    - tSNE
    - Diffusion mapping (with two different choices for affinity matrix)

In [None]:
from sklearn.manifold import TSNE
from sklearn.manifold import SpectralEmbedding
from sklearn.decomposition import PCA, KernelPCA, FastICA

First we can do some mild outlier detection to get rid of irrelevant data

In [None]:
lb = new_data.quantile(0.1)
ub = new_data.quantile(0.9)
rang = ub - lb
reduced_data = new_data[~((new_data < (lb - 2 * rang)) |(new_data > (ub + 2 * rang))).any(axis=1)]
features = reduced_data.drop(['Class'], axis=1, inplace=False)
features = (features - np.mean(features)) / (np.std(features) + 1e-8)
labels = reduced_data['Class']

**First method : The humble PCA**

In [None]:
pca_embedding =  PCA(n_components=2) 
pca_emb_data = pca_embedding.fit_transform(features.values)
plt.figure(figsize=(10,10))
plt.scatter(pca_emb_data[labels == 1, 0], pca_emb_data[labels == 1, 1], color='red', label='positive samples')
plt.scatter(pca_emb_data[labels == 0, 0], pca_emb_data[labels == 0, 1], color='blue', label='negative samples')
plt.legend()

Very interesting! looks like just applying PCA turns the data into an (almost) linearnly separable data! Lets look at another one

**Fist (b) method : Kernel PCA with rbf kernel**

In [None]:
kpca_embedding =  KernelPCA(n_components=2, kernel='rbf')
kpca_emb_data = kpca_embedding.fit_transform(features.values)
plt.figure(figsize=(10,10))
plt.title('Reduced data with kernel PCA (RBF kernel)')
plt.scatter(kpca_emb_data[labels == 1, 0], kpca_emb_data[labels == 1, 1], color='red', label='positive samples')
plt.scatter(kpca_emb_data[labels == 0, 0], kpca_emb_data[labels == 0, 1], color='blue', label='negative samples')
plt.legend()

Regular PCA looks better! Lets try ICA:

**Second method : (Fast) ICA**

In [None]:
ica_embedding =  FastICA(n_components=2) 
ica_emb_data = ica_embedding.fit_transform(features.values)
plt.figure(figsize=(10,10))
plt.scatter(ica_emb_data[labels == 1, 0], ica_emb_data[labels == 1, 1], color='red', label='positive samples')
plt.scatter(ica_emb_data[labels == 0, 0], ica_emb_data[labels == 0, 1], color='blue', label='negative samples')
plt.legend()

This is just a rotation of the results with PCA so lets ignor ICA and try tSNE.

**Third method: tSNE**

In [None]:
tsne_embedding =  TSNE(n_components=2) 
tsne_emb_data = tsne_embedding.fit_transform(features.values)
plt.figure(figsize=(10,10))
plt.title('Reduced data with tSNE')
plt.scatter(tsne_emb_data[labels == 1, 0], tsne_emb_data[labels == 1, 1], color='red', label='positive samples')
plt.scatter(tsne_emb_data[labels == 0, 0], tsne_emb_data[labels == 0, 1], color='blue', label='negative samples')
plt.legend()

**Fourth method : diffusion mapping**

In [None]:
spec_embedding = SpectralEmbedding(n_components=2, affinity='rbf')
transformed_data2 = spec_embedding.fit_transform(features.values)
fig = plt.figure(figsize=(8,24))
plt.subplot(3, 1, 1)
plt.scatter(transformed_data2[labels == 1, 0], transformed_data2[labels == 1, 1], color='red', label='positive samples')
plt.legend()
plt.subplot(3, 1, 2)
plt.scatter(transformed_data2[labels == 0, 0], transformed_data2[labels == 0, 1], color='blue', label='negative samples')
plt.legend()
plt.subplot(3, 1, 3)
plt.scatter(transformed_data2[labels == 1, 0], transformed_data2[labels == 1, 1], color='red', label='positive samples')
plt.scatter(transformed_data2[labels == 0, 0], transformed_data2[labels == 0, 1], color='blue', label='negative samples')
plt.legend()

This looks pretty interesting! it looks like all non-fradulent data is mapped very close to (0, 0), so we may use spectral embedding to augment the given features.
Lets try spectral embedding now with nearest neighbours kernel instead.

In [None]:
spec_embedding2 = SpectralEmbedding(n_components=2, affinity='nearest_neighbors', n_neighbors=30)
transformed_data2 = spec_embedding2.fit_transform(features.values)
fig = plt.figure(figsize=(8,24))
plt.subplot(3, 1, 1)
plt.scatter(transformed_data2[labels == 1, 0], transformed_data2[labels == 1, 1], color='red', label='positive samples')
plt.legend()
plt.subplot(3, 1, 2)
plt.scatter(transformed_data2[labels == 0, 0], transformed_data2[labels == 0, 1], color='blue', label='negative samples')
plt.legend()
plt.subplot(3, 1, 3)
plt.scatter(transformed_data2[labels == 1, 0], transformed_data2[labels == 1, 1], color='red', label='positive samples')
plt.scatter(transformed_data2[labels == 0, 0], transformed_data2[labels == 0, 1], color='blue', label='negative samples')
plt.legend()

In the embeded space, theres quite alot of overlap between fradulent and non-fradulent data but for a very large portion of fradulent points are completely spearable from negative samples! We can use this too! 

# Conclusions 

   -  The dimension of the feature space is not very large
   -  There are no missing entries, constant or duplicate features, this is a clean data set!
   -  Dynamic range of features is a good feature itself to be added to existing ones
   -  Outlier data detection and removal may need to be carried out (not in the first version of the model we build though)
  -   Data visualisation in 2D space shows that these features may be easily separable if embedded into a different feature space so we have to try this

