#### Data preprocessing
* Exploratory Data Analysis
* Features transformation
* Features selection

#### Algorithm used
* Isolation Forest
* Local Outlier Factor

# Data Preprocessing

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn.manifold import TSNE

#for data preprocessing
from sklearn.decomposition import PCA

#for modeling
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest

#filter warnings
import warnings
warnings.filterwarnings("ignore")

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

### Exploratory Data Analysis

In [None]:
df = pd.read_csv("../input/creditcard.csv")
df.head()

In [None]:
sns.countplot(df.Class)
plt.show()
print(df.Class.value_counts())

***Plotting the Data***

Using a technique called T-SNE, we can reduce the dimensions of the data and create a 2D plot. The objective here is to show that distance based anomaly detection methods might not work as well as other techniques on this dataset. This is because the positive cases are not too far away from the normal cases.

In [None]:
from matplotlib import pyplot
df_plt=df[df['Class']==0].sample(1000)
df_plt_pos=df[df['Class']==1].sample(20)
df_plt=pd.concat([df_plt,df_plt_pos])
y_plt=df_plt['Class']
X_plt=df_plt.drop('Class',1)
X_embedded = TSNE(n_components=2).fit_transform(X_plt)
pyplot.figure(figsize=(12,8))
pyplot.scatter(X_embedded[:,0], X_embedded[:,1], c=y_plt, cmap=pyplot.cm.get_cmap("Paired", 2))
pyplot.colorbar(ticks=range(2))

We have 0.17% fraud cases in the dataset which are anomalies which are shown by red dot and normal transaction i.e. not fraudulent cases are shown by blue dot

In [None]:
timedelta = pd.to_timedelta(df['Time'], unit='s')
df['Time_hour'] = (timedelta.dt.components.hours).astype(int)

plt.figure(figsize=(12,5))
sns.distplot(df[df['Class'] == 0]["Time_hour"], color='g')
sns.distplot(df[df['Class'] == 1]["Time_hour"], color='r')
plt.title('Fraud and Normal Transactions by Hours', fontsize=17)
plt.xlim([-1,25])
plt.show()

Seems like hour of day have some impact on number or fraud cases. Lets move to transform the remaining features.

### Feature Transformation

Lets transform the remaining features using PCA.

In [None]:
cols= df[['Time', 'Amount']]

pca = PCA()
pca.fit(cols)
X_PCA = pca.transform(cols)

df['V29']=X_PCA[:,0]
df['V30']=X_PCA[:,1]

df.drop(['Time','Time_hour', 'Amount'], axis=1, inplace=True)

df.columns

Now lets have a view at distribution of features

In [None]:
columns = df.drop('Class', axis=1).columns
grid = gridspec.GridSpec(6, 5)

plt.figure(figsize=(20,10*2))

for n, col in enumerate(df[columns]):
    ax = plt.subplot(grid[n])
    sns.distplot(df[df.Class==1][col], bins = 50, color='g')
    sns.distplot(df[df.Class==0][col], bins = 50, color='r') 
    ax.set_ylabel('Density')
    ax.set_title(str(col))
    ax.set_xlabel('')
    
plt.show()

## Feature Selection using Z-test

Lets move to do some hypothesis testing to find statistically significant features. We will be performing `Z-test` with valid transactions as our population. 

So the case is we have to find if the values of fraud transactions are significantly different from normal transaction or not for all features. The level of significance is 0.01 and its a two tailed test.

#### Scenario:
* Valid transactions as our population
* Fraud transactions as sample
* Two tailed Z-test
* Level of significance 0.01
* Corresponding critical value is 2.58

#### Hypothesis:
* H0: There is no difference (insignificant)
* H1: There is a difference  (significant)

#### Formula for z-score:

$$ Zscore = (\bar{x} - \mu) / S.E$$

In [None]:
def ztest(feature):
    
    mean = normal[feature].mean()
    std = fraud[feature].std()
    zScore = (fraud[feature].mean() - mean) / (std/np.sqrt(sample_size))
    
    return zScore

In [None]:
columns= df.drop('Class', axis=1).columns
normal= df[df.Class==0]
fraud= df[df.Class==1]
sample_size=len(fraud)
significant_features=[]
critical_value=2.58

for i in columns:
    
    z_vavlue=ztest(i)
    
    if( abs(z_vavlue) >= critical_value):    
        print(i," is statistically significant") #Reject Null hypothesis. i.e. H0
        significant_features.append(i)

As we have already seen from distribution plots that distribution of normal and fraud data of V13, V15, V22, V23, V25 and 26 features is almost same, now, its proven through hypothesis testing. We will eliminate these features from our dataset as they don't contribute at all.

#### Split data into Inliers and Outliers

`Inliers` are values that are normal.`Outliers` are values that don't belong to normal data and they are the anomalies.

In [None]:
significant_features.append('Class')
df= df[significant_features]

inliers = df[df.Class==0]
ins = inliers.drop(['Class'], axis=1)

outliers = df[df.Class==1]
outs = outliers.drop(['Class'], axis=1)

ins.shape, outs.shape

# Modeling
.

####  1. ISOLATION FOREST

Isolation Forest is an unsupervised anomaly detection algorithm that uses the two properties “Few” and “Different” of anomalies to detect their existence. Since anomalies are few and different, they are more susceptible to isolation. This algorithm isolates each point in the data and splits them into outliers or inliers. This split depends on how long it takes to separate the points. If we try to separate a point which is obviously a non-outlier, it’ll have many points in its round, so that it will be really difficult to isolate. On the other hand, if the point is an outlier, it’ll be alone and we’ll find it very easily.

####  2. Local Outlier Factor

The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors. It is a calculation that looks at the neighbors of a certain point to find out its density and compare this to the density of neighbour points later on. In short we can say that the density around an outlier object is significantly different from the density around its neighbors. LOF considers as outliers the samples that have a substantially lower density than their neighbors.

In [None]:
def normal_accuracy(values):
    
    tp=list(values).count(1)
    total=values.shape[0]
    accuracy=np.round(tp/total,4)
    
    return accuracy

def fraud_accuracy(values):
    
    tn=list(values).count(-1)
    total=values.shape[0]
    accuracy=np.round(tn/total,4)
    
    return accuracy

###  Isolation Forest

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix, precision_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Define accuracy functions (if not already defined)
def normal_accuracy(predictions):
    return np.mean(predictions == 1)

def fraud_accuracy(predictions):
    return np.mean(predictions == -1)

# Set the random state for reproducibility
state = 42

# Fit the Isolation Forest model
ISF = IsolationForest(random_state=state)
ISF.fit(ins)

# Predict normal and fraud cases
normal_isf = ISF.predict(ins)
fraud_isf = ISF.predict(outs)

# Calculate accuracies
in_accuracy_isf = normal_accuracy(normal_isf)
out_accuracy_isf = fraud_accuracy(fraud_isf)

# Print accuracies
print("Accuracy in Detecting Normal Cases:", in_accuracy_isf)
print("Accuracy in Detecting Fraud Cases:", out_accuracy_isf)

# Calculate confusion matrix
conf_matrix = confusion_matrix(np.concatenate((np.ones(len(ins)), -np.ones(len(outs)))), np.concatenate((normal_isf, fraud_isf)))

# Calculate precision and recall
precision = precision_score(np.concatenate((np.ones(len(ins)), -np.ones(len(outs)))), np.concatenate((normal_isf, fraud_isf)), pos_label=-1)
recall = recall_score(np.concatenate((np.ones(len(ins)), -np.ones(len(outs)))), np.concatenate((normal_isf, fraud_isf)), pos_label=-1)

# Print confusion matrix, precision, and recall
print("Confusion Matrix:\n", conf_matrix)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

###  Local Outlier Factor

In [None]:
from sklearn.metrics import confusion_matrix
# Define accuracy functions (if not already defined)
def normal_accuracy(predictions):
    return np.mean(predictions == 1)

def fraud_accuracy(predictions):
    return np.mean(predictions == -1)

# Fit the Local Outlier Factor model with novelty detection enabled
LOF = LocalOutlierFactor(novelty=True)
LOF.fit(ins)

# Predict normal and fraud cases
normal_lof = LOF.predict(ins)
fraud_lof = LOF.predict(outs)

# Calculate accuracies
in_accuracy_lof = normal_accuracy(normal_lof)
out_accuracy_lof = fraud_accuracy(fraud_lof)

# Print accuracies
print("Accuracy in Detecting Normal Cases:", in_accuracy_lof)
print("Accuracy in Detecting Fraud Cases:", out_accuracy_lof)

from sklearn.metrics import confusion_matrix
# Calculate confusion matrix
conf_matrix = confusion_matrix(np.concatenate((np.ones(len(ins)), -np.ones(len(outs)))), np.concatenate((normal_lof, fraud_lof)))

# Print confusion matrix
print("Confusion Matrix:\n", conf_matrix)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


In [None]:
fig, (ax1,ax2)= plt.subplots(1,2, figsize=[10,2])

ax1.set_title("Accuracy of Isolation Forest",fontsize=10)
sns.barplot(x=[in_accuracy_isf,out_accuracy_isf], 
            y=['normal', 'fraud'],
            label="classifiers", 
            color="lightblue", 
            ax=ax1)
ax1.set(xlim=(0,1))

ax2.set_title("Accuracy of Local Outlier Factor",fontsize=20)
sns.barplot(x=[in_accuracy_lof,out_accuracy_lof], 
            y=['normal', 'fraud'], 
            label="classifiers", 
            color="lightgreen", 
            ax=ax2)
ax2.set(xlim=(0,1))
plt.show()

### CONCLUSION

Both, Isolation Forest and Local Outlier Factor performed same in predicting Normal cases but Isolation Forest performed far better in detecting Fraud cases.