## Data Wrangling

**Problem Statement:** What opportunities exist for the CC Bank to reduce their annual cyber security breaches with an accuracy of 90% with credit card fraud detection.

**Context:** Billions of dollars are lost due to fraudulent credit card transactions.
The use of advanced machine learning techniques is key for efficient fraud detection algorithms to assist with fraudulent activity.

**Criteria for Success:** This system will be adopted and used to detect credit card frauds with an accuracy of 80-90%.

**Scope of Solution Space:** The scope will focus on the alert and feedback system.

**Constraints:** Small percentage of fraud occurring, so recall and precision needs to be really high, 
people’s spending habits are different and changing

**Stakeholders:** CC Bank CEO, CC Bank VP, Head of Finance department, Tech Lead

**Data Sources:** https://data.world/vlad/credit-card-fraud-detection

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

In [2]:
df = pd.read_csv('CC.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,1,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,2,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,3,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,4,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,5,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


‘Time’ and ‘Amount’ denote time and amount of transaction respectively. ‘Class’ denotes whether transaction is fraudulent or not. ‘V1’ to ‘V28’ are reduced features of transaction details which can’t be disclosed. If Class = 0, then the transaction is non-fraudulent, if Class = 1, then it is fraudulent. I will delete the first column since it is just another index.

In [4]:
df = df.iloc[: , 1:]

In [6]:
# Checking the data types
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

In [7]:
df.shape

(284807, 31)

This dataframe is very large, therefore, we will cut down its size by 1/10 for the purpose of speed.

In [9]:
df = df.sample(frac=0.1)
df.shape

(2848, 31)

In [10]:
df.nunique()

Time      2819
V1        2842
V2        2842
V3        2842
V4        2842
V5        2842
V6        2842
V7        2842
V8        2842
V9        2842
V10       2842
V11       2842
V12       2842
V13       2842
V14       2842
V15       2842
V16       2842
V17       2842
V18       2842
V19       2842
V20       2842
V21       2842
V22       2842
V23       2842
V24       2842
V25       2842
V26       2842
V27       2842
V28       2842
Amount    1574
Class        2
dtype: int64

Its good that there are only 2 unique values in the Class column since we only want 0 and 1. Let's check out to see if there are any null values.

In [12]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,...,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0,2848.0
mean,94871.58427,-0.063274,0.054175,-0.0327,0.006162,0.045724,0.010807,-0.015816,-0.031364,0.013589,...,0.023538,0.010632,-0.001514,0.011602,0.010894,-0.008892,-0.003576,-0.006752,78.410246,0.001756
std,47951.401117,2.084583,1.774971,1.576606,1.423705,1.421903,1.387213,1.261413,1.36195,1.115875,...,0.925166,0.764477,0.534403,0.601002,0.514686,0.492577,0.443371,0.361624,207.055479,0.041871
min,237.0,-33.669917,-47.429676,-22.338591,-4.365023,-23.611865,-20.054615,-21.234463,-23.17964,-8.73967,...,-11.102491,-8.483441,-6.674813,-2.132702,-2.705961,-1.420358,-6.156626,-5.048979,0.0,0.0
25%,54518.5,-0.943994,-0.567482,-0.922681,-0.817736,-0.64731,-0.767298,-0.560832,-0.207437,-0.630637,...,-0.224034,-0.527283,-0.166926,-0.347649,-0.326571,-0.338261,-0.071399,-0.05376,5.265,0.0
50%,84728.0,-0.020203,0.099413,0.165283,-0.047724,0.000429,-0.282194,0.050991,0.034616,-0.051596,...,-0.017579,0.041432,-0.015784,0.045381,0.049498,-0.064534,0.002952,0.011224,20.51,0.0
75%,139457.0,1.305654,0.860643,1.005527,0.791383,0.652474,0.385614,0.590934,0.333155,0.603517,...,0.187222,0.546772,0.140556,0.440699,0.365841,0.23873,0.094413,0.077147,72.5025,0.0
max,172676.0,2.379609,9.843153,4.101716,11.885313,29.016124,16.493227,21.437514,7.500621,7.932015,...,22.579714,4.359627,5.620972,3.104385,1.815601,3.517346,6.21123,7.262727,5239.5,1.0


In [13]:
# There is no null values in this dataframe
df.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

**Summary:**
The dataset is now cleaned and I did that by deleting the first column, checking the data types, checking for null values or values that should not be there, and by downsizing the dataset to 1/10 its size.