# Credit Card Fraud Analysis and Detection

---

## Problem Identification

### Problem statement

How can credit card companies detect up to 90% of fraudulent transactions?

### Context

Credit card companies get a small percentage of fraudulent charges but those transactions result in many hours of work and lost revenue. They would like to have a model to use to detect fraud before the charges are processed.

### Criteria for success

Detect 90% of fraudulent transactions.

### Scope of solution space

This project only applies to credit card charges by XYZ Credit Union (i.e., the Kaggle dataset).

### Constraints within solution space

Due to privacy, the dataset was put through Principle Component Analysis (PCA) and anonymized. 

### Stakeholders

* Executives from XYZ Credit Union
* Chief Technology Officer

### Key data sources

- [Credit Card Fraud Detection Kaggle dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)



---

## Environment setup

In [1]:
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

sns.set('notebook')

---

## Data Wrangling

### Data Collection

In [2]:
# Read CSV to Dataframe
df = pd.read_csv('../data/raw/creditcard.csv')

In [3]:
# Inspect the data
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Data Organization

Project based on the [cookiecutter data science project template](https://drivendata.github.io/cookiecutter-data-science/).

### Data Definition

In [4]:
# Review the columns
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [5]:
# Review the types
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

All data seems to be in the expected format. No cleaning work to be done here.

In [6]:
# Inspect the entire Dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

No missing values. No cleanup work to be done.

In [7]:
# Inspect the percent of unique values per column
df.nunique() / 284807 * 100

Time      43.746116
V1        96.789405
V2        96.789405
V3        96.789405
V4        96.789405
V5        96.789405
V6        96.789405
V7        96.789405
V8        96.789405
V9        96.789405
V10       96.789405
V11       96.789405
V12       96.789405
V13       96.789405
V14       96.789405
V15       96.789405
V16       96.789405
V17       96.789405
V18       96.789405
V19       96.789405
V20       96.789405
V21       96.789405
V22       96.789405
V23       96.789405
V24       96.789405
V25       96.789405
V26       96.789405
V27       96.789405
V28       96.789405
Amount    11.504984
Class      0.000702
dtype: float64

Again, the data looks as expected. Some observations: 
* Time is mostly unique, but there is expected overlap as it covers 2 days worth of data. 
* All PCA variables are mostly unique.
* Amount is expectedly not very unique as there are likely to be the same transaction amounts.
* Class is only 0 (valid) or 1 (fraud) so expectedly not very unique.