# Predict Credit Card Fraud


Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In this project, you have access to a dataset (based on a synthetic financial dataset), that represents a typical set of credit card transactions. *transactions.csv* is the original dataset containing 200k transactions. For starters, we’re going to be working with a small portion of this dataset, *transactions_modified.csv*, which contains one thousand transactions. Your task is to use Logistic Regression and create a predictive model to determine if a transaction is fraudulent or not.

In [36]:
#import libraries 
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Load the data
transactions = pd.read_csv('transactions_modified.csv')
print(transactions.head())
print(transactions.info())

   step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2           1    818679.85  


How many fraudelent transactions in existing dataset?

In [37]:
print("Fraudelent transactions in this dataset: ", transactions.isFraud.sum())

Fraudelent transactions in this dataset:  282


#### Clean the Data

In [38]:
print(transactions.amount.describe())

count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64


In [39]:
print(transactions.type.unique())

['CASH_OUT' 'PAYMENT' 'CASH_IN' 'TRANSFER' 'DEBIT']


We have a lot of information about the type of transaction we are looking at. Let’s create a new column called *isPayment* that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [40]:
#create dictionary
ispayment = {'PAYMENT':1,'DEBIT':1, 'CASH_OUT':0, 'CASH_IN':0, 'TRANSFER':0}
#transform column
transactions['isPayment'] = transactions.type.map(ispayment)

print(transactions.head()) 

   step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2           1    818679.85  


Similarly, create a column called *isMovement*, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [41]:
#create dictionary
ismovement = {'PAYMENT':0,'DEBIT':0, 'CASH_OUT':1, 'CASH_IN':0, 'TRANSFER':1}
#transform column
transactions['isMovement'] = transactions.type.map(ismovement)

print(transactions.head()) 

   step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2           1    818679.85  


With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud. Let’s create a column called *accountDiff* with the absolute difference of the *oldbalanceOrg* and *oldbalanceDest* columns.

In [42]:
# Create accountDiff field
transactions['accountDiff'] = abs(transactions['oldbalanceDest'] - transactions['oldbalanceOrg'])

#### Select and Split the Data into Test and Train

In [43]:
# Separate out X (features: amount,isPayment,isMovement,accountDiff) and y (label: isFraud)
X = transactions[['amount', 'isPayment','isMovement','accountDiff']]
y = transactions[['isFraud']]

In [44]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=6)

#### Normalize the Data

Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data.

In [45]:
# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Create and Evaluate the Model

In [51]:
# Create and fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train.values.ravel())

LogisticRegression()

Let's look at model scores. First on the training data and then the test data.

Scoring the model on the training data will process the training data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy.

In [55]:
print('Model score on the training data:', model.score(X_train,y_train))

Model score on the training data: 0.8414285714285714


In [56]:
print('Model score on the test data:', model.score(X_test, y_test))

Model score on the test data: 0.85


The coefficients for our model to see how important each feature column was for prediction.

In [54]:
print(model.coef_)

[[ 2.76728882 -0.61054026  2.06030391 -1.29953811]]


#### Predict with the Model

Let’s use our model to process more transactions that have gone through our systems.
Create several made up transactions and test it on a model.

In [90]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
your_transaction = np.array([123500.0, 0.0, 1.0, 62350.0])

In [91]:
# Combining new transactions into single array.
sample_transactions = np.stack((transaction1,transaction2,transaction3,your_transaction))

Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on.

In [92]:
# Normalizing our new sample transactions, for our model.
sample_transactions = scaler.transform(sample_transactions)

Now we can apply our model on new sample transactions.

In [93]:
# Predict fraud on the new transactions
print(model.predict(sample_transactions))

[0 0 0 0]


Let's see the probabilities of transaction being fraudelent. The 1st column is the probability of a transaction not being fraudulent, and the 2nd column is the probability of a transaction being fraudulent

In [94]:
# Show probabilities on the new transactions
print(model.predict_proba(sample_transactions))

[[0.59993568 0.40006432]
 [0.99794715 0.00205285]
 [0.99576952 0.00423048]
 [0.60055607 0.39944393]]
