# Logistic Regression

Using a Transactions dataset, we are going to create a model that will predict if a transaction is a fraudulent or legitimate.

In [98]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Import our data and standarize it. Our data set has been simplified.

In [99]:
transactions = pd.read_csv("transactions.csv")
fradTrans = (transactions['isFraud'] == 1)
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,0,0.0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,0,5542581.85


We need to use more than the amount to see if a trasaction is fraudulent or not. We have more legitimate transactions than fraudulent transactions overall. This is important since we will need to balance our data. To do so, we are going to use more features.

Since we have multiple catergories in our data such as type of transaction. We are going to sort it into two types, "PAYMENT"= 1 and "DEBIT"= 0. We are going to considered anything else that is not payment as debit.

We are going to do the same if money is going out or going in. Categorize "CASH_OUT" and "TRANSFER" as 1 and "CASH_IN" as 0.

In [101]:
transactions["isPayment"] = transactions["type"].apply(lambda x: 1 if x == "PAYMENT" else 0)
transactions["isMovement"] = transactions["type"].apply(lambda x: 1 if x == "TRANSFER" or "CASH_OUT" else 0)

Lastly, we are going to keep in mind the account difference. We are going to take the absolute value of oldbalanceOrg and oldbalanceDest. Our assumption is that a grater difference in balance will be more likely to be a fraudulent transaction.

In [102]:
transactions["accountDiff"] = transactions.apply(lambda row: abs(row['oldbalanceOrg'] - row['oldbalanceDest']), axis=1)
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,1,0.0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,1,5542581.85


Now that we have our data ready, we are going to split it into training and testing data. We are going to use 80% of our data for training and 20% for testing.

In [105]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(transactions[['amount', 'isPayment', 'isMovement', 'accountDiff']], transactions['isFraud'], test_size=0.3, random_state=0)

Since we are working with Logical Regression which uses regularization, we are going to need to standarize our data. We are going to use the StandardScaler from sklearn.

In [131]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)


Now that our data is ready we can create our model and fit it to our training data. We are going to start with an initial threshold of 0.5.

In [120]:
from sklearn.linear_model import LogisticRegression

logisticRegr = LogisticRegression()
logisticRegr.fit(x_train, y_train)

logisticRegr.score(x_train, y_train) # We are trying to get a score > 0.80

0.8442857142857143

We fit our test data and score the model.

In [119]:
predictions = logisticRegr.predict(x_test)
logisticRegr.score(x_test, y_test)

0.6833333333333333

Predictions: we testing data to predict the outcome of our model.

In [133]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
transaction4 = np.array([123456.78, 0.0, 1.0, 0.0])

sampleTransactions = np.array([transaction1, transaction2, transaction3, transaction4])
sampleTransactions = scaler.fit_transform(sampleTransactions)


In [136]:
predictions = logisticRegr.predict(sampleTransactions)
print(logisticRegr.predict_proba(sampleTransactions))

[[0.70354664 0.29645336]
 [0.9876293  0.0123707 ]
 [0.73549818 0.26450182]
 [0.61596297 0.38403703]]
