## About this project

Credit card fraud can be defined as any fraud or theft that involves a credit card. Credit card fraud aims to purchase goods or steal money with someone else credit account. Advance in modern-day technology has created more room for credit card fraud. Since the invention of online purchases, perpetrators no longer need a physical card to make an unauthorized purchase. Additionally, electronic databases containing credit card data can get hacked or crash on their own, releasing customers' credit card information. These electronic database hacks put the security of many accounts at risk at once.

Common forms of credit card fraud include:

- Lost or stolen cards are used without their owner's permission.

- Credit cards are 'skimmed,' where the card is cloned or copied using a special swipe machine to create a duplicate.

- Card details, such as the card number, cardholder name, date of birth, and address, are stolen from online databases or through email scams. These stolen details are then sold and utilized for fraud on the internet or over the phone. This type of fraud is often referred to as 'card-not-present' fraud.

- Fraudulent applications are made in someone else's name for a new credit card, without the person's knowledge or consent.



The goal of this project is to build a fraud detection system using machine learning. This system will have the ability to classify an online credit card transaction as fraudulent or not fraudulent based on the transaction details. The machine learning model will learn from past credit card transaction data and use the patterns it learns from the data to identify if a new transaction is fraudulent or not. Furthermore, the model will be hosted as a web service that accepts credit card transaction details in a JSON data format and returns a prediction of whether the transaction is fraudulent or not. Deploying this model as a web service will be beneficial to financial institutions, banks, or online stores as they can easily feed transaction details to the machine learning system and the model will return its prediction. This can help dectecting in potential fraud in trade and therefore get the transaction flagged and declined. By identifying potential instances of fraud, companies can take steps to prevent fraudulent activity from occurring, which can save them a significant amount of money loss.

### About the dataset

The dataset used to in this machine learning project was gotten from [kaggle](https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud). 
#### Data columns/features explained
The data contains 8 features and 1000000 observations. Below is an explanation of the features.

- distance_from_home - the distance from home where the transaction happened.
- distance_from_last_transaction - the distance from last transaction happened.
- ratio_to_median_purchase_price - Ratio of purchased price transaction to median purchase price.
- repeat_retailer - Is the transaction happened from same retailer.
- used_chip - Is the transaction through chip (credit card).
- used_pin_number - Is the transaction happened by using PIN number.
- online_order - Is the transaction an online order.
- fraud(target) - Is the transaction fraudulent.

### Challenges in credit card fraud detection
- It's not always easy to agree on the ground truth for what "fraud" means.
- Most of the transactions are tagged as not fraudulent, therefore leading to a class imbalance of the dataset
- Fraud detection dataset contains sensitive pieces of information, therefore a lot of data representations a usually hidden


### Import libraries and load data 

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
import pickle


In [None]:
data=pd.read_csv("/Users/victoroshimua/Downloads/card_transdata.csv")

In [None]:
data.head().T

### EDA

In [None]:
data.shape

In [None]:
col=data.columns
list(col)

In [None]:
data.describe()
## checking for nans, sometimes nans are stored as large numbers (99999999)

In [None]:
## Check for missing data
data.isnull().any()


##### Observation: No missing data

In [None]:
## Exploring the target "Fraud"
data["fraud"].value_counts(normalize=True).round(2)

In [None]:
data["fraud"].value_counts(normalize=True).plot(kind="bar")
plt.show()

#### Observation: Fraud (1) is about 10% of the class and not Fraud (0) is 90% of the class, this is a problem of class imbalance 

## Setting up validation framework
#### The data set will be splitted into train(60%), validation(20%), test(20%) sets 

In [None]:
data_full_train,data_test=train_test_split(data,test_size=0.2,random_state=1)
data_train,data_val=train_test_split(data_full_train,test_size=0.25,random_state=1)

In [None]:
len(data_train),len(data_test),len(data_val)

In [None]:
data_train=data_train.reset_index(drop=True)
data_test=data_test.reset_index(drop=True)
data_val=data_val.reset_index(drop=True)


In [None]:
y_train=data_train.fraud
y_test=data_test.fraud
y_val=data_val.fraud

In [None]:
del data_train['fraud']
del data_test['fraud']
del data_val['fraud']


## Features Importance (Correlation)

In [None]:
data_full_train.corrwith(data_full_train.fraud).to_frame('correlation')

### Observations

- distance_from_home: Positive correlation (0.185024) with the fraud column, indicating a weak positive relationship.

- distance_from_last_transaction: Positive correlation (0.088343) with the fraud column, indicating a weak positive relationship.

- ratio_to_median_purchase_price: Positive correlation (0.463126) with the fraud column, indicating a moderate positive relationship.

- repeat_retailer: Almost no correlation (-0.000997) with the fraud column, suggesting no significant relationship.

- used_chip: Negative correlation (-0.060728) with the fraud column, indicating a weak negative relationship.

- used_pin_number: Negative correlation (-0.100202) with the fraud column, indicating a weak negative relationship.

- online_order: Positive correlation (0.192077) with the fraud column, indicating a weak positive relationship.

- fraud: The correlation of 1.000000 represents the correlation of the fraud column with itself, which is always perfect.


## Modeling



In [None]:
# Logistic regression
from sklearn.linear_model import LogisticRegression
model2=LogisticRegression()
model2.fit(data_train,y_train)
pred2=model2.predict(data_val)
roc_auc_score(y_val,pred2)

In [None]:
# Xgboost
dtrain=xgb.DMatrix(data_train ,label= y_train)
dval=xgb.DMatrix(data_val,label=y_val)
xgb_params={
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=175)


In [None]:
confusion_matrix(y_val,pred2)

## Evaluating model performance
I will use two different evaluation metrics
- Roc_Auc (Receiver Operating Characteristic Area Under the Curve)
- Confusion table

In [None]:
prediction = model.predict(dval)
roc_auc_score(y_val, prediction)

#### Observation:  The ROC AUC score is 0.9999999623312275, this indicates excellent performance, suggesting that the model is highly effective in distinguishing between fraud and not fraud instances in the binary classification task.

In [None]:
prediction_binary = (prediction >= 0.5).astype(int)
confusion_matrix(y_val, prediction_binary)


### Observation: 
- True Positives (TP):  182,547 instances correctly predicted as Fraud.
- False Positives (FP):  0 instances wrongly predicted as Fraud.
- False Negatives (FN):  2 instances wrongly predicted as Not fraud.
- True Negatives (TN): 17,449 instances correctly predicted as Not fraud.


## Training final model 
To build the machine learing model i will use XGboost model right away since XGboost perform well for class imbalance tabular data and had a beter score than logistic regression.

In [None]:
data_full_train.head()

In [None]:
Y_train=data_full_train.fraud
X_train= data_full_train
del X_train["fraud"]

In [None]:
X_train

In [None]:
dtrain=xgb.DMatrix(X_train, label= Y_train)
xgb_params={
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}
final_model = xgb.train(xgb_params, dtrain, num_boost_round=175)

In [None]:
prediction = final_model.predict_proba(data_test)[:,1]
print(prediction)

In [None]:
prediction_binary = (prediction >= 0.5).astype(int)
count0= 0
count1= 0
for i in prediction_binary:
    if i ==0:
        count0 +=1
    else:
        count1 +=1
print(count0,count1)



In [None]:
print(prediction_binary)

#### roc_auc_score

In [None]:
roc_auc_score(y_test, prediction)

#### confusion table

In [None]:
confusion_matrix(y_test, prediction_binary)

#### Accuracy

In [None]:
(y_test==prediction_binary).mean()

## Feature importance Xgboost
Weight and Gain 
- Weight focuses on the frequency of feature usage
- Gain emphasizes the contribution of a feature to the model's performance.

In [None]:
scores = final_model.get_score(importance_type='gain')
scores = sorted(scores.items(), key=lambda x: x[1])
list(reversed(scores))

In [None]:
data = [('ratio_to_median_purchase_price', 2899.43701171875),
 ('online_order', 2386.5302734375),
 ('distance_from_home', 1214.861328125),
 ('used_pin_number', 1117.2647705078125),
 ('used_chip', 1028.78564453125),
 ('distance_from_last_transaction', 818.9542236328125)]


# Extract feature names and corresponding weights
feature_names = [item[0] for item in data]
gain = [item[1] for item in data]

# Create bar plot
plt.barh(feature_names, gain)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance based on gain')
plt.show()


#### Observations: the most important feature here is 'ratio_to_median_purchase_price'

In [None]:

scores= model.get_score(importance_type='weight')
scores = sorted(scores.items(), key=lambda x: x[1])
list(reversed(scores))

In [None]:
data = [('distance_from_home', 297.0),
        ('ratio_to_median_purchase_price', 280.0),
        ('online_order', 197.0),
        ('distance_from_last_transaction', 178.0),
        ('used_pin_number', 149.0),
        ('used_chip', 93.0)]

# Extract feature names and corresponding weights
feature_names = [item[0] for item in data]
weights = [item[1] for item in data]

# Create bar plot
plt.barh(feature_names, weights)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance based on Weight')
plt.show()


#### Observations: the most important feature here is 'distance from home'

## Using the model
Lets see how the model predicts spam using a sample dataset

In [None]:
transaction = data_test.iloc[4]
transaction_dmatrix = xgb.DMatrix(pd.DataFrame(transaction).T)
prediction_ = final_model.predict(transaction_dmatrix)
print(prediction_)
threshold=0.5
prediction_binary = 1 if prediction_ >= threshold else 0
print(prediction_binary)


## Saving the model to disk
For this i will use pickle

In [None]:
model_output_file = f'xgb_model.bin'


In [None]:
with open(model_output_file, 'wb') as f_out:
    pickle.dump((final_model),f_out)

## Loading the model

In [None]:
import pickle 

In [None]:
model_file='xgb_model.bin'

In [None]:
with open(model_file,'rb') as f_in:
    model=pickle.load(f_in)

In [None]:
model

In [None]:
import json
request = data_test.iloc[3].to_dict()
print(json.dumps(request, indent=2))


In [None]:
fraud =  "distance_from_home": 42.29620833829316,
  "distance_from_last_transaction": 0.3205190486789607,
  "ratio_to_median_purchase_price": 4.080047993541461,
  "repeat_retailer": 1.0,
  "used_chip": 1.0,
  "used_pin_number": 0.0,
  "online_order": 1.0

In [None]:
not_fraud = "distance_from_home": 1.9808574511514092,
  "distance_from_last_transaction": 5.04247186357492,
  "ratio_to_median_purchase_price": 0.5889080718859896,
  "repeat_retailer": 1.0,
  "used_chip": 0.0,
  "used_pin_number": 0.0,
  "online_order": 0.0