# Credit Card Fraud Detection - Machine Learning Project
#### Austin Irwin

For this project, I built a machine learning algorithm that predicts fraudulent credit card transactions. The data used for training and testing this model are from the Kaggle dataset called 'Credit Card Transactions Fraud Detection Dataset', which consists of simulated transaction data of 1000 customers and 800 merchants. 

I used a randfom forest classifier to make fraud predictions.

In [5]:
# import required packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from math import radians, sin, cos, sqrt, atan2

## Data Retrieval & Inspection

In [2]:
# load in the data
fraud_train = pd.read_csv('data/fraudTrain.csv')
fraud_test = pd.read_csv('data/fraudTest.csv')

In [3]:
# check all columns for data types and null values
print(fraud_train.info())
print(fraud_test.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [4]:
# get a quick look at the DataFrame
print(fraud_train.head())

   Unnamed: 0 trans_date_trans_time            cc_num  \
0           0   2019-01-01 00:00:18  2703186189652095   
1           1   2019-01-01 00:00:44      630423337322   
2           2   2019-01-01 00:00:51    38859492057661   
3           3   2019-01-01 00:01:16  3534093764340240   
4           4   2019-01-01 00:03:06   375534208663984   

                             merchant       category     amt      first  \
0          fraud_Rippin, Kub and Mann       misc_net    4.97   Jennifer   
1     fraud_Heller, Gutmann and Zieme    grocery_pos  107.23  Stephanie   
2                fraud_Lind-Buckridge  entertainment  220.11     Edward   
3  fraud_Kutch, Hermiston and Farrell  gas_transport   45.00     Jeremy   
4                 fraud_Keeling-Crist       misc_pos   41.96      Tyler   

      last gender                        street  ...      lat      long  \
0    Banks      F                561 Perry Cove  ...  36.0788  -81.1781   
1     Gill      F  43039 Riley Greens Suite 393  ...  48

In [5]:
# look at the number of unique values for some predictors
print(f'the merchant variable has {fraud_train['merchant'].nunique()} unique values')
print(f'the city variable has {fraud_train['city'].nunique()} unique values')
print(f'the category variable has {fraud_train['category'].nunique()} unique values')
print(f'the job variable has {fraud_train['job'].nunique()} unique values')



the merchant variable has 693 unique values
the city variable has 894 unique values
the category variable has 14 unique values
the job variable has 494 unique values


## Data Cleaning & Preparation

In [6]:
# remove the unnamed column from the train and test DataFrames
fraud_train.drop('Unnamed: 0', axis=1, inplace=True)
fraud_test.drop('Unnamed: 0', axis=1, inplace=True)

fraud_train.columns

Index(['trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt',
       'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat',
       'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time', 'merch_lat',
       'merch_long', 'is_fraud'],
      dtype='object')

In [7]:
# create a transaction time variable, converting the time to cyclic sin and cos values 
fraud_train['trans_date_trans_time'] = pd.to_datetime(fraud_train['trans_date_trans_time'])
seconds_train = fraud_train['trans_date_trans_time'].dt.hour * 3600 + fraud_train["trans_date_trans_time"].dt.minute * 60 + fraud_train["trans_date_trans_time"].dt.second

fraud_test['trans_date_trans_time'] = pd.to_datetime(fraud_test['trans_date_trans_time'])
seconds_test = fraud_test['trans_date_trans_time'].dt.hour * 3600 + fraud_test['trans_date_trans_time'].dt.minute * 60 + fraud_test['trans_date_trans_time'].dt.second

# encode the time as a cyclic feature (i.e., 23:59 is next to 00:00)
fraud_train['time_sin'] = np.sin(2 * np.pi * seconds_train / 86400)
fraud_train['time_cos'] = np.cos(2 * np.pi * seconds_train / 86400)

fraud_test['time_sin'] = np.sin(2 * np.pi * seconds_test / 86400)
fraud_test['time_cos'] = np.cos(2 * np.pi * seconds_test / 86400)

print(fraud_train[['trans_date_trans_time', 'time_sin', 'time_cos']])
print(fraud_test[['trans_date_trans_time', 'time_sin', 'time_cos']])


        trans_date_trans_time  time_sin  time_cos
0         2019-01-01 00:00:18  0.001309  0.999999
1         2019-01-01 00:00:44  0.003200  0.999995
2         2019-01-01 00:00:51  0.003709  0.999993
3         2019-01-01 00:01:16  0.005527  0.999985
4         2019-01-01 00:03:06  0.013526  0.999909
...                       ...       ...       ...
1296670   2020-06-21 12:12:08 -0.052917 -0.998599
1296671   2020-06-21 12:12:19 -0.053716 -0.998556
1296672   2020-06-21 12:12:32 -0.054660 -0.998505
1296673   2020-06-21 12:13:36 -0.059306 -0.998240
1296674   2020-06-21 12:13:37 -0.059379 -0.998236

[1296675 rows x 3 columns]
       trans_date_trans_time  time_sin  time_cos
0        2020-06-21 12:14:25 -0.062863 -0.998022
1        2020-06-21 12:14:33 -0.063444 -0.997985
2        2020-06-21 12:14:53 -0.064895 -0.997892
3        2020-06-21 12:15:15 -0.066492 -0.997787
4        2020-06-21 12:15:17 -0.066637 -0.997777
...                      ...       ...       ...
555714   2020-12-31 23:59:07 

In [8]:
# create an age variable 
fraud_train['dob'] = pd.to_datetime(fraud_train['dob'])
fraud_train['age'] = (fraud_train['trans_date_trans_time'] - fraud_train['dob']).dt.days // 365

fraud_test['dob'] = pd.to_datetime(fraud_test['dob'])
fraud_test['age'] = (fraud_test['trans_date_trans_time'] - fraud_test['dob']).dt.days // 365

print(fraud_train[['age']])
print(fraud_test[['age']])

         age
0         30
1         40
2         56
3         52
4         32
...      ...
1296670   58
1296671   40
1296672   52
1296673   39
1296674   24

[1296675 rows x 1 columns]
        age
0        52
1        30
2        49
3        32
4        65
...     ...
555714   54
555715   21
555716   39
555717   55
555718   27

[555719 rows x 1 columns]


In [9]:
# turn gender into a binary variable (men = 0, women = 1)
fraud_train['gender'] = fraud_train['gender'].replace(['M', 'F'], [0, 1])
fraud_test['gender'] = fraud_test['gender'].replace(['M', 'F'], [0, 1])

fraud_train[['gender']].head()

  fraud_train['gender'] = fraud_train['gender'].replace(['M', 'F'], [0, 1])
  fraud_test['gender'] = fraud_test['gender'].replace(['M', 'F'], [0, 1])


Unnamed: 0,gender
0,1
1,1
2,0
3,0
4,0


In [11]:
# compute distance which uses haversine distance to measure the difference between owner's address and purchase location

# create a function that computes the haversine distance
def haversine_distance(lat1, long1, lat2, long2):
    R = 6378 # radius of Earth in kilometres
    lat1_rad, long1_rad = radians(lat1), radians(long1)
    lat2_rad, long2_rad = radians(lat2), radians(long2)
    dlat = lat2_rad - lat1_rad
    dlong = long2_rad - long1_rad
    a = sin(dlat / 2)**2 + cos(lat1_rad) * cos(lat2_rad) * sin(dlong / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

# create the distance variable and include it in the train and test set
fraud_train['distance'] = fraud_train.apply(lambda row: haversine_distance(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)
fraud_test['distance'] = fraud_test.apply(lambda row: haversine_distance(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

fraud_train['distance'].head()

0     78.683926
1     30.245371
2    108.324972
3     95.778350
4     77.641957
Name: distance, dtype: float64

In [12]:
# create a frequency encoding for the merchant column and map it to the train and test data
merchant_freq_map = fraud_train['merchant'].value_counts(normalize=True).to_dict()
fraud_train['merchant_encoded'] = fraud_train['merchant'].map(merchant_freq_map)
fraud_test['merchant_encoded'] = fraud_test['merchant'].map(merchant_freq_map)

# create a frequency encoding for the city column and map it to the train and test data
city_freq_map = fraud_train['city'].value_counts(normalize=True).to_dict()
fraud_train['city_encoded'] = fraud_train['city'].map(city_freq_map)
fraud_test['city_encoded'] = fraud_test['city'].map(city_freq_map)

# create a frequency encoding for the job column and map it to the train and test data
job_freq_map = fraud_train['job'].value_counts(normalize=True).to_dict()
fraud_train['job_encoded'] = fraud_train['job'].map(job_freq_map)
fraud_test['job_encoded'] = fraud_test['job'].map(job_freq_map)

# create a one-hot encoding for the category column, and join the encodings to the train and test data
oh_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
oh_encoder.fit(fraud_train[['category']])

oh_encoded_train = oh_encoder.transform(fraud_train[['category']]) 
oh_encoded_test = oh_encoder.transform(fraud_test[['category']])

oh_encoded_train_df = pd.DataFrame(oh_encoded_train, columns=oh_encoder.get_feature_names_out(['category']), index=fraud_train.index)
oh_encoded_test_df = pd.DataFrame(oh_encoded_test, columns=oh_encoder.get_feature_names_out(['category']), index=fraud_test.index)

fraud_train = pd.concat([fraud_train.drop(columns=["category"]), oh_encoded_train_df], axis=1)
fraud_test = pd.concat([fraud_test.drop(columns=["category"]), oh_encoded_test_df], axis=1)

# view all of the columns in the updated DataFrames
fraud_train.columns
fraud_test.columns

Index(['trans_date_trans_time', 'cc_num', 'merchant', 'amt', 'first', 'last',
       'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'city_pop',
       'job', 'dob', 'trans_num', 'unix_time', 'merch_lat', 'merch_long',
       'is_fraud', 'time_sin', 'time_cos', 'age', 'distance',
       'merchant_encoded', 'city_encoded', 'job_encoded',
       'category_entertainment', 'category_food_dining',
       'category_gas_transport', 'category_grocery_net',
       'category_grocery_pos', 'category_health_fitness', 'category_home',
       'category_kids_pets', 'category_misc_net', 'category_misc_pos',
       'category_personal_care', 'category_shopping_net',
       'category_shopping_pos', 'category_travel'],
      dtype='object')

## Model Training and Testing

In [13]:
# select the relevant input variables and output variable for both train and test sets
x_train = fraud_train[['amt', 'gender', 'city_pop', 'time_sin', 'time_cos', 'age', 'distance', 'merchant_encoded', 'city_encoded', 'job_encoded', 
'category_entertainment', 'category_food_dining', 'category_gas_transport', 'category_grocery_net', 'category_grocery_pos', 
'category_health_fitness', 'category_home', 'category_kids_pets', 'category_misc_net', 'category_misc_pos', 'category_personal_care', 
'category_shopping_net', 'category_shopping_pos', 'category_travel']]

y_train = fraud_train['is_fraud']

x_test = fraud_test[['amt', 'gender', 'city_pop', 'time_sin', 'time_cos', 'age', 'distance', 'merchant_encoded', 'city_encoded', 'job_encoded', 
'category_entertainment', 'category_food_dining', 'category_gas_transport', 'category_grocery_net', 'category_grocery_pos', 
'category_health_fitness', 'category_home', 'category_kids_pets', 'category_misc_net', 'category_misc_pos', 'category_personal_care', 
'category_shopping_net', 'category_shopping_pos', 'category_travel']]

y_test = fraud_test['is_fraud']

# fit the random forest classifier
rf = RandomForestClassifier(random_state=123)
rf.fit(x_train, y_train)

# generate predictions
y_pred = rf.predict(x_test)

# evaluate the model's performance
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.95      0.73      0.83      2145

    accuracy                           1.00    555719
   macro avg       0.98      0.86      0.91    555719
weighted avg       1.00      1.00      1.00    555719

