# Detecting Credit Card Fraud

# INTRODUCTION (text)

# UNDERSTANDING AND EXPLAINING MACHINE LEARNING ALGORITHMS (text)

# UNDERSTANDING,CLEANING, and VISUALIZING THE DATA (text and code)

In [2]:
#importing packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np 
%matplotlib inline

In [3]:
# reading in files 
df = pd.read_csv('datasets/bs140513_032310.csv')

### Cleaning the Data

In [4]:
# method that removes quotations 
def remove_quotations(x):
    return x.strip("''")

# method that replaces age null value w/ -1 
def age_null(x):
    if x == 'U':
        return -1 
    return x 

# method that encodes gender into a number 
def gender_switch(x):
    if x == 'M':
        return 0
    if x == 'F':
        return 1
    if x == 'E':
        return 2
    else: 
        return -1

# method that cleans dataframe and returns cleaned version 
def clean_transaction(df):
    transaction_dataset = df.copy()
    transaction_dataset['customer'] = transaction_dataset['customer'].apply(remove_quotations) # remove quotation 
    transaction_dataset['age'] = transaction_dataset['age'].apply(remove_quotations).apply(age_null) # remove quotations and replaces null
    transaction_dataset['gender'] = transaction_dataset['gender'].apply(remove_quotations).apply(gender_switch) # remove quotation and encode gender
    transaction_dataset['zipcodeOri'] = transaction_dataset['zipcodeOri'].apply(remove_quotations).astype(int) # remove quotations 
    transaction_dataset['merchant'] = transaction_dataset['merchant'].apply(remove_quotations)
    transaction_dataset['zipMerchant'] = transaction_dataset['zipMerchant'].apply(remove_quotations).astype(int)
    transaction_dataset['category'] = transaction_dataset['category'].apply(remove_quotations)
    # transaction_dataset['category'] = transaction_dataset['category'].apply(encoder_categories)
    return transaction_dataset

In [5]:
clean_transaction = clean_transaction(df)

In [6]:
different_categories = clean_transaction['category'].unique().tolist()
category_dictionary = {} 
iterator = 0 
for i in different_categories:
    category_dictionary[i] = iterator
    iterator += 1  

# method that encodes the transaction categories into numbers 
def encoder_categories(x):
    return category_dictionary[x]

In [7]:
clean_transaction['category'] = clean_transaction['category'].apply(encoder_categories)

In [8]:
clean_transaction.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,C1093826151,4,0,28007,M348934600,28007,0,4.55,0
1,0,C352968107,2,0,28007,M348934600,28007,0,39.68,0
2,0,C2054744914,4,1,28007,M1823072687,28007,0,26.89,0
3,0,C1760612790,3,0,28007,M348934600,28007,0,17.25,0
4,0,C757503768,5,0,28007,M348934600,28007,0,35.72,0


Categories have now been encoded to numbers for easier manipulation. The following dictionary explains what each value means:

In [9]:
category_dictionary

{'es_transportation': 0,
 'es_health': 1,
 'es_otherservices': 2,
 'es_food': 3,
 'es_hotelservices': 4,
 'es_barsandrestaurants': 5,
 'es_tech': 6,
 'es_sportsandtoys': 7,
 'es_wellnessandbeauty': 8,
 'es_hyper': 9,
 'es_fashion': 10,
 'es_home': 11,
 'es_contents': 12,
 'es_travel': 13,
 'es_leisure': 14}

In [10]:
clean_transaction['zipcodeOri'].describe()

count    594643.0
mean      28007.0
std           0.0
min       28007.0
25%       28007.0
50%       28007.0
75%       28007.0
max       28007.0
Name: zipcodeOri, dtype: float64

In [11]:
clean_transaction['zipMerchant'].describe()

count    594643.0
mean      28007.0
std           0.0
min       28007.0
25%       28007.0
50%       28007.0
75%       28007.0
max       28007.0
Name: zipMerchant, dtype: float64

There are 10 columns in the dataset, but the two columns that specify zipcode only
contain one value and should be dropped.

In [58]:
clean_transaction = clean_transaction.drop(['zipcodeOri', 'zipMerchant'], axis=1)

There are 8 columns remaining. Step appears to be represent the time a transaction occurred since the first transaction was made. According to the paper published on researchgate.net, the Age column refers to a categorized age:

**0 --> <=18**

**1 --> 19-25**

**2 --> 26-35**

**3 --> 36-45**

**4 --> 46-55**

**5 --> 56-65**

**6 --> >65**

**U --> Unknown**

The Customer column is a unique ID assigned to each customer. There are customers in the dataset who have made multiple purchases, so this column may be useful to us. Gender represents the gender of a customer: 

**'M' -> Male**

**'F' -> Female**

**'E' -> Enterprise**

**'U' -> Unknown**

The Merchant column refers to a unique ID assigned to each merchant. There are merchants who have facilitated more than one transaction in the dataset, so this column may be useful to us.

Category and amount columns are self-evident, category represents the consumer category to which the purchased item belongs to, and amount refers to the numeric value of the purchase (likely in euros). 

The fraud column is categorized as follows: 0 is indicative of no fraud, 1 means the transaction was a fraud.

Additionally, there are no null values in the dataset, so no values seem to be missing. This makes sense as the dataset we are dealing with is simulated.

### Understanding the Data

In [61]:
corr_matrix = clean_transaction.corr()
corr_matrix.style.background_gradient(cmap='coolwarm')

Unnamed: 0,step,gender,category,amount,fraud
step,1.0,0.00110703,-0.0245922,-0.00796142,-0.0118981
gender,0.00110703,1.0,0.010147,0.0128877,0.0250473
category,-0.0245922,0.010147,1.0,0.268715,0.278354
amount,-0.00796142,0.0128877,0.268715,1.0,0.489967
fraud,-0.0118981,0.0250473,0.278354,0.489967,1.0


Basically no correlation with anything BESIDES the amount. Amount is crucial

### Manipulating the Data (text and code)

In [73]:
# a method to get the reliability of a specific category in the given data
def get_fraud_scores(category): 
    fraud_scores = { i : 0 for i in clean_transaction[category].unique() }
    fraud_dict = clean_fraud[category].value_counts().to_dict()
    total_dict = clean_transaction[category].value_counts().to_dict()
    final_fraud = dict((v,k) for k,v in fraud_dict.items()) # Final dictionary with ID as key, reliability as value

    for key in fraud_scores: 
        if key in fraud_dict:
            fraud_scores[key] = fraud_dict[key]/total_dict[key] # reliability = fraudulent purchases / total purchases
    return fraud_scores

In [75]:
# DataFrame containing only fraudulent transactions
clean_fraud = clean_transaction[clean_transaction['fraud'] ==1] 
clean_fraud.head()

Unnamed: 0,step,customer,age,gender,merchant,category,amount,fraud,customer_reliability,merchant_reliability
88,0,C583110837,3,0,M480139044,1,44.26,1,0.092857,0.465792
89,0,C1332295774,3,0,M480139044,1,324.5,1,0.264151,0.465792
434,0,C1160421902,3,0,M857378720,4,176.32,1,0.033058,0.754098
435,0,C966214713,3,0,M857378720,4,337.41,1,0.04,0.754098
553,0,C1450140987,4,1,M1198415165,8,220.11,1,0.120805,0.226582


#### Attempting to assign each customer and merchant a fraud score

In [76]:
customer_fraud_dict = get_fraud_scores('customer') #dictionary with customer ID as key, customer reliability as value
merchant_fraud_dict = get_fraud_scores('merchant') #dictionary with merchant ID as key, merchant reliability as value

Now that we finally have a fraud percentage for each customer, we must insert our new datapoint into our dataset so we can use it for future analysis. 

In [77]:
def return_cust_fraud_percentage(x): # method to navigate customer_fraud dictionary
    return customer_fraud_dict[x]

def return_merch_fraud_percentage(x): # methdo to navigate merchant_fraud dictionary
    return merchant_fraud_dict[x]

clean_transaction['customer_reliability'] = clean_transaction['customer'].apply(return_cust_fraud_percentage)
clean_transaction['merchant_reliability'] = clean_transaction['merchant'].apply(return_merch_fraud_percentage)

clean_fraud = clean_transaction[clean_transaction['fraud'] ==1]  # redefine df_frauds with new categories

In [78]:
clean_transaction['customer_reliability'].describe()

count    594643.000000
mean          0.012108
std           0.055508
min           0.000000
25%           0.000000
50%           0.000000
75%           0.006135
max           0.945652
Name: customer_reliability, dtype: float64

In [79]:
clean_fraud['customer_reliability'].describe()

count    7200.000000
mean        0.266578
std         0.297518
min         0.005348
25%         0.031746
50%         0.092349
75%         0.477273
max         0.945652
Name: customer_reliability, dtype: float64

There is a significant difference in the average customer reliability in fraudulent transactions, suggesting some customers are simply more likely to commit fraud than others.

In [80]:
clean_transaction['merchant_reliability'].describe()

count    594643.000000
mean          0.012108
std           0.080055
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           0.963351
Name: merchant_reliability, dtype: float64

In [81]:
clean_fraud['merchant_reliability'].describe()

count    7200.000000
mean        0.541401
std         0.298606
min         0.017778
25%         0.250000
50%         0.465792
75%         0.832109
max         0.963351
Name: merchant_reliability, dtype: float64

In [82]:
clean_transaction.head()

Unnamed: 0,step,customer,age,gender,merchant,category,amount,fraud,customer_reliability,merchant_reliability
0,0,C1093826151,4,0,M348934600,0,4.55,0,0.0,0.0
1,0,C352968107,2,0,M348934600,0,39.68,0,0.0,0.0
2,0,C2054744914,4,1,M1823072687,0,26.89,0,0.0,0.0
3,0,C1760612790,3,0,M348934600,0,17.25,0,0.0,0.0
4,0,C757503768,5,0,M348934600,0,35.72,0,0.0,0.0


There is also a significant difference in the average merchant reliability in a fraudulent transaction. This suggests some merchants are easier to take advantage of (possibly due to poor security practices), and thus are involved in more fraudulent transactions.

### Visualizing the Data

# APPLYING MACHINE LEARNING (text and code) 

In [17]:
model = linear_model 
model.fit([TrainX, TrainY])
predictions = model.predict(Validate_or_test_X)

NameError: name 'linear_model' is not defined