# Business Case - Fraud detection model

Author: Emilio Macias

Note: the dataset can be downloaded from www.kaggle.com/ealtman2019/credit-card-transactions/download

This notebook expects the dataset to be located at: data/credit_card_transactions-ibm_v2.csv

## Table of Contents

* [Data exploration](#exploration)
* [Feature engineering](#feature_eng)
    * [Handling of missing values](#missing_val)
    * [Transforming categorical features](#cat_features)
* [Data balancing](#data_balancing)
* [Modelling](#modelling)
    * [Model preparation](#model_prep)
    * [Training and evaluation](#model_train)
* [Region analysis](#region_analysis)
* [Merchant analysis](#merchant_analysis)

## Data exploration <a class="anchor" id="exploration"></a>

We will start with an exploratory analysis of the data.

In [None]:
import pandas as pd
import numpy as np

import category_encoders as ce

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,
                             roc_curve, recall_score, classification_report, f1_score,
                             precision_recall_fscore_support, accuracy_score)

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
from pylab import rcParams

import seaborn as sns
sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 14, 8

In [None]:
df = pd.read_csv("data/credit_card_transactions-ibm_v2.csv")
#df = pd.read_csv("data/medium_dataset.csv")
#df = df.sample(n=100000, random_state=111)
#df.to_csv('data/medium_dataset.csv')

The following line displays the number of rows (transactions) and columns (variables) from the dataset:

In [None]:
df.shape

Below we can see what some of the transactions look like:

In [None]:
df.head()

The following table points out the descriptive statistics of the different fields:

In [None]:
df.describe(include='all')

As we can see above, some of the features contain null variables (represented as NaN) which will be handled in the following section.
From the previous table, it is important to note some facts from certain features:
- There are 2,000 different users and each of them can use up to 9 cards.
- The transactions were recorded from the year 1991 to 2020.
- There are 3 types of transactions: Swipe Transaction, Online Transaction and Chip Transaction.
- The transactions were paid in 223 different states including US states and world countries.
- There are 23 different errors that occurred during the transactions.

Below are the cities with the highest number of transactions. As we can see all of them are cities in the US except the most frequent value which are online transactions.

In [None]:
df['Merchant City'].value_counts().head(10).to_frame('Number of transactions')

Below we list the categorical features that will have to be transformed into numerical values since the machine learning algorithms are math-based:

In [None]:
df_types = df.dtypes.to_frame('Type')
print([feature for feature in df_types[df_types['Type'] == 'object'].index])

How different are the amount of money used in different transaction classes?

In [None]:
# remove $ character to make Amount a numeric field
df['Amount'] = df['Amount'].apply(lambda x: float(x.strip('$')))

In [None]:
df.Amount.describe()

In [None]:
# drop the transactions with negative amount
index_negatives = df[df['Amount'] < 0 ].index
df.drop(index_negatives, inplace = True)

In [None]:
df.shape

In [None]:
df.Amount.max()

In [None]:
frauds = df[df['Is Fraud?'] == 'Yes']
legitimates = df[df['Is Fraud?'] == 'No']

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Amount per transaction by class')

bins = 50

ax1.hist(frauds.Amount, bins = bins)
ax1.set_title('Fraud')

ax2.hist(legitimates.Amount, bins = bins)
ax2.set_title('Legitimate')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.xlim([0, 10000])
plt.yscale('log')
plt.show();

## Feature Engineering <a class="anchor" id="feature_eng"></a>

The following sections cover the different steps needed to convert our data into a format useable by machine learning algorithms.

### Handling of missing values <a class="anchor" id="missing_val"></a>

As we discovered in our descriptive analysis earlier, our dataset contains null values that have to be treated. These are the features that contain null values:

In [None]:
[column for column in list(df) if df[column].isnull().values.any()]

It turns out that when a transaction is online, the merchant state and zip are filled with null values. In order for the machine learning algorithms to work correctly we should fill these values, and one option is to use the same "ONLINE" string as for the Merchant City.

In [None]:
df.loc[df['Merchant City'] == 'ONLINE', 'Merchant State'] = 'ONLINE'

In [None]:
df.loc[df['Merchant City'] == 'ONLINE', 'Zip'] = 0

We found there are over 157K in-person transactions without a Zip code. We are going to fill them following this approach:
1. Get the most frequent Zip from the transactions for the same merchant name and city.
2. Get the most frequent Zip from the transactions for the same merchant city only.
3. Get the most frequent Zip from the transactions for the same merchant state.
4. Get the most frequent Zip from all the transactions.

In [None]:
len(df[df['Zip'].isnull()])

In [None]:
df['Zip'].isnull().values.any()

In [None]:
value_counts = df.groupby(['Merchant Name', 'Merchant City'])['Zip'].value_counts().to_frame('count')
value_counts = value_counts.reset_index()
value_counts.columns
#value_counts

In [None]:
value_counts.head()

In [None]:
# Fill missing Zip with most frequent for given Merchant name and city
if df['Zip'].isnull().values.any():
    df['Zip'] = df.groupby(['Merchant Name', 'Merchant City'])['Zip'].transform(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() > 0 else np.nan))

# If there are still missing Zips, fill them with most frequent for given Merchant city only
if df['Zip'].isnull().values.any():
    df['Zip'] = df.groupby('Merchant City')['Zip'].transform(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() > 0 else np.nan))

# If there are still missing Zips, fill them with most frequent for given Merchant state
if df['Zip'].isnull().values.any():
    df['Zip'] = df.groupby('Merchant State')['Zip'].transform(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() > 0 else np.nan))

# If there are still missing Zips, fill them with most frequent for any merchant
if df['Zip'].isnull().values.any():
    freq_zip = df['Zip'].value_counts().idxmax()
    df['Zip'].fillna(freq_zip, inplace=True)

Finally, when a transaction is legitimate, the field "Errors?" is null. We'll fill these values with an "OK" string.

In [None]:
df['Errors?'].fillna('OK', inplace=True)

In [None]:
df.isnull().values.any()

### Transforming categorical features <a class="anchor" id="cat_features"></a>

As mentioned earlier there is a set of features that are categorical instead of numeric:

- Nominal: Use Chip, Merchant City, Merchant State, Errors?, Is Fraud?
- Ordinal: Time

The time can be converted into an hour-only integer feature since minutes will have low or no impact on the fraud detection.

In [None]:
df['Time'] = df['Time'].apply(lambda x: int(x.split(':')[0]))

The "Use Chip" variable can be easily converted with one-hot encoding since there are only 3 types of transaction: swipe, online and chip.

In [None]:
use_chip_dummies = pd.get_dummies(df['Use Chip'])
df = pd.concat([df, use_chip_dummies], axis='columns')
df.drop('Use Chip', axis='columns', inplace=True)

We are going to apply the same technique on the "Errors?" feature since there is a limited amount of error types. However, as shown below, some transactions have multiple errors combined into one single error string. This would lead to a loss of potentially valuable information if we apply one-hot encoding directly without splitting the errors into single-error boolean features. Therefore we will have to apply it manually.

In [None]:
df['Errors?'].value_counts().to_frame('Number transactions')

It turns out that there are only 7 different errors that can occur during a transaction. We'll create a different dummy column for each of them:

In [None]:
error_types = list(set([single_error for error in df['Errors?'].unique().tolist() for single_error in error.split(',') if single_error != 'OK']))
error_types

In [None]:
# create a zero-filled column for each type of error
for error in error_types:
    df[error] = 0

In [None]:
# function to set the error columns for the given transaction row
def set_error(row):
    if row['Errors?'] != 'OK':
        for error_type in error_types:
            if error_type in row['Errors?']:
                row[error_type] = 1
    return row

df = df.apply(lambda row: set_error(row), axis=1)

Below we can see some transactions with the new error columns:

In [None]:
df.head()

In [None]:
# Remove Errors categorical column
df.drop('Errors?', axis='columns', inplace=True)

The next two categorical variables are going to be treated in the same way. Both Merchant City and Merchant State have a high cardinality and applying one-hot encoding on them would create far too many columns which might lead to overfitting of the future tree-based model. Therefore we are going to convert them into numeric features via binary encoding, by which each category is converted into binary digits and each digit creates one feature column.

We can make use of the library category_encoders for the binary encoding: https://contrib.scikit-learn.org/category_encoders/

In [None]:
print ('Merchant cities: ' + str(len(df['Merchant City'].unique())))
print ('Merchant states: ' + str(len(df['Merchant State'].unique())))

In [None]:
ce_be = ce.BinaryEncoder(cols=['Merchant City', 'Merchant State']);
df = ce_be.fit_transform(df);

The new columns resulting from the binary encoding can be seen below:

In [None]:
df.head()

Finally, for the target variable "Is Fraud?", we can just map it to an integer where 0 represents a legitimate transaction and 1, a fraudulent transaction.

In [None]:
df['Is Fraud?'] = df['Is Fraud?'].apply(lambda x: 0 if x == 'No' else 1)

In [None]:
df.head(5)

With the above transformations, all our features are now numerical and can be used to train a machine learning model. However, we should still filter out some of the features that might not be relevant enough to predict the target "Is Fraud" value.

### Data balancing <a class="anchor" id="data_balancing"></a>

- Dataset is highly imbalanced - need to do something:
    * check ROC AUC (ability to distinguish between classes) should be close to 1 (instead of 0.5 which means it can only predict half the classes I.e. legitimate transactions)
    * Could use  imbalance-learn library - ADASYN and SMOTE oversampling techniques in the minority class
    * Confusion matrix and FP/FN

In [None]:
def plot_balance(data):
    LABELS = ["Ok", "Fraud"]
    count_classes = pd.value_counts(data['Is Fraud?'])
    count_classes.plot(kind = 'bar', rot=0)
    plt.title("Fraud distribution")
    plt.xticks(range(2), LABELS)
    plt.xlabel("Is Fraud")
    plt.ylabel("Frequency");
    
plot_balance(df)

In [None]:
frauds = df[df['Is Fraud?'] == 1]
normal = df[df['Is Fraud?'] == 0]
print(f'Num fraudulent transactions:{len(frauds)}')
print(f'Num legitimate transactions:{len(normal)}')

In [None]:
print(f'Only {round(len(frauds)/len(df) * 100, 2)}% of transactions are fraudulent')

It seems that we have a highly imbalanced dataset on our hands since legitimate transactions overwhelm the fraudulent ones by a large margin. This imbalance of the target class might decrease the performance of the classification algorithm (the model might fail to identify fraud) so we'll need to create more samples of the minority class i.e. the fraudulent transactions.

In particular, we are going to apply an oversampling method called Synthetic Minority Oversampling Technique (SMOTE) which takes into account characteristics of the fraudulent class to create synthetic duplicates. The objective is to increase the ratio of the fraudulent class from 0.6% to 10% of the total transactions.

We can use the library imbalanced-learn for this task: https://pypi.org/project/imbalanced-learn/

In [None]:
X = df.loc[:, df.columns != 'Is Fraud?']
y = df['Is Fraud?']

In [None]:
method = SMOTE(sampling_strategy=0.1)
X_resampled, y_resampled = method.fit_sample(X, y)

In [None]:
X_resampled = pd.DataFrame(X_resampled, columns=X.columns)

y_resampled = pd.DataFrame(y_resampled, columns=['Is Fraud?'])

In [None]:
df_resampled = pd.concat([X_resampled, y_resampled], axis='columns')

Below we can see that the distribution of fraudulent and legimitate transactions has now changed to have a ratio of 90-10:

In [None]:
plot_balance(df_resampled)

In [None]:
frauds = df_resampled[df_resampled['Is Fraud?'] == 1]
normal = df_resampled[df_resampled['Is Fraud?'] == 0]
print(f'Num fraudulent transactions:{len(frauds)}')
print(f'Num genuine transactions:{len(normal)}')

## Modelling <a class="anchor" id="modelling"></a>

### Model preparation <a class="anchor" id="model_prep"></a>

We are going to try 3 different classification algorithms (logistic regression, decision tree and random forest) but fefore that, we need to split our data into train (80%) and test (20%), and normalize it with the standard scaler.

In [None]:
y = df_resampled['Is Fraud?']
X = df_resampled.drop(['Is Fraud?'], axis=1)

In [None]:
RANDOM_SEED = 15

X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

In [None]:
1 in y_test

In [None]:
scaler = StandardScaler()
T = scaler.fit(X_train)

X_train = scaler.transform(X_train.values)
X_test = scaler.transform(X_test.values)

y_test = y_test.values
y_train = y_train.values

Below are the sizes of the 2 subsets

In [None]:
print(f'Train size: {str(len(X_train))}')
print(f'Test size: {str(len(X_test))}')

### Model training and evaluation <a class="anchor" id="model_train"></a>

Below we can see the parameters we are using for our 3 classification models:

In [None]:
names = ['Logistic Regression', 'Decision Tree', 'Random Forest']

classifiers = [
    linear_model.LogisticRegression(),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=50)]

classifiers

Now let's run the different algorithms and generate their confusion matrices. Confusion matrix is an essential evaluation method for classification problems such as our fraud detection system. With this method we can easily visualise the false/true negatives and positives, which will help us understand the precision and recall of our model.

In [None]:
LABELS = ['Legitimate', 'Fraud']

i = 1
figure = plt.figure(figsize=(27, 7))

half_point = int(len(classifiers)/2)

if len(classifiers) % 2 == 1:
    half_point += 1

probs =[]
for name, clf in zip(names, classifiers):
        ax = plt.subplot(1, len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        y_pred = clf.predict(X_test) 
                
        Z = clf.predict_proba(X_test)[:, 1]
        
        probs.append(Z)
        
        conf_matrix = confusion_matrix(y_test, y_pred)

        show_bar = False
        if i == len(classifiers):
            show_bar = True
            
        sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, 
                    cbar = show_bar, fmt="d");

        ax.set_title(name + ": " + str("{0:.4f}".format(score)))
        
        if i == 1:
            plt.ylabel('True class')
            
        if i == half_point:
            plt.xlabel('Predicted class')
            
        i += 1


plt.tight_layout()
plt.show()        

Even though the plots above can give us a good summary of our models performance, we should dive deeper specially taking into account how imbalanced our dataset was originally.

One way to investigate further is by using ROC curves to understand the performance of our binary classifier.

In [None]:
figure = plt.figure(figsize=(27, 5))
i = 1

for name, clf in zip(names, classifiers):
    fpr, tpr, thresholds = roc_curve(y_test, probs[i-1])
    roc_auc = auc(fpr, tpr)
    ax = plt.subplot(1, len(classifiers) + 1, i)
    
# 
    ax.plot(fpr, tpr, label= 'AUC= %0.4f'% roc_auc)
    ax.legend(loc='lower right')
    ax.plot([0,1],[0,1],'r--')
    ax.set_xlim([-0.01, 1])
    ax.set_ylim([0, 1.05])
    ax.set_title(name)
    if i == 1:
        ax.set_ylabel('True Positive Rate')
    if i == half_point:
       
        ax.set_xlabel('False Positive Rate')
    i += 1

plt.show();

# Region analysis <a class="anchor" id="region_analysis"></a>

In [None]:
import pandas as pd
import numpy as np
import re
import folium

In [None]:
df = pd.read_csv("data/credit_card_transactions-ibm_v2.csv")

In [None]:
# group by merchant country and fraud class, and rename columns
df_fraud_per_state = df.groupby(['Merchant State', 'Is Fraud?']).size().unstack(fill_value=0).reset_index()
df_fraud_per_state.rename(columns = {'No':'Legitimate', 'Yes':'Fraudulent'}, inplace = True)
df_fraud_per_state.head()

In [None]:
df_fraud_per_state.shape

In [None]:
# calculate fraud ratio per country
df_fraud_per_state['Fraud Ratio'] = df_fraud_per_state.apply(lambda x: x['Fraudulent'] / (x['Fraudulent'] + x['Legitimate']), axis='columns')
df_fraud_per_state.head()

Since we will be performing an analysis at country level, we are going to treat all the transactions from an US state as part of the United States. Therefore we will convert all 2-letter US state names into "United States of America".

In [None]:
# update country name for all the US states
df_fraud_per_state['Merchant State'] = df_fraud_per_state['Merchant State'].apply(lambda x: 'United States of America' if re.search(r'\b[A-Z]{2}\b', x) else x)

In [None]:
# combine the amounts for all the US states into one
us_transactions = df_fraud_per_state[df_fraud_per_state['Merchant State'] == 'United States of America'].sum(numeric_only=True).to_frame().transpose()
us_transactions['Merchant State'] = 'United States of America'
us_transactions['Fraud Ratio'] = us_transactions.apply(lambda x: x['Fraudulent'] / (x['Fraudulent'] + x['Legitimate']), axis='columns')
us_transactions

In [None]:
# drop US-states rows from dataframe (they'll be replaced by 1 US-row)
index_us_states = df_fraud_per_state[df_fraud_per_state['Merchant State'] == 'United States of America'].index
df_fraud_per_state.drop(index_us_states, inplace = True)

# append US-row
df_fraud_per_state = df_fraud_per_state.append(us_transactions)

In [None]:
df_fraud_per_state.shape

Below is the list of countries with the highest fraud rate:

In [None]:
df_fraud_per_state[df_fraud_per_state['Merchant State'] == 'Italy']

In [None]:
# show countries with the highest ratio of fraud
df_fraud_per_state[['Merchant State', 'Fraud Ratio']].sort_values('Fraud Ratio', ascending=False).head(10)

In [None]:
# load file with world country coords
world_geo = r'data/world-countries.json' # geojson file

# create a plain world map
world_map = folium.Map(location=[0, 0], zoom_start=2, tiles='Mapbox Bright')

# create a numpy array of length 4 and has linear spacing from the minimum fraud ratio to the maximum value
threshold_scale = np.linspace(start=0.0, stop=1.0, num=5, endpoint=True, dtype=float)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 0.001 # ensure last value of list is greater than max fraud ratio

# let Folium determine the scale.
world_map = folium.Map(location=[0, 0], zoom_start=2, tiles='Mapbox Bright')
world_map.choropleth(
    geo_data=world_geo,
    data=df_fraud_per_state,
    columns=['Merchant State', 'Fraud Ratio'],
    key_on='feature.properties.name',
    threshold_scale=threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Fraud ratio',
    reset=True
)
world_map

Apart from discovering the ratio of fraud in each country, it might be interesting to find out what are the amounts (in USD) for which fraud has been commited. Perhaps only a few fraudulent transactions have been commited in a country but if those transactions included large amounts of money then it will be relevant for our analysis.

In [None]:
# drop legitimate and online transactions since we're only interested in the total fraudulent amount per merchant country
drop_transactions = df[(df['Is Fraud?'] == 'No') | (df['Merchant City'] == 'ONLINE')].index
fraudulent_df = df.drop(drop_transactions)

In [None]:
# keep only columns Merchant State and Amount
fraudulent_df.drop(fraudulent_df.columns.difference(['Merchant State','Amount']), axis='columns', inplace=True)
fraudulent_df['Amount'] = fraudulent_df['Amount'].apply(lambda x: float(x.strip('$')))
fraudulent_df.head()

In [None]:
# group by merchant country and calculate the sum of amounts
df_fraud_amount_per_state = fraudulent_df.groupby(['Merchant State']).sum().reset_index()
df_fraud_amount_per_state.head()

Since we will be performing an analysis at country level, we are going to treat all the fraudulent amount from an US state as part of the United States. Therefore we will convert all 2-letter US state names into "United States of America".

In [None]:
# merge all US states into 1 United States row
df_fraud_amount_per_state[df_fraud_amount_per_state['Merchant State'].apply(lambda x: True if re.search(r'\b[A-Z]{2}\b', x) else False)]['Amount'].sum()

df_fraud_amount_per_state['Merchant State'] = df_fraud_amount_per_state['Merchant State'].apply(lambda x: 'United States of America' if re.search(r'\b[A-Z]{2}\b', x) else x)

us_fraud_amount = df_fraud_amount_per_state[df_fraud_amount_per_state['Merchant State'] == 'United States of America'].sum(numeric_only=True).to_frame().transpose()
us_fraud_amount['Merchant State'] = 'United States of America'
us_fraud_amount

In [None]:
# drop US-states rows from dataframe (they'll be replaced by 1 US-row)
index_us_states = df_fraud_amount_per_state[df_fraud_amount_per_state['Merchant State'] == 'United States of America'].index
df_fraud_amount_per_state.drop(index_us_states, inplace = True)

# append US-row
df_fraud_amount_per_state = df_fraud_amount_per_state.append(us_fraud_amount)

Below is the list of countries with the highest fraud amount:

In [None]:
df_fraud_amount_per_state.sort_values('Amount', ascending=False).head(10)

In [None]:
df_fraud_amount_per_state['Amount'].max()

In [None]:
# load world countries with coords
world_geo = r'data/world-countries.json' # geojson file

# create a plain world map
world_map = folium.Map(location=[0, 0], zoom_start=2, tiles='Mapbox Bright')

# create a numpy array of length 6 and has linear spacing from the minimum fraud ratio to the maximum value
threshold_scale = np.linspace(start=0, stop=500000, num=6, endpoint=True, dtype=float)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 0.001 # ensure last value of list is greater than max fraud ratio

# let Folium determine the scale.
world_map = folium.Map(location=[0, 0], zoom_start=2, tiles='Mapbox Bright')
world_map.choropleth(
    geo_data=world_geo,
    data=df_fraud_amount_per_state,
    columns=['Merchant State', 'Amount'],
    key_on='feature.properties.name',
    threshold_scale=threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Fraudulent amount (in USD)',
    reset=True
)
world_map

# Merchant analysis <a class="anchor" id="merchant_analysis"></a>

In [None]:
# for HTTP requests
import requests  

# for HTML scrapping 
from bs4 import BeautifulSoup 

In [None]:
df = pd.read_csv("data/credit_card_transactions-ibm_v2.csv")

In [None]:
# URL of website from which to scrap tabular data.
mcc_url = "https://docs.checkout.com/resources/codes/merchant-category-codes"

# if the request was successful, reponse should be 200.
response = requests.get(mcc_url)
assert response.status_code == 200

# parse response content to HTML
soup = BeautifulSoup(response.content, 'html.parser')

# title of website
title = soup.title.string
print(f'Page title: {title}') 

# find the right table to scrap
mcc_table=soup.find('table')

# get the 1st row of the table i.e. the header
row0 = mcc_table.findAll("tr")[0]

# show the column names
header = [th.text.rstrip() for th in row0.find_all('th')]
print(f'Column names: {header}') 

In [None]:
# construct dictionary of MCCs
merchant_category_codes = {}

# iterate through the rows of the table
for row in mcc_table.findAll("tr"):    
    cells = row.findAll('td')
    if len(cells)==2:
        code = int(cells[0].find(text=True))
        desc =  cells[1].find(text=True)
        merchant_category_codes[code] = desc
    
print(f'Number of merchant codes: {len(merchant_category_codes)}')

In [None]:
# group by MCC and fraud class, and rename columns
df_fraud_per_mcc = df.groupby(['MCC', 'Is Fraud?']).size().unstack(fill_value=0).reset_index()
df_fraud_per_mcc.rename(columns = {'No':'Legitimate', 'Yes':'Fraudulent'}, inplace = True)
df_fraud_per_mcc.head()

In [None]:
df_fraud_per_mcc.shape

Below is the total number of fraudulent transactions:

In [None]:
# calculate total number of fraudulent transactions
total_fraud_transactions = df_fraud_per_mcc['Fraudulent'].sum()
total_fraud_transactions

In [None]:
# get the categories of merchant with the highest number of fraudulent transactions
top_merchant_fraud = df_fraud_per_mcc.sort_values('Fraudulent', ascending=False).head(9)
top_merchant_fraud

In [None]:
rest_fraud_transactions = total_fraud_transactions - top_merchant_fraud['Fraudulent'].sum()
y = np.array(top_merchant_fraud['Fraudulent'].append(pd.Series(rest_fraud_transactions)))
top_mcc_labels = [merchant_category_codes[mcc] for mcc in top_merchant_fraud['MCC']]
top_mcc_labels.append('Other')

title = plt.title('Fraud by type of merchant')
pie = plt.pie(y)
plt.axis('equal')
plt.legend(pie[0],top_mcc_labels, bbox_to_anchor=(1.5,0.5), loc="center right",
            fontsize=14, bbox_transform=plt.gcf().transFigure)
plt.subplots_adjust(left=0.0, bottom=0.1, right=0.5)