## Task 1: Credit Card Routing for Online Purchase via Predictive Modelling

### Problem statement
* Over the past year, the online payment department at a large retail company have encountered a high failure rate of online credit card payments done via so-called payment service providers, referred to as PSP's by the business stakeholders.
* The company losses alot of money due to failed transactions and customers have become increasingly unsatisfied with the online shop.
* The current routing logic is manual and rule-based. Business decision makers hope that with predictive modelling, a smarter way of routing a PSP to a transaction is possible.

### Data Science Task
* Help the business to automate the credit card routing via a predictive model
* Such a model should increase the payment success rate by finding the best possible PSP for each transaction and at the same time keep the transaction fees low.

# PART 2b: Bivariate Analysis

### Import Key Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# import visualization libraries
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from bokeh.plotting import figure, show, output_notebook 
from bokeh.palettes import Spectral
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show

### Read Dataset and update index

In [3]:
dataset = pd.read_excel("PSP_Jan_Feb_2019.xlsx")

In [4]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,tmsp,country,amount,success,PSP,3D_secured,card
0,0,2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa
1,1,2019-01-01 00:01:17,Germany,89,1,UK_Card,0,Visa
2,2,2019-01-01 00:02:49,Germany,238,0,UK_Card,1,Diners
3,3,2019-01-01 00:03:13,Germany,238,1,UK_Card,1,Diners
4,4,2019-01-01 00:04:33,Austria,124,0,Simplecard,0,Diners


In [5]:
dataset = dataset.drop('Unnamed: 0', axis=1)

In [6]:
# make timestamp the index for easier analysis
dataset = dataset.set_index(dataset.columns[0])

In [7]:
dataset.head()

Unnamed: 0_level_0,country,amount,success,PSP,3D_secured,card
tmsp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa
2019-01-01 00:01:17,Germany,89,1,UK_Card,0,Visa
2019-01-01 00:02:49,Germany,238,0,UK_Card,1,Diners
2019-01-01 00:03:13,Germany,238,1,UK_Card,1,Diners
2019-01-01 00:04:33,Austria,124,0,Simplecard,0,Diners


In [8]:
# add a feature field to hold the order of the dates - for the base model
dataset['date_order'] = np.arange(len(dataset.index))

#### Identify fields with missing data

In [9]:
# Print the number of missing entries in each column
print(dataset.isna().sum())

country       0
amount        0
success       0
PSP           0
3D_secured    0
card          0
date_order    0
dtype: int64


## 2. Building the Actual Refined Models (based on CRISP-DM)

### CRISP-DM (2) - Data Understanding  (Exploratory Data Analysis)
* Bivariate analysis
* What are the relationships between the different independent variables?
* Are there any correlations between the independent variables?
* What is the relationshop between the indepedenet variables and the dependent variables and the correlations?
* Which PSP has more successes, as per the data?
* Whihch PSPS has the lowest cost, as per the data?

In [11]:
# reset_dataset
dataset_time = dataset.copy().reset_index()
dataset_time.head()

Unnamed: 0,tmsp,country,amount,success,PSP,3D_secured,card,date_order
0,2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa,0
1,2019-01-01 00:01:17,Germany,89,1,UK_Card,0,Visa,1
2,2019-01-01 00:02:49,Germany,238,0,UK_Card,1,Diners,2
3,2019-01-01 00:03:13,Germany,238,1,UK_Card,1,Diners,3
4,2019-01-01 00:04:33,Austria,124,0,Simplecard,0,Diners,4


In [14]:
#### create a numeric value for country
def encode_country(country):
    if country=="Austria":
        return 0
    if country=="Germany":
        return 1
    if country =="Switzerland":
        return 2

#### create a numeric value for card
def encode_card(card):
    if card=="Diners":
        return 0
    if card=="Master":
        return 1
    if card =="Visa":
        return 2

#### create a numeric value for PSP
def encode_PSP(psp):
    if psp=="Goldcard":
        return 0
    if psp=="Moneycard":
        return 1
    if psp =="Simplecard":
        return 2
    if psp =="UK_Card":
        return 3

In [15]:
dataset_time['country_num'] = dataset_time['country'].apply(encode_country)
dataset_time['card_num'] = dataset_time['card'].apply(encode_card)
dataset_time['PSP_num'] = dataset_time['PSP'].apply(encode_PSP)

## Bivariate analysis of the independent variables

In [127]:
def bivariate_analysis(df, feature1, feature2, colors_dict, number_of_categories=3):
    # Histogram - amount and success
    print('Histogram - {} and amount'.format(feature1))
    f, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,3))
    ax[0] = sns.histplot(data=df, x='amount', bins=20, hue=feature1, ax=ax[0], palette='tab10')
    ax[1] = sns.histplot(data=df, x='amount', bins=200, hue=feature1, ax=ax[1], palette='tab10')
    plt.show()

    # Scatter and strip plot plot - amount and country
    colors = colors_dict
    patch_list = []
    for key, value in colors_dict.items():
        patch = mpatches.Patch(color=value, label=key)
        patch_list.append(patch)

    print('Scatter and strip plot - {} and amount'.format(feature1))
    f, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,3))
    ax[0].scatter(df.index, df[feature1], c=df[feature1].map(colors))
    ax[0].set_title('Amount against index', fontsize=10)
    ax[0].set_xlabel('Index')
    ax[0].set_ylabel('Amount')
    ax[0].legend(handles=patch_list, loc='upper right',  fontsize=8)
    
    amount_count = df.groupby([feature1, 'amount'])['tmsp'].count().reset_index().rename(columns={'tmsp':'count'}).copy()
    ax[1].scatter(amount_count[feature1],amount_count['count'], c=amount_count[feature1].map(colors))
    ax[1].set_title('Amount against Count', fontsize=10)
    ax[1].set_xlabel('Amount')
    ax[1].set_ylabel('Count')
    ax[1].legend(handles=patch_list, loc='upper right',  fontsize=8)
    
    ax[2] = sns.stripplot(y=dataset_time['amount'], hue=dataset_time[feature1])
    ax[2].set_title('Amount strip plot', fontsize=10)
    ax[2].legend(loc='upper right',fontsize=8)

    # plot year against another feature based on amount
    print('Compare {} against {} based on transactions count and values'.format(feature2,feature1))
    f, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,3))
    df_count = df.groupby([feature1,feature2])['amount'].count().reset_index()
    ax[0]= sns.barplot(x=df_count[feature1], y=df_count['amount'], hue=df_count[feature2], ax=ax[0])
    for i in range(number_of_categories):
        ax[0].bar_label(ax[0].containers[i])
    ax[0].set_ylabel('Count of Amount')
    ax[0].set_title("Count of amount")
    
    df_sum = df.groupby([feature1,feature2])['amount'].sum().reset_index()
    ax[1]= sns.barplot(x=df_sum[feature1], y=df_sum['amount'], hue=df_sum[feature2], ax=ax[1])
    for i in range(number_of_categories):
        ax[1].bar_label(ax[1].containers[i])
    ax[1].set_ylabel('Sum of Amount')
    ax[1].set_title("Sum of amount")
    plt.tight_layout()

    plt.show()

In [None]:
bivariate_analysis(dataset_time, 'country', 'success', {'Austria':'red', 'Germany':'orange', 'Switzerland':'blue'})

Histogram - amount and amount


KeyboardInterrupt: 

In [120]:
#bivariate_analysis(dataset_time, 'amount', 'success', {0:'red', 1:'orange'})

In [124]:
#bivariate_analysis(dataset_time, 'amount', 'PSP', {'UK_Card':'red', 'Simplecard':'blue', 'Moneycard':'orange','Goldcard':'green'})

In [122]:
#bivariate_analysis(dataset_time, 'amount', '3D_secured', {0:'red', 1:'orange'})

In [125]:
#bivariate_analysis(dataset_time, 'amount', 'card', {'Visa':'red', 'Diners':'orange', 'Master':'blue'})

In [None]:
# plot year against another feature based on amount
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
df_count = df.groupby([time_series_dummy,feature])[additional_feature].count().reset_index()
ax[0]= sns.barplot(x=df_count [time_series_dummy], y=df_count[additional_feature], hue=df_count[feature], ax=ax[0], order=ordered_list)
for i in range(number_of_categories):
    ax[0].bar_label(ax[0].containers[i])
ax[0].set_ylabel('Count of Amount')
ax[0].set_title("Count of amount")

df_sum = df.groupby([time_series_dummy,feature])[additional_feature].sum().reset_index()
ax[1]= sns.barplot(x=df_sum[time_series_dummy], y=df_sum[additional_feature], hue=df_sum[feature], ax=ax[1], order=ordered_list)
for i in range(number_of_categories):
    ax[1].bar_label(ax[1].containers[i])
ax[1].set_ylabel('Sum of Amount')
ax[1].set_title("Sum of amount")
plt.tight_layout()
plt.show()

In [None]:
# plot year against another feature based on amount
            f, ax = plt.subplots(nrows=v_nrows, ncols=v_ncols, figsize=v_figsize)
            df_count = df.groupby([time_series_dummy,feature])[additional_feature].count().reset_index()
            ax[0]= sns.barplot(x=df_count [time_series_dummy], y=df_count[additional_feature], hue=df_count[feature], ax=ax[0], order=ordered_list)
            for i in range(number_of_categories):
                ax[0].bar_label(ax[0].containers[i])
            ax[0].set_ylabel('Count of Amount')
            ax[0].set_title("Count of amount")
            
            df_sum = df.groupby([time_series_dummy,feature])[additional_feature].sum().reset_index()
            ax[1]= sns.barplot(x=df_sum[time_series_dummy], y=df_sum[additional_feature], hue=df_sum[feature], ax=ax[1], order=ordered_list)
            for i in range(number_of_categories):
                ax[1].bar_label(ax[1].containers[i])
            ax[1].set_ylabel('Sum of Amount')
            ax[1].set_title("Sum of amount")
            plt.tight_layout()
            plt.show()

## Bivariate analysis of indepenent variables against dependent variable (PSP)