## Task 1: Credit Card Routing for Online Purchase via Predictive Modelling

### Problem statement
* Over the past year, the online payment department at a large retail company have encountered a high failure rate of online credit card payments done via so-called payment service providers, referred to as PSP's by the business stakeholders.
* The company losses alot of money due to failed transactions and customers have become increasingly unsatisfied with the online shop.
* The current routing logic is manual and rule-based. Business decision makers hope that with predictive modelling, a smarter way of routing a PSP to a transaction is possible.

### Data Science Task
* Help the business to automate the credit card routing via a predictive model
* Such a model should increase the payment success rate by finding the best possible PSP for each transaction and at the same time keep the transaction fees low.

# PART 1: Building the base mode

### Import Key Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# import visualization libraries
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from bokeh.plotting import figure, show, output_notebook 
from bokeh.palettes import Spectral
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show

### Read Dataset and update index

In [3]:
dataset = pd.read_excel("PSP_Jan_Feb_2019.xlsx")

In [4]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,tmsp,country,amount,success,PSP,3D_secured,card
0,0,2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa
1,1,2019-01-01 00:01:17,Germany,89,1,UK_Card,0,Visa
2,2,2019-01-01 00:02:49,Germany,238,0,UK_Card,1,Diners
3,3,2019-01-01 00:03:13,Germany,238,1,UK_Card,1,Diners
4,4,2019-01-01 00:04:33,Austria,124,0,Simplecard,0,Diners


In [5]:
dataset = dataset.drop('Unnamed: 0', axis=1)

In [6]:
# make timestamp the index for easier analysis
dataset = dataset.set_index(dataset.columns[0])

In [7]:
dataset.head()

Unnamed: 0_level_0,country,amount,success,PSP,3D_secured,card
tmsp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa
2019-01-01 00:01:17,Germany,89,1,UK_Card,0,Visa
2019-01-01 00:02:49,Germany,238,0,UK_Card,1,Diners
2019-01-01 00:03:13,Germany,238,1,UK_Card,1,Diners
2019-01-01 00:04:33,Austria,124,0,Simplecard,0,Diners


In [8]:
# add a feature field to hold the order of the dates - for the base model
dataset['date_order'] = np.arange(len(dataset.index))

#### Identify fields with missing data

In [9]:
# Print the number of missing entries in each column
print(dataset.isna().sum())

country       0
amount        0
success       0
PSP           0
3D_secured    0
card          0
date_order    0
dtype: int64


## 1. Creation of a Base Model

### 1a. Base model data preparation

In [10]:
# create a copy of the dataset
base_dataset = dataset.copy()

In [11]:
base_dataset.head()

Unnamed: 0_level_0,country,amount,success,PSP,3D_secured,card,date_order
tmsp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa,0
2019-01-01 00:01:17,Germany,89,1,UK_Card,0,Visa,1
2019-01-01 00:02:49,Germany,238,0,UK_Card,1,Diners,2
2019-01-01 00:03:13,Germany,238,1,UK_Card,1,Diners,3
2019-01-01 00:04:33,Austria,124,0,Simplecard,0,Diners,4


#### 1ai. Check the unique values in the categorical variables and the label

In [12]:
base_dataset['country'].nunique()

3

In [13]:
base_dataset['card'].nunique()

3

In [14]:
base_dataset['PSP'].nunique()

4

#### 1aii. Deal with missing data in the feature variables vector matrix
* As noted above, there is no missing data

#### 1aiii. Encoding of categorical feature variables and label and defining feature variable and dependent variable vector matrices for the base model

In [15]:
base_dataset.head(1)

Unnamed: 0_level_0,country,amount,success,PSP,3D_secured,card,date_order
tmsp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-01-01 00:01:11,Germany,89,0,UK_Card,0,Visa,0


In [16]:
#define categorical features
cat_features = ['country', 'card']

In [17]:
#encoding the categorical feature variables using OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),cat_features)], remainder='passthrough')
X_base = np.array(ct.fit_transform(base_dataset.drop('PSP', axis=1)))

In [18]:
print(X_base[2])

[  0.   1.   0.   1.   0.   0. 238.   0.   1.   2.]


In [19]:
#encoding the label using LabelEncoder
le = LabelEncoder()
y_base = le.fit_transform(base_dataset['PSP'])

In [20]:
print(y_base[2])

3


#### 1aiv. Split the data into training set and the test set

In [21]:
X_base_train, X_base_test, y_base_train, y_base_test = train_test_split(X_base, y_base, test_size=0.2,random_state=30)

In [22]:
print(X_base_train[2])

[0.0000e+00 1.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00
 2.6100e+02 0.0000e+00 1.0000e+00 1.7732e+04]


#### 1av. Feature scaling

In [23]:
# scaling all the non-encoded columns on both train and test set
sc = StandardScaler()
X_base_train[:,6:] = sc.fit_transform(X_base_train[:,6:]) #fitting is done only with the train set
X_base_test[:,6:] = sc.transform(X_base_test[:,6:]) #scale test data using the fitted scaler

In [24]:
print(X_base_train[1])

[ 0.          1.          0.          0.          1.          0.
  1.43373055  1.98884824  1.79991589 -0.68218692]


In [25]:
print(X_base_test[1])

[ 1.          0.          0.          0.          1.          0.
  2.04710735 -0.50280357 -0.55558152 -0.02736259]


### 1b. Creating the base model

#### 1bi. Define the type and parameters of the base model
* Type: Logistic Regression Model

* Parameters: Random_state=30 will be used for all models to get the same results

In [26]:
# Train LogisticRegression model
logR = LogisticRegression(random_state=30)
logR.fit(X_base_train, y_base_train)

LogisticRegression(random_state=30)

In [27]:
# Predict y given X_base_test
y_base_pred = logR.predict(X_base_test)

#### 1bii. Evaluation of the base model
* Accuracy - 52%

In [28]:
accuracy = accuracy_score(y_base_test, y_base_pred)
print("Logistic Regression model accuracy (in %):", accuracy*100)

Logistic Regression model accuracy (in %): 52.50942273358461
