# Using the original Zindi test data


### Description

In this challenge, Zindi provided training and test data. However, there are no target values (i.e. fraudulent or not fradulent) in the provided test data. Thus, in order to test the the model, a test train split of the train data was used before. This notebook describes how the actual test data provided by Zindi can be used to test the model nonetheless.

_Note: In the following, all variable names containing "test" are written as uppercases "TEST" to highlight the difference between the Zindi test data and the test data from the test train split of the Zindi train data._

### Import modules

In [None]:
# Basic modules and plotting tools
import pandas as pd
import numpy as np
from datetime import datetime, date, time, timedelta
from collections import Counter
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

# Modelling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Scikit-learn model modules
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Scikit-learn metrics
from sklearn.datasets import make_classification
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, f1_score, accuracy_score
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import fbeta_score, accuracy_score
from sklearn.metrics import matthews_corrcoef

# Random seed used in this notebook:
RSEED = 42

### Loading test data

In [None]:
# Load the test data provided by Zindi
data_TEST = pd.read_csv("data/test.csv")

### Data cleaning

* Stripping the ID columns from non-integer characters and converted them to integers
* Separating TransactionStartTime into transactiontime and transactiondate
* Dropping redundant columns

In [None]:
# Function for stripping ID related columns
def remove_letters(string):
    return int(string.split('_')[1])

# Applying the function above for all ID related columns
id_columns = ["TransactionId","BatchId","AccountId","SubscriptionId","CustomerId","ProviderId","ProductId","ChannelId"]    
for i in id_columns:
    data_TEST[i] = data_TEST[i].apply(lambda x:remove_letters(x))

### Unknotting TransactionStartTime

In [None]:
# Functions for seperating `TransactionStartTime` into time and date, respectively
def convert_to_date(date):
    # convert field into datetime format
    d = datetime.strptime(date,'%Y-%m-%dT%H:%M:%SZ')
    # extract date
    return d.date()

def convert_to_time(date):
    d = datetime.strptime(date,'%Y-%m-%dT%H:%M:%SZ')
    # extract time
    return d.time()

Consolidate times into seperate blocks:

1. 00:00 - 05:59 (night)
2. 06:00 - 09:59 (morning)
3. 10:00 - 13:59 (midday)
4. 14:00 - 17:59 (afternoon)
5. 18:00 - 23:59 (evening)

In [None]:
# Day time consolidation function
def consolidate_time(time):
    if time.hour < 6:
        return 'night'
    elif time.hour < 10:
        return 'morning'
    elif time.hour < 14:
        return 'midday'
    elif time.hour < 18:
        return 'afternoon'
    else:
        return 'evening'

In [None]:
# Create new column 'DayTime' by creatinge new columns with seperate information for `TransactionTime` and `TransactionDate`
data_TEST['TransactionTime'] = data_TEST.TransactionStartTime.apply(lambda x: convert_to_time(x))
data_TEST['TransactionDate'] = data_TEST.TransactionStartTime.apply(lambda x: convert_to_date(x))
data_TEST['DayTime'] = data_TEST.TransactionTime.apply(lambda x: consolidate_time(x))

In [None]:
# Create new column by extracting weekdays from `TransactionDate`
data_TEST['TransactionWeekday'] = data_TEST.TransactionDate.apply(lambda x: x.isoweekday())

In [None]:
def convert_to_isoweek(date):
    return date.isocalendar()[1]

data_TEST['ISOWeek'] = data_TEST.TransactionDate.apply(lambda x: convert_to_isoweek(x))

### Further feature engineering

In [None]:
# Create new feature to distinguish between Debit (0) and Credit (1)
data_TEST['DebitCredit'] = data_TEST.Amount.apply(lambda x: 0 if x > 0 else 1)

In [None]:
# New column: transaction per batch
a = data_TEST.groupby('BatchId', as_index=False)['TransactionId'].count()
a.rename(columns= {'TransactionId': 'TransactionInBatch' }, inplace=True)
data_TEST = data_TEST.merge(a, on='BatchId')

In [None]:
# New column: difference between Value and Amount
data_TEST['value_amount_diff'] = abs(data_TEST["Value"] - data_TEST["Amount"])

In [None]:
# Function to determine number of transactions to date set by same account ID:
def transactions_toDate(df, transaction_id, account_id):
    """
    returns dataframe
    """
    # create empty dictionary
    TTD = {'t_id': [], 'a_id': [], 
           'TransactionsToDate': []}#, 'date': []}
    count = 0
    # iterate through all transaction ids for one account id and assign counts
    for t in transaction_id:
        TTD['t_id'] += [t]
        TTD['a_id'] += [account_id]
        TTD['TransactionsToDate'] += [count]
        count += 1
    # return counts in data frame format 
    return pd.DataFrame.from_dict(TTD)

In [None]:
temp = pd.DataFrame()
for i in data_TEST.AccountId.unique():
    df = data_TEST.query('AccountId == @i')
    # count seperately for every sub set of account ids
    TTD = transactions_toDate(df, df.TransactionId,i)
    # add counts vertically to temporary data frame
    temp = pd.concat([temp, TTD], axis=0)

In [None]:
data_TEST = data_TEST.merge(temp, left_on='TransactionId', right_on='t_id')
data_TEST.drop(['t_id', 'a_id'], inplace=True, axis=1)

In [None]:
# Drop columns that do not convey additional meaning
cols_to_drop = ['SubscriptionId','CurrencyCode', 'CountryCode', 'TransactionStartTime', 'BatchId','TransactionTime','Amount','TransactionDate']
data_TEST_clean = data_TEST.drop(columns=cols_to_drop, inplace=False)

### Transform Data

Transform 'Value' due to skewness into log('Value'). The 'Value' column can therefore be dropped.

In [None]:
data_TEST_clean['ValueLog']=np.log(data_TEST_clean.Value)
data_TEST_clean.drop(columns='Value', inplace=True)

### Preparing for model input

Now, the data can be used to feed the model. To simplify, data_TEST_clean is now used as df_TEST.

In [None]:
df_TEST = data_TEST_clean

Dummy variables have to be created to use categorical features:

In [None]:
df_TEST = pd.get_dummies(df_TEST, columns = ['ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'PricingStrategy', 'DayTime','TransactionWeekday','DebitCredit'], drop_first=True)

Loading the previously prepared train data as df:

In [None]:
df = pd.read_csv("data/data_train_clean_withdummies.csv")

Some column names existent in df are not present in df_TEST and vice versa. First, these columns are identified by compare both column names.

In [None]:
df_columns = set(df.columns)
df_TEST_columns = set(df_TEST.columns)

In [None]:
# The following features are missing in df_TEST
missing_in_df_TEST = df_columns.difference(df_TEST_columns)

In [None]:
# The following features are missing in df
missing_in_df = df_TEST_columns.difference(df_columns)

Adding missing columns except for the target variable 'FraudResult':

In [None]:
for i in missing_in_df_TEST:
    if i == "FraudResult":
        continue
    df_TEST[i] = 0
    print(i)
for i in missing_in_df:
    print(i)
    df[i] = 0

### Model data input: Resampling and scaling

In [None]:
# Rename test data for clarification
X_TEST = df_TEST

In [None]:
# Separate predictor variables
X_train = df.drop('FraudResult', axis =1)

# Separate target variable
y_train = df['FraudResult']

In [None]:
# Apply resampling ONLY to train data
X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)

In [None]:
# Scale train and test data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

#standardization of train set (fit_transform)
X_train_res_stand = scaler.fit_transform(X_train_res)

#change array to dataframe
scaled_df_train_resampled = pd.DataFrame(X_train_res_stand)
scaled_df_train_resampled.columns = X_train.columns

#standardization of test set (transform)
X_TEST_stand = scaler.transform(X_TEST)

#change array to dataframe
scaled_df_TEST = pd.DataFrame(X_TEST_stand)
scaled_df_TEST.columns = X_TEST.columns

### Model: Stacked model

In [None]:
# Sub-models:
estimators = [
    ('dt', DecisionTreeClassifier(random_state = RSEED)),
    ('ada', AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),n_estimators=200)),
    ('rf', RandomForestClassifier(n_estimators=1000,criterion = 'entropy',max_depth = None,random_state = RSEED,max_features = 'sqrt',n_jobs=-1, verbose = 1))
    #('rf', RandomForestClassifier(n_estimators=100,random_state = RSEED,max_features = 'sqrt',n_jobs=-1, verbose = 1))
    ]

# Meta-model
clf = StackingClassifier(estimators = estimators, final_estimator = LogisticRegression(),cv=10)

# Fit the training data
clf.fit(scaled_df_train_resampled, y_train_res)

In [None]:
# Predicted y values by this model
stack_y_pred = clf.predict(scaled_df_TEST)

### Data upload to Zindi

Accessing [this link](https://zindi.africa/competitions/xente-fraud-detection-challenge) and click on the "Get a score" button, the predicted data can be evaluated. The true target values are hidden such that only a final score can be obtained without knowing any details. To upload the data, the data must be prepared as shown in the template "sample_submission.csv" which consists of a DataFrame with two columns: TransactionId and FraudResult. This DataFrame is prepared with the following code:

In [None]:
# Convert the preditcted target values into a pandas DataFrame
y_pred = pd.DataFrame(stack_y_pred)
# Creating a DataFrame that consists of the Transaction IDs
final_test_table = X_TEST["TransactionId"]
# Concatenating both DataFrames, resulting in the required two-column-DataFrame
final_test_table = pd.concat([final_test_table,y_pred],axis=1)
# Rename column
final_test_table.rename(columns={0:"FraudResult"}, inplace=True)
# Adding the string "TransactionId_" in each observation to fit the requirements
final_test_table["TransactionId"] = final_test_table["TransactionId"].apply(lambda x: "TransactionId_"+str(x))
# Save the data as csv
final_test_table.to_csv('data/model_output.csv',index=False)