## Description
The objective of this competition is to create a machine learning model to detect fraudulent transactions.

Fraud detection is an important application of machine learning in the financial services sector. This solution will help Xente provide improved and safer service to its customers.

This competition is sponsored by Xente, Innovation Village, and insight2impact.

## Data
Xente is an e-commerce and financial service app serving 10,000+ customers in Uganda.

This dataset includes a sample of approximately 140,000 transactions that occurred between 15 November 2018 and 15 March 2019.

One of the challenges of fraud detection problems is that the data is highly imbalanced. 

Xente_variable_definitions.csv: Definition of the features per transaction
Training.csv: Transactions from 15 November 2018 to 13 February 2019, including whether or not each transaction is fraudulent. You will use this file to train your model.
Test.csv: Transactions from 13 February 2019 to 14 March 2019, not including whether or not each transaction is fraudulent. You will test your model on this file.
sample_submission.csv: is an example of what your submission file should look like. The order of the rows does not matter, but the names of the TransactionId must be correct. The value in FraudResult will be 1 for is a Fraud and 0 for is not a fraud.

## Evaluation
The error metric for this competition is the F1 score, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer your score is to 1, the better your model.

F1 Score: A performance score that combines both precision and recall. It is a harmonic mean of these two variables. Formula is given as: 2*Precision*Recall/(Precision + Recall)

Precision: This is an indicator of the number of items correctly identified as positive out of total items identified as positive. Formula is given as: TP/(TP+FP)

Recall / Sensitivity / True Positive Rate (TPR): This is an indicator of the number of items correctly identified as positive out of total actual positives. Formula is given as: TP/(TP+FN)

Where:

TP=True Positive
FP=False Positive
TN=True Negative
FN=False Negative

Info from Leaderboard: score to beat: 0,89

In [None]:
import pandas as pd
from datetime import datetime, date, time, timedelta
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Load the variable names
variable_meanings = pd.read_csv("data/variable meanings.csv")
pd.set_option('max_colwidth', 800)
variable_meanings

## Results from Pandas Profile
1. There is only one `CurrencyCode`, that means there is no additional information provided by this feature ==> Drop CurrencyCode
2. There is only one CountryCode, that means there is no additional information provided by this feature ==> Drop CountryCode
3. TransactionIDs are all distinct, that means there is no additional information provided ==> Drop TransactionIds
4. TransactionStartTime consist of timestamps. For further analysis ==> group them into timeframes (use "datetime")
5. Amount contains + and - values (due to debit/credit) ==> we need to create a column with debit credit and transform "amount" to absolut values
6. Extreeeemely imbalanced target value ==> Oversampling? ==> read further information / links auf zindi nutzen
7. Definition for column  CustomerID and AccountId seems to be mixed up

8. Transform variable amount (log) due to skewness ==> no log function for negative values ==> we won't use this variable in model
9. Transform variable value or value (USD) (log) due to skewness => we choose ValueUSD (smaller values)
(10.Transform variable Providerid (log) due to skewness)


In [None]:
# Load the data
data_test = pd.read_csv("data/test.csv")
data_train = pd.read_csv("data/training.csv")

In [None]:
data_train.head()

---

## Data cleaning

* Stripped the ID columns from non-integer characters and converted them to integers
* Separated TransactionStartTime into transactiontime and transactiondate
* Drop redundant columns

In [None]:
def remove_letters(string):
    return int(string.split('_')[1])
    
id_columns = ["TransactionId","BatchId","AccountId","SubscriptionId","CustomerId","ProviderId","ProductId","ChannelId"]    
for i in id_columns:
    data_train[i] = data_train[i].apply(lambda x:remove_letters(x))

In [None]:
data_train.head()

In [None]:
# seperate `TransactionStartTime` into time and date
def convert_to_date(date):
    # convert field into datetime format
    d = datetime.strptime(date,'%Y-%m-%dT%H:%M:%SZ')
    # extract date
    return d.date()

def convert_to_time(date):
    d = datetime.strptime(date,'%Y-%m-%dT%H:%M:%SZ')
    # extract time
    return d.time()

# create new columns with seperate information for `TransactionTime` and `TransactionDate`
data_train['TransactionTime'] = data_train.TransactionStartTime.apply(lambda x: convert_to_time(x))
data_train['TransactionDate'] = data_train.TransactionStartTime.apply(lambda x: convert_to_date(x))

Consolidate times into seperate blocks:

1. 00:00 - 05:59 (night)
2. 06:00 - 09:59 (morning)
3. 10:00 - 13:59 (midday)
4. 14:00 - 17:59 (afternoon)
5. 18:00 - 23:59 (evening)

In [None]:
# verify that time scale is 0-23
data_train.TransactionTime.apply(lambda x: x.hour).value_counts()

In [None]:
# aplly day time consolidation
def consolidate_time(time):
    if time.hour < 6:
        return 'night'
    elif time.hour < 10:
        return 'morning'
    elif time.hour < 14:
        return 'midday'
    elif time.hour < 18:
        return 'afternoon'
    else:
        return 'evening'
    
data_train['DayTime'] = data_train.TransactionTime.apply(lambda x: consolidate_time(x))

In [None]:
# extract weekdays from `TransactionDate`
data_train['TransactionWeekday'] = data_train.TransactionDate.apply(lambda x: x.isoweekday())

In [None]:
# create new feature to distinguish between Debit (0) and Credit (1)
data_train['DebitCredit'] = data_train.Amount.apply(lambda x: 0 if x > 0 else 1)

In [None]:
# add column for value in USD (1 UGX = 0.00028 USD (12.01.2022, UTC 12:10))
data_train['ValueUSD'] = data_train.Value.apply(lambda x: x * 0.00028)

In [None]:
def convert_to_datetime(date):
    # convert field into datetime format
    d = datetime.strptime(date,'%Y-%m-%dT%H:%M:%SZ')
    # extract date
    return d
data_train['DT'] = data_train.TransactionStartTime.apply(lambda x: convert_to_datetime(x))

In [None]:
#
def transactions_toDate(df, transaction_id, account_id):
    """
    returns dataframe
    """
    TTD = {'t_id': [], 'a_id': [], 'count': [], 'date': []}
    #print(transaction_id, account_id)
    for t, a in zip(transaction_id, account_id):
        count = 0
        #print(a)
        target_date = df.query('TransactionId == @t').DT.dt.to_pydatetime()[0]
        for idx, row in df.iterrows():
            #print(row.DT-target_date)
            #print(type(row.DT))
            #print(target_date)
            if row.DT < target_date:
                #print(row.AccountId)
                if row.AccountId == a:
                    count += 1
                    #print(count)
            else:
                break
        TTD['t_id'] += [t]
        TTD['a_id'] += [a]
        TTD['count'] += [count]
        TTD['date'] += [target_date]
    return pd.DataFrame.from_dict(TTD)

In [None]:
temp=data_train[['TransactionId', 'AccountId', 'CustomerId', 'FraudResult', 'TransactionDate', 'TransactionTime']]\
        .query('FraudResult == 1')

In [None]:
TTD = transactions_toDate(data_train, temp.TransactionId, temp.AccountId)

In [None]:
data_train

In [None]:
# drop columns that do not convey additional meaning
#'TransactionId' ==> remove later before modelling
cols_to_drop = ['CurrencyCode', 'CountryCode', 'TransactionStartTime']
data_train_clean = data_train.drop(columns=cols_to_drop, inplace=False)
data_train_clean.to_csv('data/data_train_clean.csv')

In [None]:
data_train_clean.head()

### Data Visualization

In [None]:
px.pie(data_train_clean.groupby('DebitCredit').count()[['TransactionId']].reset_index(), values = 'TransactionId', names ='DebitCredit',title='Percentage of Debit and Credit' )

In [None]:
g = sns.stripplot(data=data_train_clean, x="PricingStrategy", y="Value",hue="DebitCredit")

In [None]:
g = sns.stripplot(data=data_train_clean, x="ProductCategory", y="Value",hue="DebitCredit")

In [None]:
g = sns.stripplot(data=data_train_clean, x="ProviderId", y="Value",hue="DebitCredit")

### Transform Data


9. Transform variable value or value (USD) (log) due to skewness
(10.Transform variable Providerid (log) due to skewness)

In [None]:
data_train_clean['ValueUSDLog']=np.log(data_train_clean.ValueUSD)

### Dataframe for model

In [None]:
data_train_clean.columns

In [None]:
redundant = ['Value', 'Amount', 'TransactionId','DT','TransactionTime','TransactionDate','ValueUSD']
df = data_train_clean.drop(redundant, axis =1)
df.to_csv('data/data_temp.csv')
df.head()

In [None]:
prov id, prod id, prod cat, channelid, pricing strategie, daytime, weekday, debitcredit