Module Imports

In [1]:
from pathlib import Path
import sys  
import os
import dask.dataframe as dd
import pandas as pd 
curr_path = str(Path(os.getcwd()).parent)
sys.path.append(curr_path)
from scripts.constants import *
from scripts.read import *
spark = create_spark()
PREFIX = "."

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/30 15:08:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


23/12/30 22:14:46 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 148082 ms exceeds timeout 120000 ms
23/12/30 22:14:46 WARN SparkContext: Killing executors is not supported by current scheduler.
23/12/30 22:14:52 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage.B

# Buy Now, Pay Later (BNPL) Industry Project Summary

With the interest of multiple merchants, we wanted to develop a way to filter them to only select the best of the best due to the onboarding quota. Hence, the main focus of our research was:
> How can we select the best of merchants based on their credibility and customer demographics? 

---

## Approach

In starting the project, we identified what are some key components to a transaction and the nature of a business. These factors included:
* Customer loyalty to merchant
* Revenue of a merchant to measure performance
* Industry growth for the listed industries in our dataset
* Customer demographics of each merchant

Ultimately, we deduced to features that relate to merchant performance such as revenue and customer demographics such as location by SA2 code, age, and area earnings to conduct our analysis on desirable merchants to onboard.

## Step 1: Building an Extract-Transform-Load (ETL) Framework
In data retrieval and management, we applied the following skeleton to ensure a streamlined, efficient process. Post-ETL, we delete all the redundant scripts to avoid data leakage. These are all run via a shell script to avoid meddling with source code. 

![ETL Flowchart](../plots/etl_flowchart.png)

## Step 2: Conducting Preliminary Analysis

After retrieval and transformations, this stage involved analysing all our internal and external datasets to understand their nature and characteristics. Below are some visualisations of features we ended up measuring, whereby their respective notebooks are also indicated for in-depth analysis.

We made multiple assumption along the way. First, we assumed that each industry tags are distinct representations of their own industries. We also assumed that consumers are to be 15 years or older as it is likely for consumers to be below this age range.

In terms of project limitations, we faced challenges in precisely matching locations with SA2 and postcodes. Postcodes aren't standard ABS measurement of locality and region in Australia. Hence, we had to retrieved this from public sources. We also have limited depth of demographic parameters whereby the earnings of a consumer can only be identified by a median earning figure per SA2 code. We acknowledge that exploring additional factors such as diversity and occupation of consumers could provide a more comprehensive view. Lastly, we had minimal insights to fraud probability. This underlines the need for more extensive historical data on both merchants and consumers.


#### Merchants Data

For the following visualisation, more findings can be found in the `notebooks` folder, under the file merchants_geo_visualization file.

Revenue level distribution for the lowest revenue band and the corresponding industries with highest counts:


![Alt text](../plots/industry_dist_revenue_a.png)


Revenue level distribution for the highest revenue band and the corresponding industries with highest counts:

![Alt text](../plots/industry_dist_revenue_e.png)

#### External + Consumer Data

For the following visualisation, more findings can be found in the `notebooks` folder, under the file consumers_geo_visualisation file.

![Consumer Count by State](../plots/consumer_count_by_state.png)

![Customer Loyalty by SA2](../plots/customer_loyalty_by_sa2.png)

![Alt text](../plots/spent_dollar_by_sa2.png)

![Alt text](../plots/average_dollar_value_by_state.png)

## Step 2.5: Building Models

For features such as merchant fraud probability and dollar value of transactions, we wanted to predict these values in order to obtain an idea of the future performance of our merchants and what to anticipate from them. 

#### Dollar Value Modelling

In approaching dollar value modelling, we used a linear regression model to predict transaction values due to its interpretability and time constrictions. Features such as industry, order dates, as well as the SA2 code for each transaction were used in modelling. Unfortunately, due to our large dataset, the coefficients of the model suggested non-linear correlation except for the SA2 code of a transaction. Hence, the residual plots resulted in a clear trend which suggests the unsuitable nature to predict these values. More details can be found in the notebooks/base_linear_model.ipynb file.

![Residual Plot for Linear Regression](../plots/updated_residuals.png)


#### Merchant Fraud Probability Modelling

As for modelling the merchant fraud probability, a random forest regressor was used to predict these values. A combination of merchant details such as take rate, revenue level, and order datetimes were used as features. The filtering of transactions were decided based on the distribution of fraud probability whereby probabilities exceeding 50% were removed.
 
![Alt text](merchant_fraud_dist.png)

## Step 3: Ranking Metrics

We deduced the following metrics in order to rank our merchants based on their desirability*. We also assumed that all factors used in the ranking metrics are equally important, hence they all have equal weight.

*More information regarding these metrics can be found in the `notebooks` folder, in files such as revenue_takerate, selecting_industry, and selecting_merchants

Firstly, industries were filtered based on the ranking metrics to retain the best performing industries. The following ranking metrics were used:
* Average and median transaction value per industry tag
* Average earning of BNPL by merchants
* Number of transactions per merchant
* BNPL’s take rate per merchant transaction
* Average and median fraud probability

Then, the top 3 best-performing industries were selected. The following industries produced the best performance in terms of the above metrics. The following visualisation is an example of a metric used. 

![Alt text](../plots/takerate_dist_top_three_industry.png)

Lastly, the top 100 merchants were then ranked based on the same metrics with tailored calculations.

## Output: Ranking of Top 100 merchants

Running the following chunk of code will produce a table of all 100 merchants ranked based on our ranking metrics. 

More details regarding the ranking of these merchants can be found in the `selecting_merchants` notebook in the notebooks folder.

In [4]:
top100_merchants = pd.read_csv(PREFIX + TOP_100_MERCHANTS_PATH)
industry_tags = pd.read_csv(PREFIX+INDUSTRY_MAPPED)
top100_merchants = top100_merchants.merge(industry_tags, on=[MERCHANT_ABN, MAPPED_INDUSTRY], how=LEFT_JOIN).drop("Unnamed: 0", axis=1)
top100_merchants

Unnamed: 0,merchant_abn,name_x,Score,MappedIndustry,name_y,revenue_level,take_rate,industry_tags
0,64203420245,Pede Nonummy Corp.,28558.399000,X,Pede Nonummy Corp.,c,2.86,tent awning
1,49891706470,Non Vestibulum Industries,27117.817000,X,Non Vestibulum Industries,a,5.80,tent awning
2,45629217853,Lacus Consulting,25027.771429,P,Lacus Consulting,a,6.98,gift card novelty souvenir
3,89726005175,Est Nunc Consulting,23631.994000,X,Est Nunc Consulting,a,6.01,tent awning
4,80324045558,Ipsum Dolor Sit Corporation,21589.514286,P,Ipsum Dolor Sit Corporation,c,1.47,gift card novelty souvenir
...,...,...,...,...,...,...,...,...
95,66567150019,Cursus In Foundation,1311.619143,F,Cursus In Foundation,b,3.52,digital goods books movies music
96,16570599421,Non Magna Nam PC,1305.449857,F,Non Magna Nam PC,a,6.74,digital goods books movies music
97,81333423763,Molestie Dapibus Ligula Company,1303.218571,X,Molestie Dapibus Ligula Company,c,2.21,tent awning
98,40507108040,Lacinia At Institute,1299.049000,X,Lacinia At Institute,a,6.24,tent awning
