# Fund Flow Analysis 
Predicting Clearing date of SAP invoices using advanced feature engineering in **SAP S/4HANA Accounts Receiables processing**

 We are presenting here a custom Feature engineering and Machine learning prediction scenario running on top of the extracted dataset from S/4HANA CDS view on accouting line item. 
 
 With the Integrated version of SAP Databrciks in SAP Business Data Cloud it may well be possible to utilize an existing Data Product and share with the Databricks Envionment. 

 Using this Notebook you will understand the basic of data processing on Databricks, along with explanation of fundamental features in Databricks like : 

-  Delta Lake
-  Working on a Delta Lake table
-  Lakehouse Architecture
-  Difference between Pandas dataframe on VM/local machines vs using pyspark dataframe on databricks cluster
-  Pyspark Dataframe API for Data modelling and feature engineering 
-  Databricks Unity catalog
-  ML Model training and batch inference



# **DATASET**
-  Dataset can be **extracted from S/4HANA system** using **SAP CDS view : I_OperationalAcctgDocItem** , wherein we will have "Clearing Date" for accounting line item
- For the customer POC, customer had extracted dataset using a **custom SAP report**.
- In future this may well come from a **standard SAP Data product**, with sharing of dataset directly enabled within the SAP Business Data Cloud framework, opening up the Databricks notebook with shared data product inbuilt. 

Existing Dataset columns : 

       'Branch', 'Branch Name', 'Customer Code', 'Customer Name Sold T',
       'Document Type', 'Document Number', 'Document Date', 'Due Date',
       'Document Amount', 'Outstanding Amount', 'Business', 'Payment Terms',
       'Transaction Cheque', 'Cover Cheque', 'Overdue', 'TDS', 'Division',
       'Customer Reference', 'Group', 'SBU', 'Sub Sbu Desc',
       'Business Description', 'Customer Profile', 'Payment Method',
       'POD Available', 'Text', 'ECB', 'Bill To Party Code',
       'Bill To Party Name', 'Ship to Party Code', 'Ship to Party Name',
       'End Customer Details', 'OEM/NOC', 'Note Attachment', 'Clearing Date',
       'Plant', 'Plant Description', 'Invoice Ageing(Days)'

# **Architecture and Technical Flow**

##  **SAP BDC**

<br />

<img src="/Workspace/Shared/Screenshot 2025-05-06 at 11.50.03 AM.png" alt="Alt Text" width="1000" height="1200">

<br />
<br />

##  **Databricks**

<br />

<img src="/Workspace/Shared/Screenshot 2025-04-29 at 10.53.24 AM.png" alt="Alt Text" width="1000" height="1200">


- As we are working on a standalone databricks instance here we are relying on a Datasphere Replication flow. 
 
  **replication flow** is trigerred from **SAP datasphere** where extracting data from S/4HANA and persisting in **Azure ADSL** -** Data lake ** as a **parquet file**. 

- A **Delta table** : Databricks Lakehouse architecture works on top of data lake , enhancing it by bringing in reliability, performance, governance by implementing a transaction layer on top of the open parquet data format.

 Here we create a delta table in top of the replicated data from from SAP ( parquent format) 

<img src="/Workspace/Shared/Screenshot 2025-04-29 at 11.12.20 AM.png" alt="Alt Text" width="500" height="600">
<br />
<br />
<img src="/Workspace/Shared/Screenshot 2025-04-29 at 11.12.35 AM.png"  width="1000" height="1200">

## **Delta Lake Tables**

The shared data from SAP BDC will be stored in SAP BDC Datalake as a Delta Lake table, this data will then be shared with databricks using Delta sharing protocol ensuring Zero copy of data. 

<img src="/Workspace/Shared/Screenshot 2025-05-06 at 11.15.46 AM.png"  width="1000" height="1200">
<br />
<br />

## **Delta Sharing** 

<img src="/Workspace/Shared/Screenshot 2025-04-29 at 11.23.12 AM.png"  width="500" height="600">
<br />
<br />
<img src="/Workspace/Shared/Screenshot 2025-04-29 at 11.26.47 AM.png"  width="1000" height="1200">
<br />
<br />
<img src="/Workspace/Shared/Screenshot 2025-05-06 at 11.39.00 AM.png"  width="1000" height="1200">

 
#  **Data Engineering** 

#  **What is the difference between PYSPARK DATAFRAME on Databricks compute cluster compared with Pandas dataframe ?**

# **Pandas dataframe** running on a VM/laptop/Mac/windows : 

- **Eager execution** : When you load data from a CSV file all data is loaded in memory, and this data now has to fit in memory i.e you can only load that much data as fits on the available memory on the machine. 
- Pandas computations happen on a single core
- No query optimizer

# **PYSPARK Dataframe on Databricks with Attached Compute cluster**

- **Lazy execution** [map operations (like filtering, new column, type conversion etc) are only done when reduce operations (aggregations etc..) are called], pyspark dataframe is immutable i.e once initialized it cannot be changed. Pyspark dataframe is immutable (cannot be changed over time), if a change is requested an internal new dataframe is created to handle the request. This is unlike pandas dataframes and the reason for this is parallelism. In order utlize the full power of parallel distributed computing with pyspark this is the behaviour. 
Further reading here : https://docs.databricks.com/aws/en/pyspark/  
- pyspark (pandas on spark) is scalable to multiple machines in cluster and can process big data.  Even on a single machine it can leverage all cores. 
- allows spark queries to run on larger than memory datasets. 

we calculate several features with some assumptions : 

-   **Overdue** : Invoice is marked overdue historically if Clearing data > Due date. 
-   **Days_to_Pay_mean** : historical mean days to pay for the Customer Code, based on the 1 year data shared . ( we do a groupby on customer code)
-   **Payment_cv**: co-efficient of variation per customer we are trying to capture relative variability in their payment behaviour based on 1 year data.
-   **Actual_Overdue_Sum** : sum of all historical overdue per Customer code . < also based on 1 year data shared>.
-   **Late_payment_ratio** : Mean of the overdue column grouped by the customer code. 
-   **Payment_std_deviation** : Standard deviation of historical days to pay grouped by customer code. 
-   **Time_Since_Invoice_date** : For this feature a current_date is assumed , which is the date at which the prediction is being run, for the attached files I have assumed current date as 28th April’25. During training this was taken as the Max available date in the dataset. 
-   **Z_score**: A higher absolute z-score indicates that the payment behaviour is more unusual (either much earlier or much later than the due date) for that particular customer(to be researched more)
-   **Timeliness_score** : A custom rating for each customer out of 40, higher rating indicates better on-time historical payment behaviour ( To be researched more and again based on 1 year data.)
-   **Invoice_amount_ratio**: Calculate Ratio of Invoice Amount to Customer's Average Invoice Amount
-   **Reliability_score**: A custom created score which is a sum of timeliness score, consistency score and amount score, trying to score a reliability on payments. (to be researched more). 



## Dataset from S/4HANA system