Notebook purpose:

- Check whether we can calculate balances

Conclusion:

- We cannot reconstruct balances.

In [1]:
import sys

import pandas as pd

sys.path.append("/Users/fgu/dev/projects/entropy")
import entropy.helpers.data as hd

Load small data sample

In [2]:
df = hd.read_raw_sample("777")
df.shape

Time for read_raw_sample               : 5.04 seconds


(682656, 27)

Construct table containing dates of first and last txn for each account as well as the date of the last account refresh.

In [16]:
table = df.groupby("Account Reference").agg(
    first_txn=("Transaction Date", "min"),
    last_txn=("Transaction Date", "max"),
    last_refresh_date=("Account Last Refreshed", "first"),
)
table.head(5)

Unnamed: 0_level_0,first_txn,last_txn,last_refresh_date
Account Reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
671,2014-06-09,2014-08-18,2014-08-22 11:25:00
672,2014-04-23,2014-07-21,2014-08-21 09:50:00
674,2014-04-22,2014-08-21,2014-08-22 11:25:00
4149,2013-02-28,2014-09-17,2015-01-03 06:52:00
4150,2013-05-19,2014-08-27,2014-09-04 15:46:00


For just about all accounts, last refresh date is after the date of the last transaction.

In [5]:
(table.last_refresh_date > table.last_txn).value_counts()

True     1273
False       9
dtype: int64

The difference in days is usually substantial.

In [15]:
(table.last_refresh_date - table.last_txn).dt.days.describe(
    percentiles=[0.025, 0.05, 0.1, 0.2]
)

count    1282.000000
mean       80.038222
std       203.442030
min       -96.000000
2.5%        0.000000
5%          0.000000
10%         1.000000
20%         1.000000
50%        10.000000
max      1944.000000
dtype: float64

Given that, according to the data dictionary, `Latest available balance` refers to the balance at the `Account Last Refreshed` date, this means **we cannot reconstruct balances**.

To do this, we would need to be able to cumulatively sum daily transaction totals and adjust that sequence by an offset, which we could calculate as the difference of the cumulative sum and the latest available balance on the account last refreshed date. This doesn't work, however, as the cumulative sum at the date of the last account refresh would be incorrect because we are missing transactions leading up to that date.

## Checks

During users period of observations, the number of days without any transactions is much shorter than the gaps between last txn data and last updated date above.

In [18]:
txn_gaps = (
    df.groupby(["User Reference", "Transaction Date"])["Transaction Date"]
    .first()
    .groupby("User Reference")
    .diff()
)
txn_gaps.head(5)

User Reference  Transaction Date
777             2012-01-03            NaT
                2012-01-04         1 days
                2012-01-06         2 days
                2012-01-09         3 days
                2012-01-11         2 days
Name: Transaction Date, dtype: timedelta64[ns]

In [19]:
txn_gaps.describe()

count                       148357
mean     1 days 16:24:06.241161522
std      5 days 22:21:33.744561507
min                1 days 00:00:00
25%                1 days 00:00:00
50%                1 days 00:00:00
75%                2 days 00:00:00
max              766 days 00:00:00
Name: Transaction Date, dtype: object