Notebook purpose:

- Validate balances

Approach:

1. Identify od fee txns
2. For each user-month, create dummies indicating whether was in overdraft and paid od fees
3. Explore relationship between above dummies

In [1]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sys.path.append('/Users/fgu/dev/projects/entropy')
import entropy.helpers.aws as aws
import entropy.data.cleaners as cl
import entropy.data.creators as cr

sns.set_style('whitegrid')
pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)
pd.set_option('max_colwidth', None)
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [2]:
SAMPLE = 'X77'
fp = f'~/tmp/entropy_{SAMPLE}.parquet'

df = aws.read_parquet(fp)
print('Rows: {:,.0f}\nUsers: {}'.format(df.shape[0], df.user_id.nunique()))
df.head(1)

Rows: 1,275,582
Users: 431


Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,desc_old,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings
0,688261,2012-01-03,777,400.0,mdbremoved,,transfers,transfers,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-11-13,True,<mdbremoved> - s/o,364.220001,non merchant mbl,transfers,other account,other account,u,201201,-1542.99646,24319.220881,False


## Identify overdraft fee txns

Approach: inspect txn descriptions for relevant tags and note down od-related ones, then inspect common tags for these od-descriptions, then check for additional od-related txns in these tags. Based on that, design regex pattern and gauge precision and recall. 

Hypothesised pattern based on data inspection (omits word boundary for second group as there are txns descriptions where *interest* isn't part of a larger word like *interestto*).

In [3]:
pattern = r'\b(?:od|o d|overdraft|o draft)\b.*(?:fee|usage|interest)'
mask = df.desc.str.contains(pattern) & df.debit

Matches only txns with relevant tags

In [4]:
od_fees = df[mask]
tag_counts = od_fees.tag_auto.value_counts()
tag_counts[:5]

bank charges        16681
interest charges      777
banking charges       373
interest income       173
accessories             0
Name: tag_auto, dtype: int64

Has high precision (i.e. very few -- no? -- false positives)

In [5]:
od_fees.desc.sample(n=10)

1157221           overdraft interestto 13jul 2019
1268831                              daily od fee
1064751                              daily od fee
1216424                              daily od fee
941464                           daily od fee chg
1265819                              daily od fee
490809                               daily od fee
36483      unarrnged od usage 22oct a c 25 charge
1061924                              daily od fee
930555                           daily od fee chg
Name: desc, dtype: object

Seems to have high recall (i.e. finds most od-related txns). Below shows that using a less restrictive pattern has no more true positives but a few false positives (being yet less restrictive and omitting word boundary around first group finds thousandas of irrelevant txns, like `desc`s containing *food*, and finds only a handful of addition bank charges txns).

In [6]:
pattern_alt = r'\b(?:od|o d|overdraft|o draft)\b'
mask_alt = df.desc.str.contains(pattern_alt) & df.debit
df[mask_alt].tag_auto.value_counts()[:9]

bank charges                16994
interest charges              777
banking charges               373
interest income               173
enjoyment                       2
cash                            1
tv / movies package             1
entertainment, tv, media        1
paypal account                  0
Name: tag_auto, dtype: int64

In [8]:
mask_alt = df.desc.str.contains(pattern_alt) & df.debit & df.tag_auto.str.contains('enjoyment|cash|tv')
df[mask_alt].desc

453077                                                     paypal c o d e x cd 7038 deb
995919                        visa cash withdrawal trg od oruzja eur 30 00000 at 1 1274
1018334                                    amzn mktp uk od 0us luxembourg on 12 jan bcc
1067066    card payment to kindle svcs od 3553gv 5 3 99 gbp rate 1 00 gbp on 20 01 2020
1123131                               3395 24jan 20 prime video od 8k67ny 5 353 7661 lu
Name: desc, dtype: object

## Create dummies for in overdraft and paid od fee

For each user-month, create a dummy indicating whether the user's current account was in overdraft and another dymmy indicating whether they paid overdraft fees.

In [64]:
def make_overdraft_data(df):    
    mask = (df.account_type.eq('current'), ['user_id', 'date', 'balance'])    
    
    pattern = r'\b(?:od|o d|overdraft|o draft)\b.*(?:fee|usage|interest)'
    is_od_fee_txn = df.desc.str.contains(pattern) & df.debit    
    
    month = pd.Grouper(key='date', freq='M')    
    fee_paid = lambda s: s.max() == 1
    in_od = lambda s: s.min() < 0
    
    return (df.loc[mask]
            .assign(od_fee_txn=is_od_fee_txn)
            .groupby(['user_id', month])
            .agg(od_fees_paid=('od_fee_txn', fee_paid),
                 in_od=('balance', in_od))
            .groupby('user_id')
            .resample('M', level='date').first()
            .fillna(0).astype(bool))
      
od = make_overdraft_data(df)
od.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,od_fees_paid,in_od
user_id,date,Unnamed: 2_level_1,Unnamed: 3_level_1
777,2012-01-31,False,True
777,2012-02-29,False,True
777,2012-03-31,False,False
777,2012-04-30,False,True
777,2012-05-31,False,True


## Explore relationship between being in od and paying od fees

### Relationship between od fees paid this month and in od this month

In [86]:
m = (od.groupby(['od_fees_paid', 'in_od']).size()
 .sort_values(ascending=False)
 .reset_index()
 .pivot('in_od', 'od_fees_paid')
 .droplevel(level=0, axis=1))

correct = np.diag(m).sum()
share_correct = correct / m.sum().sum()
print(f'Total (and share) correct: {correct} ({share_correct:.1%})')
m

Total (and share) correct: 8086 (60.4%)


od_fees_paid,False,True
in_od,Unnamed: 1_level_1,Unnamed: 2_level_1
False,5945,1319
True,3972,2141


### Relationship between od fees paid this month and in od this or previous month

In [87]:
od['in_od2'] = np.maximum(od.in_od, od.in_od.shift()) == 1
od[18:21]

Unnamed: 0_level_0,Unnamed: 1_level_0,od_fees_paid,in_od,in_od2
user_id,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
777,2013-07-31,False,True,True
777,2013-08-31,False,False,True
777,2013-09-30,False,False,False


In [88]:
m = (od.groupby(['od_fees_paid', 'in_od2']).size()
 .sort_values(ascending=False)
 .reset_index()
 .pivot('in_od2', 'od_fees_paid')
 .droplevel(level=0, axis=1))

correct = np.diag(m).sum()
share_correct = correct / m.sum().sum()
print(f'Total (and share) correct: {correct} ({share_correct:.1%})')
m

Total (and share) correct: 7628 (57.0%)


od_fees_paid,False,True
in_od2,Unnamed: 1_level_1,Unnamed: 2_level_1
False,5404,1236
True,4513,2224


### Distribution of lag between end/beginning of od and next od fee