Notebook purpose

- Validate balances

In [1]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sys.path.append('/Users/fgu/dev/projects/entropy')
import entropy.helpers.aws as aws
import entropy.data.cleaners as cl
import entropy.data.creators as cr

sns.set_style('whitegrid')
pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)
pd.set_option('max_colwidth', None)
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [2]:
SAMPLE = 'X77'
fp = f'~/tmp/entropy_{SAMPLE}.parquet'

df = aws.read_parquet(fp)
print('Rows: {:,.0f}\nUsers: {}'.format(df.shape[0], df.user_id.nunique()))
df.head(1)

Rows: 1,275,582
Users: 431


Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,desc_old,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings
0,688261,2012-01-03,777,400.0,mdbremoved,,transfers,transfers,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-11-13,True,<mdbremoved> - s/o,364.220001,non merchant mbl,transfers,other account,other account,u,201201,-1542.99646,24319.220881,False


## How do balance overdrafts correlate with od fees?


Approach:
1. Identify od fee txns
2. For each user-month, create dummies indicating whether was in overdraft and paid od fees
3. Explore relationship between above dummies


### Identify overdraft fee txns

Approach: inspect txn descriptions for relevant tags and note down od-related ones, then inspect common tags for these od-descriptions, then check for additional od-related txns in these tags. Based on that, design regex pattern and gauge precision and recall. 

Hypothesised pattern based on data inspection (omits word boundary for second group as there are txns descriptions where *interest* isn't part of a larger word like *interestto*).

In [3]:
pattern = r'\b(?:od|o d|overdraft|o draft)\b.*(?:fee|usage|interest)'
mask = df.desc.str.contains(pattern) & df.debit

Matches only txns with relevant tags

In [4]:
od_fees = df[mask]
tag_counts = od_fees.tag_auto.value_counts()
tag_counts[:5]

bank charges        16681
interest charges      777
banking charges       373
interest income       173
accessories             0
Name: tag_auto, dtype: int64

Has high precision (i.e. very few -- no? -- false positives)

In [5]:
od_fees.desc.sample(n=10)

1250978    arranged od usage 03sep a c 6747
1266370                        daily od fee
1263913                  daily od fee 31 12
1161760                  daily od fee 23 06
402473                        o d usage fee
385288                     daily od fee chg
480445                     o draft interest
37225                    unplanned o d fees
1264473                  daily od fee 01 04
1267430                  daily od fee 06 07
Name: desc, dtype: object

Seems to have high recall (i.e. finds most od-related txns). Below shows that using a less restrictive pattern has no more true positives but a few false positives (being yet less restrictive and omitting word boundary around first group finds thousandas of irrelevant txns, like `desc`s containing *food*, and finds only a handful of addition bank charges txns).

In [6]:
pattern_alt = r'\b(?:od|o d|overdraft|o draft)\b'
mask_alt = df.desc.str.contains(pattern_alt) & df.debit
df[mask_alt].tag_auto.value_counts()[:9]

bank charges                16994
interest charges              777
banking charges               373
interest income               173
enjoyment                       2
cash                            1
tv / movies package             1
entertainment, tv, media        1
paypal account                  0
Name: tag_auto, dtype: int64

In [7]:
mask_alt = df.desc.str.contains(pattern) & df.debit & df.tag_auto.str.contains('enjoyment|cash|tv')
df[mask_alt].desc

Series([], Name: desc, dtype: object)

## Create dummies for in overdraft and paid od fee

For each user-month, create a dummy indicating whether the user's current account was in overdraft and another dymmy indicating whetehr they paid overdraft fees.

In [33]:
# keep only current account txns
ca = df[df.account_type.eq('current')].copy()

ca['od_fee_txn'] = mask
g = ca.groupby(['user_id', 'ym'])
ca['paid_od_fees'] = g.od_fee_txn.transform('max')
ca['in_overdraft'] = g.balance.transform(lambda x: x.min() < 0)

## Explore relationship between being in od and paying od fees

In [68]:
(ca.groupby(['user_id', 'ym'])
 [['paid_od_fees', 'in_overdraft']].first()
 .groupby(['paid_od_fees', 'in_overdraft']).size()
 .sort_values(ascending=False))

paid_od_fees  in_overdraft
False         False           5823
              True            3972
True          True            2141
              False           1319
dtype: int64

Todo
- Check same as above, but whether paid od fees in month of overdraft or right after
- Check distribution of lag between end/beginning of od and next od fee