Notebook purpose

- Document problems in MBD raw data

In [2]:
import os
import sys
import numpy as np
import pandas as pd
sys.path.append('/Users/fgu/dev/projects/entropy')
import entropy.helpers.aws as aws
import entropy.data.cleaners as cl

pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)
pd.set_option('max_colwidth', None)
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [2]:
m = aws.S3BucketManager('3di-project-entropy')
m.list()

['3di-project-entropy/entropy_000.parquet',
 '3di-project-entropy/entropy_777.parquet',
 '3di-project-entropy/entropy_X77.parquet']

## Auto purpose tag inconsistency

Auto purpose tag should equal manual tag if manual tag is not missing and else equal Auto Purpose Tag. There are many cases where this is not the case.

### Case 1: incorrectly empty user precedence tag

In [3]:
df = aws.s3read_parquet('s3://3di-data-mdb/raw/mdb_777.parquet')
df.head(1)

Unnamed: 0,Transaction Reference,User Reference,User Registration Date,Year of Birth,Salary Range,Postcode,LSOA,MSOA,Derived Gender,Transaction Date,Account Reference,Provider Group Name,Account Type,Latest Recorded Balance,Transaction Description,Credit Debit,Amount,User Precedence Tag Name,Manual Tag Name,Auto Purpose Tag Name,Merchant Name,Merchant Business Line,Account Created Date,Account Last Refreshed,Data Warehouse Date Created,Data Warehouse Date Last Updated,Transaction Updated Flag
0,688293,777,2011-07-20,1969.0,20K to 30K,WA1 4,E01012553,E02002603,M,2012-01-25,262916,NatWest Bank,Current,364.220001,"9572 24jan12 , tcs bowdon , bowdon gb - pos",Debit,25.030001,No Tag,No Tag,No Tag,No Merchant,Unknown Merchant,2011-07-20,2020-07-21 20:32:00,2014-07-18,2017-10-24,U


In [73]:
tag_names = ['User Precedence Tag Name', 'Manual Tag Name', 'Auto Purpose Tag Name']
tags = df[tag_names]

mask = ((tags['User Precedence Tag Name'] == 'No Tag')
        & ((tags['Auto Purpose Tag Name'] != 'No Tag') 
           | (tags['Manual Tag Name'] != 'No Tag')))
errors = tags[mask]
errors.head(3)

Unnamed: 0,User Precedence Tag Name,Manual Tag Name,Auto Purpose Tag Name
33,No Tag,No Tag,Cash
36,No Tag,No Tag,Interest charges
37,No Tag,No Tag,Lunch or Snacks


In [74]:
print(f'Tags are incorrect in {len(errors) / len(df):.1%} percent of observations.')

Tags are incorrect in 8.9% percent of observations.


### Case 2: incorrectly empty manual and auto purpose tag

In [76]:
mask = ((tags['User Precedence Tag Name'] != 'No Tag')
        & (tags['Auto Purpose Tag Name'] == 'No Tag') 
        & (tags['Manual Tag Name'] == 'No Tag'))
errors = tags[mask]
errors.head(2)

Unnamed: 0,User Precedence Tag Name,Manual Tag Name,Auto Purpose Tag Name
507,Financial - other,No Tag,No Tag
590,Water,No Tag,No Tag


In [77]:
print(f'Tags are incorrect in {len(errors) / len(df):.1%} percent of observations.')

Tags are incorrect in 0.4% percent of observations.


### Correction

In [6]:
def correct_tag_up(df):
    """Set tag_up to tag_manual if tag_manual not missing else to tag_auto.
    
    This definition of tag_up is violated in two ways: sometimes tag_up is
    missing while one of the other two tags isn't, sometimes tag_up is
    not missing but both other tags are. In the latter case, we leave tag_up
    unchanged.
    """
    correct_up_value = df.tag_manual.fillna(df.tag_auto)
    df['tag_up'] = (df.tag_up.where(df.tag_up.notna(), correct_up_value))
    return df

## Duplicate transactions

In [3]:
df = aws.s3read_parquet('s3://3di-project-entropy/entropy_X77.parquet')

### Case studies

Most transactions below seem to be duplicated. How is this possible? How can we check/correct for it?

In [107]:
df.loc[df.user_id == 35177].set_index('date').loc['1 Jan 2020'][:20]

Unnamed: 0_level_0,id,user_id,amount,desc,merchant,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,ym,balance,income
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2020-01-01,672667040,35177,866.0,<mdbremoved>,,,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2017-05-25,724235,2020-08-14 20:59:00,hsbc,current,2020-01-03,1900-01-01,True,844.299988,non merchant mbl,,,,202001,-721.219604,37692.274554
2020-01-01,681780726,35177,3.2,tfl travel ch tfl.gov.uk/cp gb - tfl travel ch,tfl,public transport,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2016-04-24,558493,2020-03-11 23:17:00,hsbc,credit card,2020-01-12,1900-01-01,True,-6698.25,tfl,public transport,,public transport,202001,-6967.069824,37692.274554
2020-01-01,681780725,35177,1.5,co-op group food london w12 gb - co-op group food,co-op,"food, groceries, household",False,xxxx 0,2014-02-14,20k to 30k,1990.0,2016-04-24,558493,2020-03-11 23:17:00,hsbc,credit card,2020-01-12,1900-01-01,True,-6698.25,co-op supermarket,"food, groceries, household",,"food, groceries, household",202001,-6967.069824,37692.274554
2020-01-01,681780723,35177,10.75,eagle london gb - eagle,,,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2016-04-24,558493,2020-03-11 23:17:00,hsbc,credit card,2020-01-12,1900-01-01,True,-6698.25,,,,,202001,-6967.069824,37692.274554
2020-01-01,681780724,35177,11.15,eagle london gb - eagle,,,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2016-04-24,558493,2020-03-11 23:17:00,hsbc,credit card,2020-01-12,1900-01-01,True,-6698.25,,,,,202001,-6967.069824,37692.274554
2020-01-01,806077736,35177,11.15,eagle <mdbremoved>,,,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2020-08-11,1731987,2020-08-14 20:59:00,hsbc,credit card,2020-08-12,1900-01-01,True,-6909.620117,,,,,202001,-6930.949707,37692.274554
2020-01-01,806077737,35177,10.75,eagle <mdbremoved>,,,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2020-08-11,1731987,2020-08-14 20:59:00,hsbc,credit card,2020-08-12,1900-01-01,True,-6909.620117,,,,,202001,-6930.949707,37692.274554
2020-01-01,806081025,35177,900.0,balance transfer,,transfers,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2020-08-11,1731991,2020-08-14 21:00:00,marks & spencer,credit card,2020-08-12,1900-01-01,True,-3213.429932,personal,transfers,,transfers,202001,-3812.719482,37692.274554
2020-01-01,806081026,35177,26.1,balance transfer <mdbremoved>,,transfers,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2020-08-11,1731991,2020-08-14 21:00:00,marks & spencer,credit card,2020-08-12,1900-01-01,True,-3213.429932,personal,transfers,,transfers,202001,-3812.719482,37692.274554
2020-01-01,806077738,35177,3.2,tfl travel ch <mdbremoved>,tfl,public transport,False,xxxx 0,2014-02-14,20k to 30k,1990.0,2020-08-11,1731987,2020-08-14 20:59:00,hsbc,credit card,2020-08-12,1900-01-01,True,-6909.620117,tfl,public transport,,public transport,202001,-6930.949707,37692.274554


In [108]:
df.loc[df.user_id == 362977].set_index('date').loc['1 Jan 2020'][:20]

Unnamed: 0_level_0,id,user_id,amount,desc,merchant,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,ym,balance,income
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2020-01-01,672900827,362977,1.28,"card payment to paybyphone re: barnet on,1.28 gbp, rate 1.00/gbp on xx-xx-2019",vwfs,parking,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624558,2020-08-16 12:06:00,santander,current,2020-01-03,1900-01-01,True,16629.380859,vwfs,parking,,parking,202001,33337.257812,76201.198661
2020-01-01,672415668,362977,10.1,"card payment to willows activity farm,10.10 gbp, rate 1.00/gbp on xx-xx-2019",,,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624558,2020-08-16 12:06:00,santander,current,2020-01-03,1900-01-01,True,16629.380859,,,,,202001,33337.257812,76201.198661
2020-01-01,672487189,362977,3.5,"card payment to balady,3.50 gbp, rate 1.00/gbp on xx-xx-2019",,,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624558,2020-08-16 12:06:00,santander,current,2020-01-03,1900-01-01,True,16629.380859,,,,,202001,33337.257812,76201.198661
2020-01-01,672965985,362977,7.2,"card payment to willows activity farm,7.20 gbp, rate 1.00/gbp on xx-xx-2019",,,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624558,2020-08-16 12:06:00,santander,current,2020-01-03,1900-01-01,True,16629.380859,,,,,202001,33337.257812,76201.198661
2020-01-01,673263612,362977,23.200001,"card payment to willows activity farm,23.20 gbp, rate 1.00/gbp on xx-xx-2019",,,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624558,2020-08-16 12:06:00,santander,current,2020-01-03,1900-01-01,True,16629.380859,,,,,202001,33337.257812,76201.198661
2020-01-01,673378076,362977,30.35,"card payment to balady,30.35 gbp, rate 1.00/gbp on xx-xx-2019",,,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624558,2020-08-16 12:06:00,santander,current,2020-01-03,1900-01-01,True,16629.380859,,,,,202001,33337.257812,76201.198661
2020-01-01,673825826,362977,9.75,which?-moto-recurring t london,which,books / magazines / newspapers,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624759,2020-03-12 09:48:00,american express,credit card,2020-01-04,1900-01-01,True,-3445.27002,which,books / magazines / newspapers,,books / magazines / newspapers,202001,-11784.491211,76201.198661
2020-01-01,674374257,362977,7.99,audible uk adbl.co/pymt,audible,books / magazines / newspapers,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624759,2020-03-12 09:48:00,american express,credit card,2020-01-04,1900-01-01,True,-3445.27002,audible,books / magazines / newspapers,,books / magazines / newspapers,202001,-11784.491211,76201.198661
2020-01-01,674573429,362977,107.949997,paypal *salmadreamw eba xxxxxx7733,paypal,enjoyment,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624759,2020-03-12 09:48:00,american express,credit card,2020-01-04,1900-01-01,True,-3445.27002,paypal,enjoyment,,enjoyment,202001,-11784.491211,76201.198661
2020-01-01,676797740,362977,374.0,amazon.co.uk*fc2ug3e05 amazon.co.uk,amazon,enjoyment,False,nw2 2,2016-11-14,20k to 30k,1985.0,2016-11-14,624759,2020-03-12 09:48:00,american express,credit card,2020-01-07,1900-01-01,True,-3445.27002,amazon,enjoyment,,enjoyment,202001,-11784.491211,76201.198661


In [109]:
df.loc[df.user_id == 467877].set_index('date').loc['1 Jan 2020']

Unnamed: 0_level_0,id,user_id,amount,desc,merchant,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,ym,balance,income
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2020-01-01,674929767,467877,91.75,myprotein.com xxxxxxx9889,myprotein.com,"hairdressing, health, other",False,bs4 3,2018-10-21,30k to 40k,1988.0,2019-07-14,1356546,2020-03-11 17:45:00,american express,credit card,2020-01-05,1900-01-01,True,-1547.310059,myprotein.com,"hairdressing, health, other",,"hairdressing, health, other",202001,-2695.730225,25255.063058
2020-01-01,674929766,467877,11.86,mike guerin - <mdbremoved> ' bristol,,transfers,False,bs4 3,2018-10-21,30k to 40k,1988.0,2019-07-14,1356546,2020-03-11 17:45:00,american express,credit card,2020-01-05,1900-01-01,True,-1547.310059,,transfers,,transfers,202001,-2695.730225,25255.063058
2020-01-01,675722354,467877,25.0,bristol city council bristol gbr,,council tax,False,bs4 3,2018-10-21,30k to 40k,1988.0,2018-10-21,1081233,2020-06-10 09:10:00,sainsburys,credit card,2020-01-06,1900-01-01,True,-318.26001,,council tax,,council tax,202001,-518.460938,25255.063058
2020-01-01,675722353,467877,282.5,"www.ralphlauren.co.uk chadderton, old",ralph lauren,clothes - designer or other,False,bs4 3,2018-10-21,30k to 40k,1988.0,2019-07-14,1356546,2020-03-11 17:45:00,american express,credit card,2020-01-06,2020-01-18,True,-1547.310059,ralph lauren,clothes - designer or other,,clothes - designer or other,202001,-2695.730225,25255.063058
2020-01-01,760914427,467877,11.86,mike guerin - <mdbremoved> ' bristol,,transfers,False,bs4 3,2018-10-21,30k to 40k,1988.0,2020-04-19,1671779,2020-08-16 10:20:00,american express,credit card,2020-04-20,1900-01-01,True,-1107.72998,,transfers,,transfers,202001,-2962.080322,25255.063058
2020-01-01,760910751,467877,91.75,myprotein.com xxxxxxx9889,myprotein.com,"hairdressing, health, other",False,bs4 3,2018-10-21,30k to 40k,1988.0,2020-04-19,1671779,2020-08-16 10:20:00,american express,credit card,2020-04-20,1900-01-01,True,-1107.72998,myprotein.com,"hairdressing, health, other",,"hairdressing, health, other",202001,-2962.080322,25255.063058
2020-01-01,760916217,467877,282.5,"www.ralphlauren.co.uk chadderton, old",ralph lauren,clothes - designer or other,False,bs4 3,2018-10-21,30k to 40k,1988.0,2020-04-19,1671779,2020-08-16 10:20:00,american express,credit card,2020-04-20,2020-04-22,True,-1107.72998,ralph lauren,clothes - designer or other,,clothes - designer or other,202001,-2962.080322,25255.063058
2020-01-01,788887525,467877,5.45,portwell place( <mdbremoved> bristol gbr,,transfers,False,bs4 3,2018-10-21,30k to 40k,1988.0,2020-06-14,1708333,2020-08-16 09:37:00,sainsburys,credit card,2020-06-15,2020-06-16,True,-828.23999,,transfers,,transfers,202001,-192.179916,25255.063058


### Exploration

#### Prevalance

How prevalent are duplicates?

In [4]:
tfl_txn = df.desc.str.contains('tfl')
df['dup'] = df.duplicated(['date', 'user_id', 'account_id', 'amount']) & ~tfl_txn
print('About {:.1%} of transactions are potential duplicates'.format(len(df[df.dup]) / len(df)))

About 3.4% of transactions are potential duplicates


In [5]:
d = df[df.duplicated(['date', 'user_id', 'amount']) & ~tfl_txn]
print('If we don\'t require the txns to be on the same account, about {:.1%} of transactions are potential duplicates'.format(len(d) / len(df)))

If we don't require the txns to be on the same account, about 6.3% of transactions are potential duplicates


I focus on first cases on the same account for now. What percentage of users is affected?

In [6]:
print('{:.1%} of users have potential dups'.format(df[df.dup].user_id.nunique() / df.user_id.nunique()))

99.5% of users have potential dups


Suggests that problem is not limited to a small number of users or banks but is a MDB wide problem. What type of account is affected?

In [7]:
df[df.dup].account_type.value_counts() / df.account_type.value_counts() * 100

current        3.648721
credit card    1.690014
savings        2.708652
other          3.352484
Name: account_type, dtype: float64

#### Txns value

In [8]:
pcts = [.01, .05, .1, .25, .50, .75, .90, .95, .99]
df[df.dup].amount.describe(percentiles=pcts)

count    44629.000000
mean         8.479360
std        493.575256
min     -50000.000000
1%        -659.000000
5%         -49.000000
10%         -6.490000
25%          2.000000
50%          9.200000
75%         20.000000
90%         50.000000
95%        100.000000
99%        500.000000
max      22000.000000
Name: amount, dtype: float64

Amounts are small, but roughly in line with distribution of all amounts (not shown), and too large for it to be likely that they would often be made multiple times on the same day (i.e. they are not mainly coffee puchases).

What's the distribution of the percentage of txns per user that are potentials dups? 

In [9]:
df.groupby('user_id').dup.mean().mul(100).describe(percentiles=pcts)

count    429.000000
mean       3.052654
std        3.485481
min        0.000000
1%         0.284972
5%         0.599143
10%        0.934519
25%        1.396530
50%        2.177602
75%        3.342618
90%        5.406710
95%        7.924490
99%       16.752279
max       35.239600
Name: dup, dtype: float64

Most users have very few potnatial duplicates. What's the distribution of the (net) value of txns per user that are potentials dups? 

In [10]:
df[df.dup].groupby('user_id').amount.sum().describe(percentiles=pcts)

count      427.000000
mean       886.242004
std       5870.280273
min     -53773.460938
1%      -23134.605781
5%       -3377.933032
10%       -676.339978
25%         72.094997
50%        570.539978
75%       1816.944946
90%       4472.633984
95%       7254.045068
99%      19637.889648
max      25055.359375
Name: amount, dtype: float64

Above means that for 25% of users, duplicates skew their financial situation by more than £1800, which is substantial. Most people appear poorer than they probably are.

What's the distribution of the (absolute) value of txns that are potentials dups?

In [11]:
df[df.dup].set_index('user_id').amount.abs().groupby('user_id').sum().describe(percentiles=pcts)

count       427.000000
mean       7028.406250
std       20731.789062
min           3.390000
1%           36.133001
5%           88.571999
10%         205.177997
25%         702.040009
50%        1904.050049
75%        5477.185059
90%       13655.356055
95%       21536.695117
99%       82549.418750
max      231779.453125
Name: amount, dtype: float64

What transactions tend to be duplicates?

In [12]:
df[df.dup].desc.value_counts(dropna=False)[:30]

<mdbremoved>                                                       2719
<mdbremoved>                                                        729
<mdbremoved> ft                                                     419
b365 moto                                                           338
daily od fee                                                        330
<mdbremoved> - s/o                                                  248
<mdbremoved> so                                                     222
paypal payment                                                      172
www.skybet.com cd 9317                                              171
<mdbremoved> atm                                                    162
betfair.-purchase                                                   150
paypal payment - d/d                                                148
bank giro credit ref <mdbremoved>                                   147
b365 cd 0159                                                    