Notebook purpose

- Understand nature of duplicate transactions, explore solutions, document decisions about what duplicates to drop

### Summary

Types of duplicates and how we handle them:

1. `['user_id', 'date', 'amount', 'account_id', 'desc']` are identical* -> drop in main analysis.

2. `['user_id', 'date', 'amount', 'account_id']` are identical and `desc` is similar**, *** -> drop from main analysis.

3. `['user_id', 'date', 'amount', 'account_id']` are identical and `desc` not similar -> keep in main analysis.

4. `['user_id', 'date', 'amount']`, `desc` may or may not differ, but `account_id` differs. This is relevant if there are (many) duplicated accounts, in which case a different account number is no guarantee for a different account. -> ignore for now, ask MDB to share list of duplicated accounts.

\* This includes pairs where description for both txns is `<mdbremoved>`, in which case we assume that the same description is being masked.

\** "similar" is defined below.

\*** In this category we include cases where desc of one txn is `<mdbremoved>` while other isn't, so even though descriptions are not similar, we assume that what is being masked by `<mdbremoved>` in one description is similar to what is visible in the other.

Solution steps:

- preprocess using cleaning func to eliminate extraneous chars
- drop type 1 dups
- of remaining dups, drop those for which all WORDS of short desc are contained in long desc


Todo:
- Make regex [fast](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3)

In [1]:
import os
import sys

import numpy as np
import pandas as pd
import seaborn as sns

sys.path.append('/Users/fgu/dev/projects/entropy')
import entropy.helpers.aws as aws
import entropy.data.cleaners as cl

sns.set_style('whitegrid')
pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)
pd.set_option('max_colwidth', None)
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [2]:
df = pd.read_parquet('~/tmp/entropy_X77.parquet')

In [3]:
def distr(x):
    pcts = [.01, .05, .1, .25, .50, .75, .90, .95, .99]
    return x.describe(percentiles=pcts).round(2)

def duplicates_sample(df, col_subset, n=100, seed=2312):
    """Draws sample of size n of duplicate txns as defined by col_subset."""
    dups = df[df.duplicated(subset=col_subset, keep=False)].copy()
    dups['group'] = dups.groupby(col_subset).ngroup()
    unique_groups = np.unique(dups.group)
    rng = np.random.default_rng(seed=seed)
    sample = rng.choice(unique_groups, size=n)
    return dups[dups.group.isin(sample)]

## Case studies

Below three case studies of duplicates

In [None]:
dh.user_date_data(df, 35177, '1 Jan 2020')

In [None]:
dh.user_date_data(df, 362977, '1 Jan 2020')

In [None]:
dh.user_date_data(df, 467877, '1 Jan 2020')

## Type 1 duplicates

### Definition
- `['user_id', 'date', 'amount', 'account_id', 'desc']` are identical.
 
- This includes transactions where desc for both is `<mdbremoved>`, where we have to make a call whether or not to assume they are the same.

- Reasons for false positives (FP): user makes two identical transactions on the same day (or on subsequent days for txns that appear with a delay). Plausible cases are coffee and betting shop txns. However, inspection suggests that the vast majority of cases are duplicates, as they are txns that are unlikely to result from multiple purchases on the same day.


### Decision

- We delete dups for main analysis and do robustness check withouth deleting them
- We tread cases where both descriptions are `<mdbremoved>` no different from others, even though it's somewhat more likely that they are genuinely different transactions

In [4]:
col_subset = ['user_id', 'date', 'amount', 'account_id', 'desc']
dup_var = 'dup1'

df[dup_var] = df.duplicated(subset=col_subset)

### Prevalence and value

How prevalent are duplicates?

In [5]:
n_df = len(df)
n_dups = len(df[df[dup_var]])
n_users_dups = df[df[dup_var]].user_id.nunique()
n_users_df = df.user_id.nunique()
txt = 'About {:.1%} of transactions across {:.0%} of users are potential dups.'
print(txt.format(n_dups / n_df, n_users_dups / n_users_df))

About 1.7% of transactions across 97% of users are potential dups.


Gross value of duplicated txns

In [6]:
gross_value = df[df[dup_var]].set_index('user_id').amount.abs().groupby('user_id').sum()
distr(gross_value)

count       417.00
mean       4622.48
std       15167.33
min           0.17
1%            4.07
5%           19.30
10%          61.06
25%         236.30
50%         836.96
75%        2736.70
90%        9293.31
95%       17393.81
99%       59012.20
max      183754.34
Name: amount, dtype: float64

Most frequent txns description

In [7]:
df[df[dup_var]].desc.value_counts(dropna=False)[:10]

<mdbremoved>                         2050
<mdbremoved>                          517
<mdbremoved> ft                       357
b365 moto                             263
tfl travel charge tfl.gov.uk/cp       167
www.skybet.com cd 9317                165
<mdbremoved> - s/o                    158
<mdbremoved> so                       156
bank giro credit ref <mdbremoved>     147
betfair.-purchase                     146
Name: desc, dtype: int64

Most frequent auto tag

In [8]:
df[df[dup_var]].tag_auto.value_counts(dropna=False)[:10]

NaN                           6240
transfers                     2961
gambling                      2185
enjoyment                     1535
public transport              1076
lunch or snacks               1048
bank charges                   823
entertainment, tv, media       562
cash                           516
food, groceries, household     506
Name: tag_auto, dtype: int64

Proportion of txns per auto tag that are duplicated

In [11]:
txns_per_tag_overall = df.tag_auto.value_counts(dropna=False)
txns_per_tag_duplicated = df[df[dup_var]].tag_auto.value_counts(dropna=False) 
p_dup_per_tag = (txns_per_tag_duplicated / txns_per_tag_overall)
p_dup_per_tag.sort_values(ascending=False)[:40]

investment - other               0.214953
gambling                         0.159303
mobile app                       0.151954
isa                              0.088095
tradesmen fees                   0.076923
flights                          0.049423
parking                          0.046014
payment protection insurance     0.044776
bills                            0.040082
home appliance insurance         0.039448
games and gaming                 0.038053
supermarket                      0.035669
road charges                     0.032078
pension or investments           0.030481
pet insurance                    0.027899
bank charges                     0.026923
refunded purchase                0.026788
public transport                 0.026075
child - everyday or childcare    0.024896
fines                            0.024390
NaN                              0.024055
gym membership                   0.023636
postage / shipping               0.023256
entertainment, tv, media         0

### Inspect dups

In [12]:
duplicates_sample(df, col_subset, n=3, seed=None)

Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings,dup1,group
432753,152648286,2016-07-11,309377,3.3,afs google - d/d,google,spend,services,True,tn22 1,2015-06-27,,1983.0,2015-06-30,445695,2018-02-09 15:34:00,natwest bank,current,2016-08-06,2018-11-05,True,,google,enjoyment,,administration - other,u,201607,,26350.679688,False,False,5113
432754,152648287,2016-07-11,309377,3.3,afs google - d/d,google,spend,services,True,tn22 1,2015-06-27,,1983.0,2015-06-30,445695,2018-02-09 15:34:00,natwest bank,current,2016-08-06,2018-11-05,True,,google,enjoyment,,administration - other,u,201607,,26350.679688,False,True,5113
648903,267648222,2017-09-23,389477,1.87,"card payment to harrods ltd,1.87 gbp, rate 1.00/gbp on 21-09-2017",harrods,spend,household,,se14 5,2017-08-01,,1995.0,2017-08-01,761689,2019-03-20 01:57:00,santander,current,2017-09-24,2017-11-13,True,119.720001,harrods,home,,,u,201709,-408.230286,17807.490234,False,False,8993
648904,267648225,2017-09-23,389477,1.87,"card payment to harrods ltd,1.87 gbp, rate 1.00/gbp on 21-09-2017",harrods,spend,household,,se14 5,2017-08-01,,1995.0,2017-08-01,761689,2019-03-20 01:57:00,santander,current,2017-09-24,2017-11-13,True,119.720001,harrods,home,,,u,201709,-408.230286,17807.490234,False,True,8993
1167455,727870871,2018-12-10,561177,50.0,<mdbremoved> pocket money via mobile - lvp fp 09/12/18 10 xxxxxxxxxxxxxx000n,,spend,retail,True,le12 8,2020-01-06,50k to 60k,1986.0,2020-03-06,1630041,2020-08-16 10:00:00,natwest bank,current,2020-03-07,1900-01-01,True,1232.430054,,"child - toys, clubs or other",,"child - toys, clubs or other",c,201812,-2230.909668,20388.599609,False,False,16038
1167456,727870872,2018-12-10,561177,50.0,<mdbremoved> pocket money via mobile - lvp fp 09/12/18 10 xxxxxxxxxxxxxx000n,,spend,retail,True,le12 8,2020-01-06,50k to 60k,1986.0,2020-03-06,1630041,2020-08-16 10:00:00,natwest bank,current,2020-03-07,1900-01-01,True,1232.430054,,"child - toys, clubs or other",,"child - toys, clubs or other",c,201812,-2230.909668,20388.599609,False,True,16038


In [16]:
df = df.drop_duplicates(subset=col_subset)

## Type 2 dups

### Definition

- `['user_id', 'date', 'amount', 'account_id']` are identical, `desc` is different but similar, or one `desc` contains `<mdbremoved>` and the other one isn't.

### Decision

- Inspection suggests that in most cases, similar but different desc strings result from slight editing of the string, e.g. by removing unnecessary punctuation characters or (as discussed in MDB documentation) by revealing additional information that a new algorighm has classified as non-sensitive.

In [4]:
col_subset = ['user_id', 'date', 'amount', 'account_id']
dup_var = 'dup2'

df[dup_var] = df.duplicated(subset=col_subset)

### Prevalence and value

How prevalent are duplicates?

In [18]:
n_df = len(df)
n_dups = len(df[df[dup_var]])
n_users_dups = df[df[dup_var]].user_id.nunique()
n_users_df = df.user_id.nunique()
txt = 'About {:.1%} of transactions across {:.0%} of users are potential dups.'
print(txt.format(n_dups / n_df, n_users_dups / n_users_df))

About 1.9% of transactions across 99% of users are potential dups.


Gross value of duplicated txns

In [19]:
gross_value = df[df[dup_var]].set_index('user_id').amount.abs().groupby('user_id').sum()
distr(gross_value)

count       426.00
mean       2543.32
std        8323.64
min           2.50
1%            7.72
5%           41.88
10%          91.53
25%         295.48
50%         881.08
75%        2102.15
90%        4924.51
95%        6928.04
99%       25746.48
max      106598.39
Name: amount, dtype: float64

Most frequent txns description

In [20]:
df[df[dup_var]].desc.str[:12].value_counts(dropna=False)[:10]

<mdbremoved>    3460
daily od fee    1894
int'l xxxxxx     944
card payment     490
direct debit     439
contactless      410
visa purchas     346
tfl travel c     343
tfl.gov.uk/c     288
call ref.no.     273
Name: desc, dtype: int64

### Similarity score

Most frequent auto tag

In [4]:
import difflib
import functools
import collections
from fuzzywuzzy import fuzz


def clean_desc(df):
    """Removes extraneous characters that often create duplicates."""
    df = df.copy()
    import re, string
    number_mask = re.compile(r'[x]{2,}')
    common_suffixes = re.compile(r' -( .)? .{2,3}$')   # e.g. - vis, - p/p, - e gbp
    punctuation = re.compile('[{}]+'.format(string.punctuation))
    multiple_spaces = re.compile('\s{2,}')
    separate_word_digits = re.compile(r'(?<=[a-zA-Z])(?=\d+)')
    
    df['desc'] = (df.desc
                  .str.replace(common_suffixes, ' ', regex=True)
                  .str.replace(punctuation, ' ', regex=True)
                  .str.replace(number_mask, ' ', regex=True)
                  .str.replace(separate_word_digits, ' ', regex=True)
                  .str.replace(multiple_spaces, ' ', regex=True)
                  .str.strip())
    return df


def drop_dup1(df):
    return df.drop_duplicates(subset=['user_id', 'account_id', 'date', 'amount', 'desc'])


def _potential_dup2_dups(df):
    cols=['user_id', 'account_id', 'date', 'amount']
    dups = df[df.duplicated(subset=cols, keep=False)].copy()
    dups['group'] = dups.groupby(col_subset).ngroup()
    return dups


def _identify_dup2(group):
    cols = list(group.columns)
    group['dup'] = False

    DescAndId = collections.namedtuple('DescAndID', ['desc', 'id'])
    shortest_first = functools.partial(sorted, key=lambda x: len(x.desc))
    
    items = [DescAndId(*item) for item in zip(group.desc, group.id)]
    shortest, *others = shortest_first(items)
    
    others_are_equal = len(set(others)) == 1
    others_ids = [o.id for o in others]
    
    if not others_are_equal:
        answer = False
    else:
        remainder = others[0].desc
        for w in shortest.desc.split():            
            if w in remainder:
                remainder = remainder.replace(w, '', 1)
            else:
                answer = False
                break
            answer = True
        
        if not answer:
            remainder = shortest.desc
            for w in others[0].desc.split():
                if w in remainder:                    
                    remainder = remainder.replace(w, '', 1)
                else:
                    answer = False
                    break
                answer = True
            
    group.loc[group.id.isin(others_ids), 'dup'] = answer
    return group

def _drop_dups2(df):
    df[df.dup].index
    
def drop_dup2(df):
    df = df.copy()
    dups = _potential_dup2_dups(df)
    dups = dups.groupby('group').apply(_identify_dup2)
    dups = dups[dups.dup].index
    return df.drop(dups)

def counter(df):
    print(df.shape)
    return df

In [170]:
clean = clean_desc(df)

In [194]:
u = (clean
     .pipe(counter)
     .pipe(drop_dup1)
     .pipe(counter)
     .pipe(drop_dup2)
     .pipe(counter))
u

(1301806, 32)
(1276922, 32)
(1273255, 32)


Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings,dup2
0,688261,2012-01-03,777,400.000000,mdbremoved,,transfers,tsransfer,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-11-13,True,364.220001,non merchant mbl,transfers,other account,other account,u,201201,-1451.075562,24319.220881,False,False
1,688264,2012-01-03,777,10.270000,9572 30dec 11 mcdonalds restaurant winwick road gb,mcdonalds,spend,services,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2015-03-19,True,364.220001,mcdonalds,dining and drinking,,dining and drinking,u,201201,-1451.075562,24319.220881,False,False
2,688263,2012-01-03,777,6.680000,9572 31dec 11 tesco stores 3345 warrington gb,tesco,spend,household,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-08-15,True,364.220001,tesco supermarket,"food, groceries, household",,supermarket,u,201201,-1451.075562,24319.220881,False,False
3,688265,2012-01-03,777,12.000000,9572 31dec 11 tesco stores 3345 warrington gb,tesco,spend,household,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-08-15,True,364.220001,tesco supermarket,"food, groceries, household",,supermarket,u,201201,-1451.075562,24319.220881,False,False
4,688262,2012-01-03,777,3.030000,aviva pa,aviva,spend,finance,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-08-15,True,364.220001,aviva,health insurance,life insurance,life insurance,u,201201,-1451.075562,24319.220881,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301801,802761344,2020-07-29,587177,67.559998,aldi cd 4914,aldi,spend,household,False,cm12 0,2020-06-13,30k to 40k,1964.0,2020-06-13,1708060,2020-08-16 11:46:00,tsb,current,2020-07-31,1900-01-01,True,3263.830078,aldi,"food, groceries, household",,"food, groceries, household",c,202007,3763.830078,37865.039062,False,False
1301802,802761345,2020-07-29,587177,414.670013,mdbremoved,,,,False,cm12 0,2020-06-13,30k to 40k,1964.0,2020-06-13,1708060,2020-08-16 11:46:00,tsb,current,2020-07-31,1900-01-01,True,3263.830078,,,repayments,repayments,c,202007,3763.830078,37865.039062,False,False
1301803,802761325,2020-07-30,587177,200.000000,regular transfer payment to mdbremoved mandate no 14,,,transfers,False,cm12 0,2020-06-13,30k to 40k,1964.0,2020-06-14,1708319,2020-08-16 08:35:00,santander,current,2020-07-31,1900-01-01,True,18552.080078,,,,,c,202007,18052.080078,37865.039062,False,False
1301804,803229644,2020-07-31,587177,500.000000,santander 123 a c mdbremoved,santander bank,transfers,tsransfer,False,cm12 0,2020-06-13,30k to 40k,1964.0,2020-06-13,1708060,2020-08-16 11:46:00,tsb,current,2020-08-01,1900-01-01,True,3263.830078,santander bank,transfers,,transfers,c,202007,3263.830078,37865.039062,False,False


In [151]:
# compare pre_post clean desc

k = df.iloc[785793].to_frame().T
display(k.desc.values[0])
clean_desc(k).desc.values[0]

'midgleys cd 5714 deb'

'midgleys cd 5714 deb'

Features:
- In groups larger than two, if others are related and shortest isn't, then we're unable to identify others as dups. (e.g.: `df.iloc[[1267567, 1267576, 1267577]]`)
- If shorter is mdbremoved only, then conservatively classify as non-dup
- Number matched to wrong equivalent [559240, 559242] -> each element in other should only match once
- Groups with daily od charges, shortest without date not identified as dup

Limitations:
- If group contains two groups of duplicates, they are not identified




## Improvements

### Clean

Original

In [242]:
%%time

def clean_desc(df):
    """Removes extraneous characters that hinder duplicates detection.
    
    Removes common suffixes such as -vis, -p/p, and - e gbp; all
    punctuation; multiple x characters, which are used to mask card
    or account numbers: and extra whitespace. Also splits digits
    suffixes -- but not prefixes, as these are usually dates -- from
    words (e.g. 'no14' becomes 'no 14', '14jan' remains unchanged).
    """
    import string
    df = df.copy()
    kwargs = dict(repl=' ', regex=True)
    df['desc'] = (
        df.desc
        .str.replace(r'-\s(\w\s)?.{2,3}$', **kwargs)
        .str.replace(fr'[{string.punctuation}]+', **kwargs)
        .str.replace(r'[x]{2,}', **kwargs)
        .str.replace(r'(?<=[a-zA-Z])(?=\d)', **kwargs)
        .str.replace(r'\s{2,}', **kwargs)
        .str.strip()
    )
    return df

k1 = clean_desc(df)

CPU times: user 860 ms, sys: 9.91 ms, total: 869 ms
Wall time: 869 ms
