Notebook purpose

- Understand nature of duplicate transactions, explore solutions, document decisions about what duplicates to drop

Types of duplicates

Type 1 duplicates:  `['user_id', 'date', 'amount', 'account_id', 'desc']` are identical

Type 2 duplicates: `['user_id', 'date', 'amount', 'account_id']` are identical and one `desc` is "loose subset" of the other (i.e. each word in one desc appears somewhere in the other, but can be out of order, though each pattern in other txn ).

Approach taken:

- Clean description string to remove extraneous characters that obfuscate type 1 duplicates. 
- Remove type 1 duplicates
- Identify and remove type 2 duplicates

In [1]:
import os
import sys

import numpy as np
import pandas as pd
import seaborn as sns

sys.path.append('/Users/fgu/dev/projects/entropy')
import entropy.helpers.aws as aws
import entropy.data.cleaners as cl

sns.set_style('whitegrid')
pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)
pd.set_option('max_colwidth', None)
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [2]:
df = aws.read_parquet('~/tmp/entropy_777.parquet')

## Remove duplicates

In [4]:
import collections

def counter(df):
    print(df.shape)
    return df


def _get_potential_type2_dups(df):
    """Returns txns with identical user and account ids, dates, and amounts."""
    cols=['user_id', 'account_id', 'date', 'amount']
    dups = df[df.duplicated(subset=cols, keep=False)].copy()
    dups['group'] = dups.groupby(cols).ngroup()
    return dups

def _identify_type2_dups(df):
    """Returns index of Type2 duplicates."""

    def each_word_in_string(words, string):
        """Tests whether each word from words appears in string.
        Allows each substring in string to be matched only once.
        """
        unmatched = string
        for w in words:
            if w not in unmatched:
                return False
            unmatched = unmatched.replace(w, '', 1)
        return True
    
    def identifier(g):
        descriptions = [DescId(*i) for i in zip(g.desc, g.id)]
        shortest, *others = sorted(descriptions, key=lambda x: len(x.desc))

        for other in others:            
            words, string = shortest.desc.split(), other.desc
            answer = each_word_in_string(words, string)
            if not answer:
                words, string = other.desc.split(), shortest.desc
                answer = each_word_in_string(words, string)
            g.loc[g.id.eq(other.id), 'dup'] = answer

        return g
    
    DescId = collections.namedtuple('DescId', ('desc', 'id'))
    df['dup'] = False

    df = df.groupby('group').apply(identifier)
    return df[df.dup].index

def drop_type2_dups(df):
    """Drops Type 2 duplicates.
    
    A Type 2 duplicate is one of two txns with identical user ids, txn ids,
    account ids, dates, and amounts, as well as similar txn descriptions, 
    where "similar" means that each word in the description of one txn appears
    in the description of the other.
    """
    potential_dups = _get_potential_type2_dups(df)
    dups_idx = _identify_type2_dups(potential_dups)
    return df.drop(dups_idx)


In [31]:
k1 = (df.pipe(counter)
         .pipe(drop_type1_dups)
         .pipe(counter)
         .pipe(drop_type2_dups)
         .pipe(counter))
k1.desc

(123764, 31)
(121188, 31)
(121039, 31)


0                                                 mdbremoved
1         9572 30dec 11 mcdonalds restaurant winwick road gb
2              9572 31dec 11 tesco stores 3345 warrington gb
3              9572 31dec 11 tesco stores 3345 warrington gb
4                                                   aviva pa
                                 ...                        
123759                                        mdbremoved sto
123760                             arthur lane on 29 jul clp
123761                             wickes bury on 30 jul clp
123762                        pets at home ltd on 30 jul clp
123763                    9582 30jul 20 lidl gb bury bury gb
Name: desc, Length: 121039, dtype: object

In [14]:
# benchmark
k0 = (df.pipe(counter)
         .pipe(drop_type1_dups)
         .pipe(counter)
         .pipe(drop_type2_dups)
         .pipe(counter))
k0.head(3)

(123764, 31)
(121188, 31)
(121093, 31)


Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings
0,688261,2012-01-03,777,400.0,mdbremoved,,transfers,transfers,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-11-13,True,364.220001,non merchant mbl,transfers,other account,other account,u,201201,-1451.075562,24319.220881,False
1,688264,2012-01-03,777,10.27,9572 30dec 11 mcdonalds restaurant winwick road gb,mcdonalds,spend,services,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2015-03-19,True,364.220001,mcdonalds,dining and drinking,,dining and drinking,u,201201,-1451.075562,24319.220881,False
2,688263,2012-01-03,777,6.68,9572 31dec 11 tesco stores 3345 warrington gb,tesco,spend,household,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2014-07-18,2017-08-15,True,364.220001,tesco supermarket,"food, groceries, household",,supermarket,u,201201,-1451.075562,24319.220881,False


Features:
- In groups larger than two, if others are related and shortest isn't, then we're unable to identify others as dups. (e.g.: `df.iloc[[1267567, 1267576, 1267577]]`)
- If shorter is mdbremoved only, then conservatively classify as non-dup
- Number matched to wrong equivalent [559240, 559242] -> each element in other should only match once
- Groups with daily od charges, shortest without date not identified as dup

Limitations:
- If group contains two groups of duplicates, they are not identified




In [5]:
# get potential dups
# for each pairing in each group, check whether each word in string
    # if yes, second is duplicate
    # if no, second isn't duplicate

# decisions:
# - use combinations and if not answer pattern as above or 
#   simply check each word in string for each permutation in group
# approach below uses latter, which is clean, but relies on there not being any type 1 dups in the data. do 
# i want to rely on this? probably not. want to be able to filter out dype 2 dups even when there are type 1 dups.
# for this, though, I need a version of the if not answer pattern and combinations. think of nice way to do this. 

import itertools



def drop_type1_dups(df):
    """Drops Type 1 duplicates.
    
    A Type 1 duplicate is one of two txns with identical user and
    account ids, dates, amounts, and txn descriptions.
    """
    df = df.copy()
    cols = ['user_id', 'account_id', 'date', 'amount', 'desc']
    return df.drop_duplicates(subset=cols)


def _potential_type2_dups(df):
    """Returns desc and duplicate group id for potential Type 2 duplicates."""
    cols = ['date', 'user_id', 'account_id', 'amount']
    return (df.loc[df.duplicated(subset=cols, keep=False)]
            .assign(group=lambda df: df.groupby(cols).ngroup())
            .loc[:,['desc', 'group']])


def _each_word_in_string(words, string):
    """Tests whether each word from words appears in string.
    Allows each substring in string to be matched only once.
    """
    unmatched = string
    for w in words:
        if w not in unmatched:
            return False
        unmatched = unmatched.replace(w, '', 1)
    return True


def _type2_dups_indices(g):
    """Checks for each txn pair in a group whether one txn is a Type 2
    duplicate of the other, and returns idx of all duplicates.
    """
    dups = []
    pairs = list(itertools.combinations(g.index, 2))
    for first, second in pairs:
        words = g.loc[first].desc.split()
        string = g.loc[second].desc
        if _each_word_in_string(words, string):
            dups.append(second)
            break
        words = g.loc[second].desc.split()
        string = g.loc[first].desc
        if _each_word_in_string(words, string):
            dups.append(first)
    return dups


def drop_type2_dups(df):
    """Drops Type 2 duplicates.
    
    A Type 2 duplicate is a txn whose user id, account id, date, and amount
    are identical to another txn, and whose txn description is similar to that
    other txn, where "similar" means that each word in the txn description 
    appears in the description of the other txn.
    """   
    potential_dups = _potential_type2_dups(df)    
    dups = potential_dups.groupby('group').apply(_type2_dups_indices).sum()
    return df.drop(dups)


k = df.pipe(drop_type1_dups)[3300:3500].pipe(counter).pipe(drop_type2_dups).pipe(counter)
k


(200, 32)
(198, 32)


Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,desc_old,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings
3304,109371611,2015-12-03,777,15.000000,mdbremoved fp 03 12 15 30 1 3000n,,,,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2015-12-08,2017-10-23,True,"<mdbremoved> fp 03/12/15 30 , 1xxxxxxxxxxxx3000n - s/o",364.220001,non merchant mbl,,,,u,201512,366.253174,27638.970703,False
3305,109371621,2015-12-03,777,16.500000,8892 02dec 15 co op group 8054 warrington gb,co-op,spend,household,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2015-12-08,2017-08-12,True,"8892 02dec15 , co-op group xx8054, warrington gb - pos",364.220001,co-op supermarket,"food, groceries, household",,groceries,u,201512,366.253174,27638.970703,False
3306,109371614,2015-12-03,777,3.990000,8892 02dec 15 itunes com bill itunes com lu,apple,spend,services,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2015-12-08,2017-08-12,True,"8892 02dec15 , itunes.com/bill , itunes.com lu - pos",364.220001,apple,"entertainment, tv, media",lifestyle - other,lifestyle - other,u,201512,366.253174,27638.970703,False
3307,109551675,2015-12-04,777,14.970000,8892 03dec 15 amazon uk marketplace 6620 lu,amazon,spend,services,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2015-12-09,2017-08-12,True,"8892 03dec15 , amazon uk , marketplace , xxx-xxx-6620 lu - pos",364.220001,amazon,enjoyment,,,u,201512,312.183228,27638.970703,False
3308,109551674,2015-12-04,777,39.099998,8892 03dec 15 applegreen warrington gb,applegreen,spend,motor,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2015-12-09,2017-10-02,True,"8892 03dec15 , applegreen , warrington gb - pos",364.220001,applegreen,fuel,fuel,fuel,u,201512,312.183228,27638.970703,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3500,118238242,2016-01-18,777,25.000000,8892 16jan 16 tesco pfs 4010 warrington gb,tesco,spend,motor,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2016-01-23,1900-01-01,True,"8892 16jan16 , tesco pfs 4010 , warrington gb - pos",364.220001,tesco fuel,fuel,groceries,groceries,c,201601,-804.166687,24264.291016,False
3501,118238239,2016-01-18,777,21.240000,o2,o2,spend,communication,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2016-01-23,1900-01-01,True,o2 - d/d,364.220001,o2,mobile,,mobile,c,201601,-804.166687,24264.291016,False
3502,118238238,2016-01-18,777,25.000000,o2,o2,spend,communication,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2016-01-23,1900-01-01,True,o2 - d/d,364.220001,o2,mobile,,mobile,c,201601,-804.166687,24264.291016,False
3503,118238241,2016-01-18,777,-25.990000,8892 15jan 16 amazon uk marketplace 6620 lu refund,amazon,spend,retail,False,wa1 4,2011-07-20,20k to 30k,1969.0,2011-07-20,262916,2020-07-21 20:32:00,natwest bank,current,2016-01-23,1900-01-01,False,"8892 15jan16 , amazon uk , marketplace , xxx-xxx-6620 lu , refund - pos",364.220001,amazon,refunded purchase,,refunded purchase,c,201601,-804.166687,24264.291016,False


In [75]:
s = k

In [78]:
s.drop([3346, 3407])

Unnamed: 0,desc,group
3347,amazon prime 2454 79 00 pound sterling luxembourg,0
3376,8892 15dec 15 itunes com bill itunes com lu,1
3377,dgi sky protect,1
3406,interest on your standard balance interest 1 5230,2
3500,8892 16jan 16 tesco pfs 4010 warrington gb,3
3502,o2,3


In [40]:
# df.loc[122315:122319].pipe(drop_type1_dups)



Unnamed: 0,id,date,user_id,amount,desc,merchant,tag_group,tag,user_female,user_postcode,user_registration_date,user_salary_range,user_yob,account_created,account_id,account_last_refreshed,account_provider,account_type,data_warehouse_date_created,data_warehouse_date_last_updated,debit,latest_balance,merchant_business_line,tag_auto,tag_manual,tag_up,updated_flag,ym,balance,income,savings
122315,747676411,2019-07-19,578777,5.0,8450 18jul 19 c asda superstore farnworth gb,asda,spend,household,True,bl8 2,2020-03-29,30k to 40k,1988.0,2020-03-29,1648915,2020-08-16 20:04:00,natwest bank,current,2020-03-30,1900-01-01,True,709.48999,asda supermarket,"food, groceries, household",,"food, groceries, household",c,201907,-1052.890137,22296.589844,False
122316,747726214,2019-07-20,578777,50.0,bluedot on 19 jul bcc,,,,True,bl8 2,2020-03-29,30k to 40k,1988.0,2020-03-30,1649664,2020-08-16 20:04:00,barclays,current,2020-03-31,1900-01-01,True,266.179993,,,,,c,201907,-1547.049316,22296.589844,False
122317,747726215,2019-07-20,578777,20.0,bluedot on 19 jul bcc,,,,True,bl8 2,2020-03-29,30k to 40k,1988.0,2020-03-30,1649664,2020-08-16 20:04:00,barclays,current,2020-03-31,1900-01-01,True,266.179993,,,,,c,201907,-1547.049316,22296.589844,False
122318,747726216,2019-07-20,578777,20.0,bluedot on 19 jul bcc,,,,True,bl8 2,2020-03-29,30k to 40k,1988.0,2020-03-30,1649664,2020-08-16 20:04:00,barclays,current,2020-03-31,1900-01-01,True,266.179993,,,,,c,201907,-1547.049316,22296.589844,False
122319,747726213,2019-07-20,578777,25.33,lidl gb bury on 19 jul clp,lidl,spend,household,True,bl8 2,2020-03-29,30k to 40k,1988.0,2020-03-30,1649664,2020-08-16 20:04:00,barclays,current,2020-03-31,1900-01-01,True,266.179993,lidl,"food, groceries, household",,"food, groceries, household",c,201907,-1547.049316,22296.589844,False


In [85]:
import itertools

pairs = list(itertools.permutations(k.index, 2))[:3]
for p in pairs:
    print(k.loc[p,:])

                   desc  group
437         bmach 23dec      0
438  co operative 22dec      0
                 desc  group
437       bmach 23dec      0
458  sainsburys 06jan      1
                 desc  group
437       bmach 23dec      0
459  sainsburys 05jan      1


In [12]:
# old

import functools

def _potential_dup2_dups(df):
    cols=['user_id', 'account_id', 'date', 'amount']
    dups = df[df.duplicated(subset=cols, keep=False)].copy()
    dups['group'] = dups.groupby(cols).ngroup()
    return dups

def _identify_dup2(df):
    
    def helper(group):
    
        group['dup'] = False

        DescAndId = collections.namedtuple('DescAndID', ['desc', 'id'])
        shortest_first = functools.partial(sorted, key=lambda x: len(x.desc))

        items = [DescAndId(*item) for item in zip(group.desc, group.id)]
        shortest, *others = shortest_first(items)

        others_are_equal = len(set(others)) == 1
        others_ids = [o.id for o in others]

        if not others_are_equal:
            answer = False
        else:
            remainder = others[0].desc
            for w in shortest.desc.split():            
                if w in remainder:
                    remainder = remainder.replace(w, '', 1)
                else:
                    answer = False
                    break
                answer = True

            if not answer:
                remainder = shortest.desc
                for w in others[0].desc.split():
                    if w in remainder:                    
                        remainder = remainder.replace(w, '', 1)
                    else:
                        answer = False
                        break
                    answer = True

        group.loc[group.id.isin(others_ids), 'dup'] = answer
        return group
    
    return df.groupby('group').apply(helper)



def drop_dup2_old(df):
    df = df.copy()
    dups = _potential_dup2_dups(df)
    dups = _identify_dup2(dups)
    dups = dups[dups.dup].index
    return df.drop(dups)


## Case studies

Below three case studies of duplicates

In [None]:
dh.user_date_data(df, 35177, '1 Jan 2020')

In [None]:
dh.user_date_data(df, 362977, '1 Jan 2020')

In [None]:
dh.user_date_data(df, 467877, '1 Jan 2020')

## Type 1 duplicates

In [91]:
def distr(x):
    pcts = [.01, .05, .1, .25, .50, .75, .90, .95, .99]
    return x.describe(percentiles=pcts).round(2)

def duplicates_sample(df, col_subset, n=100, seed=2312):
    """Draws sample of size n of duplicate txns as defined by col_subset."""
    dups = df[df.duplicated(subset=col_subset, keep=False)].copy()
    dups['group'] = dups.groupby(col_subset).ngroup()
    unique_groups = np.unique(dups.group)
    rng = np.random.default_rng(seed=seed)
    sample = rng.choice(unique_groups, size=n)
    return dups[dups.group.isin(sample)]

### Definition
- `['user_id', 'date', 'amount', 'account_id', 'desc']` are identical.
 
- This includes transactions where desc for both is `<mdbremoved>`, where we assume that they mask the same transaction desctiption.

- Reasons for false positives (FP): user makes two identical transactions on the same day (or on subsequent days for txns that appear with a delay). Plausible cases are coffee and betting shop txns. However, inspection suggests that the vast majority of cases are genuine duplicates, as they are txns that are unlikely to result from multiple purchases on the same day.

In [92]:
col_subset = ['user_id', 'date', 'amount', 'account_id', 'desc']
dup_var = 'dup1'

df[dup_var] = df.duplicated(subset=col_subset)

### Prevalence and value

How prevalent are duplicates?

In [93]:
n_df = len(df)
n_dups = len(df[df[dup_var]])
n_users_dups = df[df[dup_var]].user_id.nunique()
n_users_df = df.user_id.nunique()
txt = 'About {:.1%} of transactions across {:.0%} of users are potential dups.'
print(txt.format(n_dups / n_df, n_users_dups / n_users_df))

About 1.9% of transactions across 97% of users are potential dups.


Gross value of duplicated txns

In [94]:
gross_value = df[df[dup_var]].set_index('user_id').amount.abs().groupby('user_id').sum()
distr(gross_value)

count       418.00
mean       5042.84
std       17571.44
min           0.17
1%            4.56
5%           20.70
10%          61.59
25%         239.15
50%         861.25
75%        2763.97
90%        9491.04
95%       17296.48
99%       66390.75
max      190906.02
Name: amount, dtype: float64

Most frequent txns description

In [95]:
df[df[dup_var]].desc.value_counts(dropna=False)[:10]

mdbremoved                                3059
mdbremoved ft                              357
tfl travel ch tfl gov uk cp                298
tfl gov uk cp tfl travel ch                290
paypal payment                             271
b 365 moto                                 263
tfl travel charge tfl gov uk cp            202
betfair purchase                           196
faster payments receipt ref mdbremoved     186
www skybet com cd 9317                     165
Name: desc, dtype: int64

Most frequent auto tag

In [96]:
df[df[dup_var]].tag_auto.value_counts(dropna=False)[:10]

NaN                           7237
transfers                     3066
gambling                      2205
enjoyment                     1609
public transport              1557
lunch or snacks               1157
food, groceries, household     906
bank charges                   873
dining or going out            688
entertainment, tv, media       568
Name: tag_auto, dtype: int64

Proportion of txns per auto tag that are duplicated

In [97]:
txns_per_tag_overall = df.tag_auto.value_counts(dropna=False)
txns_per_tag_duplicated = df[df[dup_var]].tag_auto.value_counts(dropna=False) 
p_dup_per_tag = (txns_per_tag_duplicated / txns_per_tag_overall)
p_dup_per_tag.sort_values(ascending=False)[:10]

investment - other               0.214953
gambling                         0.160761
mobile app                       0.152677
isa                              0.088095
tradesmen fees                   0.076923
vehicle                          0.066667
supermarket                      0.064968
flights                          0.050346
repayments                       0.047753
child - everyday or childcare    0.047026
Name: tag_auto, dtype: float64

### Inspect dups

In [98]:
duplicates_sample(df, col_subset, n=2, seed=None).desc

642141     transfer from mdbremoved
642142     transfer from mdbremoved
1076408                  gocardless
1076409                  gocardless
Name: desc, dtype: object

## Type 2 dups

### Definition

- `['user_id', 'date', 'amount', 'account_id']` are identical, one `desc` is subset of the other.

Remove type 1 dups

In [99]:
df = df.drop_duplicates(subset=col_subset)

In [100]:
col_subset = ['user_id', 'date', 'amount', 'account_id']
dup_var = 'dup2'

df[dup_var] = df.duplicated(subset=col_subset)

### Prevalence and value

How prevalent are duplicates?

In [101]:
n_df = len(df)
n_dups = len(df[df[dup_var]])
n_users_dups = df[df[dup_var]].user_id.nunique()
n_users_df = df.user_id.nunique()
txt = 'About {:.1%} of transactions across {:.0%} of users are potential dups.'
print(txt.format(n_dups / n_df, n_users_dups / n_users_df))

About 1.7% of transactions across 99% of users are potential dups.


Gross value of duplicated txns

In [32]:
gross_value = df[df[dup_var]].set_index('user_id').amount.abs().groupby('user_id').sum()
distr(gross_value)

count       424.00
mean       2497.45
std        8311.57
min           3.00
1%           11.08
5%           48.28
10%         104.04
25%         298.47
50%         880.35
75%        2097.92
90%        4584.54
95%        6842.71
99%       25811.31
max      106598.39
Name: amount, dtype: float64

Most frequent txns description

In [33]:
df[df[dup_var]].desc.str[:12].value_counts(dropna=False)[:10]

<mdbremoved>    3523
daily od fee    1894
int'l xxxxxx     941
card payment     463
tfl travel c     336
direct debit     319
call ref.no.     308
tfl.gov.uk/c     288
contactless      281
tesco stores     275
Name: desc, dtype: int64

Most frequent auto tag