Notebook purpose

- Understand nature of duplicate transactions, explore solutions, document decisions about what duplicates to drop

Types of duplicates

Type 1 duplicates:  `['user_id', 'date', 'amount', 'account_id', 'desc']` are identical

Type 2 duplicates: `['user_id', 'date', 'amount', 'account_id']` are identical and one `desc` is "loose subset" of the other (i.e. each word in one desc appears somewhere in the other, but can be out of order, though each pattern in other txn ).

Approach taken:

- Clean description string to remove extraneous characters that obfuscate type 1 duplicates. 
- Remove type 1 duplicates
- Identify and remove type 2 duplicates

In [1]:
import os
import sys

import numpy as np
import pandas as pd
import seaborn as sns

sys.path.append("/Users/fgu/dev/projects/entropy")
import entropy.data.cleaners as cl
import entropy.helpers.aws as aws

sns.set_style("whitegrid")
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
pd.set_option("max_colwidth", None)
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [None]:
def user_date_data(df, user_id, date):
    """Returns data for specified user and date sorted by amount."""
    return (
        df.loc[df.user_id == user_id]
        .set_index("date")
        .loc[date]
        .sort_values("amount", ascending=False)
    )

In [7]:
df = aws.read_parquet("~/tmp/entropy_777.parquet")

## Case studies

Below three case studies of duplicates

In [None]:
user_date_data(df, 35177, "1 Jan 2020")

In [None]:
user_date_data(df, 362977, "1 Jan 2020")

In [None]:
user_date_data(df, 467877, "1 Jan 2020")

## Type 1 duplicates

In [91]:
def distr(x):
    pcts = [0.01, 0.05, 0.1, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]
    return x.describe(percentiles=pcts).round(2)


def duplicates_sample(df, col_subset, n=100, seed=2312):
    """Draws sample of size n of duplicate txns as defined by col_subset."""
    dups = df[df.duplicated(subset=col_subset, keep=False)].copy()
    dups["group"] = dups.groupby(col_subset).ngroup()
    unique_groups = np.unique(dups.group)
    rng = np.random.default_rng(seed=seed)
    sample = rng.choice(unique_groups, size=n)
    return dups[dups.group.isin(sample)]

### Definition
- `['user_id', 'date', 'amount', 'account_id', 'desc']` are identical.
 
- This includes transactions where desc for both is `<mdbremoved>`, where we assume that they mask the same transaction desctiption.

- Reasons for false positives (FP): user makes two identical transactions on the same day (or on subsequent days for txns that appear with a delay). Plausible cases are coffee and betting shop txns. However, inspection suggests that the vast majority of cases are genuine duplicates, as they are txns that are unlikely to result from multiple purchases on the same day.

In [92]:
col_subset = ["user_id", "date", "amount", "account_id", "desc"]
dup_var = "dup1"

df[dup_var] = df.duplicated(subset=col_subset)

### Prevalence and value

How prevalent are duplicates?

In [93]:
n_df = len(df)
n_dups = len(df[df[dup_var]])
n_users_dups = df[df[dup_var]].user_id.nunique()
n_users_df = df.user_id.nunique()
txt = "About {:.1%} of transactions across {:.0%} of users are potential dups."
print(txt.format(n_dups / n_df, n_users_dups / n_users_df))

About 1.9% of transactions across 97% of users are potential dups.


Gross value of duplicated txns

In [94]:
gross_value = df[df[dup_var]].set_index("user_id").amount.abs().groupby("user_id").sum()
distr(gross_value)

count       418.00
mean       5042.84
std       17571.44
min           0.17
1%            4.56
5%           20.70
10%          61.59
25%         239.15
50%         861.25
75%        2763.97
90%        9491.04
95%       17296.48
99%       66390.75
max      190906.02
Name: amount, dtype: float64

Most frequent txns description

In [95]:
df[df[dup_var]].desc.value_counts(dropna=False)[:10]

mdbremoved                                3059
mdbremoved ft                              357
tfl travel ch tfl gov uk cp                298
tfl gov uk cp tfl travel ch                290
paypal payment                             271
b 365 moto                                 263
tfl travel charge tfl gov uk cp            202
betfair purchase                           196
faster payments receipt ref mdbremoved     186
www skybet com cd 9317                     165
Name: desc, dtype: int64

Most frequent auto tag

In [96]:
df[df[dup_var]].tag_auto.value_counts(dropna=False)[:10]

NaN                           7237
transfers                     3066
gambling                      2205
enjoyment                     1609
public transport              1557
lunch or snacks               1157
food, groceries, household     906
bank charges                   873
dining or going out            688
entertainment, tv, media       568
Name: tag_auto, dtype: int64

Proportion of txns per auto tag that are duplicated

In [97]:
txns_per_tag_overall = df.tag_auto.value_counts(dropna=False)
txns_per_tag_duplicated = df[df[dup_var]].tag_auto.value_counts(dropna=False)
p_dup_per_tag = txns_per_tag_duplicated / txns_per_tag_overall
p_dup_per_tag.sort_values(ascending=False)[:10]

investment - other               0.214953
gambling                         0.160761
mobile app                       0.152677
isa                              0.088095
tradesmen fees                   0.076923
vehicle                          0.066667
supermarket                      0.064968
flights                          0.050346
repayments                       0.047753
child - everyday or childcare    0.047026
Name: tag_auto, dtype: float64

### Inspect dups

In [98]:
duplicates_sample(df, col_subset, n=2, seed=None).desc

642141     transfer from mdbremoved
642142     transfer from mdbremoved
1076408                  gocardless
1076409                  gocardless
Name: desc, dtype: object

## Type 2 dups

### Definition

- `['user_id', 'date', 'amount', 'account_id']` are identical, one `desc` is subset of the other.

Below stats are a large upper-bound on Type 2 duplicate problem, as only a minority of potential type 2 duplicates (those identified as dups with the type 2 col_subset criteria), are actually type 2 dups.

Remove type 1 dups

In [99]:
df = df.drop_duplicates(subset=col_subset)

In [100]:
col_subset = ["user_id", "date", "amount", "account_id"]
dup_var = "dup2"

df[dup_var] = df.duplicated(subset=col_subset)

### Prevalence and value

How prevalent are duplicates?

In [101]:
n_df = len(df)
n_dups = len(df[df[dup_var]])
n_users_dups = df[df[dup_var]].user_id.nunique()
n_users_df = df.user_id.nunique()
txt = "About {:.1%} of transactions across {:.0%} of users are potential dups."
print(txt.format(n_dups / n_df, n_users_dups / n_users_df))

About 1.7% of transactions across 99% of users are potential dups.


Gross value of duplicated txns

In [32]:
gross_value = df[df[dup_var]].set_index("user_id").amount.abs().groupby("user_id").sum()
distr(gross_value)

count       424.00
mean       2497.45
std        8311.57
min           3.00
1%           11.08
5%           48.28
10%         104.04
25%         298.47
50%         880.35
75%        2097.92
90%        4584.54
95%        6842.71
99%       25811.31
max      106598.39
Name: amount, dtype: float64

Most frequent txns description

In [33]:
df[df[dup_var]].desc.str[:12].value_counts(dropna=False)[:10]

<mdbremoved>    3523
daily od fee    1894
int'l xxxxxx     941
card payment     463
tfl travel c     336
direct debit     319
call ref.no.     308
tfl.gov.uk/c     288
contactless      281
tesco stores     275
Name: desc, dtype: int64

Most frequent auto tag