# Fuzzy duplicate detection
fingerprinting

Two strings can easily be compared using https://github.com/seatgeek/fuzzywuzzy

In [1]:
import pandas as pd
from fuzzywuzzy import fuzz

In [2]:
d = pd.DataFrame({'one': ['fuzz', 'wuzz'], 'two': ['fizz', 'woo']})

d.apply(lambda s: fuzz.partial_ratio(s['one'], s['two']), axis=1)

0    75
1    33
dtype: int64

But what if I want to perform such a operation with each row of a dataframe vs. all the other rows e.g. find duplicates?

In [3]:
d = pd.DataFrame({'id': ['1', '2', '3'], 'email': ['first@email.com', '1@email.com', 'iTrickYouAndStealIBAN1sBank@other.com'], 'bank': ['IBAN1', 'IBAN2', 'IBAN1'], 'name': ['name1', 'name1', 'name3'], 'date': ['2016-01-01', '2016-01-02', '2016-01-02']})
d

Unnamed: 0,bank,date,email,id,name
0,IBAN1,2016-01-01,first@email.com,1,name1
1,IBAN2,2016-01-02,1@email.com,2,name1
2,IBAN1,2016-01-02,iTrickYouAndStealIBAN1sBank@other.com,3,name3


Get each row as a string

In [4]:
x = d.to_string(header=False,
                  index=False,
                  index_names=False).split('\n')
vals = pd.DataFrame([','.join(ele.split()) for ele in x])
vals

Unnamed: 0,0
0,"IBAN1,2016-01-01,first@email.com,1,name1"
1,"IBAN2,2016-01-02,1@email.com,2,name1"
2,"IBAN1,2016-01-02,iTrickYouAndStealIBAN1sBank@o..."


In [5]:
vals.apply(lambda s: fuzz.partial_ratio(s, vals), axis=1)

0    72
1    69
2    68
dtype: int64

### Question 1 (the basics)
How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?

### Question 2 (what I want to achieve)
Use the date column and add and additional column to `d` which describes *daysSinceLastPurchase*, which e.g. for *id 3* should be `today - 2016-01-01` because the "real last purches" occured just a day before
