#### 1. Finding Similar Costumers

Companies nowadays are implementing product suggestions to provide users with things they are likely to buy. The process often starts by finding similar behaviours among consumers; for this task, we will focus on this part in the specific.

Here you will implement an algorithm to find the most similar match to a consumer given his bank account information. In particular, you will implement your version of the LSH algorithm that takes as input information about a consumer and finds people similar to the one in the study.

1.1 Set up the data

1. To start working download the banking dataset on Kaggle.

2. For the sake of this first part, not all columns are necessary since comparing each field single handedly can be quite time-expensive. Then, carefully read the linked guide above and try to understand which features will be appropriate for this task (An heads up: some users have more than one transaction record, make sure to handle them all). Once you have finished, project a version of the dataset to work with.

In [25]:
from random import randint
import random

First, all null values ​​in the dataset were dropped. The CustomerDOB and TransactionDate columns were transformed into datetime.

All values ​​that had a CustomerDOB greater than 1998 were eliminated since if the data was extracted in 2016, it is assumed that only people over 18 years of age can carry out bank transactions.

Likewise, transactions related to people born in 1800 were removed because the data is assumed to be inconsistent.

Finally, although we know that it is not the best decision to delete data, this decision was made because to work with the MinHash, we cannot apply the functions when there are NaN values ​​and having some value at 0 value can damage the data and calculations. Likewise, we know that in real life this type of outdated data is not common thanks to the fact that companies have a technological advance to extract the data or subcontract third parties (AWS, Microsoft, etc.) to carry out this process.

In [26]:
import pandas as pd

data = pd.read_csv('bank_transactions.csv')

data.dropna(inplace=True)

data.CustomerDOB = pd.to_datetime(data.CustomerDOB)

data.TransactionDate = pd.to_datetime(data.TransactionDate)

data.drop(data[data.CustomerDOB.dt.year > 1998].index, axis=0, inplace=True)

data.drop(data[data.CustomerDOB.dt.year == 1800].index, axis=0, inplace=True)

data = data.reset_index()

In [27]:
data

Unnamed: 0,index,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,0,T1,C5841053,1994-10-01,F,JAMSHEDPUR,17819.05,2016-02-08,143207,25.0
1,2,T3,C4417068,1996-11-26,F,MUMBAI,17874.44,2016-02-08,142712,459.0
2,3,T4,C5342380,1973-09-14,F,MUMBAI,866503.21,2016-02-08,142714,2060.0
3,4,T5,C9031234,1988-03-24,F,NAVI MUMBAI,6714.43,2016-02-08,181156,1762.5
4,5,T6,C1536588,1972-08-10,F,ITANAGAR,53609.20,2016-02-08,173940,676.0
...,...,...,...,...,...,...,...,...,...,...
908979,1048562,T1048563,C8020229,1990-08-04,M,NEW DELHI,7635.19,2016-09-18,184824,799.0
908980,1048563,T1048564,C6459278,1992-02-20,M,NASHIK,27311.42,2016-09-18,183734,460.0
908981,1048564,T1048565,C6412354,1989-05-18,M,HYDERABAD,221757.06,2016-09-18,183313,770.0
908982,1048565,T1048566,C6420483,1978-08-30,M,VISAKHAPATNAM,10117.87,2016-09-18,184706,1000.0


We are going to keep only the rows we are going to be used (query_user.csv)
So, we are going to get rid of Transaction ID and Customer ID.

In [28]:

data = data[['CustomerDOB','CustGender','CustLocation','CustAccountBalance','TransactionDate','TransactionTime','TransactionAmount (INR)']]

1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

1. Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

2. Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

Using the MinHash function:

 h(x) = (ax + b) % c

- x: the given number

- a, b: randomly chosen integers less than the maximum value of x

- c: prime number slightly bigger than the maximum value of x

Source: https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/

In [29]:
#we need to have all the values in int dtype format in order to apply the hashing function

def date_to_int(value):
    return int(pd.Timestamp(value).timestamp())

def string_to_int(value):
    return sum(ord(x) for x in value)

def float_to_int(value):
    ente,decimal  = str(value).split('.')
    return int(ente+decimal)

In [30]:
data.CustomerDOB = data['CustomerDOB'].apply(date_to_int)

data.CustGender = data['CustGender'].apply(string_to_int)

data.CustLocation = data['CustLocation'].apply(string_to_int)

data.CustAccountBalance = data['CustAccountBalance'].apply(float_to_int)

data.TransactionDate = data['TransactionDate'].apply(date_to_int)

data['TransactionAmount (INR)'] = data['TransactionAmount (INR)'].apply(float_to_int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.CustomerDOB = data['CustomerDOB'].apply(date_to_int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.CustGender = data['CustGender'].apply(string_to_int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.CustLocation = data['CustLocation'].apply(string_to_int)
A value is trying to be set

By getting the entire database converted to a number, we can apply the Hash function.

In [31]:
data

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,780969600,70,755,1781905,1454889600,143207,250
1,848966400,70,443,1787444,1454889600,142712,4590
2,116812800,70,443,86650321,1454889600,142714,20600
3,575164800,70,777,671443,1454889600,181156,17625
4,82252800,70,583,536092,1454889600,173940,6760
...,...,...,...,...,...,...,...
908979,649728000,77,624,763519,1474156800,184824,7990
908980,698544000,77,446,2731142,1474156800,183734,4600
908981,611452800,77,644,22175706,1474156800,183313,7700
908982,273283200,77,968,1011787,1474156800,184706,10000


Based on the source we are relying on to obtain the MinHash, we know that we must obtain the maximum value of x and then obtain the next prime number from it.

In [32]:
data.max()

CustomerDOB                 913507200
CustGender                         77
CustLocation                     2609
CustAccountBalance         4316555553
TransactionDate            1481241600
TransactionTime                235959
TransactionAmount (INR)     156003499
dtype: int64

In [33]:
from sympy import nextprime
#next prime number:
max_value_of_x = nextprime(4316555553)
max_value_of_x

4316555557

The x maximum value is: 4316555553 and its next prime is 4316555557

In [34]:
#Creating the Hash Function
def HashFunction(x,a,b,C):   
    return (a*x + b) % C    

In [35]:
#Applying hash function and getting min of the Hash value:
def mhash(info,C,list_of_tuples):

    signatures  = []

    for ind,x in enumerate(info):
        hash_value = []
        for values in list_of_tuples:
            a,b = values
            ans = HashFunction(x,a,b,C)
            hash_value.append(ans)
        signatures.append(int(min(hash_value)))
    
    return signatures
    

In [36]:
import numpy as np
import tqdm as tqdm

We get random numbers to have the coefficients a and b. Based on the source, we have M equal to 2^32-1. So, we are going to obtain random coefficients between 0 and M.

Finally, we apply the Minhash funcion to all the dataset.

In [37]:
#Random coefficients to the Hash function
def ran_coef(n_hash, M):
    
    a=[]
    b=[]
    for _ in range(n_hash):
       
        r_i = random.randint(0,M)
        a.append(r_i)
        r_i = random.randint(0,M)
        b.append(r_i)
    return a,b    

C  = 4316555557 #prime number
M = 2**32 - 1
n_hash = 10
a,b = ran_coef(n_hash,M) 

list_of_tuples = list(
    map(
        lambda x, y: (x, y),
        a,
        b
    )
)

data['min-hash']  = [mhash([*x[1]],C,list_of_tuples) for x in (data.iterrows())]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['min-hash']  = [mhash([*x[1]],C,list_of_tuples) for x in (data.iterrows())]


In [38]:
data

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),min-hash
0,780969600,70,755,1781905,1454889600,143207,250,"[86559837, 400864551, 32408621, 2989391, 11048..."
1,848966400,70,443,1787444,1454889600,142712,4590,"[201678146, 400864551, 316595587, 189432663, 1..."
2,116812800,70,443,86650321,1454889600,142714,20600,"[134206891, 400864551, 316595587, 173303209, 1..."
3,575164800,70,777,671443,1454889600,181156,17625,"[330636252, 400864551, 64435310, 518418355, 11..."
4,82252800,70,583,536092,1454889600,173940,6760,"[68488861, 400864551, 21152619, 15033643, 1104..."
...,...,...,...,...,...,...,...,...
908979,649728000,77,624,763519,1474156800,184824,7990,"[183212031, 170275674, 429786670, 447513355, 2..."
908980,698544000,77,446,2731142,1474156800,183734,4600,"[40295660, 170275674, 63811494, 228398316, 205..."
908981,611452800,77,644,22175706,1474156800,183313,7700,"[710255379, 170275674, 15754056, 471195290, 20..."
908982,273283200,77,968,1011787,1474156800,184706,10000,"[453792297, 170275674, 597843560, 989050512, 2..."


As we see in the output, we get the same buckets for values ​​that are exactly the same as the sort F and M.

In [None]:
#TO_DO: 1. Apply minhash to query dataset (same function as defined before) - 1.3
#2. Implement Jaccard or Cosine similarity (or any one to find a score) to obtain the LSH. With this score answer the question 1.2 (play with different tresholds) and also find the similarity with the query.