# **1. Finding Similar Costumers**

Companies nowadays are implementing product suggestions to provide users with things they are likely to buy. The process often starts by finding similar behaviours among consumers; for this task, we will focus on this part in the specific.
Here you will implement an algorithm to find the most similar match to a consumer given his bank account information. In particular, you will implement your version of the LSH algorithm that takes as input information about a consumer and finds people similar to the one in the study.

**1.1 Set up the data**

For the sake of this first part, not all columns are necessary since comparing each field single handedly can be quite time-expensive. Then, carefully read the linked guide above and try to understand which features will be appropriate for this task (An heads up: some users have more than one transaction record, make sure to handle them all). Once you have finished, project a version of the dataset to work with.

In [None]:
import pandas as pd
from datetime import date
import datetime
import time
from datetime import timedelta
import csv
from datetime import datetime
from random import randint
from numpy import dot
import numpy as np
from numpy.linalg import norm

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# import data
data =pd.read_csv('/content/drive/MyDrive/ADM_HW4/bank_transactions.csv',sep=',')


In [None]:
data

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,T1,C5841053,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,T2,C2142763,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,T3,C4417068,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,T4,C5342380,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,T5,C9031234,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5
...,...,...,...,...,...,...,...,...,...
1048562,T1048563,C8020229,8/4/90,M,NEW DELHI,7635.19,18/9/16,184824,799.0
1048563,T1048564,C6459278,20/2/92,M,NASHIK,27311.42,18/9/16,183734,460.0
1048564,T1048565,C6412354,18/5/89,M,HYDERABAD,221757.06,18/9/16,183313,770.0
1048565,T1048566,C6420483,30/8/78,M,VISAKHAPATNAM,10117.87,18/9/16,184706,1000.0


In [None]:
# drop non necessary columns
data = data.drop(['CustomerID','TransactionID'],axis=1)
data


Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5
...,...,...,...,...,...,...,...
1048562,8/4/90,M,NEW DELHI,7635.19,18/9/16,184824,799.0
1048563,20/2/92,M,NASHIK,27311.42,18/9/16,183734,460.0
1048564,18/5/89,M,HYDERABAD,221757.06,18/9/16,183313,770.0
1048565,30/8/78,M,VISAKHAPATNAM,10117.87,18/9/16,184706,1000.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048567 entries, 0 to 1048566
Data columns (total 7 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   CustomerDOB              1045170 non-null  object 
 1   CustGender               1047467 non-null  object 
 2   CustLocation             1048416 non-null  object 
 3   CustAccountBalance       1046198 non-null  float64
 4   TransactionDate          1048567 non-null  object 
 5   TransactionTime          1048567 non-null  int64  
 6   TransactionAmount (INR)  1048567 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 56.0+ MB


In [None]:
# save the new csv file 
data = data.to_csv("/content/drive/MyDrive/ADM_HW4/bank_transactions_new_version.csv", index=False)

**1.2 Fingerprint hashing**

Implementing our minhash function from scratch. 

In [None]:
# We apply the definition of Hash Function.
def Hash(Number,w):
    n =2^32 -1
    return sum([ord(x)*(w^i) for i,x in enumerate(Number)])%n

In [None]:
def MinHash(ListfElements,w,perm):
    list_ = [0 for i in range(len(ListfElements))]
    for i, Number in enumerate(ListfElements):
        if not isinstance(Number,int):
            if isinstance(Number,str): 
                Number = Hash(Number,w)
        l = float('inf')
        for perm_ in perm:
            a, b = perm_
            z = (a*Number+b)%w
            if(l > z):
                l = z
        list_[i] = z
    return list_

In [None]:
data =pd.read_csv('/content/drive/MyDrive/ADM_HW4/bank_transactions_new_version.csv')


In [None]:
data.head()

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5


In [None]:
w  = 256
MaximumValue = 2^32 -1
n  = 12
perm = [(randint(1,MaximumValue),randint(1,MaximumValue)) for i in range(n)]

Process the dataset and add each record to the MinHash.

In [None]:
# we compute the Hash in order to obatin the third column "hashed" 
start = time.time()
data['hashed']  = [MinHash([*x[1]],w,perm) for x in data.iterrows()]
end = time.time()
print(end - start, 's')

0it [13:39, ?it/s]


102.22004580497742 s


In [None]:
data.head()

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),hashed
0,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0,"[79, 125, 62, 254.14999999996508, 131, 81, 79.0]"
1,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0,"[85, 243, 10, 17.87000000000262, 131, 30, 153.0]"
2,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0,"[220, 125, 177, 248.11999999999534, 131, 216, ..."
3,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0,"[243, 125, 177, 245.82999999821186, 131, 6, 36.0]"
4,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5,"[108, 125, 131, 79.89000000001397, 131, 204, 1..."


**1.3 Locality Sensitive Hashing**

After importing the dataset report the most similar users and comparing them against the bank_transactions dataset.

In [None]:
# import data 
data1 =pd.read_csv('/content/drive/MyDrive/ADM_HW4/query_users.csv',sep=',')


In [None]:
data1.head()

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0
1,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0
2,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5
3,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0
4,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0


In [None]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CustomerDOB              50 non-null     object 
 1   CustGender               50 non-null     object 
 2   CustLocation             50 non-null     object 
 3   CustAccountBalance       50 non-null     float64
 4   TransactionDate          50 non-null     object 
 5   TransactionTime          50 non-null     int64  
 6   TransactionAmount (INR)  50 non-null     float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.9+ KB


In [None]:
# we compute the Hash in order to obatin the third column "hashed" 
start = time.time()
data1['hashed']  = [MinHash([*x[1]],w,perm) for x in data1.iterrows()]
end = time.time()
print(end - start, 's')


0.006717681884765625 s


In [None]:
data1.head()

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),hashed
0,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0,"[59, 243, 36, 223.0299999997951, 62, 10, 231.0]"
1,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0,"[79, 243, 148, 114.07000000000698, 62, 138, 95.0]"
2,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5,"[243, 243, 125, 49.80000000001746, 59, 76, 182.5]"
3,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0,"[82, 243, 56, 122.25, 39, 246, 232.0]"
4,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0,"[200, 243, 85, 76.99000000022352, 105, 107, 64.0]"


In [None]:
def cosine_similarity(vec1,vec2):
    cosine_similarity = dot(vec1,vec2)/(norm(vec1)*norm(vec2))
    return cosine_similarity

def best_match(Queries):
    hash  = np.array(Queries.hashed)/max(Queries.hashed)
    return max([(cosine_similarity(hash,(np.array(x)/max(x))),i) for i,x in enumerate(data.hashed)])


In [None]:
dic = []
for i in range(len(data1)):
    if i in data1.index:
        x = best_match(data1.loc[i])
        dic.append(x[1])

In [None]:
data1

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),hashed
0,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0,"[59, 243, 36, 223.0299999997951, 62, 10, 231.0]"
1,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0,"[79, 243, 148, 114.07000000000698, 62, 138, 95.0]"
2,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5,"[243, 243, 125, 49.80000000001746, 59, 76, 182.5]"
3,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0,"[82, 243, 56, 122.25, 39, 246, 232.0]"
4,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0,"[200, 243, 85, 76.99000000022352, 105, 107, 64.0]"
5,10/1/81,M,WORLD TRADE CENTRE BANGALORE,23143.95,11/9/16,192906,303.0,"[10, 243, 62, 102.84999999997672, 177, 118, 73.0]"
6,20/9/76,F,CHITTOOR,15397.8,28/8/16,92633,20.0,"[154, 125, 197, 117.39999999996508, 131, 143, ..."
7,10/4/91,M,MOHALI,426.3,2/8/16,203754,50.0,"[128, 243, 200, 92.89999999999964, 131, 22, 14..."
8,19/3/90,M,MOHALI,4609.34,26/8/16,184015,300.0,"[246, 243, 200, 46.820000000006985, 59, 169, 4.0]"
9,19/12/70,M,SERAMPORE,6695988.46,27/8/16,144030,299.0,"[125, 243, 62, 198.58000001311302, 223, 66, 23..."


In [None]:
data.loc[dic]

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),hashed
747502,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0,"[59, 243, 36, 223.0299999997951, 62, 10, 231.0]"
729233,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0,"[79, 243, 148, 114.07000000000698, 62, 138, 95.0]"
315449,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5,"[243, 243, 125, 49.80000000001746, 59, 76, 182.5]"
640748,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0,"[82, 243, 56, 122.25, 39, 246, 232.0]"
11626,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0,"[200, 243, 85, 76.99000000022352, 105, 107, 64.0]"
937293,10/1/81,M,WORLD TRADE CENTRE BANGALORE,23143.95,11/9/16,192906,303.0,"[10, 243, 62, 102.84999999997672, 177, 118, 73.0]"
619274,20/9/76,F,CHITTOOR,15397.8,28/8/16,92633,20.0,"[154, 125, 197, 117.39999999996508, 131, 143, ..."
79635,10/4/91,M,MOHALI,426.3,2/8/16,203754,50.0,"[128, 243, 200, 92.89999999999964, 131, 22, 14..."
600251,19/3/90,M,MOHALI,4609.34,26/8/16,184015,300.0,"[246, 243, 200, 46.820000000006985, 59, 169, 4.0]"
582929,19/12/70,M,SERAMPORE,6695988.46,27/8/16,144030,299.0,"[125, 243, 62, 198.58000001311302, 223, 66, 23..."


#**2. Grouping customers together!**



**2.1 Getting your data + feature engineering**

In [None]:
data =pd.read_csv('/content/drive/MyDrive/ADM_HW4/bank_transactions.csv',sep=',')


In [None]:
df1 = data.groupby('CustomerID')['TransactionID'].apply(list).reset_index(name='TransactionID')
df2 = data.groupby('CustomerID')['CustomerDOB'].apply(list).reset_index(name='CustomerDOB')
df3 = data.groupby('CustomerID')['CustGender'].apply(list).reset_index(name='CustGender')
df4 = data.groupby('CustomerID')['CustLocation'].apply(list).reset_index(name='CustLocation')
df5 = data.groupby('CustomerID')['CustAccountBalance'].apply(list).reset_index(name='CustAccountBalance')
df6 = data.groupby('CustomerID')['TransactionDate'].apply(list).reset_index(name='TransactionDate')
df7 = data.groupby('CustomerID')['TransactionTime'].apply(list).reset_index(name='TransactionTime')
df8 = data.groupby('CustomerID')['TransactionAmount (INR)'].apply(list).reset_index(name='TransactionAmount (INR)')

In [None]:
data1 = pd.merge(df1,df2,on="CustomerID")
data2 = pd.merge(data1,df3,on="CustomerID")
data3 = pd.merge(data2,df4,on="CustomerID")
data4 = pd.merge(data3,df5,on="CustomerID")
data5 = pd.merge(data4,df6,on="CustomerID")
data6 = pd.merge(data5,df7,on="CustomerID")
data7 = pd.merge(data6,df8,on="CustomerID")
data7

Unnamed: 0,CustomerID,TransactionID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,C1010011,"[T33671, T173509]","[19/8/92, 5/8/83]","[F, M]","[NOIDA, NEW DELHI]","[32500.73, 120180.54]","[26/9/16, 9/8/16]","[123813, 11229]","[4750.0, 356.0]"
1,C1010012,[T363022],[28/7/94],[M],[MUMBAI],[24204.49],[14/8/16],[204409],[1499.0]
2,C1010014,"[T89544, T251648]","[4/6/92, 19/8/84]","[F, M]","[MUMBAI, MUMBAI]","[38377.14, 161848.76]","[1/8/16, 7/8/16]","[154451, 220305]","[1205.0, 250.0]"
3,C1010018,[T971994],[29/5/90],[F],[CHAMPARAN],[496.18],[15/9/16],[170254],[30.0]
4,C1010024,[T401396],[21/6/65],[M],[KOLKATA],[87058.65],[18/8/16],[141103],[5000.0]
...,...,...,...,...,...,...,...,...,...
884260,C9099836,[T226394],[24/12/90],[M],[BHIWANDI],[133067.23],[7/8/16],[5122],[691.0]
884261,C9099877,[T980444],[9/6/96],[M],[BANGALORE],[96063.46],[15/9/16],[120255],[222.0]
884262,C9099919,[T401232],[21/10/93],[M],[GUNTUR],[5559.75],[18/8/16],[122533],[126.0]
884263,C9099941,[T659168],[22/4/95],[M],[CHENNAI],[35295.92],[28/8/16],[213722],[50.0]


# **Algorithmic Question**

An imaginary university wants to restrict its student’s entrance to the campus. Suppose that there are N entrances, M students and G guards. Due to the security measures, each student is known to be assigned a gate through which they should enter the university.

Assume that the university's head of the guards knows the order in which the students are coming to the university (yeah, they know you more than you know about yourself!). He wants you to help him if having only G guards is enough to address the restrictions they wish to apply (in other words, whether there will be a moment when more than G entrances should be opened or not).

In [None]:
# enter the values of N, M and G
# n: number of entrances to the university, m: number of students in the university, and g: number of guards
n, m, g = map(int, input().split())
# entrance that has been assigned to the each m student 
a = list(map(int, input().split()))
# initialize x
x = 0
for i in range(1, len(a)):
  if a[i] not in a[:i] and a[i-1] in a[i+1:]:
    x = x+1

# if g: guards is enough return yes if not return no    
if g > x :
  print("YES")
else:
  print("NO")



4 5 1
1 1 3 3 3
YES
