In [465]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import spacy 
import seaborn as sns
from datetime import datetime

# Elliot Linsey CMA Technical Test

Importing the data

In [466]:
df = pd.read_csv('searchdata.csv')
df.head()

Unnamed: 0,UserId,SearchTerm,SearchTime,SearchId
0,4d55f67bd1b7564b56e3f983b18f67df,piano,1648036174,9024b97a667be98dba7cc21f0f7a94ef
1,72f7cd70f17b00e7c7363936a8446dac,forks recipes,1647985710,2a7411a124810a7bb04276d91f18fd05
2,a93054a319fdb50b6ecebb316c4d54ac,bills reminder tracker,1647990762,4f5805a61cf45fc3b5b75eb2f4735443
3,789e68b59f0cafa6aea5fb8038f20b9b,travel planner,1647996107,944751607b79903d51b7e80bb770b8a1
4,7ac484989ecbe822b8e80b86d41f5d97,stream,1648048356,9ce5645ea6f10a8eb9f8f1f59ea42839


Checking the datatypes, SearchTime is in Unix time stored as an int64. In my definition of a SessionID, I will say that any searches within half an hour of each other belong to the same session. Unix time uses seconds, this means that any search within 1800 seconds of one another are within the same session. To calculate this, you simply minus one search time from the other, if the result is less than 1800 then they are within the same session.  

In [467]:
df.dtypes

UserId        object
SearchTerm    object
SearchTime     int64
SearchId      object
dtype: object

I first use a test dataset of just one user to test my method. This user has 3 distinct sessions, the first 3 searches, the fourth search and the last two searches. 

In [469]:
test = df[df.UserId=='4d55f67bd1b7564b56e3f983b18f67df'].reset_index(drop=True)
test

Unnamed: 0,UserId,SearchTerm,SearchTime,SearchId
0,4d55f67bd1b7564b56e3f983b18f67df,piano,1648036174,9024b97a667be98dba7cc21f0f7a94ef
1,4d55f67bd1b7564b56e3f983b18f67df,detector fingerprint,1647994210,3c5b9c1af23a22125ad34591a8176d43
2,4d55f67bd1b7564b56e3f983b18f67df,shape meter camera,1647994232,0ded87a446f1179aa58ce89078bea215
3,4d55f67bd1b7564b56e3f983b18f67df,shake flashlight compass,1647994269,ee1e5bdcb5c16b7f950884d1e2d857b6
4,4d55f67bd1b7564b56e3f983b18f67df,flight tracker,1648004801,17482fc9d78d745fc471825b555a6d87
5,4d55f67bd1b7564b56e3f983b18f67df,piano crush keyboard games,1648036137,8a188d6431abf0ad4e210e677b1e8bc9


The bin function finds the distances between each searchtime per user. If the distances are less than 1800 it bins them together leaving an array of [3,3,3,1,2,2] in this case, these are our grouped sessions. 

In [481]:
def bins(series):
    distances2 = []
    for y in series:
        distances = []
        for x in series:
            distances.append(y - x)
        #print(distances)
        distances2.append(distances)
    distances3 = np.abs(np.array(distances2))
    distances3 = np.bincount(np.where(distances3 <= 1800)[1])
    return distances3

In [482]:
from numpy import random

For the method to work, the dataset has to be sorted by both userID and search time. 

In [485]:
df = df.sort_values(['UserId','SearchTime'])
df

Unnamed: 0,UserId,SearchTerm,SearchTime,SearchId
24388,0007521478fa064fb17a3e4513aae615,tracer,1648003187,2aa8736f9b557b97a6a57753c328e3f7
4746,0007521478fa064fb17a3e4513aae615,tracker,1648003218,4daf2f6458df338afd2dbead4e1786f8
9591,0007521478fa064fb17a3e4513aae615,fantasy racing,1648014030,07a99d380cff22644c09abef3ebdcd6f
45275,0007521478fa064fb17a3e4513aae615,trivia puzzle fortune games,1648014065,94d8ced5a7adcfc5b02f10666fe3bc1c
30186,0007521478fa064fb17a3e4513aae615,social games bonds,1648014092,62ad884ab808b6a6343c141ef4b037ac
...,...,...,...,...
9195,fffd74244a44e5f627745f8f1d063139,financial times,1648002767,2eab1a7c6f52b09e86018f69b7e95517
33345,fffd74244a44e5f627745f8f1d063139,train times,1648002799,0ec4bb2ce37d07dba2b08082bb454928
42499,fffd74244a44e5f627745f8f1d063139,verify,1648044276,1f5d692ddad91a9fc020a6f886068379
7294,fffd74244a44e5f627745f8f1d063139,order,1648044306,7b566745050f39a77e190fb4c7409e60


The sessionID function goes through each username in the sorted dataframe, it then finds the session bins of each userID and to prevent duplicate sessionIDs between users, it adds a random integer to the specific user's bins. For example, with the previous array of [3,3,3,1,2,2] it may add 250,000 to each value, so the final sessionIDs are [250003,250003,250003,250001,250002,250002].

In [486]:
def sessionID(df):
    usernames = pd.unique(df.UserId)
    sessionids = []
    for x in usernames:
        one = df[df.UserId==x].reset_index(drop=True)
        one = one.sort_values(by='SearchTime')
        one['SessionID'] = bins(one.SearchTime.to_list())
        sessionids.append(one['SessionID']+random.randint(0,400000))
    return sessionids
    
sessionids = sessionID(df)

In [489]:
list2 = []
for x in sessionids:
    for y in x:
        list2.append(y)


The above function provides an array of arrays, to get the values out we use for loops then print the length of the list. This equals the number of rows in our original dataframe. 

In [490]:
print(len(list2))

46967


We simply add this list as our new columns 'SessionIDs'. If we go back to our test example, we find that the sessions have been correctly distributed. 

In [491]:
df['SessionIDs'] = list2
df[df.UserId == '4d55f67bd1b7564b56e3f983b18f67df']

Unnamed: 0,UserId,SearchTerm,SearchTime,SearchId,SessionIDs
5551,4d55f67bd1b7564b56e3f983b18f67df,detector fingerprint,1647994210,3c5b9c1af23a22125ad34591a8176d43,108510
26878,4d55f67bd1b7564b56e3f983b18f67df,shape meter camera,1647994232,0ded87a446f1179aa58ce89078bea215,108510
26953,4d55f67bd1b7564b56e3f983b18f67df,shake flashlight compass,1647994269,ee1e5bdcb5c16b7f950884d1e2d857b6,108510
31679,4d55f67bd1b7564b56e3f983b18f67df,flight tracker,1648004801,17482fc9d78d745fc471825b555a6d87,108508
45741,4d55f67bd1b7564b56e3f983b18f67df,piano crush keyboard games,1648036137,8a188d6431abf0ad4e210e677b1e8bc9,108509
0,4d55f67bd1b7564b56e3f983b18f67df,piano,1648036174,9024b97a667be98dba7cc21f0f7a94ef,108509


I have realised that if a user has two sessions of 3 searches for example, they will both end up having the same session IDs, this could be fixed by separating the bins and adding a separate random number to each, unfortunately I ran out of time to implement this. 

Below is the full dataframe.

In [492]:
df

Unnamed: 0,UserId,SearchTerm,SearchTime,SearchId,SessionIDs
24388,0007521478fa064fb17a3e4513aae615,tracer,1648003187,2aa8736f9b557b97a6a57753c328e3f7,92125
4746,0007521478fa064fb17a3e4513aae615,tracker,1648003218,4daf2f6458df338afd2dbead4e1786f8,92125
9591,0007521478fa064fb17a3e4513aae615,fantasy racing,1648014030,07a99d380cff22644c09abef3ebdcd6f,92127
45275,0007521478fa064fb17a3e4513aae615,trivia puzzle fortune games,1648014065,94d8ced5a7adcfc5b02f10666fe3bc1c,92127
30186,0007521478fa064fb17a3e4513aae615,social games bonds,1648014092,62ad884ab808b6a6343c141ef4b037ac,92127
...,...,...,...,...,...
9195,fffd74244a44e5f627745f8f1d063139,financial times,1648002767,2eab1a7c6f52b09e86018f69b7e95517,296244
33345,fffd74244a44e5f627745f8f1d063139,train times,1648002799,0ec4bb2ce37d07dba2b08082bb454928,296244
42499,fffd74244a44e5f627745f8f1d063139,verify,1648044276,1f5d692ddad91a9fc020a6f886068379,296244
7294,fffd74244a44e5f627745f8f1d063139,order,1648044306,7b566745050f39a77e190fb4c7409e60,296244


I ran out of time to implement task 2, but I would use spacy to create a vocab of all the tokens within the search terms then use similarity measures to try and recommend similar terms per sessionID

Thanks for reading.

Elliot Linsey