# How to Generate Implicit Judgment Lists from User Click Data

This notebook outlines a step by step process for generating a judgment list that can be used to calculate search engine relevance metrics like nDCG. A judgment list is a colleciton of query/document pairs given a relevance rating label. For example, for the query "red car", the document named "Lightning McQueen is a Red Racing Car" might be rated as highly relevant, or 3 on a 0-3 scale.

Once a judgment list is created, it can be used to calculate nDCG (and other relevance metrics) in a data analytics pipeline, or an opensource tool like Quepid.

### Notebook Outline

1. Install & Import Dependencies
2. Load the Data
3. Calculate Simple Click Through Rate (CTR)
4. Overcome Position Bias with Simplified Dynamic Bayesian Network
5. Overcome Confidence Bias with Beta Prior
6. Format the Judgment List

This whole process pulls heavily from Chapter 11 of the book ["AI-Powered Search"](https://www.manning.com/books/ai-powered-search) by Trey Grainger, Doug Turnbull, Max Irwin. For a deeper explanation of the process, and information for how to leverage a judgment list for AI features like Learn to Rank, please support the authors and buy the book!

# Install & Import Dependencies
All you'll need for this notebook is python (of course), and pandas to manipulate the data. You can provide your own data, or choose to load our full or small dataset.

In [None]:
%pip install pandas
import pandas as pd

# Load the Data

The sample data is stored in a compressed pickle file which can be read directly into a pandas dataframe.

While the compressed file `implicit_judgments_data.pkl` is only a couple gigabytes, the original csv was over 24GB. The dataframe ends up being a little over 31GB in memory with the index. In other words, it could take a few minutes to load the df.

The `implicit_judgments_data_small.pkl` file is a smaller dataset that loads much quicker (a few seconds) and might be better for demo purposes. The original csv was only 500mb, and the dataframe is a little over 700mb when loaded in memory.

In [66]:
# uncomment the dataset you want to use, or load your own!

# dataset = 'implicit_judgments_data_small.pkl'
# dataset = 'implicit_judgments_data.pkl'

# load the dataset
df = pd.read_pickle(dataset, compression='bz2')

### Inspect the dataframe

In [199]:
df[:10]

Unnamed: 0,session_id,user_id,query_id,doc_id,position,clicked
0,34573630,15,10509813,34175267,1,1
1,34573630,15,10509813,34171511,2,0
2,34573630,15,10509813,35444452,3,0
3,34573630,15,10509813,15370141,4,0
4,34573630,15,10509813,31342884,5,0
5,34573630,15,10509813,43630531,6,0
6,34573630,15,10509813,26065978,7,0
7,34573630,15,10509813,29902424,8,0
8,34573630,15,10509813,39016998,9,0
9,34573630,15,10509813,62861215,10,0


The first column on the left is the pandas dataframe built-in index (no label). Next we have a column for `session_id`, `user_id`, `query_id`, `doc_id`, `position`, and `clicked`. `session_id` refers to a single search session by a user (`user_id`), corresponding to a `query_id` (the identifier given to a query, aka search term), and the resulting documents (identified by `doc_id`). For a given search session we can see the given position of the result documents in the `position` column, and whether the document link was clicked or not (`clicked = 1` for clicked, `0` for not clicked).

In the `implicit_judgments_data_small` dataset, we can see the first search session has `session_id` `3457630`, made by user `15`. The `query_id` is `10509813` and we can see there were 10 documents returned, but only the first document in the result set was clicked.

> Note: the sample data only contains `query_id`, but ideally, your end judgment list would use the actual query string.

# Calculate Simple Click Through Rate (CTR)

To calculate CTR, we'll count all the clicks for a query that a document recieved, and divide it by the total number of search sessions where that query/doc pair occurred.

In [200]:
def calculate_ctr(sessions):
  num_clicks = sessions.groupby(["query_id", "doc_id"])["clicked"].sum()
  num_sessions = sessions.groupby(["query_id", "doc_id"])["session_id"].nunique()
  ctr = num_clicks / num_sessions
  return ctr

ctr = calculate_ctr(df)
ctr.sort_values(ascending=False)[:50]

query_id  doc_id  
2452707   58541183    2.0
7795064   42876266    2.0
15762397  56197529    2.0
4915918   1882964     2.0
669859    1235044     2.0
11979214  41935282    2.0
3507033   27638557    2.0
2963766   60197483    2.0
2099316   41490361    2.0
9928578   60635459    2.0
104584    70496331    2.0
11131380  7238661     2.0
20303545  16842490    2.0
104584    15660452    2.0
          11084634    2.0
669859    19537820    2.0
4100451   34033472    2.0
17185468  28830094    2.0
739545    10058203    2.0
21235243  35794486    2.0
609346    37038217    2.0
17185468  3367745     2.0
19413606  66134228    2.0
2704555   53914790    1.5
5302328   3272635     1.0
5302330   18618276    1.0
12048414  8829567     1.0
2913884   7210064     1.0
          26469811    1.0
6844000   26072969    1.0
5302330   39793713    1.0
2913884   46152681    1.0
8883434   52278349    1.0
14305044  63493627    1.0
15293149  30760668    1.0
14305044  53608280    1.0
15293149  15378814    1.0
          13713521 

# Overcome Position Bias with Simplified Dynamic Bayesian Network

CTR can be victim of biases in users and our application. One bias is Position Bias. Simply put, Position Bias occurs because a user is more likely to click on results towards the top of the search result page regardless of how relevant the result actually is.

One way to combat this is using a Simplified Dynamic Bayesian Network, which uses the idea of "examines" to essentially reward clicks that occur lower in the result list.

### Add examines to the dataframe
Every document above the last clicked result is thought to have been "examined", or observed/scanned by the user.


In [201]:
def add_examines(sessions):
  # filter to only clicked events
  clicked_sessions = sessions[sessions["clicked"] == 1]
  # Group by 'session_id' and 'query_id' and get the maximum 'position' for each session/query
  last_click_per_session = clicked_sessions.groupby(["session_id", "query_id"])["position"].max()

  # Reset index to make it easier to map
  last_click_per_session = last_click_per_session.reset_index()
  last_click_per_session = last_click_per_session.rename(columns={"position": "last_click_position"})

  # Merge the last_click_per_session with the original sessions DataFrame
  sessions = pd.merge(sessions, last_click_per_session, on=["session_id", "query_id"], how="left")
    
  # fill NaN last clicks (sessions with no clicks) with 0 value
  sessions['last_click_position'] = sessions["last_click_position"].fillna(0)
  
  # add the examined column, set to true if the position is less than or equal to the last clicked position
  sessions["examined"] = sessions["position"] <= sessions["last_click_position"] 
  return sessions

df_with_examines = add_examines(df)
df_with_examines

Unnamed: 0,session_id,user_id,query_id,doc_id,position,clicked,last_click_position,examined
0,34573630,15,10509813,34175267,1,1,1.0,True
1,34573630,15,10509813,34171511,2,0,1.0,False
2,34573630,15,10509813,35444452,3,0,1.0,False
3,34573630,15,10509813,15370141,4,0,1.0,False
4,34573630,15,10509813,31342884,5,0,1.0,False
...,...,...,...,...,...,...,...,...
15430955,35970414,5794351,11695257,61490336,6,0,0.0,False
15430956,35970414,5794351,11695257,11153727,7,0,0.0,False
15430957,35970414,5794351,11695257,31693680,8,0,0.0,False
15430958,35970414,5794351,11695257,31603391,9,0,0.0,False


In [214]:
# for every query/doc pair, sum the clicks and examines for the doc in that query

def calculate_clicked_examined(sessions):
  return sessions[sessions["examined"]].groupby(["query_id","doc_id"])[["clicked", "examined"]].sum()

calculate_clicked_examined(df_with_examines)

Unnamed: 0_level_0,Unnamed: 1_level_0,clicked,examined
query_id,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1
250,56587578,1,1
517,70880027,1,1
622,13698417,1,1
622,42550710,0,1
955,15534388,1,1
...,...,...,...
21432624,64097494,1,1
21432624,65801311,0,1
21433077,41732620,1,2
21433466,26188386,1,1


### Calculate Grade
Here we divide the total number of clicks by total number of examines for query/doc pair in all of our sessions.

In [215]:
def calculate_grade(sessions):
  sessions = calculate_clicked_examined(sessions)
  sessions["grade"] = sessions["clicked"] / sessions["examined"]
  return sessions

search_df = calculate_grade(df_with_examines)
# flatten the df multi-index so all columns are same level
search_df.reset_index(inplace=True)
search_df[search_df['grade'] > 0][:50].sort_values(by='grade', ascending=True)


Unnamed: 0,query_id,doc_id,clicked,examined,grade
14,1343,15131650,1,2,0.5
27,1437,64481928,1,2,0.5
34,1791,2624065,1,2,0.5
87,6522,44239139,1,2,0.5
18,1343,67976152,1,2,0.5
12,1343,406395,1,2,0.5
13,1343,13171393,1,2,0.5
17,1343,55674579,1,2,0.5
73,5882,69886391,1,1,1.0
0,250,56587578,1,1,1.0


# Overcome Confidence Bias with Beta Distribution
Should a document with 1 examine and 1 click have a grade of 1 (100%)? Probably not -- there's not enough information to be able to tell if that was just a random click. Beta Distribution is one way we can estimate what a typical mean grade would be, and then create a probability distribution that increases slightly for the few occurences of the click/examine events.

In other words, it assumes what a typical grade should be and then acts as if 1 examine and 1 click are just one drop in the bucket of all the clicks and examines that led to the typical grade. The 1 examine and 1 click will "pull" the distribution a little to the "higher" probablity according to a weight we'll assign.

We'll assign our typical grade (`prior_grade`) the mean grade of all our events. We'll assign it a weight of `100`. These values can be played around with. The more the weight, the more the probability distribution will be affected by click events.

### Assign prior_a and prior_b values

In [216]:
def calculate_prior(sessions, prior_grade, prior_weight):
  sessions["prior_a"] = prior_grade * prior_weight
  sessions["prior_b"] = (1 - prior_grade) * prior_weight
  return sessions

median_grade = search_df['grade'].median()

prior_grade = median_grade
prior_weight = 100
calculate_prior(search_df, prior_grade, prior_weight)

Unnamed: 0,query_id,doc_id,clicked,examined,grade,prior_a,prior_b
0,250,56587578,1,1,1.0,16.666667,83.333333
1,517,70880027,1,1,1.0,16.666667,83.333333
2,622,13698417,1,1,1.0,16.666667,83.333333
3,622,42550710,0,1,0.0,16.666667,83.333333
4,955,15534388,1,1,1.0,16.666667,83.333333
...,...,...,...,...,...,...,...
978116,21432624,64097494,1,1,1.0,16.666667,83.333333
978117,21432624,65801311,0,1,0.0,16.666667,83.333333
978118,21433077,41732620,1,2,0.5,16.666667,83.333333
978119,21433466,26188386,1,1,1.0,16.666667,83.333333


### Calculate posterior_a, posterior_b values, and final beta_grade value

In [217]:
def calculate_sdbn(sessions, prior_grade=0.3, prior_weight=100):
  sessions = calculate_prior(sessions, prior_grade, prior_weight)
  sessions["posterior_a"] = (sessions["prior_a"] + sessions["clicked"])
  sessions["posterior_b"] = (sessions["prior_b"] + sessions["examined"] - sessions["clicked"])
  sessions["beta_grade"] = (sessions["posterior_a"] / (sessions["posterior_a"] + sessions["posterior_b"]))
  return sessions.sort_values("beta_grade", ascending=False)

sessions = search_df
calculate_sdbn(sessions)

Unnamed: 0,query_id,doc_id,clicked,examined,grade,prior_a,prior_b,posterior_a,posterior_b,beta_grade
915961,20100007,46063140,340,408,0.833333,30.0,70.0,370.0,138.0,0.728346
150830,4605457,39061378,442,570,0.775439,30.0,70.0,472.0,198.0,0.704478
132876,4102451,34175267,1247,1866,0.668274,30.0,70.0,1277.0,689.0,0.649542
489988,11483526,30581891,237,317,0.747634,30.0,70.0,267.0,150.0,0.640288
797870,17670982,28406892,131,157,0.834395,30.0,70.0,161.0,96.0,0.626459
...,...,...,...,...,...,...,...,...,...,...
149297,4575281,39002266,1,38,0.026316,30.0,70.0,31.0,107.0,0.224638
293974,7444613,6746234,3,49,0.061224,30.0,70.0,33.0,116.0,0.221477
149299,4575281,48039998,0,36,0.000000,30.0,70.0,30.0,106.0,0.220588
485541,11401789,63399546,11,86,0.127907,30.0,70.0,41.0,145.0,0.220430


# Format the Judgment List
We now have all the data we need, but we can clean it up a bit to make calculating nDCG a bit easier. First we will quantize our `beta_grade` probabilistic float values into four quantiles, which will get assigned a value from 0-3 in the new `grade` column.

Then we'll drop the columns we don't need, et voila. Implicit Judgment List.

### Quantize

In [194]:
search_df['grade'] = pd.cut(search_df['beta_grade'], bins=4, labels=False)
search_df.sort_values(by='grade', ascending=False)[:50]

Unnamed: 0,query_id,doc_id,clicked,examined,grade,prior_a,prior_b,posterior_a,posterior_b,beta_grade
328250,8117795,8539216,115,138,3,30.0,70.0,145.0,93.0,0.609244
132876,4102451,34175267,1247,1866,3,30.0,70.0,1277.0,689.0,0.649542
489988,11483526,30581891,237,317,3,30.0,70.0,267.0,150.0,0.640288
110065,3446059,26785689,180,243,3,30.0,70.0,210.0,133.0,0.612245
453269,10751928,51205217,202,271,3,30.0,70.0,232.0,139.0,0.625337
150830,4605457,39061378,442,570,3,30.0,70.0,472.0,198.0,0.704478
441089,10509813,34175267,227,327,3,30.0,70.0,257.0,170.0,0.601874
797870,17670982,28406892,131,157,3,30.0,70.0,161.0,96.0,0.626459
603726,13838503,31747249,110,125,3,30.0,70.0,140.0,85.0,0.622222
915923,20098967,46063140,114,131,3,30.0,70.0,144.0,87.0,0.623377


### Format and drop unused columns

> Again, note that in a real judgment list -- we'd want the _actual_ query strings instead of a `query_id`.

In [219]:
columns_to_drop = ['clicked', 'examined', 'prior_a', 'prior_b', 'posterior_a', 'posterior_b', 'beta_grade']
final_judgment_list = search_df.drop(columns_to_drop, axis='columns')
final_judgment_list.set_index("query_id", inplace=True)
final_judgment_list.sort_values(by='grade', ascending=False)[:50]

Unnamed: 0_level_0,doc_id,grade
query_id,Unnamed: 1_level_1,Unnamed: 2_level_1
250,56587578,1.0
11710437,26135987,1.0
11711488,50559245,1.0
11711438,62274753,1.0
11711438,58375232,1.0
11711438,57372829,1.0
11711438,28074867,1.0
11711438,23398553,1.0
11711431,37564552,1.0
11711430,62832733,1.0


### Export to CSV

In [220]:
final_judgment_list.to_csv("final_judgment_list.csv")