## III. Baseline #2: Nearest neighbours questions

We apply the embeddings to questions title and feed them into a ANN model that will help us retrieve the nearest, say, 100 questions. With a simple ranking on users associated to those questions, we might obtain a decent baseline.

We will use [Spotify ANNOY](https://github.com/spotify/annoy) model for fast indexing and retrieval of candidates questions. On benchmarks, ANNOY is faster than Facebook FAISS, and the high-dimensionality of our embeddings forbid us from using KDTree, prone to the curse of dimensionality.

Pros of the model:
- Simple to implement
- Fast in production and adapted to batch, precompute inference settings
- Intuitive results
- Possibility to build a more complex ranking system afterwards

Cons:
- Two models instead of one
- ANN is an Inductive model, we need to rebuild the indexing for each new entry, so it might not be ideal in a real-time setting.

In [80]:
import os
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import sys
sys.path.append("..")
from dataclasses import dataclass

from annoy import AnnoyIndex

from src.data import load_data, split_questions, save_results_csv, DATA_PATH
from src.embedder import BertEmbedder
from src.score import precision_k, recall_k

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
USE_PRECOMPUTE = True

In [12]:
df_answers, df_questions, df_users = load_data()

df_answers.shape: (95709, 6)
df_questions.shape: (100000, 6)
df_users.shape: (138698, 12)


In [13]:
Q_train, Q_val, Q_test = split_questions(df_questions, df_answers)

Q_train.shape: (65036, 8)
Q_val.shape: (5000, 8)
Q_test.shape: (1000, 8)


We begin by computing embeddings of both train and validation questions title.

In [32]:
def compute_questions_embeddings(df_questions, embedder, file_name):
    results = []
    for (question_id, title) in tqdm(df_questions[["question_id", "title"]].values):
        embeddings = embedder.get_embeddings(title)
        results.append({"question_id": question_id, "title": title, "embeddings": embeddings})
    file_path = os.path.join(DATA_PATH, file_name)
    df_q_embeddings = pd.DataFrame(results)
    df_q_embeddings.to_pickle(file_path)
    print(f"{file_name} written")
    return df_q_embeddings

def load_questions_embeddings(file_name):
    file_path = os.path.join(data_path, file_name)
    return pd.read_pickle(file_name)

In [33]:
embedder = BertEmbedder()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [34]:
file_name = "q_embeddings.pkl"
if USE_PRECOMPUTE:
    df_q_embeddings = load_questions_embeddings(file_name)
else:
    Q_train_val = pd.concat([Q_train, Q_val])
    df_q_embeddings = compute_questions_embeddings(Q_train_val, embedder, file_name)

  0%|          | 0/70036 [00:00<?, ?it/s]

q_embeddings.pkl written


Quick demo using the first question indexed (index 0), we query the 10 neareast questions.

In [37]:
ann = AnnoyIndex(768, "angular")
embeddings = df_q_embeddings.embeddings.values
for idx, v in tqdm(enumerate(embeddings), total=len(embeddings)):
    ann.add_item(idx, v)
ann.build(10)

  0%|          | 0/70036 [00:00<?, ?it/s]

True

In [39]:
idxs, distances = ann.get_nns_by_item(0, 10, include_distances=True)
question_ids = df_q_embeddings.iloc[idxs].question_id
df_results = df_questions.loc[df_questions.question_id.isin(question_ids)][["question_id", "title"]]
df_results["distances"] = distances
df_results

Unnamed: 0,question_id,title,distances
2171,63142034,How to create an automation for tplink pharos ...,0.0
4682,63051247,selenium cannot import name webdriver in ubuntu,0.337811
15578,63695568,Is there any embedded database for Node.js tha...,0.347284
32458,63849070,Authentication Exception when trying to connec...,0.367351
39417,61494103,Neovim plugin Fugitive isn't using the ssh key...,0.368468
52724,63230711,Configuration domain name on nginx on linux?,0.369126
54445,63841827,How do I use synology domain name for azure,0.371456
55446,60364031,git username visible to nginx,0.372364
84514,62677039,Displaying data from mySQL database to vue.js ...,0.372692
95698,60544448,selenium with firefox close tab by javascript ...,0.373452


In [40]:
df_results.title.values

array(['How to create an automation for tplink pharos cpe520 using xpath with selenium and python for log in?',
       'selenium cannot import name webdriver in ubuntu',
       'Is there any embedded database for Node.js that allows to use mongoose driver API?',
       'Authentication Exception when trying to connect to Amazon keyspace using .net core and cassandra csharp driver from linux',
       "Neovim plugin Fugitive isn't using the ssh key agent, so I can't Gpush/Git push",
       'Configuration domain name on nginx on linux?',
       'How do I use synology domain name for azure',
       'git username visible to nginx',
       'Displaying data from mySQL database to vue.js front end using PHP',
       'selenium with firefox close tab by javascript but SetTimeout() not ok'],
      dtype=object)

In the results above, we have returned the 9 closest titles to the first one (notice the increasing distances between the first row and the rest). There is room for improvement in those results: we find some similarity based on API, proxy and selenium but questions linked to ssh seems to be a bit far-off.

We now define the dataset that our model will use.

In [54]:
@dataclass
class Dataset:
    question_ids_to_predict: list[int]
    embeddings: list
    questions_idxs_mapping: pd.DataFrame
    df_answers: pd.DataFrame
    df_users: pd.DataFrame
    
    def __len__(self):
        return len(self.question_ids_to_predict)

In [55]:
ds = Dataset(
    question_ids_to_predict=Q_val.question_id.values,
    embeddings=df_q_embeddings.embeddings.tolist(),
    questions_idxs_mapping=Q_train_val,
    df_answers=df_answers,
    df_users=df_users,
)

In [72]:
class ANN_Ranker:    
    
    def predict_users(self, ds, n_top_users=20, k_nearest_questions=40):
        self.ann = self.build_ann(ds)
        results = []
        print(" # [ANN_Ranker] Make users predictions")
        for idx, question_id in enumerate(tqdm(ds.question_ids_to_predict, total=len(ds))):
            question_ids = self.get_nearest_questions(idx, ds, k_nearest_questions)
            top_user_ids = self.get_top_users(question_ids, ds, n_top=20)
            results.append(np.hstack([question_id, top_user_ids]))
        save_results_csv("baseline_2_results.csv", results)
    
    def build_ann(self, ds, distance="angular", size=768):
        print(" # [ANN_Ranker] Build ANN")
        ann = AnnoyIndex(size, distance)
        for idx, v in tqdm(enumerate(ds.embeddings), total=len(ds.embeddings)):
            ann.add_item(idx, v)
        ann.build(10)
        return ann

    def get_nearest_questions(self, idx, ds, k_nearest_questions):
        idxs = ann.get_nns_by_item(idx, k_nearest_questions)
        idxs = np.array(idxs)
        question_ids = ds.questions_idxs_mapping.iloc[idxs].question_id.values
        return question_ids

    def get_top_users(self, question_ids, ds, n_top=20):
        df_answers_nn = ds.df_answers.loc[ds.df_answers.question_id.isin(question_ids)]
        df_top_users = df_answers_nn.groupby("user_id").score.sum().reset_index()
        df_top_users = df_top_users.merge(
            ds.df_users[["id", "reputation"]], left_on="user_id", right_on="id", how="left"
        )
        df_top_users.sort_values(["score", "reputation"], ascending=[False, False], inplace=True)
        return df_top_users.user_id[:n_top].values

In [73]:
ann_ranker = ANN_Ranker()
ann_ranker.predict_users(ds)

 # [ANN_Ranker] Build ANN


  0%|          | 0/70036 [00:00<?, ?it/s]

 # [ANN_Ranker] Make users predictions


  0%|          | 0/5000 [00:00<?, ?it/s]

../data/results/baseline_2_results written


In [77]:
file_path = os.path.join(DATA_PATH, "results", "baseline_2_results")
df_results = pd.read_csv(file_path, index_col="question_id")
df_results.head(3)

Unnamed: 0_level_0,user1_id,user2_id,user3_id,user4_id,user5_id,user6_id,user7_id,user8_id,user9_id,user10_id,user11_id,user12_id,user13_id,user14_id,user15_id,user16_id,user17_id,user18_id,user19_id,user20_id
question_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
61821197,11329890,58456,6309,1658906,4399281,70465,6797509,8813644,7535379,14620948,12821415,209103,235648,10185816,5577076,1023597,11061080,2834065,7706936,10304821
60874696,1059372,1113486,6869836,4665755,440558,2877241,4238408,10008173,765226,9959152,5820814,8367626,9518890,5105949,11227781,5028841,6079412,10490683,6066528,470214
64374269,3732271,12323248,1144035,5089204,3385827,11897007,3219613,10959940,1548468,7299782,13808319,10305477,6635033,3825777,4785185,4117728,568283,132438,8198946,8805315


In [81]:
R = df_results.values
# order the actual answers base on the prediction
df_actual = pd.DataFrame(df_answers.groupby("question_id").user_id.apply(list))
A = list(df_actual.loc[df_results.index].values[:, 0])

print(f"Precision @20: {precision_k(Y=A, Y_pred=R)}")
print(f"Recall @20: {recall_k(Y=A, Y_pred=R)}")

Precision @20: 0.0005300000000000001
Recall @20: 0.007945238095238095


Compare it to dummy prediction where we simply select the top 20 users with the most answers.

In [82]:
top_20_users = df_answers.user_id.value_counts().head(20)
R_dummy = [top_20_users.index] * len(A)

print(f"Precision @20: {precision_k(Y=A, Y_pred=R_dummy):.4f}")
print(f"Recall @20: {recall_k(Y=A, Y_pred=R_dummy):.4f}")

Precision @20: 0.0032
Recall @20: 0.0466


We improved from the last baseline but we still performs poorly compared to the dummy prediction. Our embeddings might not be adapted for titles, since there are a lot of tech-specific words.

What is more, a 768 embeddings size might be too much for small text like titles. We need to lower embedding sizes to a more reasonable one like 64.

As a follow-up we can try another embedding method like word2vec, were we would create embeddings specific to our titles instead of our current generic ones from Huggingface BERT.

An even simpler approach would be to create labels on questions by running TF-IDF, so that for a new question we would simply look for users having already answered questions with shared labels —like "kubernetes" or "node.js".