# IV. Baseline #3 

**Embeddings**

With BERT embeddings, we looked at understanding complex relationships between tokens and infer tech concepts, but it showed to perform quite poorly for our task. Here we take a simpler approach by simply computing the TF-IDF embedding of each title and performing cosine similarity search to find the top-n similar questions.

**Model**

Like before, we fetch the users having answered to this subset of neighbor questions and rank them using basic heuristics like their total ratings on the subset and their reputation.

Pros of this model:
- Embeddings are faster to compute and easier to understand
- Embeddings are more specific to our tech-oriented text
- Simple ranking heuristics can give sensible solutions

Cons:
- Brute-force cosine similarity is compute intensive
- kNN is a decent approximation for lower dimensional vectors but can get quickly innacurate due to the curse of dimensionality.

## Load data and packages

In [60]:
import os
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import seaborn as sns
from matplotlib import pyplot as plt
import sys
sys.path.append("..")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors

from src.data import load_data, split_questions, save_results_csv, DATA_PATH
from src.embedder import stop_words
from src.score import precision_k, recall_k

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
df_answers, df_questions, df_users = load_data()

df_answers.shape: (95709, 6)
df_questions.shape: (100000, 6)
df_users.shape: (138698, 12)


In [3]:
Q_train, Q_val, Q_test = split_questions(df_questions, df_answers)

Q_train.shape: (65036, 8)
Q_val.shape: (5000, 8)
Q_test.shape: (1000, 8)


Credits to [Ciprian Borodescu](https://www.algolia.com/blog/ai/the-anatomy-of-high-performance-recommender-systems-part-iv/) for this neat implementation of TF-IDF below!

In [33]:
# creating the tf-idf Vectorizer to analyze, at word level, unigrams and bigrams
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), stop_words=stop_words) 
 
# applying the vectorizer on the 'title' column
tfidf_matrix = tf.fit_transform(Q_train["title"])

# getting the embedding for a validation question
query_title = [Q_val.iloc[0].title]
tfidf_vector_query = tf.transform(query_title)

# compute the cosine similarity between the training matrix and the query vector
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_vector_query)

# return the top 10 closest titles (excluding the query)
top_idxs = np.argsort(cosine_sim, axis=0)[::-1][1:11].ravel()
print("query title:", query_title)
print("most similar titles:", Q_train.iloc[top_idxs].title.values)

query title: ['Return an array with all negatives in a matrix']
most similar titles: ['Return an array after loop async'
 'How can code return an array of characters in C?'
 'How to return an array which is an element of an multidimensional array?'
 'How to return array of urls in django'
 'How do you return an array of structs from a function in C?'
 'Why does it return 1?'
 'Return Array of Objects in JSON in django rest_framework'
 'Return an array of items that satisfies a specific rule'
 'How can I return an array of all items in the provided array that include a singular item'
 'Return array value from produce function | immer.js']


## Brute force cosine sim

Let's compute the closest questions for our validation set!

In [64]:
def get_top_questions(tf, tfidf_matrix, Q_val, Q_train, df_answers, df_users, n_top_questions=20):
    """
    For a given question (or dataset of question), find top users by:
    1. Computing the cosine sim between Q_train and Q_val titles
    2. Fetching the 20 nearest neighbours for each questions
    3. Choosing the top 20 users having the higher cumulative score on these neighbours questions,
       break equality with users reputation.
    """
    # embed and get similarity of our validation questions
    tfidf_matrix_query = tf.transform(Q_val["title"])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix_query)
    
    # get the n_top_questions closest training questions for each query
    top_idxs_matrix = np.argsort(cosine_sim, axis=0)[::-1][1:n_top_questions+1].T
    
    # get the top users for each validation question
    question_ids_val = Q_val.question_id.values
    results = []
    for question_id, top_idxs in tqdm(zip(question_ids_val, top_idxs_matrix), total=len(question_ids_val)):
        neighbour_question_ids = Q_train.iloc[top_idxs].question_id.values
        # apply rules to get our top user ids for this questions
        top_user_ids = df_answers.loc[df_answers.question_id.isin(neighbour_question_ids)] \
                                  .groupby("user_id").score.sum().reset_index() \
                                  .merge(df_users, left_on="user_id", right_on="id", how="inner") \
                                  .sort_values(["score", "reputation"], ascending=[False, False]) \
                                  .head(20).user_id.values
        results.append(np.hstack([question_id, top_user_ids]))
    
    # saving results to compute score
    save_results_csv("baseline_3_results.csv", results)

In [65]:
get_top_questions(tf, tfidf_matrix, Q_val, Q_train, df_answers, df_users, n_top_questions=50)

  0%|          | 0/5000 [00:00<?, ?it/s]

../data/results/baseline_3_results.csv written


In [66]:
file_path = os.path.join(DATA_PATH, "results", "baseline_3_results.csv")
df_results = pd.read_csv(file_path, index_col="question_id")
df_results.head(3)

Unnamed: 0_level_0,user1_id,user2_id,user3_id,user4_id,user5_id,user6_id,user7_id,user8_id,user9_id,user10_id,user11_id,user12_id,user13_id,user14_id,user15_id,user16_id,user17_id,user18_id,user19_id,user20_id
question_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
60698269,12158757,6574038,3732271,10429793,2586922,10248678,5030014,1447675,3293881,6243352,1411277,9901261,9661744,3962914,2877241,5459839,4620771,12500315,6110094,1822379
61476685,8583393,12299030,7687641,9698684,6950186,12299000,5238915,3732271,2372064,797495,10618540,2901002,7964527,6622587,1639625,3874623,10035985,4552295,10201580,10068985
59551270,2887218,7098650,7085197,1491895,1447675,11654,1100248,6950238,3867033,2035262,5249621,650475,159388,13552470,1225617,988260,8891224,4934937,5215131,6510523


In [61]:
R = df_results.values
# order the actual answers base on the prediction
df_actual = pd.DataFrame(df_answers.groupby("question_id").user_id.apply(list))
A = list(df_actual.loc[df_results.index].values[:, 0])

print(f"Precision @20: {precision_k(Y=A, Y_pred=R)}")
print(f"Recall @20: {recall_k(Y=A, Y_pred=R)}")

Precision @20: 0.0067
Recall @20: 0.10945999999999999


Let's get the dummy recommender performance as a means of comparison.

In [62]:
top_20_users = df_answers.user_id.value_counts().head(20)
R_dummy = [top_20_users.index] * len(A)

print(f"Precision @20: {precision_k(Y=A, Y_pred=R_dummy):.4f}")
print(f"Recall @20: {recall_k(Y=A, Y_pred=R_dummy):.4f}")

Precision @20: 0.0031
Recall @20: 0.0451


At last, we finally beat the dummy recommender! What happen if we increase the `n_top_questions`?

In [68]:
def get_score(n_top_questions):
    
    get_top_questions(tf, tfidf_matrix, Q_val, Q_train, df_answers, df_users, n_top_questions)
    df_results = pd.read_csv(file_path, index_col="question_id")

    R = df_results.values
    
    # order the actual answers base on the prediction
    df_actual = pd.DataFrame(df_answers.groupby("question_id").user_id.apply(list))
    A = list(df_actual.loc[df_results.index].values[:, 0])

    print(f"Precision @20: {precision_k(Y=A, Y_pred=R)}")
    print(f"Recall @20: {recall_k(Y=A, Y_pred=R)}")

In [70]:
get_score(n_top_questions=50)

  0%|          | 0/5000 [00:00<?, ?it/s]

../data/results/baseline_3_results.csv written
Precision @20: 0.00704
Recall @20: 0.11435333333333333


In [69]:
get_score(n_top_questions=100)

  0%|          | 0/5000 [00:00<?, ?it/s]

../data/results/baseline_3_results.csv written
Precision @20: 0.006810000000000001
Recall @20: 0.10883


We are slightly better off with 50 top questions, but to get a stronger estimation we should run cross-validation on this parameter.

## kNN 

Finally, let's try to use a kNN model instead of brute-force cosine similarity to improve our speed and memory efficiency.

In [77]:
def get_top_questions_knn(tf, knn, tfidf_matrix, Q_val, Q_train, df_answers, df_users, n_top_questions=20):
    """
    For a given question (or dataset of question), find top users by:
    1. Computing the kNN between Q_train and Q_val titles
    2. Fetching the 20 nearest neighbours for each questions
    3. Choosing the top 20 users having the higher cumulative score on these neighbours questions,
       break equality with users reputation.
    """
    # embed and get similarity of our validation questions
    tfidf_matrix_query = tf.transform(Q_val["title"])
    
    # get 20 closest neighbors matrix
    top_idxs_matrix = knn.kneighbors(tfidf_matrix_query, return_distance=False)
        
    # get the top users for each validation question
    question_ids_val = Q_val.question_id.values
    results = []
    for question_id, top_idxs in tqdm(zip(question_ids_val, top_idxs_matrix), total=len(question_ids_val)):
        neighbour_question_ids = Q_train.iloc[top_idxs].question_id.values
        # apply rules to get our top user ids for this questions
        top_user_ids = df_answers.loc[df_answers.question_id.isin(neighbour_question_ids)] \
                                  .groupby("user_id").score.sum().reset_index() \
                                  .merge(df_users, left_on="user_id", right_on="id", how="inner") \
                                  .sort_values(["score", "reputation"], ascending=[False, False]) \
                                  .head(20).user_id.values
        results.append(np.hstack([question_id, top_user_ids]))
    
    # saving results to compute score
    save_results_csv("baseline_3_results.csv", results)

In [79]:
def get_score_knn(n_top_questions):
    
    get_top_questions_knn(tf, knn, tfidf_matrix, Q_val, Q_train, df_answers, df_users, n_top_questions)
    df_results = pd.read_csv(file_path, index_col="question_id")

    R = df_results.values
    
    # order the actual answers base on the prediction
    df_actual = pd.DataFrame(df_answers.groupby("question_id").user_id.apply(list))
    A = list(df_actual.loc[df_results.index].values[:, 0])

    print(f"Precision @20: {precision_k(Y=A, Y_pred=R)}")
    print(f"Recall @20: {recall_k(Y=A, Y_pred=R)}")

In [80]:
get_score_knn(n_top_questions=50)

  0%|          | 0/5000 [00:00<?, ?it/s]

../data/results/baseline_3_results.csv written
Precision @20: 0.00468
Recall @20: 0.07585


Here we get a practical sense of tradeoff between the memory & compute efficiency vs metrics performance, our kNN is faster but both precision @20 and recall @20 are smaller.

## Notes

To improve those performances further, instead of simple heuristics we could try to build a classifier to suggest the higher probability for our user candidates to answer a question. We could then leverage more user attributes such as missing of photo, text (or labels) in their bio and number of replies.