# II. Baseline #1

**Questions and users' answers featurization**

We write $f(q) \in \mathbb{R}^{768}$ the embeddings of a question $q$, assuming token embedding size of 768, and
$g(u) \in \mathbb{R}^{768}$ the embeddings of a user $u$.

For these embeddings, we will consider only the `title` of questions and the `text` answers from users.
- Question embeddings are defined by the average embedding on each word of the title.
- User embeddings are defined by the average embedding on each answer by this user (one answer embedding being the average of its word embeddings).

We will leverage a pre-trained BERT model from HuggingFace to create these embeddings.

<br>

**Model**

For a given new question, our baseline algorithm consists in returning the top-20 dot product between the question embedding and users embeddings:

$$max f(q)^T . g(u)$$

<br>

Pros of this model
- a sensible baseline
- no training needed and fast to try out


Cons
- not flexible
- doesn't leverage explicit ratings
- not sure NLP semantics will be relevant enough to suggest the most likely users to answer a question

## Loading data and packages

In [1]:
import os
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import pickle
import sys
sys.path.append("..")

import re
import torch

from src.data import load_data, split_questions, save_results_csv, DATA_PATH
from src.embedder import BertEmbedder
from src.score import precision_k, recall_k

%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vincentmaladiere/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Set `USE_PRECOMPUTE` to `True` once you have computed embeddings. This will result in a significant speed-up.

In [2]:
USE_PRECOMPUTE = True

In [3]:
df_answers, df_questions, df_users = load_data()

df_answers.shape: (95709, 6)
df_questions.shape: (100000, 6)
df_users.shape: (138698, 12)


In [4]:
Q_train, Q_val, Q_test = split_questions(df_questions, df_answers)

Q_train.shape: (65036, 8)
Q_val.shape: (5000, 8)
Q_test.shape: (1000, 8)


A quick demo of our embeddings: here, 2 sentences from "cosmos" in Wikipedia are compared to the introduction of "cheeseburger", also in wikipedia.

In [5]:
text_1 = "The cosmos, and our understanding of the reasons for its existence and significance, are studied in cosmology – a broad discipline covering scientific, religious or philosophical aspects of the cosmos and its nature."
text_2 = "Religious and philosophical approaches may include the cosmos among spiritual entities or other matters deemed to exist outside our physical universe."
text_3 = "The cheese is usually added to the cooking hamburger patty shortly before serving, which allows the cheese to melt."

embedder = BertEmbedder()
out_1 = embedder.get_embeddings(text_1)
out_2 = embedder.get_embeddings(text_2)
out_3 = embedder.get_embeddings(text_3)

print(f"text_1 x text_1: {out_1[None] @ out_1[None].T}")
print(f"text_1 x text_2: {out_1[None] @ out_2[None].T}")
print(f"text_2 x text_3: {out_2[None] @ out_3[None].T}")
print(f"text_1 x text_3: {out_1[None] @ out_3[None].T}")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


text_1 x text_1: tensor([[6593.2627]])
text_1 x text_2: tensor([[6189.0732]])
text_2 x text_3: tensor([[5542.9907]])
text_1 x text_3: tensor([[5702.2930]])


The product of the first vector with itself is obviously the highest. We then find the two cosmos sentences on the second rank, as we expected. However all scores seems rather close to each other, it migth indicate that this scoring method is quite noisy.

## Preprocessing

**Clean users' answers**

Users answers are formatted in a raw html format, so we need to parse the relevant information that the BERT transformer will leverage. Besides, users often include code snippets in their answer, and we need to remove those as well.

In [6]:
CLEANR = re.compile('<.*?>') 

def clean_answer(raw_html):
    raw_html = clean_code_snippets(raw_html)
    txt = clean_html(raw_html)
    txt = clean_punctuation(txt)
    return txt

def clean_code_snippets(raw_html):
    # select content outside of <code> </code>
    chunks = raw_html.split("<code>")
    clean_txt = ""
    for chunk in chunks:
        sub_chunks = chunk.split("</code>")
        if len(sub_chunks) > 1:
            text = sub_chunks[1]
        else:
            text = sub_chunks[0]
        clean_txt = " ".join([clean_txt, text])
    return clean_txt

def clean_html(raw_html):
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext

def clean_punctuation(txt):
    return txt.replace("\n", " ").strip()

Example of answer including a code snippet:

In [7]:
answer_idx = 5000
txt = df_answers.iloc[answer_idx].text
print(f"original answer is: \n{repr(txt)}")
print("="*18)
txt = clean_answer(txt)
print(f"cleaned answer is: \n{repr(txt)}")

original answer is: 
'<p>You should try this plugin:</p>\n<p><a href="https://pub.dev/packages/custom_splash" rel="nofollow noreferrer">customSplash</a></p>\n<p>And there are still more on pub.dev.</p>\n<p>The code for custom splash:</p>\n<pre><code>runApp(MaterialApp(\n    home: CustomSplash(\n        imagePath: \'assets/flutter_icon.png\',\n        backGroundColor: Colors.deepOrange,\n        animationEffect: \'zoom-in\',\n        logoSize: 200,\n        home: MyApp(),\n        customFunction: duringSplash,\n        duration: 2500,\n        type: CustomSplashType.StaticDuration,\n        outputAndHome: op,\n    ),\n));\n</code></pre>'
cleaned answer is: 
'You should try this plugin: customSplash And there are still more on pub.dev. The code for custom splash:'


**Let's embed top users answers**

First, we need to define a subset of users that has answered to, say, at least 3 questions. We will create embedding indexes for each of these users.

In [11]:
def load_users_embeddings(file_name):
    file_path = os.path.join(DATA_PATH, "intermediary", file_name)
    return pd.read_pickle(file_path)
    

def compute_users_embedding(df_answers, user_ids, embedder):
    max_text_size = 500
    user_2_embeddings = {}
    n_skip_user = 0
    for user_id in tqdm(user_ids):
        texts = df_answers.loc[df_answers.user_id == user_id].text.values
        user_embeddings = []
        for text in texts:
            text = clean_answer(text)[:max_text_size]
            if text:
                text_embeddings = embedder.get_embeddings(text)
                if not text_embeddings is None:
                    user_embeddings.append(text_embeddings)
        if user_embeddings:
            user_embeddings = torch.stack(user_embeddings).mean(dim=0)
            user_2_embeddings[user_id] = user_embeddings
        else:
            n_skip_user += 1

    print(f"{n_skip_user} were skipped")
    
    save_pickle("user_embeddings.pkl", user_2_embeddings)
    
    return user_2_embeddings


def filter_top_users(df_users, df_answers, k_answers=3):
    group_user = df_answers.groupby("user_id").answer_id.count() \
                           .reset_index().rename(columns={"answer_id": "total"})
    df_merge = df_users.merge(group_user, left_on="id", right_on="user_id", how="inner")
    df_merge = df_merge.loc[df_merge.total >= k_answers]
    
    return df_merge.user_id.values


def save_pickle(file_name, content):
    file_path = os.path.join(DATA_PATH, "intermediary", file_name)
    pickle.dump(content, open(file_path, "wb+"))
    print(f"{file_path} written")

In [12]:
top_user_ids = filter_top_users(df_users, df_answers, k_answers=1)
len(top_user_ids)

45471

In [13]:
if USE_PRECOMPUTE:
    user_2_embeddings = load_users_embeddings("user_embeddings.pkl")
else:
    user_2_embeddings = compute_users_embedding(df_answers, top_user_ids, embedder)
len(user_2_embeddings)

44880

**Let's now define our model and encode some test questions**

In [14]:
def get_questions_top_users_precompute(Q_test, user_2_embeddings, df_q_embedding):
    user_ids, U = get_u_matrix(user_2_embeddings)
    results = []
    for question_id in tqdm(Q_test.question_id.values):
        if question_id in df_q_embedding.index:
            q_embedding = df_q_embedding.loc[question_id].embeddings
            top_user_ids = get_top_k_users_precompute(q_embedding, U, user_ids)
            results.append(np.hstack([question_id, top_user_ids]))
    save_results_csv("baseline_1_results.csv", results)

    
def get_top_k_users_precompute(q, U, user_ids, k=20):
    r = U @ q[None].T
    top_idxs = torch.argsort(r, dim=0, descending=True)[:k].numpy().reshape(1, -1)
    top_user_ids = user_ids[top_idxs]
    return top_user_ids.ravel()
    

def compute_questions_top_users(Q_test, user_2_embeddings, embedder):
    user_ids, U = get_u_matrix(user_2_embeddings)
    cols = Q_test.columns
    col2idx = dict(zip(cols, range(len(cols))))
    results = []
    for row in tqdm(Q_test.values):
        title = row[col2idx["title"]]
        question_id = row[col2idx["question_id"]]
        top_user_ids = get_top_k_users(title, U, user_ids, embedder)
        results.append(np.hstack([question_id, top_user_ids]))
    save_results_csv("baseline_results.csv", results)

    
def get_top_k_users(title, U, user_ids, embedder, k=20):
    q = embedder.get_embeddings(title)
    r = U @ q[None].T
    top_idxs = torch.argsort(r, dim=0, descending=True)[:k].numpy().reshape(1, -1)
    top_user_ids = user_ids[top_idxs]
    return top_user_ids.ravel()


def get_u_matrix(user_2_embeddings):
    user_ids = np.array(list(user_2_embeddings.keys()))
    U = torch.stack(list(user_2_embeddings.values()))
    return user_ids, U

In [15]:
if USE_PRECOMPUTE:
    file_path = os.path.join(DATA_PATH, "intermediary", "q_embeddings.pkl")
    df_q_embeddings = pd.read_pickle(file_path).set_index("question_id")
    get_questions_top_users_precompute(Q_val, user_2_embeddings, df_q_embeddings)
else:
    get_questions_top_users(Q_val, user_2_embeddings, embedder)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/5000 [00:00<?, ?it/s]

../data/results/baseline_results.csv written


In [16]:
file_path = os.path.join(DATA_PATH, "results", "baseline_1_results.csv")
df_results = pd.read_csv(file_path, index_col="question_id")
df_results.head(3)

Unnamed: 0_level_0,user1_id,user2_id,user3_id,user4_id,user5_id,user6_id,user7_id,user8_id,user9_id,user10_id,user11_id,user12_id,user13_id,user14_id,user15_id,user16_id,user17_id,user18_id,user19_id,user20_id
question_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
59758505,12073046,12000088,13619217,9701547,1394729,7202022,12423397,8854888,6387370,13396042,12788763,3384658,7394739,12992796,3618301,9119669,3641900,13949605,12531994,1404347
63187776,9701547,12423397,7394739,8804776,9982227,1394729,6387370,13619217,12073046,4626254,10152069,6587870,11480455,3802507,13076470,5588286,14446599,12531994,13983739,3618301
61603799,9701547,7248771,12073046,70918,7357999,2859017,5787139,1394729,12446721,12423397,7394739,3464777,13619217,12531994,12406503,11462013,13949605,9982227,976470,1404347


**Get our precision score**

In [17]:
R = df_results.values
# order the actual answers base on the prediction
df_actual = pd.DataFrame(df_answers.groupby("question_id").user_id.apply(list))
A = list(df_actual.loc[df_results.index].values[:, 0])

print(f"Precision @20: {precision_k(Y=A, Y_pred=R)}")
print(f"Recall @20: {recall_k(Y=A, Y_pred=R)}")

Precision @20: 7.100831811726518e-05
Recall @20: 0.0011158449989855954


Compare it to dummy prediction where we simply select the top 20 users with the most answers.

In [18]:
top_20_users = df_answers.user_id.value_counts().head(20)
R_dummy = [top_20_users.index] * len(A)

print(f"Precision @20: {precision_k(Y=A, Y_pred=R_dummy):.4f}")
print(f"Recall @20: {recall_k(Y=A, Y_pred=R_dummy):.4f}")

Precision @20: 0.0032
Recall @20: 0.0478


Our first baseline performs very poorly. Let's propose a new baseline method, were we will fetch users from similar questions.

## Notes

Had this first baseline showed some sign of success, we could have tried to learn users embeddings instead of using NLP. 

### V2 approach: learning user embeddings

For this new approach, we keep the same embedding strategy for questions, but we want to learn an embedding for users. The algorithm don't change, we still keep the top-20 dot product between the users and questions.

$$max f(q)^T . g(u)$$

**Learning algorithm**
- Let's note $x_u \in \mathbb{R}^{728}$ a user embeddings to learn.

**Notes**

As a cons, we need to keep in mind that learning embeddings for users with few answers is prone to overfitting, so we may need to limit our user list. Here is some pseudo-code for this approach.

```python
# neural net trained by pytorch
embed_user = torch.nn.Embedding(num_classes=len(dataset.users.unique()), embed_size=768)

opt = torch.optim.SGD(embed_user.parameters(), weight_decay=some_chosen_regularization_value)

# everything below is batched
# the random user may or may have not answered the question sampled with them
for question_id, random_user in zip(dataset.questions, dataset.users):
    f_q = embed_nlp_fixed_params(question.title)
    
    # users who actually answered the questions should map to 1.0 dot product
    for user_who_answered in question.users:
        x_u = embed_user(user_who_answered.user_id)
        loss = (1.0 - x_u.dot(f_q)) ** 2
        
        opt.zero_grad()
        loss.backward()
        opt.step()
    
    # random user should map to 0.0 dot product for this question
    x_u = embed_user
    loss = (0.0 - x_u.dot(f_q)) ** 2
    
    opt.zero_grad()
    loss.backward()
    opt.step()
 ```
 
 At eval time nothing changes, except we have different embeddings. We still take the top 20 users with the largest dot product:

$$\max x_u^T.f(q)$$