# Search feature weighting

This notebook computes feature weights for a search ranking model.

As of early 2020, search ranking was being done by an ad-hoc pairwise comparison function that may not even be transitive. We want to replace it with a more structured and analyzable approach that can additional search features besides corpus frequency, such as cosine vector distance for now, with room for more features later.

The basic pieces are:
  - Use occurrence of a search result in the search survey as a 0-or-1 relevance variable
  - Create a relevance score from that using some fairly basic linear modelling techniques to compute best-fit feature weights
  - Measure success by the three10 score: the mean percentage of how many top-3 results from the linguist survey appear in the top 10 search results

There are many interesting possible future improvements here, such as:
  - Use occurence anywhere in sample, instead of top3 results only, for training
  - More precise training data, e.g., relevance rankings of 1-5
  - Handle homonyms in training data instead of matching purely on wordform text
  - More training data, specifically how many results per query we have human scores for
  - More features, e.g., tf-idf
  - Higher quality features, e.g., better stopword filtering in vector computations
  - Map features to have similar ranges and distributions to better allow the regression to more effectively compare them
  - Separate training and test sets
  - Fancier models
  - Better evaluation functions, such as discounted cumulative gain

That said, having all the pieces together, even in a very basic form, is already an improvement over the existing search, so let’s start with that.

## Preliminaries

Load some libraries. `weighting_nb_code.py` contains some more python-y code that was extracted from some exploratory jupyter notebooks once it was working ok.

In [1]:
import importlib


import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import weighting_nb_code

# Reload the code in `weighting_nb_code.py` by re-running this cell, or
# by copying the next line into other cells. If this reload mechanism
# proves insufficient, there is also `IPython.lib.deepreload`.
importlib.reload(weighting_nb_code);

First, if the JSON output file doesn’t already exist, we’ll run the `featuredump` management command to get our raw data. CVD search is not yet on by default, so we add a fancy query to enable it.

In [2]:
![ -f sample-features.json ] || \
    {weighting_nb_code.BASE_DIR}/manage.py featuredump \
        --prefix-queries-with 'cvd:retrieval' \
        > sample-features.json

100%|█████████████████████████████████████████| 548/548 [00:58<00:00,  9.44it/s]


The loaded feature data looks like this:

In [3]:
data = weighting_nb_code.dataframe_from_featuredump('sample-features.json')
data

Unnamed: 0,target_language_affix_match,definitions,webapp_sort_rank,is_preverb_match,is_lemma,lemma_wordform_text,source_language_affix_match,source_language_match,morpheme_ranking,cosine_vector_distance,query,wordform_text,wordform_length,query_wordform_edit_distance,target_language_keyword_match
0,True,"[[That's what he says to him., MD], [s/he says...",1,,True,itêw,,,3.86950,,about,itêw,4,,[about]
1,True,"[[He speaks of it as so., MD], [s/he says thus...",2,,True,itam,,,4.52311,,about,itam,4,,[about]
2,True,"[[I suppose., MD], [apparently, I guess, I sup...",3,,True,êtikwê,,,5.32741,,about,êtikwê,6,,[about]
3,True,"[[Maybe; perhaps; I guess so., MD], [maybe, I ...",4,,True,êtokwê,,,5.76791,,about,êtokwê,6,,[about]
4,True,"[[He tells about it., MD], [s/he tells about s...",5,,True,wîhtam,,,6.75698,0.339667,about,wîhtam,6,,[about]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123275,,"[[s/he goes to see s.o., s/he fetches s.o., CW]]",503,,True,nâtawâpamêw,,,30.29770,,they see us,nâtawâpamêw,11,,[see]
123276,,[[They are one group of them. E.g. Tribe or na...,504,,True,pêyakôskânêsiwak,,,30.38920,,they see us,pêyakôskânêsiwak,16,,[they]
123277,,"[[s/he goes to see s.t., s/he fetches s.t., CW]]",505,,True,nâtawâpahtam,,,30.48390,,they see us,nâtawâpahtam,12,,[see]
123278,,"[[they are in vast numbers, CW]]",506,,True,kakwâhyakêyatinwa,,,,,they see us,kakwâhyakêyatinwa,17,,[they]


Here’s the current combined result survey sample.

In [4]:
weighting_nb_code.survey()

Unnamed: 0,Query,Nêhiyawêwin 1,Nêhiyawêwin 2,Nêhiyawêwin 3
0,about,wayês,ohci,papâ
1,all,kahkiyaw,kapê,mâwaci
2,also,mîna,êkwa,kisik
3,and,êkwa,mîna,kisik
4,as,kisik,wiya,tâpiskôt
...,...,...,...,...
543,she sees him,wâpamêw,,
544,starblanket,atâhkakohp,acâhkosa kâ-otakohpit,
545,star blanket,atâhkakohp,acâhkosa kâ-otakohpit,
546,being taught,kiskinwahamâkosiw,,


And `weighting_nb_code.py` contains a function to annotate the `featuredump` results with the top3/three10 metrics.

In [5]:
weighting_nb_code.top3_and_310_stats(data, rank_column="webapp_sort_rank")[
    ["query", "wordform_text", "definitions", "actual_result_ranks", "top3", "three10"]
]

Unnamed: 0,query,wordform_text,definitions,actual_result_ranks,top3,three10
0,"""horse""",misatim,"[[horse, CW]]","[4.0, 14.0]",100.000000,50.000000
1,'horse',misatim,"[[horse, CW]]","[4.0, 13.0]",100.000000,50.000000
2,Calgary,otôskwanihk,"[[Calgary, AB; literally: at the elbow; at his...",[21.0],100.000000,0.000000
3,Cree,nêhiyaw,[[A Cree Indian man. A native of the Cree nati...,"[1.0, 8.0]",100.000000,100.000000
4,Cree language,nêhiyawêwin,"[[The Cree language., MD], [the Cree language;...",[8.0],100.000000,100.000000
...,...,...,...,...,...,...
543,yellow hat,osâwastotin,"[[yellow hat, CW]]",[21.0],100.000000,0.000000
544,you,kiya,"[[You., MD], [you, CW]]","[3.0, 12.0, 27.0]",100.000000,33.333333
545,young,oski,,[],0.000000,0.000000
546,younger sibling,nisîmis,[[My younger brother or sister (Among children...,[8.0],100.000000,100.000000


## Initial results from dictionary code

Without any cosine-vector stuff, here are the current search stats we want to beat. 81.3% for top3, and 59.4% for three10.

In [6]:
if os.path.isfile('sample-features-orig.json'):
    data_orig = weighting_nb_code.dataframe_from_featuredump('sample-features-orig.json')
    display(weighting_nb_code.top3_and_310_stats_summary(data_orig, rank_column="webapp_sort_rank"))

top3       81.326034
three10    59.367397
dtype: float64

Note: this won’t exactly match what the django `/search-quality` pages report, because of some differences in determining exactly what the rank is. In the django code, if the results are `(non-lemma1, non-lemma2)`, we count the ranks as `(1, 3)` because the UI display of `non-lemma1` includes its lemma definition at rank 2. Here we skip that for now, but the results should be close enough.

And, for comparison, here are the stats when we added a very basic cosine vector distance model to the search:

In [7]:
weighting_nb_code.top3_and_310_stats_summary(data, rank_column="webapp_sort_rank")

top3       83.424574
three10    41.271290
dtype: float64

The top3 score—what percent of desired search results we see anywhere in the list—has gone up. That is, the vector model’s ability to resolve synonyms has improved recall. But the three10 score—what percent of desired search results are near the top—has gone down since we don’t have a good ranking mechanism.

## Modelling

At this point we have all the definition and feature data from the webapp loaded, and we could experiment by adding more data columns with additional features. Those features could be computed by Python code here, or loaded from data files.

For this first version, let’s stick with what we have:

In [8]:
def prep_results_for_regression(df):
    # The default value used for `fillna()` doesn’t matter if we
    # also have an indicator variable, but things get trickier
    # with logarithms.
    return df.assign(
        morpheme_ranking=df["morpheme_ranking"].fillna(1),
        has_morpheme_ranking=weighting_nb_code.has_col_as_int(df, "morpheme_ranking"),
        has_cosine_vector_distance=weighting_nb_code.has_col_as_int(df, "cosine_vector_distance"),
        cosine_vector_distance=df["cosine_vector_distance"].fillna(1.1),
        is_in_survey=df.apply(weighting_nb_code.is_in_survey, axis=1),
        keyword_match_len=df['target_language_keyword_match'].apply(len)
    )

In [9]:
df = prep_results_for_regression(data)
results = smf.ols(
    """
    is_in_survey ~
        wordform_length
        + keyword_match_len
        + has_morpheme_ranking
        + morpheme_ranking
        + np.log(1 + cosine_vector_distance)
    """,
    data=df,
).fit()
display(results.summary())
sorted_results = weighting_nb_code.rank_by_predictor(df, results)
weighting_nb_code.top3_and_310_stats_summary(sorted_results, rank_column="result_rank")

0,1,2,3
Dep. Variable:,is_in_survey,R-squared:,0.057
Model:,OLS,Adj. R-squared:,0.057
Method:,Least Squares,F-statistic:,1498.0
Date:,"Mon, 10 May 2021",Prob (F-statistic):,0.0
Time:,11:07:32,Log-Likelihood:,134990.0
No. Observations:,123280,AIC:,-270000.0
Df Residuals:,123274,BIC:,-269900.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0559,0.003,19.795,0.000,0.050,0.061
wordform_length,-0.0006,9.36e-05,-6.076,0.000,-0.001,-0.000
keyword_match_len,0.0326,0.001,45.689,0.000,0.031,0.034
has_morpheme_ranking,0.0228,0.003,8.354,0.000,0.017,0.028
morpheme_ranking,-0.0010,6.77e-05,-14.756,0.000,-0.001,-0.001
np.log(1 + cosine_vector_distance),-0.1191,0.002,-78.804,0.000,-0.122,-0.116

0,1,2,3
Omnibus:,197247.555,Durbin-Watson:,1.695
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80165249.192
Skew:,10.819,Prob(JB):,0.0
Kurtosis:,126.038,Cond. No.,347.0


top3       83.424574
three10    70.042579
dtype: float64

Objectively, we have increased how many of the top survey results we display at all from 81.3% to 83.4%, and we have increased the mean number of them that appear in the top 10 results per query from ~60% to ~70%. That’s great!

Let’s take a look at a sample query. Before:

In [10]:
(weighting_nb_code.top3_and_310_stats(data, rank_column='webapp_sort_rank')
     .query('query == "counts"'))[['query', 'actual_result_ranks', 'top3', 'three10']]

Unnamed: 0,query,actual_result_ranks,top3,three10
126,counts,"[12.0, 19.0, 30.0]",100.0,0.0


When searching for ‘counts,’ before all the good results were showing up somewhere in the results, but none of them were in the top 10.

Now, with this new ranking model, the top results from the survey show up at the top of the search results, and even in 1, 2, 3 order:

In [11]:
(weighting_nb_code.top3_and_310_stats(sorted_results, rank_column='result_rank')
     .query('query == "counts"'))[['query', 'actual_result_ranks', 'top3', 'three10']]

Unnamed: 0,query,actual_result_ranks,top3,three10
126,counts,"[1.0, 2.0, 3.0]",100.0,100.0


If we look in more detail at the results, we can see that the cosine vector distance and the morpheme ranking are being combined, but one doesn’t overrule the other. Rarer words generally appear later in the list, but a strong CVD score can move it earlier, and vice versa.

In [12]:
sorted_results.query("query == 'counts'").sort_values('score', ascending=False)[
    ['wordform_text', 'definitions', 'morpheme_ranking',
     'has_cosine_vector_distance',
     'cosine_vector_distance', 'is_in_survey', 'score']
 ].head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,has_cosine_vector_distance,cosine_vector_distance,is_in_survey,score
91548,akihcikêw,"[[s/he counts, CW]]",17.2326,1,0.248058,1,0.062559
91530,akihtam,"[[He counts them. Inanimate., MD], [s/he count...",14.3906,1,0.388727,1,0.053815
91537,akimêw,"[[He counts them (those people)., MD], [s/he c...",15.6314,1,0.39185,1,0.052877
91554,akihtâw,"[[s/he counts s.t., CW]]",17.9402,1,0.370059,0,0.051883
91574,akihtâsow,"[[s/he counts, CW]]",30.4859,1,0.248058,0,0.049326
91523,itakihtam,"[[He sets a price on it., MD], [s/he charges s...",12.4932,1,0.467548,0,0.047998
91531,têpakihtam,"[[He has counted enough., MD], [s/he counts s....",14.9248,1,0.431836,0,0.047936
91565,têpakimêw,"[[He counts enough of them., MD], [s/he counts...",20.3262,1,0.400263,0,0.045767
91546,akihtamawêw,"[[He counts those things for him., MD], [s/he ...",17.0718,1,0.448443,0,0.04385
91553,mâmawôkihtam,"[[s/he counts s.t. all together, CW]]",17.6431,1,0.460465,0,0.041727


This is quite a bit better than only using the morpheme ranking:

In [13]:
if os.path.isfile('sample-features-orig.json'):
    display((data_orig.assign(is_in_survey=data_orig.apply(weighting_nb_code.is_in_survey, axis=1))
     .query("query == 'counts'").sort_values('webapp_sort_rank')[
        ['wordform_text', 'definitions', 'morpheme_ranking', 'is_in_survey']
     ]).head(10))

Unnamed: 0,wordform_text,definitions,morpheme_ranking,is_in_survey
82070,itakihtam,"[[He sets a price on it., MD], [s/he charges s...",12.4932,0
82071,itakisow,[[s/he is counted thus; it is held in such est...,12.9382,0
82072,itakihtêw,"[[It is priced at so much., MD], [it is counte...",13.6279,0
82073,kîsakimêw,[[s/he finishes counting s.o.; s/he finishes g...,14.3744,0
82074,akihtam,"[[He counts them. Inanimate., MD], [s/he count...",14.3906,1
82075,têpakihtam,"[[He has counted enough., MD], [s/he counts s....",14.9248,0
82076,wiyakihtam,"[[He puts a price on it., MD], [s/he counts s....",15.1186,0
82077,akisow,"[[s/he is counted, listed; s/he is accountable...",15.3652,0
82078,mâtakihtam,"[[He starts counting it., MD], [s/he starts co...",15.3697,0
82079,mâtakimâw,"[[it (i.e. the moon, month) is beginning, it s...",15.5938,0


## Model export

While the model generated by the `statsmodels` library is `pickle`able, since it’s a fairly basic linear model, for now we will just print the parameters to use in the webapp.

In [14]:
print(results.params.to_json(indent=2))

{
  "Intercept":0.0559011609,
  "wordform_length":-0.0005685605,
  "keyword_match_len":0.0325909057,
  "has_morpheme_ranking":0.022778805,
  "morpheme_ranking":-0.0009984537,
  "np.log(1 + cosine_vector_distance)":-0.1190890019
}


And here are some test vectors for ensuring the implementation is working correctly.

In [15]:
import re

def print_test_vector(**kwargs):
    df = prep_results_for_regression(pd.DataFrame([{
        "query": "counts",
        "wordform_text": "",
        "target_language_keyword_match": [],
        "wordform_length": 0,
        "keyword_match_len": 0,
        "morpheme_ranking": np.nan,
        "cosine_vector_distance": np.nan,
        **kwargs
    }]))
    ret = results.predict(df)[0]
    # future python feature “underscore as a decimal separator”
    # https://bugs.python.org/issue43624 would be handy here
    ret = f'{ret:_f}'
    if '.' in ret:
        l, r = ret.split('.')
        r = re.sub(r'(...)(?=.)', r'\1_', r)
        ret = f'{l}.{r}'
    print(ret)

In [16]:
print_test_vector()

-0.033_454


In [17]:
print_test_vector(cosine_vector_distance=0.7)

-0.008_289


In [18]:
print_test_vector(morpheme_ranking=12.8)

-0.022_457


In [19]:
print_test_vector(cosine_vector_distance=0.7, morpheme_ranking=12.8)

0.002_708
