# Search feature weighting

This notebook computes feature weights for a search ranking model.

As of early 2020, search ranking was being done by an ad-hoc pairwise comparison function that may not even be transitive. We want to replace it with a more structured and analyzable approach that can additional search features besides corpus frequency, such as cosine vector distance for now, with room for more features later.

The basic pieces are:
  - Use occurrence of a search result in the search survey as a 0-or-1 relevance variable
  - Create a relevance score from that using some fairly basic linear modelling techniques to compute best-fit feature weights
  - Measure success by the three10 score: the mean percentage of how many top-3 results from the linguist survey appear in the top 10 search results

There are many interesting possible future improvements here, such as:
  - Use occurence anywhere in sample, instead of top3 results only, for training
  - More precise training data, e.g., relevance rankings of 1-5
  - Handle homonyms in training data instead of matching purely on wordform text
  - More training data, specifically how many results per query we have human scores for
  - More features, e.g., tf-idf
  - Higher quality features, e.g., better stopword filtering in vector computations
  - Map features to have similar ranges and distributions to better allow the regression to more effectively compare them
  - Separate training and test sets
  - Fancier models
  - Better evaluation functions, such as discounted cumulative gain

That said, having all the pieces together, even in a very basic form, is already an improvement over the existing search, so let’s start with that.

## Preliminaries

Load some libraries. `weighting_nb_code.py` contains some more python-y code that was extracted from some exploratory jupyter notebooks once it was working ok.

In [45]:
import importlib


import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import weighting_nb_code

# Reload the code in `weighting_nb_code.py` by re-running this cell, or
# by copying the next line into other cells. If this reload mechanism
# proves insufficient, there is also `IPython.lib.deepreload`.
importlib.reload(weighting_nb_code);

First, if the JSON output file doesn’t already exist, we’ll run the `featuredump` management command to get our raw data. CVD search is not yet on by default, so we add a fancy query to enable it.

In [65]:
![ -f sample-features.json ] || \
    {weighting_nb_code.BASE_DIR.parent.parent}/crkeng-manage featuredump \
        --prefix-queries-with 'espt:1' \
        > sample-features.json

100%|█████████████████████████████████████████| 548/548 [00:32<00:00, 17.04it/s]


The loaded feature data looks like this:

In [66]:
data = weighting_nb_code.dataframe_from_featuredump('sample-features.json')
data

Unnamed: 0,morpheme_ranking,relevance_score,wordform_length,target_language_affix_match,pos_match,target_language_keyword_match,is_espt_result,query_wordform_edit_distance,lemma_freq,cosine_vector_distance,query,wordform_text,definitions,source_language_match,webapp_sort_rank,lemma_wordform_text,is_lemma,word_list_freq,source_language_keyword_match
0,,0.344752,4,True,0.0,[about],0,,0.020288,0.000000,about,ohci,"[[from there, thence, out of, CW], [with, by m...",,1,ohci,True,3,[]
1,,0.293062,6,True,0.0,[about],0,,0.001052,0.525657,about,âcimêw,"[[s/he tells about s.o., s/he talks about s.o....",,2,âcimêw,True,1,[]
2,,0.291421,11,True,0.0,[about],0,,0.000124,0.522532,about,pimitâcimow,"[[s/he crawls about, s/he crawls around, s/he ...",,3,pimitâcimow,True,0,[]
3,,0.069804,6,,0.0,[],0,,0.014659,0.564088,about,tânisi,"[[how, in what way, CW]]",,4,tânisi,True,1,[]
4,,0.066577,10,,0.0,[],0,,0.005536,0.663220,about,misi-mîciw,"[[s/he eats a lot of s.t., s/he eats much of s...",,5,mîciw,False,3,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18568,,0.062512,4,,0.0,[],0,,0.000000,0.640580,they see us,kâh-,"[[grammatical preverb: intermittent, repeatedl...",,32,kâh-,True,0,[]
18569,,0.062481,13,,0.0,[],0,,-3.000000,0.641013,they see us,katawatêyimêw,"[[s/he thinks s.o. beautiful, CW]]",,33,katawatêyimêw,True,0,[]
18570,,0.061861,18,,0.0,[],0,,0.000000,0.677476,they see us,oskana kâ-asastêki,"[[Regina, SK, CW], [""Pile-o'-Bones"", CW]]",,34,oskana kâ-asastêki,True,1,[]
18571,,0.059863,10,,0.0,[],0,,-21.000000,0.707219,they see us,nôhtêpayiw,"[[s/he lacks (s.t.), CW], [s/he falls short, s...",,35,nôhtêpayiw,True,1,[]


Here’s the current combined result survey sample.

In [67]:
weighting_nb_code.survey()

Unnamed: 0,Query,Nêhiyawêwin 1,Nêhiyawêwin 2,Nêhiyawêwin 3
0,about,wayês,ohci,papâ
1,all,kahkiyaw,kapê,mâwaci
2,also,mîna,êkwa,kisik
3,and,êkwa,mîna,kisik
4,as,kisik,wiya,tâpiskôt
...,...,...,...,...
543,she sees him,wâpamêw,,
544,starblanket,atâhkakohp,acâhkosa kâ-otakohpit,
545,star blanket,atâhkakohp,acâhkosa kâ-otakohpit,
546,being taught,kiskinwahamâkosiw,,


And `weighting_nb_code.py` contains a function to annotate the `featuredump` results with the top3/three10 metrics.

In [68]:
weighting_nb_code.top3_and_310_stats(data, rank_column="webapp_sort_rank")[
    ["query", "wordform_text", "definitions", "actual_result_ranks", "top3", "three10"]
]

  uniques = Index(uniques)


Unnamed: 0,query,wordform_text,definitions,actual_result_ranks,top3,three10
0,"""horse""",misatim,,[],0.0,0.0
1,'horse',misatim,,[],0.0,0.0
2,Calgary,otôskwanihk,"[[Calgary, AB, CW]]",[1.0],100.0,100.0
3,Cree,nêhiyaw,,[],0.0,0.0
4,Cree language,nêhiyawêwin,,[],0.0,0.0
...,...,...,...,...,...,...
543,yellow hat,osâwastotin,,[],0.0,0.0
544,you,kiya,,[],0.0,0.0
545,young,oski,,[],0.0,0.0
546,younger sibling,nisîmis,,[],0.0,0.0


## Initial results from dictionary code

Without any cosine-vector stuff, here are the current search stats we want to beat. 81.3% for top3, and 59.4% for three10.

In [69]:
import os
if os.path.isfile('sample-features-orig.json'):
    data_orig = weighting_nb_code.dataframe_from_featuredump('sample-features-orig.json')
    display(weighting_nb_code.top3_and_310_stats_summary(data_orig, rank_column="webapp_sort_rank"))

Note: this won’t exactly match what the django `/search-quality` pages report, because of some differences in determining exactly what the rank is. In the django code, if the results are `(non-lemma1, non-lemma2)`, we count the ranks as `(1, 3)` because the UI display of `non-lemma1` includes its lemma definition at rank 2. Here we skip that for now, but the results should be close enough.

And, for comparison, here are the stats when we added a very basic cosine vector distance model to the search:

In [70]:
weighting_nb_code.top3_and_310_stats_summary(data, rank_column="webapp_sort_rank")

  uniques = Index(uniques)


top3       9.671533
three10    9.215328
dtype: float64

The top3 score—what percent of desired search results we see anywhere in the list—has gone up. That is, the vector model’s ability to resolve synonyms has improved recall. But the three10 score—what percent of desired search results are near the top—has gone down since we don’t have a good ranking mechanism.

## Modelling

At this point we have all the definition and feature data from the webapp loaded, and we could experiment by adding more data columns with additional features. Those features could be computed by Python code here, or loaded from data files.

For this first version, let’s stick with what we have:

In [71]:
def prep_results_for_regression(df):
    # The default value used for `fillna()` doesn’t matter if we
    # also have an indicator variable, but things get trickier
    # with logarithms.
    return df.assign(
        morpheme_ranking=df["morpheme_ranking"].fillna(1),
        # has_morpheme_ranking=weighting_nb_code.has_col_as_int(df, "morpheme_ranking"),
        has_cosine_vector_distance=weighting_nb_code.has_col_as_int(df, "cosine_vector_distance"),
        cosine_vector_distance=df["cosine_vector_distance"].fillna(1.1),
        is_in_survey=df.apply(weighting_nb_code.is_in_survey, axis=1),
        keyword_match_len=df['target_language_keyword_match'].apply(len),
        pos_match=df["pos_match"].fillna(0),
        word_list_freq=df["word_list_freq"].fillna(0),
        lemma_freq=df["lemma_freq"].fillna(0),
        is_espt_result=df["is_espt_result"].fillna(1)
    )

All the options above *can* be used as parameters for search, but they aren't all useful. Using the three10 and top3 scores generated by the cell below, I have determined that the current set of options is the minimal set necessary to achieve the desired results.

In [72]:
df = prep_results_for_regression(data)
results = smf.ols(
    """
    is_in_survey ~
        word_list_freq
        + lemma_freq
        + morpheme_ranking
        + np.log(1 + cosine_vector_distance)
        + keyword_match_len
        + is_espt_result
        + pos_match
    """,
    data=df,
).fit()
display(results.summary())
sorted_results = weighting_nb_code.rank_by_predictor(df, results)
weighting_nb_code.top3_and_310_stats_summary(sorted_results, rank_column="result_rank")

0,1,2,3
Dep. Variable:,is_in_survey,R-squared:,0.196
Model:,OLS,Adj. R-squared:,0.196
Method:,Least Squares,F-statistic:,754.9
Date:,"Wed, 01 Jun 2022",Prob (F-statistic):,0.0
Time:,15:00:27,Log-Likelihood:,23443.0
No. Observations:,18573,AIC:,-46870.0
Df Residuals:,18566,BIC:,-46820.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0315,0.002,13.479,0.000,0.027,0.036
word_list_freq,0.0019,0.001,3.458,0.001,0.001,0.003
lemma_freq,1.589e-07,4.71e-07,0.337,0.736,-7.65e-07,1.08e-06
morpheme_ranking,0.0315,0.002,13.479,0.000,0.027,0.036
np.log(1 + cosine_vector_distance),-0.1135,0.008,-14.029,0.000,-0.129,-0.098
keyword_match_len,0.2204,0.004,51.825,0.000,0.212,0.229
is_espt_result,-0.0081,0.006,-1.474,0.141,-0.019,0.003
pos_match,0.0104,0.004,2.583,0.010,0.002,0.018

0,1,2,3
Omnibus:,28146.259,Durbin-Watson:,1.882
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12988689.787
Skew:,9.527,Prob(JB):,0.0
Kurtosis:,131.144,Cond. No.,2.69e+17


  uniques = Index(uniques)


top3       9.671533
three10    9.276156
dtype: float64

Objectively, we have increased how many of the top survey results we display at all from 81.3% to 83.4%, and we have increased the mean number of them that appear in the top 10 results per query from ~60% to ~70%. That’s great!

Let’s take a look at a sample query. Before:

In [73]:
(weighting_nb_code.top3_and_310_stats(data, rank_column='webapp_sort_rank')
     .query('query == "counts"'))[['query', 'actual_result_ranks', 'top3', 'three10']]

  uniques = Index(uniques)


Unnamed: 0,query,actual_result_ranks,top3,three10
126,counts,[],0.0,0.0


When searching for ‘counts,’ before all the good results were showing up somewhere in the results, but none of them were in the top 10.

Now, with this new ranking model, the top results from the survey show up at the top of the search results, and even in 1, 2, 3 order:

In [74]:
(weighting_nb_code.top3_and_310_stats(sorted_results, rank_column='result_rank')
     .query('query == "counts"'))[['query', 'actual_result_ranks', 'top3', 'three10']]

  uniques = Index(uniques)


Unnamed: 0,query,actual_result_ranks,top3,three10
126,counts,[],0.0,0.0


If we look in more detail at the results, we can see that the cosine vector distance and the morpheme ranking are being combined, but one doesn’t overrule the other. Rarer words generally appear later in the list, but a strong CVD score can move it earlier, and vice versa.

In [75]:
sorted_results.query("query == 'counts'").sort_values('score', ascending=False)[
    ['wordform_text', 'definitions', 'morpheme_ranking',
     'has_cosine_vector_distance',
     'cosine_vector_distance', 'is_in_survey', 'score']
 ].head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,has_cosine_vector_distance,cosine_vector_distance,is_in_survey,score
14401,pêyak,"[[one, 1, CW], [alone, single, a single one, C...",1,1,0.686167,0,0.007299
14399,âcimowina,[],1,1,0.768318,0,0.004249
14400,kîsikâwa,[],1,1,0.807228,0,0.00364
14402,awa,"[[this, this one, CW]]",1,1,0.708275,0,0.003547
14403,misi-mîciw,"[[s/he eats a lot of s.t., s/he eats much of s...",1,1,0.787861,0,0.002608
14404,nitawi-,"[[go and, go to, CW], [engaged in, CW]]",1,1,0.772883,0,0.001718
14405,kîsikâw,"[[it is day, it is daylight, CW]]",1,1,0.807228,0,0.001393
14406,nîso-kîsikâw,"[[it is Tuesday, CW]]",1,1,0.753839,0,0.001068
14407,mîcisow,"[[s/he eats, s/he has a meal, CW], [it feeds (...",1,1,0.785188,0,0.000901
14408,tipiskâw,"[[it is night, it is night time, CW], [it is d...",1,1,0.820307,0,0.000558


This is quite a bit better than only using the morpheme ranking:

In [76]:
if os.path.isfile('sample-features-orig.json'):
    display((data_orig.assign(is_in_survey=data_orig.apply(weighting_nb_code.is_in_survey, axis=1))
     .query("query == 'counts'").sort_values('webapp_sort_rank')[
        ['wordform_text', 'definitions', 'morpheme_ranking', 'is_in_survey']
     ]).head(10))

## Model export

While the model generated by the `statsmodels` library is `pickle`able, since it’s a fairly basic linear model, for now we will just print the parameters to use in the webapp.

In [78]:
print(results.params.to_json(indent=2))

{
  "Intercept":0.0314788206,
  "word_list_freq":0.0018736602,
  "lemma_freq":0.0000001589,
  "morpheme_ranking":0.0314788206,
  "np.log(1 + cosine_vector_distance)":-0.1134934198,
  "keyword_match_len":0.2204347216,
  "is_espt_result":-0.0081121758,
  "pos_match":0.0103591827
}


And here are some test vectors for ensuring the implementation is working correctly.

In [18]:
import re

def print_test_vector(**kwargs):
    df = prep_results_for_regression(pd.DataFrame([{
        "query": "counts",
        "wordform_text": "",
        "target_language_keyword_match": [],
        "wordform_length": 0,
        "keyword_match_len": 0,
        "morpheme_ranking": np.nan,
        "cosine_vector_distance": np.nan,
        "pos_match": 0,
        "is_espt_result": 0,
        "word_list_freq": 0,
        "lemma_freq": 0,
        **kwargs
    }]))
    ret = results.predict(df)[0]
    # future python feature “underscore as a decimal separator”
    # https://bugs.python.org/issue43624 would be handy here
    ret = f'{ret:_f}'
    if '.' in ret:
        l, r = ret.split('.')
        r = re.sub(r'(...)(?=.)', r'\1_', r)
        ret = f'{l}.{r}'
    print(ret)

In [19]:
print_test_vector()

-0.021_606


In [20]:
print_test_vector(cosine_vector_distance=0.7)

0.002_865


In [21]:
print_test_vector(morpheme_ranking=12.8)

0.357_854


In [22]:
print_test_vector(cosine_vector_distance=0.7, morpheme_ranking=12.8)

0.382_325


In [23]:
print_test_vector(cosine_vector_distance=0.7, morpheme_ranking=12.8, wordform_length=9, target_language_keyword_match_len=1)

0.382_325


In [24]:
print_test_vector(cosine_vector_distance=0.7, morpheme_ranking=12.8, pos_match=5)

0.455_524
