# Search feature weighting

This notebook computes feature weights for a search ranking model.

As of early 2020, search ranking was being done by an ad-hoc pairwise comparison function that may not even be transitive. We want to replace it with a more structured and analyzable approach that can additional search features besides corpus frequency, such as cosine vector distance for now, with room for more features later.

The basic pieces are:
  - Use occurrence of a search result in the search survey as a 0-or-1 relevance variable
  - Create a relevance score from that using some fairly basic linear modelling techniques to compute best-fit feature weights
  - Measure success by the three10 score: the mean percentage of how many top-3 results from the linguist survey appear in the top 10 search results

There are many interesting possible future improvements here, such as:
  - Use occurence anywhere in sample, instead of top3 results only, for training
  - More precise training data, e.g., relevance rankings of 1-5
  - Handle homonyms in training data instead of matching purely on wordform text
  - More training data, specifically how many results per query we have human scores for
  - More features, e.g., tf-idf
  - Higher quality features, e.g., better stopword filtering in vector computations
  - Map features to have similar ranges and distributions to better allow the regression to more effectively compare them
  - Separate training and test sets
  - Fancier models
  - Better evaluation functions, such as discounted cumulative gain

That said, having all the pieces together, even in a very basic form, is already an improvement over the existing search, so let’s start with that.

## Preliminaries

Load some libraries. `weighting_nb_code.py` contains some more python-y code that was extracted from some exploratory jupyter notebooks once it was working ok.

In [1]:
import importlib


import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import weighting_nb_code

# Reload the code in `weighting_nb_code.py` by re-running this cell, or
# by copying the next line into other cells. If this reload mechanism
# proves insufficient, there is also `IPython.lib.deepreload`.
importlib.reload(weighting_nb_code);

  from pandas import Int64Index as NumericIndex


First, if the JSON output file doesn’t already exist, we’ll run the `featuredump` management command to get our raw data. CVD search is not yet on by default, so we add a fancy query to enable it.

In [2]:
![ -f sample-features.json ] || \
    {weighting_nb_code.BASE_DIR}/manage.py featuredump \
        --prefix-queries-with 'cvd:retrieval' \
        > sample-features.json

The loaded feature data looks like this:

In [3]:
data = weighting_nb_code.dataframe_from_featuredump('sample-features.json')
data

Unnamed: 0,morpheme_ranking,relevance_score,is_lemma,wordform_text,source_language_match,query_wordform_edit_distance,target_language_affix_match,analyzable_inflection_match,cosine_vector_distance,lemma_wordform_text,definitions,webapp_sort_rank,is_espt_result,wordform_length,target_language_keyword_match,source_language_keyword_match,query
0,13.11080,0.072803,True,kêkât,,,True,,0.208314,kêkât,"[[just about, almost, CW]]",1,,5,[about],[],about
1,8.98018,0.068610,True,nânitaw,,,True,,0.283407,nânitaw,"[[Somewhere about., MD], [simply; something; s...",2,,7,[about],[],about
2,11.74090,0.067212,True,papâ-,,,True,,0.281026,papâ-,"[[go around, about, all over, CW]]",3,,5,[about],[],about
3,11.74090,0.066460,True,akâwâc,,,True,,0.282995,akâwâc,"[[Just about or just barely., MD], [hardly, ba...",4,,6,[about],[],about
4,6.75698,0.066289,True,wîhtam,,,True,,0.339667,wîhtam,"[[He tells about it., MD], [s/he tells about s...",5,,6,[about],[],about
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84691,22.67860,-0.010532,False,nititisâpamikonânak,,,,,,itisâpamêw,[],139,True,19,[see],[],they see us
84692,24.48040,-0.012331,False,nipêtisâpamikonânak,,,,,,pêtisâpamêw,[],140,True,19,[see],[],they see us
84693,30.48390,-0.014345,True,nâtawâpahtam,,,,,,nâtawâpahtam,"[[s/he goes to see s.t., s/he fetches s.t., CW]]",141,,12,[see],[],they see us
84694,26.13030,-0.015684,False,nikîhkâyâpamohikonânak,,,,,,kîhkâyâpamohêw,[],142,True,22,[see],[],they see us


Here’s the current combined result survey sample.

In [4]:
weighting_nb_code.survey()

Unnamed: 0,Query,Nêhiyawêwin 1,Nêhiyawêwin 2,Nêhiyawêwin 3
0,about,wayês,ohci,papâ
1,all,kahkiyaw,kapê,mâwaci
2,also,mîna,êkwa,kisik
3,and,êkwa,mîna,kisik
4,as,kisik,wiya,tâpiskôt
...,...,...,...,...
543,she sees him,wâpamêw,,
544,starblanket,atâhkakohp,acâhkosa kâ-otakohpit,
545,star blanket,atâhkakohp,acâhkosa kâ-otakohpit,
546,being taught,kiskinwahamâkosiw,,


And `weighting_nb_code.py` contains a function to annotate the `featuredump` results with the top3/three10 metrics.

In [5]:
weighting_nb_code.top3_and_310_stats(data, rank_column="webapp_sort_rank")[
    ["query", "wordform_text", "definitions", "actual_result_ranks", "top3", "three10"]
]

  uniques = Index(uniques)


Unnamed: 0,query,wordform_text,definitions,actual_result_ranks,top3,three10
0,"""horse""",misatim,"[[horse, CW]]","[1.0, 2.0]",100.000000,100.000000
1,'horse',misatim,"[[horse, CW]]","[3.0, 11.0]",100.000000,50.000000
2,Calgary,otôskwanihk,"[[Calgary, AB; literally: at the elbow; at his...",[34.0],100.000000,0.000000
3,Cree,nêhiyaw,[[A Cree Indian man. A native of the Cree nati...,"[5.0, 8.0]",100.000000,100.000000
4,Cree language,nêhiyawêwin,"[[The Cree language., MD], [the Cree language;...",[1.0],100.000000,100.000000
...,...,...,...,...,...,...
543,yellow hat,osâwastotin,"[[yellow hat, CW]]",[1.0],100.000000,100.000000
544,you,kiya,,[],0.000000,0.000000
545,young,oski,,[],0.000000,0.000000
546,younger sibling,nisîmis,[[My younger brother or sister (Among children...,[2.0],100.000000,100.000000


## Initial results from dictionary code

Without any cosine-vector stuff, here are the current search stats we want to beat. 81.3% for top3, and 59.4% for three10.

In [7]:
import os
if os.path.isfile('sample-features-orig.json'):
    data_orig = weighting_nb_code.dataframe_from_featuredump('sample-features-orig.json')
    display(weighting_nb_code.top3_and_310_stats_summary(data_orig, rank_column="webapp_sort_rank"))

Note: this won’t exactly match what the django `/search-quality` pages report, because of some differences in determining exactly what the rank is. In the django code, if the results are `(non-lemma1, non-lemma2)`, we count the ranks as `(1, 3)` because the UI display of `non-lemma1` includes its lemma definition at rank 2. Here we skip that for now, but the results should be close enough.

And, for comparison, here are the stats when we added a very basic cosine vector distance model to the search:

In [8]:
weighting_nb_code.top3_and_310_stats_summary(data, rank_column="webapp_sort_rank")

  uniques = Index(uniques)


top3       79.714112
three10    65.328467
dtype: float64

The top3 score—what percent of desired search results we see anywhere in the list—has gone up. That is, the vector model’s ability to resolve synonyms has improved recall. But the three10 score—what percent of desired search results are near the top—has gone down since we don’t have a good ranking mechanism.

## Modelling

At this point we have all the definition and feature data from the webapp loaded, and we could experiment by adding more data columns with additional features. Those features could be computed by Python code here, or loaded from data files.

For this first version, let’s stick with what we have:

In [9]:
def prep_results_for_regression(df):
    # The default value used for `fillna()` doesn’t matter if we
    # also have an indicator variable, but things get trickier
    # with logarithms.
    return df.assign(
        morpheme_ranking=df["morpheme_ranking"].fillna(1),
        has_morpheme_ranking=weighting_nb_code.has_col_as_int(df, "morpheme_ranking"),
        has_cosine_vector_distance=weighting_nb_code.has_col_as_int(df, "cosine_vector_distance"),
        cosine_vector_distance=df["cosine_vector_distance"].fillna(1.1),
        is_in_survey=df.apply(weighting_nb_code.is_in_survey, axis=1),
        keyword_match_len=df['target_language_keyword_match'].apply(len)
    )

In [10]:
df = prep_results_for_regression(data)
results = smf.ols(
    """
    is_in_survey ~
        wordform_length
        + keyword_match_len
        + has_morpheme_ranking
        + morpheme_ranking
        + np.log(1 + cosine_vector_distance)
    """,
    data=df,
).fit()
display(results.summary())
sorted_results = weighting_nb_code.rank_by_predictor(df, results)
weighting_nb_code.top3_and_310_stats_summary(sorted_results, rank_column="result_rank")

0,1,2,3
Dep. Variable:,is_in_survey,R-squared:,0.058
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,1036.0
Date:,"Thu, 17 Mar 2022",Prob (F-statistic):,0.0
Time:,10:30:39,Log-Likelihood:,79964.0
No. Observations:,84696,AIC:,-159900.0
Df Residuals:,84690,BIC:,-159900.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0542,0.004,13.828,0.000,0.046,0.062
wordform_length,-0.0006,0.000,-5.041,0.000,-0.001,-0.000
keyword_match_len,0.0400,0.001,42.546,0.000,0.038,0.042
has_morpheme_ranking,0.0290,0.004,7.528,0.000,0.021,0.037
morpheme_ranking,-0.0013,9.38e-05,-13.807,0.000,-0.001,-0.001
np.log(1 + cosine_vector_distance),-0.1242,0.002,-63.109,0.000,-0.128,-0.120

0,1,2,3
Omnibus:,124043.652,Durbin-Watson:,1.693
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29509247.487
Skew:,9.254,Prob(JB):,0.0
Kurtosis:,92.551,Cond. No.,352.0


  uniques = Index(uniques)


top3       79.714112
three10    66.088808
dtype: float64

Objectively, we have increased how many of the top survey results we display at all from 81.3% to 83.4%, and we have increased the mean number of them that appear in the top 10 results per query from ~60% to ~70%. That’s great!

Let’s take a look at a sample query. Before:

In [11]:
(weighting_nb_code.top3_and_310_stats(data, rank_column='webapp_sort_rank')
     .query('query == "counts"'))[['query', 'actual_result_ranks', 'top3', 'three10']]

  uniques = Index(uniques)


Unnamed: 0,query,actual_result_ranks,top3,three10
126,counts,"[1.0, 2.0, 4.0]",100.0,100.0


When searching for ‘counts,’ before all the good results were showing up somewhere in the results, but none of them were in the top 10.

Now, with this new ranking model, the top results from the survey show up at the top of the search results, and even in 1, 2, 3 order:

In [12]:
(weighting_nb_code.top3_and_310_stats(sorted_results, rank_column='result_rank')
     .query('query == "counts"'))[['query', 'actual_result_ranks', 'top3', 'three10']]

  uniques = Index(uniques)


Unnamed: 0,query,actual_result_ranks,top3,three10
126,counts,"[1.0, 2.0, 3.0]",100.0,100.0


If we look in more detail at the results, we can see that the cosine vector distance and the morpheme ranking are being combined, but one doesn’t overrule the other. Rarer words generally appear later in the list, but a strong CVD score can move it earlier, and vice versa.

In [26]:
sorted_results.query("query == 'counts'").sort_values('score', ascending=False)[
    ['wordform_text', 'definitions', 'morpheme_ranking',
     'has_cosine_vector_distance',
     'cosine_vector_distance', 'is_in_survey', 'score']
 ].head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,has_cosine_vector_distance,cosine_vector_distance,is_in_survey,score
61662,akihtam,"[[He counts them. Inanimate., MD], [s/he count...",14.3906,1,0.490174,1,0.050568
61663,akihcikêw,"[[s/he counts, CW]]",17.2326,1,0.456675,1,0.048435
61665,akimêw,"[[He counts them (those people)., MD], [s/he c...",15.6314,1,0.533738,1,0.046021
61664,akihtêw,"[[it is counted, CW]]",16.3776,1,0.516414,0,0.045827
61666,itakihtam,"[[He sets a price on it., MD], [s/he charges s...",12.4932,1,0.580796,0,0.044412
61667,akihcikêwina,[],19.9242,1,0.467025,0,0.042153
61670,têpakihtam,"[[He has counted enough., MD], [s/he counts s....",14.9248,1,0.565385,0,0.041842
61669,akihtâw,"[[s/he counts s.t., CW]]",17.9402,1,0.547529,0,0.041281
61668,têpakimêw,"[[He counts enough of them., MD], [s/he counts...",20.3262,1,0.496446,0,0.041084
61671,akihtamawêw,"[[He counts those things for him., MD], [s/he ...",17.0718,1,0.548999,0,0.03973


This is quite a bit better than only using the morpheme ranking:

In [14]:
if os.path.isfile('sample-features-orig.json'):
    display((data_orig.assign(is_in_survey=data_orig.apply(weighting_nb_code.is_in_survey, axis=1))
     .query("query == 'counts'").sort_values('webapp_sort_rank')[
        ['wordform_text', 'definitions', 'morpheme_ranking', 'is_in_survey']
     ]).head(10))

## Model export

While the model generated by the `statsmodels` library is `pickle`able, since it’s a fairly basic linear model, for now we will just print the parameters to use in the webapp.

In [15]:
print(results.params.to_json(indent=2))

{
  "Intercept":0.0541748268,
  "wordform_length":-0.0006392647,
  "keyword_match_len":0.0400435941,
  "has_morpheme_ranking":0.0290126923,
  "morpheme_ranking":-0.0012946508,
  "np.log(1 + cosine_vector_distance)":-0.1242364288
}


And here are some test vectors for ensuring the implementation is working correctly.

In [21]:
import re

def print_test_vector(**kwargs):
    df = prep_results_for_regression(pd.DataFrame([{
        "query": "counts",
        "wordform_text": "",
        "target_language_keyword_match": [],
        "wordform_length": 0,
        "keyword_match_len": 0,
        "morpheme_ranking": np.nan,
        "cosine_vector_distance": np.nan,
        **kwargs
    }]))
    ret = results.predict(df)[0]
    # future python feature “underscore as a decimal separator”
    # https://bugs.python.org/issue43624 would be handy here
    ret = f'{ret:_f}'
    if '.' in ret:
        l, r = ret.split('.')
        r = re.sub(r'(...)(?=.)', r'\1_', r)
        ret = f'{l}.{r}'
    print(ret)

In [17]:
print_test_vector()

-0.039_295


In [18]:
print_test_vector(cosine_vector_distance=0.7)

-0.013_043


In [19]:
print_test_vector(morpheme_ranking=12.8)

-0.025_560


In [20]:
print_test_vector(cosine_vector_distance=0.7, morpheme_ranking=12.8)

0.000_693
