# Search feature weighting

As of early 2020, search ranking is done by an ad-hoc pairwise comparison function that may not even be transitive. We want to replace it with a more structured and analyzable approach that can accommodate multiple search features such as cosine vector distance and corpus frequency, and weight those features in some manner to produce a best fit for a relevance score to use for ranking results.

The basic pieces are:
  - Use occurrence of a search result in the search survey as a 0-or-1 relevance variable
  - Model a relevance score from that using some fairly basic linear techniques
  - Measure success by the three10 score: the mean percentage of the number of top-3 survey results that appear in the top 10 search results

There are many possible improvements here, such as:
  - Use occurence anywhere in sample, instead of top3 results only, for training
  - More precise training data, e.g., relevance rankings of 1-5
  - Handle homonyms in training data instead of matching purely on wordform text
  - More training data in terms of scored results per query
  - More features, e.g., tf-idf
  - Higher quality features, e.g., better stopword filtering in vector computations
  - Map features to have similar ranges and distributions to better allow comparison between them
  - Separate training and test sets
  - Fancier models
  - Better evaluation functions, such as discounted cumulative gain

*but having all the pieces together, even in this basic form, is already an improvement over the current search so let’s start there.*

## Preliminaries

Load some libraries. `analysis.py` contains some more python-y code that was extracted from some scratch jupyter notebooks once it was working ok.

In [1]:
import importlib

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import analysis

# If this is insufficient, there is also IPython.lib.deepreload
importlib.reload(analysis);

Here’s what the output of the `featuredump` management command looks like.

In [2]:
data = analysis.dataframe_from_featuredump('sample-features.json')
data

Unnamed: 0,target_language_keyword_match,query_wordform_edit_distance,lemma_wordform_text,is_preverb_match,source_language_match,definitions,query,wordform_text,webapp_sort_rank,wordform_length,source_language_affix_match,cosine_vector_distance,is_lemma,target_language_affix_match,morpheme_ranking
0,[about],,itêw,,,"[[That's what he says to him., MD], [s/he says...",about,itêw,1,4,,,True,True,3.86950
1,[about],,itam,,,"[[He speaks of it as so., MD], [s/he says thus...",about,itam,2,4,,,True,True,4.52311
2,[about],,êtikwê,,,"[[I suppose., MD], [apparently, I guess, I sup...",about,êtikwê,3,6,,,True,True,5.32741
3,[about],,êtokwê,,,"[[Maybe; perhaps; I guess so., MD], [maybe, I ...",about,êtokwê,4,6,,,True,True,5.76791
4,[about],,wîhtam,,,"[[He tells about it., MD], [s/he tells about s...",about,wîhtam,5,6,,0.339667,True,True,6.75698
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123319,[see],,nâtawâpamêw,,,"[[s/he goes to see s.o., s/he fetches s.o., CW]]",they see us,nâtawâpamêw,503,11,,,True,,30.29770
123320,[they],,pêyakôskânêsiwak,,,[[They are one group of them. E.g. Tribe or na...,they see us,pêyakôskânêsiwak,504,16,,,True,,30.38920
123321,[see],,nâtawâpahtam,,,"[[s/he goes to see s.t., s/he fetches s.t., CW]]",they see us,nâtawâpahtam,505,12,,,True,,30.48390
123322,[they],,kakwâhyakêyatinwa,,,"[[they are in vast numbers, CW]]",they see us,kakwâhyakêyatinwa,506,17,,,True,,


and the featuredump output w/o cvd search for comparison

In [3]:
data_orig = analysis.dataframe_from_featuredump('sample-features-orig.json')

Here’s the current combined result survey sample.

In [4]:
analysis.survey()

Unnamed: 0,Query,Nêhiyawêwin 1,Nêhiyawêwin 2,Nêhiyawêwin 3
0,about,wayês,ohci,papâ
1,all,kahkiyaw,kapê,mâwaci
2,also,mîna,êkwa,kisik
3,and,êkwa,mîna,kisik
4,as,kisik,wiya,tâpiskôt
...,...,...,...,...
543,she sees him,wâpamêw,,
544,starblanket,atâhkakohp,acâhkosa kâ-otakohpit,
545,star blanket,atâhkakohp,acâhkosa kâ-otakohpit,
546,being taught,kiskinwahamâkosiw,,


And `analysis.py` contains a function to annotate the `featuredump` results with the top3/three10 metrics.

In [5]:
analysis.top3_and_310_stats(data, rank_column="webapp_sort_rank")[
    ["query", "wordform_text", "definitions", "actual_result_ranks", "top3", "three10"]
]

Unnamed: 0,query,wordform_text,definitions,actual_result_ranks,top3,three10
0,"""horse""",misatim,"[[horse, CW]]","[4.0, 12.0]",100.000000,50.000000
1,'horse',misatim,"[[horse, CW]]","[4.0, 13.0]",100.000000,50.000000
2,Calgary,otôskwanihk,"[[Calgary, AB; literally: at the elbow; at his...",[1.0],100.000000,100.000000
3,Cree,nêhiyaw,[[A Cree Indian man. A native of the Cree nati...,"[4.0, 5.0]",100.000000,100.000000
4,Cree language,nêhiyawêwin,"[[The Cree language., MD], [the Cree language;...",[4.0],100.000000,100.000000
...,...,...,...,...,...,...
543,yellow hat,osâwastotin,"[[yellow hat, CW]]",[2.0],100.000000,100.000000
544,you,kiya,"[[You., MD], [you, CW]]","[1.0, 4.0, 51.0]",100.000000,66.666667
545,young,oski,,[],0.000000,0.000000
546,younger sibling,nisîmis,[[My younger brother or sister (Among children...,[3.0],100.000000,100.000000


## Initial results from dictionary code

Without any cosine-vector stuff, here are the current search stats we want to beat. 81.3% for top3, and 59.4% for three10.

In [6]:
analysis.top3_and_310_stats_summary(data_orig, rank_column="webapp_sort_rank")

top3       81.326034
three10    59.367397
dtype: float64

Note: this won’t exactly match what the django `/search-quality` pages report, because of some differences in determining exactly what the rank is. In the django code, if the results are `(non-lemma1, non-lemma2)`, we count the ranks as `(1, 3)` because the UI display of `non-lemma1` includes its lemma definition at rank 2. Here we skip that for now, but the results should be close enough.

And, the stats when we add a very basic cosine vector distance model to the search:

In [7]:
analysis.top3_and_310_stats_summary(data, rank_column="webapp_sort_rank")

top3       83.424574
three10    55.474453
dtype: float64

The top3 score—what percent of desired search results we see anywhere in the list—has gone up. That is, the vector model’s ability to resolve synonyms has improved recall. But the three10 score—what percent of desired search results are near the top—has gone down since we don’t have a good ranking mechanism.

## Modelling

In [8]:
def prep_results_for_regression(df):
    # In a straightforward linear combination, the default values won’t matter because
    # they get multiplied by the has_ variable which will be zero. But things could get
    # trickier if we have logarithms.
    return df.assign(
        morpheme_ranking=df["morpheme_ranking"].fillna(0),
        has_morpheme_ranking=analysis.has_col_as_int(df, "morpheme_ranking"),
        cosine_vector_distance=df["cosine_vector_distance"].fillna(1.1),
        has_cosine_vector_distance=analysis.has_col_as_int(df, "cosine_vector_distance"),
        is_in_survey=df.apply(analysis.is_in_survey, axis=1),
        keyword_match_len=data['target_language_keyword_match'].apply(len)
    )

In [9]:
df = prep_results_for_regression(data)
results = smf.ols(
    """
    is_in_survey ~
        wordform_length
        + keyword_match_len
        + has_morpheme_ranking
        + morpheme_ranking
        + np.log(1 + cosine_vector_distance)
    """,
    data=df,
).fit()
display(results.summary())
sorted_results = analysis.rank_by_predictor(df, results)
analysis.top3_and_310_stats_summary(sorted_results, rank_column="result_rank")

0,1,2,3
Dep. Variable:,is_in_survey,R-squared:,0.058
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,1507.0
Date:,"Tue, 04 May 2021",Prob (F-statistic):,0.0
Time:,09:42:06,Log-Likelihood:,135080.0
No. Observations:,123324,AIC:,-270100.0
Df Residuals:,123318,BIC:,-270100.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0554,0.003,19.471,0.000,0.050,0.061
wordform_length,-0.0006,9.35e-05,-6.066,0.000,-0.001,-0.000
keyword_match_len,0.0326,0.001,45.826,0.000,0.031,0.034
has_morpheme_ranking,0.0235,0.003,8.535,0.000,0.018,0.029
morpheme_ranking,-0.0010,6.76e-05,-14.739,0.000,-0.001,-0.001
np.log(1 + cosine_vector_distance),-0.1196,0.002,-79.101,0.000,-0.123,-0.117

0,1,2,3
Omnibus:,197281.757,Durbin-Watson:,1.737
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80139050.18
Skew:,10.815,Prob(JB):,0.0
Kurtosis:,125.996,Cond. No.,350.0


top3       83.424574
three10    70.225061
dtype: float64

In [31]:
x

Unnamed: 0,wordform_length,keyword_match_len,has_morpheme_ranking,morpheme_ranking,cosine_vector_distance
0,1,0,0,0,0


In [37]:
x = pd.DataFrame([
    {
        "wordform_length": 1,
        "keyword_match_len": 0,
        "has_morpheme_ranking": 0,
        "morpheme_ranking": 0,
        "cosine_vector_distance": 1.1,
    }
])
results.predict(x)

0   -0.033908
dtype: float64

In [45]:
x = pd.DataFrame([
    {
        "wordform_length": 1,
        "keyword_match_len": 0,
        "has_morpheme_ranking": 0,
        "morpheme_ranking": 0,
        "cosine_vector_distance": 0.7,
    }
])
results.predict(x)

0   -0.00864
dtype: float64

In [39]:
# This test is currently failing.
x = pd.DataFrame([
    {
        "wordform_length": 1,
        "keyword_match_len": 0,
        "has_morpheme_ranking": 1,
        "morpheme_ranking": 12.8,
        "cosine_vector_distance": 0.7,
    }
])
results.predict(x)

0    0.002137
dtype: float64

In [46]:
x = pd.DataFrame([
    {
        "wordform_length": 1,
        "keyword_match_len": 0,
        "has_morpheme_ranking": 1,
        "morpheme_ranking": 12.8,
        "cosine_vector_distance": 1.1,
    }
])
results.predict(x)[0]

-0.023130453155353087

In [43]:
results.params.values

array([ 0.05537813, -0.00056731,  0.03264485,  0.023535  , -0.00099665,
       -0.11957764])

In [44]:
results.params.index

Index(['Intercept', 'wordform_length', 'keyword_match_len',
       'has_morpheme_ranking', 'morpheme_ranking',
       'np.log(1 + cosine_vector_distance)'],
      dtype='object')

In [28]:
results.params.round(10)

Intercept                             0.055378
wordform_length                      -0.000567
keyword_match_len                     0.032645
has_morpheme_ranking                  0.023535
morpheme_ranking                     -0.000997
np.log(1 + cosine_vector_distance)   -0.119578
dtype: float64

In [10]:
sorted_results

Unnamed: 0,target_language_keyword_match,query_wordform_edit_distance,lemma_wordform_text,is_preverb_match,source_language_match,definitions,query,wordform_text,webapp_sort_rank,wordform_length,...,cosine_vector_distance,is_lemma,target_language_affix_match,morpheme_ranking,has_morpheme_ranking,has_cosine_vector_distance,is_in_survey,keyword_match_len,score,result_rank
107652,[hors],,misatim,,,"[[horse, CW]]","""horse""",misatim,4,7,...,0.000000,True,,9.93681,1,1,1,1,0.097683,1.0
107660,[hors],,mistatim,,,"[[horse, CW, MD]]","""horse""",mistatim,12,8,...,0.000000,True,,12.11330,1,1,1,1,0.094947,2.0
107676,[hors],,misacimosis,,,"[[little horse, pony, foal, colt, CW]]","""horse""",misacimosis,28,11,...,0.138156,True,,16.00270,1,1,0,1,0.073894,3.0
107668,[hors],,nâpêstim,,,"[[stallion, male horse; male dog, CW, MD]]","""horse""",nâpêstim,20,8,...,0.151773,True,,16.36890,1,1,0,1,0.073809,4.0
107662,[hors],,takahkatim,,,"[[good horse; good dog, CW]]","""horse""",takahkatim,14,10,...,0.183081,True,,12.80170,1,1,0,1,0.073022,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36080,[],,kêtasâkêw,,,"[[He takes his coat off., MD], [s/he takes his...",your,kêtasâkêw,20,9,...,0.459749,True,,18.27890,1,1,0,0,0.010358,43.0
36082,[],,kâsîhkwâkêw,,,"[[s/he washes his/her own face with something,...",your,kâsîhkwâkêw,22,11,...,0.464270,True,,18.03480,1,1,0,0,0.009097,44.0
36079,[],,kisîpêkistikwânâkêw,,,"[[s/he washes his/her own head with something,...",your,kisîpêkistikwânâkêw,19,19,...,0.456856,True,,17.33790,1,1,0,0,0.005860,45.0
36085,[your],,mamisîtotamowin,,,[[The state oaf being confident in your faith ...,your,mamisîtotamowin,25,15,...,1.100000,True,,22.04670,1,0,0,1,-0.007644,46.0


In [11]:
sorted_results.query("query == 'hair'")[
    ['wordform_text', 'definitions', 'morpheme_ranking',
     'has_cosine_vector_distance',
     'cosine_vector_distance', 'is_in_survey', 'score']
 ].head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,has_cosine_vector_distance,cosine_vector_distance,is_in_survey,score
38211,mêstakay,"[[hair, CW]]",13.5515,1,5.960464e-08,0,0.093513
38215,miyahciwâna,"[[pubic hair, CW]]",11.7409,1,0.1552082,0,0.076363
38224,miyahciwânâna,"[[pubic hair, CW]]",11.7409,1,0.1552082,0,0.075229
38225,tipwêham,"[[s/he curls hair, CW]]",14.3799,1,0.1693116,0,0.073984
38212,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1,0.073712
38213,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1,0.073712
38214,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1,0.073712
38223,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1,0.073712
38226,nêstakaya,"[[my hair, CW]]",12.1174,1,0.1924281,0,0.073331
38227,nêstakaya,"[[my hair, CW]]",12.1174,1,0.1924281,0,0.073331


In [47]:
sorted_results.query("query == 'yellow hat'")[
    ['wordform_text', 'definitions', 'morpheme_ranking',
     'has_cosine_vector_distance',
     'cosine_vector_distance', 'is_in_survey', 'score']
 ].head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,has_cosine_vector_distance,cosine_vector_distance,is_in_survey,score
110654,osâwastotin,"[[yellow hat, CW]]",15.39,1,0.0,1,0.122624
110705,nîpâmâyâtastotin,"[[purple hat, CW]]",19.3989,1,0.131871,0,0.068335
110661,astotin,"[[hat, cap, headgear, CW]]",8.77879,1,0.319382,0,0.065695
110660,osâwêkin,"[[Yellow cloth., MD], [yellow material; yellow...",14.4942,1,0.29373,0,0.061779
110655,osâwasâkay,"[[yellow dress, coat, CW]]",15.4204,1,0.276213,0,0.061351
110657,osâwi-,"[[yellow, brown, CW]]",12.5092,1,0.360646,0,0.058862
110662,kayâs-astotin,"[[old hat, CW]]",14.9523,1,0.325223,0,0.05561
110663,piponastotin,"[[winter hat, CW]]",14.0294,1,0.342403,0,0.055557
110669,osâwâw,"[[It is yellow., MD], [it is yellow; it is ora...",14.4383,1,0.377142,0,0.055498
110664,iskwêwastotin,"[[woman's hat, CW]]",13.3395,1,0.347781,0,0.055199


In [12]:
(data_orig.assign(is_in_survey=data_orig.apply(analysis.is_in_survey, axis=1))
 .query("query == 'hair'").sort_values('webapp_sort_rank')[
    ['wordform_text', 'definitions', 'morpheme_ranking', 'is_in_survey']
 ]).head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,is_in_survey
35464,pâhpakowêwayân,"[[hide with thick hair on it, CW]]",9.73725,0
35465,pîway,"[[hair from a hide, fur, bristles; feathers, p...",11.3943,1
35466,mihyawê-,"[[with fur or body hair, CW]]",11.7409,0
35467,mihyawê-,"[[with fur or body hair, CW]]",11.7409,0
35468,miyahciwâna,"[[pubic hair, CW]]",11.7409,0
35469,wîcisiw,"[[s/he has a good head of hair, CW]]",11.7409,0
35470,miyahciwânâna,"[[pubic hair, CW]]",11.7409,0
35471,apihkêw,[[He knits. Or he braids. Could also mean he w...,11.8835,0
35472,nêstakaya,"[[my hair, CW]]",12.1174,0
35473,nêstakaya,"[[my hair, CW]]",12.1174,0
