# Search feature weighting

As of early 2020, search ranking is done by an ad-hoc pairwise comparison function that may not even be transitive. We want to replace it with a more structured and analyzable approach that can accommodate multiple search features such as cosine vector distance and corpus frequency, and weight those features in some manner to produce a best fit for a relevance score to use for ranking results.

The basic pieces are:
  - Use occurrence of a search result in the search survey as a 0-or-1 relevance variable
  - Model a relevance score from that using some fairly basic linear techniques
  - Measure success by the three10 score: the mean percentage of the number of top-3 survey results that appear in the top 10 search results

There are many possible improvements here, such as:
  - Use occurence anywhere in sample, instead of top3 results only, for training
  - More precise training data, e.g., relevance rankings of 1-5
  - Handle homonyms in training data instead of matching purely on wordform text
  - More training data in terms of scored results per query
  - More features, e.g., tf-idf
  - Higher quality features, e.g., better stopword filtering in vector computations
  - Map features to have similar ranges and distributions to better allow comparison between them
  - Separate training and test sets
  - Fancier models
  - Better evaluation functions, such as discounted cumulative gain

*but having all the pieces together, even in this basic form, is already an improvement over the current search so let’s start there.*

## Preliminaries

Load some libraries. `analysis.py` contains some more python-y code that was extracted from some scratch jupyter notebooks once it was working ok.

In [1]:
import importlib

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import analysis

# If this is insufficient, there is also IPython.lib.deepreload
importlib.reload(analysis);

Here’s what the output of the `featuredump` management command looks like.

In [2]:
data = analysis.dataframe_from_featuredump('sample-features.json')
data

Unnamed: 0,wordform_length,webapp_sort_rank,source_language_match,is_preverb_match,cosine_vector_distance,definitions,morpheme_ranking,query,query_wordform_edit_distance,source_language_affix_match,lemma_wordform_text,target_language_affix_match,wordform_text,target_language_keyword_match,is_lemma
0,4,1,,,,"[[That's what he says to him., MD], [s/he says...",3.86950,about,,,itêw,True,itêw,[about],True
1,4,2,,,,"[[He speaks of it as so., MD], [s/he says thus...",4.52311,about,,,itam,True,itam,[about],True
2,6,3,,,,"[[I suppose., MD], [apparently, I guess, I sup...",5.32741,about,,,êtikwê,True,êtikwê,[about],True
3,6,4,,,,"[[Maybe; perhaps; I guess so., MD], [maybe, I ...",5.76791,about,,,êtokwê,True,êtokwê,[about],True
4,6,5,,,0.339667,"[[He tells about it., MD], [s/he tells about s...",6.75698,about,,,wîhtam,True,wîhtam,[about],True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123319,11,503,,,,"[[s/he goes to see s.o., s/he fetches s.o., CW]]",30.29770,they see us,,,nâtawâpamêw,,nâtawâpamêw,[see],True
123320,16,504,,,,[[They are one group of them. E.g. Tribe or na...,30.38920,they see us,,,pêyakôskânêsiwak,,pêyakôskânêsiwak,[they],True
123321,12,505,,,,"[[s/he goes to see s.t., s/he fetches s.t., CW]]",30.48390,they see us,,,nâtawâpahtam,,nâtawâpahtam,[see],True
123322,17,506,,,,"[[they are in vast numbers, CW]]",,they see us,,,kakwâhyakêyatinwa,,kakwâhyakêyatinwa,[they],True


and the featuredump output w/o cvd search for comparison

In [3]:
data_orig = analysis.dataframe_from_featuredump('sample-features-orig.json')

Here’s the current combined result survey sample.

In [4]:
analysis.survey()

Unnamed: 0,Query,Nêhiyawêwin 1,Nêhiyawêwin 2,Nêhiyawêwin 3
0,about,wayês,ohci,papâ
1,all,kahkiyaw,kapê,mâwaci
2,also,mîna,êkwa,kisik
3,and,êkwa,mîna,kisik
4,as,kisik,wiya,tâpiskôt
...,...,...,...,...
543,she sees him,wâpamêw,,
544,starblanket,atâhkakohp,acâhkosa kâ-otakohpit,
545,star blanket,atâhkakohp,acâhkosa kâ-otakohpit,
546,being taught,kiskinwahamâkosiw,,


And `analysis.py` contains a function to annotate the `featuredump` results with the top3/three10 metrics.

In [5]:
analysis.top3_and_310_stats(data, rank_column="webapp_sort_rank")[
    ["query", "wordform_text", "definitions", "actual_result_ranks", "top3", "three10"]
]

Unnamed: 0,query,wordform_text,definitions,actual_result_ranks,top3,three10
0,"""horse""",misatim,"[[horse, CW]]","[4.0, 12.0]",100.000000,50.000000
1,'horse',misatim,"[[horse, CW]]","[4.0, 13.0]",100.000000,50.000000
2,Calgary,otôskwanihk,"[[Calgary, AB; literally: at the elbow; at his...",[1.0],100.000000,100.000000
3,Cree,nêhiyaw,[[A Cree Indian man. A native of the Cree nati...,"[4.0, 5.0]",100.000000,100.000000
4,Cree language,nêhiyawêwin,"[[The Cree language., MD], [the Cree language;...",[4.0],100.000000,100.000000
...,...,...,...,...,...,...
543,yellow hat,osâwastotin,"[[yellow hat, CW]]",[2.0],100.000000,100.000000
544,you,kiya,"[[You., MD], [you, CW]]","[1.0, 4.0, 51.0]",100.000000,66.666667
545,young,oski,,[],0.000000,0.000000
546,younger sibling,nisîmis,[[My younger brother or sister (Among children...,[3.0],100.000000,100.000000


## Initial results from dictionary code

Without any cosine-vector stuff, here are the current search stats we want to beat. 81.3% for top3, and 59.4% for three10.

In [6]:
analysis.top3_and_310_stats_summary(data_orig, rank_column="webapp_sort_rank")

top3       81.326034
three10    59.367397
dtype: float64

Note: this won’t exactly match what the django `/search-quality` pages report, because of some differences in determining exactly what the rank is. In the django code, if the results are `(non-lemma1, non-lemma2)`, we count the ranks as `(1, 3)` because the UI display of `non-lemma1` includes its lemma definition at rank 2. Here we skip that for now, but the results should be close enough.

And, the stats when we add a very basic cosine vector distance model to the search:

In [7]:
analysis.top3_and_310_stats_summary(data, rank_column="webapp_sort_rank")

top3       83.424574
three10    55.474453
dtype: float64

The top3 score—what percent of desired search results we see anywhere in the list—has gone up. That is, the vector model’s ability to resolve synonyms has improved recall. But the three10 score—what percent of desired search results are near the top—has gone down since we don’t have a good ranking mechanism.

## Modelling

In [28]:
def prep_results_for_regression(df):
    # In a straightforward linear combination, the default values won’t matter because
    # they get multiplied by the has_ variable which will be zero. But things could get
    # trickier if we have logarithms.
    return df.assign(
        morpheme_ranking=df["morpheme_ranking"].fillna(1),
        has_morpheme_ranking=analysis.has_col_as_int(df, "morpheme_ranking"),
        cosine_vector_distance=df["cosine_vector_distance"].fillna(1),
        has_cosine_vector_distance=analysis.has_col_as_int(df, "cosine_vector_distance"),
        is_in_survey=df.apply(analysis.is_in_survey, axis=1),
        keyword_match_len=data['target_language_keyword_match'].apply(len)
    )

In [29]:
df = prep_results_for_regression(data)
results = smf.ols(
    """
    is_in_survey ~
        wordform_length
        + keyword_match_len
        + has_morpheme_ranking * morpheme_ranking
        + has_cosine_vector_distance * np.log(1 + cosine_vector_distance)
    """,
    data=df,
).fit()
display(results.summary())
sorted_results = analysis.rank_by_predictor(df, results)
analysis.top3_and_310_stats_summary(sorted_results, rank_column="result_rank")

0,1,2,3
Dep. Variable:,is_in_survey,R-squared:,0.068
Model:,OLS,Adj. R-squared:,0.068
Method:,Least Squares,F-statistic:,1511.0
Date:,"Mon, 03 May 2021",Prob (F-statistic):,0.0
Time:,19:47:37,Log-Likelihood:,135790.0
No. Observations:,123324,AIC:,-271600.0
Df Residuals:,123317,BIC:,-271500.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0707,0.002,37.037,0.000,0.067,0.074
wordform_length,-0.0004,9.31e-05,-4.123,0.000,-0.001,-0.000
keyword_match_len,0.0162,0.001,19.487,0.000,0.015,0.018
has_morpheme_ranking,0.0386,0.002,24.818,0.000,0.036,0.042
morpheme_ranking,0.0155,0.001,11.234,0.000,0.013,0.018
has_morpheme_ranking:morpheme_ranking,-0.0165,0.001,-12.158,0.000,-0.019,-0.014
has_cosine_vector_distance,0.0410,0.001,73.267,0.000,0.040,0.042
np.log(1 + cosine_vector_distance),-0.1492,0.002,-63.958,0.000,-0.154,-0.145
has_cosine_vector_distance:np.log(1 + cosine_vector_distance),-0.1697,0.003,-52.994,0.000,-0.176,-0.163

0,1,2,3
Omnibus:,195941.753,Durbin-Watson:,1.735
Prob(Omnibus):,0.0,Jarque-Bera (JB):,78125041.98
Skew:,10.668,Prob(JB):,0.0
Kurtosis:,124.444,Cond. No.,2.47e+16


top3       83.424574
three10    66.271290
dtype: float64

In [25]:
def prep_results_for_regression(df):
    return df.assign(
        morpheme_ranking=df["morpheme_ranking"].fillna(20),
        has_morpheme_ranking=
        cosine_vector_distance=df["cosine_vector_distance"].fillna(1.1),
        is_in_survey=df.apply(analysis.is_in_survey, axis=1),
        keyword_match_len=data['target_language_keyword_match'].apply(len)
    )

In [26]:
df = prep_results_for_regression(data)
results = smf.ols(
    """
    is_in_survey ~
        wordform_length
        + keyword_match_len
        + morpheme_ranking
        + np.log(1 + cosine_vector_distance)
    """,
    data=df,
).fit()
display(results.summary())
sorted_results = analysis.rank_by_predictor(df, results)
analysis.top3_and_310_stats_summary(sorted_results, rank_column="result_rank")

0,1,2,3
Dep. Variable:,is_in_survey,R-squared:,0.058
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,1884.0
Date:,"Mon, 03 May 2021",Prob (F-statistic):,0.0
Time:,19:44:36,Log-Likelihood:,135080.0
No. Observations:,123324,AIC:,-270100.0
Df Residuals:,123319,BIC:,-270100.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0789,0.001,62.040,0.000,0.076,0.081
wordform_length,-0.0006,9.35e-05,-6.073,0.000,-0.001,-0.000
keyword_match_len,0.0327,0.001,45.840,0.000,0.031,0.034
morpheme_ranking,-0.0010,6.76e-05,-14.813,0.000,-0.001,-0.001
np.log(1 + cosine_vector_distance),-0.1196,0.002,-79.091,0.000,-0.123,-0.117

0,1,2,3
Omnibus:,197284.503,Durbin-Watson:,1.737
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80144678.557
Skew:,10.815,Prob(JB):,0.0
Kurtosis:,126.0,Cond. No.,158.0


top3       83.424574
three10    70.133820
dtype: float64

In [10]:
sorted_results.query("query == 'hair'").sort_values('webapp_sort_rank')[
    ['wordform_text', 'definitions', 'morpheme_ranking',
     'has_cosine_vector_distance',
     'cosine_vector_distance', 'is_in_survey']
 ].head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,has_cosine_vector_distance,cosine_vector_distance,is_in_survey
38210,pâhpakowêwayân,"[[hide with thick hair on it, CW]]",9.73725,0,1.0,0
38211,mêstakay,"[[hair, CW]]",13.5515,1,5.960464e-08,0
38212,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1
38213,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1
38214,mêscakâsa,"[[hair; short hair, CW]]",16.2734,1,0.1481664,1
38215,miyahciwâna,"[[pubic hair, CW]]",11.7409,1,0.1552082,0
38216,wîcisiw,"[[s/he has a good head of hair, CW]]",11.7409,0,1.0,0
38217,apihkêw,[[He knits. Or he braids. Could also mean he w...,11.8835,0,1.0,0
38218,sîkaham,"[[s/he pours s.t., CW], [s/he combs s.t. (e.g....",12.7319,0,1.0,0
38219,wâpâsow,"[[He is fair or light skinned., MD], [s/he is ...",12.8398,0,1.0,0


In [11]:
(data_orig.assign(is_in_survey=data_orig.apply(analysis.is_in_survey, axis=1))
 .query("query == 'hair'").sort_values('webapp_sort_rank')[
    ['wordform_text', 'definitions', 'morpheme_ranking', 'is_in_survey']
 ]).head(10)

Unnamed: 0,wordform_text,definitions,morpheme_ranking,is_in_survey
35464,pâhpakowêwayân,"[[hide with thick hair on it, CW]]",9.73725,0
35465,pîway,"[[hair from a hide, fur, bristles; feathers, p...",11.3943,1
35466,mihyawê-,"[[with fur or body hair, CW]]",11.7409,0
35467,mihyawê-,"[[with fur or body hair, CW]]",11.7409,0
35468,miyahciwâna,"[[pubic hair, CW]]",11.7409,0
35469,wîcisiw,"[[s/he has a good head of hair, CW]]",11.7409,0
35470,miyahciwânâna,"[[pubic hair, CW]]",11.7409,0
35471,apihkêw,[[He knits. Or he braids. Could also mean he w...,11.8835,0
35472,nêstakaya,"[[my hair, CW]]",12.1174,0
35473,nêstakaya,"[[my hair, CW]]",12.1174,0
