# Queries

- "Apple pie"
- "Chicken" in the African category
- "Easy bread" less than 2h
- "Pasta bolognese"
- "Oatmeal"

Prioritize doing 3 of the queries first and running test models, then add more if possible.

# Scoring

Human classification of top 100 results obtained using the standard system (LTR-less). Scoring is done on a numeric scale from 0-5.

## Criteria

The attribution of a given score is a bit subjective but tries to follow the following guidelines:

0. A document that does not match the query.
1. A document that vaguely matches the query, is very incomplete (missing important fields, like instructions) and has no reviews. Or has very negative reviews.
2. A document that partially matches the query, is incomplete and has no reviews. Or a document with negative reviews.
3. A document that matches the query semantically, is reasonably complete (may miss more than two fields) and has at least one positive review.
4. A document that perfectly or almost perfectly matches the query semantically, is complete or missing just one of the fields and has a good number of positive reviews (5 to 20).
5. A document that perfectly matches the query semantically, is complete (the recipe has a full ingredient list, steps and cook time/nutritional information) and has a lot of positive reviews (more than 20).

In [6]:
import urllib.parse as urlp

URL = "http://localhost:8983/solr/recipes/select"
URL += "?rows=100"
URL += "&q.op=AND"
URL += "&q={q}"
URL += "&qf=" + "Name^5 Description Ingredients^2 Keywords^2 Instructions Reviews^0.5 AuthorName^0.2"
URL += "&wt=json"
URL += "&defType=edismax"
URL += "&fl=id,RecipeId,score,[features]"
URL += "&rq=%7B!ltr%20model%3DmyModel%20efi.text%3D%27%24%7Bquery%7D%27%7D"
URL += "&fq={fq}"

query = ["apple pie", "chicken", "easy bread", "pasta bolognese", "oatmeal"]
facet = ["", "Category_Facet:African", "", "", ""]
urls = [URL.format(q=query[i], fq=facet[i]) for i in range(len(query))]


http://localhost:8983/solr/recipes/select?rows=100&q.op=AND&q=apple pie&qf=Name^5 Description Ingredients^2 Keywords^2 Instructions Reviews^0.5 AuthorName^0.2&wt=json&defType=edismax&fl=id,RecipeId,score,[features]&rq=%7B!ltr%20model%3DmyModel%20efi.text%3D%27%24%7Bquery%7D%27%7D&fq=


In [7]:
import requests
import simplejson
import pandas as pd

for (idx, url) in enumerate(urls):
    response = requests.request("GET", url)
    json = simplejson.loads(response.text)

    for doc in json["response"]["docs"]:
        doc["URL"] = "http://localhost:3000/recipe/{0}".format(doc["RecipeId"]) 
        doc["query"] = query[idx]
        doc["facet"] = facet[idx]
    
    df = pd.DataFrame(json["response"]["docs"])
    df.to_csv("queries/query{0}_results.csv".format(idx+1), index=False)

# Modelling

Solr's LTR implementation supports two different kinds of models: Linear and Tree Based.

There are various algorithms that may be used in order to create these models. We will demonstrate the use of two, one for each type:

- A Linear Model using Support Vector Machines
- A Neural Network model built using RankNet

We will use SciKit Learn's SVM implementation and a RankNet implementation built into Keras.

In [62]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import requests
import simplejson
import glob

result_files = glob.glob("queries/*_results.csv")
scores_files = glob.glob("queries/*_scores.csv")

inputs = pd.concat((pd.read_csv(file) for file in result_files), ignore_index=True)
scores = pd.concat((pd.read_csv(file) for file in scores_files), ignore_index=True)


X = []
Y = [entry.score for entry in scores.itertuples()]


In [63]:

for entry in inputs.itertuples():
    req_url = "http://localhost:8983/solr/recipes/select?rows=100&q.op=AND&q={q}&qf=Name^5%20Description%20Ingredients^2%20Keywords^2%20Instructions%20Reviews^0.5%20AuthorName^0.2&wt=json&defType=edismax&fl=[features]&fq={fq}&rq={rq}"
    facet = entry.facet
    if pd.isna(facet):
        facet = ""
    response = requests.request("GET", req_url.format(q=entry.query, fq=f"RecipeId:{entry.RecipeId} {facet}", rq=f"{{!ltr model=myModel efi.text='{entry.query}'}}"))
    json = simplejson.loads(response.text)
    X.append([float(feature.split("=")[1]) for feature in json["response"]["docs"][0]["[features]"].split(",")])


In [64]:
(train_x,
 test_x,
 train_y,
 test_y) = train_test_split(X, Y, test_size=0.25, random_state=1, stratify=Y)

 

In [65]:
linearSVM = svm.LinearSVC()

linearSVM.fit(train_x, train_y)

pred = linearSVM.predict(test_x)

print(classification_report(test_y, pred))

              precision    recall  f1-score   support

           0       1.00      0.11      0.20         9
           1       0.00      0.00      0.00        10
           2       0.37      0.86      0.52        42
           3       0.39      0.20      0.26        35
           4       0.00      0.00      0.00        14
           5       0.00      0.00      0.00         6

    accuracy                           0.38       116
   macro avg       0.29      0.19      0.16       116
weighted avg       0.33      0.38      0.28       116



  _warn_prf(average, modifier, msg_start, len(result))
