# 2. Searching with Solr

## a. Technicalities and Syntax

**<u>Querying a Solr server</u>**

A Solr index is available at the following address [http://solrserver](http://solrserver). Querying a Solr server can be done in multiple ways but the simplest one is to use a HTTP GET request with query string. This is a lot of technical terms just to say that we will create a URL pointing towards the Solr server and add some parameters to it. Parameters can be added to any URL this way: [https://www.example.com/page?param1=val1&param2=val2]().
If one of the parameter should contain an array of values, you can just repeat the parameter for each value of the array: [https://www.example.com/page?param1=val1&param1=val2](). Be careful that passing arrays as query parameters is not standardised, and other services may not work the same way (but Solr does !).
Solr will then process the query and will return a list of resulting documents, along with other information.

In [34]:
import urllib
import json
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import itertools
import uuid
from IPython.display import display_javascript, display_html, display, clear_output, JSON
import IPython

In [35]:
class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json
        self.uuid = str(uuid.uuid4())
        
    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;"></div>'.format(self.uuid),
            raw=True
        )
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
          document.getElementById('%s').appendChild(renderjson(%s))
        });
        """ % (self.uuid, self.json_str), raw=True)

In [36]:
def query_solr(url, param_widgets):
    for key in param_widgets:
        if isinstance(param_widgets[key],list):
            for param in param_widgets[key]:
                url += key + '=' + param.value.replace(' ','+') + '&'
        else:
            url += key + '=' + param_widgets[key].value.replace(' ','+') + '&'
    url += "echoParams=all"
        
    print("Querying ",url)
    connection = urllib.request.urlopen(url)
    response = json.load(connection)
    return response

def generate_widgets(params_list):
    global params_widgets
    params_widgets = {}
    for param in params_list:
        params_widgets[param[0]] = [widgets.Text(
            value=param[1],
            placeholder='query',
            description='q:',
            disabled=False
        )]
    params_widgets["defType"] = [widgets.Text(
        value='lucene',
        placeholder='deftype',
        description='defType:',
        disabled=False
    )]
    params_widgets["sort"] = [widgets.Text(
        value='score desc',
        placeholder='sort',
        description='sort:',
        disabled=False
    )]
    params_widgets["rows"] = [widgets.Text(
        value='10',
        placeholder='rows',
        description='rows:',
        disabled=False
    )]
    params_widgets["start"] = [widgets.Text(
        value='0',
        placeholder='start',
        description='start:',
        disabled=False
    )]
    params_widgets["fl"] = [widgets.Text(
        value='*,score',
        placeholder='fl',
        description='fl:',
        disabled=False
    )]
    params_widgets["fq"] = []
    for i in range(0,4):
        params_widgets["fq"].append(widgets.Text(
            value='',
            placeholder='filter query',
            description='fq:',
            disabled=False
        ))
    
    for w in list(itertools.chain.from_iterable(params_widgets.values())):
        display(w)
        
    def on_button_click(button):
        global response
        response = query_solr(url, params_widgets)
        print("done.")

    button = widgets.Button(
        description='Send',
        disabled=False,
        button_style='success',
        tooltip='Click me',
        icon='check'
    )
    button.on_click(on_button_click)
    display(button)

In [None]:
global params_widgets
global url
url = 'http://localhost:8080/solr/newseye_collection/select?'
solr_paramaters = [("q","*:*")]
generate_widgets(solr_paramaters)


In [None]:
RenderJSON(response)

**<u>Common parameters</u>**

Solr can support different set of parameters, each set corresponds to a different way of expressing queries. The following parameters are common to all available query parsers.
- **defType:** *(default: lucene)* allows you to select which query parser Solr should use when processing your query. You can choose betwen *lucene*, *dismax* or *edismax*.
- **sort:** *(default: score desc)* allows you to select how Solr will sort the results of your query. The syntax is as follows: `<field> <order>`, where `<field>` is the field over which you want to sort the results and `<order>` is *asc* for ascending or *desc* for descending. The default value will present documents in the descending order of their relevancy score.
- **rows:** *(default: 10)* how many results will be returned by Solr. A search can return many results and it is not always desirable to display all of them at once. This parameter will limit the amount of documents returned by the search.
- **start:** *(default: 0)* is used to specify an offset in the results list. Solr will display documents from this offset onwards. This parameter can be used in conjunction with the **rows** parameter to setup the paging of results. If the **rows** parameter is set to 10 (the default value), the first page of results can be displayed by setting **start** to 0. You can then get the second page by issuing the same request but setting the **start** parameter to 10, the second page by setting it to 20, *etc*.
- **fl:** *(default: &ast;)* this parameter instructs Solr on which field shouyld be returned for every document. The default value returns every fields. You can set it to a space-separated list of fields.
- **fq:** can be used to filter the documents returned after a query. The value of this parameter is in the form `<field>:<value or range>`. For example, setting this to `year:1950` will only return documents from this specific year ; setting it with a range can be done like this `year:[1900 TO 1910]`.

For a complete list of the common parameters, see [this page](https://solr.apache.org/guide/8_0/common-query-parameters.html#common-query-parameters).

**<u>Querying text fields</u>**

The main parameter to use to search inside text field is the **q** parameter. It can contains terms and operators. Terms can either be single words like `fox` or `president`, or they can be phrases which are a group of words surrounded by double quotes like `"green plant"`. Several terms modifiers can be used to modify a search.
- **Wildcard:** these can only be applied to single terms and not phrases. They add flexibility to the search by matching one character (using `?`) or several characters (using `*`). For example, the query `te?t` will match both `test` and `text`. The query `presi*` will match `president`, `presidential`, `presiding`, *etc*.
- **Fuzzy search:** this is based on the edit distance algorithm. The system will match words that are similar to the query term, even if it is not an exact match. For example, the query `roam~` will match words like `foam` or `road`. By default, all words with an edit disyance of 2 will match the query word.
- **Proximity search:** this type of search will look for words that are separated by a certain amount of different words. For example, the query `"battle france"~10` will match documents containing those two words separated by 10 other words at most.
- **Boosting terms:** it is possible to control the relevancy of resulting documents by assigning a boost factor to the terms of your query. For example, the query `green^3 apple` will match documents containg those two terms, but document containing the term `green` will have a higher relevancy. The boost factor can also be set between 0 and 1 to lesser the relevancy of a particular term.

Boolean operators (AND, OR, NOT) can be used to formulate more complex queries. See this in more details [here](https://solr.apache.org/guide/8_0/the-standard-query-parser.html#boolean-operators-supported-by-the-standard-query-parser).

## b. Advanced parameters

Solr provides different way to parse queries by setting the defType parameter (see *Common parameters* section above). The default one, `lucene`, allows to use all term modifiers described previously. However, others query parsers can be used, for example `dismax`. This parser prevents using the wildcard term modifier but allows for more control by defining more parameters.
- **qf:** stands for "query fields". It allow you to specify which fields will be searched and how important they are. This can be useful for if you have several textual field but one of them should be given more importance? Imagine our documents contain a title, a content and a comment. In this case, the qf parameter could be set like this: `title^2 content comment^0.5`, giving matches in the title field twice the importance than matches in the content of a document, while matches in a comment are twice less important that matches in the content. It is to be noted that modifying this parameter will not modify which documents are returned but will indeed modify their relevancy score and the order they are returned.
- **mm:** stands for "minimum should match". By default, terms in a query are all optionals (even though at least one term should be present for a document to match). You can modify this value using integers or percentage. For example, setting it to `3` means that at least 3 terms in the query should be present in a document for it to match. Another example: `50%` means that at least half of the terms in the query should be present in a document fot it to match. You can also set this value as a negative integer or percentage. For example, setting this parameter to `-25%` means that up to a quarter of the terms in the query may be missing in resulting documents.
- **pf and ps:** corresponds to *phrase field* and *phrase slop*. It is very similar to the proximity search, but allows for more flexibility. In a sense, it allows to treat the entire content of the `q` parameter as a phrase without using quotes. The fields concerned by this need to be set in the *pf* paramters, while the *ps* parameter is set to an integer indicating how much a token needs to be moved to match a phrase.
- **tie:**

## c. Facetting

The goal of facetting is to provide users with a way to navigate through the search results. After a query, a list of matching documents is returned. If facetting is activated on some fields, the search will also return the number of matching documents for each value of this field. This information ca, be used to build a user interface allowing users to filter the list of results and get an idea of the number of documents concerned by this filter. You can activate facets in a Solr query by setting the `facet` parameter to `true`. 

**<u>Field facetting</u>**

One option is then to choose which field should be facetted using the `facet.field` parameter. It is to be noted that you can request for several facet fields by setting the `facet.field` parameter multiple times. Several parameters can be used to configure the returned facets. If you have only one `facet.field` parameter, you can use the following parameters directly. If you have several `facet.field`, they can be configure individually with this syntax: `f.<fieldname>.facet.<parameter>`.
- **facet.sort:** this parameter determines the ordering of the terms returned. Possible value are `count` where the terms with the higher counts are first, and `index` where terms are ordered alphabetically whatever their count is.
- **facet.limit:** this parameter determines the maximum number of terms to be returned.
- **facet.offset:** this parameter determines an offset in the returned terms which can be used for paging.
A lot more option are available, see [here](https://solr.apache.org/guide/8_0/faceting.html#field-value-faceting-parameters).

**<u>Range facetting</u>**

This type of facetting can be used on dates or numerical fields to create buckets of values for which you will get the number of matching documents. The syntax is similar to field facetting, and you can activate range facetting on specific fields by setting the `facet.range` parameter. If you set this parameter multiple times, you can configure each field individually by using the syntax `f.<fieldname>.facet.range.<parameter>`. The following parameters are mandatory when using range facetting.
- **facet.range.start:** specifies the lowest bound of the range.
- **facet.range.end:** specifies the upper bound of the range.
- **facet.range.gap:** specifies the span of each range. This value will be added to the *start*
Other options are available to further configure resulting facets, see [here](https://solr.apache.org/guide/8_0/faceting.html#range-faceting).

**<u>Pivot facetting</u>**

In a sense, this functionality allows you to extract sub-facets from the index. Using it is simple, you just have to set the `facet.pivot` parameter to a comma-separated list of fields to be used. The first field will be on level 1, the second on level 2, *etc*. 

## d. Other useful features

**<u>More like this</u>**

Solr offers a way to identify documents similar to the ones returned after a query by setting the `mlt` parameter to `true`. Two parameters are then important here.
- **mlt.count:** how many similar documents should be returned for each document in the query response.
- **mlt.fl:** which fields should be used to compute the similarity.
More advance parameters can be used to further refine how the similarity is computed. You can for example set the minimum/maximum term frequency from which terms are taken into account, the minimum/maximum document frequency, *etc*. See [here](https://solr.apache.org/guide/7_5/morelikethis.html#common-parameters-for-morelikethis) for more details.

**<u>Hit highlighting</u>**

Enabling highlighting allows Solr to return snippets of the resulting documents where the terms in the query appear. It is very useful to get a quick glance of the context of the terms used in the query. To enable it, you have to set the `hl` parameter to `true`. From there, several options are available.