<a href="https://colab.research.google.com/github/Yunyi-W/interactive-widget-for-PyTerrier-on-Colab-notebook/blob/main/fig_display_ver0_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# User Evaluation for a display method of IR experiment results using PyTerrier

This notebook will provide guidance to use our proposed `fig_display()` method to simplify your IR experimental process using PyTerrier.

Please read the API and then follow the guidance to try each function of the method.



# **API**

`fig_display(pipelines, topics=None, qrels=None, eval_metrics=None, names=None,perquery=False, baseline=None, **kwargs)`

**Introduction**

fig_display() is a UI tool for PyTerrier based on ipywidgets, i.e. a user interface for IR experiments that can be used within Jupyter notebooks.

Users can use this method to see the results of IR experiments and compare them under different situations through tables and figures by only entering parameters instead of a long piece of code using Pyterrier.

The tool offers various *functionalities*. When the preconditions of a given functionality is satisfied, the functionalities is available. If more than one function's condition is satisfied, multiple functionalities will be available.

**Functionalities**

1. Display results for a query ("SINGLE QUERY" tab)

  If the parameter 'topics' is None, the result of retrieval transformation will be shown after the user enters a text-based query.

 If the parameter 'topics' contains a DataFrame of topics (e.g. obtained from a dataset, c.f. `dataset.get_topics()`), then users can select a system and a query from those they input to view the result. If qrels are also provided, the label documents of documents will also be displayed (c.f. relevant or not).

2. Comparison of two systems side by side ("COMPARE" tab)

  If there is more than one transformer pipeline, a comparison of the results of two transformer pipelines can be shown side by side.

3. Average performance ("AVERAGE PERFORMANCE" tab)

  When there is more than one transformer pipeline, if the parameter 'perquery' is True, a figure of the difference of the value of a `pt.Experiment()` between two transformer pipelines with the specificed evaluation measures can be shown. Users can select any two pipelines and any one measure from the parameter they input and they also can choose the threshold of the difference to filter the result whereas (perquery is False), the table of the result of the experiment() will be shown.

  **!Note**: the figure file of the picture can be downloaded from "Files" by clicking the file icon on the left side, then putting the mouse on fig.jpg and clicking "..." to choose "download".

**Parameters:**

In general, the parameters to `fig_display()` correspond to the parameters supported by `pt.Experiment()`, and should be familiar to any experienced PyTerrier user.

* **pipelines**(list, pyterrier.Transformer) - A list of transformers or a transformer to evaluate. If you already have the results for one (or more) of your systems, a results dataframe can also be used here. Results produced by the transformers must have “qid”, “docno”, “score”, “rank” columns.

* **topics**(str, pandas.DataFrame)(default: None) - Either a String form of the query or a path to a topics file or a pandas.Dataframe with columns=[‘qid’, ‘query’]. If 'topics' is None, users can enter a string form of query on page "SINGLE QUERY" for further operations.

* **qrels**(pandas.DataFrame) - Either a path to a qrels file or a pandas.Dataframe with columns=[‘qid’,’docno’, ‘label’]

* **eval_metrics**(list)(default: [‘map’]) - Which evaluation metrics to use. e.g. [‘map’]

* **name**(list) -  List of names for each retrieval system when presenting the results. Default=None. If None: Obtains the str() representation of each transformer as its name.

* **perquery**(bool)(default: False) – If True returns each metric for each query, else return mean metrics across all queries. Default=False.

* **baseline**(int)(default: None) – If set to the index of an item of the retr_system list, will calculate the number of queries improved, degraded and the statistical significance (paired t-test p value) for each measure. Default=None: If None, no additional columns will be added for each measure.


In [None]:
%pip install python-terrier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-terrier
  Downloading python-terrier-0.9.2.tar.gz (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting chest
  

In [None]:
import pyterrier as pt
if not pt.started():
 pt.init()

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



The following call contains the definition of `fig_display()` function - you do not need to read the code, but you must execute the cell.

In [None]:
# from google.colab import output
# output.enable_custom_widget_manager()
from matplotlib.pyplot import MultipleLocator
import matplotlib.pyplot as plt
import pandas
from pandas.core.indexes.base import trim_front
import tempfile
from pandas.core.base import NoNewAttributesMixin
from pandas._libs.lib import to_object_array_tuples
import ipywidgets as widgets
import pyterrier as pt
from pyterrier.measures import *
from IPython.display import display

page_layout = widgets.Layout(display='flex',flex_flow='column',align_content='flex-start')
style = {'description_width': 'initial'}

def fig_display(
    pipelines,
    topics = None,
    qrels = None,
    eval_metrics = [MAP],
    names = None,
    perquery = False,
    baseline = None,
    **kwargs):

  #The interactive pages
  class SingleQueryPage:
    #Page1 for dynamitcly showing a table with a selected/entered topic
    #If there is no entered topic, users enter topics and submit it.
    #Otherwise, users select a topic from topic list
    def __init__(self,pipeline_dic,query_dic):
      self.name = 'SINGLE QUERY'
      #Needed widgets
      self.title = widgets.Label()
      self.pipeline = widgets.Dropdown(description='Select Pipeline:',options=pipeline_dic.keys(),disabled=True,style=style)
      self.search_button = widgets.Button(description='Search')
      self.r_table = widgets.Output()#The widget for outputing the table

      self.topics_layout = widgets.GridspecLayout(3,2)
      self.topics_layout.layout=page_layout

      self.topics_layout[0,0] = self.title
      self.topics_layout[1,0] = self.pipeline

      #lay out the page based on parameter 'topics'
      self.layout1()
      if len(query_dic)>=1:
        self.layout2(query_dic)

      self.pageLayout = widgets.VBox([self.topics_layout,self.r_table])

    #The method for the layout when the parameter 'topics' is none
    def layout1(self):
      self.title.value = 'Please enter a query'
      self.topics = widgets.Text(description='Input search:',disabled=False)
      self.topics_layout[2,0] = self.topics
      self.topics_layout[2,1] = self.search_button
    #The method for the layout when the 'topics' is not none
    def layout2(self,query_dic):
      self.title.value = 'Please select a query'
      self.topics = widgets.Dropdown(description='Query:',options=query_dic.keys(),disabled=False,style=style)
      self.topics_layout[2,:] = self.topics


  class ComparePage:
    #page3 for comparing two table with different pipelines side by side
    #if there is more than one pipeline, the page is available.
    def __init__(self,singleQueryPage):
      self.name = 'COMPARE'
      #Needed widgets
      self.pipeline_left = singleQueryPage.pipeline
      self.pipeline_right = widgets.Dropdown(description='Select Another Pipeline:',options=singleQueryPage.pipeline.options,disabled=True,style=style)
      if len(singleQueryPage.pipeline.options)>1:
        #In the initial state, different tables are displayed on the left and right sides.
        self.pipeline_right.value=singleQueryPage.pipeline.options[1]

      self.r_table_left = widgets.Output()
      self.r_table_right = widgets.Output()

      self.pipe_result_left = widgets.VBox([self.pipeline_left,self.r_table_left],layout={'border':'1px solid gray'})
      self.pipe_result_right = widgets.VBox([self.pipeline_right,self.r_table_right],layout={'border':'1px solid gray'})
      self.compare_layout = widgets.HBox([self.pipe_result_left,self.pipe_result_right])

      self.pageLayout = self.compare_layout

  class ExpPage:
    #Page3 for generating a chart to display the difference of the results between different pipelines with the selected measure
    result of experiment()
    def __init__(self,comparePage,eval_metrics):
      self.name = 'AVERAGE PERFORMANCE'

      self.title = widgets.Label(value='Average Performance:')
      self.pipeline_left = widgets.Dropdown(description='Pipeline 1:',style=style)
      self.pipeline_right = widgets.Dropdown(description='Pipeline 2:',style=style)
      self.measure = widgets.Dropdown(description='Measure:',style=style)
      self.threshold = widgets.FloatSlider(description='Threshold:',min=0.0,disabled=False)
      self.select_layout = widgets.GridspecLayout(3,2)
      #self.select_layout.layout=page_layout
      self.select_layout[0,0] = self.title
      self.select_layout[1,0] = self.pipeline_left
      self.select_layout[2,0] = self.pipeline_right
      self.select_layout[1,1] = self.measure
      self.select_layout[2,1] = self.threshold

      self.e_table = widgets.Output()

      self.pageLayout = self.e_table

  #Methods for data processing
  def single_calculate(pipeline,topics):
    #Call PyTerrier methods to get results with different single topics
    if not isinstance(topics,pandas.DataFrame):
      #If topics is not a list of topics
      result = pipeline_dic[pipeline].search(topics)
    else:
      result = pipeline_dic[pipeline].transform(topics)
    return result

  def attach_revelant_label(result):
    #Attach a revelant label for each tuples
    #Check if every tuples is related to the qrels
    if not qrels is None:
      if set(['qid','docno']).issubset(result.columns):
        #Merge result with the list of qrels acofording to
        result = pandas.DataFrame.merge(result,qrels,how='left',on=['qid','docno'])
    return result

  def exp_calculate():
    exp_result = pt.Experiment(
    retr_systems=pipelines,
    topics=topics,
    qrels=qrels,
    eval_metrics=eval_metrics,
    names=names,
    perquery=perquery,
    baseline=baseline,
    **kwargs)
    if perquery is True:
      result_joined = pandas.merge(exp_result,exp_result,on=['qid','measure'],how='inner')
      exp_result = result_joined.drop(result_joined[result_joined['name_x']==result_joined['name_y']].index)
      exp_result['difference'] = exp_result['value_x']-exp_result['value_y']
    return exp_result


  # def exp_thres_filter(result,pipeline1,pipeline2,measure,threshold):
  #   result = result[(result.name_x==pipeline1)&(result.name_y==pipeline2)]
  #   result = result[(result['difference']>=threshold)|(result['difference']<=-(threshold))]
  #   return result

  def output_display(output,dataframe):
    with output:
      output.clear_output(wait=True)
      display(dataframe)

  def getPipeline(pipelines):
    #pipelines is a list
    if isinstance(pipelines,list):
      for p in pipelines:
        pipeline_dic[str(p)] = p
    else:
      pipeline_dic[str(pipelines)] = pipelines
    return pipeline_dic

  def getQuery(topics):
    if isinstance(topics,pandas.DataFrame):
      for row in topics.itertuples():
        #topics = pd.DataFrame([["2","mathematical"]],columns=['qid','query'])
        df = pandas.DataFrame([[getattr(row,'qid'),getattr(row,'query')]],columns=['qid','query'])
        name = getattr(row,'qid')+'-'+getattr(row,'query')
        query_dic[name] = df
    elif not topics is None:
      query_dic[topics] = topics
    return query_dic

  #initialize
  pipeline_dic = {}
  query_dic = {}
  getPipeline(pipelines)
  getQuery(topics)
  pages = []
  singleQuery = SingleQueryPage(pipeline_dic,query_dic)
  pages.append(singleQuery)
  compare = ComparePage(singleQuery)
  average = ExpPage(compare,eval_metrics)
  if isinstance(pipelines,list) and len(pipelines)>1:
    pages.append(compare)
    if isinstance(topics,pandas.DataFrame)& (not qrels is None):
      pages.append(average)

  #search button clicked
  def on_button_clicked(b):
    #if users input text in box
    if singleQuery.topics.value != "":
      try:
        if isinstance(eval(singleQuery.topics.value),pandas.DataFrame):
          query_dic[singleQuery.topics.value] = eval(singleQuery.topics.value)
          topics = eval(singleQuery.topics.value)
      except NameError:
        topics = singleQuery.topics.value
      getQuery(topics)
      # print(isinstance(singleQuery.topics.value,str))
      #print(query_dic[singleQuery.topics.value])
      output_display(singleQuery.r_table,single_calculate(singleQuery.pipeline.value,query_dic[singleQuery.topics.value]))
      if isinstance(pipelines,list) and len(pipelines)>1:
        singleQuery.pipeline.disabled=False
        compare.pipeline_right.disabled=False
        output_display(compare.r_table_right,single_calculate(compare.pipeline_right.value,query_dic[singleQuery.topics.value]))
        output_display(compare.r_table_left,single_calculate(compare.pipeline_left.value,query_dic[singleQuery.topics.value]))
        # if isinstance(topics,pandas.DataFrame):
        #   output_display(average.e_table,exp_thres_filter(exp_calculate(average.pipeline_left.value,average.pipeline_right.value,average.measure.value),average.threshold.value))
  #observe method
  def on_pipevalue_left_change(change):
    # if not topics is None:
    #   topic = query_dic[singleQuery.topics.value]
    # else:
    #   topic = singleQuery.topics.value
    dataframe = single_calculate(change['new'],query_dic[singleQuery.topics.value])
    output_display(singleQuery.r_table,attach_revelant_label(dataframe))
    output_display(compare.r_table_left,single_calculate(change['new'],query_dic[singleQuery.topics.value]))
  def on_pipevalue_right_change(change):
    # if not topics is None:
    #   topic = query_dic[singleQuery.topics.value]
    # else:
    #   topic = singleQuery.topics.value
    dataframe = single_calculate(change['new'],query_dic[singleQuery.topics.value])
    output_display(compare.r_table_right,dataframe)

  def on_queryValue_change(change):
    dataframe = single_calculate(singleQuery.pipeline.value,query_dic[change['new']])
    output_display(singleQuery.r_table,attach_revelant_label(dataframe))
    # dataframe_right = single_calculate(pipeline_d_right.value,query_dic[change['new']])
    # output_display(r_table_right,dataframe_right)

  def on_thresMax_change(change):
    average.threshold.step=change['new']/100.0

  #layout in different conditions
  if topics is None:
    singleQuery.search_button.on_click(on_button_clicked)
    singleQuery.pipeline.observe(on_pipevalue_left_change,names='value')
    compare.pipeline_right.observe(on_pipevalue_right_change,names='value')
  else:
    singleQuery.topics.observe(on_queryValue_change,names='value')
    r_result = single_calculate(singleQuery.pipeline.value, query_dic[singleQuery.topics.value])
    output_display(singleQuery.r_table,attach_revelant_label(r_result))
    if isinstance(pipelines,list) and len(pipelines)>1:
      singleQuery.pipeline.disabled=False
      compare.pipeline_right.disabled=False
      singleQuery.pipeline.observe(on_pipevalue_left_change,names='value')
      output_display(compare.r_table_right,single_calculate(compare.pipeline_right.value,topics))
      output_display(compare.r_table_left,single_calculate(compare.pipeline_left.value,topics))
      compare.pipeline_right.observe(on_pipevalue_right_change,names='value')
      if isinstance(topics,pandas.DataFrame)& (not qrels is None):

        exp_df = exp_calculate()
        average.pipeline_left.options=exp_df['name_x'].unique()
        average.pipeline_right.options=exp_df['name_y'].unique()
        average.measure.options=exp_df['measure'].unique()
        average.threshold.observe(on_thresMax_change,names='max')
        if perquery:
          def threshold_func(pipe1,pipe2,measures,threshold):
            plt.figure(figsize=(20,10))
            exp_df_temp = pandas.DataFrame.merge(exp_df,topics,how='left',on=['qid',])
            df = exp_df_temp[(exp_df_temp.name_x==pipe1)&(exp_df_temp.name_y==pipe2)]
            df = df[df['measure']==measures]
            df = df[(df['difference']>=threshold)|(df['difference']<=-threshold)]
            df['qid:query'] = df['qid']+': '+df['query']
            df = df.sort_values(by='difference',ascending=False)
            if len(df['difference'])>0:
              max = df['difference'].values[0]
              min = df['difference'].values[-1]
              if (-min)>max:
                average.threshold.max = -min
              else:
                average.threshold.max = max
            #display(result1)

            fig = plt.bar(df['qid:query'],df['difference'])
            y_major_locator = MultipleLocator(0.05)
            ax = plt.gca()
            ax.yaxis.set_major_locator(y_major_locator)
            plt.xticks(rotation=90)
            plt.tick_params(labelsize=16,pad=5)
            plt.xlabel('Qid:Query',fontsize=25)
            plt.ylabel('Delta '+measures,fontsize=25)
            plt.savefig('fig.jpg')
            plt.show(fig)

          inter_plot = widgets.interactive_output(
              threshold_func,
              {'pipe1':average.pipeline_left,'pipe2':average.pipeline_right,'measures':average.measure,'threshold':average.threshold})
          with average.e_table:
            display(average.select_layout,inter_plot)
        else:
          output_display(average.e_table,exp_df)


  #process bar
  process_bar = widgets.Tab()
  pageLayout = []
  for i in range(len(pages)):
    pageLayout.append(pages[i].pageLayout)
    process_bar.set_title(i,pages[i].name)
  process_bar.children = pageLayout
  display(process_bar)

## Functionality 1

Please enter pipelines and topics, and then see results with a selected pipeline and a selected query in page "SINGLE QUERY" and try change Pipeline,Query or enter a string form of a query.

If this is a query from a test collection, you should also be shown which documents are relevant or not

Here are some pipelines and topics or you can enter your pipelines and topics


In [None]:
vaswani_dataset = pt.get_dataset("vaswani")
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
index = pt.IndexFactory.of(vaswani_dataset.get_index())
br = pt.BatchRetrieve(index, wmodel="BM25")
pl = pt.BatchRetrieve(index, wmodel="PL2")
tf = pt.BatchRetrieve(index, wmodel="Tf")
pipelines1 = [br,pl,tf]
pipelines2 = None
topics1 = vaswani_dataset.get_topics()
topics2 = cord19.get_topics(variant='title')

fig_display(pipelines1, None)

Tab(children=(VBox(children=(GridspecLayout(children=(Label(value='Please enter a query', layout=Layout(grid_a…

## Functionality 2

Enter more than one pipeline and then compare any two of them side by side in page "COMPARE".

In [None]:
fig_display(pipelines1, topics1)

Tab(children=(VBox(children=(GridspecLayout(children=(Label(value='Please select a query', layout=Layout(grid_…

## Functionality 3

Enter pipelines, topics, qrels, eval_metrics and perquery and then compare result of the experiment() between any two of pipelines with a selected eval_metric and its threshold in page "AVERAGE PERFORMANCE"

In [None]:
qrels1 = vaswani_dataset.get_qrels()
qrels2 = cord19.get_qrels()

eval_metrics = [MAP, nDCG, nDCG@10]
perquery = True
fig_display(pipelines1,topics1,qrels1,perquery=perquery)

Tab(children=(VBox(children=(GridspecLayout(children=(Label(value='Please select a query', layout=Layout(grid_…

Thank you for taking the time to complete this notebook. Please return to the questionairre and
complete the form.