# Benchmark

## *Table of contents*
* *Setup Phase*
	1. Import libraries and queries
	2. Showing all the available queries
	3. Query selection

* *Evaluation Phase*
	1. Precision at Standard Recall Levels for query Q
	2. Interpolated Average Precision (IAP) at Standard Recall Leveles
	3. R-Precision
	4. Mean Average Precision (MAP)
	5. F-Measure & E-Measure 

## *Setup phase*

### Import libraries and queries

### Showing all the available queries

### Query selection

In [None]:
try:
    examined_q = 3
    print("User Information Need: " + queries[examined_q]["UIN"])
except IndexError as e:
    print("index not valid")

In [None]:
from evaluation.functions import Benchmark

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # Suppress the warning 

b = Benchmark(queries[examined_q])


### Precision at Standard Recall Levels for query Q

In [None]:
# define axes' names
axes = ["recall", "precision"]

# create a dataframe for Seaborn
df = pd.DataFrame()
for model, model_name in models:
    result = b.getResults(20, model)
    # get precision at standard recall values over list of result
    SRLValues = b.getSRLValues(
        b.getPrecisionValues(result),
        b.getRecallValues(result)
    )
    
    # tmp dataframe concatenated to the main one
    dfB = pd.DataFrame(SRLValues, columns = axes)
    dfB["Version"] = f'{model_name}'
    
    df = pd.concat([df, dfB])

sns.set_theme()


# plot the line graph
pltP = sns.lineplot(data = df, x = 'recall', y = 'precision', marker='o', markersize=4, hue="Version", palette="colorblind")
pltP.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')

# set fixed axes, the semicolon suppress the output
pltP.set_xlim([-0.1, 1.1]);
pltP.set_ylim([-0.1, 1.1]);
    

In [None]:
IAPatSRL = {}

for q in queries:
    tmpB = Benchmark(q)
    for model, model_name in models:
        result = tmpB.getResults(20, model)
        SRLValues = tmpB.getSRLValues(
            tmpB.getPrecisionValues(result),
            tmpB.getRecallValues(result)
            )
        IAPatSRL.setdefault(model_name, []).append(
			SRLValues
		)

### Interpolated Average Precision (IAP) at Standard Recall Leveles

In [None]:
from matplotlib.ticker import MultipleLocator
import textwrap

versions = [] 
AvPr_values=[]

for model, model_name in models:
    result = b.getResults(20, model)
    SRLValues = b.getSRLValues(
        b.getPrecisionValues(result),
        b.getRecallValues(result)
    )
    
    AvPr_values.append(b.getIapAvgPrecision(SRLValues))
    
    versions.append(textwrap.fill(model_name, width=10,
                    break_long_words=True))
    
# plot the average precisions
# apply the default theme
sns.set_theme()


# create a dataframe for Seaborn
df = pd.DataFrame({"Search-engine version": versions, "IAP at SRL": AvPr_values})

# plot the bar graph
pltAvPr = sns.barplot(data = df, x = "Search-engine version", y = 'IAP at SRL',palette="colorblind")


# set fixed axes, the semicolon suppress the output
pltAvPr.set_ylim([0.0, max(AvPr_values)+0.20]); # set y-axis    
pltAvPr.yaxis.set_major_locator(MultipleLocator(0.05))

Calculate how many relevant documents I have returned compared to the ideal case in which in the first $n$ documents are all relevants. 

### R-Precision Comparison between two models

In [None]:
'''
Available models: 
- "BM25F"
- "Doc2Vec"
- "Sentiment Weighting Model"
- "Sentiment Weighting Model - Amount Reviews Based"
'''

model1 = "BM25F"
model2 = "Doc2Vec"

for model, model_name in models:
    if model1 == model_name:
        model1 = model
    if model2 == model_name:
        model2 = model

In [None]:
RP_comparison = []
for q in queries:
    tmpB = Benchmark(q)
    model1Res = tmpB.getResults(20, model1)
    model2Res = tmpB.getResults(20, model2)
    
    RP_comparison.append(
        tmpB.getRPrecision(model1Res) - tmpB.getRPrecision(model2Res)
    )

Plot the graph

Firstly, get Non-Interpolated Average Precision for each query using $ \sum_{r=0}^{n} \frac{P_q(r/|R_q|)}{|R_q|} $


Secondly, compute Mean Average Precision for every model which is just an average between all the NIAP. 

### F-Measure & E-Measure 

E-measure is a variant of the harmonic mean which allows us to emphasize the value of recall or precision based on what we are interested in:

- $b=1 \rightarrow 1-\text{F-Measure}$ 
- $b>1$ emphasize precision
- $b<1$ emphasize recall

*Precision or Recall?*

- High Recall: relevant documents, but with too many unrelevant documents. 
- High Precision: few results but with an greater probability of being relevant. 

It's possible to customize *b* value.