# Evaluation in Information Retrieval Review




Till now we have seen many IR techniques to improve the accuracy of the system. However, we have not talked about how to evaluate the effect of a method. To evaluate, we need a measure of success for our system. Since training documents already observed by our system, it is not a good idea to evaluate the success of our system on the training data. That's why we need a new dataset that has not been observed by our algorithm. This new data is called "test data." Similar to the training data, test data should also be labeled as relevant or nonrelevant.  

The document containing all the terms in the query does not have to be relevant to the query. The important thing here is that whether the document meets the information need or not.  Test document set should be constructed keeping that in mind.

It is commonly seen that people tune their algorithms parameters by using test dataset. However, it is not appropriate to expose any test data to the algorithm. We need to have <i>validation data</i> or <i>development test collection</i>, in order to tune the parameter without using the test data.

There are many available datasets to evaluate algorithms. Two of the most popular ones are Text Retrieval Conference(to evaluate ranking algorithms) and Reuters(to evaluate classification algorithms).

## Evaluation Techniques

The most popular evaluation methods in IR are precision and recall. Precision (P) is the proportion of relevant documents for the retrieved set. It is calculated by this formula:

\begin{align}
Precision = \frac {\text{#relevant items retreved}}{\text{#retrieved items}}
\end{align}

Recall (R) is the proportion of retrieved relevant documents for the entire relevant documents. It is calculated by this formula:

\begin{align}
Recall = \frac {\text{#relevant items in retrieved set}}{\text{#all relevant items}}
\end{align}


There is another evaluation measure which is the fraction of the correct predictions over the training set. It is called accuracy. Using accuracy as evaluation criteria is not always a good idea. If our data is skewed excessively towards  one class, the classifier could get 99.9% accuracy by always predicting that class. For instance, if a classifier for cancer prediction always predicts non-malignant, it would have very good accuracy but it would misguide people who have cancer and hinder early treatment. Because of these reasons, accuracy is generally neglected by the researchers.

For some situations precision is more important than recall, and in some other situations recall is the priority.  For instance, when ordinary person searching information from the internet, finding the best results on the first page is the priority for them. Here precision is more important since we do not want the nonrelevant document in our first page for the price of comprehensive search results. However for an expert the relevant documents on the first page are most likely known and repetitive for him, he needs more through search results for his research. In this case, we favor on recall, since we want to fetch all the relevant documents. However, there is a constraint, we can not have both very good precision value and high recall value. As recall increases, precision decreases. So there is a trade-off here. In order to fine tune the tradeoff, we created another measure from precision and recall values. It is called F-measure. It is weighted harmonic average of precision and recall. 

\begin{align}
F= \frac {1}{\alpha\frac{1}{P} + (1-\alpha) \frac{1}{R}}\hspace{30mm} \\
= \frac {(\beta^2 + 1)PR}{\beta^2P + R}
\text{  , where } \beta^2 = \frac {1-\alpha} {\alpha}
\end{align}

By tuning alpha between the range $[0,1]$, we can decide which one to emphasisse more. When $\beta=1$ or $\alpha=\frac{1}{2}$ F-measure is called $F_1$ measure.



<img src="precision_recall_curve.png" width="40%">

\begin{align}
    \text {Fig. 1. An example of precision-recall curve.}
\end{align}

Even though these measures(precision, recall, F1) are great for evaluating unordered documents, they are not convenient to use for evaluating relevance ranking. But we can gain some information from  drawing the precision-recall curve as shown in Fig. 1. 

<img src="prec-recall-avg.png" width="40%">

\begin{align}
    \text {Fig. 2. An example of averaged precision-recall.}
\end{align}


Using this graph we can find <i>interpolated precision</i> for a recall value. As you can see in Fig. 1. by drawing horizontal line to left of each recall value and getting average of recall values rangeing from 0 to 1 by 0.1 steps. This creates another graph, which is more visually understandable. However, this is not enough for evaluation, since we generally want to have a single number summarizes the quality of our algorithm.

More and more ways of evaluating has been discovered throught the years. The most prevalent method for evaluatio is <i>Mean Average Precision</i>(MAP). It is the average of average query precision for top k documents. It is calculated by this formula:

\begin{align}
    MAP(Q) = \frac {1}{|Q|} \sum_{j=1}^{|Q|} \frac{1}{m_j}\sum_{k=1}^{m_j}P(R_{jk})
\end{align}

where $Q$ is the set of queries that are going to be tested for ranking relevance, $R_{jk}$ is set of retrived k documents. In MAP, we only use the binary decisions like 1 for relevant 0 for nonrelevant. However if we have a continuous values between $[0,1]$, how could we evaluate that. Creating trashold might help but it would not take advantage of the continuous values. For this type of problem, researchers use <i>normalized discounted cumulative gsain</i> (NDCG). NDCG is calculated by this formula:

\begin{align}
    NDCG(Q) = \frac {1}{|Q|} \sum_{j=1}^{|Q|} Z_{kj}\sum_{m=1}^{k} \frac{2^{R(j,m)}-1}{\log_{2}(1+m)}
\end{align}


One other problem of evaluation is that the integrity of our test data. How to be sure that our test data is non-biased, while even the two expert can differ in what is relevant or not. To understand that we need a measure how much agreement between the experts. One of the method is kappa statistic.

\begin{align}
    kappa = \frac{P(A) - P(E)}{1-P(E)}
\end{align}

where $P(A)$ is the fraction of agreement between experts and $P(E)$ is the prbability of them being randomly agreeing same thing. A good kappa value would be between 0.67 and 0.8. If it is less than 0.67, we should not trust the test data. 

Challanging task in IR is not to find relevant data. We can find many relevant documants to a query, but on the first page results most of the documents would be redundant for user. This is the most challanging part of finding relevant documents to a query.