Information Retrieval (IR) Evaluation Metrics
----

<center><img src="http://math.sfsu.edu/beck/images/dilbert.accurate.numbers.gif" width="80%"/></center>

By The End Of This Session You Should Be Able To:
----

- Evaluate an IR system with the following concepts:  
    - Precision and recall
    - Precision at k
    - Mean average precision

Confusion Matrix
----

<center><img src="images/cont.png" width="75%"/></center>

<center><img src="images/p_tp.png" width="75%"/></center>

<center><img src="images/recall_tp.png" width="75%"/></center>

<center><img src="images/items.png" width="45%"/></center>

<center><img src="images/precision.png" width="45%"/></center>


<center><img src="images/p.png" width="75%"/></center>

<center><img src="images/r.png" width="75%"/></center>

<center><img src="http://www.info.univ-angers.fr/~gh/Predipath/confus3.jpg" width="75%"/></center>

<center><img src="images/pvsr.png" width="75%"/></center>

Precision & Recall as a Bucket & a Well
------

<center><img src="http://www.motherearthnews.com/homesteading-and-livestock/~/media/Images/MEN/Editorial/Blogs/Homesteading%20and%20Livestock/How%20to%20Use%20a%20Well%20Bucket%20Video/free%20well%20picture.jpg" width="75%"/></center>

Check for understanding
-----

If there are 100 documents in a collection that are relevant to a given query and 60 of these items are retrieved in a given search. 

What is the recall?



Recall = (60/100) = .60

In a given search, the system retrieves 80 items, out of which 30 are relevant and 50 are non-relevant. 

What is the precision?

Precison = (30/80) = .375

When a query have thousands (or millions) of relevant documents, recall is often not a meaningful metric.

No one interested in reading all of them 

Search Engine Results Page (SERP)
-----

1st position is most important

2nd position is sometimes clicked on

3rd position is rarely clicked on

4th-end Doesn't matter

----

Above the fold is all that matters. The fold (aka attention) is getting smaller. For example, compare desktop to mobile to watch

Need "precision at k" or p@k
----

For example:

if 1st result is relevant p@k = 1.

if 1st results i not relevant p@k = 0.

[P@10 or "Precision at 10"](https://en.wikipedia.org/wiki/Mean_average_precision#Precision_at_K) corresponds to the number of relevant results on the first search results page which typically has 10 shown results. 

What is precision at different k for this SERP?

1. N / not relevant document
2. N / not relevant document
3. N / not relevant document
4. R / relevant document
5. R / relevant document
6. N / not relevant document
7. R / relevant document 
8. R / relevant document 
9. N / not relevant document
10. R / relevant document

In [31]:
# Here is our data
serp = 'N N N R R N R R N R'.split()

In [32]:
# Convert to boolean vector
serp_relevant = [relevance == 'R' # R = relevant 
                 for relevance in serp]

serp_relevant

[False, False, False, True, True, False, True, True, False, True]

In [35]:
# Caluclate precision at each k
precisions = [sum(serp_relevant[:k+1])/(k+1) 
               for k, relevant in enumerate(serp_relevant)]

precisions

[0.0,
 0.0,
 0.0,
 0.25,
 0.4,
 0.3333333333333333,
 0.42857142857142855,
 0.5,
 0.4444444444444444,
 0.5]

In [36]:
# What are the precisions?
for (i,value) in enumerate(precisions):
    print(f"Precision @{i+1} - {value:.3}")

Precision @1 - 0.0
Precision @2 - 0.0
Precision @3 - 0.0
Precision @4 - 0.25
Precision @5 - 0.4
Precision @6 - 0.333
Precision @7 - 0.429
Precision @8 - 0.5
Precision @9 - 0.444
Precision @10 - 0.5


In [37]:
import numpy as np

print("Average precision for this query: {:.2f}".format(np.mean(precisions)))

Average precision for this query: 0.29


Performance across multiple queries
-----

<center><img src="images/map.png" width="75%"/></center>

SERP & CTR: It pays be to a winner
----

<center><img src="images/serp-ctr.svg" width="45%"/></center>

The Internet is "winner take most". It is best to be #1 in a niche then lower in a wider area.

Discounted Cumulative Gain (DCG)
-----

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. 


<center><img src="images/dcg.png" width="75%"/></center>

Discounted Cumulative Gain (DCG)
------

<center><img src="images/example.png" width="65%"/></center>

How log "sucky" your SERP is

Summary
-----

- Precision and recall are the just the start of evaluating a IR system
- Evaluate a SERP with:
    + Precision 
    + Recall
    + p@k
    + MAP
    - Discounted Cumulative Gain (DCG) 

<br>
<br> 
<br>

----

---
Advanced metrics
----

[Fall-out](https://en.wikipedia.org/wiki/Information_retrieval#Fall-out): The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available. 

![](images/fallout.png)

----
[Generality](http://crpit.com/confpapers/CRPITV49Yan.pdf): The proportion of relevant items per query.
    
Larger the collection, the larger will be the number of non-relevant item in given query. Hence, an increase in the level of recall will cause a decrease in precision.

[Source](http://www.cs.usc.edu/assets/002/82932.pdf)

<br>
<br> 
<br>

----