# Assignment 4: Evaluating Search Engines

For this assignment, we leave aside the code we developed so far, and look into the more general issue of how to evaluate and compare different search engines. The ultimate test for any Information Retrieval system is how well it is able to satisfy the information needs of users.

# Cohen's Kappa

Our evaluation will involve the calculation of [Cohen's kappa](https://en.wikipedia.org/wiki/Cohen's_kappa) to quantify the degree to which two human assessors agree or disagree on whether results are considered relevant or not. To calculate Cohen's kappa, we are going to use the [scikit-learn library](http://scikit-learn.org/stable/):

In [2]:
! pip install --user scikit-learn



In [3]:
from sklearn.metrics import cohen_kappa_score

This library expects relevance assessments as lists of elements where `1` stands for _relevant_ and `0` stands for _not relevant_, for example like this:

In [4]:
a1=[1,0,1,0,1,0,1,0]

This list means that the first document was assessed to be relevant, the second to be not relevant, the third to be relevant etc.

We need two assessments in order to calculate Cohen's kappa, so let's make another exemplary list that only differs on the last element:

In [5]:
a2=[1,0,1,0,1,0,1,1]

We can now invoke the library as follows to calculate the agreement between the two:

In [6]:
cohen_kappa_score(a1, a2)

0.75

This value represents high agreement. We can reach maximal agreement if the two assessments are identical:

In [7]:
cohen_kappa_score(a1, a1)

1.0

Now, let's see what happens for a third assessment that differs on three positions with the first one (the three last positions):

In [8]:
a3=[1,0,1,0,1,1,0,1]

cohen_kappa_score(a1, a3)

0.25

We get a smaller but still positive value, because these two assessments still mostly agree. If we make a further example that differs on 6 of the 8 positions, we get the following result:

In [9]:
a4=[1,0,0,1,0,1,0,1]

cohen_kappa_score(a1, a4)

-0.5

The score is now negative, because the two differ on more positions than they agree. The agreement is in fact less than what you would expect to occur just by chance. We get the maximal disagreement if we define a fifth example that disagrees on all positions:

In [10]:
a5=[0,1,0,1,0,1,0,1]

cohen_kappa_score(a1, a5)

-1.0

Be aware that the kappa score cannot be calculated if you have only `1`s or only `0`s:

In [11]:
a6=[1,1,1,1,1,1,1,1]
a7=[1,1,1,1,1,1,1,1]

cohen_kappa_score(a6, a7)

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


nan

And in the case of a highly skewed set (either vast majority of agreements on `1` or vast majority of agreements on `0`), the kappa score can be counter-intuitive:

In [12]:
a8=[1,1,1,1,1,1,0,1]
a9=[1,1,1,1,1,1,1,0]

cohen_kappa_score(a8, a9)

-0.1428571428571428

Now that we understand how this function works, we will apply it below for our specific evaluation.

# Results and Assessments

Next, we will define some auxilary code to deal with lists of URLs from search engines and associated relevance assessments. We will encode result lists like this:

In [13]:
urls = [
    'https://en.wikipedia.org/wiki/Information_retrieval/',  # 1st result
    'http://www.dictionary.com/browse/information',          # 2nd result
    'https://nlp.stanford.edu/IR-book/'                      # ...
]

And we represent corresponding assessments, as above, as lists of the same size containing relevance values:

In [14]:
my_assessment = [1, 0, 1]
another_assessment = [0, 0, 1]

In order to nicely display URL lists, with or without related assessments, we define a function called `display_results`:

In [15]:
from IPython.display import display, HTML

def display_results(urls, assessment1=None, assessment2=None):
    lines = []
    lines.append('<table>')
    header = '<tr><th>#</th><th>Result URL</th>'
    if (assessment1):
        header += '<th>Assessment 1</th>'
    if (assessment2):
        header += '<th>Assessment 2</th>'
    header += '</tr>'
    lines.append(header)
    i = 0
    for url in urls:
        show_url = url
        if (len(url) > 80):
            show_url = url[:75] + '...'
        line = '<tr><td>{}</td><td><a href="{:s}">{:s}</a></td>'.format(i+1, url, show_url)
        if (assessment1):
            if (assessment1[i] == 0):
                line += '<td><em>Not relevant</em></td>'
            else:
                line += '<td><strong>Relevant</strong></td>'
        if (assessment2):
            if (assessment2[i] == 0):
                line += '<td><em>Not relevant</em></td>'
            else:
                line += '<td><strong>Relevant</strong></td>'
        line += '</tr>'
        lines.append(line)
        i = i+1
    lines.append('</table>')
    display( HTML(''.join(lines)) )

We can use this function to display a list of URLs, optionally together with one or two assessment lists:

In [16]:
print("Just a list of URLs:")
display_results(urls)

print("With one assessment:")
display_results(urls, my_assessment)

print("With two assessments:")
display_results(urls, my_assessment, another_assessment)

Just a list of URLs:


#,Result URL
1,https://en.wikipedia.org/wiki/Information_retrieval/
2,http://www.dictionary.com/browse/information
3,https://nlp.stanford.edu/IR-book/


With one assessment:


#,Result URL,Assessment 1
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant
2,http://www.dictionary.com/browse/information,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant


With two assessments:


#,Result URL,Assessment 1,Assessment 2
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant,Not relevant
2,http://www.dictionary.com/browse/information,Not relevant,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant,Relevant


Now we are ready to perform an actual evaluation, which will involve a substantial amount of manual work.

---

# Tasks

**Your name:** Astha Patel

### Task 1

Think up and formulate a information need (for example in the field of Computer Science or Medicine) for which you think the answer can be found in scientific publications. On page 152 in the book an example of such an information need is shown: "Information on whether drinking red wine is more effective at reducing the risk of heart attacks than white wine."

**Answer:** [Information on whether a vegetarian diet lowers risk of metabolic diseases than a non-vegetarian diet.]

Next, write down specifically what documents have to look like to satisfy your information need. For example if your information need is about finding an overview of different cancer types, you could state that a document would need to list at least ten types of cancer to satisfy your information need (among other criteria). Write this down as a protocol with rules and examples. For example, such a protocol could state that at least three out of five given criteria have to be fulfilled for a document to be considered relevant for the information need, and then specify the criteria. Or your protocol could have the form of a sequence of rules, where each rule lets you either label the document as relevant or not relevant, or proceed with the next rule. Such rules and criteria can, for example, be about the general topic of the paper, the concepts mentioned in it, the covered relations between concepts, the type of publication (research paper, overview paper, etc.), the number of references, the types of contained diagrams, and so on, depending on your specified information need.

**Answer:** [Rule 1: The document is credible in its nature. For example, the document must be a scientic publication with a research methodology that provides emprical conclusions. If this rule is satsified, proceed to next rule (else the document is not relevant) Rule 2: The document must cover topics regarding rhealth risks of having a vegetarian vs non vegetarian diet. Additionally, the health risks must include but is not limited to how dietery factors incluence metabolism in humans.If this rule is satsified, proceed to next rule (else the document is not relevant) Rule 3: Documents containing a correlation between risk of metabolic disease and diet are deemed relevant. ]

### Task 2

Formulate a keyword query that represents the information need. For the example on page 152 in the book (see above), the example query "wine AND red AND white AND heart AND attack AND effective" is given. (You don't need to use connectors like "AND", but if you do, make first sure your chosen search engines below actually support them.)

**Answer:** [diet AND vegetarian AND nonvegetarian AND metabolism AND health AND risks AND reduced]

Then submit your query to **two** of the following academic search engines:

- [Google Scholar](https://scholar.google.com) (all science disciplines)
- [Semantic Scholar](https://www.semanticscholar.org) (all science disciplines)
- [PubMed Search](https://www.ncbi.nlm.nih.gov/pubmed) (Life Sciences / biomedicine)

The right choice of two from the three search engine depends on the topic of your information need. If your information need is in the Life Sciences and biomedicine, it's probably best to include PubMed Search, but otherwise you should pick Google Scholar and Semantic Scholar.

Extract a list of the top 10 URLs of the lists of each of the search engines given the query. To be ensure that your results are reproducible, it is advised to use the private mode of your browser. Try to access the resulting publications. For the publications where that is not possible (because of dead links or because the publication is pay-walled even within the VU network), exclude them from the list and add more publications to the end of your list (that is, append results number 11, then 12, etc. to ensure you have two lists of 10 publications each). In order to deal with paywalls, you should try accessing the articles from the VU network, use
[UBVU Off-Campus
Access](http://www.ub.vu.nl.vu-nl.idm.oclc.org/nl/faciliteiten/toegang-buiten-de-campus/index.aspx), or try to find the respective documents from alternative sources (Google Scholar, for example, is very good at finding free PDFs of articles). If you get fewer than 10 results for one of the search engines, modify the keyword query above to make it more inclusive, and then redo the steps of this task.

Store your two lists of URLs in the form of Python lists as introduced above. Then, use the `display_results` function to nicely display them.

In [22]:
# Create two of the lists below, depending on your chosen engines:

#urls_semantic = ...
urls_google = ['https://vu-on-worldcat-org.vu-nl.idm.oclc.org/v2/search/detail/8137955097?queryString=Vegetarian%20Dietary%20Patterns%20Are%20Associated%20With%20a%20Lower%20Risk%20of%20Metabolic%20Syndrome&clusterResults=false&groupVariantRecords=false&stickyFacetsChecked=false'
               ,'https://www.researchgate.net/profile/Johanna-Dwyer-2/publication/20110585_Health_aspects_of_vegetarian_diets/links/00b49516c0dc1838a2000000/Health-aspects-of-vegetarian-diets.pdf'
               ,'https://www.cambridge.org/core/services/aop-cambridge-core/content/view/2905C5A7D1CAD779D8D33196A4641CEF/S0007114515002937a.pdf/cross-sectional-and-longitudinal-comparisons-of-metabolic-profiles-between-vegetarian-and-non-vegetarian-subjects-a-matched-cohort-study.pdf'
               ,'https://www.cambridge.org/core/services/aop-cambridge-core/content/view/590B63B52A149CB11FBF9C901499DCED/S0007114514004139a.pdf/div-class-title-a-perspective-on-vegetarian-dietary-patterns-and-risk-of-metabolic-syndrome-div.pdf'
               ,'https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML'
               ,'https://academic.oup.com/ajcn/article/100/suppl_1/353S/4576455?login=true'
               ,'https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S2212267213011131'
               ,'https://academic.oup.com/ajcn/article/70/3/570s/4715021?login=true'
               ,'https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S0033062018300872'
               ,'https://ajph.aphapublications.org/doi/pdf/10.2105/AJPH.75.5.507']
urls_pubmed = ['https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S0033062018300872'
               ,'https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005'
               ,'https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005'
               ,'https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML'
               ,'https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S0895706199000576'
               ,'https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S0749379715000732'
               ,'https://journals-plos-org.vu-nl.idm.oclc.org/plosone/article?id=10.1371/journal.pone.0071799'
               ,'https://www-cambridge-org.vu-nl.idm.oclc.org/core/journals/british-journal-of-nutrition/article/effect-of-vegetarian-diet-on-skin-autofluorescence-measurements-in-haemodialysis-patients/47B72DF44B570A24A034B014211A8898'
               ,'https://digital.csic.es/bitstream/10261/74531/1/accesoRestringido.pdf'
               ,'https://www.researchgate.net/publication/225296805_Hypotensive_hypoglycaemic_and_antioxidant_effects_of_consuming_a_cocoa_product_in_moderately_hypercholesterolemic_humans']
display_results(urls_google)
display_results(urls_pubmed)

#,Result URL
1,https://vu-on-worldcat-org.vu-nl.idm.oclc.org/v2/search/detail/8137955097?q...
2,https://www.researchgate.net/profile/Johanna-Dwyer-2/publication/20110585_H...
3,https://www.cambridge.org/core/services/aop-cambridge-core/content/view/290...
4,https://www.cambridge.org/core/services/aop-cambridge-core/content/view/590...
5,https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML
6,https://academic.oup.com/ajcn/article/100/suppl_1/353S/4576455?login=true
7,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S22122...
8,https://academic.oup.com/ajcn/article/70/3/570s/4715021?login=true
9,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S00330...
10,https://ajph.aphapublications.org/doi/pdf/10.2105/AJPH.75.5.507


#,Result URL
1,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S00330...
2,https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005
3,https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005
4,https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML
5,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S08957...
6,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S07493...
7,https://journals-plos-org.vu-nl.idm.oclc.org/plosone/article?id=10.1371/jou...
8,https://www-cambridge-org.vu-nl.idm.oclc.org/core/journals/british-journal-...
9,https://digital.csic.es/bitstream/10261/74531/1/accesoRestringido.pdf
10,https://www.researchgate.net/publication/225296805_Hypotensive_hypoglycaemi...


### Task 3

Then, find a fellow student who will **independently**
assess the results as "relevant" or "not relevant" using the protocol that you
have defined above, and also help (at least) one other student for his/her
assessment. Write down their names here:

**Name of the student who assesses my results:** Myrte Kuipers

**Name of the student who I help to assess his/her results:** Myrte Kuipers

Show to the other assessor everything you have written down above for Tasks 1 and 2 (and you might also want to give him/her the PDFs you got for these papers to simplify the process).

You as assessors need to stick to the protocol you made in Task 1 and should not discuss with each other, especially when you doubt whether a result is relevant or not. Write down your assessments as lists of relevance values, as introduced above, and make sure they correctly map to the URLs by displaying them together with the `display_results` function.

To avoid problems with extreme results, mark in each list at least one paper as 'relevant' and at least one paper as 'not relevant'. That is, if all papers seem relevant, mark the one that seems least relevant 'not relevant', and conversely, if none of the papers seem relevant, mark the one that seems a bit more relevant than the others as 'relevant'.

In [27]:
# 0 = not relevant; 1 = relevant

# You only need to create 4 of the following 6 lists, again depending on which search engines you chose.

# Assessment 1 is from you:


assessment1_google = [1,1,1,1,1,1,0,0,1,1]
assessment1_pubmed = [1,0,0,1,0,1,1,0,0,0]


# Assessment 2 is from your fellow student (don't show him/her your own assessment!):

assessment2_google = [1,1,1,1,1,1,0,0,1,1]
assessment2_pubmed = [1,0,0,1,0,1,1,0,1,0]

# Call display_results here
display_results(urls_google, assessment1_google)
display_results(urls_pubmed, assessment1_pubmed)

display_results(urls_google, assessment2_google)
display_results(urls_pubmed, assessment2_pubmed)

#,Result URL,Assessment 1
1,https://vu-on-worldcat-org.vu-nl.idm.oclc.org/v2/search/detail/8137955097?q...,Relevant
2,https://www.researchgate.net/profile/Johanna-Dwyer-2/publication/20110585_H...,Relevant
3,https://www.cambridge.org/core/services/aop-cambridge-core/content/view/290...,Relevant
4,https://www.cambridge.org/core/services/aop-cambridge-core/content/view/590...,Relevant
5,https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML,Relevant
6,https://academic.oup.com/ajcn/article/100/suppl_1/353S/4576455?login=true,Relevant
7,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S22122...,Not relevant
8,https://academic.oup.com/ajcn/article/70/3/570s/4715021?login=true,Not relevant
9,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S00330...,Relevant
10,https://ajph.aphapublications.org/doi/pdf/10.2105/AJPH.75.5.507,Relevant


#,Result URL,Assessment 1
1,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S00330...,Relevant
2,https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005,Not relevant
3,https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005,Not relevant
4,https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML,Relevant
5,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S08957...,Not relevant
6,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S07493...,Relevant
7,https://journals-plos-org.vu-nl.idm.oclc.org/plosone/article?id=10.1371/jou...,Relevant
8,https://www-cambridge-org.vu-nl.idm.oclc.org/core/journals/british-journal-...,Not relevant
9,https://digital.csic.es/bitstream/10261/74531/1/accesoRestringido.pdf,Not relevant
10,https://www.researchgate.net/publication/225296805_Hypotensive_hypoglycaemi...,Not relevant


#,Result URL,Assessment 1
1,https://vu-on-worldcat-org.vu-nl.idm.oclc.org/v2/search/detail/8137955097?q...,Relevant
2,https://www.researchgate.net/profile/Johanna-Dwyer-2/publication/20110585_H...,Relevant
3,https://www.cambridge.org/core/services/aop-cambridge-core/content/view/290...,Relevant
4,https://www.cambridge.org/core/services/aop-cambridge-core/content/view/590...,Relevant
5,https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML,Relevant
6,https://academic.oup.com/ajcn/article/100/suppl_1/353S/4576455?login=true,Relevant
7,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S22122...,Not relevant
8,https://academic.oup.com/ajcn/article/70/3/570s/4715021?login=true,Not relevant
9,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S00330...,Relevant
10,https://ajph.aphapublications.org/doi/pdf/10.2105/AJPH.75.5.507,Relevant


#,Result URL,Assessment 1
1,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S00330...,Relevant
2,https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005,Not relevant
3,https://academic-oup-com.vu-nl.idm.oclc.org/ajcn/article/78/3/633S/4690005,Not relevant
4,https://oce-ovid-com.vu-nl.idm.oclc.org/article/00004872-201611000-00011/HTML,Relevant
5,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S08957...,Not relevant
6,https://www-sciencedirect-com.vu-nl.idm.oclc.org/science/article/pii/S07493...,Relevant
7,https://journals-plos-org.vu-nl.idm.oclc.org/plosone/article?id=10.1371/jou...,Relevant
8,https://www-cambridge-org.vu-nl.idm.oclc.org/core/journals/british-journal-...,Not relevant
9,https://digital.csic.es/bitstream/10261/74531/1/accesoRestringido.pdf,Relevant
10,https://www.researchgate.net/publication/225296805_Hypotensive_hypoglycaemi...,Not relevant


### Task 4

Compute Cohen's kappa to quantify how much the two assessors agreed. Use the function `cohen_kappa_score` demonstrated above to calculate two times the inter-annotator agreement (once for each of the two search engines), and print out the results.

In [28]:
# Add your code here:

kappa_google = cohen_kappa_score(assessment1_google,assessment2_google)
kappa_pubmed = cohen_kappa_score(assessment1_pubmed,assessment2_pubmed)

print("Kappa for Google Scholar:", kappa_google)
print("Kappa for PubMed:", kappa_pubmed)

Kappa for Google Scholar: 1.0
Kappa for PubMed: 0.8


Explain whether the agreement can be considered high or not, based on the interpretation table on [this Wikipedia page](https://en.wikipedia.org/wiki/Fleiss'_kappa#Interpretation) (this Wikipedia page is about a different type of kappa but the interpretation table can also be used for Cohen's kappa).

**Answer:** [For the google scholar result of 1.0 it is clear that the there is almsot perfect agreement according ot the linked Wikipedia page. For Pubmed the score was 0.01 away from almust perfect agrrement resulting in interpretation that there is nonetheless, substantial agreement.]

### Task 5

Define a function called `precision_at_n` that calculates Precision@n as described in the lecture slides, which takes as input an assessment list and a value for _n_ and returns the respective Precision@n value. Run this function to calculate Precision@10 (that is, n=10) on all four assessments (two assessors and two search engines).

In [32]:
# Add your code here:

def precision_at_n(assessed_list, n):
    docs = assessed_list.count(1)
    precision = docs/n
    print(precision)
    
precision_at_n(assessment1_google,10)
precision_at_n(assessment1_pubmed, 10)
# Print out Precision@10 for all assessments here.

0.8
0.4


Explain what these specific Precision@10 results tell us (or don't tell us) about the quality of the two search engines for your particular information need. You can also refer to the results of Task 4 if necessary.

**Answer:** [Percision@10 indicates the number of correct results divided by the number of all results returned evaluated for the top 10 results. Hence, this means the percentage of all relevant results that is retunred from the query for google is 80% and pubmet 40%. It is clear that the ratio of trup postives was much higher for google than for pubmed. Therefore, for the topic in question, suprisingly google outperformed pubmed and has a better chance at finding information regarding metabolism health and vegetarian vs non-vegetarian diets. 

# Submission

Submit the answers to the assignment via Canvas as a modified version of this Notebook file (file with `.ipynb` extension) that includes your code and your answers.

Before submitting, restart the kernel and re-run the complete code (**Kernel > Restart & Run All**), and then check whether your assignment code still works as expected.

Don't forget to add your name, and remember that the assignments have to be done **individually**, and that code sharing or copying are **strictly forbidden** and will be punished.