## Install libraries

Install the libraries that are not installed by default in the conda_python3 kernel used in the Sagemaker Notebook used for this tuturial. If you are running this tutorial locally, you may beed to install aditional libraries like boto3 and matplotlib.

In [None]:
!pip install opensearch-py datasets

## Download Dataset

We use the Cranfield dataset, a small corpus of 1400 scientific abstracts. It is a slight modification of the dataset that you can download from https://ir-datasets.com/cranfield.html. We have modified the queries "query_id" field to get values 1-225 to match qrels "query_id".

In [None]:
from measureRelevance import download_s3_file

bucket_name = 'aws-blogs-artifacts-public'
s3_folder = 'BDB-4664/'  # Folder (prefix) in S3, ensure it ends with a '/'

# Documents
s3_docs_key = 'BDB-4664/docs.zip'
# Queries
s3_queries_key = 'BDB-4664/queries.zip'
# Queries relevancies
s3_qrels_key = 'BDB-4664/qrels.zip'

# Download files
download_s3_file(bucket_name, s3_docs_key)
download_s3_file(bucket_name, s3_queries_key)
download_s3_file(bucket_name, s3_qrels_key)

In [None]:
!unzip -o docs
!unzip -o queries
!unzip -o qrels

In [None]:
from datasets import load_from_disk

docs = load_from_disk("docs")
queries = load_from_disk("queries")
qrels = load_from_disk("qrels")

## Load OS Clients

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

# Create OS client
domain_name = 'measure-relevance-os-domain'
region = 'us-east-1'

credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region)

boto3_client_os = boto3.client('opensearch')
domain_host = boto3_client_os.describe_domain(DomainName='{}'.format(domain_name))['DomainStatus']['Endpoint']


os_client = OpenSearch(
            hosts=[{'host': domain_host, 'port': 443}],
            http_auth=auth,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection
        )

## Ingest documents to OpenSearch

We ingest the Cranfield documents using the parallel version of the bulk helper of the OpenSearch Python client. 

In [None]:
from opensearchpy.helpers import parallel_bulk

index_name = 'cranfield'


def _generate_data():
    for doc in docs:
        yield {"_index": index_name, "_id": doc['doc_id'], "title": doc['title'],
               "author": doc['author'], "bib": doc['bib'], "text": doc['text']}


succeeded = []
failed = []
for success, item in parallel_bulk(os_client, actions=_generate_data()):
    if success:
        succeeded.append(item)
    else:
        failed.append(item)

if len(failed) > 0:
    print(f"There were {len(failed)} errors:")
    for item in failed:
        print(item["index"]["error"])

if len(succeeded) > 0:
    print(f"Bulk-inserted {len(succeeded)} items (streaming_bulk).")

## Query the documents

We do the 225 queries them using the Search API operation lets you execute a search request on your indexed documents. For our examples, we will do a full-text search using the match query with the default parameters. We limit our response results to the 100 most relevant ranked documents.

In [None]:
from collections import defaultdict

relevant_docs = defaultdict(list)

for qrel in qrels:
    relevant_docs[qrel['query_id']].append(qrel['doc_id'])

In [None]:
from measureRelevance import list_of_docs

#to get list of all documents
no_of_top = 100
list_of_documents = list_of_docs(no_of_top, queries, os_client)

## Calculate recall, precision and F1

Recall: Fraction of relevant documents that are retrieved

- Recall = #(relevant items retrieved)/#(relevant items) = P(retrieved|relevant)

Precision (P) is the fraction of retrieved documents that are relevant

- Precision = #(relevant items retrieved)/#(retrieved items) = P(relevant|retrieved)

F1: Harmonic mean of precision and recall

- F1 = 2 * (Precision * Recall)/(Precision + Recall)

We start by calculating recall, precision and F1 with K=10. These values are valuable, but the ultimate goal is to find the K that mazimises these metrics. 

In [None]:
from measureRelevance import calculate_precision, calculate_recall, calculate_f1
import statistics

top = [10, 50, 100, 500]
no_of_top = top[1]

precision = calculate_precision(no_of_top, list_of_documents, relevant_docs)
recall = calculate_recall(no_of_top, list_of_documents, relevant_docs)
f1 = calculate_f1(precision, recall)

print("Mean Precision:{}".format(statistics.mean(precision)))
print("Mean Recall:{}".format(statistics.mean(recall)))
print("Mean F1:{}".format(statistics.mean(f1)))

## Plot recall, precision and F1 @K

We now calculate recall, precision an F1 accross different K values. In this case, from K=1 to K=30 to find the values of K that mazimises each of the metrics.  

In [None]:
from measureRelevance import plot_recall_vs_k, plot_precision_vs_k, plot_f1_vs_k, plot_precision_recall_f1_vs_k

# Set the maximum value of K
max_k = 30

Recall answers the question: "Of all the relevant documents in the dataset, how many did the system find?" To maximize recall, you would want your system to retrieve as many relevant documents as possible. This happens as you increase the number of documents retrieved. 

In [None]:
# Call the function to plot recall vs K
plot_recall_vs_k(max_k, list_of_documents, relevant_docs)

Precision answers the question: "Of the documents retrieved, how many were actually relevant?" To maximize precision, you would want your system to retrieve only the most relevant documents and avoid retrieving irrelevant ones. As you increase the number of documents retrieved, the precision lowers because between your retrieved documents appear some that are not relevant. 

In [None]:
# Call the function to plot precision vs K
plot_precision_vs_k(max_k, list_of_documents, relevant_docs)

The F1 score is designed to combine precision and recall into a single metric. It is particularly useful when both metrics need to be balanced and neither can be prioritized over the other.

In [None]:
# Call the function to plot F1 score vs K
plot_f1_vs_k(max_k, list_of_documents, relevant_docs)

Let's plot recall, precision and F1. We can see how recall increases as K increases while precision decreases. The F1 score takes both recall and precision values into consideration to find a K value that maximises both metrics. 
- K=6 achieves the highest F1 

In [None]:
# Call the function to plot Precision, Recall, and F1 vs K
plot_precision_recall_f1_vs_k(max_k, list_of_documents, relevant_docs)

## Mean Average Precision (MAP)

Averages the AP scores across multiple queries, considering only the top K retrieved documents.
- K=2 achieves the highest MAP

In [None]:
from measureRelevance import plot_map_vs_k

# Set the maximum value of K
max_k = 30

# Call the function to plot MAP@K vs K
plot_map_vs_k(max_k, list_of_documents, relevant_docs)

## Normalized Discounted Cumulative Gain (NDCG)



The relK variable is different this time. It is a range of relevance ranks where *0* is the least relevant, and some higher value is the most relevant. relK = qrel['relevance']

* In our case -1 is non-relevant, so we don't use it as relK
* We then have 1,2,3,4 refactored to 0,1,2,3. 

We start by calculating the Discounted Cumulative Gain (DCG) and the Ideal DCG (IDCG). The IDCG represents the best posible DCG, in which the K documents retrieved are ordered in decreasing relevancy. IDCG is used to normalize the DCG results. 

When plotting the results, we achieved the highest NDCG with K=7

In [None]:
from measureRelevance import plot_ndcg_dcg_idcg

query_id = 1
max_k = 30

plot_ndcg_dcg_idcg(list_of_documents, query_id, qrels, max_k)
