# Chat Bot task

We are Online Learning platform.  
In support chat students ask their questions, majority of which can be solved by providing link to video, where students can find answer. Often questions and answers are repeated. To automate this process we decided to create chat bot, which will help provide proper video to students questions.  
How we can evaluate how relevant videos this chat bot sends?

We have developed 2 versions of chat bot in this task  
therefore now, we need to evaluate them and choose one

How we can understand, which version of chat  bot if better?

## Cumulative Gain (CG)

to understand which model is better in ranging videos, we will Summarize values of relevance in their recommendations:  

consider that we took 3 top videos by some query:  
- 1st model had following marks of relevance: [0.99, 0.91, 0.83]  
- 2nd model had following marks of relevance: [0.99, 0.94, 0.88]  

CG@k=∑ rel, where rel - relevance of one of the items in the result, k - number of results

In [1]:
import numpy as np
first_marks = [0.99, 0.91, 0.83]
second_marks = [0.99, 0.94, 0.88]
cg1 = np.sum(first_marks)
cg2 = np.sum(second_marks)
print(f'CG1 = {cg1}')
print(f'Cg2 = {cg2}')

CG1 = 2.73
Cg2 = 2.81


### Number of objects in the result

@k - can be found in many metrics connected with recommendation systems.  
Usually in real data we have hundreds or millions items (goods in online shop etc.)  
Recommendation system can range such objects, but on practice we usually need to evaluate 1st 5 or 20 objects

## Function formatting:  

In [2]:
relevance = [0.99, 0.94, 0.88, 0.74, 0.71, 0.68]
k = 5
relevance_considered = relevance[:k]
score = np.sum(relevance_considered)
score

4.26

In [3]:
from typing import List

import numpy as np


def cumulative_gain(relevance: List[float], k: int) -> float:
    """Score is cumulative gain at k (CG@k)

    Parameters
    ----------
    relevance:  `List[float]`
        Relevance labels (Ranks)
    k : `int`
        Number of elements to be counted

    Returns
    -------
    score : float
    """
    relevance_considered = relevance[:k]
    score = np.sum(relevance_considered)
    return score

# Penalty for place in the output

We have got easily interpreted useful metric: we take N top recommendations, summarize their relevances and in that way we compare different models

However, during the further investigation we realized that the most relevant object can be on 1st or 5th place and this would not influence our metric (in other words, we would consider 2 such models as the same in performance)  
So how we can include Place of object in consideration?

#### Statistics in Search engines:  
Google says that 50% of users finish searching after 1st or 2nd link  
91% of users does not go further than 1st page of the response

This means that if the most relevant document is in the response but does not placed on the 1st or send place, there is a high probability that user will not see it.  
Therefore, we need to penalize models for putting highly relevant documents lower in the order

# Discounted Cumulative Gain  
We will add penalty to the relevance if important document is in the end of list or we will add weight if document is in the beginning of the list:  
![](./pic/2.png)

lets' consider the following results of 2 models:

In [4]:
first_marks = [0.99, 0.94, 0.88]
second_marks = [0.99, 0.83, 0.89]

dcg_array_1 = []
dcg_array_2 = []
for order in range(1, len(first_marks)+1):
    i = order - 1
    print(f'order = {order}, relevance_i = {first_marks[i]}')
    dcg_iter = first_marks[i] / np.log2(order+1)
    print(f'dcg_iter = {dcg_iter}')
    dcg_array_1.append(dcg_iter)

for order in range(1, len(second_marks)+1):
    i = order - 1
    dcg_iter = second_marks[i] / np.log2(order+1)
    dcg_array_2.append(dcg_iter)

dcg1 = np.sum(dcg_array_1)
dcg2 = np.sum(dcg_array_2)
print(f'dcg1 = {dcg1}')
print(f'dcg2 = {dcg2}')

order = 1, relevance_i = 0.99
dcg_iter = 0.99
order = 2, relevance_i = 0.94
dcg_iter = 0.59307396835717
order = 3, relevance_i = 0.88
dcg_iter = 0.44
dcg1 = 2.02307396835717
dcg2 = 1.9586716954643097


there 2 options how DCG can be calculated (difference is in Penalty in divider):
1. Standard - penalty of relevance = log(order)
2. Industry - this method exponentially increases relevance, therefore if there are many values relevance of which is close to 1, in contrast with Standard method, this metric would adequately consider such output  
![](./pic/3.png)

Industry function:

In [5]:
# industry function:
first_marks = [0.99, 0.95, 0.8, 0.98, 0.97]
second_marks = [0.8, 0.99, 0.95, 0.98, 0.97]

dcg_array_1 = []
dcg_array_2 = []
for order in range(1, len(first_marks)+1):
    i = order - 1
    relevance = first_marks[i]
    dcg_i = (2**relevance - 1) / np.log2(order+1)
    dcg_array_1.append(dcg_i)

for order in range(1, len(second_marks)+1):
    i = order - 1
    relevance = second_marks[i]
    dcg_i = (2**relevance - 1) / np.log2(order+1)
    dcg_array_2.append(dcg_i)

print(np.sum(dcg_array_1))
print(np.sum(dcg_array_2))

2.7344299716685585
2.6189991399064203


In [6]:
# standard function:
first_marks = [0.99, 0.95, 0.8, 0.98, 0.97]
second_marks = [0.8, 0.99, 0.95, 0.98, 0.97]

dcg_array_1 = []
dcg_array_2 = []
for order in range(1, len(first_marks)+1):
    i = order - 1
    relevance = first_marks[i]
    dcg_i = relevance / np.log2(order+1)
    dcg_array_1.append(dcg_i)

for order in range(1, len(second_marks)+1):
    i = order - 1
    relevance = second_marks[i]
    dcg_i = relevance / np.log2(order+1)
    dcg_array_2.append(dcg_i)

print(np.sum(dcg_array_1))
print(np.sum(dcg_array_2))

2.786693515822315
2.6969307059651735


In [7]:
# DCG python function:
from typing import List

import numpy as np


def discounted_cumulative_gain(relevance: List[float], k: int, method: str = "standard") -> float:
    """Discounted Cumulative Gain

    Parameters
    ----------
    relevance : `List[float]`
        Video relevance list
    k : `int`
        Count relevance to compute
    method : `str`, optional
        Metric implementation method, takes the values​​
        `standard` - adds weight to the denominator
        `industry` - adds weights to the numerator and denominator
        `raise ValueError` - for any value

    Returns
    -------
    score : `float`
        Metric score
    """
    score = 0
    relevance_considered = relevance[:k]
    for order in range(1, len(relevance_considered)+1):
        i = order - 1
        if method == "standard":
            dcg_i = relevance_considered[i] / np.log2(order+1)
        else:
            dcg_i = (2**relevance_considered[i] - 1) / np.log2(order+1)
        score = score + dcg_i

    return score

In [8]:
# standard function:
first_marks = [0.99, 0.95, 0.8, 0.98, 0.97]
second_marks = [0.8, 0.99, 0.95, 0.98, 0.97]

score = 0
for order in range(1, len(first_marks)+1):
    i = order - 1
    relevance = first_marks[i]
    dcg_i = relevance / np.log2(order+1)
    score = score + dcg_i
score

2.786693515822315

In [9]:
discounted_cumulative_gain(relevance = first_marks, k=5, method="standard")

2.786693515822315

# Normalized Discounted Cumulative Gain

DCG@k - is not normalized metric, which makes task of comparing models with each other harder.  
For example consider that we are comparing 2 models on the different queries:   
in the first query, relevances can be close to 1 and in the second in contract to 0.   
this may happen if in the search database there are documents relevant to the query, while there are no documents for the second query.

Some times we need to average queries results, but we cannot add bananas to apples. That is why it would be better if they become of the same size (normalized) 

### ideal DCG:
One of simple methods for normalization - divide on the maximum number.   
What if we calculate DCG but documents would be sorted in advance in order by decreasing relevance? - It would be the maximum possible DCG for specific query with the specified number of documents considered (k) - IDCG (ideal discounted cumulative gain)  
Now, to calculate nDCG (Normalized DCG) we can simply divide DCG on IDCG.  
$$nDCG@k = \frac{DCG}{IDCG}$$

In [24]:
relevances = [0.99, 0.94, 0.74, 0.88, 0.71, 0.68]
relevances_sorted = list(np.sort(relevances)[::-1])
k = 5
method = 'standard'
relevances_considered = relevances[:k]
relevances_considered_sorted = relevances_sorted[:k]
# dcg:
dcg = discounted_cumulative_gain(relevances, k, method = method)
print(f'dcg = {dcg}')
# dcg_ideal:
dcg_ideal = discounted_cumulative_gain(relevances_sorted, k, method = method)
print(f'dcg_ideal = {dcg_ideal}')
ndcg = dcg / dcg_ideal
print(f'nDCG = {ndcg}')

dcg = 2.6067348325982804
dcg_ideal = 2.6164401144680056
nDCG = 0.9962906539247512


In [27]:
# function nDCG:
from typing import List
import numpy as np

def discounted_cumulative_gain(relevance: List[float], k: int, method: str = "standard") -> float:
    dcg = 0
    relevance_considered = relevance[:k]
    for order in range(1, len(relevance_considered)+1):
        i = order - 1
        if method == "standard":
            dcg_i = relevance_considered[i] / np.log2(order+1)
        else:
            dcg_i = (2**relevance_considered[i] - 1) / np.log2(order+1)
        dcg = dcg + dcg_i
    return dcg

def normalized_dcg(relevance: List[float], k: int, method: str = "standard") -> float:
    """Normalized Discounted Cumulative Gain.

    Parameters
    ----------
    relevance : `List[float]`
        Video relevance list
    k : `int`
        Count relevance to compute
    method : `str`, optional
        Metric implementation method, takes the values
        `standard` - adds weight to the denominator
        `industry` - adds weights to the numerator and denominator
        `raise ValueError` - for any value

    Returns
    -------
    score : `float`
        Metric score
    """
    relevances_sorted = list(np.sort(relevance)[::-1])
    # dcg:
    dcg = discounted_cumulative_gain(relevance, k, method = method)
    # dcg_ideal:
    dcg_ideal = discounted_cumulative_gain(relevances_sorted, k, method = method)
    score = dcg / dcg_ideal
    return score

In [28]:
normalized_dcg(relevance=relevances, k=k, method=method)

0.9962906539247512

# Average Normalized Discounted Cumulative Gain  
Now, we can compare models by the specific query.  
With growth of project, number of queries and documents significantly increases. Therefore, consider some specific queries does not make sense anymore.

At this stage we face the problem, that we need to consider high quantity of queries and monitor performance of the model in general. How we can calculate quality metrics for many queries?

- **Average nDCG** - average value of nDCG metric for each query in the list of queries.  
$$(Average)nDCG = \frac{sum(nDCG)(qi)}{n},$$  
where qi - one query in the list, n - number of queries considered

In [31]:
list_relevances = [
        [0.99, 0.94, 0.88, 0.89, 0.72, 0.65],
        [0.99, 0.92, 0.93, 0.74, 0.61, 0.68], 
        [0.99, 0.96, 0.81, 0.73, 0.76, 0.69]
    ]  
k = 5
method = 'standard'

# calculation of Average nDCG:
queries_number = len(list_relevances)
score = 0
for query in list_relevances:
    nDCG_iter = normalized_dcg(query, k, method)
    score = score + nDCG_iter
score = score / queries_number
score

0.9961322104432755

In [None]:
# Average nDCG function:
from typing import List

import numpy as np

def discounted_cumulative_gain(relevance: List[float], k: int, method: str = "standard") -> float:
    dcg = 0
    relevance_considered = relevance[:k]
    for order in range(1, len(relevance_considered)+1):
        i = order - 1
        if method == "standard":
            dcg_i = relevance_considered[i] / np.log2(order+1)
        else:
            dcg_i = (2**relevance_considered[i] - 1) / np.log2(order+1)
        dcg = dcg + dcg_i
    return dcg

def normalized_dcg(relevance: List[float], k: int, method: str = "standard") -> float:
    relevances_sorted = list(np.sort(relevance)[::-1])
    # dcg:
    dcg = discounted_cumulative_gain(relevance, k, method = method)
    # dcg_ideal:
    dcg_ideal = discounted_cumulative_gain(relevances_sorted, k, method = method)
    score = dcg / dcg_ideal
    return score

def avg_ndcg(list_relevances: List[List[float]], k: int, method: str = 'standard') -> float:
    """Average nDCG

    Parameters
    ----------
    list_relevances : `List[List[float]]`
        Video relevance matrix for various queries
    k : `int`
        Count relevance to compute
    method : `str`, optional
        Metric implementation method, takes the values ​​\
        `standard` - adds weight to the denominator\
        `industry` - adds weights to the numerator and denominator\
        `raise ValueError` - for any value

    Returns
    -------
    score : `float`
        Metric score
    """
    queries_number = len(list_relevances)
    score = 0
    for query in list_relevances:
        nDCG_iter = normalized_dcg(query, k, method)
        score = score + nDCG_iter
    score = score / queries_number

    return score

In the result we have Metrics that allows us to monitor general performance of different Ranging models.