## Evaluation for the text-based recommender and the combined recommender based on survey data

In [1]:
# !pip list

## Recommender Evaluation

### Load and clean the data

We have two Airbnb recommenders and we want to evaluate both of them. 

One recommender is a text based only search engine. The other recommender is a combined recommender that tries to improved upon the text only one by trying to utilize some numeric features. See other notebooks and the report for more details.

We conducted surveys having humans rate the search results from each search engine. This notebook processes those surveys and evaluates the search engines using NDCG.

The survey data consists of spreadsheets, which are the responses of our human raters. Each spreadsheet consists of a list of queries and their search result list, sorted by what the system thinks is the most relevant. 

The rater would look at the Airbnb recommender web app's search result in detail and then rate the particular search result (suggested Airbnb listing) on a scale of 1 to 5, 1 being not relevant, 5 being highly relevant.

The survey data is already somewhat cleaned, because the user only has to fill in the relevance column with a number from 1 to 5.

Although the surveys for the combined recommender may contain more columns than the text based one, for the purposes of *this* notebook the survey files are identical, the only real difference being the rater's ratings. So we'll read in one file and make it the main data frame. Then we read in the other survey result files but only take the ratings column. The column we need from the other files is "Relevance: 1 to 5", which we'll rename to match the rater name and the type of search engine evaluated.

Then we use this data frame to evaluate the two search engines.


In [2]:
import numpy as np
import pandas as pd
import math
import altair as alt

basepath='../data/surveys/'
rater_files = ['Text only Airbnb recommender survey_rater_1.xlsx',
              'Text only Airbnb recommender survey_rater_2.xlsx',
              'Text only Airbnb recommender survey_rater_3.xlsx',
              'Combined Airbnb recommender survey_rater_1.xlsx',
              'Combined Airbnb recommender survey_rater_2.xlsx',
              'Combined Airbnb recommender survey_rater_3.xlsx',]

columns=['Query text','Price range','listing_id','listing_url','listing_name','Relevance: 1 to 5']

cols_exel_issue = ['Query text', 'Price range', 'Bed, bedroom, bathroom',
                   'cluster_id', 'listing_id', 'listing_url', 'listing_name',
                   'Relevance: 1 to 5']

In [3]:
df_ratings_raw = pd.read_excel(basepath+rater_files[0], header=13,names=columns)
df_ratings_raw.head(12)

Unnamed: 0,Query text,Price range,listing_id,listing_url,listing_name,Relevance: 1 to 5
1.0,I want a private room close to uw campus with ...,$50-$5000,37640654,https://www.airbnb.com/rooms/37640654,Montlake Apt 3 blocks from UW Light Rail & Hosp.,4.5
,,,54166095,https://www.airbnb.com/rooms/54166095,Comfy Big Room by UW with Free Parking & Fast ...,5.0
,,,50549818,https://www.airbnb.com/rooms/50549818,"Quiet, spacious, one bedroom daylight basement",4.0
,,,34448530,https://www.airbnb.com/rooms/34448530,Montlake Tudor 3 blocks from UW Light Rail Sta...,4.0
,,,3852185,https://www.airbnb.com/rooms/3852185,Room for rotating med students!,3.0
2.0,condo near Space Needle,,5362889,https://www.airbnb.com/rooms/5362889,Stunning 2 BR Next To Space Needle & Arena!!,3.0
,,,19438023,https://www.airbnb.com/rooms/19438023,Immaculate 2 BR Under the Space Needle & Key A...,4.5
,,,53402998,https://www.airbnb.com/rooms/53402998,1-bedroom condo near Space Needle,3.5
,,,36770186,https://www.airbnb.com/rooms/36770186,"Condo w/ Great View, Location, & Free Parking",4.0
,,,669735617462052600,https://www.airbnb.com/rooms/669735617462052577,Amazing Getaway in the Heart of Seattle,2.0


In [4]:
df_ratings_raw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75 entries, 1.0 to nan
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Query text         15 non-null     object 
 1   Price range        1 non-null      object 
 2   listing_id         75 non-null     int64  
 3   listing_url        75 non-null     object 
 4   listing_name       75 non-null     object 
 5   Relevance: 1 to 5  75 non-null     float64
dtypes: float64(1), int64(1), object(4)
memory usage: 4.1+ KB


In [5]:
df_ratings_raw.shape

(75, 6)

As mentioned the survey files are identical except for the rater's actual ratings and again they were already precleaned in the sense that we already examined them in a spreadsheet program. So we'll now read in just the other rater's rating column and append to our table here.

In [6]:
rating_cols = ['rater_1_text', 'rater_2_text', 'rater_3_text',
               'rater_1_combined', 'rater_2_combined', 'rater_3_combined']

df_ratings_raw.rename(columns={'Relevance: 1 to 5': rating_cols[0]},
                      inplace = True)

for i, rating_col in enumerate(rating_cols):
    print(rater_files[i])
    if i == 0:
        continue
    if i > 2:
        header = 15
        cols_to_use = cols_exel_issue
    else:
        header = 13
        cols_to_use = columns
    df_temp = pd.read_excel(basepath+rater_files[i], header=header,names=cols_to_use)
    df_ratings_raw[rating_cols[i]] = df_temp['Relevance: 1 to 5']

Text only Airbnb recommender survey_rater_1.xlsx
Text only Airbnb recommender survey_rater_2.xlsx
Text only Airbnb recommender survey_rater_3.xlsx
Combined Airbnb recommender survey_rater_1.xlsx
Combined Airbnb recommender survey_rater_2.xlsx
Combined Airbnb recommender survey_rater_3.xlsx


In [7]:
df_ratings_raw

Unnamed: 0,Query text,Price range,listing_id,listing_url,listing_name,rater_1_text,rater_2_text,rater_3_text,rater_1_combined,rater_2_combined,rater_3_combined
1.0,I want a private room close to uw campus with ...,$50-$5000,37640654,https://www.airbnb.com/rooms/37640654,Montlake Apt 3 blocks from UW Light Rail & Hosp.,4.5,4.5,4.0,4.5,4.5,4.0
,,,54166095,https://www.airbnb.com/rooms/54166095,Comfy Big Room by UW with Free Parking & Fast ...,5.0,5.0,5.0,5.0,5.0,5.0
,,,50549818,https://www.airbnb.com/rooms/50549818,"Quiet, spacious, one bedroom daylight basement",4.0,5.0,4.0,4.0,5.0,4.0
,,,34448530,https://www.airbnb.com/rooms/34448530,Montlake Tudor 3 blocks from UW Light Rail Sta...,4.0,4.5,4.0,4.0,4.5,4.0
,,,3852185,https://www.airbnb.com/rooms/3852185,Room for rotating med students!,3.0,4.5,3.0,3.0,4.5,3.0
...,...,...,...,...,...,...,...,...,...,...,...
15.0,Looking for a 4 bedroom 3 bath house walking d...,,18934440,https://www.airbnb.com/rooms/18934440,Arboretum Retreat,3.5,3.5,4.5,5.0,5.0,4.0
,,,753841636123637900,https://www.airbnb.com/rooms/753841636123637928,Arboretum Tranquil 2 BR,4.0,3.5,4.5,4.0,4.7,3.5
,,,641288819512351500,https://www.airbnb.com/rooms/641288819512351447,Capitol Hill Lake and Mountain View English Tudor,4.5,3.5,4.5,4.0,4.8,4.0
,,,34924092,https://www.airbnb.com/rooms/34924092,"5 bed 3 bath house, Volunteer Park Capitol Hill",5.0,4.8,5.0,4.0,4.8,4.0


Below shows only simple additional cleaning is needed. We do see that there are Nan's and we know that when we created the survey, for convenience we didn't repeat same entries in two columns to make it more readable. And the info() function confirms the only two columnns with Nan were 'Query text' and 'Price range' were the ones we purposefully did that to. We also see that no other columns have Nan and the rating columns are the correct data type. We'll fill those Nans now with forward fill.

In [8]:
df_ratings_raw.isnull().values.any()

True

In [9]:
df_ratings_raw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75 entries, 1.0 to nan
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Query text        15 non-null     object 
 1   Price range       1 non-null      object 
 2   listing_id        75 non-null     int64  
 3   listing_url       75 non-null     object 
 4   listing_name      75 non-null     object 
 5   rater_1_text      75 non-null     float64
 6   rater_2_text      75 non-null     float64
 7   rater_3_text      75 non-null     float64
 8   rater_1_combined  75 non-null     float64
 9   rater_2_combined  75 non-null     float64
 10  rater_3_combined  75 non-null     float64
dtypes: float64(6), int64(1), object(4)
memory usage: 7.0+ KB


In [10]:
df_ratings = df_ratings_raw.reset_index(drop=True).ffill()
df_ratings

Unnamed: 0,Query text,Price range,listing_id,listing_url,listing_name,rater_1_text,rater_2_text,rater_3_text,rater_1_combined,rater_2_combined,rater_3_combined
0,I want a private room close to uw campus with ...,$50-$5000,37640654,https://www.airbnb.com/rooms/37640654,Montlake Apt 3 blocks from UW Light Rail & Hosp.,4.5,4.5,4.0,4.5,4.5,4.0
1,I want a private room close to uw campus with ...,$50-$5000,54166095,https://www.airbnb.com/rooms/54166095,Comfy Big Room by UW with Free Parking & Fast ...,5.0,5.0,5.0,5.0,5.0,5.0
2,I want a private room close to uw campus with ...,$50-$5000,50549818,https://www.airbnb.com/rooms/50549818,"Quiet, spacious, one bedroom daylight basement",4.0,5.0,4.0,4.0,5.0,4.0
3,I want a private room close to uw campus with ...,$50-$5000,34448530,https://www.airbnb.com/rooms/34448530,Montlake Tudor 3 blocks from UW Light Rail Sta...,4.0,4.5,4.0,4.0,4.5,4.0
4,I want a private room close to uw campus with ...,$50-$5000,3852185,https://www.airbnb.com/rooms/3852185,Room for rotating med students!,3.0,4.5,3.0,3.0,4.5,3.0
...,...,...,...,...,...,...,...,...,...,...,...
70,Looking for a 4 bedroom 3 bath house walking d...,$50-$5000,18934440,https://www.airbnb.com/rooms/18934440,Arboretum Retreat,3.5,3.5,4.5,5.0,5.0,4.0
71,Looking for a 4 bedroom 3 bath house walking d...,$50-$5000,753841636123637900,https://www.airbnb.com/rooms/753841636123637928,Arboretum Tranquil 2 BR,4.0,3.5,4.5,4.0,4.7,3.5
72,Looking for a 4 bedroom 3 bath house walking d...,$50-$5000,641288819512351500,https://www.airbnb.com/rooms/641288819512351447,Capitol Hill Lake and Mountain View English Tudor,4.5,3.5,4.5,4.0,4.8,4.0
73,Looking for a 4 bedroom 3 bath house walking d...,$50-$5000,34924092,https://www.airbnb.com/rooms/34924092,"5 bed 3 bath house, Volunteer Park Capitol Hill",5.0,4.8,5.0,4.0,4.8,4.0


In [11]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Query text        75 non-null     object 
 1   Price range       75 non-null     object 
 2   listing_id        75 non-null     int64  
 3   listing_url       75 non-null     object 
 4   listing_name      75 non-null     object 
 5   rater_1_text      75 non-null     float64
 6   rater_2_text      75 non-null     float64
 7   rater_3_text      75 non-null     float64
 8   rater_1_combined  75 non-null     float64
 9   rater_2_combined  75 non-null     float64
 10  rater_3_combined  75 non-null     float64
dtypes: float64(6), int64(1), object(4)
memory usage: 6.6+ KB


### Evaluation of the search and recommender

We're almost ready to start doing some calculations to evaluate our model. First we'll get a list of the queries that were used in our search engine surveys. We'll use these to grab out the sub section that contains the search results for that query. We'll evaluate each of these sets of querie/results.

In [12]:
unique_queries = df_ratings['Query text'].unique()
unique_queries, len(unique_queries)

(array(['I want a private room close to uw campus with parking and coffee shop.',
        'condo near Space Needle',
        'house near famous Pike Place Starbucks',
        'Nice clean quiet apartment with three beds around downtown',
        'Coffeeholic house',
        'Clean, quiet 2 bed near famous vietnamese coffee shops',
        'Big house near lake washington',
        'house with view of mount rainier seattle, quiet, clean!',
        'I want to be near uw but close to local music scene too',
        'clean place near seattle harbor cruise',
        'I want a 2 bedroom 2 bath near the Seattle art museum and walking distance to restaurants',
        'I need a 2 bed 1 bedroom 1 bath condo with walking distance to Pike Place Market and great restaurants',
        '3 bedroom 2 bath house with 4 beds near Seattle Museum of Popular Culture ',
        'Looking for a cozy 1 bedroom 1 bath condo near the famous Smith Tower in Pioneer Square. Must be clean.',
        'Looking for a 4 b

#### Calclulate NDCG

The data frame representing the survey is grouped by query text and the search results for each query is sorted in descending order of what the search engine believes is the most relevant to the least relevant.

NDCG is a modern way to evaluate search engine results that takes into account both the relevance as labeled by raters (human raters in our case) and the order of the results.

Now it's time to calculate the NDCG @ n, for each query's search results.  In our case, to keep the survey response time manageable, we only had the rater rate 5 results for each query. Thus in our case n is also the size of of the search result recommendation set.


As we learned in SIAD685, there are many ways to do the discounted cumulative gain. In our case because we are working with 5 result items per query, we wanted to amplify the difference and also penalize the case when the top ideal ranked item is not on top and one way to do that is to do 2**rating in the numerator [see reference link below]. This will help us better see the difference in performance in our case.

**We leave some commented out for you to try!**

In [13]:
# This is the NCDG that we learned in SIADS685 Search and Recommender Systems
# It's values the top 2 results and begins discounting in the 3rd result onward.
# Uncomment if you prefer this one.

# def calculate_dcg(doc_ranks, base):
#     dcg = 0
#     for i,r in enumerate(doc_ranks):
#         idx = i+1
#         if idx > base:
#             rank = r / math.log(idx,base)
#         else:
#             rank = r
#         dcg += rank
#     return dcg

In [14]:
# Per SIADS685 where the discount begins is highly reliant on domain knowledge.
# Some case you might discount immediately after the first result. See reference
# link here:
# https://towardsdatascience.com/evaluate-your-recommendation-engine-using-ndcg-759a851452d1
# Uncomment if you prefer this one:

# def calculate_dcg(doc_ranks, base):
#     dcg = 0
#     for i,r in enumerate(doc_ranks):
#         idx = i+1
#         rank = r / math.log(idx+1,base)
#         dcg += rank
#     return dcg

In [15]:
# Per SIADS685 where the discount begins is highly reliant on domain knowledge.
# Some case you might discount immediately after the first result AND also
# penalize. Since we have a small number of results, five, then we want to spread
# out the result to more easily see the difference in performance. See reference
# link here:
# https://towardsdatascience.com/evaluate-your-recommendation-engine-using-ndcg-759a851452d1
# Uncomment if you prefer this one:

def calculate_dcg(doc_ranks, base):
    dcg = 0
    for i,r in enumerate(doc_ranks):
        idx = i+1
        rank = ((2**r) - 1) / math.log(idx+1,base)
        dcg += rank
    return dcg

In [16]:
max_docs = 10

def calculate_ndcg(ratings, n=-1):
    # figure what "at n" is, because in NDCG is really always NDCG@n
    if n == -1 or n > max_docs:
        at_n = len(ratings)
    else:
        at_n = n
    
    # calculate the dcg@n for the search engine's ranking of the documents
    system_dcg = calculate_dcg(ratings[:at_n], 2)

    # calculate the dcg@n for the rater's ranking of documents
    #   which we first need to sort by the rater's ranking
    ideal_dcg = calculate_dcg(np.sort(ratings)[::-1][:at_n], 2)

    # normalized DCG at n
    norm_dcg = system_dcg / ideal_dcg     

    # For the curious, uncomment this to see the system vs ideal numbers
    # print('sdcg:{:.3f}, idcg:{:.3f}, ndcg:{:.f}'.format(system_dcg, ideal_dcg, norm_dcg))
    return norm_dcg

In [17]:
ndcg_data = {'query':unique_queries}

for rater in rating_cols:
  ndcg = []
  for query_text in unique_queries:
      # get the ratings for the particular search term, 
      #   which are already sorted by search engine ranking
      ratings = df_ratings[df_ratings['Query text'] == query_text][rater].values

      # calculate and save the NDCG @ n
      ndcg.append(calculate_ndcg(ratings))
  ndcg_data['{}_ndcg'.format(rater)] = ndcg

len(ndcg_data)

7

In [18]:
df_ndcg = pd.DataFrame(data=ndcg_data)
df_ndcg

Unnamed: 0,query,rater_1_text_ndcg,rater_2_text_ndcg,rater_3_text_ndcg,rater_1_combined_ndcg,rater_2_combined_ndcg,rater_3_combined_ndcg
0,I want a private room close to uw campus with ...,0.943583,0.94072,0.896641,0.943583,0.94072,0.896641
1,condo near Space Needle,0.826819,1.0,1.0,0.826819,1.0,1.0
2,house near famous Pike Place Starbucks,0.996557,1.0,0.985284,0.996557,1.0,0.985284
3,Nice clean quiet apartment with three beds aro...,0.820278,0.751766,0.71557,1.0,1.0,0.71557
4,Coffeeholic house,0.97732,0.978755,0.854388,0.97732,0.978755,0.854388
5,"Clean, quiet 2 bed near famous vietnamese coff...",1.0,1.0,0.860331,0.907613,0.974001,0.995705
6,Big house near lake washington,0.995321,1.0,0.968839,0.995321,1.0,0.968839
7,"house with view of mount rainier seattle, quie...",1.0,0.847916,0.950154,1.0,0.847916,0.950154
8,I want to be near uw but close to local music ...,0.937246,1.0,0.860331,0.937246,1.0,0.860331
9,clean place near seattle harbor cruise,0.972682,1.0,1.0,0.972682,1.0,1.0


Now we have each rater's NDCG for each of the queries. The typical modern way to evaluate a search engine is to average the NDCG's over a number of queries and raters. The range of the value is from 0 to 1, where a 1 means the search engine perfectly returns the ideal order of the Airbnb listings. 

But first let's look at some graph for fun to see if we see any obvious patterns.

In [19]:
df3 = pd.melt(df_ndcg, id_vars=['query'], value_vars=['rater_1_text_ndcg', 'rater_2_text_ndcg', 'rater_3_text_ndcg',
       'rater_1_combined_ndcg', 'rater_2_combined_ndcg',
       'rater_3_combined_ndcg'])
df3

Unnamed: 0,query,variable,value
0,I want a private room close to uw campus with ...,rater_1_text_ndcg,0.943583
1,condo near Space Needle,rater_1_text_ndcg,0.826819
2,house near famous Pike Place Starbucks,rater_1_text_ndcg,0.996557
3,Nice clean quiet apartment with three beds aro...,rater_1_text_ndcg,0.820278
4,Coffeeholic house,rater_1_text_ndcg,0.977320
...,...,...,...
85,I want a 2 bedroom 2 bath near the Seattle art...,rater_3_combined_ndcg,1.000000
86,I need a 2 bed 1 bedroom 1 bath condo with wal...,rater_3_combined_ndcg,1.000000
87,3 bedroom 2 bath house with 4 beds near Seattl...,rater_3_combined_ndcg,0.968839
88,Looking for a cozy 1 bedroom 1 bath condo near...,rater_3_combined_ndcg,1.000000


First we see below how each of the search engine(s) performed by each rater and by specific query in our test set. It doesn't look too bad, with many queries hovering around 0.9 or above, even reaching 1.0. We do see bigger variance around the fourth query and maybe the last five queries. It looks like these queries contain numbers, and their NDCG's hover around 0.7 to 0.8.

In [20]:
alt.Chart(df3, width=500, 
          title='Text Only and Combined Search Performance by Query by Rater').mark_line(
              
          ).encode(
            x=alt.X('query:N',title='Search Query Term',axis=alt.Axis(
                labelAngle=-45), sort=None),
            y=alt.Y('value:Q',title='Normalized Discounted Cumulative Gain',
                    scale=alt.Scale(domain=[0.0, 1.1])),
            color=alt.Color('variable:N', 
                            scale=alt.Scale(scheme= 'dark2'),
                            title='Rater, Search Engine Type')
            ).interactive()


Now let's see how the different types of search engine performance by query. We take the mean NDCG for each type.

In [21]:
data = {'query': df_ndcg['query'].values,
         'mean_ndcg_text_only': df_ndcg[['rater_1_text_ndcg', 'rater_2_text_ndcg', 
                                    'rater_3_text_ndcg']].mean(axis=1),
         'mean_ndcg_combined': df_ndcg[['rater_1_combined_ndcg', 
                                        'rater_2_combined_ndcg',
                                        'rater_3_combined_ndcg']].mean(axis=1)
         }
df_ndcg_by_type = pd.DataFrame(data)
df_ndcg_by_type[:5]

Unnamed: 0,query,mean_ndcg_text_only,mean_ndcg_combined
0,I want a private room close to uw campus with ...,0.926981,0.926981
1,condo near Space Needle,0.942273,0.942273
2,house near famous Pike Place Starbucks,0.993947,0.993947
3,Nice clean quiet apartment with three beds aro...,0.762538,0.90519
4,Coffeeholic house,0.936821,0.936821


In [22]:
df4 = pd.melt(df_ndcg_by_type, id_vars=['query'], 
              value_vars=['mean_ndcg_text_only', 'mean_ndcg_combined'])
df4

Unnamed: 0,query,variable,value
0,I want a private room close to uw campus with ...,mean_ndcg_text_only,0.926981
1,condo near Space Needle,mean_ndcg_text_only,0.942273
2,house near famous Pike Place Starbucks,mean_ndcg_text_only,0.993947
3,Nice clean quiet apartment with three beds aro...,mean_ndcg_text_only,0.762538
4,Coffeeholic house,mean_ndcg_text_only,0.936821
5,"Clean, quiet 2 bed near famous vietnamese coff...",mean_ndcg_text_only,0.953444
6,Big house near lake washington,mean_ndcg_text_only,0.988053
7,"house with view of mount rainier seattle, quie...",mean_ndcg_text_only,0.93269
8,I want to be near uw but close to local music ...,mean_ndcg_text_only,0.932526
9,clean place near seattle harbor cruise,mean_ndcg_text_only,0.990894


In [23]:
alt.Chart(df4, width=500, 
          title='Text Only and Combined Search Performance by Query, mean NDCG').mark_line(
              
          ).encode(
            x=alt.X('query:N',title='Search Query Term',axis=alt.Axis(
                labelAngle=-45), sort=None),
            y=alt.Y('value:Q',title='Normalized Discounted Cumulative Gain',
                    scale=alt.Scale(domain=[0.0, 1.1])),
            color=alt.Color('variable:N', 
                            scale=alt.Scale(scheme= 'dark2'),
                            title='Mean, Search Engine Type')
            ).interactive()

We can see in the above chart the combined search engine model (which again takes into account the text model, but also further refines on numeric data) performs better on average for queries with numeric data, such as the last five queries that take specifically have numeric info such as number of beds and baths.

Now finally, let's see the overall performance of the text only search vs the combined search, which is the mean NDCG over all searches performed. It looks like the combined recommender wins... 0.96 to 0.92

In [24]:
labelExpression = "datum.label == 'mean_ndcg_text_only' ? 'Text Only' : 'Combined'"

bars = alt.Chart(df4, width=500, height=70).mark_bar(size=20).encode(
  alt.X('mean(value)', title='Mean NDCG', scale=alt.Scale(domain=[0.0, 1.1])),
  alt.Y('variable', sort=None, title='Search Engine Type', 
        axis=alt.Axis(labelExpr=labelExpression)
        )
)

text = bars.mark_text(align='left',
                      baseline='middle',
                      dx=3 # Nudges text to right so it doesn't appear on top of the bar
                      ).encode(
                           # we'll use the percentage as the text
                      text=alt.Text('mean(value)',format='.04') 
)

(text + bars).configure_view(
    # we don't want a stroke around the bars
    strokeWidth=0
).configure_scale(
    # add some padding
    bandPaddingInner=0.2
).properties(
    title={'text': 'Search Engine Performance by Type',
           'subtitle': ['The mean NDCG for all queries and results'],
           }
)

### Evaluating Rater Agreeability

One question comes to mind. While it's good to see that our NDCG score are closer to 1, how do we know that the ideal relevance ratings themselves from the human raters are trustworthy or not. In [Evaluation of Text Retrieval Systems in SIADS685](https://www.coursera.org/learn/siads685/lecture/Rl9ml), the professor notes that "Of course, when you have multiple raters, you should actually compare their performance and calculate their agreements...In this case, if the agreement is high between the multiple raters then we can trust the relevance judgments."

One measure of inter-rater agreement is Cohen's kappa for when there are two raters, while Fleiss' kappa are for when there are 3 or more. 

According to [wikipedia](https://https://en.wikipedia.org/wiki/Fleiss%27_kappa): "Fleiss' kappa can be used with binary or nominal-scale. It can also be applied to Ordinal data (ranked data)" however it might not account for ordering. 

So we decided to take an adventure and try another popular measurement, which is intraclass correlation coefficient (ICC). According to the documentation for a popular statistics library [pingouin](https://pingouin-stats.org/build/html/generated/pingouin.intraclass_corr.html): "The intraclass correlation (ICC, [1]) assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects."

We'll use ICC to evaluate our raters.

#### Intraclass correlation (ICC)

So we'll use pingouin to calculate the ICC and then interpret the result.

In [25]:
import pingouin as pg

We need to melt our data into a format for the library to process easily. The documentation's wine example essentially has a number of wines and judges ratings of the wines. It's tempting to use the query as our wine, but we have to remember that each query was used in both the text only and combined recommender, so each relevance rating is really a combination of query and search engine type. We'll break it down into two ICC calculations.

If we think of the wine example, we also need to think that the wines are unique items. In our case a query is not unique because the rater was asked to a query's top 5 results. The unique equivalent in this case is a query string and a particular Airbnb listing id.

In [26]:
df_ratings.columns


Index(['Query text', 'Price range', 'listing_id', 'listing_url',
       'listing_name', 'rater_1_text', 'rater_2_text', 'rater_3_text',
       'rater_1_combined', 'rater_2_combined', 'rater_3_combined'],
      dtype='object')

In [27]:
### Combine query text and listing url into a unique wine, then rate

df_ratings_icc_text = df_ratings[['Query text','listing_url', 'rater_1_text', 'rater_2_text', 'rater_3_text']].copy()
df_ratings_icc_text['search_result_item'] = df_ratings_icc_text['Query text'].astype(str) +"-"+ df_ratings_icc_text['listing_url']

# Our ratings are 1 to 5, but some are decimal such as 4.5, convert to int
df_ratings_icc_text['rater_1_text'] = df_ratings_icc_text['rater_1_text'].apply(lambda x: x*10)
df_ratings_icc_text['rater_2_text'] = df_ratings_icc_text['rater_2_text'].apply(lambda x: x*10)
df_ratings_icc_text['rater_3_text'] = df_ratings_icc_text['rater_3_text'].apply(lambda x: x*10)

df_ratings_icc_text['rater_1_text'] = df_ratings_icc_text['rater_1_text'].astype('int64')
df_ratings_icc_text['rater_2_text'] = df_ratings_icc_text['rater_2_text'].astype('int64')
df_ratings_icc_text['rater_3_text'] = df_ratings_icc_text['rater_3_text'].astype('int64')

# Get the columns we need: item being rated, and the rater's ratings
df = df_ratings_icc_text[['search_result_item','rater_1_text', 'rater_2_text', 'rater_3_text']]
df_ratings_icc_text = df.copy()
df_ratings_icc_text.reset_index(drop=True, inplace=True)
df_ratings_icc_text = pd.melt(df_ratings_icc_text, id_vars=['search_result_item'], 
                         value_vars=['rater_1_text', 'rater_2_text', 'rater_3_text'])

# Calculate the ICC
icc = pg.intraclass_corr(data=df_ratings_icc_text, 
                         targets='search_result_item', 
                         raters='variable',
                         ratings='value').round(3)
icc.set_index("Type")


# 0.239 to 0.537, with basic without multiplying
# 0.238 to 0.536, with multiplying by 10

Unnamed: 0_level_0,Description,ICC,F,df1,df2,pval,CI95%
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ICC1,Single raters absolute,0.238,1.939,74,150,0.0,"[0.1, 0.39]"
ICC2,Single random raters,0.257,2.157,74,148,0.0,"[0.12, 0.41]"
ICC3,Single fixed raters,0.278,2.157,74,148,0.0,"[0.14, 0.43]"
ICC1k,Average raters absolute,0.484,1.939,74,150,0.0,"[0.24, 0.66]"
ICC2k,Average random raters,0.51,2.157,74,148,0.0,"[0.29, 0.67]"
ICC3k,Average fixed raters,0.536,2.157,74,148,0.0,"[0.32, 0.69]"


Now calculate the same for the Combined Search Engine's results

In [28]:
### Combine query text and listing url into a unique wine, then rate

df_ratings_icc_combined = df_ratings[['Query text','listing_url', 'rater_1_combined', 'rater_2_combined', 'rater_3_combined']].copy()
df_ratings_icc_combined['search_result_item'] = df_ratings_icc_combined['Query text'].astype(str) +"-"+ df_ratings_icc_combined['listing_url']

# Our ratings are 1 to 5, but some are decimal such as 4.5, convert to int
df_ratings_icc_combined['rater_1_combined'] = df_ratings_icc_combined['rater_1_combined'].apply(lambda x: x*10)
df_ratings_icc_combined['rater_2_combined'] = df_ratings_icc_combined['rater_2_combined'].apply(lambda x: x*10)
df_ratings_icc_combined['rater_3_combined'] = df_ratings_icc_combined['rater_3_combined'].apply(lambda x: x*10)

df_ratings_icc_combined['rater_1_combined'] = df_ratings_icc_combined['rater_1_combined'].astype('int64')
df_ratings_icc_combined['rater_2_combined'] = df_ratings_icc_combined['rater_2_combined'].astype('int64')
df_ratings_icc_combined['rater_3_combined'] = df_ratings_icc_combined['rater_3_combined'].astype('int64')

# Get the columns we need: item being rated, and the rater's ratings
df = df_ratings_icc_combined[['search_result_item','rater_1_combined', 'rater_2_combined', 'rater_3_combined']]
df_ratings_icc_combined = df.copy()
df_ratings_icc_combined.reset_index(drop=True, inplace=True)
df_ratings_icc_combined = pd.melt(df_ratings_icc_combined, id_vars=['search_result_item'], 
                         value_vars=['rater_1_combined', 'rater_2_combined', 'rater_3_combined'])

# Calculate the ICC
icc = pg.intraclass_corr(data=df_ratings_icc_combined, 
                         targets='search_result_item', 
                         raters='variable',
                         ratings='value').round(3)
icc.set_index("Type")

# 0.074 to 0.465 without multiplying
# same with multiplying

Unnamed: 0_level_0,Description,ICC,F,df1,df2,pval,CI95%
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ICC1,Single raters absolute,0.074,1.24,74,150,0.135,"[-0.05, 0.22]"
ICC2,Single random raters,0.161,1.869,74,148,0.001,"[0.03, 0.31]"
ICC3,Single fixed raters,0.225,1.869,74,148,0.001,"[0.08, 0.38]"
ICC1k,Average raters absolute,0.193,1.24,74,150,0.135,"[-0.18, 0.46]"
ICC2k,Average random raters,0.366,1.869,74,148,0.001,"[0.08, 0.57]"
ICC3k,Average fixed raters,0.465,1.869,74,148,0.001,"[0.21, 0.65]"


#### Interpretation

There are a [few types of ICC as noted here in Wikipedia](https://en.wikipedia.org/wiki/Intraclass_correlation). In our case we're focusing on ICC3k which is did the judges rate similar items low and high, which again per Wikipedia:

*   Two-way mixed: k fixed raters are defined. Each subject is measured by the k raters.
*   Average measures: the reliability is applied to a context where measures of k raters will be averaged for each subject
*   Consistency: in the context of repeated measurements by the same rater, systematic errors of the rater are canceled and only the random residual error is kept. (e.g. did the judges rate similar subjects low and high? [link](https://www.statology.org/intraclass-correlation-coefficient/))


To interpret the ICC, there are a couple of scales. Per Wikipedia page above:

*    Cicchetti scale:

     *    Less than 0.40—poor.
     *    Between 0.40 and 0.59—fair.
     *    Between 0.60 and 0.74—good.
     *    Between 0.75 and 1.00—excellent.

*    Koo and Li scale:

     *    below 0.50: poor
     *    between 0.50 and 0.75: moderate
     *    between 0.75 and 0.90: good
     *    above 0.90: excellent

Our scores for ICC3k are as follows. First we look at the Text-only search engine results:

*    Text-only Based Search Results: 
     *    Score: 0.536
     *    Confidence Interval (95%): [0.32, 0.69]

On the Cicchetti scale we see that our raters's inter-agreement is Text-only "fair" and the confidence interval (CI) on the low end of 0.32 would put it into the "poor" range while the high end of (CI) would put it into the "good" range.

On the Koo and Li scale looks like our raters's inter-agreement is "moderate" but on the low side of the CI would be "poor"

Next is the raters's inter-agreement results for the Combined Search engine:

*    Combined Search Results:
     *    Score: 0.465
     *    Confidence Interval (95%): [0.21, 0.65]

On the Cicchetti score raters's inter-agreement still in the "fair" zone and similarly the confidence interval it ranges from "poor" to "good". On the Koo and Li scale it dips to the "poor" zone. 






<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f2a50dc6-ff6a-45ff-9dbe-d7a35bd1e393' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>