With KNN, RAG, or Classification, we can recommend specific wines. See the diagram below for how we make this suggestion in each approach. 

In this notebook, we evaluate how well each modelling approach suggests top 5 wines.

<img src='misc/Evaluate Top 5 Wines.png' width="800" >

We use LLM to generate the *Relevance* score to evaluate the wines reccommended by each model. We ask the AI to measure relevance of the result, which asseses the appropriateness and applicability of the wine reccommdations with respect to the user query. We test with 100 queries and report the mean relevance score for each model. 



In [1]:
# !pip install -qU \
#   transformers==4.31.0 \
#   pinecone-client==2.2.4 \
#   openai==1.3.2 \
#   tiktoken==0.5.1 \
#   langchain==0.0.336 \
#   lark==1.1.8 \
#   cohere==4.27


In [2]:
import numpy as np
import pandas as pd
import os
import re
import time

# Imports

## OpenAI

In [4]:
import os
import openai

# get API key from OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or OPENAI_API_KEY


# Set up the LLM Chain with prompt to assess Relevance

In [40]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from IPython.display import Markdown

# initialize LLM
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo-1106', 
    # model_name='gpt-4-1106-preview', 
    temperature=0
)

# In put the prompt to assess Relevance
template_rel = """

As the 'Relevance Judge', your role involves evaluating the relevance of each QUERY for wine suggestion and a RESULT (which is a wine recommendation table), providing a score from 1 to 5.
You will receive a tuple of (QUERY, RESULT), and must give an overall score for the tuple.

score: Your numerical score for the model's relevance based on the rubric
justification: Your step-by-step reasoning about the model's relevance score

Please give a score from 1-5 based on the degree of relevance to the query, where the lowest and highest scores are defined as follows:
Score 1: The result doesn't mention anything about the question or is completely irrelevant to the query.
Score 5: The result addresses all aspects of the question and all parts of the result are meaningful and relevant to the question.

You must format your answer in a well-formatted markdown table with only one single row, and columns for QUERY, RESULT, score, and justification.
This format will help to clearly present your analysis.

Your score should measures the appropriateness and applicability of the result with respect to the query.
Scores should reflect the extent to which the result directly addresses the question provided in the query, and give lower scores for incomplete or redundant result.

Maintain a formal and technical tone, focusing on impartial and objective analysis. Avoid irrelevant discussions and concentrate on the alignment between the recommendations and the query specifics.

QUERY: {query}

RESULT: {result}

ANSWER:
"""

prompt_rel = PromptTemplate.from_template(template=template_rel)

llm_chain = LLMChain(llm=llm, prompt=prompt_rel)


In [6]:
# Define helper function to convert LLM output to Pandas DataFrame
def get_df_for_result(res):
    """
    Convert the the Markdown content to Pandas DataFrame.

    """
    res_text = res['text']
        
    # Convert to pandas dataframe
    rows = res_text.split('\n')    
    split_rows = [r.split('|') for r in rows]
    
    split_rows_clean=[]
    for r in split_rows:
        clean_row =  [c.strip() for c in r if c!='']
        split_rows_clean.append(clean_row)
    
    # Extract the header and data rows
    header = split_rows_clean[0]
    data = split_rows_clean[2:]
    
    # Create a pandas DataFrame using the extracted header and data rows
    df = pd.DataFrame(data, columns=header)
    return df


In [77]:
import re

# Define helper function to extract the evaluation score from the LLM output
def get_score_test(res):
    """
    Get the evaluation score from the Mardown content.

    """
    res_text = res['text']   
    score_field = re.findall("\|\s*(\d)\s*\|", res['text'])

    if len(score_field)==0:
        return None
    else:
        return int(score_field[0])


# Test this out with a few (query, result) tuples

In [8]:
input_list = [{"query": "What red wines would you suggest for someone who enjoys a velvety texture and soft tannins?",
              "result": """
                        | Title | Description | Variety | Country | Region | Winery | Province |
                        | --- | --- | --- | --- | --- | --- | --- |
                        | Jean-Luc and Paul Aegerter 2014 Vieilles Vignes  (Savigny-lÃ¨s-Beaune) | A smooth wine with a balance between red fruits and soft tannins. A smoky edge mingles with the berry-fruit acidity. | Pinot Noir | France | Savigny-lÃ¨s-Beaune | Jean-Luc and Paul Aegerter | Burgundy |
                        | Martin Ray 2009 Synthesis Red (Napa Valley) | Very rich and smooth in texture, with lots of tannins that are soft, ripe and easy. Offers waves of blackberry, black currant, dark chocolate and caramelized oak flavors. | Bordeaux-style Red Blend | US | Napa Valley | Martin Ray | California |
                        | Sogevinus 2006 D + D Red (Douro) | A Portuguese wine with minty, ripe, velvet textures, smooth and full-bodied. Very polished and classy, with soft acidity, tarry tannins, new wood and plum juice flavors. | Portuguese Red | Portugal | Douro | Sogevinus | Douro |
                        | Tenuta di Ghizzano 2004 Veneroso Red (Toscana) | Modern, soft and velvety in texture, with hints of Amaretto, soy sauce and pipe tobacco. The mouthfeel is exceptional with bright fruit tones, chewy consistency and long-lasting menthol freshness. | Red Blend | Italy | Toscana | Tenuta di Ghizzano | Tuscany |
                        | Luis Duarte 2014 Rubrica Tinto Red (Alentejano) | A generous, opulent wine with fine, velvet-smooth tannins, a background dryness that gives structure. Black fruits shine through, enhanced by the spice and toast of the wood aging. | Portuguese Red | Portugal | Alentejano | Luis Duarte | Alentejano |
                        
                        """}, 
              {"query": "Can you suggest a wine from Germany?",
              "result": """
                        | Title | Description | Variety | Country | Region | Winery | Province |
                        | --- | --- | --- | --- | --- | --- | --- |
                        | Fritz MÃ¼ller NV Perlwein Trocken MÃ¼ller-Thurgau (Rheinhessen) | A fun and refreshing lightly sparkling white wine with floral notes and a chalky, earthen tang. | MÃ¼ller-Thurgau | Germany | Rheinhessen | Fritz MÃ¼ller | Rheinhessen |
                        | Grafen Neipperg 2008 Trocken SpÃ¤tburgunder (WÃ¼rttemberg) | A vibrant Pinot Noir with strawberry and raspberry fruit and slightly herbal nuances. | SpÃ¤tburgunder | Germany | WÃ¼rttemberg | Grafen Neipperg | WÃ¼rttemberg |
                        | Wagner-Stempel 2014 Gutswein Trocken Weissburgunder (Rheinhessen) | A sprightly Weissburgunder with crisp white peach and apricot flavor. | Weissburgunder | Germany | Rheinhessen | Wagner-Stempel | Rheinhessen |
                        | Flying Ace 2006 Red Red (Pfalz) | A red wine blend delivering black cherry fruit, medium body and soft tannins. | Red Blend | Germany | Pfalz | Flying Ace | Pfalz |
                        | Fitz-Ritter 2006 Beerenauslese Rieslaner (Pfalz) | An intensely botrytized wine with dried apricot aromas and orange marmalade flavors. | Rieslaner | Germany | Pfalz | Fitz-Ritter | Pfalz |
                        
                        """},
              {"query": "Can you recommend a red wine that's elegant and well-structured?",
              "result": """
                        | Title | Description | Variety | Country | Region | Winery | Province |
                        | --- | --- | --- | --- | --- | --- | --- |
                        | VilafontÃ© 2012 Series C Red (Paarl) | A blend of Cabernet Sauvignon, Malbec, Merlot and Cabernet Franc, with spicy aromas and a solid fruit core. Well balanced, with velvety tannins and a long, evolving finish. | Bordeaux-style Red Blend | South Africa | Paarl | VilafontÃ© | Paarl |
                        | Tardieu-Laurent 2010 Vieilles Vignes  (Gigondas) | Starts off a bit stern in texture, but by the long finish those tannins have turned velvety and plush, nicely framing intense flavors. | RhÃ´ne-style Red Blend | France | Gigondas | Tardieu-Laurent | RhÃ´ne Valley |
                        | VilafontÃ© 2005 Series C Red (Paarl) | An elegant, mouthwatering red blend with ripe tannins and a long finish. | Bordeaux-style Red Blend | South Africa | Paarl | VilafontÃ© | Paarl |
                        | Duckhorn 2004 Red Wine Red (Howell Mountain) | Almost entirely Merlot, this Howell Mountain red has a deep core of blackberries and currants, with velvety tannins. | Bordeaux-style Red Blend | US | Howell Mountain | Duckhorn | California |
                        | Ernie Els 2009 Signature Red (Stellenbosch) | This Bordeaux-style blend is oaky at first, but with a lush, mouthfilling palate and a texture like crushed velvet. | Bordeaux-style Red Blend | South Africa | Stellenbosch | Ernie Els | Stellenbosch |
                                                
                        """},
            {"query": "Suggest a wine from a winery known for its award-winning practices.",
              "result": """
                        I'm sorry, but the context does not provide information about a winery known for its award-winning practices.                        
                        """}
              ]

llm_chain = LLMChain(llm=llm, prompt=prompt_rel)
score_tab = llm_chain.apply(input_list)


In [9]:
# Display the evaluation results for these 4 cases
display(Markdown(score_tab[0]['text']))

display(Markdown(score_tab[1]['text']))

display(Markdown(score_tab[2]['text']))

display(Markdown(score_tab[3]['text']))


| QUERY | RESULT | score | justification |
| --- | --- | --- | --- |
| What red wines would you suggest for someone who enjoys a velvety texture and soft tannins? | The result provides a selection of red wines with descriptions that specifically mention velvety texture and soft tannins. Each wine is described in detail, highlighting the smoothness and softness of the tannins, which directly aligns with the query. The wines are from various countries and regions, offering a diverse range of options. Therefore, the result is highly relevant to the query and addresses all aspects of the question. | 5 | The result directly addresses the query by providing detailed descriptions of red wines with velvety texture and soft tannins, meeting all aspects of the question and offering a diverse selection of relevant recommendations. Therefore, it deserves a score of 5 for its high degree of relevance. |

| QUERY | RESULT | score | justification |
| --- | --- | --- | --- |
| Can you suggest a wine from Germany? | The result provides a table of wine recommendations from Germany, including the title, description, variety, country, region, winery, and province of each wine. Each wine listed is from a different region in Germany, showcasing the diversity of German wines. The descriptions of each wine provide a good overview of the flavor profile and characteristics, allowing for informed decision-making. Therefore, the result is highly relevant to the query and addresses all aspects of the question. | 5 | The result directly addresses the query by providing a comprehensive list of wine recommendations from Germany, covering different varieties and regions. The descriptions of each wine also align with the query by offering detailed information about the suggested wines, making the result highly relevant and informative. |

| QUERY | RESULT | score | justification |
| --- | --- | --- | --- |
| Can you recommend a red wine that's elegant and well-structured? | The result provides a list of red wines with detailed descriptions, including information about the variety, country, region, and winery. The wines are described as elegant, well-structured, and velvety, which aligns with the query. | 5 | The result directly addresses the query by providing a selection of red wines that are described as elegant and well-structured. The descriptions of the wines' characteristics and flavors demonstrate a high level of relevance to the query. Therefore, the score of 5 is justified based on the comprehensive and meaningful alignment between the recommendations and the query specifics. |

| QUERY | RESULT | Score | Justification |
|-------|--------|-------|----------------|
| Suggest a wine from a winery known for its award-winning practices. | The result does not mention anything about a winery known for its award-winning practices. | 1 | The result is completely irrelevant to the query as it does not address the request for a wine suggestion from a winery known for its award-winning practices. Therefore, it receives a score of 1. |

Seems to work well. Now we evaluate with 100 test queries.

# Evaluate RAG wine recs

In [10]:
# Load RAG test result dataset
file_path = '../Data/rag_results.csv'
wine_queries = pd.read_csv(file_path)

# Get all 100 test cases in a list
test_input = [{"query": r['question'], "result": f"""{r['result']}"""} for i, r in wine_queries.iterrows()]

In [11]:
import math
from tqdm import tqdm

# Set batch size
batch_size = 10

# Calculate the number of batches needed
# num_batches = math.ceil(len(test_input[:20]) / batch_size)  # This process the full 20 cases in 2 batches of 10
num_batches = math.ceil(len(test_input) / batch_size)  # This process the full 100 cases in 10 batches of 10

# Initialize lists to store the eval results and scores
eval_results = []

# Process data in batches with tqdm
for i in tqdm(range(num_batches), desc="Processing Batches"):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(test_input))
    batch = test_input[start_idx:end_idx]

    # Score each tuple in the batch
    batch_score = llm_chain.apply(batch) 

    # Extract the score out of the eval result and append to the score list
    for i in batch_score:
        eval_results.append(i)

    time.sleep(8)

# Store these lists for RAG
eval_results_rag = eval_results


Processing Batches: 100%|█████████████████████████████████████████████████████████████| 10/10 [17:07<00:00, 102.73s/it]


In [None]:
# Get the scores from the evals and compute mean relevance score
eval_scores_rag = []

for i in eval_results_rag:
    score_i = get_score_test(i)
    eval_scores_rag.append(score_i)

eval_scores_rag = [i for i in eval_scores_rag if i!=None]


In [13]:
# Calculate mean Revelance score
m_rel_rag = np.mean(eval_scores_rag)

print('Mean Revelance Score for RAG\'s wine recs:', m_rel_rag)

Mean Revelance Score for RAG's wine recs: 4.84


# Evaluate KNN wine recs

In [79]:
# Load RAG test result dataset
file_path = '../Data/knn_results.csv'
wine_queries = pd.read_csv(file_path)

# Get all 100 test cases in a list
test_input = [{"query": r['question'], "result": f"""{r['result']}"""} for i, r in wine_queries.iterrows()]

In [80]:
import math
from tqdm import tqdm

# Set batch size
batch_size = 3

# Calculate the number of batches needed
# num_batches = math.ceil(len(test_input[:10]) / batch_size)  # This process the full 10 cases in 4 batches of 3
num_batches = math.ceil(len(test_input) / batch_size)  # This process the full 100 cases in 34 batches of 3

# Initialize lists to store the eval results and scores
eval_results = []
# eval_scores = []

# Process data in batches with tqdm
for i in tqdm(range(num_batches), desc="Processing Batches"):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(test_input))
    batch = test_input[start_idx:end_idx]

    # Score each tuple in the batch
    batch_score = llm_chain.apply(batch) 

    # Extract the score out of the eval result and append to the score list
    for i in batch_score:
        eval_results.append(i)
        # score_i = get_score_test(i)
        # eval_scores.append(score_i)

    time.sleep(8)

# Store these lists for KNN
eval_results_knn = eval_results
# eval_scores_knn = [i for i in eval_scores if i!=None]




Processing Batches: 100%|███████████████████████████████████████████████████████████| 34/34 [3:15:27<00:00, 344.93s/it]


In [81]:
eval_scores_knn = []

for i in eval_results_knn:
    score_i = get_score_test(i)
    eval_scores_knn.append(score_i)

eval_scores_knn = [i for i in eval_scores_knn if i!=None]


In [82]:
# Calculate mean Revelance score
m_rel_knn = np.mean(eval_scores_knn)

print('Mean Revelance Score for KNN\'s wine recs:', m_rel_knn)

Mean Revelance Score for KNN's wine recs: 4.493333333333333


# Evaluate Classification wine recs

In [84]:
# Load RAG test result dataset
file_path = '../Data/classification_results.csv'
wine_queries = pd.read_csv(file_path)

# Get all 100 test cases in a list
test_input = [{"query": r['Question'], "result": f"""{r['Result']}"""} for i, r in wine_queries.iterrows()]

In [85]:
import math
from tqdm import tqdm

# Set batch size
batch_size = 3

# Calculate the number of batches needed
# num_batches = math.ceil(len(test_input[:10]) / batch_size)  # This process the full 10 cases in 4 batches of 3
num_batches = math.ceil(len(test_input) / batch_size)  # This process the full 100 cases in 34 batches of 3

# Initialize lists to store the eval results and scores
eval_results = []

# Process data in batches with tqdm
for i in tqdm(range(num_batches), desc="Processing Batches"):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(test_input))
    batch = test_input[start_idx:end_idx]

    # Score each tuple in the batch
    batch_score = llm_chain.apply(batch) 

    # Extract the score out of the eval result and append to the score list
    for i in batch_score:
        eval_results.append(i)


    time.sleep(8)

# Store these lists for KNN
eval_results_clf = eval_results


Processing Batches: 100%|███████████████████████████████████████████████████████████| 34/34 [4:07:34<00:00, 436.90s/it]


In [86]:
# Get the scores from the evals and compute mean relevance score
eval_scores_clf = []

for i in eval_results_clf:
    score_i = get_score_test(i)
    eval_scores_clf.append(score_i)

eval_scores_clf = [i for i in eval_scores_clf if i!=None]


In [87]:
# Calculate mean Revelance score
m_rel_clf = np.mean(eval_scores_clf)

print('Mean Revelance Score for Classification\'s wine recs:', m_rel_clf)

Mean Revelance Score for Classification's wine recs: 4.34375


# Comparision table across 3 models

In [89]:
tab_rel_data = {'0': ['KNN Search', m_rel_knn],
                '1': ['RAG', m_rel_rag],
                '2': ['Classification', m_rel_clf] }

# Make a datframe of final table
tab_rel_final = pd.DataFrame.from_dict(tab_rel_data, 
                                           orient='index', 
                                           columns=['Model', 'Mean Relevance Score'])

# Save to CSV file
tab_rel_final.to_csv('results_to_eval/top_5_wines/compare_KNN_RAG_CLF.csv', index=False)

In [90]:
tab_rel_final

Unnamed: 0,Model,Mean Relevance Score
0,KNN Search,4.493333
1,RAG,4.84
2,Classification,4.34375


RAG scores the highest in terms of relevance! Note that, due to time limit, we only evaluate RAG's first response here. RAG also allows users to ask follow-up questions and it handles negations well if the user emphasizes the negation in the follow-up (e.g. "Actually, I don't want floral notes."). RAG's revised reponses can potentially scores even higher than what we got above.   