# Conversational Search Retrieval Augmented Generation

In this notebook you will implement the following steps:

- **Answer selection + evaluation**: Implement a *search-based* conversation framework evaluation framework to evaluate conversation topics made up of conversation turns.
- **Answer ranking**: Implement a *re-ranking method* to sort the initial search results. Evaluate the re-ranked results.
- **Conversation memory**: Implement a conversational context modeling method to keep track of the conversation state. 

Submission dates:
- **20 October**: first stage retrieval + conversation memory + evaluation
- **15 November**: re-ranking with LLM + evaluation
- **15 December**: answer generation + evaluation

## Test bed and conversation topics
The TREC CAST corpus (http://www.treccast.ai/) for Conversational Search is indexed in this cluster and available to be searched behind an OpenSearch API.

The queries and the relevance judgments are available through class `ConvSearchEvaluation`:

In [71]:
import TRECCASTeval as trec
import numpy as np
import pprint

import numpy as np

pp = pprint.PrettyPrinter(indent=4)

test_bed = trec.ConvSearchEvaluation()

print()
print("========================================== Training conversations =====")
topics = {}
for topic in test_bed.train_topics:
    conv_id = topic['number']

    if conv_id not in (1, 2, 4, 7, 15, 17,18,22,23,24,25,27,30):
        continue

    print()
    print(conv_id, "  ", topic['title'])

    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, utterance)
        topics[topic_turn_id] = utterance

print()
print("========================================== Test conversations =====")
for topic in test_bed.test_topics:
    conv_id = topic['number']

    if conv_id not in (31, 32, 33, 34, 37, 40, 49, 50, 54, 56, 58, 59, 61, 67, 68, 69, 75, 77, 78, 79):
        continue

    print()
    print(conv_id, "  ", topic['title'])

    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, utterance)
        topics[topic_turn_id] = utterance




1    Career choice for Nursing and Physician's Assistant
1_1 What is a physician's assistant?
1_2 What are the educational requirements required to become one?
1_3 What does it cost?
1_4 What's the average starting salary in the UK?
1_5 What about in the US?
1_6 What school subjects are needed to become a registered nurse?
1_7 What is the PA average salary vs an RN?
1_8 What the difference between a PA and a nurse practitioner?
1_9 Do NPs or PAs make more?
1_10 Is a PA above a NP?
1_11 What is the fastest way to become a NP?
1_12 How much longer does it take to become a doctor after being an NP?

2    Goat breeds
2_1 What are the main breeds of goat?
2_2 Tell me about boer goats.
2_3 What breed is good for meat?
2_4 Are angora goats good for it?
2_5 What about boer goats?
2_6 What are pygmies used for?
2_7 What is the best for fiber production?
2_8 How long do Angora goats live?
2_9 Can you milk them?
2_10 How many can you have per acre?
2_11 Are they profitable?

4    The Neolithic 

In [72]:
test_bed.test_relevance_judgments

Unnamed: 0,topic_turn_id,dummy,docid,rel
0,31_1,Q0,CAR_116d829c4c800c2fc70f11692fec5e8c7e975250,0
1,31_1,Q0,CAR_1463f964653c5c9f614a0a88d26b175e4a8120f1,1
2,31_1,Q0,CAR_172e16e89ea3d5546e53384a27c3be299bcfe968,2
3,31_1,Q0,CAR_1c93ef499a0c2856c4a857b0cb4720c380dda476,0
4,31_1,Q0,CAR_2174ad0aa50712ff24035c23f59a3c2b43267650,3
...,...,...,...,...
29345,79_9,Q0,MARCO_8795229,0
29346,79_9,Q0,MARCO_8795231,0
29347,79_9,Q0,MARCO_8795233,0
29348,79_9,Q0,MARCO_8795236,0


## OpenSearch

In [73]:
import OpenSearchSimpleAPI as osearch
import pprint as pp
opensearch = osearch.OSsimpleAPI()

{'acknowledged': True, 'shards_acknowledged': True}

----------------------------------------------------------------------------------- INDEX SETTINGS
{'kwiz': {'settings': {'index': {'creation_date': '1728153198145',
                                 'knn': 'true',
                                 'number_of_replicas': '0',
                                 'number_of_shards': '1',
                                 'provided_name': 'kwiz',
                                 'refresh_interval': '-1',
                                 'uuid': 'qkpQ7pcwS7iT1IOTsfwRNg',
                                 'version': {'created': '135238227'}}}}}

----------------------------------------------------------------------------------- INDEX MAPPINGS
{'kwiz': {'mappings': {'properties': {'collection': {'type': 'keyword'},
                                      'contents': {'index_options': 'freqs',
                                                   'similarity': 'BM25',
                                   

Search example:

In [74]:
numdocs = 10
test_query = topics['40_1']
opensearch_results = opensearch.search_body(test_query, numDocs = numdocs)
print(opensearch_results)

  _index _type                                           _id     _score  \
0   kwiz  _doc                                 MARCO_8418211  20.013380   
1   kwiz  _doc                                 MARCO_8418210  19.630722   
2   kwiz  _doc                                 MARCO_7006727  18.632385   
3   kwiz  _doc                                 MARCO_3691951  17.533060   
4   kwiz  _doc  CAR_0287c00622b68eb397d7d4cca0d8bc0842426c87  17.370193   
5   kwiz  _doc                                 MARCO_8216727  17.191683   
6   kwiz  _doc                                 MARCO_2241232  16.628294   
7   kwiz  _doc  CAR_29d114f8b4847f7cceb295e03d2ca00158734ea6  16.475977   
8   kwiz  _doc  CAR_a30a23bfdbf0962e3db4aeeca4ff8e480671aed7  16.471210   
9   kwiz  _doc                                  MARCO_338959  16.266342   

                                    _source.contents  \
0  Report Abuse. Hey Djam look what i found on Wi...   
1  Best Answer: ...The origins of the term house ...   
2  Wha

In [75]:
opensearch.doc_term_vectors('CAR_c370ef5df77de117ff7d02c4b64b52f5bae9abc9')

(23596,
 1405893,
 2299289,
 {'1979': [1, 86, 96],
  '1994': [1, 114, 130],
  'a': [2, 17146, 48957],
  'also': [1, 4940, 6628],
  'and': [2, 19644, 68602],
  'as': [3, 9879, 19825],
  'bastian': [1, 29, 55],
  'black': [1, 465, 695],
  'bully': [1, 6, 9],
  'but': [1, 3822, 5068],
  'bux': [1, 15, 15],
  'chapter': [1, 99, 144],
  'character': [1, 445, 770],
  'characters': [1, 342, 664],
  'early': [1, 1091, 1330],
  "ende's": [1, 15, 19],
  'escape': [1, 90, 106],
  'fantasia': [2, 17, 20],
  'fantasy': [1, 74, 114],
  'film': [3, 683, 1936],
  'first': [1, 2701, 3824],
  'following': [1, 733, 828],
  'from': [2, 7167, 11274],
  'his': [1, 2353, 5180],
  'ii': [1, 240, 287],
  'iii': [2, 102, 124],
  'in': [1, 17309, 52844],
  'introduced': [1, 248, 277],
  'is': [1, 14551, 35833],
  'it': [1, 7548, 13205],
  'jack': [1, 73, 128],
  'james': [1, 136, 161],
  'jason': [1, 41, 61],
  'known': [1, 1873, 2234],
  'michael': [1, 138, 145],
  'neverending': [5, 162, 204],
  'new': [1, 217

In [76]:
opensearch.termvectors_JSON(doc_id='CAR_c370ef5df77de117ff7d02c4b64b52f5bae9abc9')

{'_index': 'kwiz',
 '_type': '_doc',
 '_id': 'CAR_c370ef5df77de117ff7d02c4b64b52f5bae9abc9',
 '_version': 1,
 'found': True,
 'took': 0,
 'term_vectors': {'contents': {'field_statistics': {'sum_doc_freq': 1405893,
    'doc_count': 23596,
    'sum_ttf': 2299289},
   'terms': {'1979': {'doc_freq': 86, 'ttf': 96, 'term_freq': 1},
    '1994': {'doc_freq': 114, 'ttf': 130, 'term_freq': 1},
    'a': {'doc_freq': 17146, 'ttf': 48957, 'term_freq': 2},
    'also': {'doc_freq': 4940, 'ttf': 6628, 'term_freq': 1},
    'and': {'doc_freq': 19644, 'ttf': 68602, 'term_freq': 2},
    'as': {'doc_freq': 9879, 'ttf': 19825, 'term_freq': 3},
    'bastian': {'doc_freq': 29, 'ttf': 55, 'term_freq': 1},
    'black': {'doc_freq': 465, 'ttf': 695, 'term_freq': 1},
    'bully': {'doc_freq': 6, 'ttf': 9, 'term_freq': 1},
    'but': {'doc_freq': 3822, 'ttf': 5068, 'term_freq': 1},
    'bux': {'doc_freq': 15, 'ttf': 15, 'term_freq': 1},
    'chapter': {'doc_freq': 99, 'ttf': 144, 'term_freq': 1},
    'character':

In [77]:
opensearch.get_doc_body('CAR_c370ef5df77de117ff7d02c4b64b52f5bae9abc9')

"The NeverEnding Story III: Escape from Fantasia (also known as: The NeverEnding Story III: Return to Fantasia) is a 1994 film and the second sequel to the fantasy film The NeverEnding Story (following the first sequel The NeverEnding Story II: The Next Chapter). It starred Jason James Richter as the principal character Bastian Bux, and Jack Black in one of his early roles as the school bully Slip. This film used the characters from Michael Ende's novel The Neverending Story (1979), but introduced a new storyline."

In [78]:
example_doc = 'The NeverEnding Story III: Escape from Fantasia (also known as: The NeverEnding Story III: Return to Fantasia) is a 1994 film and the second sequel to the fantasy film The NeverEnding Story (following the first sequel The NeverEnding Story II: The Next Chapter). It starred Jason James Richter as the principal character Bastian Bux'
opensearch.query_terms(example_doc,'standard')

' the neverending story iii escape from fantasia also known as the neverending story iii return to fantasia is a 1994 film and the second sequel to the fantasy film the neverending story following the first sequel the neverending story ii the next chapter it starred jason james richter as the principal character bastian bux'

In [79]:
opensearch.analyzer(analyzer="standard", query=example_doc)

{'tokens': [{'token': 'the',
   'start_offset': 0,
   'end_offset': 3,
   'type': '<ALPHANUM>',
   'position': 0},
  {'token': 'neverending',
   'start_offset': 4,
   'end_offset': 15,
   'type': '<ALPHANUM>',
   'position': 1},
  {'token': 'story',
   'start_offset': 16,
   'end_offset': 21,
   'type': '<ALPHANUM>',
   'position': 2},
  {'token': 'iii',
   'start_offset': 22,
   'end_offset': 25,
   'type': '<ALPHANUM>',
   'position': 3},
  {'token': 'escape',
   'start_offset': 27,
   'end_offset': 33,
   'type': '<ALPHANUM>',
   'position': 4},
  {'token': 'from',
   'start_offset': 34,
   'end_offset': 38,
   'type': '<ALPHANUM>',
   'position': 5},
  {'token': 'fantasia',
   'start_offset': 39,
   'end_offset': 47,
   'type': '<ALPHANUM>',
   'position': 6},
  {'token': 'also',
   'start_offset': 49,
   'end_offset': 53,
   'type': '<ALPHANUM>',
   'position': 7},
  {'token': 'known',
   'start_offset': 54,
   'end_offset': 59,
   'type': '<ALPHANUM>',
   'position': 8},
  {'toke

In [80]:
# BM25 Implementation
## Example usage

In [84]:
# pip install bm25s
import bm25s

# Create your corpus here
test_corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=test_corpus)
retriever.index(bm25s.tokenize(test_corpus))

# Query the corpus and get top-k results
query = "does the fish purr like a cat?"
k = 4
test_results, scores = retriever.retrieve(bm25s.tokenize(query), k=k)

# Let's see what we got!
for i in range(len(test_results[0])):
    doc, score = test_results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

Split strings:   0%|          | 0/4 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/4 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/4 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 1.06): a cat is a feline and likes to purr
Rank 2 (score: 0.48): a fish is a creature that lives in water and swims
Rank 3 (score: 0.00): a bird is a beautiful animal that can fly
Rank 4 (score: 0.00): a dog is the human's best friend and loves to play


In [94]:
import re

def preprocess_text(text):
    # Lowercase and remove non-alphanumeric characters
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text

# Build the corpus
corpus = []
for index, row in opensearch_results.iterrows():
    doc_id = row['_id']
    doc_body = opensearch.get_doc_body(doc_id)
    processed_doc = preprocess_text(doc_body)  # Preprocess the document
    corpus.append(processed_doc)

report abuse hey djam look what i found on wiki shame on you the origins of the term house music are disputed some house music enthusiasts claim that the term is derived from the name of a club called the warehousein the late 1970s and early 1980s underground warehouse parties became popular among the teenagers living in the chicago areaeport abuse hey djam look what i found on wiki shame on you the origins of the term house music are disputed some house music enthusiasts claim that the term is derived from the name of a club called the warehouse
best answer the origins of the term house music are disputed some house music enthusiasts claim that the term is derived from the name of a club called the warehousein the late 1970s and early 1980s underground warehouse parties became popular among the teenagers living in the chicago areaeport abuse hey djam look what i found on wiki shame on you the origins of the term house music are disputed some house music enthusiasts claim that the term

In [95]:
tokenized_query = bm25s.tokenize(query)
print("Tokenized Query:", tokenized_query)


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Tokenized Query: Tokenized(ids=[[0, 1, 2, 3]], vocab={'what': 0, 'origins': 1, 'popular': 2, 'music': 3})


In [101]:
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))
query = test_query
processed_query = preprocess_text(query)
print('Unprocessed query: ' + query)
print('processed: ' + processed_query)


Split strings:   0%|          | 0/10 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/10 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/10 [00:00<?, ?it/s]

Unprocessed query: What are the origins of popular music? 
processed: what are the origins of popular music 


In [102]:
k = 4

results, scores = retriever.retrieve(bm25s.tokenize(processed_query), k=k)
for i in range(len(results[0])):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1 (score: 0.47): find common phrases with their meanings and origins  what these popular sayings and idioms mean and their history here at know your phrase find common phrases with their meanings and origins  what these popular sayings and idioms mean and their history here at know your phrase
Rank 2 (score: 0.44): what is hip hop origins of rap and hip hop   hip hop music also referred to as rap or rap music is a style of popular music which came into existence in roughly the mid 70s but became a large part of modern day pop culture in the late 80s
Rank 3 (score: 0.44): report abuse hey djam look what i found on wiki shame on you the origins of the term house music are disputed some house music enthusiasts claim that the term is derived from the name of a club called the warehousein the late 1970s and early 1980s underground warehouse parties became popular among the teenagers living in the chicago areaeport abuse hey djam look what i found on wiki shame on you the origins of the