# Conversational Search Retrieval Augmented Generation

In this notebook you will implement the following steps:

- **Answer selection + evaluation**: Implement a *search-based* conversation framework evaluation framework to evaluate conversation topics made up of conversation turns.
- **Answer ranking**: Implement a *re-ranking method* to sort the initial search results. Evaluate the re-ranked results.
- **Conversation memory**: Implement a conversational context modeling method to keep track of the conversation state. 

Submission dates:
- **20 October**: first stage retrieval + conversation memory + evaluation
- **15 November**: re-ranking with LLM + evaluation
- **15 December**: answer generation + evaluation

## Test bed and conversation topics
The TREC CAST corpus (http://www.treccast.ai/) for Conversational Search is indexed in this cluster and available to be searched behind an OpenSearch API.

The queries and the relevance judgments are available through class `ConvSearchEvaluation`:

In [1]:
import TRECCASTeval as trec
import numpy as np
import pprint

import numpy as np

pp = pprint.PrettyPrinter(indent=4)

test_bed = trec.ConvSearchEvaluation()

print()
print("========================================== Training conversations =====")
topics = {}
for topic in test_bed.train_topics:
    conv_id = topic['number']

    if conv_id not in (1, 2, 4, 7, 15, 17,18,22,23,24,25,27,30):
        continue

    print()
    print(conv_id, "  ", topic['title'])

    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, utterance)
        topics[topic_turn_id] = utterance

print()
print("========================================== Test conversations =====")
for topic in test_bed.test_topics:
    conv_id = topic['number']

    if conv_id not in (31, 32, 33, 34, 37, 40, 49, 50, 54, 56, 58, 59, 61, 67, 68, 69, 75, 77, 78, 79):
        continue

    print()
    print(conv_id, "  ", topic['title'])

    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, utterance)
        topics[topic_turn_id] = utterance




1    Career choice for Nursing and Physician's Assistant
1_1 What is a physician's assistant?
1_2 What are the educational requirements required to become one?
1_3 What does it cost?
1_4 What's the average starting salary in the UK?
1_5 What about in the US?
1_6 What school subjects are needed to become a registered nurse?
1_7 What is the PA average salary vs an RN?
1_8 What the difference between a PA and a nurse practitioner?
1_9 Do NPs or PAs make more?
1_10 Is a PA above a NP?
1_11 What is the fastest way to become a NP?
1_12 How much longer does it take to become a doctor after being an NP?

2    Goat breeds
2_1 What are the main breeds of goat?
2_2 Tell me about boer goats.
2_3 What breed is good for meat?
2_4 Are angora goats good for it?
2_5 What about boer goats?
2_6 What are pygmies used for?
2_7 What is the best for fiber production?
2_8 How long do Angora goats live?
2_9 Can you milk them?
2_10 How many can you have per acre?
2_11 Are they profitable?

4    The Neolithic 

In [3]:
test_bed.test_relevance_judgments

Unnamed: 0,topic_turn_id,dummy,docid,rel
0,31_1,Q0,CAR_116d829c4c800c2fc70f11692fec5e8c7e975250,0
1,31_1,Q0,CAR_1463f964653c5c9f614a0a88d26b175e4a8120f1,1
2,31_1,Q0,CAR_172e16e89ea3d5546e53384a27c3be299bcfe968,2
3,31_1,Q0,CAR_1c93ef499a0c2856c4a857b0cb4720c380dda476,0
4,31_1,Q0,CAR_2174ad0aa50712ff24035c23f59a3c2b43267650,3
...,...,...,...,...
29345,79_9,Q0,MARCO_8795229,0
29346,79_9,Q0,MARCO_8795231,0
29347,79_9,Q0,MARCO_8795233,0
29348,79_9,Q0,MARCO_8795236,0


## OpenSearch

In [2]:
import OpenSearchSimpleAPI as osearch
import pprint as pp
opensearch = osearch.OSsimpleAPI()

{'acknowledged': True, 'shards_acknowledged': True}

----------------------------------------------------------------------------------- INDEX SETTINGS
{'kwiz': {'settings': {'index': {'creation_date': '1728032876294',
                                 'knn': 'true',
                                 'number_of_replicas': '0',
                                 'number_of_shards': '1',
                                 'provided_name': 'kwiz',
                                 'refresh_interval': '-1',
                                 'uuid': 'Ys75obTLQACAypDZuHLJrw',
                                 'version': {'created': '135238227'}}}}}

----------------------------------------------------------------------------------- INDEX MAPPINGS
{'kwiz': {'mappings': {'properties': {'collection': {'type': 'keyword'},
                                      'contents': {'index_options': 'freqs',
                                                   'similarity': 'BM25',
                                   

Search example:

In [3]:
results = opensearch.search_body(topics['33_1'], numDocs = 10)
print(results)

  _index _type                   _id     _score  \
0   kwiz  _doc  YrjfVpIBXGI5MNTnS6DH  31.390944   
1   kwiz  _doc  pmzSVpIBXGI5MNTnXJUv  29.324303   
2   kwiz  _doc  YbjfVpIBXGI5MNTnS6DH  28.118290   
3   kwiz  _doc  rbjfVpIBXGI5MNTnRHnc  28.004580   
4   kwiz  _doc  wkXLVpIBXGI5MNTnWDe-  27.588179   
5   kwiz  _doc  YLjfVpIBXGI5MNTnS6DH  27.412436   
6   kwiz  _doc  RarcVpIBXGI5MNTn4psK  27.102676   
7   kwiz  _doc  SKrcVpIBXGI5MNTn4psK  26.969175   
8   kwiz  _doc  ZrjfVpIBXGI5MNTnS6DH  26.597956   
9   kwiz  _doc  o2zSVpIBXGI5MNTnXJUv  25.748140   

                                    _source.contents    _source.doc  \
0  The film only adapts the first half of the boo...  MARCO_6213095   
1  The name of Atrayu's horse in the neverending ...  MARCO_5954180   
2  Goodbye to the old Neverending Story and Hello...  MARCO_6213094   
3  1980s film Trivia. 1  The Karate Kid. 2  The N...  MARCO_6202855   
4  Noah Hathaway. Noah Leslie Hathaway (born Nove...  MARCO_1311833   
5  The Never

In [4]:
opensearch.doc_term_vectors('rbjfVpIBXGI5MNTnRHnc')

(8635155,
 346405856,
 493726771,
 {'1': [2, 1441199, 1919815],
  '1980s': [1, 8279, 8881],
  '2': [1, 1183572, 1541053],
  '3': [1, 847420, 1055628],
  'adventure': [1, 9332, 10610],
  'and': [1, 6278073, 12806471],
  'better': [1, 113939, 127744],
  'bill': [1, 42352, 57467],
  'dead': [1, 28735, 35922],
  'excellent': [1, 26141, 28546],
  'film': [1, 53077, 74329],
  'flash': [1, 12336, 19425],
  'jack': [1, 13502, 18877],
  'jumpin': [1, 57, 85],
  'karate': [1, 877, 1238],
  'kid': [1, 7305, 8727],
  'labyrinth': [1, 741, 1005],
  'neverending': [1, 94, 122],
  'off': [1, 189818, 230660],
  'poltergeist': [1, 127, 170],
  'story': [1, 54254, 66160],
  "ted's": [1, 82, 100],
  'the': [2, 7527202, 28021854],
  'trivia': [1, 3286, 3637]})

In [5]:
opensearch.termvectors_JSON(doc_id='YrjfVpIBXGI5MNTnS6DH')

{'_index': 'kwiz',
 '_type': '_doc',
 '_id': 'YrjfVpIBXGI5MNTnS6DH',
 '_version': 1,
 'found': True,
 'took': 3,
 'term_vectors': {'contents': {'field_statistics': {'sum_doc_freq': 346405856,
    'doc_count': 8635155,
    'sum_ttf': 493726771},
   'terms': {'a': {'doc_freq': 5741370, 'ttf': 12507272, 'term_freq': 1},
    'adapts': {'doc_freq': 695, 'ttf': 734, 'term_freq': 1},
    'and': {'doc_freq': 6278073, 'ttf': 12806471, 'term_freq': 1},
    'as': {'doc_freq': 2148718, 'ttf': 3165308, 'term_freq': 3},
    'basis': {'doc_freq': 41878, 'ttf': 48012, 'term_freq': 2},
    'be': {'doc_freq': 1726899, 'ttf': 2308175, 'term_freq': 2},
    'book': {'doc_freq': 68972, 'ttf': 88764, 'term_freq': 3},
    'chapter': {'doc_freq': 22418, 'ttf': 30400, 'term_freq': 1},
    'chapter.the': {'doc_freq': 6, 'ttf': 6, 'term_freq': 1},
    'completely': {'doc_freq': 42830, 'ttf': 47096, 'term_freq': 1},
    'consequently': {'doc_freq': 4057, 'ttf': 4231, 'term_freq': 1},
    'convey': {'doc_freq': 362

In [6]:
opensearch.get_doc_body('YrjfVpIBXGI5MNTnS6DH')

'The film only adapts the first half of the book, and consequently does not convey the message of the title as it was portrayed in the novel. The second half of the book would subsequently be used as the rough basis for the second film, The NeverEnding Story II: The Next Chapter.The third film, The NeverEnding Story III: Escape From Fantasia, features a completely original plot.he second half of the book would subsequently be used as the rough basis for the second film, The NeverEnding Story II: The Next Chapter.'

In [7]:
example_doc = 'The film only adapts the first half of the book, and consequently does not convey the message of the title as it was portrayed in the novel. The second half of the book would subsequently be used as the rough basis for the second film, The NeverEnding Story II: The Next Chapter.The third film, The NeverEnding Story III: Escape From Fantasia, features a completely original plot.he second half of the book would subsequently be used as the rough basis for the second film, The NeverEnding Story II: The Next Chapter.'
opensearch.query_terms(example_doc,'standard')

' the film only adapts the first half of the book and consequently does not convey the message of the title as it was portrayed in the novel the second half of the book would subsequently be used as the rough basis for the second film the neverending story ii the next chapter.the third film the neverending story iii escape from fantasia features a completely original plot.he second half of the book would subsequently be used as the rough basis for the second film the neverending story ii the next chapter'

In [8]:
opensearch.analyzer(analyzer="standard", query=example_doc)

{'tokens': [{'token': 'the',
   'start_offset': 0,
   'end_offset': 3,
   'type': '<ALPHANUM>',
   'position': 0},
  {'token': 'film',
   'start_offset': 4,
   'end_offset': 8,
   'type': '<ALPHANUM>',
   'position': 1},
  {'token': 'only',
   'start_offset': 9,
   'end_offset': 13,
   'type': '<ALPHANUM>',
   'position': 2},
  {'token': 'adapts',
   'start_offset': 14,
   'end_offset': 20,
   'type': '<ALPHANUM>',
   'position': 3},
  {'token': 'the',
   'start_offset': 21,
   'end_offset': 24,
   'type': '<ALPHANUM>',
   'position': 4},
  {'token': 'first',
   'start_offset': 25,
   'end_offset': 30,
   'type': '<ALPHANUM>',
   'position': 5},
  {'token': 'half',
   'start_offset': 31,
   'end_offset': 35,
   'type': '<ALPHANUM>',
   'position': 6},
  {'token': 'of',
   'start_offset': 36,
   'end_offset': 38,
   'type': '<ALPHANUM>',
   'position': 7},
  {'token': 'the',
   'start_offset': 39,
   'end_offset': 42,
   'type': '<ALPHANUM>',
   'position': 8},
  {'token': 'book',
   's