#### Uncomment the nltk.download if you haven't downloaded it yet

In [1]:
import TRECCASTeval as trec
import numpy as np
import pprint
import pandas as pd
import OpenSearchSimpleAPI as osearch
import pprint as pp

import re
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [2]:
pp = pprint.PrettyPrinter(indent=4)

test_bed = trec.ConvSearchEvaluation()

# Initialize stop words and stemmer
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

### Preprocess text will tokenize the text 
This function takes in raw text (usually conversational utterances) and performs the following:

- Converts the text to lowercase.
- Removes non-alphanumeric characters (punctuation, symbols).
- Removes common stopwords using the NLTK stopwords list.
- Stems each word using the Porter stemmer to reduce words to their base forms.

In [3]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    return text

### Training and Test Data Processing

Here, we iterate over the training and testing topics provided by the test_bed. We filter out some conversation IDs based on predefined criteria, and then process each utterance in the conversation using our preprocess_text function. The preprocessed utterances are accumulated over turns of the conversation to simulate a growing context.

Key Variables:
- **previous_query_tokenized**: Keeps track of the concatenated previous utterances to simulate a conversation history.
- **topics**: Stores each turn's preprocessed utterances, indexed by a combination of the conversation ID and turn number.

(Printing of the queries is optional and used mostly for debug purposes)

In [4]:
# Change visualization of the tokenized queries
print_queries = False
print("========================================== Training conversations =====") if print_queries else 0
topics = {}
for topic in test_bed.train_topics:
    conv_id = topic['number']

    if conv_id not in (1, 2, 4, 7, 15, 17,18,22,23,24,25,27,30):
        continue

    print() if print_queries else 0
    print(conv_id, "  ", topic['title']) if print_queries else 0

    previous_query_tokenized = ''
    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        updated_utterance = previous_query_tokenized + utterance
        previous_query_tokenized += preprocess_text(utterance) + ' '
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, updated_utterance) if print_queries else 0
        topics[topic_turn_id] = updated_utterance

print() if print_queries else 0
print("========================================== Test conversations =====") if print_queries else 0
for topic in test_bed.test_topics:
    conv_id = topic['number']

    if conv_id not in (31, 32, 33, 34, 37, 40, 49, 50, 54, 56, 58, 59, 61, 67, 68, 69, 75, 77, 78, 79):
        continue


    #print(conv_id, "  ", topic['title'])

    previous_query_tokenized = ''
    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        updated_utterance = previous_query_tokenized + utterance
        previous_query_tokenized += preprocess_text(utterance) + ' '
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, updated_utterance) if print_queries else 0
        topics[topic_turn_id] = updated_utterance

test_bed.test_relevance_judgments

Unnamed: 0,topic_turn_id,dummy,docid,rel
0,31_1,Q0,CAR_116d829c4c800c2fc70f11692fec5e8c7e975250,0
1,31_1,Q0,CAR_1463f964653c5c9f614a0a88d26b175e4a8120f1,1
2,31_1,Q0,CAR_172e16e89ea3d5546e53384a27c3be299bcfe968,2
3,31_1,Q0,CAR_1c93ef499a0c2856c4a857b0cb4720c380dda476,0
4,31_1,Q0,CAR_2174ad0aa50712ff24035c23f59a3c2b43267650,3
...,...,...,...,...
29345,79_9,Q0,MARCO_8795229,0
29346,79_9,Q0,MARCO_8795231,0
29347,79_9,Q0,MARCO_8795233,0
29348,79_9,Q0,MARCO_8795236,0


# OpenSearch implementation

### Setup

The OpenSearch API is initialized, confirming index creation with the following settings:

**Index name**: kwiz   
**Similarity**: BM25 for text ranking and LM Jelinek-Mercer for smoothing (λ=0.7)   
**Shards**: 1 shard, no replicas   
**Documents**: 23,596 documents indexed   
**k-NN enabled**: Sentence embeddings available for vector-based queries

In [5]:
opensearch = osearch.OSsimpleAPI()

{'acknowledged': True, 'shards_acknowledged': True}

----------------------------------------------------------------------------------- INDEX SETTINGS
{'kwiz': {'settings': {'index': {'creation_date': '1728153198145',
                                 'knn': 'true',
                                 'number_of_replicas': '0',
                                 'number_of_shards': '1',
                                 'provided_name': 'kwiz',
                                 'refresh_interval': '-1',
                                 'similarity': {'default': {'lambda': '0.7',
                                                            'type': 'LMJelinekMercer'}},
                                 'uuid': 'qkpQ7pcwS7iT1IOTsfwRNg',
                                 'version': {'created': '135238227'}}}}}

----------------------------------------------------------------------------------- INDEX MAPPINGS
{'kwiz': {'mappings': {'properties': {'collection': {'type': 'keyword'},
                   

We conduct a test search using a single preprocessed query (61_7) to retrieve the top 100 documents from the OpenSearch API. This helps verify if the query is functioning correctly and if we receive results as expected.

The results of the OpenSearch query are printed to ensure that the API returns valid documents.

In [6]:
numdocs = 100
test_query = topics['61_7']

opensearch_results = opensearch.search_body(test_query, numDocs = numdocs)
print(opensearch_results)

   _index _type                                           _id     _score  \
0    kwiz  _doc  CAR_54ddfb93ad52e7e7bdf960f5cd3164f683eb757b  42.747547   
1    kwiz  _doc  CAR_4b18b521b30a9d32d2c2852b05a5fffce336ca4e  39.716260   
2    kwiz  _doc  CAR_db3beebe1d9e72b74daeec818f076a1e6a794b9d  36.619880   
3    kwiz  _doc                                 MARCO_3765773  36.438580   
4    kwiz  _doc  CAR_56f5109e7dcc45e4bcf50cbc789a3fff94ab1575  35.619743   
..    ...   ...                                           ...        ...   
95   kwiz  _doc                                 MARCO_6139465  25.450195   
96   kwiz  _doc  CAR_613140b2eab12517d1da86bb42d2688934a3d4e1  25.309765   
97   kwiz  _doc                                 MARCO_8019905  25.259228   
98   kwiz  _doc  CAR_d8c0ddb5a2cec36eec0eb592c845665ee060e847  25.202800   
99   kwiz  _doc                                 MARCO_8344507  25.190685   

                                     _source.contents  \
0   The Justice League is a fi

## BM25-based Retrieval

This section performs document retrieval using the BM25 ranking algorithm for all queries in topics.   
For each query, the top 3 documents are retrieved from OpenSearch and from each of those documents, we extract the body (passage) and it's ID.   
The results are stored in a Pandas DataFrame for easier visualization and analysis.

In [7]:
k = 3

BM25data = []
for topic in topics:
    query = topics[topic]
    opensearch_results = opensearch.search_body(query, numDocs = k)
    best_docs = []
    best_passages = []
    for index, row in opensearch_results.iterrows():
        doc_id = row['_id']
        doc_body = opensearch.get_doc_body(doc_id)
        best_passages.append(doc_body)
        best_docs.append(doc_id)
    BM25data.append({'turn': topic, 'query': query, 'top passages': best_passages, 'doc ids': best_docs})
    
df = pd.DataFrame(BM25data)
print(df)


     turn                                              query  \
0     1_1                   What is a physician's assistant?   
1     1_2  physician assist What are the educational requ...   
2     1_3  physician assist educ requir requir becom one ...   
3     1_4  physician assist educ requir requir becom one ...   
4     1_5  physician assist educ requir requir becom one ...   
..    ...                                                ...   
309  79_5  taught sociolog main contribut august comt rol...   
310  79_6  taught sociolog main contribut august comt rol...   
311  79_7  taught sociolog main contribut august comt rol...   
312  79_8  taught sociolog main contribut august comt rol...   
313  79_9  taught sociolog main contribut august comt rol...   

                                          top passages  \
0    [What is the difference between a medical assi...   
1    [What Education Do I Need for a Career As a Ph...   
2    [NEW: Follow this link to view the updated 201...   