This notebook shows the results of query-time near-duplicate detection (NDD).  It shows the same query with and without the duplicates.  To use this notebook, first see [these instructions](https://sycamore.readthedocs.io/en/stable/welcome_to_sycamore/get_started.html) and start the Sycamore containers using `docker compose up`.  Then, ingest the college credit card marketing agreements data.  The documents come from [data.gov](https://catalog.data.gov/dataset/college-credit-card-marketing-agreements-data), but we have made them accessible via S3.  There are two ingestion commands to choose from, depending on how realistic or quick the demo should be:

- All 1911 PDFs (~4 hours): `docker compose run sycamore_crawler_s3 aryn-public cccmad`
- 35 PDFs needed for demo (minutes): `docker compose run sycamore_crawler_s3 aryn-public cccmad-tiny`

Note that for the full dataset, approximately 17 documents will fail to ingest for various reasons.  The Sycamore importer will keep retrying them, but the problems will persist.  It's not necessary for 100% of the documents to be ingested in order to run this example.  Once the importer queue shrinks to ~100 documents, it's OK to proceed.  Note that due to variations in ingestion, results may not exactly match what's described here.

In [1]:
import json
import requests
import warnings
import urllib3
warnings.filterwarnings("ignore", category=urllib3.exceptions.InsecureRequestWarning)

The code below exists to retrieve the embedding model ID from OpenSearch.  This ID is different every time OpenSearch is set up.  We need to suplly the ID in our query.  So, we need to fetch it every time in order to be sure.

In [2]:
def get_model_id():
    query = {
        'query': {
            'bool': {
                'must': [
                    {
                        'match': {'name': 'all-MiniLM-L6-v2'},
                    },
                    {
                        'term': {'model_config.model_type': 'bert'},
                    },
                ],
            },
        },
    }
    with requests.get('https://opensearch:9200/_plugins/_ml/models/_search', json=query, verify=False) as resp:
        res = json.loads(resp.text)
        return res['hits']['hits'][0]['_id']

This next function performs the supplied query and prints out both the retrieved chunks and the AI-generated answer.  For clarity, the text chunks are truncated at 80 characters.

In [3]:
def do_query(query_dict):
    url = 'https://opensearch:9200/demoindex0/_search?search_pipeline=hybrid_rag_pipeline'
    with requests.post(url, json=query, verify=False) as resp:
        res = json.loads(resp.text)
        hits = res['hits']['hits']
        for i in range(10):
            text = hits[i]['_source']['text_representation']
            text = text.replace('\n', ' ')[:80]
            print(f'[{i}] {text}')
        answer = res['ext']['retrieval_augmented_generation']['answer']
        print(f'[ANSWER]\n{answer}')

First, we run the query without near-duplicate-detection.  We do this by not asking for `shingles` in `_source`.  In OpenSearch, the `_source` is where we list the fields that we want to retrieve for each hit.  The results we get have many copies of the first chunk.  The generated answer comes from only one chunk (or two if using the tiny dataset).

In [4]:
short_query = 'arbitration'
long_query = 'summarize the rules of arbitration'
query = {
    '_source': [
        'text_representation',
    ],
    'query': {
        'hybrid': {
            'queries': [
                {
                    'match': {'text_representation': short_query},
                },
                {
                    'neural': {
                        'embedding': {
                            'query_text': short_query,
                            'k': 100,
                            'model_id': get_model_id(),
                        },
                    },
                },
            ],
        },
    },
    'ext': {
        'generative_qa_parameters': {
            'llm_question': long_query,
            'context_size': 5,
            'llm_model': 'gpt-4',
        },
    },
    'size': 100,
}
do_query(query)

[0] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[1] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[2] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[3] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[4] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[5] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[6] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[7] (e) Should an arbitrator refuse or be unable to proceed with arbitration proceed
[8] The American Arbitration Association ("AAX') shall conduct the arbitration, (b) 
[9] The American Arbitration Association ("AAX') shall conduct the arbitration, (b) 
[ANSWER]
The rules of arbitration state that any controversy or claim arising from an agreement that cannot be resolved through mediation will be sett

For the next query, we re-use the previous query data structure, but we modify it slightly.  We append `shingles` to the list of fields to be retrieved.  This enables NDD processing; without `shingles` it can't detect near-duplicates.  Now, when we run the query there is much more diversity in the retrieved chunks and the generated answer is richer.

In [5]:
query['_source'].append('shingles')
do_query(query)

[0] Arbitration: Any controversy or claim arising out of or in relation to this Agre
[1] (e) Should an arbitrator refuse or be unable to proceed with arbitration proceed
[2] The American Arbitration Association ("AAX') shall conduct the arbitration, (b) 
[3] (c) as efficient and expeditious a manner as practicable and, in this connection
[4] Any claim or dispute ("Dispute") by FIA or Supplier, against the other, or again
[5] The arbitration hearini! shall be held in such neutral location as the parties m
[6] (h) The arbitrator of Arbitration Panel is instructed to schedule promptly all d
[7] or paftW summary judgment is granted, the non-prevailing patty may not raise as 
[8] The arbitrator of Arbitration Panel is instructed to schedule promptly all disco
[9] The arbaration hearing shall be held in such neutral location as the parties may
[ANSWER]
The rules of arbitration typically involve a dispute being settled by an arbitrator in accordance with the commercial Arbitration Rules of an