This notebook shows the results of query-time near-duplicate detection (NDD).  It shows the same query with and without the duplicates.  The content here is inspired by this [blog post](https://www.aryn.ai/post/near-duplicate-detection-in-sycamore-what-is-it-good-for).

To use this notebook:
1. Follow [these instructions](https://sycamore.readthedocs.io/en/stable/welcome_to_sycamore/get_started.html) and start the Sycamore containers using `docker compose up`.
2. Make sure to start with a clean slate by running `docker compose run reset`.
3. Ingest the college credit card marketing agreements data.  The documents come from [data.gov](https://catalog.data.gov/dataset/college-credit-card-marketing-agreements-data), but we have made them accessible via S3.  There are two ingestion commands to choose from, depending on how much time is available:

    - 35 PDFs needed for demo (minutes): `docker compose run sycamore_crawler_s3 aryn-public cccmad-tiny`
    - All ~2000 PDFs (4+ hours): `docker compose run sycamore_crawler_s3 aryn-public cccmad`

The results below are from the 35-document dataset, which we think most users will choose.  The full dataset will provide different results.  There may be variations even in the small dataset due to platform differences and OpenAI variation.  It's not necessary for 100% of the documents to be ingested in order to run this example.  Once the lion's share completes, it's OK to proceed.

More information about NDD can be found [here](https://sycamore.readthedocs.io/en/stable/querying_data/dedup.html).  Join our [Slack channel](https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg).

In [1]:
import json
import requests
import warnings
import urllib3
warnings.filterwarnings("ignore", category=urllib3.exceptions.InsecureRequestWarning)

<br>

---
The code below exists to retrieve the embedding model ID from OpenSearch.  This ID is different every time OpenSearch is set up.  We need to supply the ID in our query.  So, we need to fetch it every time in order to be sure.

In [2]:
def get_model_id():
    query = {
        'query': {
            'bool': {
                'must': [
                    {
                        'match': {'name': 'all-MiniLM-L6-v2'},
                    },
                    {
                        'term': {'model_config.model_type': 'bert'},
                    },
                ],
            },
        },
    }
    with requests.get('https://opensearch:9200/_plugins/_ml/models/_search', json=query, verify=False) as resp:
        res = json.loads(resp.text)
        return res['hits']['hits'][0]['_id']

<br><hr>
This next function performs the supplied query and prints out both the retrieved chunks and the AI-generated answer.  For clarity, the text chunks are truncated at 80 characters.

In [3]:
def do_query(query_dict):
    url = 'https://opensearch:9200/demoindex0/_search?search_pipeline=hybrid_rag_pipeline'
    with requests.post(url, json=query, verify=False) as resp:
        res = json.loads(resp.text)
        hits = res['hits']['hits']
        for i in range(10):
            text = hits[i]['_source']['text_representation']
            text = text.replace('\n', ' ')[:80]
            print(f'[{i+1}] {text}')
        answer = res['ext']['retrieval_augmented_generation']['answer']
        print(f'[ANSWER]\n{answer}')

<br>

---
First, we run the query without near-duplicate-detection.  We do this by not asking for `shingles` in `_source`.  In OpenSearch, the `_source` is where we list the fields that we want to retrieve for each hit.

If everything is set up and running properly, the numbered results will contain many repeated lines.  In the small dataset, there are only two documents in the top 5 (the RAG context).  The resulting generated answer starts by denying the premise of the question and then goes on to summarize one source.  The answer doesn't reflect the breadth of the dataset.

In [4]:
query_str = 'how are liabilities and assets affected by force majeure'
query = {
    '_source': [
        'text_representation',
    ],
    'query': {
        'hybrid': {
            'queries': [
                {
                    'match': {'text_representation': query_str},
                },
                {
                    'neural': {
                        'embedding': {
                            'query_text': query_str,
                            'k': 100,
                            'model_id': get_model_id(),
                        },
                    },
                },
            ],
        },
    },
    'ext': {
        'generative_qa_parameters': {
            'llm_question': query_str,
            'context_size': 5,
            'llm_model': 'gpt-4',
        },
    },
    'size': 100,
}
do_query(query)

[1] liabilities exceed its assets, or is adjudicated insolvent, or takes advantage o
[2] liabilities exceed its assets, or is adjudicated insolvent, or takes advantage o
[3] liabilities exceed its assets, or is adjudicated insolvent, or takes advantage o
[4] The Party affected by the Event of Force Majeure shall make all reasonable effor
[5] The Party affected by the Event of Force Majeure shall make all reasonable effor
[6] The Party affected by the Event of Force Majeure shall make all reasonable effor
[7] The Party affected by the Event of Force Majeure shall make all reasonable effor
[8] The Party affected by the Event of Force Majeure shall make all reasonable effor
[9] The Party affected by the Event of Force Majeure shall make all reasonable effor
[10] The Party affected by the Event of Force Majeure shall make all reasonable effor
[ANSWER]
Force majeure does not directly affect liabilities and assets. However, if a party is unable to perform its obligations due to a force majeu

<br>

---
For the next query, we re-use the previous query data structure, but we modify it slightly.  We append `shingles` to the list of fields to be retrieved.  This enables NDD processing; without `shingles` it can't detect near-duplicates.  Now, when we run the query there is much more diversity in the retrieved chunks.  When run on the small dataset, there appears to be just one duplicate that gets past NDD and it appears to be some sort of revision.  Looking at the generated answer, there are more cited sources and the explanation is richer.  It covers topics like modifying obligations and prior obligations, not included above.  Crucially, it mentions assets twice.

In [5]:
query['_source'].append('shingles')
do_query(query)

[1] liabilities exceed its assets, or is adjudicated insolvent, or takes advantage o
[2] The Party affected by the Event of Force Majeure shall make all reasonable effor
[3] The Party affected by the Event of Force Majeure shall use all reasonable endeav
[4] under this Account Agreement, or to enjoy any of its benefits because of fire, n
[5] is Insolvengy. If the Recipient: adjudicated insolvent or bankrupt, (c) takes ad
[6] 0) performance of or failure to perform any of its obligations herein if such de
[7] 10.2.7 is in the possession of the Recipient Party at the time the Confidential 
[8] Cardholders unless and to the extent that the loss was caused by MBNA's gross ne
[9] A. In the event of any material breach of this Agreement by MITFCU or MIT AA, th
[10] Force Majeure. Neither party shall be liable to the other for delays in the perf
[ANSWER]
Force majeure events, such as natural disasters or governmental actions, can suspend or extend obligations under an agreement, including fin