##### This notebook shows the results of query-time near-duplicate detection (NDD).  It shows the same query with and without the duplicates.  The content here is inspired by this [blog post](https://www.aryn.ai/post/near-duplicate-detection-in-sycamore-what-is-it-good-for).


##### The Aryn Partitioner in this job is configured to use the Aryn Partitioning Service to provide fast, GPU-powered performance. Go to [aryn.ai/sign-up ](aryn.ai/sign-up) to get a free API key for the service. This is the recommended configuration.

##### You can also run the Aryn Partitioner locally by setting `use_partitioning_service` to `False`. Though you can use CPU to run the Aryn Partitioner, it is recommended to use an NVIDIA GPU for good performance.


To use this notebook:
1. Follow [these instructions](https://sycamore.readthedocs.io/en/stable/welcome_to_sycamore/get_started.html) and start the Sycamore containers using `docker compose up`.
2. It's best to start with a clean slate by running `docker compose run reset`.
3. Ingest the college credit card marketing agreements data.  The documents come from [data.gov](https://catalog.data.gov/dataset/college-credit-card-marketing-agreements-data), but we have made them accessible via a public S3 bucket.  There are two ingestion methods to choose from, depending on how much time is available:

    - JSON: (minutes) ingest pre-processed data represented as JSON into OpenSearch
    - PDF: (hours) fully process all ~2000 PDFs and ingest them into OpenSearch

Set `use_json` below accordingly.  Also set `save_resources` as desired.

The results should be the same for both methods, although there may be variations due to platform differences and OpenAI variation.

More information about NDD can be found [here](https://sycamore.readthedocs.io/en/stable/querying_data/dedup.html).  Join our [Slack channel](https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg).

In [None]:
import os
import json
import requests
import warnings
import urllib3
import multiprocessing
import pyarrow.fs
import sycamore
from sycamore.functions.tokenizer import HuggingFaceTokenizer
from sycamore.transforms import COALESCE_WHITESPACE
from sycamore.transforms.merge_elements import MarkedMerger
from sycamore.transforms.partition import ArynPartitioner
from sycamore.transforms.embed import SentenceTransformerEmbedder

warnings.filterwarnings('ignore', category=urllib3.exceptions.InsecureRequestWarning)

In [None]:
from sycamore.utils.aryn_config import ArynConfig, _DEFAULT_PATH
assert ArynConfig.get_aryn_api_key() != "", f"Unable to find aryn API key.  Looked in {_DEFAULT_PATH}"

if the above assertion fails, you can either set the environment variable ARYN_API_KEY and restart jupyter
or make a yaml file at the specified path in the assertion error that looks like:

```
aryn_token: "YOUR-ARYN-API-KEY"
```

It is unsafe, but if neither of those options work, you can put it in this notebook with
```
import os
os.environ["ARYN_API_KEY"] = "UNSAFE-ARYN-API-KEY-LOCATION" 
```

but beware that it is easy to accidentally commit the notebook file and have it include your key.

In [None]:
# Set to False to ingest the PDFs from scratch, which takes an hour or more
use_json = True

# Set to False to use all available CPU and memory
save_resources = True

# Different hostnames inside and outside Docker compose environment
opensearch_host = 'opensearch' if os.path.exists('/.dockerenv') else 'localhost'

index_name = 'demoindex0'

In [None]:
osrch_args = {
    'hosts': [{'host': opensearch_host, 'port': 9200}],
    'http_compress': True,
    'http_auth': ('admin', 'admin'),
    'use_ssl': True,
    'verify_certs': False,
    'ssl_assert_hostname': False,
    'ssl_show_warn': False,
    'timeout': 120,
}

idx_settings = {
    'body': {
        'settings': {
            'index.knn': True,
        },
        'mappings': {
            'properties': {
                'embedding': {
                    'type': 'knn_vector',
                    'dimension': 384,
                    'method': {'name': 'hnsw', 'engine': 'faiss'},
                },
            },
        },
    },
}

In [None]:
tokenizer = HuggingFaceTokenizer('thenlper/gte-small')
embedder = SentenceTransformerEmbedder(model_name='sentence-transformers/all-MiniLM-L6-v2', batch_size=100)

fsys = pyarrow.fs.S3FileSystem(anonymous=True, region='us-east-1')
ctx = sycamore.init()

if use_json:
    # Fast way: pre-processed DocSet as JSON...
    path = 's3://aryn-public/cccmad-json'
    ds = ctx.read.json_document(path, filesystem=fsys)
else:
    # Slow way: process PDF documents via Sycamore pipeline...
    path = 's3://aryn-public/cccmad'
    ds = (
        ctx.read.binary(path, binary_format='pdf', filesystem=fsys)
        .partition(partitioner=ArynPartitioner())
        .regex_replace(COALESCE_WHITESPACE)
        .mark_bbox_preset(tokenizer=tokenizer)
        .merge(merger=MarkedMerger())
        .spread_properties(['path'])
        .split_elements(tokenizer=tokenizer, max_tokens=512)
        .explode()
        .sketch()
        .embed(embedder=embedder)
    )

ds.write.opensearch(
    os_client_args=osrch_args,
    index_name=index_name,
    index_settings=idx_settings,
)

<br>

---
The code below exists to retrieve the embedding model ID from OpenSearch.  This ID is different every time OpenSearch is set up.  We need to supply the ID in our query.  So, we will fetch it every time in order to be sure.

In [None]:
def get_model_id():
    query = {
        'query': {
            'bool': {
                'must': [
                    {
                        'match': {'name': 'all-MiniLM-L6-v2'},
                    },
                    {
                        'term': {'model_config.model_type': 'bert'},
                    },
                ],
            },
        },
    }
    with requests.get(f'https://{opensearch_host}:9200/_plugins/_ml/models/_search', json=query, verify=False) as resp:
        res = json.loads(resp.text)
        return res['hits']['hits'][0]['_id']

<br><hr>
This next function performs the given query and prints out both the top ten retrieved chunks and the AI-generated answer.  For clarity, the text chunks are truncated at 80 characters.

In [None]:
def do_query(query_dict):
    url = f'https://{opensearch_host}:9200/{index_name}/_search?search_pipeline=hybrid_rag_pipeline'
    with requests.post(url, json=query, verify=False) as resp:
        res = json.loads(resp.text)
        hits = res['hits']['hits']
        for i in range(10):
            text = hits[i]['_source']['text_representation']
            text = text.replace('\n', ' ')[:80]
            print(f'[{i+1}] {text}')
        answer = res['ext']['retrieval_augmented_generation']['answer']
        print(f'[ANSWER]\n{answer}')

<br>

---
First, we run the query without near-duplicate-detection.  We do this by not asking for `shingles` in `_source`.  In OpenSearch queries, the `_source` is where we list the fields that we want to retrieve for each hit.

If everything is set up and running properly, the numbered results will contain many repeated lines.  There is only one document in the top 10 (the RAG context).  The resulting generated answer starts by saying no information was found and then goes on to summarize the single source.  The answer doesn't reflect the breadth of the dataset.

In [None]:
query_str = 'how does force majeure affect assets and insolvency'
query = {
    '_source': [
        'text_representation',
    ],
    'query': {
        'hybrid': {
            'queries': [
                {
                    'match': {'text_representation': query_str},
                },
                {
                    'neural': {
                        'embedding': {
                            'query_text': query_str,
                            'k': 100,
                            'model_id': get_model_id(),
                        },
                    },
                },
            ],
        },
    },
    'ext': {
        'generative_qa_parameters': {
            'llm_question': query_str,
            'context_size': 10,
            'llm_model': 'gpt-4',
        },
    },
    'size': 100,
}
do_query(query)

<br>

---
For the next query, we re-use the previous query data structure, but we modify it slightly.  We append `shingles` to the list of fields to be retrieved.  This enables NDD processing; without `shingles` it can't detect near-duplicates.  Now, when we run the query there is much more diversity in the retrieved chunks.  There appear to be four unique chunks after NDD.  Looking at the generated answer, there are more cited sources and the explanation is richer.  It specifically addresses insolvency, which was part of the question.

In [None]:
query['_source'].append('shingles')
do_query(query)