<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/05_2_MongoDB.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 05-2. SemanticSearch_MongoDB


## Overview  
In this exercise, we will explore how to utilize MongoDB to embed documents and construct a vectorspace. You will learn how to create a Retriever object and conduct hybrid searches to achieve effective query results. Additionally, this exercise covers performing additional keyword searches utilizing the Atlas Index available in MongoDB, enhancing the search capabilities.
 
## Purpose of the Exercise
The purpose of this exercise is to demonstrate the use of the Solar Embedding API to generate embeddings and create a vectorspace. By the end of this tutorial, users will understand how to conduct query searches using the Hybrid Search method, deploy a cluster to leverage MongoDB Atlas, and perform additional keyword searches based on the MongoDB Atlas Index. This exercise will enhance your ability to perform comprehensive and effective searches within MongoDB, utilizing both embeddings and keyword search techniques.



## MongoDB Atlas

To use MongoDB Atlas, you must first deploy a cluster. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/).
Create an Atlas database and create an Atlas Search Index to search vectors.

Follow below MongoDB Atlas guide
- [Create new cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/)
- [Connect to database](https://www.mongodb.com/docs/atlas/driver-connection/)
- [Create an Atlas Vector Search Index](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/)
- [Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

### benefits?
- use mongoDB itself!

In [18]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 pymongo langchain langchain-mongodb python-dotenv

In [20]:
import os
import getpass
import warnings

warnings.filterwarnings("ignore")

UPSTAGE_API_KEY = getpass.getpass("Enter your API Key")
_ = os.environ.setdefault("UPSTAGE_API_KEY", UPSTAGE_API_KEY)

MONGODB_ATLAS_CLUSTER_URI = getpass.getpass("Enter your MONGODB ATLAS CLUSTER URI")
_ = os.environ.setdefault("MONGODB_ATLAS_CLUSTER_URI", MONGODB_ATLAS_CLUSTER_URI)

In [1]:
# @title set API key
import os
import getpass
from pprint import pprint
import warnings

warnings.filterwarnings("ignore")

from IPython import get_ipython

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata
    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

if "UPSTAGE_API_KEY" not in os.environ:
    os.environ["UPSTAGE_API_KEY"] = getpass.getpass("Enter your Upstage API key: ")


In [263]:
from pymongo.mongo_client import MongoClient
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_upstage import UpstageEmbeddings
import os

"""
Your connection string should use following format:
mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net
"""
MONGODB_ATLAS_CLUSTER_URI = os.environ["MONGODB_ATLAS_CLUSTER_URI"]

# Connect to your Atlas cluster
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
# Define collection and index name
DB_NAME = "langchain_db"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"

db_collection = client[DB_NAME][COLLECTION_NAME]

# Create Indexes

create atlas index fields.

[Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

```json
{
  "fields": [
    {
      "numDimensions": 4096,
      "path": "embedding",
      "similarity": "dotProduct",
      "type": "vector"
    }
  ]
}
```

In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


sample_text = [
    "Korea is a beautiful country to visit in the spring.",
    "The best time to visit Korea is in the fall.",
    "Best way to find bug is using unit test.",
    "Python is a great programming language for beginners.",
    "Sung Kim is a great teacher.",
]

splits = RecursiveCharacterTextSplitter().create_documents(sample_text)

print(splits)

vectorstore = MongoDBAtlasVectorSearch.from_documents(
    documents=splits,
    collection=db_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

[Document(page_content='Korea is a beautiful country to visit in the spring.'), Document(page_content='The best time to visit Korea is in the fall.'), Document(page_content='Best way to find bug is using unit test.'), Document(page_content='Python is a great programming language for beginners.'), Document(page_content='Sung Kim is a great teacher.')]
batch_size: 5


In [30]:
db_collection.find_one({"text": "Hello, new sentence"}) is not None

False

In [31]:
db_collection.find_one({"text": splits[0].page_content}) is not None

True

In [25]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [26]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 125


In [27]:
from langchain_mongodb import MongoDBAtlasVectorSearch

vectorstore = MongoDBAtlasVectorSearch(
    collection=db_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split
    for split in splits
    if not db_collection.find_one({"text": split.page_content})
]
print(len(unique_splits))

# 3. Embed & indexing if it's not in the vector store
if len(unique_splits) > 0:
    MongoDBAtlasVectorSearch.from_documents(
        documents=unique_splits,
        collection=MONGODB_COLLECTION,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
        index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    )

0


In [None]:
# Query the retriever
search_result = retriever.invoke("How to find problems in code?")
print(search_result)
print(search_result[0].page_content[:100])

# hybrid search

We will carry out a hybrid search that combines BM25 and vector search techniques. This process will unfold in two separate stages: first within LangChain, and then via a query in MongoDB. Moreover, we will apply reciprocal rank fusion to merge results from different search methods into a single, unified outcome.

Add additional **Search Index** to MongoDB Atlas with following definition. To enable keyword search.
```json
{
  "mappings": {
    "dynamic": false,
    "fields": {
      "text": {
        "type": "string"
      }
    }
  }
}
```

In [264]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_mongodb import MongoDBAtlasVectorSearch

# Using Langchain


# 1. initializing retrievers
vectorstore = MongoDBAtlasVectorSearch(
    collection=db_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

vector_retriever = vectorstore.as_retriever()
bm25_retriever = BM25Retriever.from_documents(splits)

# set weights for reciprocal rank fusion
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6]
)

In [265]:
# 2. Query using hybrid retriever
docs = hybrid_retriever.get_relevant_documents("How to find prblems in code?")
print(docs)

[Document(page_content="<p id='13' style='font-size:16px'>introduced bugs immediately. Several bug-finding techni-<br>ques could be used, including code inspections, unit testing,<br>and the use of static analysis tools. Since these steps would<br>be taken right after a code change was made, the developer<br>would still retain the full mental context of the change. This<br>holds promise for reducing the time required to find<br>software bugs and reducing the time that bugs stay resident<br>in software before removal.</p><br>", metadata={'_id': {'$oid': '66500f045d6a17e9c9316f7d'}, 'total_pages': 16, 'type': 'html', 'split': 'none', 'title': 'Classifying Software Changes: Clean or Buggy?'}), Document(page_content="<p id='42' style='font-size:16px'>determine which kinds of function return values must be<br>checked. For example, if the return value of foo was always<br>verified in the previous project history but was not verified<br>in the current source code, it is very suspicious. Livsh

In [266]:
# Using MongoDB query

# One of the great things about the MongoDB Atlas vector store is the variety of queries we can use.
# Perfrom hybrid search using MongoDB query.

# The reciprocal rank score is calculated as below
# 1.0/{document position in the results + vector or full-text penalty + constant value}


def hybrid_search(client, query):
    vector_penalty = 4
    keyword_penalty = 6
    return client.aggregate(
        [
            {
                # $vectorSearch stage to search the embedding field for the query specified as vector embeddings in the queryVector field of the query.
                # The query specifies a search for up to 100 nearest neighbors and limit the results to 20 documents only. This stage returns the sorted documents from the semantic search in the results.
                "$vectorSearch": {
                    "index": ATLAS_VECTOR_SEARCH_INDEX_NAME,
                    "path": "embedding",
                    "queryVector": UpstageEmbeddings(
                        model="solar-embedding-1-large"
                    ).embed_query(query),
                    "numCandidates": 10,
                    "limit": 5,
                }
            },
            {
                # $group stage to group all the documents in the results from the semantic search in a field named docs.
                "$group": {"_id": None, "docs": {"$push": "$$ROOT"}}
            },
            {
                # $unwind stage to unwind the array of documents in the docs field and store the position of the document in the results array in a field named rank.
                "$unwind": {"path": "$docs", "includeArrayIndex": "rank"}
            },
            {
                # $addFields stage to add a new field named vs_score that contains the reciprocal rank score for each document in the results.
                # Here, reciprocal rank score is calculated by dividing 1.0 by the sum of rank, the vector_penalty weight, and a constant value of 1.
                "$addFields": {
                    "vs_score": {
                        "$divide": [1.0, {"$add": ["$rank", vector_penalty, 1]}]
                    }
                }
            },
            {
                # $project stage to include only the following fields in the results: vs_score, _id, title, text
                "$project": {
                    "vs_score": 1,
                    "_id": "$docs._id",
                    "title": "$docs.title",
                    "text": "$docs.text",
                }
            },
            {
                # $unionWith stage to combine the results from the preceding stages with the results of the following stages in the sub-pipeline
                "$unionWith": {
                    "coll": COLLECTION_NAME,
                    "pipeline": [
                        {
                            # $search stage to search for movies that contain the query in the text field. This stage returns the sorted documents from the keyword search in the results.
                            "$search": {
                                "index": "text",
                                "phrase": {"query": query, "path": "text"},
                            }
                        },
                        {
                            # $limit stage to limit the output to 15 results only.
                            "$limit": 15
                        },
                        {
                            # $group stage to group all the documents from the keyword search in a field named docs.
                            "$group": {"_id": None, "docs": {"$push": "$$ROOT"}}
                        },
                        {
                            # $unwind stage to unwind the array of documents in the docs field and store the position of the document in the results array in a field named rank.
                            "$unwind": {"path": "$docs", "includeArrayIndex": "rank"}
                        },
                        {
                            # $addFields stage to add a new field named kws_score that contains the reciprocal rank score for each document in the results.
                            # Here, reciprocal rank score is calculated by dividing 1.0 by the sum of the value of rank, the full_text penalty weight, and a constant value of 1.
                            "$addFields": {
                                "kws_score": {
                                    "$divide": [
                                        1.0,
                                        {"$add": ["$rank", keyword_penalty, 1]},
                                    ]
                                }
                            }
                        },
                        {
                            # $project stage to include only the following fields in the results: kws_score, _id, title, text
                            "$project": {
                                "kws_score": 1,
                                "_id": "$docs._id",
                                "title": "$docs.title",
                                "text": "$docs.text",
                            }
                        },
                    ],
                }
            },
            {
                # $project stage to include only the following fields in the results: _id, title, text, vs_score, kws_score
                "$project": {
                    "title": 1,
                    "vs_score": {"$ifNull": ["$vs_score", 0]},
                    "kws_score": {"$ifNull": ["$kws_score", 0]},
                    "text": 1,
                }
            },
            {
                # $project stage to add a field named score that contains the sum of vs_score and kws_score to the results.
                "$project": {
                    "score": {"$add": ["$kws_score", "$vs_score"]},
                    "title": 1,
                    "vs_score": 1,
                    "kws_score": 1,
                    "text": 1,
                }
            },
            # $sort stage to sort the results by score in descending order.
            {"$sort": {"score": -1}},
            #   $limit stage to limit the output to 10 results only.
            {"$limit": 10},
        ]
    )

In [267]:
result = hybrid_search(db_collection, "How to find prblems in code?")
for doc in result:
    print(doc["text"], doc["score"], "\n")

<p id='13' style='font-size:16px'>introduced bugs immediately. Several bug-finding techni-<br>ques could be used, including code inspections, unit testing,<br>and the use of static analysis tools. Since these steps would<br>be taken right after a code change was made, the developer<br>would still retain the full mental context of the change. This<br>holds promise for reducing the time required to find<br>software bugs and reducing the time that bugs stay resident<br>in software before removal.</p><br> 0.2 

<p id='42' style='font-size:16px'>determine which kinds of function return values must be<br>checked. For example, if the return value of foo was always<br>verified in the previous project history but was not verified<br>in the current source code, it is very suspicious. Livshits and<br>Zimmermann combine software repository mining and<br>dynamic analysis to discover common use patterns and<br>code patterns that are likely errors in Java applications [25].<br>Similarly, PR-Miner min