# NexuSync Demo Notebook

This notebook demonstrates how to use NexuSync for document indexing, querying, and other key functionalities.

NexuSync is a powerful library designed for efficient document indexing and querying, using state-of-the-art language and embedding models.


## Initialize NexuSync

Let's start by initializing NexuSync with the required parameters. Adjust the following parameters according to your use case:

- **input_dirs**: List of directories containing documents for indexing.
- **openai_model_yn**: Whether to use OpenAI models for embeddings and language tasks.
- **embedding_model**: Model to be used for generating embeddings.
- **language_model**: Model to be used for language tasks.
- **chroma_db_dir**: Directory for storing ChromaDB files.
- **index_persist_dir**: Directory for persisting the index.
- **chunk_size**: Size of the text chunks to be used for creating embeddings.
- **chunk_overlap**: Overlap between text chunks to maintain context.


In [1]:
from nexusync import NexuSync

OPENAI_MODEL_YN = False # if False, you will use ollama model
EMBEDDING_MODEL = "BAAI/bge-base-en-v1.5" # suggested embedding model
LANGUAGE_MODEL = 'llama3.2' # you need to download ollama model first, please check https://ollama.com/download
TEMPERATURE = 0.4 # range from 0 to 1, higher means higher creativitiy level
CHROMA_DB_DIR = 'chroma_db'
INDEX_PERSIST_DIR = 'index_storage'
CHROMA_COLLECTION_NAME = 'my_collection'
INPUT_DIRS = ["../sample_docs"] # can specify multiple document paths
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 20
RECURSIVE = True

ns = NexuSync(input_dirs=INPUT_DIRS, 
              openai_model_yn=False, 
              embedding_model=EMBEDDING_MODEL, 
              language_model=LANGUAGE_MODEL, 
              temperature=TEMPERATURE, 
              chroma_db_dir = CHROMA_DB_DIR,
              index_persist_dir = INDEX_PERSIST_DIR,
              chroma_collection_name=CHROMA_COLLECTION_NAME,
              chunk_overlap=CHUNK_OVERLAP,
              chunk_size=CHUNK_SIZE,
              recursive=RECURSIVE
              )

In [2]:
# Initialize the embeding model and the language model
ns.initialize_models()

# Initialize and persist the vector store in a chroma db
ns.initialize_vectors()

Using HuggingFace embedding model: BAAI/bge-base-en-v1.5
Using Ollama LLM model: llama3.2


2024-10-09 06:56:24,459 - nexusync.core.indexer - INFO - Index already built. Loading from disk.
2024-10-09 06:56:24,459 - nexusync.NexuSync - INFO - Vectors and Querier initialized successfully.


## Rebuild Index

Rebuild the index when either of the  followings are changed:
- openai_model_yn
- embedding_model
- language_model
- chroma_db_dir
- index_persist_dir
- chroma_collection_name
- chunk_overlap
- chunk_size
- recursive

In [3]:
from nexusync import rebuild_index
rebuild_index(input_dirs=INPUT_DIRS, 
              openai_model_yn=False, 
              embedding_model=EMBEDDING_MODEL, 
              language_model=LANGUAGE_MODEL, 
              temperature=TEMPERATURE, 
              chroma_db_dir = CHROMA_DB_DIR,
              index_persist_dir = INDEX_PERSIST_DIR,
              chroma_collection_name=CHROMA_COLLECTION_NAME,
              chunk_overlap=CHUNK_OVERLAP,
              chunk_size=CHUNK_SIZE,
              recursive=RECURSIVE
              )

2024-10-09 06:36:25,665 - nexusync.core.indexing_functions - INFO - Starting index rebuild process...
2024-10-09 06:36:25,667 - nexusync.core.indexing_functions - INFO - Deleting existing index directory: index_storage
2024-10-09 06:36:25,675 - nexusync.core.indexing_functions - INFO - Deleting existing Chroma DB directory: chroma_db
VisionEncoderDecoderModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
The attention

## Refresh Index

In [3]:
# Initialize and persist the vector store in a chroma db
ns.refresh_index()

2024-10-09 06:56:26,680 - nexusync.core.indexer - INFO - Starting index refresh process...
2024-10-09 06:56:26,683 - nexusync.core.indexer - INFO - Processing directory: ../sample_docs
VisionEncoderDecoderModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Pleas

## Stream Chat

In [11]:
text_qa_template = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information above, I want you to think step by step to answer the query in a crisp manner. "
    "In case you don't know the answer, say 'I don't know!'.\n"
    "Query: {query_str}\n"
    "Answer: "
)


ns.initialize_stream_chat(
    text_qa_template=text_qa_template,
    chat_mode="context",
    similarity_top_k=3
)

2024-10-09 07:01:24,002 - nexusync.core.chat_engine - INFO - Chat engine initialized


In [12]:
query = "do you find breaking news in it"

for token in ns.start_chat_stream(query):
    print(token, end='', flush=True)  # Print each token as it's generated

Yes, I found the following "Breaking News" text:

"Trump and Harris had a fight!"{'response': 'Yes, I found the following "Breaking News" text:\n\n"Trump and Harris had a fight!"', 'metadata': {'sources': [{'source_text': 'file_path: /mnt/d/nexusync/notebooks/../sample_docs/new_added.txt\n\nBreaking News: Trump and Harris had a fight!', 'metadata': {'file_path': '/mnt/d/nexusync/notebooks/../sample_docs/new_added.txt', 'file_name': 'new_added.txt', 'file_type': 'text/plain', 'file_size': 44, 'creation_date': '2024-10-07', 'last_modified_date': '2024-10-07'}}, {'source_text': "file_path: /mnt/d/nexusync/notebooks/../sample_docs/READEME.md\n\n🌟 Features\n\n- **Smart Document Indexing**: Automatically index documents from specified directories, keeping your knowledge base up-to-date. - **Efficient Querying**: Use natural language to query your document collection and get relevant answers quickly. - **Deletion Handling**: Automatically remove documents from the index when they're deleted f

In [8]:
# Usage:
query = "summary of the store.json?"

for item in ns.start_chat_stream(query):
    if isinstance(item, str):
        # This is a token, print or process as needed
        print(item, end='', flush=True)
    else:
        # This is the final response with metadata
        print("\n\nFull response:", item['response'])
        print("Metadata:", item['metadata'])
        break

Based on the original snippet from `store.json`, here's a summary:

The file contains data about various products, specifically women's clothing items. Here are some key points:

* The products are described as:
	+ Lightweight and breathable
	+ Soft and comfortable
	+ Featuring V-neck collars and slimmer fits
	+ Available in different colors and sizes (although the specific details are not provided)
* The categories mentioned include "women's clothing".
* Prices for these items range from $9.99 to $12.99.
* Each product has a rating with 4-5 stars out of 5, based on reviews from 145 customers or less.

Overall, this snippet appears to be showcasing specific women's clothing products available for purchase, along with their key features and pricing information.

Full response: Based on the original snippet from `store.json`, here's a summary:

The file contains data about various products, specifically women's clothing items. Here are some key points:

* The products are described as:
	

## One-time query

In [6]:
query = "how to install nexusync"


response = ns.start_query(text_qa_template = text_qa_template, query = query )

print(f"Query: {query}")
print(f"Response: {response['response']}")
print(f"Response: {response['metadata']}")

2024-10-09 06:59:57,194 - nexusync.NexuSync - INFO - Starting query: how to install nexusync
2024-10-09 07:00:01,795 - nexusync.NexuSync - INFO - Query completed successfully.


Query: how to install nexusync
Response: To install Nexusync, use the following command:

pip install nexusync
Response: {'sources': [{'source_text': 'file_path: /mnt/d/nexusync/notebooks/../sample_docs/READEME.md\n\nInstallation\n\n```bash\npip install nexusync\n```', 'metadata': {'file_path': '/mnt/d/nexusync/notebooks/../sample_docs/READEME.md', 'file_name': 'READEME.md', 'file_type': 'text/markdown', 'file_size': 4571, 'creation_date': '2024-10-06', 'last_modified_date': '2024-10-06'}}, {'source_text': 'file_path: /mnt/d/nexusync/notebooks/../sample_docs/READEME.md\n\nNexuSync\n\nNexuSync is a powerful document indexing and querying tool built on top of LlamaIndex. It allows you to efficiently manage, search, and interact with large collections of documents using advanced natural language processing techniques.', 'metadata': {'file_path': '/mnt/d/nexusync/notebooks/../sample_docs/READEME.md', 'file_name': 'READEME.md', 'file_type': 'text/markdown', 'file_size': 4571, 'creation_date

# Get Chat History

In [13]:
# Get chat history
chat_history = ns.chat_engine.get_chat_history()
print("Chat History:")
for entry in chat_history:
    print(f"Human: {entry['query']}")
    print(f"AI: {entry['response']}\n")

Chat History:
Human: do you find breaking news in it
AI: Yes, I found the following "Breaking News" text:

"Trump and Harris had a fight!"



## Testing Refresh

### Adding a document

In [14]:
# Add a new document
with open("../sample_docs/new_added.txt", "w") as f:
    f.write("Breaking News: Trump and Harris had a fight!")

# Refresh the index: incremental in new files and detect deleted files in the folder
ns.refresh_index()
print("Index refreshed successfully!")

2024-10-09 07:02:19,302 - nexusync.core.indexer - INFO - Starting index refresh process...
2024-10-09 07:02:19,304 - nexusync.core.indexer - INFO - Processing directory: ../sample_docs
2024-10-09 07:02:29,296 - nexusync.core.indexer - INFO - Loaded 6 files from ../sample_docs
2024-10-09 07:02:29,308 - nexusync.core.indexer - INFO - Updated 1 files in ../sample_docs
2024-10-09 07:02:29,308 - nexusync.core.indexer - INFO - Updated file: /mnt/d/nexusync/notebooks/../sample_docs/new_added.txt
2024-10-09 07:02:29,310 - nexusync.core.indexer - INFO - No deleted files found.


Index refreshed successfully!


### Deleting a file

In [3]:
# Step 2: Delete the new document
# os.remove('../sample_docs/Nvidia ecosystem.pptx')
# print("New document deleted.")

ns.refresh_index()
print("Index refreshed after deletion.")

2024-10-07 15:40:27,349 - nexusync.core.indexer - INFO - Starting index refresh process...
2024-10-07 15:40:27,353 - nexusync.core.indexer - INFO - Processing directory: ../sample_docs
2024-10-07 15:40:36,886 - nexusync.core.indexer - INFO - Loaded 7 files from ../sample_docs
2024-10-07 15:40:36,887 - nexusync.core.indexer - INFO - Updated 0 files in ../sample_docs
2024-10-07 15:40:36,888 - nexusync.core.indexer - INFO - No deleted files found.


Index refreshed after deletion.
