# NexuSync Demo Notebook

This notebook demonstrates how to use NexuSync for document indexing, querying, and other key functionalities.

NexuSync is a powerful library designed for efficient document indexing and querying, using state-of-the-art language and embedding models.


## Initialize NexuSync

Let's start by initializing NexuSync with the required parameters. Adjust the following parameters according to your use case:

- **input_dirs**: List of directories containing documents for indexing.
- **openai_model_yn**: Whether to use OpenAI models for embeddings and language tasks.
- **embedding_model**: Model to be used for generating embeddings.
- **language_model**: Model to be used for language tasks.
- **chroma_db_dir**: Directory for storing ChromaDB files.
- **index_persist_dir**: Directory for persisting the index.
- **chunk_size**: Size of the text chunks to be used for creating embeddings.
- **chunk_overlap**: Overlap between text chunks to maintain context.


In [1]:
from nexusync import NexuSync

OPENAI_MODEL_YN = False # if False, you will use ollama model
EMBEDDING_MODEL = "BAAI/bge-base-en-v1.5" # suggested embedding model
LANGUAGE_MODEL = "llama3.2" # you need to download ollama model first, please check https://ollama.com/download
BASE_URL = "http://localhost:11434" # you can swith to different base_url for Ollama model
TEMPERATURE = 0.4 # range from 0 to 1, higher means higher creativitiy level
CHROMA_DB_DIR = 'chroma_db'
INDEX_PERSIST_DIR = 'index_storage'
CHROMA_COLLECTION_NAME = 'my_collection'
INPUT_DIRS = ["../sample_docs"] # can specify multiple document paths
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 20
RECURSIVE = True

ns = NexuSync(input_dirs=INPUT_DIRS, 
              openai_model_yn=False, 
              embedding_model=EMBEDDING_MODEL, 
              language_model=LANGUAGE_MODEL, 
              base_url = BASE_URL,
              temperature=TEMPERATURE, 
              chroma_db_dir = CHROMA_DB_DIR,
              index_persist_dir = INDEX_PERSIST_DIR,
              chroma_collection_name=CHROMA_COLLECTION_NAME,
              chunk_overlap=CHUNK_OVERLAP,
              chunk_size=CHUNK_SIZE,
              recursive=RECURSIVE
              )

  from .autonotebook import tqdm as notebook_tqdm
2024-10-20 14:43:35,719 - nexusync.utils.embedding_models.set_embedding_model - INFO - Using HuggingFace embedding model: BAAI/bge-base-en-v1.5
2024-10-20 14:43:35,746 - nexusync.utils.embedding_models.set_language_model - INFO - Ollama LLM initialized with model: llama3.2 and base_url: http://localhost:11434
2024-10-20 14:43:35,824 - nexusync.NexuSync - INFO - Vectors and Querier initialized successfully.
2024-10-20 14:43:37,871 - nexusync.core.indexer - INFO - Loaded 2 from all directories.
2024-10-20 14:43:40,042 - nexusync.core.indexer - INFO - Index Built.


## Normal Query

`ns.start_query` can quickly test for a query with no memory.

In [6]:
query = "main result of the paper can llm generate novltive ideas"

text_qa_template = """
Context Information:
--------------------
{context_str}
--------------------

Query: {query_str}

Instructions:
1. Carefully read the context information and the query.
2. Think through the problem step by step.
3. Provide a concise and accurate answer based on the given context.
4. If the answer cannot be determined from the context, state "Based on the given information, I cannot provide a definitive answer."
5. If you need to make any assumptions, clearly state them.
6. If relevant, provide a brief explanation of your reasoning.

Answer: """


response = ns.start_query(text_qa_template = text_qa_template, query = query )

print(f"Query: {query}")
print(f"Response: {response['response']}")
print(f"Response: {response['metadata']}")

2024-10-20 14:46:55,072 - nexusync.NexuSync - INFO - Starting query: main result of the paper can llm generate novltive ideas
2024-10-20 14:46:58,224 - nexusync.NexuSync - INFO - Query completed successfully.


Query: main result of the paper can llm generate novltive ideas
Response: Based on the provided context, it appears that the paper is investigating whether Large Language Models (LLMs) can generate novel research ideas comparable to expert humans. The authors conducted a large-scale human study with over 100 NLP researchers and compared their generated ideas with those of an LLM ideation agent.

The main result of the paper suggests that LLM-generated ideas are judged as more novel than human expert ideas, but slightly weaker on feasibility (p<0.05). This implies that while LLMs can generate novel ideas, they may not always be feasible or practical for actual research projects.

Therefore, the answer to the query "Can LLMs Generate Novel Research Ideas?" is: Yes, LLMs can generate novel research ideas, but their feasibility and practicality may vary.
Response: {'sources': [{'source_text': 'page_label: 5\nfile_path: /mnt/d/nexusync/notebooks/../sample_docs/Can LLMs Generate Novel Resear

## Stream Chat

Use `ns.initialize_stream_chat` first. You can then print token by token in the following example.

In [5]:
text_qa_template = """
Context Information:
--------------------
{context_str}
--------------------

Query: {query_str}

Instructions:
1. Carefully read the context information and the query.
2. Think through the problem step by step.
3. Provide a concise and accurate answer based on the given context.
4. If the answer cannot be determined from the context, state "Based on the given information, I cannot provide a definitive answer."
5. If you need to make any assumptions, clearly state them.
6. If relevant, provide a brief explanation of your reasoning.

Answer: """

ns.initialize_stream_chat(
    text_qa_template=text_qa_template,
    chat_mode="context",
    similarity_top_k=3
)

2024-10-20 15:03:50,409 - nexusync.core.chat_engine - INFO - Chat engine initialized


In [6]:
query = "main result of the paper can llm generate novltive ideas"


for item in ns.start_chat_stream(query):
    if isinstance(item, str):
        # This is a token, print or process as needed
        print(item, end='', flush=True)
    else:
        # This is the final response with metadata
        # print("\n\nFull response:", item['response'])
        print("\n\n")
        print("Metadata:", item['metadata'])
        break

The main result of the paper "Can LLMs Generate Novel Research Ideas?" indicates that AI-generated ideas are judged to be more novel than human-generated ideas, based on a large-scale evaluation involving nearly 300 reviews. However, while these AI-generated ideas may appear more exciting, there are concerns about their feasibility and the potential decline in the quality of academic discourse due to the reliance on AI for idea generation. The study highlights the need for rigorous standards and accountability in evaluating both AI-assisted and human-generated research to ensure the integrity of academic contributions. Additionally, the paper raises issues related to intellectual credit, idea homogenization, and the importance of human collaboration in the ideation process.


Metadata: {'sources': [{'source_text': 'page_label: 19\nfile_path: /mnt/d/nexusync/notebooks/../sample_docs/Can LLMs Generate Novel Research Ideas.pdf\n\nQuestion 2: Are evaluations based solely on ideas subjectiv

## Rebuild Index

Rebuild the index when either of the following is changed:
- openai_model_yn
- embedding_model
- language_model
- base_url
- chroma_db_dir
- index_persist_dir
- chroma_collection_name
- chunk_overlap
- chunk_size
- recursive

In the example below, we rebuilt it with OpenAI model

In [1]:
from nexusync import rebuild_index
from nexusync import NexuSync

OPENAI_MODEL_YN = True # if False, you will use ollama model
EMBEDDING_MODEL = "text-embedding-3-large" # suggested embedding model
LANGUAGE_MODEL = 'gpt-4o-mini' # you need to download ollama model first, please check https://ollama.com/download
TEMPERATURE = 0.4 # range from 0 to 1, higher means higher creativitiy level
CHROMA_DB_DIR = 'chroma_db'
INDEX_PERSIST_DIR = 'index_storage'
CHROMA_COLLECTION_NAME = 'my_collection'
INPUT_DIRS = ["../sample_docs"] # can specify multiple document paths
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 20
RECURSIVE = True

rebuild_index(input_dirs=INPUT_DIRS, 
              openai_model_yn=OPENAI_MODEL_YN, 
              embedding_model=EMBEDDING_MODEL, 
              language_model=LANGUAGE_MODEL, 
              temperature=TEMPERATURE, 
              chroma_db_dir = CHROMA_DB_DIR,
              index_persist_dir = INDEX_PERSIST_DIR,
              chroma_collection_name=CHROMA_COLLECTION_NAME,
              chunk_overlap=CHUNK_OVERLAP,
              chunk_size=CHUNK_SIZE,
              recursive=RECURSIVE
              )

  from .autonotebook import tqdm as notebook_tqdm
2024-10-20 15:02:57,850 - nexusync.core.indexing_functions - INFO - Starting index rebuild process...
2024-10-20 15:02:57,940 - nexusync.utils.embedding_models.set_embedding_model - INFO - Using OpenAI embedding model: text-embedding-3-large
2024-10-20 15:02:57,944 - nexusync.utils.embedding_models.set_language_model - INFO - Using OpenAI LLM model: gpt-4o-mini
2024-10-20 15:02:57,946 - nexusync.core.indexing_functions - INFO - Deleting existing index directory: index_storage
2024-10-20 15:02:59,738 - nexusync.core.indexing_functions - INFO - Loaded 2 files from all directories.
2024-10-20 15:03:06,072 - nexusync.core.indexing_functions - INFO - Index Built.


In [2]:
# reinitiate the ns after rebuilding the index
ns = NexuSync(input_dirs=INPUT_DIRS, 
              openai_model_yn=OPENAI_MODEL_YN, 
              embedding_model=EMBEDDING_MODEL, 
              language_model=LANGUAGE_MODEL, 
              temperature=TEMPERATURE, 
              chroma_db_dir = CHROMA_DB_DIR,
              index_persist_dir = INDEX_PERSIST_DIR,
              chroma_collection_name=CHROMA_COLLECTION_NAME,
              chunk_overlap=CHUNK_OVERLAP,
              chunk_size=CHUNK_SIZE,
              recursive=RECURSIVE
              )

2024-10-20 15:03:06,089 - nexusync.utils.embedding_models.set_embedding_model - INFO - Using OpenAI embedding model: text-embedding-3-large
2024-10-20 15:03:06,096 - nexusync.utils.embedding_models.set_language_model - INFO - Using OpenAI LLM model: gpt-4o-mini
2024-10-20 15:03:06,096 - nexusync.NexuSync - INFO - Vectors and Querier initialized successfully.
2024-10-20 15:03:07,413 - nexusync.core.indexer - INFO - Index already built. Loading from disk.


In [8]:
# Test the new built index
query = "main result of the paper can llm generate novltive ideas"

text_qa_template = """
Context Information:
--------------------
{context_str}
--------------------

Query: {query_str}

Instructions:
1. Carefully read the context information and the query.
2. Think through the problem step by step.
3. Provide a concise and accurate answer based on the given context.
4. If the answer cannot be determined from the context, state "Based on the given information, I cannot provide a definitive answer."
5. If you need to make any assumptions, clearly state them.
6. If relevant, provide a brief explanation of your reasoning.

Answer: """


response = ns.start_query(text_qa_template = text_qa_template, query = query )

print(f"Query: {query}")
print(f"Response: {response['response']}")
print(f"Response: {response['metadata']}")

2024-10-20 14:59:40,277 - nexusync.NexuSync - INFO - Starting query: main result of the paper can llm generate novltive ideas
2024-10-20 14:59:47,608 - nexusync.NexuSync - INFO - Query completed successfully.


Query: main result of the paper can llm generate novltive ideas
Response: The main result of the paper indicates that AI-generated ideas are judged as more novel than human expert ideas, with statistical significance (p<0.05). However, this novelty may come at the slight expense of feasibility, although the study size did not have sufficient power to conclusively identify these effects. The paper emphasizes that while LLMs can generate novel ideas, there are concerns regarding the quality and diversity of these ideas, as well as potential issues related to intellectual credit and the overall impact on academic discourse.
Response: {'sources': [{'source_text': 'page_label: 19\nfile_path: /mnt/d/nexusync/notebooks/../sample_docs/Can LLMs Generate Novel Research Ideas.pdf\n\nQuestion 2: Are evaluations based solely on ideas subjective? In this current study, we focused\nsolely on evaluating the ideas themselves. Intellectual Credit. The use of LLMs to generate research ideas introduces si

## Refresh Index

In [3]:
# Initialize and persist the vector store in a chroma db
ns.refresh_index()

2024-10-20 15:03:24,569 - nexusync.core.indexer - INFO - Starting index refresh process...
2024-10-20 15:03:24,572 - nexusync.core.indexer - INFO - Processing directory: ../sample_docs
2024-10-20 15:03:26,569 - nexusync.core.indexer - INFO - Loaded 2 files from ../sample_docs
2024-10-20 15:03:26,571 - nexusync.core.indexer - INFO - No files were modified or added in ../sample_docs
2024-10-20 15:03:26,571 - nexusync.core.indexer - INFO - No files were modified or added in any directory
2024-10-20 15:03:26,573 - nexusync.core.indexer - INFO - No deleted files found.


# Get Chat History

Chat History will only be available when starting a stream chat. Will not capture a normal query.

In [7]:
# Get chat history
chat_history = ns.chat_engine.get_chat_history()
print("Chat History:")
for entry in chat_history:
    print(f"Human: {entry['query']}")
    print(f"AI: {entry['response']}\n")

Chat History:
Human: main result of the paper can llm generate novltive ideas
AI: The main result of the paper "Can LLMs Generate Novel Research Ideas?" indicates that AI-generated ideas are judged to be more novel than human-generated ideas, based on a large-scale evaluation involving nearly 300 reviews. However, while these AI-generated ideas may appear more exciting, there are concerns about their feasibility and the potential decline in the quality of academic discourse due to the reliance on AI for idea generation. The study highlights the need for rigorous standards and accountability in evaluating both AI-assisted and human-generated research to ensure the integrity of academic contributions. Additionally, the paper raises issues related to intellectual credit, idea homogenization, and the importance of human collaboration in the ideation process.



## Testing

### Adding a document

In [8]:
# Add a new document
with open("../sample_docs/new_added.txt", "w") as f:
    f.write("Breaking News: Trump and Harris had a fight!!!!")

# Refresh the index: incremental in new files and detect deleted files in the folder
ns.refresh_index()
print("Index refreshed successfully!")

2024-10-20 15:04:57,399 - nexusync.core.indexer - INFO - Starting index refresh process...
2024-10-20 15:04:57,401 - nexusync.core.indexer - INFO - Processing directory: ../sample_docs
2024-10-20 15:04:59,516 - nexusync.core.indexer - INFO - Loaded 2 files from ../sample_docs
2024-10-20 15:04:59,829 - nexusync.core.indexer - INFO - Updated file: /mnt/d/nexusync/notebooks/../sample_docs/new_added.txt
2024-10-20 15:04:59,830 - nexusync.core.indexer - INFO - Total files modified or added: 1
2024-10-20 15:04:59,832 - nexusync.core.indexer - INFO - No deleted files found.


Index refreshed successfully!


### Deleting a file

In [3]:
# Step 2: Delete the new document
# os.remove('../sample_docs/Nvidia ecosystem.pptx')
# print("New document deleted.")

ns.refresh_index()
print("Index refreshed after deletion.")

2024-10-07 15:40:27,349 - nexusync.core.indexer - INFO - Starting index refresh process...
2024-10-07 15:40:27,353 - nexusync.core.indexer - INFO - Processing directory: ../sample_docs
2024-10-07 15:40:36,886 - nexusync.core.indexer - INFO - Loaded 7 files from ../sample_docs
2024-10-07 15:40:36,887 - nexusync.core.indexer - INFO - Updated 0 files in ../sample_docs
2024-10-07 15:40:36,888 - nexusync.core.indexer - INFO - No deleted files found.


Index refreshed after deletion.
