In [None]:
# !pip3 install langchain
# !pip3 install llama-index==0.6.0
# !pip3 install pymongo
# !pip3 install nltk
# !pip3 install Pillow
# !pip3 install python-dotenv

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [None]:
from llama_index import (
    LLMPredictor,
    GPTVectorStoreIndex, 
    GPTListIndex, 
    GPTSimpleKeywordTableIndex,
    download_loader
)

from langchain.chat_models import ChatOpenAI
from llama_index.response.notebook_utils import display_response

In [None]:
# !pip install PyPDF

### INTRO

At a basic level, LlamaIndex takes your documents and breaks them into chunks called nodes.

Workflow:
1) Connect the private knowledge sources using LlamaIndex connectors. 
2) Load in the Documents. A ‘LlamaIndex Document’ represents a lightweight container around the data source. 
3) Parse the ‘LlamaIndex Documents’ objects into ‘LlamaIndex Nodes’ objects. Nodes represent “chunks” of source ‘LlamaIndex Documents’ (ex., a text chunk). These node objects can be persisted in a MongoDB collection.
4) Construct ‘LlamaIndex Index’ from ‘LlamaIndex Nodes’. There are various kinds of indexes in LlamaIndex, like “List Index” (which stores Nodes as a Sequential chain) and “Vector Store Index” (this stores each node and a corresponding embedding in a vector store). Depending on the type of Index, these indexes can be persisted into a MongoDB collection or a Vector Database.
5) Finally, query the Index. The query is parsed at this step; relevant Nodes are retrieved through indexes and provided as input to the “Large Language Model” (LLM). Different types of queries can use different indexes.


Use of Indexes:
For summarization, you have two options: GPTListIndex or GPTVectorStoreIndex with response_mode="tree_summarize". The distinction lies in the approach taken to generate the summary. A list index utilizes every node in the index to create the summary, while a vector index utilizes only the top k nodes to generate a summary.

For Q&A, GPTVectorStoreIndex can be used. During the query, the system fetches the top k most relevant nodes based on your query text. These nodes are then used as context to synthesize an answer using the LLM.

### Initialize OpenAI and MongoDB

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

import os

# Set environment variable
os.environ['OPENAI_API_KEY'] = 'Your-API-KEY-Here'

# Access environment variable
print(os.environ['OPENAI_API_KEY'])

ModuleNotFoundError: ignored

#### Load Documents

In [None]:
from llama_hub.file.base import SimpleDirectoryReader

loader = SimpleDirectoryReader('Your file path here')
documents = loader.load_data()# !pip install llama_hub

In [None]:
#documents

In [None]:
# from llama_hub.file import PDFReader

# loader = PDFReader('/content/data/2303.08774.pdf')
# documents = loader.load_data()

#### Parse into Nodes
Document stores contain ingested document chunks, which LlamaIndex calls 'Node' objects.


By default, the SimpleDocumentStore stores Node objects in-memory.

In [None]:
from llama_index.node_parser import SimpleNodeParser
nodes = SimpleNodeParser().get_nodes_from_documents(documents)

In [None]:
#nodes

## Persisting nodes and indexes to MongoDB
There is an option to persist the nodes as an actual collection in mongoDB using MongoDocumentStore. Here we would persist the data in mongoDB. 
Storing the ‘LlamaIndex documents’ and indexes in a database becomes necessary in a couple of scenarios:
(a) Use cases where large datasets require more than in-memory storage.
(b) Ingesting and processing data from various sources (for example, PDFs, google docs, Slack).
(c) The requirement to continuously maintain updates from the underlying data sources. 

Being able to persist this data enables processing the data once and then being able to query it for various downstream applications. You can easily reconnect to your MongoDB collection and reload the index by re-initializing a MongoIndexStore with an existing db_name and collection_name.

MongoDB offers a free forever Atlas cluster in the public cloud service of your choice. Quickly create a free forever Atlas cluster by following this [tutorial](https://www.mongodb.com/developer/products/atlas/free-atlas-cluster/). Or you can get started directly [here](https://www.mongodb.com/cloud/atlas/register). 


In [None]:
MONGO_URI = "MONGO_DB_URI_HERE"
MONGODB_DATABASE = "DB_NAME_HERE"
# Note: You can configure the db_name and namespace when instantiating MongoDocumentStore & MongoIndexStore, 
# otherwise they default to db_name="db_docstore" and namespace="docstore"

#### Add Nodes to MongoDB backed Docstore

In [None]:
from llama_index.storage.docstore import MongoDocumentStore
docstore = MongoDocumentStore.from_uri(uri=MONGO_URI)

docstore.add_documents(nodes)

This would result in a new collection called `docstore/data` and `docstore/metadata` being created in mongoDB

![MongoDocumentStore](https://drive.google.com/uc?export=view&id=1PrMet1I8bWfd-6pf4YK8RtQmRYFpLdVu)


### Define Indexes & Store them in MongoDB


Each index uses the same underlying Docstore.

In [None]:
from llama_index.storage.docstore import MongoDocumentStore
from llama_index.storage.index_store import MongoIndexStore
from llama_index.storage.storage_context import StorageContext

storage_context = StorageContext.from_defaults(
    docstore=MongoDocumentStore.from_uri(uri=MONGO_URI, db_name=MONGODB_DATABASE),
    index_store=MongoIndexStore.from_uri(uri=MONGO_URI, db_name=MONGODB_DATABASE),
)



In [None]:
list_index = GPTListIndex(nodes, storage_context=storage_context)

In [None]:
vector_index = GPTVectorStoreIndex(nodes, storage_context=storage_context) 

In [None]:
keyword_table_index = GPTSimpleKeywordTableIndex(nodes, storage_context=storage_context) 

This would result in a new collection called `index_store/data` being created in mongoDB

![MongoIndexStore](https://drive.google.com/uc?export=view&id=1JkpyWyJjXLLC-0i1Q2NCflDG5RyDUQbk)

### Retrieve Nodes from MongoDB Docstore

(This is an OPTIONAL step. If you have been following along till now, the documents are already loaded in-memory)

In [None]:
from llama_index.storage.docstore import MongoDocumentStore
docstore = MongoDocumentStore.from_uri(uri=MONGO_URI, db_name=MONGODB_DATABASE)
nodes = list(docstore.docs.values())

# NOTE: Verify that the docstore still has the same nodes
len(docstore.docs)

2

## Test out some Queries

In [None]:
vector_response = vector_index.as_query_engine().query("Does he have experience with Salesforce?") 
display_response(vector_response)

**`Final Response:`** Yes, he does have experience with Salesforce. He mentions that he "built CRM systems by building reports, dashboards, automation, and integrations to improve internal processes" while working as a Product Owner II at ClearForMe. He also mentions that he "designed and developed reports and dashboards by understanding customer need in Salesforce" while working as a Technical Business Analyst at Cloud Mentor.

In [None]:
vector_response = vector_index.as_query_engine().query("What are all the companies he worked at?") 
display_response(vector_response)

**`Final Response:`** The companies Ananth Prayaga worked at are:
1. Slyce
2. Independence Blue Cross
3. Temple University
4. ClearForMe
5. Cloud Mentor
6. Comcast

In [None]:
vector_response = vector_index.as_query_engine().query("What is his name?") 
display_response(vector_response)

**`Final Response:`** His name is Ananth Prayaga.

In [None]:
vector_response = vector_index.as_query_engine().query("Does he have experience with Data") 
display_response(vector_response)

**`Final Response:`** Migration?

Yes, Ananth Prayaga has experience with data migration. He has experience extracting data/files from various legacy systems, transforming data for loading into source systems, and architecting the data migration process for a sunsetting legacy library system.

In [None]:
vector_response = vector_index.as_query_engine().query("List all his skills?") 
display_response(vector_response)

**`Final Response:`** - Data Analytics
- Business Analysis
- Product Management
- Data Migration
- Reporting
- Analytics
- SaaS Product Development
- Cross-Functional Team Management
- Google Data Studio
- BigQuery
- JIRA
- AirTable
- Excel
- Machine Learning
- Sales Analysis
- Focus Groups
- Interviews
- Surveys
- Desk Research
- Primary Research
- Secondary Research
- Google Data Studio
- BigQuery
- JIRA
- AirTable
- Excel
- Machine Learning
- Sales Analysis
- Focus Groups
- Interviews
- Surveys
- Desk Research
- Primary Research
- Secondary Research
- ETL Jobs
- Projects
- Job Schedules
- Custom Reports
- IBM Cognos
- Text Analytics
- Product Strategy
- Roadmap Development
- CRM Systems
- Reports
- Dashboards
- Automation
- Integrations
- API Product Offering
- Core Platform Development
- Product Operations
- Launch Support
- Data Extraction
- Data Mapping
- SQL
- ETL Tools
- Report and Dashboard Development
- Python Scripting
- Accounting Data Consolidation