# **Build Knowledge Graph RAG With LlamaIndex for Data Analysis**

### **Business Overview**
The financial market is a complex and dynamic ecosystem that generates vast amounts of data every second, including stock prices, news updates, industry trends, and economic reports. Analyzing this data is critical for investors and analysts to make informed decisions and manage risks. Traditionally, market data analysis has relied on manual efforts or basic tools that extract and process data from individual sources. This approach, while effective to some extent, is time-consuming, prone to errors, and limited in uncovering deeper insights such as hidden patterns or relationships between entities like companies, sectors, and stocks.

Generative AI (Gen AI) offers transformative potential in market data analysis by automating data extraction, processing, and structuring. With its ability to process unstructured data, Gen AI can uncover complex relationships and provide real-time, actionable insights. By combining web scraping, natural language processing (NLP), and knowledge graph construction, AI systems can not only handle the scale and speed required in financial markets but also offer personalized, context-aware insights to users. This allows professionals to quickly navigate the overwhelming sea of market data and focus on strategic decision-making.

**In this project, we leverage the capabilities of Gen AI to create a Knowledge Graph RAG system for market sentiment data analysis. Using web scraping techniques, we gather real-time data from multiple sources such as stock feeds, financial news websites, and industry reports. This structured information is used to build a knowledge graph database with Neo4j, LlamaIndex and OpenAI. We then query the graph index  for generating context-aware responses to user queries.**


![image.png](https://images.pexels.com/photos/30381207/pexels-photo-30381207/free-photo-of-modern-trading-desk-with-graphs-and-gadgets.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2)

## **Approach**


* **Data Collection:** Gather financial data from multiple sources such as stock feeds, news websites, and industry reports using Google search engine API and scraping techniques


* **Knowledge Graph Construction:** Build a knowledge graph to represent relationships between extracted entities like companies, market trends, and sectors using neo4j


* **Vector Embedding and Indexing**
  - Convert textual data into vector embeddings using OpenAI
  - Index data using LlamaIndex to enable retrieval-augmented generation

* RAG Workflow Integration
  - Query the vector database to fetch relevant context and related entities from the knowledge graph
  - Use OpenAI to generate contextually accurate and actionable responses


### **Learning Outcomes**

* Learn how to scrape and extract real-time data from financial websites, reports with Google SERP API.

* Understand how to process and organize unstructured data into a structured format.
* Build and query a knowledge graph using tools like Neo4j and LlamaIndex.
* Implement a Retrieval-Augmented Generation (RAG) system for generating context-aware insights.
* Use OpenAI APIs to generate actionable insights from financial data.

### **Project Setup**


#### **Setting Up Google Custom Search API (Free Tier)**
The Google Custom Search JSON API allows you to perform Google searches programmatically. Below are the steps to set up the API, generate a key, and understand the limits of the free tier.

Step 1: Create a Google Cloud Platform (GCP) Project
  - Go to the Google Cloud Console.
  - Click on Create Project.
  - Provide a project name (e.g., CustomSearchProject) and click Create.
  - Select the project from the top-left project selector.

Step 2: Enable the Custom Search JSON API
- In the Google Cloud Console, go to the APIs & Services → Library section.
- Search for Custom Search JSON API.
- Click on the result and then Enable the API for your project.

Step 3: Generate an API Key
- Go to APIs & Services → Credentials.
- Click Create Credentials → API Key.
- Your API key will be generated. Copy and save it securely, as this key is required for making requests.

Step 4: Set Up a Custom Search Engine (CSE)
- Visit the [Google Programmable Search Engine](https://cse.google.com/).
- Click Get Started and sign in with your Google account if prompted.
- In the Sites to Search section, specify the sites you want to include in the search.
- To search the entire web, enter *.com (or similar).
- Provide a name for your search engine and click Create.
- After creating the search engine, navigate to Control Panel → Search Engine ID.
- Copy the Search Engine ID (CX value), as it is required for API requests.

**Free Tier Limitations**

The free tier for the Google Custom Search JSON API has the following restrictions:

- Search Requests: The free tier allows 100 queries per day.
- Query Size: Each query can fetch up to 10 search results.
- Rate Limiting: Google may throttle excessive requests even within the free tier limits.

To monitor usage:

- Go to APIs & Services → Usage in the Google Cloud Console.
- Check the metrics for the Custom Search JSON API.





#### **Setting Up OpenAI API Key**
The OpenAI API provides access to powerful language models like GPT-4, allowing you to integrate AI capabilities into your applications. Below are the steps to generate an API key, understand its usage, and manage charges.

Step 1: Create an OpenAI Account
- Visit the [OpenAI website](https://platform.openai.com/).
- Sign up for a free account using your email, Google, or Microsoft account.

Step 2: Generate an API Key
- Log in to the [OpenAI Platform](https://platform.openai.com/).
- Navigate to the API Keys section under your account settings.
- Click Create new secret key.
- Copy and save the generated API key securely, as it will not be shown again.

Step 3: Understand API Charges
- The OpenAI API is chargeable based on the usage of tokens. However, new accounts typically receive a free trial credit (e.g., $5 USD). This allows you to experiment with the API without incurring charges initially.

- Free Trial Credit:

  - New users often get $5 in credits valid for 3 months.
  These credits can be used to explore and test the API features.

- Paid Usage:

  - After the free credits are exhausted, you must set up billing to continue using the API. You can start with a small budget, such as 5 USD, for experimentation purposes.

Charges depend on the API model and tokens consumed

Billing Management:
- Navigate to the Billing section in the OpenAI dashboard.
-Add a payment method to enable additional usage beyond the free tier.
-Set spending limits to avoid unexpected charges.

#### **Neo4j Setup**

Neo4j is a popular graph database designed to handle connected data and build knowledge graphs efficiently. Below are the steps to set up a project, generate an API key, and start using Neo4j on its free tier.

Step 1: Create a Neo4j Account
- Visit the Neo4j Aura Website (Neo4j's cloud offering).
- Click Start Free to create an account. You can sign up using your email, Google, or GitHub account.

Step 2: Launch a Free Tier Project
- After signing up, log in to your Neo4j Aura dashboard.
- Click on Create a New Instance.
- Select the Free Tier Plan:
- You can create a free instance with a database size limit of 200,000 nodes and relationships.
- This is ideal for small to medium projects, such as creating a knowledge graph for experimentation or learning.

Step 3: Access Your Database Credentials
- Once the database instance is ready (it may take a few minutes), click on the instance to view its details.
- Note down the following credentials:
- Bolt URL: Used to connect to the database.
- Username: Default is neo4j.
- Password: A temporary password will be generated. Change it after the first login.
- Use these credentials to connect to the database programmatically or through tools like the Neo4j Browser.

Step 4: Generate an API Key (For External Access)
- To connect Neo4j to external tools or libraries (like Python), you need an API key.
- Install the Neo4j Desktop App (optional, for local management) or use the Aura Cloud interface.
- Use your database credentials (Bolt URL, username, password) to authenticate external connections.



In [None]:
!pip install llama-index
!pip install llama-index-llms-openai
!pip install llama-index-graph-stores-neo4j
!pip install llama-index-embeddings-openai

In [11]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import os
import openai
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
import nest_asyncio
import asyncio

In [5]:
google_api = "enterapi_key"
search_engine_id = "enter_id"
open_ai_key = "enter_openai_key"

In [12]:
# Apply nest_asyncio
nest_asyncio.apply()

In [9]:
os.environ["OPENAI_API_KEY"] = open_ai_key
openai.api_key = os.environ["OPENAI_API_KEY"]

## **Data Collection for Financial Sentiment Analysis**

This section explains the functionality and purpose of each part of the provided Python code, which is designed to generate search queries, collect financial sentiment data from the web, and organize it into a structured dataset.



### **1. `search_with_google_api(query)`**
- **Purpose**: Queries the Google Custom Search API with a user-provided search query and returns a list of search results.
- **Key Steps**:
  1. Constructs the API URL using the query, API key (`google_api`), and search engine ID (`search_engine_id`).
  2. Sends an HTTP GET request to the API endpoint.
  3. Checks the response status code:
     - If `200`, extracts the `items` field containing the search results.
     - If not, prints the error code and response text.
- **Output**: Returns a list of search results (or an empty list in case of an error).


### **2. `generate_search_queries(user_input)`**
- **Purpose**: Creates a list of detailed and diverse search queries based on the user's input.
- **Key Steps**:
  1. Uses OpenAI's GPT-4 model to generate queries tailored to financial sentiment analysis.
  2. Prompts the AI with a description of the task, asking for 5-7 specific and varied queries.
  3. Cleans and evaluates the AI-generated response to extract the queries as a Python list.
- **Output**: A list of search queries, e.g., `"['electric vehicle sentiment analysis', ...]"`.



### **3. `fetch_full_content(url)`**
- **Purpose**: Fetches and extracts the full textual content of a webpage from a given URL.
- **Key Steps**:
  1. Sends an HTTP GET request to the URL with a custom `User-Agent` header for compatibility.
  2. If the response is successful (`200` status code), parses the HTML content using `BeautifulSoup`.
  3. Extracts text from all `<p>` (paragraph) tags on the webpage.
  4. Joins and returns the full text or `None` if no valid content is found.
- **Error Handling**: Prints error messages for unsuccessful requests or exceptions.
- **Output**: The extracted full text of the webpage or `None` if no content is available.



### **4. `create_dataset_from_queries(queries, directory="dataset")`**
- **Purpose**: Processes the list of search queries, collects search results and webpage content, and saves them as text files in a specified directory.
- **Key Steps**:
  1. **Directory Creation**:
     - Creates a directory named `dataset` (default) if it does not exist.
  2. **Processing Queries**:
     - Loops through each query in the list.
     - Fetches search results using `search_with_google_api()`.
     - Iterates through the search results, fetching webpage content for each result.
     - Saves the content as a text file if valid content is found.
  3. **File Saving**:
     - Each text file contains:
       - The query, title, link, and snippet of the result.
       - The full content of the linked webpage.
     - Files are uniquely named (`doc_<number>.txt`) to avoid overwriting.
  4. Stops saving after collecting 10 valid documents per query or if no more results are found.
- **Output**: Text files saved in the specified directory with detailed content for each query result.

### **5. Workflow Overview**
1. **User Input**: The user provides a topic or focus area for financial sentiment analysis (e.g., "Electric vehicle sector in the US").
2. **Query Generation**:
   - `generate_search_queries()` generates detailed queries related to the user input.
3. **Data Collection**:
   - `create_dataset_from_queries()` processes the queries to fetch search results and webpage content.
4. **Storage**:
   - The data is saved as organized text files in a directory for further analysis.




In [6]:
import requests

def search_with_google_api(query):
    url = f"https://www.googleapis.com/customsearch/v1?q={query}&key={google_api}&cx={search_engine_id}"

    response = requests.get(url)
    if response.status_code == 200:
        return response.json().get("items", [])
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return []


In [18]:
import os
import requests
from bs4 import BeautifulSoup
import openai


def generate_search_queries(user_input):
    """
    Generates a list of 5-7 detailed and relevant search queries for financial sentiment analysis
    based on the user's input, such as a target sector, field, or region.
    """
    prompt = f"""
    You are a financial analyst and search query expert. Based on the following user input, generate a list of 5-7 search queries
    for financial sentiment analysis based on user input. Ensure the queries cover diverse aspects of the topic, including sector-specific trends,
    regional financial overviews, and broader financial landscapes. The queries should focus on extracting data relevant to sentiment
    and performance analysis.

    User Input: {user_input}

    Strictly output the queries as a python list of strings. Do not add any additional comments.
    """

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert in generating search queries for financial sentiment analysis."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=200
    )

    # Extract and clean up the list of queries
    queries =  response.choices[0].message.content.strip()
    return eval(queries)

def fetch_full_content(url):
    """
    Fetches the full content of a webpage given its URL.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            paragraphs = soup.find_all("p")
            full_text = "\n".join([p.get_text() for p in paragraphs])
            return full_text.strip() if full_text else None
        else:
            print(f"Error: Unable to fetch content from {url} (Status Code: {response.status_code})")
            return None
    except Exception as e:
        print(f"Error fetching content from {url}: {e}")
        return None

def create_dataset_from_queries(queries, directory="dataset"):
    """
    Process search queries and save results as text files in the same directory.
    """
    if not os.path.exists(directory):
        os.makedirs(directory)

    file_count = 1  # To ensure unique filenames across all queries

    for query in queries:
        print(f"Processing query: {query}")
        valid_count = 0
        page_number = 1

        while valid_count < 10:
            print(f"Fetching search results, page {page_number}...")
            results = search_with_google_api(query + f"&start={page_number * 10}")

            if not results:
                print("No more results found. Try refining the query.")
                break

            for result in results:
                if valid_count >= 10:
                    break  # Stop when 10 valid documents are saved

                title = result["title"]
                link = result["link"]
                snippet = result.get("snippet", "No snippet")

                # Fetch full content of the link
                full_content = fetch_full_content(link)
                if full_content:  # Save only if content is valid
                    filename = f"{directory}/doc_{file_count}.txt"
                    with open(filename, "w", encoding="utf-8") as f:
                        f.write(f"Query: {query}\n")
                        f.write(f"Title: {title}\n")
                        f.write(f"Link: {link}\n")
                        f.write(f"Snippet: {snippet}\n\n")
                        f.write(f"Full Content:\n{full_content}")
                    print(f"Saved: {filename}")
                    valid_count += 1
                    file_count += 1
                else:
                    print(f"Skipped: {link} (No valid content)")

            page_number += 1  # Move to the next page of results

    print(f"Finished processing all queries. Total files saved: {file_count - 1}")


user_input = "Financial sentiment analysis for the electric vehicle sector in the US"
queries = generate_search_queries(user_input)
queries
create_dataset_from_queries(queries)


Processing query: US electric vehicle sector sentiment analysis
Fetching search results, page 1...
Saved: dataset/doc_1.txt
Error: Unable to fetch content from https://www.spglobal.com/mobility/en/research-analysis/us-ev-sales-grew-nearly-52-in-2023.html (Status Code: 403)
Skipped: https://www.spglobal.com/mobility/en/research-analysis/us-ev-sales-grew-nearly-52-in-2023.html (No valid content)
Saved: dataset/doc_2.txt
Error: Unable to fetch content from https://www.weforum.org/stories/2024/06/china-electric-vehicle-advantage/ (Status Code: 403)
Skipped: https://www.weforum.org/stories/2024/06/china-electric-vehicle-advantage/ (No valid content)
Saved: dataset/doc_3.txt
Error: Unable to fetch content from https://www.spglobal.com/mobility/en/research-analysis/2024-ev-forecast-the-supply-chain-charging-network-and-battery.html (Status Code: 403)
Skipped: https://www.spglobal.com/mobility/en/research-analysis/2024-ev-forecast-the-supply-chain-charging-network-and-battery.html (No valid co



# Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an advanced natural language processing framework that combines the strengths of retrieval-based and generation-based models. The approach enhances the ability of language models to generate accurate, contextually relevant, and factual responses by integrating external knowledge retrieved from a database or document store.

## How RAG Works

1. **Input Query**: A user provides a query or question.
2. **Retrieval Phase**:
   - The query is passed to a retriever model that searches for relevant documents or knowledge snippets from a pre-built database or vector store.
   - These documents are ranked based on their relevance to the query using similarity metrics (e.g., cosine similarity).
3. **Augmentation Phase**:
   - The retrieved documents are appended to the query to create an augmented input.
   - This augmented input is passed to a generation model, typically a large language model (LLM).
4. **Generation Phase**:
   - The generation model processes the augmented input to create a coherent and context-aware response.

## Benefits of RAG

- **Factual Accuracy**: By retrieving real-time, context-relevant information, RAG reduces hallucinations (made-up information) in LLM outputs.
- **Scalability**: RAG systems can scale with growing knowledge bases by integrating vector databases such as FAISS, Pinecone, or Milvus.
- **Flexibility**: Retrieval and generation components can be updated independently, allowing for easier customization and improvements.

---

# GraphRAG: Enhancing RAG with Graph-Based Context

GraphRAG builds upon the principles of RAG by incorporating **graph-based representations** for enhanced contextual understanding. In scenarios where relationships between entities, concepts, or events are critical, GraphRAG uses knowledge graphs to represent and leverage these connections.

## What is GraphRAG?

GraphRAG integrates **knowledge graphs** with RAG's retrieval and generation framework. Knowledge graphs store information as entities (nodes) and their relationships (edges). These graphs provide structured, interconnected data that enrich the retrieval and generation processes.

## How GraphRAG Works

1. **Input Query**:
   - The query is processed to identify key entities and relationships.
2. **Graph-Based Retrieval**:
   - Instead of or in addition to using a standard vector database, GraphRAG queries the knowledge graph for related entities and their connections.
   - Relevant subgraphs are extracted and converted into a textual or structured format suitable for augmentation.
3. **Augmentation**:
   - The extracted graph-based context is combined with the query, similar to RAG.
4. **Generation**:
   - A generation model processes the augmented input to create a response that incorporates the relational and contextual information from the graph.
5. **Feedback Loop**:
   - User feedback and new data can be used to update the knowledge graph, ensuring continuous improvement and relevance.

## Benefits of GraphRAG

- **Relational Context**: By using knowledge graphs, GraphRAG captures relationships between entities, providing deeper insights into complex queries.
- **Dynamic Updates**: Knowledge graphs can be updated in real-time, ensuring that the retrieval process remains relevant.
- **Domain-Specific Applications**: GraphRAG is particularly effective in domains like healthcare, finance, and supply chain, where interconnected data plays a significant role.

## Applications of GraphRAG

1. **Healthcare**:
   - Building systems that understand patient histories and medical relationships (e.g., symptoms, diagnoses, and treatments).
2. **Customer Support**:
   - Context-aware chatbots that use customer interaction graphs to provide personalized support.
3. **Supply Chain Management**:
   - Optimizing logistics by understanding relationships between suppliers, products, and delivery routes.
4. **Education**:
   - Knowledge-based tutoring systems that leverage educational content and conceptual relationships.

---

## Comparing RAG and GraphRAG

| **Feature**                | **RAG**                        | **GraphRAG**                                |
|----------------------------|--------------------------------|--------------------------------------------|
| **Data Source**            | Textual databases             | Knowledge graphs with relational data      |
| **Contextual Understanding**| Limited to retrieved text     | Enhanced through graph-based relationships |
| **Best Use Cases**         | General QA and text generation| Domains with interconnected entities       |
| **Complexity**             | Simpler to implement          | Requires graph construction and maintenance|

---

By combining the power of retrieval-based augmentation with the structured relational knowledge of graphs, GraphRAG represents a significant evolution in intelligent systems, enabling better contextual reasoning and domain-specific applications.


The code `documents = SimpleDirectoryReader("dataset").load_data()` uses a utility class, typically from libraries like llama-index (formerly gpt-index), to read and preprocess all documents stored in a specified directory. Here, the "dataset" directory is scanned for files, usually in formats like .txt. The SimpleDirectoryReader handles tasks such as opening the files, reading their content, and managing encodings, while ignoring unsupported or corrupt files.

In [20]:
documents = SimpleDirectoryReader("dataset").load_data()

Initiate graph store using Neo4j Credentials. Steps are given in the beginning of the notebook.

In [21]:
graph_store = Neo4jPropertyGraphStore(
    username="neo4j",
    password="",
    url="neo4j+s://.databases.neo4j.io",
)

### Creating the Index with `PropertyGraphIndex`

The code snippet demonstrates how to create a **PropertyGraphIndex**, a graph-based index used to organize and retrieve knowledge efficiently from a set of documents. This index combines various AI-driven components, including embeddings and knowledge graph extractors, to build a semantic representation of the document data.

The `PropertyGraphIndex.from_documents()` method serves as the entry point for creating the index. It begins by processing the `documents` input, which contains text data to be indexed. Each document is embedded using an embedding model, specified here as `OpenAIEmbedding(model_name="text-embedding-3-small")`. This step transforms textual information into dense vector representations, enabling semantic similarity and efficient querying.

For graph structure, the method uses **knowledge graph extractors**, such as `SchemaLLMPathExtractor`, which leverage an LLM (large language model). In this example, the LLM is instantiated as `OpenAI(model="gpt-3.5-turbo", temperature=0.0)`, configured for zero randomness to ensure consistency. The LLM extracts schema paths, entities, and relationships from the text, enriching the graph with structured knowledge.

The `property_graph_store` parameter specifies the storage backend for the graph, allowing flexibility in where and how the data is stored. Additional options, such as `show_progress=True` and `use_async=True`, enable real-time progress updates and asynchronous processing, improving efficiency for large-scale datasets.


In [22]:
# Create the index
index = PropertyGraphIndex.from_documents(
    documents,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    kg_extractors=[
        SchemaLLMPathExtractor(
            llm=OpenAI(model="gpt-3.5-turbo", temperature=0.0)
        )
    ],
    property_graph_store=graph_store,
    show_progress=True,
    use_async=True
)


Parsing nodes:   0%|          | 0/70 [00:00<?, ?it/s]

Extracting paths from text with schema: 100%|██████████| 2589/2589 [58:50<00:00,  1.36s/it]
Generating embeddings: 100%|██████████| 26/26 [00:03<00:00,  7.27it/s]
Generating embeddings: 100%|██████████| 135/135 [00:07<00:00, 17.22it/s]


In [32]:

# save and load using locally stored graphs if you dont want to implement neo4j, you can directly use locally stored index in storage folder
index.storage_context.persist(persist_dir="./storage")

from llama_index.core import StorageContext, load_index_from_storage

index = load_index_from_storage(
    StorageContext.from_defaults(persist_dir="./storage")
)


## **Building GraphRAG Pipeline**



In [8]:
# loading from existing graph store (and optional vector store)
# load from existing graph/vector store
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store
)

In [25]:
# Define retriever
retriever = index.as_retriever(
    include_text=False,  # Default is true
)
results = retriever.retrieve("What is the summary of the finanacial texts?")
for record in results:
    print(record.text)

Greeven -> WORKED_ON -> The Economist
Wei -> WORKED_ON -> The Economist
Xue -> WORKED_ON -> The Economist
He -> WORKED_ON -> The Economist
James Hamilton -> WORKED_ON -> Oil and the Macroeconomy since World War II
investor sentiment -> LOCATED_IN -> US
CFO -> WORKED_ON -> gross profit
CFO -> WORKED_ON -> ASP
CFO -> WORKED_ON -> manufacturing credits
CFO -> WORKED_ON -> battery costs
CFO -> WORKED_ON -> IRA incentives


In [28]:
query_engine = index.as_query_engine(include_text=True)
response = query_engine.query("How to invest in the EV sector? Summarize the most important financial trends in the EV Sector.")
print(response)


Investing in the EV sector involves understanding key financial trends. The sector has seen significant growth, with sales of battery electric and plug-in hybrid electric vehicles surpassing two million vehicles in 2019. Despite disruptions like the COVID-19 pandemic, the market continues to expand. Market segmentation is crucial for identifying opportunities and managing risks. Government interventions, consumer attitudes, and OEM investments play a significant role in shaping the sector. The EV market is expected to grow substantially by 2030, presenting opportunities for traditional and new-entrant OEMs, finance companies, and dealerships. Understanding global sales trends, market shares, and production projections, especially in regions like China, can provide valuable insights for potential investors in the EV sector.


In [30]:
query_engine = index.as_query_engine(include_text=True)
response = query_engine.query("Which companies are doing the best in EV sector?")
print(response)


BYD and other Chinese EV car companies are leading the electric vehicle sector.


In [31]:
query_engine = index.as_query_engine(include_text=True)
response = query_engine.query("How is Tesla doing in EV sector?")
print(response)

Tesla is located in various countries such as the United States, China, and India. The company experienced a decline in sales in 2024, marking its first annual sales drop in a decade. Despite this decline, analysts remain confident in Tesla's ability to accelerate delivery growth in the future. Tesla's financials improved in Q2 2024, with revenue increasing by 20% year-over-year driven by a rise in deliveries. The company delivered a significant number of vehicles during the quarter, although overall demand for electric vehicles in the U.S. and elsewhere slowed. Tesla's automotive sector revenue increased over 14% year-over-year in Q2 2024, with revenue from regulatory credits doubling from the previous year. Additionally, income from other ventures like energy storage deployment and charging networks rose significantly.


In [14]:
from typing import List, Dict

def generate_summary_report(context: str, query: str) -> str:
    """
    Generate a detailed summary report for financial sentiment analysis.
    Takes context and query as inputs and returns a comprehensive summary.
    """
    prompt = f"""
    You are a financial sentiment analysis assistant. Using the context provided below, generate a detailed summary report:

    Context:
    {context}

    Query:
    {query}

    The report should include:
    1. A high-level summary of the financial trends related to the query.
    2. Key positive, negative, and neutral sentiments detected.
    3. Reasons or factors driving the sentiments.
    4. Suggestions or insights for potential investors or stakeholders.

    Be concise but ensure that the report is actionable and insightful.
    """
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert in financial sentiment analysis."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=500
    )
    return response.choices[0].message.content.strip()

def query_and_generate_reports(queries: List[str]) -> List[Dict[str, str]]:
    """
    Query the knowledge graph for each query, aggregate context, and generate summary reports.
    Returns a list of dictionaries containing the query, aggregated context, and generated report.
    """
    results = []

    for query in queries:
        print(f"Processing query: {query}")
        context = query_engine.query(query)

        # Generate a summary report using the aggregated context
        report = generate_summary_report(context, query)

        results.append({
            "query": query,
            "context": context,
            "report": report
        })

    return results

def save_reports_to_file(results: List[Dict[str, str]], filename: str):
    """
    Save query results and their generated reports to a file.
    """
    with open(filename, "w", encoding="utf-8") as file:
        for result in results:
            file.write(f"Query:\n{result['query']}\n\n")
            file.write(f"Context:\n{result['context']}\n\n")
            file.write(f"Generated Report:\n{result['report']}\n\n")
            file.write("-" * 80 + "\n\n")



# Create the query engine
query_engine = index.as_query_engine(include_text=True)

# Define a list of queries, Different kinds of queries to see the effectiveness of in EV sector
queries = [
    "How to invest in the EV sector? Summarize the most important financial trends in the EV Sector.",
    "What are the recent financial sentiments about renewable energy investments?",
    "Summarize the financial outlook for the technology sector in 2024.",
    "What are the key financial risks in the automotive industry this year?",
    "Provide insights on the financial performance of AI startups in the US."
]

# Execute the queries and generate reports
results = query_and_generate_reports(queries)

# Save the reports to a file
output_file = "financial_sentiment_reports.txt"
save_reports_to_file(results, output_file)

# Print a summary of the generated reports
for result in results:
    print(f"Query: {result['query']}")
    print(f"Generated Report:\n{result['report']}")
    print("-" * 80)


Processing query: How to invest in the EV sector? Summarize the most important financial trends in the EV Sector.
Processing query: What are the recent financial sentiments about renewable energy investments?


  b"R": self.hydrate_relationship,


Processing query: Summarize the financial outlook for the technology sector in 2024.
Processing query: What are the key financial risks in the automotive industry this year?
Processing query: Provide insights on the financial performance of AI startups in the US.
Query: How to invest in the EV sector? Summarize the most important financial trends in the EV Sector.
Generated Report:
Report:

Title: Financial Trends and Sentiment Analysis in the EV Sector

1. High-Level Summary:
The electric vehicle (EV) sector's financial landscape is experiencing an upward trend due to increasing revenues ensuing from the higher sales and growing popularity of EVs. With steady growth in regulatory credits and the expansion of manufacturers into other service areas such as energy storage deployment and charging networks, the sector is witnessing dynamic revenue growth and improved financial stability.

2. Sentiment Analysis:

   a) Positive Sentiment: There is a strong positive sentiment around the incr

# Project Summary, Business Impact, Future Improvements, and Try It Out

## Project Summary
This project demonstrates the integration of a **Neo4j knowledge graph** with **LlamaIndex** and **OpenAI** to enable financial sentiment analysis. It allows users to query an indexed graph, aggregate context from multiple sources, and generate comprehensive summary reports using advanced natural language processing techniques. The system is designed to support decision-making by providing insights into financial trends, risks, and sentiments across various industries and sectors.

Key features include:
- Context aggregation from graph-based query results.
- Sentiment analysis of financial data.
- Automated report generation with actionable insights.

---

## Business Impact
The project has several significant business implications:
1. **Enhanced Decision-Making**: By providing sentiment-driven insights, businesses can make informed investment decisions and identify emerging opportunities or risks.
2. **Improved Efficiency**: Automating sentiment analysis reduces the time and effort required to manually analyze vast amounts of financial data.
3. **Scalable Insights**: The system can easily adapt to new sectors or industries, enabling businesses to monitor diverse areas of interest with minimal effort.
4. **Data-Driven Strategies**: By understanding market sentiments, organizations can tailor their strategies to align with current trends and stakeholder perceptions.

---

## Future Improvements
There are several areas where this project can be enhanced:
1. **Sentiment Trend Visualization**:
   - Add data visualization tools (e.g., Matplotlib, Plotly) to show sentiment trends over time.
   - Enable users to analyze changes in sentiment dynamically.
   
2. **Integration with External APIs**:
   - Incorporate APIs like Alpha Vantage, Yahoo Finance, or Google Finance for real-time financial data updates.
   - Use external APIs for live sentiment scoring of recent news and reports.

3. **Personalized Query Recommendations**:
   - Implement user profiling to recommend queries based on user preferences or past behavior.
   - Use machine learning to optimize query results over time.

4. **Real-Time Notifications**:
   - Add a notification system for alerts on significant sentiment shifts or financial trends.

5. **Deployment and Scalability**:
   - Deploy the application on cloud platforms like AWS or Azure for broader access and scalability.
   - Use serverless architecture to improve cost efficiency and handle varying loads.

---

## Try It Out Exercise
To build on top of this project, try the following exercises:

1. **Explore a New Sector**:
   - Use the system to analyze sentiments for a new sector, such as healthcare, technology, or renewable energy.
   - Create custom queries targeting specific companies, regions, or policies.

2. **Add Custom Query Types**:
   - Modify the system to accept user-defined templates for queries.
   - Examples:
     - "What are the top risks in [sector] this year?"
     - "Summarize the impact of [policy] on [industry]."

3. **Visualize the Results**:
   - Add a module to visualize sentiment data and financial trends over time.
   - Example: Bar charts showing positive, neutral, and negative sentiments for a given query.

4. **Sentiment Comparison**:
   - Extend the project to compare sentiments across multiple sectors or regions.
   - Example: "Compare the financial sentiments of the EV sector in the US and Europe."

5. **Deploy and Test**:
   - Deploy the project as a web application using Streamlit or Flask.
   - Allow users to interact with the system and test its capabilities with their queries.

By trying out these exercises, you can explore the full potential of the system and customize it to address unique business needs.


## Did You Know?

- **Knowledge Graphs in Finance**: Knowledge graphs are increasingly used in the financial industry to uncover hidden relationships between companies, sectors, and markets. They help analysts discover insights that are often missed in traditional data analysis.

- **Sentiment Analysis**: Sentiment analysis uses Natural Language Processing (NLP) to determine the emotional tone of a text. In finance, it's applied to news articles, earnings call transcripts, and social media to gauge market sentiment.

- **Financial Sentiment Predicts Market Trends**: Studies have shown that market sentiment can predict price movements and volatility. For instance, positive sentiment about a company can drive stock prices up, while negative sentiment can signal potential downturns.

- **AI in Finance**: Generative AI, like OpenAI's models, is transforming how financial insights are generated. From automating reports to analyzing unstructured data, AI is making finance more data-driven and efficient.

- **Real-Time Analysis**: By combining knowledge graphs and APIs like Alpha Vantage or Yahoo Finance, real-time financial sentiment analysis can provide instant insights, enabling faster decision-making.

- **EV Sector Growth**: The electric vehicle (EV) industry is expected to grow at a compound annual growth rate (CAGR) of 24.3% from 2023 to 2030, making it a key focus for financial analysts and investors.

- **Customization Potential**: Systems like this project can be customized for various industries, from healthcare to manufacturing, making them versatile tools for any sector needing sentiment-driven insights.


### **Interview Questions**

What is Retrieval-Augmented Generation (RAG), and how does it enhance LLM capabilities?

How does a knowledge graph differ from traditional data storage methods like relational databases?

Explain how embeddings are used in a graph-based RAG system.

What are the advantages of using a property graph for RAG systems?

How do you extract entities and relationships from text for constructing a knowledge graph?

What challenges can arise when integrating a knowledge graph with an LLM for contextual querying?

How would you optimize the performance of a graph-based RAG system for large-scale datasets?

Can you explain how vector similarity search is used in RAG systems for retrieving relevant information?

What factors would you consider when selecting a graph database (e.g., Neo4j) for a RAG implementation?

What is the purpose of LlamaIndex in a RAG system, and how does it facilitate integration with LLMs?

How does LlamaIndex handle large-scale document indexing for retrieval tasks?
Explain the role of SimpleDirectoryReader in LlamaIndex and how it simplifies document loading.

How does LlamaIndex leverage embeddings for semantic similarity searches?
Can you describe how the PropertyGraphIndex works in LlamaIndex and its advantages for knowledge representation?

How does the combination of embeddings and graph structures improve the quality of responses in a RAG system?

What role does temperature play in generating responses using an LLM?

How would you address the hallucination problem in LLMs when combined with RAG?

How can LLMs be integrated with real-time data sources for dynamic response generation?

What is the impact of context window size in LLMs when processing long documents in RAG systems?

How do you evaluate the accuracy and relevance of responses generated by LLMs in a RAG setup?




### **Conclusion**

This project successfully demonstrates the power of combining graph-based knowledge representation with Retrieval-Augmented Generation (RAG) to enable effective financial sentiment analysis. By leveraging cutting-edge technologies such as LlamaIndex, Neo4j, and OpenAI embeddings, the system efficiently organizes and retrieves data from unstructured sources, transforming it into actionable insights. The integration of knowledge graphs allows for a rich understanding of relationships between entities, while the use of LLMs enhances contextual accuracy in query responses.