## Setting up the environment

In [7]:
import os
from dotenv import load_dotenv
load_dotenv()

# Setup embedding model
from llama_index.core import Settings

# Use local embedding models served by LM Studio
# Use fake API key (LM Studio doesn't validate it)
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
    api_base = os.getenv("LM_STUDIO_API_BASE"),
    api_key = "nothing at al",
    model_name = os.getenv("LM_STUDIO_EMBED_MODEL"),
    embed_batch_size = 50
)

# Embedding model verification
Settings.embed_model = embed_model
embed = embed_model.get_text_embedding("The quick brown fox jumps over the lazy dog.")
print(embed[:5])  # Should print a list of floats

[0.0731334239244461, 0.07449399679899216, -0.0038104583509266376, -0.024098584428429604, -0.02073613740503788]


In [8]:
# 使用智谱免费模型，提取节点关系
from llama_index.llms.zhipuai import ZhipuAI
llm_extraction = ZhipuAI(
    api_key=os.getenv("ZHIPU_API_KEY"),
    model=os.getenv("ZHIPU_LLM_MODEL_NAME")
)
Settings.llm = llm_extraction

# Verify the LLM
print(llm_extraction.complete("\nBriefly introduce yourself in 50 Chinese characters."))

我是人工智能助手，擅长解答问题。


## Pipeline 1: Collecting & preparing the documents

In [9]:
from llama_index.core import SimpleDirectoryReader

# Load documents
documents = SimpleDirectoryReader("./documents").load_data()
print(documents[7])

Doc ID: 75e47e22-4dbd-4beb-85b0-41a3c9b4669f
Text: Title: Air mass (solar energy)  URL:
https://en.wikipedia.org/wiki/Air_mass_(solar_energy)  Content: The
air mass coefficient defines the direct optical path length through
the Earth's atmosphere, expressed as a ratio relative to the path
length vertically upwards, i.e. at the zenith. The air mass
coefficient can be used to help characterize the...


## Pipeline 2: Creating vector store

In [10]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

# Path for vector store and dataset
vector_dataset = "./dataset/vector_storage" # local storage
vector_store_path = vector_dataset
dataset_path = vector_dataset

# Create an index over the documents
# Overwrites the existing dataset if True
ow = True

try:
    vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=ow)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_documents(documents, storage_context, show_progress=True)
except Exception as e:
    print(f"An error occurred: {e}")
    print(f"Error type: {type(e)}")
    print(f"Error traceback: {e.__traceback__}")




Parsing nodes:   0%|          | 0/883 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Uploading data to deeplake dataset.


100%|██████████| 2048/2048 [00:03<00:00, 640.96it/s]


Dataset(path='./dataset/vector_storage', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (2048, 1)     str     None   
 metadata     json      (2048, 1)     str     None   
 embedding  embedding  (2048, 768)  float32   None   
    id        text      (2048, 1)     str     None   


Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Uploading data to deeplake dataset.


100%|██████████| 2048/2048 [00:03<00:00, 664.88it/s]


Dataset(path='./dataset/vector_storage', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (4096, 1)     str     None   
 metadata     json      (4096, 1)     str     None   
 embedding  embedding  (4096, 768)  float32   None   
    id        text      (4096, 1)     str     None   




Generating embeddings:   0%|          | 0/1252 [00:00<?, ?it/s]

Uploading data to deeplake dataset.


100%|██████████| 1252/1252 [00:01<00:00, 638.12it/s]


Dataset(path='./dataset/vector_storage', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (5348, 1)     str     None   
 metadata     json      (5348, 1)     str     None   
 embedding  embedding  (5348, 768)  float32   None   
    id        text      (5348, 1)     str     None   


In [11]:
import deeplake
import pandas as pd
import numpy as np

# Create a dictionary to hold the data
data = {}

# Load vector store data
ds = deeplake.load(dataset_path)
ds.summary()

# Iterate through the tensors in the dataset
for tensor_name in ds.tensors:
    tensor_data = ds[tensor_name].numpy()

    # Check if the tensor is multi-dimensional
    if tensor_data.ndim > 1:
        # Flatten multi-dimensional tensors
        data[tensor_name] = [np.array(e).flatten().tolist() for e in tensor_data]
    else:
        # Convert 1D tensors directly to lists and decode text
        if tensor_name == "text":
            data[tensor_name] = [t.tobytes().decode('utf-8') if t else "" for t in tensor_data]
        else:
            data[tensor_name] = tensor_data.tolist()

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

./dataset/vector_storage loaded successfully.





Dataset(path='./dataset/vector_storage', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (5348, 768)  float32   None   
    id        text      (5348, 1)     str     None   
 metadata     json      (5348, 1)     str     None   
   text       text      (5348, 1)     str     None   


In [12]:
# Function to display a selected record
def display_record(record_number):
    record = df.iloc[record_number]
    display_data = {
        "ID": record.get("id", "N/A"),
        "Metadata": record.get("metadata", "N/A"),
        "Text": record.get("text", "N/A"),
        "Embedding": record.get("embedding", "N/A")
    }

    # Print the ID
    print("ID:")
    print(display_data["ID"])
    print()

    # Print the metadata in a structured format
    print("Metadata:")
    metadata = display_data["Metadata"]
    if isinstance(metadata, list):
        for item in metadata:
            for key, value in item.items():
                print(f"{key}: {value}")
            print()
    else:
        print(metadata)
    print()

    # Print the text
    print("Text:")
    print(display_data["Text"])
    print()

    # Print the embedding
    print("Embedding:")
    print(display_data["Embedding"])
    print()

# Example usage
rec = 7  # Replace with the desired record number
display_record(rec)

ID:
['84e136e6-1761-4cec-9e8c-207f005be261']

Metadata:
file_path: /home/ongin/github/RAG-pR/RAG-Wikipedia/documents/Absorption_electromagnetic_radiation.txt
file_name: Absorption_electromagnetic_radiation.txt
file_type: text/plain
file_size: 5392
creation_date: 2025-06-12
last_modified_date: 2025-06-12
_node_content: {"id_": "84e136e6-1761-4cec-9e8c-207f005be261", "embedding": null, "metadata": {"file_path": "/home/ongin/github/RAG-pR/RAG-Wikipedia/documents/Absorption_electromagnetic_radiation.txt", "file_name": "Absorption_electromagnetic_radiation.txt", "file_type": "text/plain", "file_size": 5392, "creation_date": "2025-06-12", "last_modified_date": "2025-06-12"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "relationships": {"1": {"node_id": "310bb0f0-13ab-41b4-b

## Pipeline 3: Knowledge graph index-based RAG

In [13]:
from llama_index.core import Document

# Ensure 'text' column is of type string
df['text'] = df['text'].astype(str)
# Create documents with IDs
documents = [Document(text=row['text'], doc_id=str(row['id'])) for _, row in df.iterrows()]

### Generating the graph index

In [14]:
from llama_index.core import KnowledgeGraphIndex
import time
from typing import List

# Start the timer
start_time = time.time()

class RobustKnowledgeGraphIndex(KnowledgeGraphIndex):
    def _extract_triplets(self, text: str) -> List[tuple]:
        """Extract triplets with error handling to ignore failures."""
        try:
            # Call the parent class's triplet extraction method
            return super()._extract_triplets(text)
        except Exception as e:
            # Log the error (optional) and return an empty list to continue processing
            print(f"Error extracting triplets for text chunk: {e}")
            return []

# Graph index with embeddings
graph_index = RobustKnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=5,
    include_embeddings=True,
    show_progress=False,
)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Index creation time: {elapsed_time:.4f} seconds")

print(type(graph_index))

# Save the graph index to a file
graph_index_saving_path = "./dataset/graph_storage"
graph_index.storage_context.persist(persist_dir=graph_index_saving_path)

Error extracting triplets for text chunk: Error code: 400, with error text {"contentFilter":[{"level":1,"role":"user"}],"error":{"code":"1301","message":"系统检测到输入或生成内容可能包含不安全或敏感内容，请您避免输入易产生敏感内容的提示语，感谢您的配合。"}}
Error extracting triplets for text chunk: Error code: 400, with error text {"contentFilter":[{"level":1,"role":"user"}],"error":{"code":"1301","message":"系统检测到输入或生成内容可能包含不安全或敏感内容，请您避免输入易产生敏感内容的提示语，感谢您的配合。"}}
Error extracting triplets for text chunk: Error code: 400, with error text {"contentFilter":[{"level":1,"role":"user"}],"error":{"code":"1301","message":"系统检测到输入或生成内容可能包含不安全或敏感内容，请您避免输入易产生敏感内容的提示语，感谢您的配合。"}}
Error extracting triplets for text chunk: Error code: 400, with error text {"contentFilter":[{"level":1,"role":"user"}],"error":{"code":"1301","message":"系统检测到输入或生成内容可能包含不安全或敏感内容，请您避免输入易产生敏感内容的提示语，感谢您的配合。"}}
Error extracting triplets for text chunk: Error code: 400, with error text {"contentFilter":[{"level":1,"role":"user"}],"error":{"code":"1301","message":"系统检测到输入或生成内容可能

### Displaying the graph in HTML file

In [15]:
# Load the Graph data
from llama_index.core import StorageContext, load_index_from_storage

graph_index_saving_path = "./dataset/graph_storage"
storage_context = StorageContext.from_defaults(persist_dir=graph_index_saving_path)

graph_index = load_index_from_storage(storage_context)

# Create graph
from pyvis.network import Network

g = graph_index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)

# Set node and edge properties: colors and sizes
for node in net.nodes:
    node['color'] = 'lightgray'
    node['size'] = 10

for edge in net.edges:
    edge['color'] = 'black'
    edge['width'] = 1

fgraph="Knowledge_graph_visual.html"
net.write_html(fgraph)
print(fgraph)

Loading llama_index.core.storage.kvstore.simple_kvstore from ./dataset/graph_storage/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ./dataset/graph_storage/index_store.json.
Knowledge_graph_visual.html


## Interacting with the Knowledge graph index

In [16]:
import time
import textwrap



#similarity_top_k
k=3
#temperature
temp=0.1
#num_output
mt=1024
graph_query_engine = graph_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

def execute_query(user_input, k=3, temp=0.1, mt=1024):

    # Start the timer
    start_time = time.time()

    # Execute the query with additional parameters
    response = graph_query_engine.query(user_input)

    # Stop the timer
    end_time = time.time()

    # Calculate and print the execution time
    elapsed_time = end_time - start_time
    print(f"Query execution time: {elapsed_time:.4f} seconds")

    # Print the response, wrapped to 100 characters per line
    print(textwrap.fill(str(response), 100))
    return response

In [17]:
user_query="Summarise the methods to improve efficiency of solar cells, in ordered list."

In [18]:
import time
import textwrap
import sys
import io
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 120))

Query execution time: 13.5474 seconds
1. Multi-junction (MJ) solar cells - Use multiple p–n junctions with different semiconductor materials to absorb a
broader range of wavelengths, improving sunlight to electrical energy conversion efficiency. 2. Tandem fabrication
techniques - Apply to lower-cost thin-film solar cells, using materials like amorphous silicon, to create lightweight,
flexible cells with about 10% efficiency. 3. Organic solar cells - Utilize conductive organic polymers or small organic
molecules, potentially increasing the efficiency of existing inorganic solar cell technologies. 4. Quantum dot solar
cells - Employ quantum dots as photovoltaic material, potentially overcoming limitations of bulk materials. 5. Hybrid
solar cells - Combine organic and inorganic semiconductors to leverage the benefits of both. 6. Concentrated
photovoltaics (CPV) - Use optics to concentrate sunlight onto a small area, enhancing the efficiency for certain
applications. 7. Photovoltaic therma

In [19]:
print(llm_extraction.complete("Summarise the methods to improve efficiency of solar cells."))

Improving the efficiency of solar cells can be achieved through several methods:

1. **Enhanced Photocatalysis:** Utilizing catalysts to increase the efficiency of light absorption and charge separation in solar cells.

2. **Antireflective Coatings:** Applying antireflective coatings to reduce the reflection of light from the solar cell surface, thus allowing more light to enter and be converted into electricity.

3. **Texturing the Surface:** Roughening the surface of the solar cells can increase the effective area of the cell, allowing for greater light absorption.

4. **Multiple Junction Cells:** Using solar cells with multiple layers to capture light across a broader spectrum and increase overall efficiency.

5. **Selective Emitter Layers:** Implementing selective emitter layers to improve the efficiency of light absorption at certain wavelengths.

6. **Improved Material Quality:** Using high-purity materials and minimizing defects and impurities within the solar cell structure to 