## Setting up the environment

In [32]:
import os
from dotenv import load_dotenv
load_dotenv()

# Setup embedding model
from llama_index.core import Settings

# Use local embedding models served by LM Studio
# Use fake API key (LM Studio doesn't validate it)
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
    api_base = os.getenv("LM_STUDIO_API_BASE"),
    api_key = "whatever-it-is",
    model_name = os.getenv("LM_STUDIO_EMBED_MODEL"),
    embed_batch_size = 2
)

# Embedding model verification
Settings.embed_model = embed_model
embed = embed_model.get_text_embedding("The quick brown fox jumps over the lazy dog.")
print(embed[:5])  # Should print a list of floats

[0.0731334239244461, 0.07449399679899216, -0.003810458118095994, -0.024098584428429604, -0.02073613367974758]


In [33]:
# 使用智谱免费模型，提取节点关系
from llama_index.llms.zhipuai import ZhipuAI
llm_extraction = ZhipuAI(
    api_key=os.getenv("ZHIPU_API_KEY"),
    model=os.getenv("ZHIPU_LLM_MODEL_NAME")
)
Settings.llm = llm_extraction

# Verify the LLM
print(llm_extraction.complete("\nBriefly introduce yourself in 50 Chinese characters."))

我是一个致力于提供帮助的人工智能助手。


## Pipeline 1: Collecting & preparing the documents

In [34]:
from llama_index.core import SimpleDirectoryReader

# Load documents
documents = SimpleDirectoryReader("./documents").load_data()
print(documents[7])

Doc ID: 39303fef-98aa-417d-a446-12b26225c155
Text: Title: Aleksandr Stoletov  URL:
https://en.wikipedia.org/wiki/Aleksandr_Stoletov  Content: Alexander
Grigorievich Stoletov (Russian: Алекса́ндр Григо́рьевич Столе́тов; 10
August 1839 – 27 May 1896) was a Russian physicist, founder of
electrical engineering, and professor in Moscow University. He was the
brother of general Nikolai Stoletov.  Biog...


## Pipeline 2: Creating vector store

In [4]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

# Path for vector store and dataset
vector_dataset = "./dataset/vector_storage" # local storage
vector_store_path = vector_dataset
dataset_path = vector_dataset

# Create an index over the documents
# Overwrites the existing dataset if True
ow = True

if ow==True:
    try:
        vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=ow)
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        index = VectorStoreIndex.from_documents(documents, storage_context, show_progress=True)
    except Exception as e:
        print(f"An error occurred: {e}")
        print(f"Error type: {type(e)}")
        print(f"Error traceback: {e.__traceback__}")

Uploading data to deeplake dataset.


100%|██████████| 700/700 [00:01<00:00, 631.32it/s]


Dataset(path='./dataset/vector_storage', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (2748, 1)     str     None   
 metadata     json      (2748, 1)     str     None   
 embedding  embedding  (2748, 768)  float32   None   
    id        text      (2748, 1)     str     None   




In [5]:
import deeplake
import pandas as pd
import numpy as np

# Create a dictionary to hold the data
data = {}

# Load vector store data
ds = deeplake.load(dataset_path)
ds.summary()

# Iterate through the tensors in the dataset
for tensor_name in ds.tensors:
    tensor_data = ds[tensor_name].numpy()

    # Check if the tensor is multi-dimensional
    if tensor_data.ndim > 1:
        # Flatten multi-dimensional tensors
        data[tensor_name] = [np.array(e).flatten().tolist() for e in tensor_data]
    else:
        # Convert 1D tensors directly to lists and decode text
        if tensor_name == "text":
            data[tensor_name] = [t.tobytes().decode('utf-8') if t else "" for t in tensor_data]
        else:
            data[tensor_name] = tensor_data.tolist()

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

./dataset/vector_storage loaded successfully.





Dataset(path='./dataset/vector_storage', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (2748, 768)  float32   None   
    id        text      (2748, 1)     str     None   
 metadata     json      (2748, 1)     str     None   
   text       text      (2748, 1)     str     None   


In [6]:
# Function to display a selected record
def display_record(record_number):
    record = df.iloc[record_number]
    display_data = {
        "ID": record.get("id", "N/A"),
        "Metadata": record.get("metadata", "N/A"),
        "Text": record.get("text", "N/A"),
        "Embedding": record.get("embedding", "N/A")
    }

    # Print the ID
    print("ID:")
    print(display_data["ID"])
    print()

    # Print the metadata in a structured format
    print("Metadata:")
    metadata = display_data["Metadata"]
    if isinstance(metadata, list):
        for item in metadata:
            for key, value in item.items():
                print(f"{key}: {value}")
            print()
    else:
        print(metadata)
    print()

    # Print the text
    print("Text:")
    print(display_data["Text"])
    print()

    # Print the embedding
    print("Embedding:")
    print(display_data["Embedding"])
    print()

# Example usage
rec = 7  # Replace with the desired record number
display_record(rec)

ID:
['bf84f0ca-85ca-4e05-9bb7-dd004c1b877c']

Metadata:
file_path: /home/ongin/github/RAG-pR/RAG-Wikipedia/documents/Absorption_electromagnetic_radiation.txt
file_name: Absorption_electromagnetic_radiation.txt
file_type: text/plain
file_size: 5392
creation_date: 2025-06-12
last_modified_date: 2025-06-12
_node_content: {"id_": "bf84f0ca-85ca-4e05-9bb7-dd004c1b877c", "embedding": null, "metadata": {"file_path": "/home/ongin/github/RAG-pR/RAG-Wikipedia/documents/Absorption_electromagnetic_radiation.txt", "file_name": "Absorption_electromagnetic_radiation.txt", "file_type": "text/plain", "file_size": 5392, "creation_date": "2025-06-12", "last_modified_date": "2025-06-12"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "relationships": {"1": {"node_id": "74a7f798-7083-4995-b

## Pipeline 3: Knowledge graph index-based RAG

In [7]:
from llama_index.core import Document

# Ensure 'text' column is of type string
df['text'] = df['text'].astype(str)
# Create documents with IDs
documents = [Document(text=row['text'], doc_id=str(row['id'])) for _, row in df.iterrows()]

### Generating the graph index

In [8]:
from llama_index.core import KnowledgeGraphIndex
import time
from typing import List

# Start the timer
start_time = time.time()

class RobustKnowledgeGraphIndex(KnowledgeGraphIndex):
    def _extract_triplets(self, text: str) -> List[tuple]:
        """Extract triplets with error handling to ignore failures."""
        try:
            # Call the parent class's triplet extraction method
            return super()._extract_triplets(text)
        except Exception as e:
            # Log the error (optional) and return an empty list to continue processing
            print(f"Error extracting triplets for text chunk: {e}")
            return []

# Graph index with embeddings
graph_index = RobustKnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=5,
    include_embeddings=True,
    show_progress=False,
)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Index creation time: {elapsed_time:.4f} seconds")

print(type(graph_index))

# Save the graph index to a file
graph_index_saving_path = "./dataset/graph_storage"
graph_index.storage_context.persist(persist_dir=graph_index_saving_path)

Index creation time: 7986.4521 seconds
<class '__main__.RobustKnowledgeGraphIndex'>


### Displaying the graph in HTML file

In [9]:
# Load the Graph data
from llama_index.core import StorageContext, load_index_from_storage

graph_index_saving_path = "./dataset/graph_storage"
storage_context = StorageContext.from_defaults(persist_dir=graph_index_saving_path)

graph_index = load_index_from_storage(storage_context)

# Create graph
from pyvis.network import Network

g = graph_index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)

# Set node and edge properties: colors and sizes
for node in net.nodes:
    node['color'] = 'lightgray'
    node['size'] = 10

for edge in net.edges:
    edge['color'] = 'black'
    edge['width'] = 1

fgraph="Knowledge_graph_visual.html"
net.write_html(fgraph)
print(fgraph)

Loading llama_index.core.storage.kvstore.simple_kvstore from ./dataset/graph_storage/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ./dataset/graph_storage/index_store.json.
Knowledge_graph_visual.html


## Interacting with the Knowledge graph index

In [10]:
import time
import textwrap



#similarity_top_k
k=3
#temperature
temp=0.1
#num_output
mt=1024
graph_query_engine = graph_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

def execute_query(user_input, k=3, temp=0.1, mt=1024):

    # Start the timer
    start_time = time.time()

    # Execute the query with additional parameters
    response = graph_query_engine.query(user_input)

    # Stop the timer
    end_time = time.time()

    # Calculate and print the execution time
    elapsed_time = end_time - start_time
    print(f"Query execution time: {elapsed_time:.4f} seconds")

    # Print the response, wrapped to 100 characters per line
    print(textwrap.fill(str(response), 100))
    return response

In [40]:
user_query="Summarise the methods to improve efficiency of solar cells, in ordered list."

In [41]:
import time
import textwrap
import sys
import io
# Start the timer
start_time = time.time()
# Capture the output
old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
response = execute_query(user_query)
# Restore stdout
sys.stdout = old_stdout
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 120))

Query execution time: 6.7376 seconds
1. Minimize top contact coverage to reduce the light-blocking area on the cell surface. 2. Implement anti-reflective
coatings to reduce reflection and enhance absorption by causing destructive interference. 3. Use surface texturing
techniques, such as black silicon, to reduce reflectance and trap more light. 4. Apply photonic micro-structured
coatings on the cell front contact for geometric anti-reflection and light scattering. 5. Employ plasmonic light-
trapping using noble metal nanoparticles to enhance broadband photocurrent. 6. Optimize solar cell materials, like using
thermal processing to enhance crystal quality, and layering thin-film cells for multi-junction solar cells. 7. Utilize
large, parallel nanowire arrays and add nanoparticles to increase absorption lengths and conduction. 8. Consider rear
surface passivation, such as using dielectric passivation layers, to improve efficiency. 9. Use thin film materials,
incorporating light-trapping 

In [37]:
print(llm_extraction.complete("Summarise the methods to improve efficiency of solar cells."))

Improving the efficiency of solar cells involves several methods that focus on enhancing the conversion of sunlight into electricity. Here's a summary of some common strategies:

1. **Higher Efficiency Materials**: Use materials with higher inherent absorption coefficients and lower band gaps for photovoltaic conversion. This includes developing new semiconductor materials like perovskites and tandem solar cells combining different materials.

2. **Antireflection Coatings**: Apply thin films to reduce the reflection of sunlight from the cell surface. These coatings increase the amount of light absorbed by the cell.

3. **Texturing the Surface**: Texturing the surface of solar cells can increase the light trapping effect, allowing more light to be absorbed and spread throughout the cell.

4. **Light Trapping Structures**: Incorporate micrometre-scale structures to enhance light scattering and trapping, which can increase the absorption of light over a broader range of wavelengths.

5. *