# Install necessary Libraries

### Step 1: Extract Text from the PDF

We’ll use the PyMuPDF library (fitz) to extract text from the PDF. This library is effective for handling structured text, such as reports and tables, commonly found in survey data.

#### Extracting text from PDF

In [2]:
import fitz  # PyMuPDF

# Function to extract text from each page in the PDF
def extract_text_from_pdf(file_path):

    print("\nExtracting text from the file...")

    # Open the PDF file
    document = fitz.open(file_path)
    text_data = []

    # Iterate through each page
    for page_num in range(document.page_count):
        page = document[page_num]
        page_text = page.get_text()  # Extract text from page
        text_data.append(page_text)

    document.close()
    print("\nCompleted successfully!")
    return text_data

In [3]:
# Path to the PDF file
# file_path = 'files/Final_SAS 2023_Annual Report.pdf'
file_path = 'files/Agricult_data.pdf'
pdf_text = extract_text_from_pdf(file_path)


Extracting text from the file...

Completed successfully!


In [4]:
def display_extracted_text(pdf_text):
    print("\nSample of first few pages of extracted text:\n")

    # Check the first few pages to see the extracted text
    for i, page in enumerate(pdf_text[:3]):
        print(f"--- Page {i+1} ---")
        print(page[:500])  # Print first 500 characters for preview

In [5]:
display_extracted_text(pdf_text)


Sample of first few pages of extracted text:

--- Page 1 ---
[ 
    { 
    "instruction": "Answer the following question", 
    "input": "What are some methods for improving soil fertility?", 
    "response": "Improving soil fertility can be achieved through various 
methods such as crop rotation, cover cropping, and adding organic matter 
like compost or manure. Additionally, using green manures, practicing 
conservation tillage, and applying appropriate fertilizers based on soil 
testing can help enhance soil fertility and promote healthier plant 
growt
--- Page 2 ---
or brush piles, and avoiding or minimizing pesticide use to protect 
pollinator populations. Creating diverse landscapes with a variety of 
flowering plants and maintaining habitat corridors can also support 
pollinator health and biodiversity on the farm." 
    }, 
    { 
    "instruction": "Answer the following question", 
    "input": "What are some methods for improving soil health in organic 
farming?", 
    "resp

### Step 2: Text Preprocessing

We’ll implement the following preprocessing steps:

- Remove Extra Spaces and Line Breaks: To make the text easier to work with.

- Split Text into Sentences: This will help with processing the text sentence by sentence during entity extraction.

- Normalize Case and Remove Unwanted Characters: For consistent analysis, we’ll standardize the case and remove characters like page numbers, special symbols, etc.

In [6]:
import re
import nltk
from nltk.tokenize import sent_tokenize

# download nltk toketizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Ahmed Issah
[nltk_data]     Tahiru\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to C:\Users\Ahmed Issah
[nltk_data]     Tahiru\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [8]:
# Function to preprocess text
def preprocess_text(text_data):
    
    print("\nPreprocessing extracted text...")

    processed_text = []

    for page_text in text_data:
        # Remove any extraneous whitespace and newlines
        page_text = page_text.replace('\n', ' ').strip()

        # Remove unwanted characters like page numbers or table of contents markers
        page_text = re.sub(r'\bPage\s\d+\b', '', page_text)
        page_text = re.sub(r'[^a-zA-Z0-9\s.,]', '', page_text)

        # Convert text to lowercase
        page_text = page_text.lower()

        # Tokenize text into sentences
        sentences = sent_tokenize(page_text)

        # Store cleaned sentences
        processed_text.extend(sentences)

    print("\nCompleted successfully!")

    return processed_text

In [9]:
# Apply preprocessing to the extracted text
cleaned_text = preprocess_text(pdf_text)


Preprocessing extracted text...

Completed successfully!


In [10]:
# Function to display preprocessed text
def display_preprocessed_text(cleaned_text):

    print("\nSample of first few cleaned sentences:\n")

    # Display the first few cleaned sentences
    for i, sentence in enumerate(cleaned_text[:20]):
        print(f"Sentence {i+1}: {sentence}")

In [11]:
# display preprocessed text
display_preprocessed_text(cleaned_text)


Sample of first few cleaned sentences:

Sentence 1:             instruction answer the following question,      input what are some methods for improving soil fertility,      response improving soil fertility can be achieved through various  methods such as crop rotation, cover cropping, and adding organic matter  like compost or manure.
Sentence 2: additionally, using green manures, practicing  conservation tillage, and applying appropriate fertilizers based on soil  testing can help enhance soil fertility and promote healthier plant  growth.
Sentence 3: ,            instruction answer the following question,      input how can i prevent erosion on my farmland,      response preventing erosion on farmland involves implementing  conservation practices like contour plowing, terracing, and planting  windbreaks or cover crops.
Sentence 4: maintaining vegetation along waterways,  installing silt fences, and using erosion control blankets can also help  minimize soil erosion and protect th

#### Step 3: Entity Extraction

We’ll use spaCy, an NLP library that provides pre-trained models for named entity recognition (NER), part-of-speech tagging, and other text processing tasks.

3.1 Install spaCy and Download Language Model
If not already installed, we’ll install spaCy and download the en_core_web_sm model, which is spaCy's small English language model.

In [45]:
# %pip install spacy
# %python -m spacy download en_core_web_sm

3.2 Extracting Entities

In [None]:
# import spacy

# # Load spaCy's pre-trained English model
# nlp = spacy.load("en_core_web_sm")

In [12]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForTokenClassification.from_pretrained("orkg/orkgnlp-agri-ner")

# Create a pipeline for NER
nlp_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)


  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# Function to extract entities from the text
def extract_entities(text_data):
    
    print("\nExtracting entitites...")

    entities = []

    for sentence in text_data:

        # Process each sentence using spaCy's NLP pipeline
        # doc = nlp(sentence)
        ner_results = nlp_pipeline(sentence)

        # for ent in doc.ents:
        #     # Append each recognized entity and its label
        #     entities.append((ent.text, ent.label_))

        # Collect and format the recognized entities
        for result in ner_results:
            entities.append((result['word'], result['entity'], result['score']))

    print("\nCompleted successfully!")
    

    return entities

In [14]:
# apply entity extraction on the cleaned text
extracted_entities = extract_entities(cleaned_text)


Extracting entitites...

Completed successfully!


In [15]:
# Function to display a sample of extracted entities
def display_extracted_entities(extracted_entities):

    print("\nSample of extracted entities:\n")

    for i, entity in enumerate(extracted_entities[:100]):
        print(f"Entity {i+1}: Text: '{entity[0]}', Label: {entity[1]}")

In [16]:
# Display a sample of extracted entities
display_extracted_entities(extracted_entities)


Sample of extracted entities:

Entity 1: Text: 'soil', Label: B-RP
Entity 2: Text: 'fertility', Label: I-RP
Entity 3: Text: ',', Label: I-RP
Entity 4: Text: 'soil', Label: B-RP
Entity 5: Text: 'fertility', Label: I-RP
Entity 6: Text: 'crop', Label: B-METH
Entity 7: Text: 'rotation', Label: I-METH
Entity 8: Text: 'cover', Label: B-METH
Entity 9: Text: 'crop', Label: I-METH
Entity 10: Text: '##ping', Label: I-METH
Entity 11: Text: 'adding', Label: B-P
Entity 12: Text: 'organic', Label: I-P
Entity 13: Text: 'matter', Label: I-P
Entity 14: Text: 'com', Label: B-R
Entity 15: Text: '##post', Label: I-R
Entity 16: Text: 'or', Label: I-R
Entity 17: Text: 'man', Label: I-R
Entity 18: Text: '##ure', Label: I-R
Entity 19: Text: 'green', Label: B-R
Entity 20: Text: 'man', Label: I-R
Entity 21: Text: '##ures', Label: I-R
Entity 22: Text: 'conservation', Label: B-RP
Entity 23: Text: 'till', Label: I-RP
Entity 24: Text: '##age', Label: I-RP
Entity 25: Text: 'f', Label: B-R
Entity 26: Text: '##ert', 

In [17]:
len(extracted_entities)

137392

Breakdown of the entity labels found:

- B-RP (Beginning of a Resource Practice): Indicates the start of an entity related to resource practices, such as "soil" or "conservation - practices".
- I-RP (Inside a Resource Practice): Used for words inside a multi-word resource practice entity, like "fertility" in "soil fertility".
- B-P (Beginning of a Practice): Marks the start of a practice, such as "adding" or "planting".
- I-P (Inside a Practice): Denotes continuation within a multi-word practice entity, such as "organic matter" in "adding organic matter".
- B-METH (Beginning of a Method): Signifies the beginning of a method or approach, such as "crop rotation".
- I-METH (Inside a Method): Indicates continuation within a method, like "rotation" or "cover cropping".
- B-R (Beginning of a Resource): Indicates the start of an entity referring to a tangible or abstract resource, such as "green manure".
- I-R (Inside a Resource): Marks words inside a multi-word resource entity, such as "compost" or "fertilizers".
- B-S (Beginning of a Strategy): Denotes the beginning of an overall strategy, like "integrated pest management".
- I-S (Inside a Strategy): Used for words within a multi-word strategy, such as "IPM" in "integrated pest management".
- B-LOC (Beginning of a Location): Would indicate the start of a geographical location (not in your sample, but could be relevant for regions).
- I-LOC (Inside a Location): Continuation of a location name.
- B-T (Beginning of a Tool or Technology): Marks the start of a tool or technology (not in your sample).
- I-T (Inside a Tool or Technology): Denotes continuation within a multi-word tool or technology.

### Step 4: Relationship Extraction

We’ll analyze the extracted sentences to identify relationships between entities. For instance, relationships like "maize grows in" a specific season or "fertilizer applied to" certain crops can provide valuable insights for building a structured knowledge graph.

We’ll use dependency parsing, which identifies syntactic relationships between words in a sentence. spaCy’s dependency parser will help us capture these relationships, focusing on:

- Subject-Verb-Object (SVO) triples: Common in sentences that describe actions, like "farmers use fertilizers."
- Prepositional Phrases: Often contain location or temporal data, like "in season A."

##### Extracting Relationships

In [18]:
import spacy
nlp = spacy.load("en_core_web_sm")


In [19]:
# Function to extract relationships from sentences

def extract_relationships(text_data):

    print("\nExtracting relationships...")


    # container to store extracted relationships
    relationships = []

    # Loop through each sentence in the text to extract relationships 
    for sentence in text_data:
        doc = nlp(sentence)

        # Define placeholders for entities and relationships
        subject = None
        predicate = None
        obj = None

        # Dependency parsing to identify SVO structure
        for token in doc:

            # Find the subject (usually a noun or a compound noun)
            if "subj" in token.dep_:
                subject = token.text

            # Find the object (usually a noun or a compound noun)
            elif "obj" in token.dep_:
                obj = token.text

            # Find the main verb (predicate of the sentence)
            elif token.pos_ == "VERB":
                # Use lemma for consistent verbs (e.g., 'use' vs 'used')
                predicate = token.lemma_

        # If SVO structure is found, store the relationship
        if subject and predicate and obj:
            relationships.append((subject, predicate, obj))

    print("\nCompleted successfully!")

    return relationships

In [20]:
# Apply relationship extraction on cleaned text
extracted_relationships = extract_relationships(cleaned_text)


Extracting relationships...

Completed successfully!


In [21]:
# Function to display a sample of extracted relationships
def display_extracted_relationships(extracted_relationships):
    
    print("\nSample of extracted relationships:\n")

    for i, relationship in enumerate(extracted_relationships[:20]):
        print(f"Relationship {i+1}: Subject: '{relationship[0]}', Predicate: '{relationship[1]}', Object: '{relationship[2]}'")

In [22]:
# display a sample of extracted relationships
display_extracted_relationships(extracted_relationships)


Sample of extracted relationships:

Relationship 1: Subject: 'methods', Predicate: 'add', Object: 'compost'
Relationship 2: Subject: 'using', Predicate: 'promote', Object: 'growth'
Relationship 3: Subject: 'i', Predicate: 'cover', Object: 'crops'
Relationship 4: Subject: 'maintaining', Predicate: 'protect', Object: 'farmland'
Relationship 5: Subject: 'controls', Predicate: 'reduce', Object: 'pesticides'
Relationship 6: Subject: 'rotation', Predicate: 'promote', Object: 'management'
Relationship 7: Subject: 'which', Predicate: 'minimize', Object: 'evaporation'
Relationship 8: Subject: 'implementing', Predicate: 'optimize', Object: 'agriculture'
Relationship 9: Subject: 'which', Predicate: 'improve', Object: 'health'
Relationship 10: Subject: 'agriculture', Predicate: 'become', Object: 'change'
Relationship 11: Subject: 'response', Predicate: 'nest', Object: 'hotels'
Relationship 12: Subject: 'creating', Predicate: 'support', Object: 'farm'
Relationship 13: Subject: 'methods', Predicate

In [23]:
len(extracted_relationships)

5151

### Step 5: Building the DKG with NetworkX

We’ll use the extracted entities and relationships to create a structured knowledge graph that models the agricultural information.

To build the knowledge graph, we’ll use the NetworkX library in Python. This will allow us to represent entities as nodes and relationships as edges, creating a graph that can be easily updated and queried.

In [53]:
# %pip install networkx

In [24]:
import networkx as nx
import matplotlib.pyplot as plt

# Initialize an empty directed graph
G = nx.DiGraph()

In [25]:
# Function to build the knowledge graph from entities and relationships
def build_knowledge_graph_networkx(entities, relationships):

    print("\nBuilding knowledge Graph...")

    # Add entities as nodes
    for entity, entity_type, _ in entities:
        G.add_node(entity, label=entity_type)

    # Add relationships as edges
    for subject, predicate, obj in relationships:
        G.add_edge(subject, obj, label=predicate)

    print("\nKnowledge Graph built successfully!")

    return G

In [26]:
# Build the graph using extracted entities and relationships
knowledge_graph = build_knowledge_graph_networkx(extracted_entities, extracted_relationships)


Building knowledge Graph...

Knowledge Graph built successfully!


In [27]:
# %pip install spicy

In [36]:
# # Draw the graph
# plt.figure(figsize=(12, 12))
# pos = nx.spring_layout(knowledge_graph, seed=42)  # Layout for visualization
# nx.draw(knowledge_graph, pos, with_labels=True, node_size=3000, node_color="skyblue", font_size=10, font_weight="bold", edge_color="gray")
# edge_labels = nx.get_edge_attributes(knowledge_graph, "label")
# nx.draw_networkx_edge_labels(knowledge_graph, pos, edge_labels=edge_labels, font_color="red")
# plt.title("Agricultural Knowledge Graph")
# plt.show()

### Visualize Graph with Pyvis and Save it

In [28]:
from pyvis.network import Network

def visualize_knowledge_graph_pyvis(G, output_path="knowledge_graph.html"):
    """Visualize and save the graph using PyVis."""
    # Initialize PyVis network
    net = Network(notebook=True, height="750px", width="100%",
                  bgcolor="#ffffff", font_color="black", cdn_resources='remote')

    # Customize the PyVis physics settings for better layout
    net.set_options("""
    var options = {
        "nodes": {
            "font": {"size": 12},
            "size": 20
        },
        "edges": {
            "color": {"color": "#000000", "opacity": 0.5},
            "font": {"size": 10},
            "smooth": false
        },
        "physics": {
            "barnesHut": {
                "gravitationalConstant": -2000,
                "centralGravity": 0.3,
                "springLength": 200
            },
            "minVelocity": 0.75
        }
    }
    """)

    # Add nodes to the PyVis network
    for node, attr in G.nodes(data=True):
        net.add_node(node, label=node, color=attr.get("color", "skyblue"))

    # Add edges to the PyVis network with labels
    for source, target, attr in G.edges(data=True):
        net.add_edge(source, target, label=attr.get("label", ""), arrows="to")

    # Save the graph as an HTML file
    net.save_graph(output_path)
    print(f"Graph saved to {output_path}")


In [33]:

visualize_knowledge_graph_pyvis(knowledge_graph, output_path="latest_KG_agriner.html")


Graph saved to latest_KG_agriner.html


### 6.1: Extract Schema from Current Data

In [34]:
from collections import defaultdict

# Function to generate schema from entities and relationships
def generate_schema(entities, relationships):

    # Collect unique entity types
    entity_types = set(entity_type for _, entity_type, _ in entities)
    
    # Initialize dictionary to store relationship types
    relationship_types = defaultdict(set)

    # Populate relationship types based on current relationships
    for subject, predicate, obj in relationships:

        # Check if the subject and object have associated types
        subject_type = next((etype for ename, etype, _ in entities if ename == subject), None)
        obj_type = next((etype for ename, etype, _ in entities if ename == obj), None)

        # Only add if both subject_type and obj_type exist
        if subject_type and obj_type:
            relationship_types[(subject_type, obj_type)].add(predicate)

    # Convert relationship_types to a more readable format
    relationship_schema = {k: list(v) for k, v in relationship_types.items()}

    # Construct schema
    schema = {
        "entities": list(entity_types),
        "relationships": relationship_schema
    }
    return schema


In [35]:
# Generate schema based on extracted entities and relationships
schema = generate_schema(extracted_entities, extracted_relationships)

# Display the generated schema
print("Auto-Generated Schema:")
print(schema)

Auto-Generated Schema:
{'entities': ['B-S', 'I-T', 'I-R', 'I-RP', 'B-P', 'I-P', 'I-S', 'B-RP', 'B-METH', 'B-T', 'I-LOC', 'I-METH', 'B-LOC', 'B-R'], 'relationships': {('B-METH', 'I-RP'): ['improve', 'promote', 'minimize', 'mitigate', 'control', 'grow', 'associate', 'preserve', 'pest', 'help', 'reduce', 'disrupt', 'alleviate', 'enhance', 'protect'], ('I-S', 'I-R'): ['diversify', 'cover', 'store', 'conserve', 'utilize', 'suppress', 'provide', 'range', 'fix', 'prevent', 'reduce', 'use', 'accommodate', 'enhance', 'maintain'], ('B-RP', 'B-R'): ['plant', 'limit', 'protect'], ('I-METH', 'I-RP'): ['increase', 'improve', 'promote', 'minimize', 'desire', 'capture', 'prevent', 'provide', 'compete', 'compare', 'make', 'answer', 'pest', 'reduce', 'replenish', 'control'], ('B-METH', 'B-P'): ['improve', 'promote', 'kill', 'support', 'reduce', 'optimize'], ('I-R', 'B-RP'): ['increase', 'improve', 'fungi', 'stay', 'lead', 'grow', 'apply', 'indicate', 'have', 'add', 'reduce', 'contain', 'stunt', 'absorb'

In [36]:
schema

{'entities': ['B-S',
  'I-T',
  'I-R',
  'I-RP',
  'B-P',
  'I-P',
  'I-S',
  'B-RP',
  'B-METH',
  'B-T',
  'I-LOC',
  'I-METH',
  'B-LOC',
  'B-R'],
 'relationships': {('B-METH', 'I-RP'): ['improve',
   'promote',
   'minimize',
   'mitigate',
   'control',
   'grow',
   'associate',
   'preserve',
   'pest',
   'help',
   'reduce',
   'disrupt',
   'alleviate',
   'enhance',
   'protect'],
  ('I-S', 'I-R'): ['diversify',
   'cover',
   'store',
   'conserve',
   'utilize',
   'suppress',
   'provide',
   'range',
   'fix',
   'prevent',
   'reduce',
   'use',
   'accommodate',
   'enhance',
   'maintain'],
  ('B-RP', 'B-R'): ['plant', 'limit', 'protect'],
  ('I-METH', 'I-RP'): ['increase',
   'improve',
   'promote',
   'minimize',
   'desire',
   'capture',
   'prevent',
   'provide',
   'compete',
   'compare',
   'make',
   'answer',
   'pest',
   'reduce',
   'replenish',
   'control'],
  ('B-METH', 'B-P'): ['improve',
   'promote',
   'kill',
   'support',
   'reduce',
   'opti

### 7: Real-Time KG Updates with Schema Validation

We’ll implement a way to update the Dynamic Knowledge Graph (DKG) using the auto-generated schema. This will ensure new data is validated against the existing structure, keeping the graph consistent and accurate.

In this step, we’ll:

1. Validate New Data: Check that new entities and relationships align with the schema.
2. Add Validated Data to the DKG: Update the graph with new data, preserving structure and relationships.
3. Flag Inconsistent Data: If data doesn’t match the schema, it will be flagged for manual review.

#### Function to do Real-Time Updates of the KG while storing Flagged Data

In [44]:
# Containers to hold flagged entities and relationships
flagged_entities, flagged_relationships = [], []

In [48]:
def update_knowledge_graph_with_flagging(new_entities, new_relationships, graph, schema):
    global flagged_entities, flagged_relationships
    
    # Add entities with validation and ensure all nodes have a "label" attribute
    for entity, entity_type, _ in new_entities:
        if entity_type in schema["entities"]:
            graph.add_node(entity, label=entity_type)
        else:
            flagged_entities.append((entity, entity_type))

    # Add relationships with validation and handle missing labels
    for subject, predicate, obj in new_relationships:
        if graph.has_node(subject) and graph.has_node(obj):

            # Use get() with default label if missing
            subject_type = graph.nodes[subject].get("label", "Unknown")
            obj_type = graph.nodes[obj].get("label", "Unknown")
            
            valid_predicates = schema["relationships"].get((subject_type, obj_type), [])
            if predicate in valid_predicates:
                graph.add_edge(subject, obj, label=predicate)
            else:
                flagged_relationships.append((subject, predicate, obj))
        else:
            flagged_relationships.append((subject, predicate, obj))

    print(f"Flagged Entities: {flagged_entities}")
    print(f"Flagged Relationships: {flagged_relationships}")

    return flagged_entities, flagged_relationships


#### Pipeline for New Data Extraction from PDF

We’ll create a pipeline function to process a PDF file, extract new entities and relationships, and pass them to update_knowledge_graph_with_flagging.

In [49]:
def process_new_pdf(file_path):
    global knowledge_graph, schema
    
    # Extract text from PDF
    pdf_text = extract_text_from_pdf(file_path)
    
    # Preprocess extracted text
    cleaned_text = preprocess_text(pdf_text)
    
    # Extract entities and relationships
    new_entities = extract_entities(cleaned_text)
    new_relationships = extract_relationships(cleaned_text)
    
    # Update knowledge graph with validation and flagging
    flagged_entities, flagged_relationships = update_knowledge_graph_with_flagging(new_entities, new_relationships, knowledge_graph, schema)
    
    # Return flagged items for review
    return flagged_entities, flagged_relationships



In [50]:
# Extract text, preprocess and extract entities and relationships from new data file
new_file_path = 'files/Final_SAS 2023_Annual Report.pdf'
flagged_entities, flagged_relationships = process_new_pdf(new_file_path)



Extracting text from the file...

Completed successfully!

Preprocessing extracted text...

Completed successfully!

Extracting entitites...

Completed successfully!

Extracting relationships...

Completed successfully!
Flagged Entities: []
Flagged Relationships: [('which', 'cover', 'rwanda'), ('sas', 'combine', 'frame'), ('it', 'cover', 'segments'), ('it', 'screen', 'phases'), ('phase', 'cultivate', 'plots'), ('it', 'target', 'seasons'), ('estimates', 'give', 'district'), ('census', 'conduct', 'years'), ('it', 'include', 'inputs'), ('57.5', 'use', 'agriculture'), ('hectares', 'use', 'pasture'), ('56.6', 'use', 'agriculture'), ('hectares', 'use', 'pasture'), ('percent', 'apply', 'famers'), ('percent', 'compare', 'c.'), ('indicators', 'maize', 'beer'), ('summary', 'kgha', 'fruits'), ('percentage', 'practice', '1,067.2'), ('production', '2023cultivate', 'type'), ('4', 'sample', 'figures'), ('survey', 'sqm', 'statistics'), ('government', 'invest', 'addition'), ('statistics', 'evidencebas

#### Review and Confirm Flagged Data

Next, we’ll create a function for the user to review flagged data. If the user confirms, we’ll add the entity or relationship to the graph and update the schema.

In [51]:
def review_and_confirm(flagged_entities, flagged_relationships, graph, schema):
    # Process flagged entities
    for entity, entity_type in flagged_entities:
        user_input = input(f"Confirm entity '{entity}' as type '{entity_type}'? (y/n): ")
        if user_input.lower() == "y":
            # Add entity to graph and schema, ensure it has a label
            graph.add_node(entity, label=entity_type)
            print(f"{entity} added to graph")
            if entity_type not in schema["entities"]:
                schema["entities"].append(entity_type)
                print(f"{entity_type} added to schema")


    # Process flagged relationships
    for subject, predicate, obj in flagged_relationships:
        # Ensure subject and object nodes have labels
        if not graph.has_node(subject):
            graph.add_node(subject, label="Unknown")
        if not graph.has_node(obj):
            graph.add_node(obj, label="Unknown")
        
        # Retrieve labels, setting a default if missing
        subject_type = graph.nodes[subject].get("label", "Unknown")
        obj_type = graph.nodes[obj].get("label", "Unknown")

        # Prompt user for confirmation of relationship
        user_input = input(f"Confirm relationship '{subject} - {predicate} - {obj}'? (y/n): ")
        if user_input.lower() == "y":
            # Add relationship to graph and update schema if necessary
            graph.add_edge(subject, obj, label=predicate)
            print(f"relationship added to graph")
            if (subject_type, obj_type) not in schema["relationships"]:
                schema["relationships"][(subject_type, obj_type)] = [predicate]
                print(f"relationship added to schema")

            elif predicate not in schema["relationships"][(subject_type, obj_type)]:
                schema["relationships"][(subject_type, obj_type)].append(predicate)
                print(f"relationship added to schema")


# Example usage: Reviewing flagged items
review_and_confirm(flagged_entities, flagged_relationships, knowledge_graph, schema)


relationship added to graph
relationship added to schema
relationship added to graph
relationship added to schema


In [52]:
schema

{'entities': ['I-RP',
  'B-METH',
  'I-METH',
  'I-LOC',
  'B-S',
  'B-R',
  'I-S',
  'B-T',
  'I-T',
  'I-P',
  'B-RP',
  'B-P',
  'I-R',
  'B-LOC'],
 'relationships': {('B-METH', 'I-RP'): ['associate',
   'enhance',
   'reduce',
   'protect',
   'disrupt',
   'mitigate',
   'minimize',
   'pest',
   'improve',
   'control',
   'promote',
   'grow',
   'preserve',
   'alleviate',
   'help'],
  ('I-S', 'I-R'): ['provide',
   'enhance',
   'cover',
   'reduce',
   'prevent',
   'conserve',
   'use',
   'store',
   'utilize',
   'suppress',
   'accommodate',
   'range',
   'diversify',
   'fix',
   'maintain'],
  ('B-RP', 'B-R'): ['plant', 'limit', 'protect'],
  ('I-METH', 'I-RP'): ['provide',
   'reduce',
   'compete',
   'desire',
   'prevent',
   'minimize',
   'pest',
   'improve',
   'control',
   'promote',
   'replenish',
   'make',
   'compare',
   'answer',
   'capture',
   'increase'],
  ('B-METH', 'B-P'): ['support',
   'reduce',
   'improve',
   'promote',
   'kill',
   'opti

### Sample QA

In [53]:
# import openai
from openai import OpenAI

# Set up OpenAI API key
my_openai_key = "sk-OJ2_gW9HAKApES_5DbyRODLahM36bT13evmH3wxERkT3BlbkFJ5fwb2Eq-euILAFeg8IeJp5lw3MSHOxRFyB7Agjn28A"

client = OpenAI(api_key="sk-OJ2_gW9HAKApES_5DbyRODLahM36bT13evmH3wxERkT3BlbkFJ5fwb2Eq-euILAFeg8IeJp5lw3MSHOxRFyB7Agjn28A")

In [58]:

# Function to extract knowledge graph data based on keywords
def query_knowledge_graph(graph, keywords):
    results = []
    for node in graph.nodes:
        if any(keyword in node.lower() for keyword in keywords):
            label = graph.nodes[node].get("label", "Unknown")
            results.append((node, label))
    for edge in graph.edges(data=True):
        subject, obj, data = edge
        if any(keyword in subject.lower() for keyword in keywords) or any(keyword in obj.lower() for keyword in keywords):
            relation = data.get("relation", "Unknown")
            results.append((subject, relation, obj))
    return results

# Function to query the LLM with knowledge graph context
def query_llm_with_kg(question, graph):
    keywords = question.lower().split()
    kg_data = query_knowledge_graph(graph, keywords)
    
    # Construct context from knowledge graph findings
    context = "Relevant Knowledge Graph Data:\n"
    for item in kg_data:
        if len(item) == 2:
            context += f"- Entity: {item[0]}, Type: {item[1]}\n"
        elif len(item) == 3:
            context += f"- Relationship: {item[0]} - {item[1]} -> {item[2]}\n"
    
    # Combine context with question
    prompt = f"{context}\n\nQuestion: {question}"
    
    # Query LLM with combined prompt
    chat_completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant specialized in agricultural knowledge."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        temperature=0.1
    )
    
    answer = chat_completion.choices[0].message.content.strip()
    return answer




In [59]:
# Example usage
question = "What crops grow in season C in Rwanda?"
print(query_llm_with_kg(question, knowledge_graph))

To determine what crops grow in season C in Rwanda, we would typically look at local agricultural practices, climate conditions, and crop varieties suited for that specific season. However, based on the provided knowledge graph data, there is no direct mention of specific crops associated with "season C" in Rwanda.

In general, common crops grown in Rwanda during various seasons include:

1. **Maize** - A staple food crop.
2. **Beans** - Often intercropped with maize.
3. **Potatoes** - Grown in higher altitudes.
4. **Cassava** - A drought-resistant crop.
5. **Sweet potatoes** - Another staple that thrives in various conditions.
6. **Rice** - Grown in wetland areas


The intuition behind this approach is to use the knowledge graph to enhance the LLM's response. Instead of relying solely on the LLM's pre-trained knowledge, the knowledge graph provides domain-specific, structured data (e.g., facts, relationships, entities) relevant to the user's query. This context is added to the prompt, guiding the LLM to generate an answer informed by the graph's information. By doing so, the LLM's output is more accurate, grounded in domain-specific knowledge, and aligned with the question's context.