# Install necessary Libraries

In [2]:
!pip install pymupdf


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip




### Step 1: Extract Text from the PDF

We’ll use the PyMuPDF library (fitz) to extract text from the PDF. This library is effective for handling structured text, such as reports and tables, commonly found in survey data.

#### Extracting text from PDF

In [3]:
import fitz  # PyMuPDF

# Function to extract text from each page in the PDF
def extract_text_from_pdf(file_path):
    # Open the PDF file
    document = fitz.open(file_path)
    text_data = []

    # Iterate through each page
    for page_num in range(document.page_count):
        page = document[page_num]
        page_text = page.get_text()  # Extract text from page
        text_data.append(page_text)

    document.close()
    return text_data

In [4]:
# Path to the PDF file
file_path = 'files/Final_SAS 2023_Annual Report.pdf'
pdf_text = extract_text_from_pdf(file_path)

In [5]:
# Check the first few pages to see the extracted text
for i, page in enumerate(pdf_text[:3]):
    print(f"--- Page {i+1} ---")
    print(page[:500])  # Print first 500 characters for preview

--- Page 1 ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The Republic of Rwanda 
SEASONAL 
AGRICULTURAL SURVEY 
2023 
ANNUAL REPORT 
December 2023 

--- Page 2 ---
 
 
 
 
 
 
 
SEASONAL AGRICULTURAL SURVEY 
 
2023 
 
ANNUAL REPORT 

--- Page 3 ---
National Institute of Statistics of Rwanda (NISR) 
 i 
EXECUTIVE SUMMARY 
This is the annual report for the Seasonal Agricultural Survey (SAS) conducted by the National Institute of 
Statistics of Rwanda (NISR) for the agricultural year 2022/2023, which covers three primary agricultural 
seasons in Rwanda. The main agricultural seasons include Season A, spanned from September 2022 to 
February 2023, Season B which started from March to June 2023, and Season C which started from July to 
Septembe


### Step 2: Text Preprocessing

We’ll implement the following preprocessing steps:

Remove Extra Spaces and Line Breaks: To make the text easier to work with.

Split Text into Sentences: This will help with processing the text sentence by sentence during entity extraction.

Normalize Case and Remove Unwanted Characters: For consistent analysis, we’ll standardize the case and remove characters like page numbers, special symbols, etc.

In [6]:
import re
import nltk
from nltk.tokenize import sent_tokenize

# download nltk toketizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Ahmed Issah
[nltk_data]     Tahiru\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [9]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to C:\Users\Ahmed Issah
[nltk_data]     Tahiru\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [10]:
# Function to preprocess text
def preprocess_text(text_data):
    processed_text = []

    for page_text in text_data:
        # Remove any extraneous whitespace and newlines
        page_text = page_text.replace('\n', ' ').strip()

        # Remove unwanted characters like page numbers or table of contents markers
        page_text = re.sub(r'\bPage\s\d+\b', '', page_text)
        page_text = re.sub(r'[^a-zA-Z0-9\s.,]', '', page_text)

        # Convert text to lowercase
        page_text = page_text.lower()

        # Tokenize text into sentences
        sentences = sent_tokenize(page_text)

        # Store cleaned sentences
        processed_text.extend(sentences)

    return processed_text

In [12]:
# Apply preprocessing to the extracted text
cleaned_text = preprocess_text(pdf_text)

# Display the first few cleaned sentences
for i, sentence in enumerate(cleaned_text[:20]):
    print(f"Sentence {i+1}: {sentence}")

Sentence 1: the republic of rwanda  seasonal  agricultural survey  2023  annual report  december 2023
Sentence 2: seasonal agricultural survey    2023    annual report
Sentence 3: national institute of statistics of rwanda nisr   i  executive summary  this is the annual report for the seasonal agricultural survey sas conducted by the national institute of  statistics of rwanda nisr for the agricultural year 20222023, which covers three primary agricultural  seasons in rwanda.
Sentence 4: the main agricultural seasons include season a, spanned from september 2022 to  february 2023, season b which started from march to june 2023, and season c which started from july to  september 2023.   data sources   sas is a primarybased data survey, combining area frame and a list frame.
Sentence 5: it covered 1,200 segments and  345 large scale farmers.
Sentence 6: it is conducted in two distinct phases screening and harvesting phases.
Sentence 7: the  screening phase covers grown crops data, estima

#### Step 3: Entity Extraction

We’ll use spaCy, an NLP library that provides pre-trained models for named entity recognition (NER), part-of-speech tagging, and other text processing tasks. We may need to train or fine-tune the model later on for agricultural-specific terms, but for now, we’ll start with spaCy's base model and explore what it extracts.

3.1 Install spaCy and Download Language Model
If not already installed, we’ll install spaCy and download the en_core_web_sm model, which is spaCy's small English language model.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

3.2 Extracting Entities

In [13]:
import spacy

# Load spaCy's pre-trained English model
nlp = spacy.load("en_core_web_sm")

In [14]:
# Function to extract entities from the text
def extract_entities(text_data):
    entities = []

    for sentence in text_data:

        # Process each sentence using spaCy's NLP pipeline
        doc = nlp(sentence)

        for ent in doc.ents:
            # Append each recognized entity and its label
            entities.append((ent.text, ent.label_))

    return entities

In [15]:
# apply entity extraction on the cleaned text
extracted_entities = extract_entities(cleaned_text)

In [16]:
# display a sample of extracted entities
for i, entity in enumerate(extracted_entities[:20]):
    print(f"Entity {i+1}: Text: '{entity[0]}', Label: {entity[1]}")

Entity 1: Text: 'rwanda', Label: GPE
Entity 2: Text: '2023  annual', Label: DATE
Entity 3: Text: 'december 2023', Label: DATE
Entity 4: Text: 'seasonal', Label: DATE
Entity 5: Text: 'national institute of statistics', Label: ORG
Entity 6: Text: 'rwanda', Label: GPE
Entity 7: Text: 'annual', Label: DATE
Entity 8: Text: 'the national institute of  statistics', Label: ORG
Entity 9: Text: 'rwanda', Label: GPE
Entity 10: Text: 'the agricultural year 20222023', Label: DATE
Entity 11: Text: 'three', Label: CARDINAL
Entity 12: Text: 'rwanda', Label: GPE
Entity 13: Text: 'september 2022', Label: DATE
Entity 14: Text: 'february 2023, season', Label: DATE
Entity 15: Text: 'march to', Label: DATE
Entity 16: Text: 'june 2023', Label: DATE
Entity 17: Text: 'july', Label: DATE
Entity 18: Text: 'september 2023', Label: DATE
Entity 19: Text: '1,200', Label: CARDINAL
Entity 20: Text: '345', Label: CARDINAL


Entity Types to Note for Agricultural Data
- ORG: Organizations or institutions (e.g., "National Institute of Statistics of Rwanda").
- DATE: Dates, which may relate to crop seasons.
- GPE/LOC: Geopolitical entities or locations relevant to land use or agricultural regions.
- CARDINAL/QUANTITY: Quantities often related to measurements or crop statistics.

### Step 4: Relationship Extraction

We’ll analyze the extracted sentences to identify relationships between entities. For instance, relationships like "maize grows in" a specific season or "fertilizer applied to" certain crops can provide valuable insights for building a structured knowledge graph.

We’ll use dependency parsing, which identifies syntactic relationships between words in a sentence. spaCy’s dependency parser will help us capture these relationships, focusing on:

- Subject-Verb-Object (SVO) triples: Common in sentences that describe actions, like "farmers use fertilizers."
- Prepositional Phrases: Often contain location or temporal data, like "in season A."

##### Extracting Relationships

In [17]:
# Function to extract relationships from sentences

def extract_relationships(text_data):

    # container to store extracted relationships
    relationships = []

    # Loop through each sentence in the text to extract relationships 
    for sentence in text_data:
        doc = nlp(sentence)

        # Define placeholders for entities and relationships
        subject = None
        predicate = None
        obj = None

        # Dependency parsing to identify SVO structure
        for token in doc:

            # Find the subject (usually a noun or a compound noun)
            if "subj" in token.dep_:
                subject = token.text

            # Fint the object (usually a noun or a compound noun)
            elif "obj" in token.dep_:
                obj = token.text

            # Find the main verb (predicate of the sentence)
            elif token.pos_ == "VERB":
                # Use lemma for consistent verbs (e.g., 'use' vs 'used')
                predicate = token.lemma_

        # If SVO structure is found, store the relationship
        if subject and predicate and obj:
            relationships.append((subject, predicate, obj))

    return relationships

In [18]:
# Apply relationship extraction on cleaned text
extracted_relationships = extract_relationships(cleaned_text)

In [19]:
# Display a sample of extracted relationships
for i, relationship in enumerate(extracted_relationships[:20]):
    print(f"Relationship {i+1}: Subject: '{relationship[0]}', Predicate: '{relationship[1]}', Object: '{relationship[2]}'")

Relationship 1: Subject: 'which', Predicate: 'cover', Object: 'rwanda'
Relationship 2: Subject: 'sas', Predicate: 'combine', Object: 'frame'
Relationship 3: Subject: 'it', Predicate: 'cover', Object: 'segments'
Relationship 4: Subject: 'it', Predicate: 'screen', Object: 'phases'
Relationship 5: Subject: 'phase', Predicate: 'cultivate', Object: 'plots'
Relationship 6: Subject: 'who', Predicate: 'grow', Object: 'season'
Relationship 7: Subject: 'it', Predicate: 'target', Object: 'seasons'
Relationship 8: Subject: 'estimates', Predicate: 'give', Object: 'district'
Relationship 9: Subject: 'census', Predicate: 'conduct', Object: 'years'
Relationship 10: Subject: 'it', Predicate: 'include', Object: 'inputs'
Relationship 11: Subject: '57.5', Predicate: 'use', Object: 'agriculture'
Relationship 12: Subject: 'hectares', Predicate: 'use', Object: 'pasture'
Relationship 13: Subject: '56.6', Predicate: 'use', Object: 'agriculture'
Relationship 14: Subject: 'hectares', Predicate: 'use', Object: 'p

### Step 5: Building the DKG with NetworkX

We’ll use the extracted entities and relationships to create a structured knowledge graph that models the agricultural information.

To build the knowledge graph, we’ll use the NetworkX library in Python. This will allow us to represent entities as nodes and relationships as edges, creating a graph that can be easily updated and queried.

In [None]:
!pip install networkx

In [20]:
import networkx as nx
import matplotlib.pyplot as plt

# Initialize an empty directed graph
G = nx.DiGraph()

In [21]:
# Function to build the knowledge graph from entities and relationships
def build_knowledge_graph(entities, relationships):

    # Add entities as nodes
    for entity, entity_type in entities:
        G.add_node(entity, label=entity_type)

    # Add relationships as edges
    for subject, predicate, obj in relationships:
        G.add_edge(subject, obj, label=predicate)

    return G

In [22]:
# Build the graph using extracted entities and relationships
knowledge_graph = build_knowledge_graph(extracted_entities, extracted_relationships)

In [None]:
# Draw the graph
plt.figure(figsize=(12, 12))
# Layout for visualization
pos = nx.spring_layout(knowledge_graph, seed=42)
nx.draw(knowledge_graph, pos, with_labels=True, node_size=3000, node_color="skyblue", font_size=10, font_weight="bold", edge_color="gray")
edge_labels = nx.get_edge_attributes(knowledge_graph, "label")
nx.draw_networkx_edge_labels(knowledge_graph, pos, edge_labels=edge_labels, font_color="red")
plt.title("Agricultural Knowledge Graph")
plt.show()