# 1. Intro

* Node: They are data records and have labels like "Person" and have key:value properties. 
* Edge: Relationship betweeb entities and it also contains more information of relatioship

<img src="graph_db_intro.png"></img>

# 2. Fundamentals

### What is Knowledge Graph?

<img src="KnowledgeGraph.png"></img>

# 3.Querying Knowledge Graphs with Cypher

In [1]:
from langchain_community.graphs import Neo4jGraph
from neo4j import GraphDatabase
import json

In [2]:
NEO4J_URI="bolt://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="12345678"

In [3]:
# Create a driver instance to communicate with the Neo4j instance
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

In [4]:
# Function to execute a simple query (Cypher query)
# def run_query(query):
#     with driver.session() as session:
#         result = session.run(query)
#         for record in result:
#             print(record)

In [5]:
with open(r"Vat.json", 'r') as file:
    data = json.load(file)

In [8]:
with driver.session() as session:
    for article in data['description']:
        article_name = article.get('article_name').split("|")[-1].strip()
        article_summary = article.get('article_summary').strip()
        
        print(article_name)
        print(article_summary)
        
        # Cypher query to create a node for each article
        query = (
            "MERGE (a:Article {name: $article_name}) "
            "SET a.summary = $article_summary"
        )
        
        # Execute the query with parameters
        session.run(query, article_name=article_name, article_summary=article_summary)

Definitions
Defines key VAT terms such as taxable person, input tax, exempt supply, place of supply, designated zones, and tax invoices.
Scope of Tax
VAT applies to taxable and deemed supplies, as well as imports unless exempted.
Tax Rate
A standard VAT rate of 5% applies unless specified otherwise.
Responsibility for Tax
VAT must be paid by suppliers, importers, and recipients of services under the reverse charge mechanism.
Supply of Goods
A supply occurs when goods are sold, transferred, or contracted for future transfer.
Supply of Services
A supply of services is any transaction that is not a supply of goods.
Supply in Special Cases
Some transactions, such as vouchers and business transfers, are not VAT|applicable supplies.
Supply Consisting of More Than One Component
VAT treatment for bundled goods and services depends on the main supply.
Supply via Agent
Defines VAT obligations when selling through an agent in their own name or on behalf of a principal.
Supply by Government Entiti

In [9]:
# Cypher query to delete a specific article by name
# query = "MATCH (a:Article {name: $article_name}) DELETE a"
# session.run(query, article_name=article_name)

In [None]:
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
import numpy as np

# Load Hugging Face Embedding Model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Function to Generate Embeddings using Hugging Face
def generate_embedding(text):
    embedding = embedding_model.encode(text, convert_to_numpy=True)
    return embedding.tolist()  # Convert numpy array to list for storage

# Function to Fetch, Generate & Store Embeddings in Neo4j
def update_embeddings():
    with driver.session() as session:
        # Step 1: Fetch all articles
        query = "MATCH (a:Article) RETURN a.article_name AS name, a.article_summary AS summary, ID(a) AS node_id"
        results = session.run(query)

        # Step 2: Process and update embeddings
        for record in results:
            article_name = record["name"] or ""  # Handle missing values
            article_summary = record["summary"] or ""
            node_id = record["node_id"]

            if not article_name and not article_summary:
                continue  # Skip empty articles

            combined_text = f"{article_name}: {article_summary}"
            embedding = generate_embedding(combined_text)

            # Step 3: Store embedding in Neo4j
            session.run(
                "MATCH (a:Article) WHERE ID(a) = $node_id "
                "SET a.embedding = $embedding",
                node_id=node_id, embedding=embedding
            )

    print("Embeddings updated successfully!")

# Run the function
update_embeddings()

# Close the Neo4j connection
driver.close()


  from .autonotebook import tqdm as notebook_tqdm


Embeddings updated successfully!


In [None]:
# -------------------------------- End --------------------------------------

In [103]:
from langchain.schema.document import Document  # Import Document class
from langchain_community.document_loaders import PyMuPDFLoader

In [125]:
import re
import fitz  # PyMuPDF
import json
from pathlib import Path
import os

def create_documents_from_structure(structure, pdf_path):
    documents = []
    for book, parts in structure.get('Books', {}).items():
        for part, chapters in parts.get('Parts', {}).items():
            for chapter, articles in chapters.get('Chapters', {}).items():
                for article in articles:
                    for article_name, content in article.items():
                        metadata = {
                            'file_name': os.path.basename(pdf_path),
                            'page': content.get('page', 'N/A'),
                            'article_name':article_name# Add page number to metadata
                        }
                        documents.append(Document(page_content=f"{chapter} {article_name} {content['text']}", metadata=metadata))
    for chapter, articles in structure.get('Chapters', {}).items():
        for article in articles:
            for article_name, content in article.items():
                metadata = {
                    'file_name': os.path.basename(pdf_path),
                    'page': content.get('page', 'N/A')  # Add page number to metadata
                }
                documents.append(Document(page_content=f"{chapter} {article_name} {content['text']}", metadata=metadata))
    for article in structure.get('Articles', []):
        for article_name, content in article.items():
            metadata = {
                'file_name': os.path.basename(pdf_path),
                'page': content.get('page', 'N/A')  # Add page number to metadata
            }
            documents.append(Document(page_content=f"{article_name} {content['text']}", metadata=metadata))
    return documents

def extract_headings_and_content(file_id, pdf_path, display_name, is_proprietary):
    """Extract headings and content from a PDF file using regex and build a nested structure."""
    doc = fitz.open(pdf_path)
    structure = {}

    # Define regex patterns for book, part, chapter, and article headings
    book_pattern = re.compile(r'Book [A-Z]+:.*', re.IGNORECASE)
    part_pattern = re.compile(r'Part [A-Z]+:.*', re.IGNORECASE)
    chapter_pattern = re.compile(r'Chapter [\d]+ -.*', re.IGNORECASE)
    article_pattern = re.compile(r'Article \(?\d+\)?.*', re.IGNORECASE)

    current_book = None
    current_part = None
    current_chapter = None
    current_article = None
    content_buffer = []
    current_page = None

    def add_content_to_structure():
        nonlocal current_book, current_part, current_chapter, current_article, current_page
        if current_article and content_buffer:
            content = "\n".join(content_buffer).strip()
            content_data = {'text': content, 'page': current_page}
            if current_chapter:
                if current_part:
                    if current_book:
                        # Ensure all levels of the structure are initialized
                        structure.setdefault(current_book, {'Parts': {}})
                        structure[current_book]['Parts'].setdefault(current_part, {'Chapters': {}})
                        structure[current_book]['Parts'][current_part]['Chapters'].setdefault(current_chapter, []).append({current_article: content_data})
                    else:
                        # Initialize the structure for parts if the book is not present
                        structure.setdefault(current_part, {'Chapters': {}})
                        structure[current_part]['Chapters'].setdefault(current_chapter, []).append({current_article: content_data})
                else:
                    # Initialize the structure for chapters if the part is not present
                    structure.setdefault('Chapters', {})
                    structure['Chapters'].setdefault(current_chapter, []).append({current_article: content_data})
            else:
                # Initialize the structure for articles if the chapter is not present
                structure.setdefault('Articles', []).append({current_article: content_data})
            # Clear the content buffer and current article after adding content to the structure
            content_buffer.clear()
            current_article = None

    for page_number, page in enumerate(doc, start=1):
        # Get the text of the page
        text = page.get_text()
        current_page = page_number
        # Find all matches of the patterns
        for line in text.split('\n'):
            if book_match := book_pattern.match(line):
                add_content_to_structure()
                current_book = book_match.group()
                current_part = None
                current_chapter = None
            elif part_match := part_pattern.match(line):
                add_content_to_structure()
                current_part = part_match.group()
                current_chapter = None
            elif chapter_match := chapter_pattern.match(line):
                add_content_to_structure()
                current_chapter = chapter_match.group()
            elif article_match := article_pattern.match(line):
                add_content_to_structure()
                current_article = article_match.group()
            else:
                content_buffer.append(line)

    # Add the last buffered content to the structure
    add_content_to_structure()

    doc.close()

    new_docs = create_documents_from_structure(structure, pdf_path)
    
    # upload pdf file to s3
    
    # s3_file_link, error = upload_file_to_s3(country_name, pdf_path)
    
    # if error:
    #     print(f"Error uploading pdf to S3: {error}")
    #     # stops the loop if there is an error
    #     return None
    
    for doc in new_docs:
        # doc.metadata['link'] = s3_file_link
        doc.metadata['display_name'] = display_name
        doc.metadata['is_proprietary'] = is_proprietary
        doc.metadata['image_link'] = " "
        doc.metadata['file_id'] = file_id

    return new_docs

In [126]:
pdf_path = r"C:\Users\Abdullah\Downloads\VAT\VAT\Executive Regulations\Executive Regulation of Federal Decree Law No 8 of 2017 - Publish-new-2.pdf"


In [127]:
x = extract_headings_and_content(1, pdf_path, "abc", "False")

In [128]:
len(x)

78

In [130]:
for i in x:
    print(i.metadata)
    print(i,"\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n")

{'file_name': 'Executive Regulation of Federal Decree Law No 8 of 2017 - Publish-new-2.pdf', 'page': 5, 'display_name': 'abc', 'is_proprietary': 'False', 'image_link': ' ', 'file_id': 1}
page_content='Article 1   1 
Cabinet Decision No. 52 of 2017 and its amendments – Unofficial translation 
 
This is not an official Translation: 
The Executive Regulation of the Federal Decree-Law No. 8 
of 2017 on Value Added Tax 
Cabinet Decision No. 52 of 2017 – Issued 26 Nov 2017 
Cabinet Decision No. 46 of 2020 – Issued 4 Jun 2020 (Effective from 4 Jun 2020) 
Cabinet Decision No. 24 of 2021 – Issued 11 Mar 2021 (Effective from 1 Jan 2018) 
Cabinet Decision No. 88 of 2021 – Issued 28 Sep 2021 (Effective from 30 Oct 2021) 
 
The Cabinet has decided:  
 
- Having reviewed the Constitution, 
- Federal Law No. 1 of 1972 on the Competencies of the Ministries and Powers 
of the Ministers and its amendments, 
- Federal Decree-Law No. 13 of 2016 on the Establishment of the Federal Tax 
Authority, 
- Federa