#**CODE FLOW EXPLANATION**

#**Step 1: Extract Data from the PDF**
The first step involves extracting structured data from PDF files using the PDFDataExtraction class. This step is crucial for parsing and organizing raw data into meaningful structures.

##Input:

* The method `process_pdf()` in the PDFDataExtraction class takes a PDF file as input.

##Processing:

####1) Tables Extraction:
* The `img2table` library is used to extract tabular data from the PDF.
* The extracted tables are stored as objects in the output dictionary under the key `'tables'`.

####2) Key-Value Pairs Extraction:


* Regular expressions (regex) are employed to identify key-value pairs (non-tabular data) in the PDF.
* These extracted key-value pairs are organized into a JSON-like structure and stored in the output dictionary under the key `'key_value_pairs'`.




#**Step 2: Convert Extracted Data into HTML**
Once the data is extracted, it is converted into an HTML format for further processing or display.

#Input:

The output from `process_pdf()` (a dictionary containing 'tables' and
'key_value_pairs').
#Processing:

The `combine_html()` method performs the following operations:
####Extract HTML from Table Objects:


* Iterates through the `'tables'` data.
* For each table object, calls its .html property or method to generate its HTML representation.
* Appends all table HTML strings into a single combined HTML structure.



####Convert JSON to HTML:


* Converts the `'key_value_pairs'` JSON data into an HTML table using the helper function `json_to_html_table`.
* Appends the generated HTML to the combined HTML structure.
* The resulting HTML contains both tabular and key-value data, merged into a single string.




#**Step 3: Compare PDF Data Using LLM**
The final step involves leveraging a Large Language Model (LLM) to perform data comparison between two PDFs.

#Input:

The HTML data generated from Step 2 for both PDF files:

* **Datasheet**: HTML data for the first PDF file.

* **Vendor**: HTML data for the second PDF file.

#Processing:

Both HTML datasets are fed into an LLM model designed to analyze and compare data.
The model compares the content of the two PDFs based on specific parameters and tags, such as:
*   Key-value pairs
*   Tables and their structure
* Alignment of content between the two documents.

The comparison is done at a semantic level, leveraging the model’s ability to identify relationships and patterns in the data.
#Output:

The model returns a structured result indicating the comparison outcome. This could include:
* Matching or mismatched parameters between the two documents.
* Tag-by-tag or parameter-level comparison details.

In [1]:
!apt-get install tesseract-ocr
!apt-get install tesseract-ocr-eng

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [2]:
! pip install opencv-contrib-python-headless



In [3]:
!pip install img2table gradio



In [4]:
# Required packages:
!pip install pdfplumber pandas pytesseract opencv-python numpy chromadb



#Import Libraies

In [5]:
import pdfplumber
import pandas as pd
import pytesseract
import cv2
import numpy as np
from pathlib import Path
from typing import Dict, List, Tuple, Union
import logging
import json
from huggingface_hub import InferenceClient
from google.colab import userdata
import chromadb
from sentence_transformers import SentenceTransformer
from img2table.document import Image, PDF
from img2table.ocr import TesseractOCR
import pandas as pd
import json
from bs4 import BeautifulSoup
import re
import gradio as gr

In [6]:
class PDFDataExtractor:
    """Main class for extracting data from PDFs with tables and key-value pairs."""

    def __init__(self, config_path: str = None):
        """Initialize the extractor with optional configuration."""
        self.logger = self._setup_logging()
        self.config = self._load_config(config_path) if config_path else {}

    def _setup_logging(self) -> logging.Logger:
        """Configure logging for the extraction process."""
        logger = logging.getLogger('PDFDataExtractor')
        logger.setLevel(logging.INFO)
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        return logger

    def _load_config(self, config_path: str) -> Dict:
        """Load configuration from JSON file."""
        with open(config_path, 'r') as f:
            return json.load(f)

    def extract_tables_PDFPlumber(self, pdf_path: str) -> List[pd.DataFrame]:
        """Extract tables from PDF using pdfplumber."""
        tables = []
        self.logger.info(f"Processing PDF: {pdf_path}")

        try:
            with pdfplumber.open(pdf_path) as pdf:
                for page_num, page in enumerate(pdf.pages, 1):
                    # Extract tables from the page
                    page_tables = page.extract_tables()

                    for table_num, table in enumerate(page_tables, 1):
                        if table:
                            # Convert to DataFrame and clean up
                            df = pd.DataFrame(table[1:], columns=table[0])
                            df = self._clean_dataframe(df)
                            tables.append(df)

                            self.logger.info(f"Extracted table {table_num} from page {page_num}")

        except Exception as e:
            self.logger.error(f"Error extracting tables: {str(e)}")

        return tables

    def extract_tables_img2table(self,pdf_path:str):
      """Extract tables from PDF using img2table."""
      tables = []
      pdf = PDF(src=pdf_path)
      ocr= TesseractOCR(lang='eng')
      tables= pdf.extract_tables(ocr=ocr,min_confidence=70)
      return tables

    def extract_key_value_pairs(self, pdf_path: str) -> Dict[str, str]:
        """Extract key-value pairs using pattern matching and positioning."""
        key_value_pairs = {}

        try:
            with pdfplumber.open(pdf_path) as pdf:
                for page in pdf.pages:
                    text = page.extract_text()

                    # Extract using common patterns
                    pairs = self._pattern_based_extraction(text)
                    key_value_pairs.update(pairs)

                    # Extract using positionin(tables extraction by matching lines)
                    # positioned_pairs = self._position_based_extraction(page)
                    # key_value_pairs.update(positioned_pairs)
                    # print(f"positioned pair: {positioned_pairs}")

        except Exception as e:
            self.logger.error(f"Error extracting key-value pairs: {str(e)}")

        return key_value_pairs

    def _pattern_based_extraction(self, text: str,patterns:List=[r'([^:\n]+):\s*([^\n]+)',r'([^=\n]+)=\s*([^\n]+)',r'([^\t\n]+)\t+([^\n]+)']) -> Dict[str, str]:
        """Extract key-value pairs using regex patterns."""
        pairs = {}

        # Common patterns for key-value pairs
        # patterns = [
        #     r'([^:\n]+):\s*([^\n]+)',  # Basic pattern: "Key: Value"
        #     r'([^=\n]+)=\s*([^\n]+)',  # Alternative pattern: "Key = Value"
        #     r'([^\t\n]+)\t+([^\n]+)'   # Tab-separated pattern
        # ]

        for pattern in patterns:
            import re
            matches = re.findall(pattern, text)
            for key, value in matches:
                key = key.strip()
                value = value.strip()
                if key and value:
                    pairs[key] = value

        return pairs

    def _position_based_extraction(self, page) -> Dict[str, str]:
        """Extract key-value pairs based on positioning in the document."""
        pairs = {}

        # Extract words with their positions
        words = page.extract_words()

        # Group words by their vertical position (assuming keys and values are on the same line)
        lines = {}
        for word in words:
            y_pos = round(word['top'])
            if y_pos not in lines:
                lines[y_pos] = []
            lines[y_pos].append(word)

        # Analyze each line for potential key-value pairs
        for y_pos, line_words in lines.items():
            if len(line_words) >= 2:
                # Assume first word(s) are key and last word(s) are value
                potential_key = ' '.join(w['text'] for w in line_words[:len(line_words)//2])
                potential_value = ' '.join(w['text'] for w in line_words[len(line_words)//2:])

                if potential_key and potential_value:
                    pairs[potential_key.strip()] = potential_value.strip()

        return pairs

    def _clean_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
        """Clean and preprocess extracted DataFrame."""
        # Remove empty rows and columns
        df = df.dropna(how='all').dropna(axis=1, how='all')

        # Strip whitespace from strings
        df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x)

        # Remove duplicate rows
        df = df.drop_duplicates()

        return df

    def process_pdf(self, pdf_path: str) -> Dict[str, Union[List[pd.DataFrame], Dict[str, str]]]:
        """Process PDF and extract both tables and key-value pairs."""
        result = {
            'tables': self.extract_tables_img2table(pdf_path),
            'key_value_pairs': self.extract_key_value_pairs(pdf_path)
        }

        return result

    def json_to_html_table(self,json_data):
        """
        Converts data from a JSON file into an HTML table.

        Args:
            json_file_path (str): Path to the JSON file.

        Returns:
            str: HTML string representing the data in a table format.
        """
        try:
            # logging.info(f"Loading JSON data from {json_file_path}.")
            # with open(json_file_path, 'r') as file:
            #     json_data = json.load(file)

            logging.debug("Generating HTML table from JSON data.")
            html_table = "<table border='1' style='border-collapse: collapse; width: 100%;'>"
            html_table += "<tr><th>Key</th><th>Value</th></tr>"

            for key, value in json_data.items():
                html_table += f"<tr><td>{key}</td><td>{value}</td></tr>"

            html_table += "</table>"
            logging.info("HTML table generated successfully.")
            return html_table
        except Exception as e:
            logging.error(f"Error while converting JSON to HTML table: {e}")
            return f"Error: {str(e)}"

    def combine_html(self,pdf_data):
         """
          Combines multiple HTML strings into a single string.

         Args:
            pdf_data (dict): A dictionary containing data extracted from a PDF,
                         including tables and key-value pairs.

            Expected structure of `pdf_data`:
            {
                'tables': [[table1, table2, ...], [table3, table4, ...], ...],
                'key_value_pairs': {...}
            }

         Returns:
           str: A single combined HTML string containing all tables and key-value pairs.
         """
         combined_html = ""
         # Add key-value pairs converted to HTML
         combined_html += f"\n {self.json_to_html_table(pdf_data.get('key_value_pairs', {}))}"

         # Iterate over the tables in pdf_data
         for i in range(len(pdf_data['tables'])):
           for j in range(len(pdf_data['tables'][i])):
              combined_html = combined_html + f"\n {pdf_data['tables'][i][j].html}"

         return combined_html


    def save_results(self, results: Dict, output_dir: str):
        """Save extracted data to files."""
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)

        # Save tables to Excel sheets
        if isinstance(results['tables'],list):
            with pd.ExcelWriter(output_path / 'extracted_tables.xlsx') as writer:
                for i, df in enumerate(results['tables'], 1):
                    df.to_excel(writer, sheet_name=f'Table_{i}', index=False)

        # Save key-value pairs to JSON
        if results['key_value_pairs']:
            with open(output_path / 'key_value_pairs.json', 'w') as f:
                json.dump(results['key_value_pairs'], f, indent=4)



In [7]:
class pdf_data_comparison:
    def __init__(self, hf_api_key):
        """
        Initializes the HTMLTableProcessor class and sets up logging configuration.
        """
        self.setup_logging()
        self.client = InferenceClient(api_key=hf_api_key)

    def setup_logging(self):
        """
        Sets up logging configuration for the class.
        """
        logging.basicConfig(
            level=logging.DEBUG,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler("html_table_processor.log"),
                logging.StreamHandler()
            ]
        )
        logging.info("Logging configuration set up successfully.")





    def compare_html_data(self, datasheet_html, vendor_html,model_name: str="Qwen/Qwen2.5-72B-Instruct",query= None):
        """
        Compares the data between two HTML contents based on given parameters.

        Args:
            datasheet_html (str): HTML content of the datasheet.
            vendor_html (str): HTML content of the vendor's data.
            parameters (list): List of parameters to compare.

        Returns:
            str: Comparison result as generated by the InferenceClient.
        """
        if query != None:
            prompt = f"""
               Compare the data extracted from the following HTML contents:

               **Datasheet HTML**:
               {datasheet_html}

               **Vendor HTML**:
               {vendor_html}

               {query}
               """
        else:

            prompt = f"""
            Compare the data extracted from the following HTML contents:

            **Datasheet html**:
            ```{datasheet_html}```

            **Vendor html**:
            ```{vendor_html}```

            compare all the parameter values in both the html contents.
            when there is table with multiple column try to extract and compare the individual rows.
            If there are multiple tables compare it individually.

            Output Instructions:
            - Present the results in a clear tabular format with columns for:
            1. **Parameter** (parameter name)
            2. **Datasheet Value**
            3. **Vendor Value**
            4. **Result** (Matching, Discrepancy, Missing)
            5. **Detailed Explanation**
            - Include comprehensive details for any discrepancies or missing entries.
            give the result with separate table heading parameter comparison and Tag Number comparison.
            """



        messages = [
            {"role": "system", "content": "You are an Expert data analyst who can analyse HTML documents."},
            {"role": "user", "content": prompt}
        ]

        # Generate response using the InferenceClient
        stream = self.client.chat.completions.create(
            model= model_name,
            messages=messages,
            max_tokens=5000,
            stream=True,
            temperature=1,
            top_p=0.1

        )

        result = ""
        for chunk in stream:
            result += chunk.choices[0].delta.content

        return result
    def convert_into_dataframe(self, input_data):
        """
        Extracts a table from a string and converts it into a DataFrame.

        Args:
            input_string (str): The string containing table-like data.

        Returns:
            pd.DataFrame: DataFrame representation of the table.
        """
        # Split the input string into lines
        lines = input_data.split("\n")

        # Filter only rows that are part of the table (lines with | separators)
        table_rows = [line for line in lines if '|' in line]

        # Remove separator rows (lines with only '-' or '|' characters)
        table_rows = [row for row in table_rows if not re.match(r'^\|[-\s]*\|$', row)]

        # Split each row into columns using | as a separator
        table_data = [row.split('|')[1:-1] for row in table_rows]  # Exclude first and last empty entries

        # Strip whitespace from each cell
        table_data = [[cell.strip() for cell in row] for row in table_data]

        # Use the first row as the header
        header = table_data[0]
        data = table_data[1:]

        # Create a DataFrame
        df = pd.DataFrame(data, columns=header)

        # Remove any symbol characters from the DataFrame
        df = df.replace(r'[^\w\s]', '', regex=True)

        return df


In [8]:

def process_pdf_data(pdf):
   # Initialize extractor
   extractor = PDFDataExtractor()
   extractor._setup_logging()

   # Process Datasheet PDF
   results = extractor.process_pdf(pdf)
   html_data = extractor.combine_html(results)

   return html_data

def compare_pdfs(result1,result2):
    # Example comparison logic: Replace this with your actual comparison logic

    table_comparison = pdf_data_comparison(hf_api_key=userdata.get('HF_TOKEN'))
    comparison_result = table_comparison.compare_html_data(result1,result2)
    output_path = "comparison_result.md"
    with open(output_path, "w") as md_file:
        md_file.write(comparison_result)
    return comparison_result, output_path



# Backend function for Gradio
def process_and_compare(pdf1, pdf2):
    result1 = process_pdf_data(pdf1)
    result2 = process_pdf_data(pdf2)
    comparison_result,output_path = compare_pdfs(result1,result2)

    return comparison_result,output_path


# Function to process the dropdown selection and return the initial less data text
def get_initial_text(selected_option):
    if selected_option == "First PDF Data":
        return "This is less text for the First PDF Data."
    elif selected_option == "Second PDF Data":
        return "This is less text for the Second PDF Data."
    else:
        return "Please select an option."

# Function to toggle between full and short versions of the text
def toggle_text(selected_option,pdf1,pdf2,current_state):
    if selected_option == "First PDF Data":
        if current_state == "show_less":
            return (
                process_pdf_data(pdf1),
                "Show Less",
                "show_more"
            )
        else:
            return (
                "This is less text for the First PDF Data.",
                "Show More",
                "show_less"
            )
    elif selected_option == "Second PDF Data":
        if current_state == "show_less":
            return (
                process_pdf_data(pdf2),
                "Show Less",
                "show_more"
            )
        else:
            return (
                "This is less text for the Second PDF Data.",
                "Show More",
                "show_less"
            )
    else:
        return "Please select an option.", "Show More", "show_less"


# Gradio Interface
with gr.Blocks() as demo:
    gr.Markdown("# PDF Comparison Tool")

    with gr.Row():
        pdf1_input = gr.File(label="Upload First PDF", file_types=[".pdf"])
        pdf2_input = gr.File(label="Upload Second PDF", file_types=[".pdf"])

    # flag =1
    # if flag:
    #    with gr.Row():
    #        #patterns = gr.Textbox(label="PDF Patterns",placeholder="[r'([^:\n]+):\s*([^\n]+)',r'([^=\n]+)=\s*([^\n]+)',r'([^\t\n]+)\t+([^\n]+)']")
    #        model_name = gr.Textbox(label="Model Name",placeholder="Qwen/Qwen2.5-72B-Instruct")
    #        query = gr.Textbox(label="Query")

    gr.Markdown("# Extracted Data")
    with gr.Row():

       dropdown = gr.Dropdown(
            choices=["First PDF Data", "Second PDF Data"],
            label="Select Data to View"
        )

    # Initial state to keep track of whether the text is wrapped down or up
    text_state = gr.State("show_less")  # Initial state is "show_less"

    # Textbox to display the text data
    text_display = gr.HTML()

    # Button to toggle between Wrap Down and Wrap Up
    toggle_button = gr.Button("Show More")

    gr.Markdown("# Comparison Result")
    compare_button = gr.Button("Process and Compare PDFs")
    result_output = gr.Markdown(label="Comparison Result")
    download_button = gr.File(label="Download Result md")


    # Event: Update text display when dropdown changes
    dropdown.change(
        fn=get_initial_text,
        inputs=[dropdown],
        outputs=[text_display]
    )

    # Event: Toggle between full and short text
    toggle_button.click(
        fn=toggle_text,
        inputs=[dropdown,pdf1_input,pdf2_input,text_state],
        outputs=[text_display, toggle_button, text_state]
    )

    compare_button.click(
        fn=process_and_compare,
        inputs=[pdf1_input, pdf2_input],
        outputs=[result_output, download_button]
    )


# Launch the app
demo.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://890b6a504bc8a84c1c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


