<a href="https://colab.research.google.com/github/chetankhairnar05/Python_Automation/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# First, you must install the required Python library.
# You also need to have Java installed on your system for tabula-py to work, as it's a wrapper for the Tabula-Java library.
# You can run this command in your terminal or a Colab cell:
!pip install tabula-py pandas

import pandas as pd
import tabula

# Define the URL to a sample PDF file with tables.
# This URL points to a public PDF document that is known to work with tabula.
pdf_url = "https://assets.accessible-digital-documents.com/uploads/2017/01/sample-tables.pdf"
output_filename = "extracted_tables.xlsx"

# --- Step 1: Read the tables from the PDF ---
# tabula.read_pdf returns a list of DataFrames, one for each table it finds.
# The 'multiple_tables=True' parameter tells tabula to look for multiple tables per page.
# The 'pages="all"' parameter tells tabula to process all pages in the PDF.
print(f"Extracting tables from PDF: {pdf_url}")
try:
    dfs = tabula.read_pdf(pdf_url, pages="all", multiple_tables=True, stream=True)
    print(f"Successfully extracted {len(dfs)} tables.")
except Exception as e:
    print(f"Error extracting tables from the PDF: {e}")
    # An empty list is fine if no tables are found, but a more specific error is useful
    dfs = []

# --- Step 2: Process and Display the extracted DataFrames ---
if len(dfs) > 0:
    for i, df in enumerate(dfs):
        print(f"\n--- Table {i + 1} from the PDF ---")
        print(df.head()) # Print the first 5 rows of each extracted table

# --- Step 3: (Optional) Export all tables to a single Excel file on separate sheets ---
if len(dfs) > 0:
    try:
        with pd.ExcelWriter(output_filename, engine='xlsxwriter') as writer:
            for i, df in enumerate(dfs):
                sheet_name = f"Table_{i + 1}"
                # Handle potential MultiIndex columns
                if isinstance(df.columns, pd.MultiIndex):
                    df.columns = ['_'.join(col).strip() if isinstance(col, tuple) else str(col) for col in df.columns.values]
                    df.columns = [col.replace('nan_', '').replace('_nan', '') for col in df.columns]

                df.to_excel(writer, sheet_name=sheet_name, index=False)
        print(f"\nSuccessfully exported all tables to '{output_filename}'.")
    except Exception as e:
        print(f"\nError exporting tables to Excel: {e}")

# --- Step 4: (Optional) Download the Excel file in a Colab environment ---
# This part is only for use in Google Colab. If you're running locally, you don't need this.
try:
    from google.colab import files
    print("Downloading the Excel file to your local machine.")
    files.download(output_filename)
except ImportError:
    print("\nNote: 'google.colab' module not found. This download step is skipped.")


Collecting tabula-py
  Downloading tabula_py-2.10.0-py3-none-any.whl.metadata (7.6 kB)
Downloading tabula_py-2.10.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m89.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tabula-py
Successfully installed tabula-py-2.10.0
Extracting tables from PDF: https://assets.accessible-digital-documents.com/uploads/2017/01/sample-tables.pdf


Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Aug 07, 2025 10:44:12 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 07, 2025 10:44:14 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 07, 2025 10:44:14 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Aug 07, 2025 10:44:14 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 07, 2025 10:44:14 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 07, 2025 10:44:15 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 07, 2025 10:44:15 AM org.apache.pdfbox.pdmodel.font.PDTr

Successfully extracted 29 tables.

--- Table 1 from the PDF ---
  Column header (TH) Column header (TH).1 Column header (TH).2
0    Row header (TH)       Data cell (TD)       Data cell (TD)
1     Row header(TH)       Data cell (TD)       Data cell (TD)

--- Table 2 from the PDF ---
         Expenditure by function £ million  2009/10  2010/11 1
0               Policy functions Financial    22.50      30.57
1                            Information 2    10.20      14.80
2                              Contingency     2.60       1.20
3  Remunerated functions Agency services 3    44.70      35.91
4                                 Payments    22.41      19.88

--- Table 3 from the PDF ---
  Main character Daniel Radcliffe
0     Sidekick 1     Rupert Grint
1     Sidekick 2      Emma Watson
2   Lovable ogre  Robbie Coltrane
3      Professor     Maggie Smith
4     Headmaster   Richard Harris

--- Table 4 from the PDF ---
             Role             Actor
0  Main character  Daniel Radcliffe
1  

FileNotFoundError: Cannot find file: extracted_tables.xlsx