# ODOT Bid Tabulation Data Extraction

This notebook extracts bidder information from Ohio Department of Transportation (ODOT) bid tabulation PDF files. We'll analyze the PDFs to extract company names, addresses, bid amounts, and other relevant information.

In [6]:
import pandas as pd

df = pd.read_csv(r'C:\Users\clint\Desktop\RA Task\7.csv')
df

Unnamed: 0,state,county,fips,year,project_start,project_id,route,mileage,lanes,project_duration_days,eng_estimate_mils,win_bid_mils,cost_mils,num_bidders,bidders_list,all_routes,cost_overrun_pct,Project Num
0,Ohio,Paulding,39125,2018,2018-05-24 00:00:00,105522,111,12.982,2.0,99.0,0.943,0.957859,1.047510,2.0,"Shelly Company, Gerken Paving",,9.359511,180326
1,Ohio,Wyandot,39175,2018,2018-11-15 00:00:00,88832,23,,4.0,290.0,3.702,3.236775,3.304783,3.0,,"['23', '23', '23']",2.101101,180569
2,Ohio,Butler,39017,2018,2018-01-18 00:00:00,94263,73,,2.0,195.0,0.287,0.258900,0.232678,3.0,,['73'],-10.128362,180006
3,Ohio,Franklin,39049,2018,2018-01-18 00:00:00,76467,270 / 315,,4.0,255.0,4.904,6.101481,6.991613,3.0,,"['270', '270', '270', '270', '270', '270', '27...",14.588788,180012
4,Ohio,Hocking,39073,2018,2018-01-18 00:00:00,101555,33,,2.0,255.0,0.435,0.553756,0.531749,2.0,,['33'],-3.974212,180020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198,Ohio,Harrison,39067,2018,2018-12-20 00:00:00,91844,250 / 9,,2.0,284.0,2.249,2.284000,2.212447,3.0,,"['9', '250']",-3.132812,180609
199,Ohio,Highland,39071,2018,2018-12-20 00:00:00,84622,138 / 753,,2.0,223.0,1.508,1.494436,1.658765,2.0,,"['138', '138', '138', '753']",10.996011,180610
200,Ohio,Madison,39097,2018,2018-12-20 00:00:00,105547,42,,2.0,284.0,3.727,3.611668,3.535001,4.0,,"['42', '42', '42']",-2.122762,180611
201,Ohio,Clinton,39027,2018,2018-12-20 00:00:00,87300,251 / 68 / 350,,2.0,589.0,4.893,5.441520,5.331566,1.0,,"['68', '68', '68', '68', '68', '68', '251', '3...",-2.020640,180621


# ODOT Bid Tabulation Data Extraction

This notebook extracts bidder information from Ohio Department of Transportation (ODOT) bid tabulation PDF files. It identifies bidder names, addresses, counties, bid amounts, and which company was awarded each project.

## Process Overview

1. Extract text from PDF bid tabulation files
2. Use regular expressions to identify project information and bidders
3. Create structured data with bidder details
4. Process multiple files and consolidate results
5. Save extracted data to CSV files

In [7]:
# Import required libraries
import os
import re
import sys
import glob
import pandas as pd
import PyPDF2
import subprocess
from tqdm.notebook import tqdm  # For better progress bar in notebooks

## PDF Processing Approach

The key challenges in extracting bidder information from PDFs are:

1. Identifying the project details (project number, PID, award amount, etc.)
2. Correctly matching bidder companies with their bid amounts
3. Determining which company was awarded the contract
4. Handling various PDF formats and layout differences

We'll use regular expressions (regex) to extract these details from the text content of the PDFs.

In [8]:
# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    """
    Extract text content from a PDF file.
    
    Args:
        pdf_path (str): Path to the PDF file
        
    Returns:
        str: Extracted text content
        int: Number of pages in the PDF
    """
    if not os.path.exists(pdf_path):
        print(f"Error: File not found: {pdf_path}")
        return "", 0
    
    # Open and read the PDF file
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        
        # Extract text from all pages
        all_text = ""
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            all_text += page.extract_text()
    
    return all_text, num_pages

# Example: Process a sample PDF file
sample_pdf_path = r"C:\Users\clint\Desktop\RA Task\Downloads\180003bidtab.pdf"
all_text, num_pages = extract_text_from_pdf(sample_pdf_path)

print(f"Successfully extracted text from {os.path.basename(sample_pdf_path)}")
print(f"Number of pages: {num_pages}")
print("\nSample content (first 500 characters):\n")
print(all_text[:1000])

Successfully extracted text from 180003bidtab.pdf
Number of pages: 17

Sample content (first 500 characters):

Project No. 180003
PID 94079
ATB-SR 45-01.87
Federal
Type: TWO LANE RESURFACING
Completion Date: 8/15/2018
Award Amount:                $2,087,863.70Contract Awarded To:      RONYAK PAVING INCLetting Date: 1/25/2018
Engineer's Estimate:        $2,320,000.00
Ohio Department of Transportation
Official Bid Tabulation
Jerry Wray, Director
RONYAK PAVING INC
14376 N CHESHIRE ST
BURTON, OH 44021GeaugaBidder 1
Bid $2,087,863.70SHELLY & SANDS INC
1515 HARMON AVE
COLUMBUS, OH 43223FranklinBidder 2
Bid $2,193,929.12
KOSKI CONSTRUCTION CO
P O BOX 1038
ASHTABULA, OH 44005-1038Bidder 3
Bid $2,268,022.35KOKOSING CONSTRUCTION COMPANY INC
886 MC KINLEY AVE
COLUMBUS, OH 43222FranklinBidder 4
Bid $2,298,511.32
CHAGRIN VALLEY PAVING INC
17290 MUNN RD
CHAGRIN FALLS, OH 44023GeaugaBidder 5
Bid $2,400,000.00BURTON SCOT CONTRACTORS LLC
11330 KINSMAN RD
NEWBURY, OH 44065GeaugaBidder 6
Bid $2,451,806.6

In [9]:
import re


# Let's create another pattern specifically designed for this PDF format
def extract_bidders_alternative(text):
    """
    Alternative approach to extract bidder information using a simpler pattern
    focused on the specific format in the example.
    
    Args:
        text (str): Text content extracted from PDF
        
    Returns:
        list: List of dictionaries containing bidder details
    """
    # Pattern to extract bidder blocks including name, address, city/state/zip, county, bidder number, and bid amount
    pattern = r'([A-Z][A-Z\s&\.,]+?(?:INC|LLC|CO|COMPANY|CORP|CORPORATION|LTD))\s*\n([^\n]+)\n([^\n]+?),\s*([A-Z]{2})\s*(\d{5}(?:-\d{4})?)(?:-?(\w+))?\s*Bidder\s+(\d+)\s*\nBid\s+\$([\.\d,]+)'
    
    bidders = []
    
    for match in re.finditer(pattern, text):
        bidder = {
            'name': match.group(1).strip(),
            'address': match.group(2).strip(),
            'city': match.group(3).strip(),
            'state': match.group(4),
            'zip': match.group(5),
            'county': match.group(6) if match.group(6) else '',
            'bidder_number': match.group(7),
            'bid_amount': match.group(8).replace(',', '')
        }
        bidders.append(bidder)
    
    return bidders

# Try the alternative approach
alt_bidders = extract_bidders_alternative(all_text)

# Display results from alternative approach
print("\n\nALTERNATIVE APPROACH RESULTS:")
print(f"Found {len(alt_bidders)} bidders with details:")
for i, bidder in enumerate(alt_bidders, 1):
    print(f"\n{i}. {bidder['name']}")
    print(f"   Address: {bidder['address']}")
    print(f"   Location: {bidder['city']}, {bidder['state']} {bidder['zip']}")
    print(f"   County: {bidder['county']}")
    print(f"   Bidder #: {bidder['bidder_number']}")
    print(f"   Bid Amount: ${bidder['bid_amount']}")



ALTERNATIVE APPROACH RESULTS:
Found 7 bidders with details:

1. RONYAK PAVING INC
   Address: 14376 N CHESHIRE ST
   Location: BURTON, OH 44021
   County: Geauga
   Bidder #: 1
   Bid Amount: $2087863.70

2. SHELLY & SANDS INC
   Address: 1515 HARMON AVE
   Location: COLUMBUS, OH 43223
   County: Franklin
   Bidder #: 2
   Bid Amount: $2193929.12

3. KOSKI CONSTRUCTION CO
   Address: P O BOX 1038
   Location: ASHTABULA, OH 44005-1038
   County: 
   Bidder #: 3
   Bid Amount: $2268022.35

4. KOKOSING CONSTRUCTION COMPANY INC
   Address: 886 MC KINLEY AVE
   Location: COLUMBUS, OH 43222
   County: Franklin
   Bidder #: 4
   Bid Amount: $2298511.32

5. CHAGRIN VALLEY PAVING INC
   Address: 17290 MUNN RD
   Location: CHAGRIN FALLS, OH 44023
   County: Geauga
   Bidder #: 5
   Bid Amount: $2400000.00

6. BURTON SCOT CONTRACTORS LLC
   Address: 11330 KINSMAN RD
   Location: NEWBURY, OH 44065
   County: Geauga
   Bidder #: 6
   Bid Amount: $2451806.63

7. AMERICON INDUSTRIAL SERVICES LLC
  

## Bidder Information Extraction

The PDFs have a clear pattern for bidder information:
1. Company name in all caps (e.g., "SHELLY & SANDS INC")
2. Address (typically 1-2 lines)
3. City, state, and ZIP code
4. Sometimes followed by county name
5. "Bidder #" label
6. Bid amount

We'll use regex patterns to extract this structured information.

In [10]:
def create_bidder_dataframe(pdf_path):
    """
    Process a PDF file to extract bidder information and create a DataFrame.
    
    Args:
        pdf_path (str): Path to the PDF file
        
    Returns:
        pd.DataFrame: DataFrame containing bidder information with project number
    """
    # Extract the project number from the filename
    filename = os.path.basename(pdf_path)
    project_number = re.search(r'(\d+)bidtab', filename)
    if project_number:
        project_number = project_number.group(1)
    else:
        project_number = 'Unknown'
    
    # Extract text from the PDF
    text, _ = extract_text_from_pdf(pdf_path)
    
    # Extract bidder information
    bidders = extract_bidders_alternative(text)
    
    if not bidders:
        print(f"No bidders found in {filename}")
        return pd.DataFrame()
    
    # Create DataFrame
    df = pd.DataFrame(bidders)
    
    # Add project number and filename
    df['project_number'] = project_number
    df['filename'] = filename
    
    # Determine the awarded bidder (lowest bid amount)
    df['bid_amount'] = pd.to_numeric(df['bid_amount'])
    min_bid = df['bid_amount'].min()
    df['awarded'] = df['bid_amount'] == min_bid
    
    # Reorder columns with project_number first
    column_order = ['project_number', 'name', 'address', 'city', 'state', 'zip', 
                   'county', 'bidder_number', 'bid_amount', 'filename', 'awarded']
    
    # Rename columns to match the required format
    df = df.rename(columns={'name': 'bidder_name'})
    
    # Get only the requested columns in the correct order
    columns = ['project_number', 'bidder_name', 'address', 'city', 'state', 'zip', 
               'county', 'bidder_number', 'bid_amount', 'filename', 'awarded']
    
    return df[columns]

# Example usage with a single file
sample_pdf_path = r"C:\Users\clint\Desktop\RA Task\Downloads\180003bidtab.pdf"
bidders_df = create_bidder_dataframe(sample_pdf_path)

# Display the result
print(f"Created DataFrame with {len(bidders_df)} bidders for project {bidders_df['project_number'].iloc[0] if not bidders_df.empty else 'Unknown'}")
bidders_df

Created DataFrame with 7 bidders for project 180003


Unnamed: 0,project_number,bidder_name,address,city,state,zip,county,bidder_number,bid_amount,filename,awarded
0,180003,RONYAK PAVING INC,14376 N CHESHIRE ST,BURTON,OH,44021,Geauga,1,2087863.7,180003bidtab.pdf,True
1,180003,SHELLY & SANDS INC,1515 HARMON AVE,COLUMBUS,OH,43223,Franklin,2,2193929.12,180003bidtab.pdf,False
2,180003,KOSKI CONSTRUCTION CO,P O BOX 1038,ASHTABULA,OH,44005-1038,,3,2268022.35,180003bidtab.pdf,False
3,180003,KOKOSING CONSTRUCTION COMPANY INC,886 MC KINLEY AVE,COLUMBUS,OH,43222,Franklin,4,2298511.32,180003bidtab.pdf,False
4,180003,CHAGRIN VALLEY PAVING INC,17290 MUNN RD,CHAGRIN FALLS,OH,44023,Geauga,5,2400000.0,180003bidtab.pdf,False
5,180003,BURTON SCOT CONTRACTORS LLC,11330 KINSMAN RD,NEWBURY,OH,44065,Geauga,6,2451806.63,180003bidtab.pdf,False
6,180003,AMERICON INDUSTRIAL SERVICES LLC,3651 LEHARPS RD,AUSTINTOWN,OH,44515,Mahoning,7,2456486.87,180003bidtab.pdf,False


In [11]:
# Function to process multiple PDF files
def process_multiple_pdfs(pdf_folder, pattern="*bidtab.pdf"):
    """
    Process multiple PDF files in a folder and combine the results.
    
    Args:
        pdf_folder (str): Path to the folder containing PDF files
        pattern (str): Pattern to match PDF filenames
        
    Returns:
        pd.DataFrame: Combined DataFrame with bidder information from all PDFs
    """
    # Find all PDF files matching the pattern
    pdf_files = glob.glob(os.path.join(pdf_folder, pattern))
    
    if not pdf_files:
        print(f"No PDF files found in {pdf_folder} matching pattern {pattern}")
        return pd.DataFrame()
    
    print(f"Found {len(pdf_files)} PDF files to process")
    
    # Process each PDF file
    dfs = []
    for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
        df = create_bidder_dataframe(pdf_path)
        if not df.empty:
            dfs.append(df)
    
    # Combine results
    if dfs:
        combined_df = pd.concat(dfs, ignore_index=True)
        return combined_df
    else:
        print("No bidder information found in any of the PDF files")
        return pd.DataFrame()

# Example: Process all bidtab PDFs in the Downloads folder
pdf_folder = r"C:\Users\clint\Desktop\RA Task\Downloads"
combined_bidders_df = process_multiple_pdfs(pdf_folder)

# Display the result
print(f"\nCreated combined DataFrame with {len(combined_bidders_df)} bidders from {len(combined_bidders_df['project_number'].unique())} projects")
# Show the first few rows
combined_bidders_df.head()

Found 202 PDF files to process


Processing PDFs:   0%|          | 0/202 [00:00<?, ?it/s]


Created combined DataFrame with 597 bidders from 202 projects


Unnamed: 0,project_number,bidder_name,address,city,state,zip,county,bidder_number,bid_amount,filename,awarded
0,180003,RONYAK PAVING INC,14376 N CHESHIRE ST,BURTON,OH,44021,Geauga,1,2087863.7,180003bidtab.pdf,True
1,180003,SHELLY & SANDS INC,1515 HARMON AVE,COLUMBUS,OH,43223,Franklin,2,2193929.12,180003bidtab.pdf,False
2,180003,KOSKI CONSTRUCTION CO,P O BOX 1038,ASHTABULA,OH,44005-1038,,3,2268022.35,180003bidtab.pdf,False
3,180003,KOKOSING CONSTRUCTION COMPANY INC,886 MC KINLEY AVE,COLUMBUS,OH,43222,Franklin,4,2298511.32,180003bidtab.pdf,False
4,180003,CHAGRIN VALLEY PAVING INC,17290 MUNN RD,CHAGRIN FALLS,OH,44023,Geauga,5,2400000.0,180003bidtab.pdf,False


In [12]:
# Save the combined DataFrame to a CSV file
def save_to_csv(df, output_path):
    """
    Save DataFrame to CSV file.
    
    Args:
        df (pd.DataFrame): DataFrame to save
        output_path (str): Path to save the CSV file
    """
    if df.empty:
        print("DataFrame is empty, nothing to save")
        return
    
    df.to_csv(output_path, index=False)
    print(f"Saved {len(df)} records to {output_path}")

# Example: Save the combined DataFrame to a CSV file
output_path = r"C:\Users\clint\Desktop\RA Task\extracted_bidders_all.csv"
save_to_csv(combined_bidders_df, output_path)

Saved 597 records to C:\Users\clint\Desktop\RA Task\extracted_bidders_all.csv
