# Data Verification Tool - Simplified 3-Category Selector

## 🎯 How to Use This Tool:

1. **Run Cell 2-4** to load data and utilities
2. **Run Cell 5** to see the simplified selector with 3 main categories + special options
3. **Choose your review type:**
   - **🎯 Special Review Options** (at the top):
     - 📋 **Review Already Verified** - Re-examine previously reviewed records
     - 🎯 **Review Specific Row** - Enter an exact row index to review
     - 🎲 **Random Sample** - Get 20 random rows from any flag reason
   - **📊 Main Categories** (3 simplified buttons):
     - ✅ **Successful / Clear Matches** - All reliable matches
     - ⚠️ **Multiple Matches** - Ambiguous cases needing judgment
     - ❌ **Issues / Problems** - No match, partial, unclear cases combined
4. **Run Cell 8** to launch the verification window interface
5. **Use the enhanced window** to verify OCR vs Scraped data matches
6. **Call `save_and_export_results()`** when done to save your work

## 🆕 Simplified Categories:

### ✅ **Successful / Clear Matches** (~3,053 total)
**All reliable, successful matches in one category:**
- Available in RVer and Trucker (3,034 rows) - *Most reliable matches*
- Single Name/ZIP/State/Exit/Highway Match Found (11 rows)
- Single Name/ZIP/State/Road Match Found (8 rows)

### ⚠️ **Multiple Matches (Ambiguous / Needs Judgment)** (~993 total)
**All cases with multiple potential matches:**
- Multiple matches found (15 rows)
- Multiple Name/ZIP/State/Road Matches (780 rows across all variants)
- Multiple Name/ZIP/State/Exit/Highway Matches (148 rows across all variants)

### ❌ **Issues / Problems (No Match, Partial, Unclear)** (~538 total)
**All problematic cases combined into one category:**
- **No/Failed Matches**: no matching ZIPCODE, No matches found, etc. (437 rows)
- **Partial Matches**: matching ZIPCODE but missing Exit/Highway info (81 rows)
- **Unclear Cases**: unclear OCR_address_standardized issues (20 rows)

## 🆕 Enhanced Features:

### 🎯 **Special Review Options**
- **📋 Already Verified Records**: Review and modify previously verified entries
- **🎯 Specific Row Review**: Enter any row index to review that exact record
- **🎲 Random Sampling**: Get a diverse sample of 20 rows from the entire dataset

### 🔍 **Enhanced Google Search**
- **Comprehensive search terms**: Uses information from BOTH OCR and Scraped data
- **Smart deduplication**: Avoids duplicate terms in search query
- **Business name handling**: Automatically quotes business names with spaces
- **Highway/Exit information**: Includes road and exit data when available

### 📊 **Simplified Workflow**
- **3 Main Categories**: No need to choose from dozens of specific flag reasons
- **Larger Samples**: Up to 30 records per category (vs 10 for specific flag reasons)
- **Mixed Flag Reasons**: Each category contains multiple related flag reason types
- **Flag Reason Breakdown**: Shows distribution of flag reasons in your sample

## ✨ Key Benefits:
- **Simplified Decision Making**: Only 3 main buttons instead of dozens
- **Comprehensive Coverage**: Each button covers all related issue types
- **Larger Sample Sizes**: More records to verify per session (up to 30)
- **Mixed Issue Types**: Experience different problems within the same category
- **Enhanced Google search** with comprehensive data utilization
- **Visual match indicators** (✅ exact, 🟡 partial, ❌ no match)
- **Quick verification tools** (website access, enhanced Google search)
- **Automatic result saving** and export functionality

In [1]:
import pandas as pd

df = pd.read_csv(r'C:\Users\clint\Desktop\Geocoding_Task\Matching_WebScrape\7.csv')
#filter df for only flagged rows
# df = df[df['Flag_Reason'] == "No matches found"]
# #filter for Adress_Type != 'Proper'
# df = df[df['Address_Type']  == 'Exit']
#filter for OCR_state in #filter for states CA, UT, NV, AZ
df = df[df['OCR_state'].isin(['CA', 'UT', 'NV', 'AZ'])]
df

  df = pd.read_csv(r'C:\Users\clint\Desktop\Geocoding_Task\Matching_WebScrape\7.csv')


Unnamed: 0,OCR_Unnamed: 0,OCR_filename,OCR_record_num,OCR_clean_line1,OCR_clean_line2,OCR_line3,OCR_city,OCR_zip_code,OCR_label,OCR_phone,...,OCR_address_standardized_ON_parenthesis,OCR_address_standardized_OFF_parenthesis,OCR_Address_Type,OCR_Exit_Number,Exit_From_Address,Exit_From_Label,Is_Unclear_OCR,OCR_Main_Road,OCR_Secondary_Road,Manually Verified?
43950,23197,RVersFriend2006-115-ocr.csv,18,"Coalville , 84017 Holiday Hills ( 66 )",435-336-4421 1-80 Exit 162 ( UT 280 ),MO,Coalville,84017,Holiday Hills ( 66 ),435-336-4421,...,( UT 280 ),I-80 Exit 162,Exit,162,162,,False,I-80,,
43951,23197,RVersFriend2006-115-ocr.csv,18,"Coalville , 84017 Holiday Hills ( 66 )",435-336-4421 1-80 Exit 162 ( UT 280 ),MO,Coalville,84017,Holiday Hills ( 66 ),435-336-4421,...,( UT 280 ),I-80 Exit 162,Exit,162,162,,False,I-80,,
43952,23198,RVersFriend2007-104-ocr.csv,7,"Coalville , 84017 Hills ( 66 )",435-336-4421 1-80 Holiday Exit 162 ( UT 280 ),M <U+25A1>,Coalville,84017,Hills ( 66 ),435-336-4421,...,( UT 280 ),I-80 Holiday Exit 162,Exit,162,162,,False,I-80,,
43953,23198,RVersFriend2007-104-ocr.csv,7,"Coalville , 84017 Hills ( 66 )",435-336-4421 1-80 Holiday Exit 162 ( UT 280 ),M <U+25A1>,Coalville,84017,Hills ( 66 ),435-336-4421,...,( UT 280 ),I-80 Holiday Exit 162,Exit,162,162,,False,I-80,,
43954,23199,TF2008_258_271-0-ocr.csv,7,"D Coalville , 84017 Holiday Hills ( 66 ) )",4 435-336-4421 1-80 Exit 162 ( UT 280 ),M <U+2610> <U+2610> <U+2610>,Coalville,84017,Holiday Hills ( 66 ) ),435-336-4421,...,( UT 280 ),I-80 Exit 162,Exit,162,162,,False,I-80,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67842,37638,TF2014_018_031-6-ocr.csv,25,H Yermo ( 92398 ) Yermo Truck Stop ( Shell ),6 760-254-2843 1-15 Exit 191 ( Ghost Town Rd N ),24 2 M <U+2610> D,Yermo,92398,Yermo Truck Stop ( Shell ),760-254-2843,...,( Ghost Town Rd N ),I-15 Exit 191,Exit,191,191,,False,I-15,,
67843,37700,TF2015_018_031-6-ocr.csv,13,F Tulare ( 93274 ) Roche Oil Mobil ( Paige Mar...,559-686-1230 CA 99 Exit 85 (,S <U+25A1>,Tulare,93274,Roche Oil Mobil ( Paige Mart Ave W ),559-686-1230,...,(,CA 99 Exit 85,Exit,85,85,,False,CA 99,,
67844,37868,RVersFriend2006-013-ocr.csv,19,"Dixon , 95620 Ramos Oil ( Shell )",707-678-2061 1-80 Exit 66 ( CA 113 S ),<U+25A1>,Dixon,95620,Ramos Oil ( Shell ),707-678-2061,...,( CA 113 S ),I-80 Exit 66,Exit,66,66,,False,I-80,,
67845,37869,RVersFriend2007-010-ocr.csv,54,"Dixon , 95620 Ramos Oil ( Shell )",707-678-2061 1-80 Exit 66 ( CA 113 S ),00,Dixon,95620,Ramos Oil ( Shell ),707-678-2061,...,( CA 113 S ),I-80 Exit 66,Exit,66,66,,False,I-80,,


In [2]:
#create me a colun called Scraped_Year=2025
df['Scraped_Year'] = 2025

In [3]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import tkinter as tk
from tkinter import ttk, messagebox
import threading

# Window-based Flag Reason Selector Interface
class WindowFlagReasonSelector:
    def __init__(self, df):
        self.df = df
        self.selected_flag_reason = None
        self.sample_df = None
        self.selection_complete = False
        
        # Create the window
        self.root = tk.Tk()
        self.root.title("🎯 Flag Reason Selector")
        self.root.geometry("900x700")
        self.root.configure(bg='#f0f0f0')
        self.root.resizable(True, True)
        
        self.setup_ui()
        
    def setup_ui(self):
        """Setup the window UI with organized categories"""
        # Get unique flag reasons with counts
        flag_counts = self.df['Flag_Reason'].value_counts()
        
        # Main container with scrolling
        main_frame = ttk.Frame(self.root)
        main_frame.pack(fill=tk.BOTH, expand=True, padx=15, pady=15)
        
        # Title
        title_label = tk.Label(main_frame, text="🎯 Data Verification Tool - Flag Reason Selector", 
                              font=('Segoe UI', 16, 'bold'), bg='#f0f0f0', fg='#2c3e50')
        title_label.pack(pady=(0, 10))
        
        subtitle_label = tk.Label(main_frame, text="Choose which type of records you want to verify:", 
                                 font=('Segoe UI', 11), bg='#f0f0f0', fg='#34495e')
        subtitle_label.pack(pady=(0, 10))
        
        # Special Review Options Section
        review_options_frame = tk.Frame(main_frame, bg='#e8f4fd', relief=tk.RAISED, bd=2)
        review_options_frame.pack(fill=tk.X, pady=(0, 15), padx=5)
        
        review_title = tk.Label(review_options_frame, text="🎯 Special Review Options", 
                               font=('Segoe UI', 12, 'bold'), bg='#e8f4fd', fg='#2c3e50')
        review_title.pack(pady=(8, 5))
        
        # Row for special options
        special_buttons_frame = tk.Frame(review_options_frame, bg='#e8f4fd')
        special_buttons_frame.pack(pady=(0, 8), padx=10)
        
        # Already reviewed button
        reviewed_count = len(self.df[(self.df['Manually Verified?'].notna()) & (self.df['Manually Verified?'] != '')])
        reviewed_btn = tk.Button(special_buttons_frame, 
                                text=f"📋 Review Already Verified ({reviewed_count:,} rows)",
                                bg='#17a2b8', fg='white', font=('Segoe UI', 10),
                                relief=tk.FLAT, bd=0, pady=8,
                                command=lambda: self.on_special_option_selected('already_reviewed'))
        reviewed_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True)
        
        # Specific row button
        specific_btn = tk.Button(special_buttons_frame, 
                                text="🎯 Review Specific Row (Enter Index)",
                                bg='#6c757d', fg='white', font=('Segoe UI', 10),
                                relief=tk.FLAT, bd=0, pady=8,
                                command=lambda: self.on_special_option_selected('specific_row'))
        specific_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True)
        
        # Random sample button
        random_btn = tk.Button(special_buttons_frame, 
                              text=f"🎲 Random Sample (Any Flag, 20 rows)",
                              bg='#28a745', fg='white', font=('Segoe UI', 10),
                              relief=tk.FLAT, bd=0, pady=8,
                              command=lambda: self.on_special_option_selected('random_sample'))
        random_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True)
        
        # Canvas and scrollbar for scrolling
        canvas = tk.Canvas(main_frame, bg='#f0f0f0')
        scrollbar = ttk.Scrollbar(main_frame, orient="vertical", command=canvas.yview)
        scrollable_frame = ttk.Frame(canvas)
        
        scrollable_frame.bind(
            "<Configure>",
            lambda e: canvas.configure(scrollregion=canvas.bbox("all"))
        )
        
        canvas.create_window((0, 0), window=scrollable_frame, anchor="nw")
        canvas.configure(yscrollcommand=scrollbar.set)
        
        canvas.pack(side="left", fill="both", expand=True)
        scrollbar.pack(side="right", fill="y")
        
        # Define the 3 main categories and their flag reasons
        categories = {
            "✅ Successful / Clear Matches": [
                "Single Name/ZIP/State/Road Match Found (empty address type)",
                "Single Name/ZIP/State/Exit/Highway Match Found", 
                "Available in RVer and Trucker"
            ],
            "⚠️ Multiple Matches (Ambiguous / Needs Judgment)": [
                "Multiple matches found",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 5 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 6 of 6",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 3 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 4 of 4"
            ],
            "❌ Issues / Problems (No Match, Partial, Unclear)": [
                "No Name/ZIP/State/Road matches found for empty address type",
                "No matches found",
                "No Phone/Name/ZIP/State/Exit/Highway matches found",
                "no matching ZIPCODE",
                "matching ZIPCODE, matching State, no matching Exit",
                "matching ZIPCODE, matching State, matching Exit, no matching Highway/Exit/Street Address",
                "unclear OCR_address_standardized_OFF_parenthesis"
            ]
        }
        
        # Main categories section
        main_categories_frame = tk.Frame(scrollable_frame, bg='#f0f0f0')
        main_categories_frame.pack(fill=tk.X, pady=(10, 20), padx=20)
        
        main_title = tk.Label(main_categories_frame, text="📊 Main Review Categories", 
                             font=('Segoe UI', 14, 'bold'), bg='#f0f0f0', fg='#2c3e50')
        main_title.pack(pady=(0, 15))
        
        # Create the 3 main category buttons
        for category_name, flag_reasons in categories.items():
            # Check if any flag reasons in this category exist in the data
            category_flags = [fr for fr in flag_reasons if fr in flag_counts.index]
            if not category_flags:
                continue
                
            # Calculate total rows in this category
            total_rows = sum(flag_counts[fr] for fr in category_flags)
            
            # Determine button color and styling based on category
            if "Successful" in category_name:
                bg_color, fg_color = '#27ae60', 'white'
                icon = '✅'
            elif "Multiple" in category_name:
                bg_color, fg_color = '#f39c12', 'white'  
                icon = '⚠️'
            elif "Issues" in category_name:
                bg_color, fg_color = '#e74c3c', 'white'
                icon = '❌'
            else:
                bg_color, fg_color = '#95a5a6', 'white'
                icon = '📋'
            
            # Create large category button
            button_text = f"{icon} {category_name.split(' ', 1)[1]}\n({total_rows:,} total rows)"
            category_button = tk.Button(main_categories_frame, 
                                      text=button_text,
                                      bg=bg_color, fg=fg_color, 
                                      font=('Segoe UI', 12, 'bold'),
                                      relief=tk.FLAT, bd=0, pady=20,
                                      command=lambda cat=category_name: self.on_category_selected(cat))
            category_button.pack(fill=tk.X, pady=8)
            
            # Hover effects
            def on_enter(e, btn=category_button, color=bg_color):
                btn.configure(bg=self.darken_color(color))
            def on_leave(e, btn=category_button, color=bg_color):
                btn.configure(bg=color)
                
            category_button.bind("<Enter>", on_enter)
            category_button.bind("<Leave>", on_leave)
        
        # Footer with total records
        footer_frame = tk.Frame(scrollable_frame, bg='#f0f0f0')
        footer_frame.pack(fill=tk.X, pady=(20, 10), padx=10)
        
        total_label = tk.Label(footer_frame, text=f"📊 Total records available: {len(self.df):,}", 
                              font=('Segoe UI', 11, 'bold'), bg='#f0f0f0', fg='#2c3e50')
        total_label.pack()
        
        instruction_label = tk.Label(footer_frame, text="💡 Click any button above to select that flag reason type for verification", 
                                   font=('Segoe UI', 10), bg='#f0f0f0', fg='#7f8c8d')
        instruction_label.pack(pady=(5, 0))
        
        # Bind mouse wheel for scrolling
        def _on_mousewheel(event):
            canvas.yview_scroll(int(-1*(event.delta/120)), "units")
        canvas.bind_all("<MouseWheel>", _on_mousewheel)
    
    def darken_color(self, color):
        """Darken a hex color for hover effect"""
        color_map = {
            '#27ae60': '#219a52',
            '#f39c12': '#e67e22', 
            '#e74c3c': '#c0392b',
            '#3498db': '#2980b9',
            '#95a5a6': '#7f8c8d'
        }
        return color_map.get(color, color)
    
    def on_flag_reason_selected(self, flag_reason):
        """Handle flag reason selection (for backward compatibility)"""
        self.selected_flag_reason = flag_reason
        
        # Close the selection window
        self.root.destroy()
        
        # Process the selection in the notebook
        self.process_selection()
    
    def on_category_selected(self, category_name):
        """Handle category selection - pick a random flag reason from the category"""
        print(f"\n🎯 Selected Category: {category_name}")
        
        # Define categories and their flag reasons
        categories = {
            "✅ Successful / Clear Matches": [
                "Single Name/ZIP/State/Road Match Found (empty address type)",
                "Single Name/ZIP/State/Exit/Highway Match Found", 
                "Available in RVer and Trucker"
            ],
            "⚠️ Multiple Matches (Ambiguous / Needs Judgment)": [
                "Multiple matches found",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 5 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 6 of 6",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 3 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 4 of 4"
            ],
            "❌ Issues / Problems (No Match, Partial, Unclear)": [
                "No Name/ZIP/State/Road matches found for empty address type",
                "No matches found",
                "No Phone/Name/ZIP/State/Exit/Highway matches found",
                "no matching ZIPCODE",
                "matching ZIPCODE, matching State, no matching Exit",
                "matching ZIPCODE, matching State, matching Exit, no matching Highway/Exit/Street Address",
                "unclear OCR_address_standardized_OFF_parenthesis"
            ]
        }
        
        # Get flag reasons for this category that exist in the data
        category_flag_reasons = categories.get(category_name, [])
        flag_counts = self.df['Flag_Reason'].value_counts()
        available_flag_reasons = [fr for fr in category_flag_reasons if fr in flag_counts.index and flag_counts[fr] > 0]
        
        if not available_flag_reasons:
            print(f"❌ No data found for category: {category_name}")
            return
        
        # Pick the flag reason with the most data (most common one in this category)
        best_flag_reason = max(available_flag_reasons, key=lambda fr: flag_counts[fr])
        
        print(f"🎲 Automatically selected flag reason: {best_flag_reason}")
        print(f"📊 This flag reason has {flag_counts[best_flag_reason]:,} rows")
        
        self.selected_flag_reason = best_flag_reason
        
        # Close the selection window
        self.root.destroy()
        
        # Process the selection
        self.process_selection()
    
    def on_special_option_selected(self, option_type):
        """Handle special review option selection"""
        if option_type == 'already_reviewed':
            self.selected_flag_reason = "📋 Already Reviewed Records"
            # Close the window and process
            self.root.destroy()
            self.process_special_selection(option_type)
            
        elif option_type == 'specific_row':
            # Prompt for specific row index
            self.prompt_specific_row()
            
        elif option_type == 'random_sample':
            self.selected_flag_reason = "🎲 Random Sample"
            # Close the window and process
            self.root.destroy()
            self.process_special_selection(option_type)
    
    def prompt_specific_row(self):
        """Prompt user for specific row index"""
        # Create a simple input dialog
        input_window = tk.Toplevel(self.root)
        input_window.title("Enter Row Index")
        input_window.geometry("400x200")
        input_window.configure(bg='#f0f0f0')
        input_window.transient(self.root)
        input_window.grab_set()
        
        # Center the window
        input_window.geometry("+%d+%d" % (self.root.winfo_rootx() + 50, self.root.winfo_rooty() + 50))
        
        tk.Label(input_window, text="Enter the row index to review:", 
                font=('Segoe UI', 12), bg='#f0f0f0').pack(pady=20)
        
        entry = tk.Entry(input_window, font=('Segoe UI', 11), width=20)
        entry.pack(pady=10)
        entry.focus()
        
        info_label = tk.Label(input_window, 
                             text=f"Valid range: 0 to {len(self.df)-1}\nDataFrame has {len(self.df):,} total rows", 
                             font=('Segoe UI', 9), bg='#f0f0f0', fg='#666')
        info_label.pack(pady=5)
        
        def submit():
            try:
                row_idx = int(entry.get())
                if 0 <= row_idx < len(self.df):
                    self.selected_flag_reason = f"🎯 Specific Row #{row_idx}"
                    input_window.destroy()
                    self.root.destroy()
                    self.process_specific_row(row_idx)
                else:
                    messagebox.showerror("Invalid Index", f"Please enter a number between 0 and {len(self.df)-1}")
            except ValueError:
                messagebox.showerror("Invalid Input", "Please enter a valid number")
        
        def cancel():
            input_window.destroy()
        
        button_frame = tk.Frame(input_window, bg='#f0f0f0')
        button_frame.pack(pady=20)
        
        tk.Button(button_frame, text="Submit", command=submit, 
                 bg='#007bff', fg='white', font=('Segoe UI', 10), width=10).pack(side=tk.LEFT, padx=5)
        tk.Button(button_frame, text="Cancel", command=cancel, 
                 bg='#6c757d', fg='white', font=('Segoe UI', 10), width=10).pack(side=tk.LEFT, padx=5)
        
        # Bind Enter key to submit
        entry.bind('<Return>', lambda e: submit())
    
    def process_selection(self):
        """Process the flag reason selection"""
        # Determine which category this flag reason belongs to
        categories = {
            "✅ Successful / Clear Matches": [
                "Single Name/ZIP/State/Road Match Found (empty address type)",
                "Single Name/ZIP/State/Exit/Highway Match Found", 
                "Available in RVer and Trucker"
            ],
            "⚠️ Multiple Matches (Ambiguous / Needs Judgment)": [
                "Multiple matches found",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 5 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 6 of 6",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 3 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 4 of 4"
            ],
            "❌ No / Failed Matches": [
                "No Name/ZIP/State/Road matches found for empty address type",
                "No matches found",
                "No Phone/Name/ZIP/State/Exit/Highway matches found",
                "no matching ZIPCODE"
            ],
            "⚡ Partial / Incomplete Matches": [
                "matching ZIPCODE, matching State, no matching Exit",
                "matching ZIPCODE, matching State, matching Exit, no matching Highway/Exit/Street Address"
            ],
            "📝 Unclear / Needs Review": [
                "unclear OCR_address_standardized_OFF_parenthesis"
            ]
        }
        
        # Find which category this flag reason belongs to
        selected_category = "🔍 Other / Uncategorized"
        for category, flag_reasons in categories.items():
            if self.selected_flag_reason in flag_reasons:
                selected_category = category
                break
        
        print(f"🎯 Selected Category: {selected_category}")
        print(f"📋 Selected Flag Reason: {self.selected_flag_reason}")
        print("=" * 80)
        
        # Filter for the selected flag reason
        filtered_df = self.df[self.df['Flag_Reason'] == self.selected_flag_reason].copy()
        print(f"📋 Found {len(filtered_df):,} rows with this flag reason")
        
        # Determine sample size - always choose 3 random locations
        sample_size = min(3, len(filtered_df))  # Max 3 rows for verification
        
        if len(filtered_df) >= 3:
            self.sample_df = filtered_df.sample(n=3, random_state=None)
            print(f"🎲 Selected 3 random rows for verification")
        else:
            self.sample_df = filtered_df.copy()
            print(f"📝 Using all {len(self.sample_df)} available rows (less than 3 total)")
        
        # Add verification columns if they don't exist
        if 'Manually Verified?' not in self.sample_df.columns:
            self.sample_df['Manually Verified?'] = ''
        if 'Verify Reason' not in self.sample_df.columns:
            self.sample_df['Verify Reason'] = ''
        
        # Define the columns for comparison
        left_columns = [
            'OCR_city', 'OCR_zip_code', 'OCR_label', 'OCR_phone',
            'OCR_address_standardized', 'OCR_state', 'OCR_chain', 'OCR_year'
        ]
        
        right_columns = [
            'Scraped_state', 'Scraped_name', 'Scraped_full_url', 'Scraped_Chain',
            'Scraped_Highway', 'Scraped_Exit', 'Scraped_Street Address', 'Scraped_City',
            'Scraped_State', 'Scraped_Postal Code', 'Scraped_Phone', 'Scraped_Phone 2',
            'Scraped_Fax', 'Scraped_Phone 3', 'Scraped_Mile Marker', 'Scraped_Phone 4',
            'Scraped_Road Name', 'Scraped_Year'
        ]
        
        # Check which columns actually exist in the dataframe
        available_left = [col for col in left_columns if col in self.sample_df.columns]
        available_right = [col for col in right_columns if col in self.sample_df.columns]
        
        print(f"📊 Available OCR columns: {len(available_left)}")
        print(f"📊 Available Scraped columns: {len(available_right)}")
        
        # Store globally for other cells to access
        globals()['filtered_df'] = filtered_df
        globals()['sample_df'] = self.sample_df
        globals()['available_left'] = available_left
        globals()['available_right'] = available_right
        globals()['selected_flag_reason'] = self.selected_flag_reason
        
        print("\n✅ Data prepared! You can now run the verification window.")
        print("🚀 Use start_verification() or run the window interface cell to begin.")
        
        # Show sample of selected data
        print(f"\n📋 Sample of selected data:")
        display(self.sample_df[['OCR_label', 'OCR_city', 'OCR_state', 'Flag_Reason']].head())
        
        self.selection_complete = True
    
    def process_category_selection(self, category_name):
        """Process category selection (new simplified approach)"""
        print(f"🎯 Selected Category: {category_name}")
        print("=" * 80)
        
        # Define the 3 main categories and their flag reasons
        categories = {
            "✅ Successful / Clear Matches": [
                "Single Name/ZIP/State/Road Match Found (empty address type)",
                "Single Name/ZIP/State/Exit/Highway Match Found", 
                "Available in RVer and Trucker"
            ],
            "⚠️ Multiple Matches (Ambiguous / Needs Judgment)": [
                "Multiple matches found",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 2",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 4",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 1 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 2 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 3 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 4 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 5 of 6",
                "Multiple Name/ZIP/State/Road Matches Found (empty address type) - Match 6 of 6",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 2",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 1 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 2 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 3 of 4",
                "Multiple Name/ZIP/State/Exit/Highway Matches Found - Match 4 of 4"
            ],
            "❌ Issues / Problems (No Match, Partial, Unclear)": [
                "No Name/ZIP/State/Road matches found for empty address type",
                "No matches found",
                "No Phone/Name/ZIP/State/Exit/Highway matches found",
                "no matching ZIPCODE",
                "matching ZIPCODE, matching State, no matching Exit",
                "matching ZIPCODE, matching State, matching Exit, no matching Highway/Exit/Street Address",
                "unclear OCR_address_standardized_OFF_parenthesis"
            ]
        }
        
        # Get flag reasons for this category
        flag_reasons = categories.get(category_name, [])
        
        # Filter for all flag reasons in this category
        filtered_df = self.df[self.df['Flag_Reason'].isin(flag_reasons)].copy()
        print(f"📋 Found {len(filtered_df):,} rows in this category")
        
        if len(filtered_df) == 0:
            print("❌ No rows found for this category.")
            return
        
        # Determine sample size (larger for categories)
        max_sample = min(30, len(filtered_df))  # Max 30 rows for category verification
        sample_size = max_sample if len(filtered_df) >= max_sample else len(filtered_df)
        
        if sample_size >= 5:
            self.sample_df = filtered_df.sample(n=sample_size, random_state=None)
            print(f"🎲 Selected {sample_size} random rows for verification")
        else:
            self.sample_df = filtered_df.copy()
            print(f"📝 Using all {len(self.sample_df)} available rows")
        
        # Add verification columns if they don't exist
        if 'Manually Verified?' not in self.sample_df.columns:
            self.sample_df['Manually Verified?'] = ''
        if 'Verify Reason' not in self.sample_df.columns:
            self.sample_df['Verify Reason'] = ''
        
        # Define the columns for comparison
        left_columns = [
            'OCR_city', 'OCR_zip_code', 'OCR_label', 'OCR_phone',
            'OCR_address_standardized', 'OCR_state', 'OCR_chain', 'OCR_year'
        ]
        
        right_columns = [
            'Scraped_state', 'Scraped_name', 'Scraped_full_url', 'Scraped_Chain',
            'Scraped_Highway', 'Scraped_Exit', 'Scraped_Street Address', 'Scraped_City',
            'Scraped_State', 'Scraped_Postal Code', 'Scraped_Phone', 'Scraped_Phone 2',
            'Scraped_Fax', 'Scraped_Phone 3', 'Scraped_Mile Marker', 'Scraped_Phone 4',
            'Scraped_Road Name', 'Scraped_Year'
        ]
        
        # Check which columns actually exist in the dataframe
        available_left = [col for col in left_columns if col in self.sample_df.columns]
        available_right = [col for col in right_columns if col in self.sample_df.columns]
        
        print(f"📊 Available OCR columns: {len(available_left)}")
        print(f"📊 Available Scraped columns: {len(available_right)}")
        
        # Store globally for other cells to access
        globals()['filtered_df'] = filtered_df
        globals()['sample_df'] = self.sample_df
        globals()['available_left'] = available_left
        globals()['available_right'] = available_right
        globals()['selected_flag_reason'] = category_name
        
        print("\n✅ Data prepared! You can now run the verification window.")
        print("🚀 Use start_verification() or run the window interface cell to begin.")
        
        # Show sample of selected data with flag reason breakdown
        print(f"\n📋 Sample of selected data (Flag Reason distribution):")
        flag_reason_counts = self.sample_df['Flag_Reason'].value_counts()
        for flag_reason, count in flag_reason_counts.items():
            print(f"  • {flag_reason}: {count} rows")
        
        # Show sample rows
        display(self.sample_df[['OCR_label', 'OCR_city', 'OCR_state', 'Flag_Reason']].head())
        
        self.selection_complete = True
    
    def process_special_selection(self, option_type):
        """Process special review option selections"""
        print(f"🎯 Selected Option: {self.selected_flag_reason}")
        print("=" * 80)
        
        if option_type == 'already_reviewed':
            # Filter for already reviewed rows
            filtered_df = self.df[(self.df['Manually Verified?'].notna()) & (self.df['Manually Verified?'] != '')].copy()
            print(f"📋 Found {len(filtered_df):,} already reviewed rows")
            
        elif option_type == 'random_sample':
            # Take a random sample from entire dataset
            sample_size = min(20, len(self.df))
            filtered_df = self.df.sample(n=sample_size, random_state=None)
            print(f"🎲 Selected {sample_size} random rows from entire dataset")
        
        if len(filtered_df) == 0:
            print("❌ No rows found matching the selected criteria.")
            return
        
        # Use all found rows (up to reasonable limit)
        max_sample = min(50, len(filtered_df))  # Max 50 rows for special reviews
        if len(filtered_df) <= max_sample:
            self.sample_df = filtered_df.copy()
            print(f"📝 Using all {len(self.sample_df)} available rows")
        else:
            self.sample_df = filtered_df.sample(n=max_sample, random_state=None)
            print(f"🎲 Sampled {max_sample} rows from {len(filtered_df)} available")
        
        # Add verification columns if they don't exist
        if 'Manually Verified?' not in self.sample_df.columns:
            self.sample_df['Manually Verified?'] = ''
        if 'Verify Reason' not in self.sample_df.columns:
            self.sample_df['Verify Reason'] = ''
        
        # Define the columns for comparison
        left_columns = [
            'OCR_city', 'OCR_zip_code', 'OCR_label', 'OCR_phone',
            'OCR_address_standardized', 'OCR_state', 'OCR_chain', 'OCR_year'
        ]
        
        right_columns = [
            'Scraped_state', 'Scraped_name', 'Scraped_full_url', 'Scraped_Chain',
            'Scraped_Highway', 'Scraped_Exit', 'Scraped_Street Address', 'Scraped_City',
            'Scraped_State', 'Scraped_Postal Code', 'Scraped_Phone', 'Scraped_Phone 2',
            'Scraped_Fax', 'Scraped_Phone 3', 'Scraped_Mile Marker', 'Scraped_Phone 4',
            'Scraped_Road Name', 'Scraped_Year'
        ]
        
        # Check which columns actually exist in the dataframe
        available_left = [col for col in left_columns if col in self.sample_df.columns]
        available_right = [col for col in right_columns if col in self.sample_df.columns]
        
        print(f"📊 Available OCR columns: {len(available_left)}")
        print(f"📊 Available Scraped columns: {len(available_right)}")
        
        # Store globally for other cells to access
        globals()['filtered_df'] = filtered_df
        globals()['sample_df'] = self.sample_df
        globals()['available_left'] = available_left
        globals()['available_right'] = available_right
        globals()['selected_flag_reason'] = self.selected_flag_reason
        
        print("\n✅ Data prepared! You can now run the verification window.")
        print("🚀 Use start_verification() or run the window interface cell to begin.")
        
        # Show sample of selected data
        print(f"\n📋 Sample of selected data:")
        display(self.sample_df[['OCR_label', 'OCR_city', 'OCR_state', 'Flag_Reason']].head())
        
        self.selection_complete = True
    
    def process_specific_row(self, row_idx):
        """Process specific row selection"""
        print(f"🎯 Selected Option: {self.selected_flag_reason}")
        print("=" * 80)
        
        # Get the specific row
        if row_idx in self.df.index:
            filtered_df = self.df.loc[[row_idx]].copy()
        else:
            # Handle case where row_idx is positional, not index
            filtered_df = self.df.iloc[[row_idx]].copy()
        
        print(f"📋 Selected specific row: Index {row_idx}")
        
        self.sample_df = filtered_df.copy()
        
        # Add verification columns if they don't exist
        if 'Manually Verified?' not in self.sample_df.columns:
            self.sample_df['Manually Verified?'] = ''
        if 'Verify Reason' not in self.sample_df.columns:
            self.sample_df['Verify Reason'] = ''
        
        # Define the columns for comparison
        left_columns = [
            'OCR_city', 'OCR_zip_code', 'OCR_label', 'OCR_phone',
            'OCR_address_standardized', 'OCR_state', 'OCR_chain', 'OCR_year'
        ]
        
        right_columns = [
            'Scraped_state', 'Scraped_name', 'Scraped_full_url', 'Scraped_Chain',
            'Scraped_Highway', 'Scraped_Exit', 'Scraped_Street Address', 'Scraped_City',
            'Scraped_State', 'Scraped_Postal Code', 'Scraped_Phone', 'Scraped_Phone 2',
            'Scraped_Fax', 'Scraped_Phone 3', 'Scraped_Mile Marker', 'Scraped_Phone 4',
            'Scraped_Road Name', 'Scraped_Year'
        ]
        
        # Check which columns actually exist in the dataframe
        available_left = [col for col in left_columns if col in self.sample_df.columns]
        available_right = [col for col in right_columns if col in self.sample_df.columns]
        
        print(f"📊 Available OCR columns: {len(available_left)}")
        print(f"📊 Available Scraped columns: {len(available_right)}")
        
        # Store globally for other cells to access
        globals()['filtered_df'] = filtered_df
        globals()['sample_df'] = self.sample_df
        globals()['available_left'] = available_left
        globals()['available_right'] = available_right
        globals()['selected_flag_reason'] = self.selected_flag_reason
        
        print("\n✅ Data prepared! You can now run the verification window.")
        print("🚀 Use start_verification() or run the window interface cell to begin.")
        
        # Show the selected row data
        print(f"\n📋 Selected row data:")
        display(self.sample_df[['OCR_label', 'OCR_city', 'OCR_state', 'Flag_Reason']])
        
        self.selection_complete = True
    
    def run(self):
        """Run the flag reason selector window"""
        self.root.mainloop()

# Function to launch the flag reason selector window
def launch_flag_reason_selector(df):
    """Launch the window-based flag reason selector"""
    print("🚀 Launching Flag Reason Selector Window...")
    print("💡 A new window will open with organized categories to choose from.")
    
    def run_selector():
        selector = WindowFlagReasonSelector(df)
        selector.run()
        return selector
    
    # Run in a separate thread to avoid blocking the notebook
    selector_thread = threading.Thread(target=run_selector, daemon=True)
    selector_thread.start()
    
    print("✅ Flag Reason Selector window launched!")
    print("🎯 Make your selection in the window, then return here to run verification.")
    
    return selector_thread

# Linear Verification Function - Always starts with category selection
def start_linear_verification(df):
    """Start a linear verification process - category selection followed by verification"""
    print("🚀 Starting Linear Verification Process...")
    print("💡 Step 1: Select a category to verify")
    print("💡 Step 2: Verification window will open automatically")
    
    # Create a modified selector that automatically proceeds to verification
    class LinearVerificationSelector(WindowFlagReasonSelector):
        def __init__(self, df):
            super().__init__(df)
            self.auto_proceed = True  # Flag to automatically proceed to verification
            
        def process_selection(self):
            """Override to auto-launch verification after processing selection"""
            # Call the parent method first
            super().process_selection()
            
            # Auto-launch verification window
            print("🔄 Auto-launching verification window...")
            self.launch_verification_window()
            
        def launch_verification_window(self):
            """Launch the verification window automatically"""
            print("🚀 Launching verification window...")
            
            def run_verification():
                window_comparator = WindowDataComparator(self.sample_df)
                window_comparator.run()
                return window_comparator
            
            # Launch verification in a new thread
            verification_thread = threading.Thread(target=run_verification, daemon=True)
            verification_thread.start()
            
            print("✅ Verification window launched!")
            print("💾 Use save_and_export_results() when done to save your results.")
            
    def run_linear_selector():
        selector = LinearVerificationSelector(df)
        selector.run()
        return selector
    
    # Run the linear selector
    selector_thread = threading.Thread(target=run_linear_selector, daemon=True)
    selector_thread.start()
    
    return selector_thread

# Launch the linear verification process
print("🔧 Ready for Linear Verification Process!")
print("💡 Call start_linear_verification(df) to begin category selection and verification")

🔧 Ready for Linear Verification Process!
💡 Call start_linear_verification(df) to begin category selection and verification


In [4]:
# Utility function for saving results from window interface
def save_and_export_results(filename=None):
    """Save verification results from window interface and export to CSV"""
    print("💾 Preparing to save results...")
    
    try:
        # Use the sample_df which gets updated by the window interface
        if 'sample_df' not in globals():
            print("❌ No sample data found. Please run the comparison tool first.")
            return None
            
        updated_data = sample_df.copy()
        
        # Update the main dataframe with the verification results
        for idx in updated_data.index:
            if idx in df.index:
                df.loc[idx, 'Manually Verified?'] = updated_data.loc[idx, 'Manually Verified?']
                df.loc[idx, 'Verify Reason'] = updated_data.loc[idx, 'Verify Reason']
        
        # Export to CSV
        if filename is None:
            from datetime import datetime
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"verified_data_{timestamp}.csv"
        
        filepath = f"C:\\Users\\clint\\Desktop\\Geocoding_Task\\Matching_WebScrape\\{filename}"
        df.to_csv(filepath, index=False)
        
        # Show summary
        summary = updated_data[['Manually Verified?', 'Verify Reason']].copy()
        summary = summary[summary['Manually Verified?'] != '']
        
        print("✅ Results saved and exported!")
        print(f"📁 File: {filepath}")
        print(f"📊 Verified: {len(summary)} out of {len(updated_data)} rows")
        
        if len(summary) > 0:
            verification_counts = summary['Manually Verified?'].value_counts()
            for verification, count in verification_counts.items():
                emoji = {'Yes': '✅', 'No': '❌', 'Maybe': '❓'}.get(verification, '🔍')
                print(f"  {emoji} {verification}: {count}")
        
        return filepath
        
    except Exception as e:
        print(f"❌ Error saving results: {e}")
        print("💡 Make sure you've made some verifications in the window first.")
        return None

print("💾 Save function ready!")
print("💡 Use save_and_export_results() after completing your verifications.")

💾 Save function ready!
💡 Use save_and_export_results() after completing your verifications.


In [5]:
# Manual Launch Helper (run this after selecting a flag reason)
def start_verification():
    """Start the verification tool after flag reason selection"""
    if 'sample_df' not in globals():
        print("❌ No flag reason selected yet!")
        print("💡 Please click one of the 'Verify:...' buttons above first.")
        return None
    
    window_thread = launch_window_comparator()
    globals()['comparator_thread'] = window_thread
    return window_thread

print("🔧 Helper function loaded.")
print("💡 After selecting a flag reason above, you can run: start_verification()")

🔧 Helper function loaded.
💡 After selecting a flag reason above, you can run: start_verification()


In [None]:
import tkinter as tk
from tkinter import ttk, scrolledtext, messagebox
import threading
import webbrowser
import urllib.parse
import re

# Base comparison utilities for data comparison
class ComparisonUtils:
    """Shared utilities for data comparison"""
    
    @staticmethod
    def normalize_text(text):
        """Normalize text for comparison"""
        if not text or text in ['N/A', 'nan', 'None']:
            return ""
        text = str(text).lower()
        text = re.sub(r"[^\w\s]", "", text)
        text = re.sub(r"\s+", " ", text)
        return text.strip()
    
    @staticmethod
    def extract_words(text):
        """Extract words from text"""
        normalized = ComparisonUtils.normalize_text(text)
        if not normalized:
            return set()
        return set(word for word in normalized.split() if len(word) > 1)
    
    @staticmethod
    def normalize_phone(phone):
        """Normalize phone number"""
        if not phone or phone in ['N/A', 'nan', 'None']:
            return ""
        digits = re.sub(r'\D', '', str(phone))
        if len(digits) == 11 and digits.startswith('1'):
            digits = digits[1:]
        return digits
    
    @staticmethod
    def is_phone_match(ocr_phone, scraped_phones):
        """Check if phone numbers match"""
        ocr_normalized = ComparisonUtils.normalize_phone(ocr_phone)
        if not ocr_normalized or len(ocr_normalized) < 7:
            return False
        
        for phone_entry in str(scraped_phones).split(' | '):
            if ':' in phone_entry:
                phone_value = phone_entry.split(':', 1)[1].strip()
            else:
                phone_value = phone_entry.strip()
            
            scraped_normalized = ComparisonUtils.normalize_phone(phone_value)
            if scraped_normalized and ocr_normalized == scraped_normalized:
                return True
        return False
    
    @staticmethod
    def is_partial_match(ocr_value, scraped_value, field_name):
        """Check for partial matches based on field type"""
        if field_name in ['label', 'name']:
            ocr_words = ComparisonUtils.extract_words(ocr_value)
            scraped_words = ComparisonUtils.extract_words(scraped_value)
            
            if not ocr_words or not scraped_words:
                return False
            
            # Check for word overlap
            common_words = ocr_words.intersection(scraped_words)
            if common_words:
                return True
            
            # Check for partial word matches
            for ocr_word in ocr_words:
                for scraped_word in scraped_words:
                    if len(ocr_word) >= 3 and len(scraped_word) >= 3:
                        if ocr_word in scraped_word or scraped_word in ocr_word:
                            return True
            return False
        elif field_name == 'phone':
            return ComparisonUtils.is_phone_match(ocr_value, scraped_value)
        else:
            return ComparisonUtils.normalize_text(ocr_value) == ComparisonUtils.normalize_text(scraped_value)
    
    @staticmethod
    def consolidate_phone_fields(row_data):
        """Consolidate phone fields into a single string"""
        phone_fields = ['Scraped_Phone', 'Scraped_Phone 2', 'Scraped_Phone 3', 'Scraped_Phone 4', 'Scraped_Fax']
        phones = []
        for field in phone_fields:
            if field in row_data and row_data[field] and str(row_data[field]) not in ['nan', 'N/A', '']:
                phones.append(f"{field.replace('Scraped_', '')}: {row_data[field]}")
        return ' | '.join(phones) if phones else 'N/A'
    
    @staticmethod
    def consolidate_address_fields(row_data):
        """Consolidate address fields into a single string"""
        address_parts = []
        address_fields = ['Scraped_Street Address', 'Scraped_City', 'Scraped_State', 'Scraped_Postal Code']
        for field in address_fields:
            if field in row_data and row_data[field] and str(row_data[field]) not in ['nan', 'N/A', '']:
                address_parts.append(str(row_data[field]))
        return ', '.join(address_parts) if address_parts else 'N/A'

print("🔧 Utilities loaded. Ready to launch window interface...")
print("💡 Use launch_window_comparator(sample_df) to start the comparison tool.")

import webbrowser
import urllib.parse

class WindowDataComparator:
    def __init__(self, sample_df):
        self.sample_df = sample_df.copy()
        self.current_row_index = 0
        self.current_row_data = None
        
        # Create the main window
        self.root = tk.Tk()
        self.root.title("🔍 Data Verification Tool")
        self.root.geometry("1200x800")
        self.root.configure(bg='#f8f9fa')
        
        self.setup_ui()
        self.load_current_row()
        
    def setup_ui(self):
        """Setup the verification window UI"""
        # Main container
        main_frame = tk.Frame(self.root, bg='#f8f9fa')
        main_frame.pack(fill=tk.BOTH, expand=True, padx=20, pady=20)
        
        # Header
        header_frame = tk.Frame(main_frame, bg='#f8f9fa')
        header_frame.pack(fill=tk.X, pady=(0, 20))
        
        title_label = tk.Label(header_frame, text="🔍 Data Verification Tool", 
                              font=('Segoe UI', 18, 'bold'), bg='#f8f9fa', fg='#2c3e50')
        title_label.pack()
        
        # Progress info
        self.progress_label = tk.Label(header_frame, text="", 
                                      font=('Segoe UI', 12), bg='#f8f9fa', fg='#6c757d')
        self.progress_label.pack(pady=(5, 0))
        
        # Navigation buttons
        nav_frame = tk.Frame(header_frame, bg='#f8f9fa')
        nav_frame.pack(pady=(10, 0))
        
        self.prev_btn = tk.Button(nav_frame, text="⬅️ Previous", 
                                 command=self.previous_row, font=('Segoe UI', 10),
                                 bg='#6c757d', fg='white', padx=15)
        self.prev_btn.pack(side=tk.LEFT, padx=5)
        
        self.next_btn = tk.Button(nav_frame, text="Next ➡️", 
                                 command=self.next_row, font=('Segoe UI', 10),
                                 bg='#6c757d', fg='white', padx=15)
        self.next_btn.pack(side=tk.LEFT, padx=5)
        
        # Data comparison area
        comparison_frame = tk.Frame(main_frame, bg='#ffffff', relief=tk.RAISED, bd=2)
        comparison_frame.pack(fill=tk.BOTH, expand=True, pady=(0, 20))
        
        # Split into left (OCR) and right (Scraped) sections
        left_frame = tk.Frame(comparison_frame, bg='#ffffff')
        left_frame.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=10, pady=10)
        
        right_frame = tk.Frame(comparison_frame, bg='#ffffff')
        right_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=True, padx=10, pady=10)
        
        # Left (OCR) header
        tk.Label(left_frame, text="📄 OCR Data", font=('Segoe UI', 14, 'bold'), 
                bg='#ffffff', fg='#495057').pack(pady=(0, 10))
        
        # Right (Scraped) header
        tk.Label(right_frame, text="🌐 Scraped Data", font=('Segoe UI', 14, 'bold'), 
                bg='#ffffff', fg='#495057').pack(pady=(0, 10))
        
        # Scrollable text areas for data
        self.ocr_text = tk.Text(left_frame, font=('Consolas', 10), bg='#f8f9fa', 
                               fg='#495057', wrap=tk.WORD, height=15)
        self.ocr_text.pack(fill=tk.BOTH, expand=True, padx=5)
        
        self.scraped_text = tk.Text(right_frame, font=('Consolas', 10), bg='#f8f9fa', 
                                   fg='#495057', wrap=tk.WORD, height=15)
        self.scraped_text.pack(fill=tk.BOTH, expand=True, padx=5)
        
        # Verification section
        verification_frame = tk.Frame(main_frame, bg='#e9ecef', relief=tk.RAISED, bd=2)
        verification_frame.pack(fill=tk.X, pady=(0, 10))
        
        tk.Label(verification_frame, text="✅ Verification Decision", 
                font=('Segoe UI', 14, 'bold'), bg='#e9ecef', fg='#2c3e50').pack(pady=(10, 5))
        
        # Verification buttons
        button_frame = tk.Frame(verification_frame, bg='#e9ecef')
        button_frame.pack(pady=(0, 10))
        
        self.yes_btn = tk.Button(button_frame, text="✅ YES - Good Match", 
                                command=lambda: self.on_verification_decision('Yes'),
                                font=('Segoe UI', 12, 'bold'), bg='#28a745', fg='white',
                                padx=20, pady=10)
        self.yes_btn.pack(side=tk.LEFT, padx=10)
        
        self.maybe_btn = tk.Button(button_frame, text="❓ MAYBE - Uncertain", 
                                  command=lambda: self.on_verification_decision('Maybe'),
                                  font=('Segoe UI', 12, 'bold'), bg='#ffc107', fg='black',
                                  padx=20, pady=10)
        self.maybe_btn.pack(side=tk.LEFT, padx=10)
        
        self.no_btn = tk.Button(button_frame, text="❌ NO - Bad Match", 
                               command=lambda: self.on_verification_decision('No'),
                               font=('Segoe UI', 12, 'bold'), bg='#dc3545', fg='white',
                               padx=20, pady=10)
        self.no_btn.pack(side=tk.LEFT, padx=10)
        
        # Reason text box
        reason_frame = tk.Frame(verification_frame, bg='#e9ecef')
        reason_frame.pack(fill=tk.X, padx=20, pady=(0, 10))
        
        tk.Label(reason_frame, text="💭 Verification Reason (optional):", 
                font=('Segoe UI', 10), bg='#e9ecef', fg='#495057').pack(anchor='w')
        
        self.reason_entry = tk.Text(reason_frame, height=3, font=('Segoe UI', 10),
                                   bg='#ffffff', fg='#495057')
        self.reason_entry.pack(fill=tk.X, pady=(5, 0))
        
        # Action buttons
        action_frame = tk.Frame(main_frame, bg='#f8f9fa')
        action_frame.pack(fill=tk.X)
        
        google_btn = tk.Button(action_frame, text="🔍 Google Search", 
                              command=self.open_google_search,
                              font=('Segoe UI', 10), bg='#007bff', fg='white', padx=15)
        google_btn.pack(side=tk.LEFT, padx=5)
        
        website_btn = tk.Button(action_frame, text="🌐 Open Website", 
                               command=self.open_website,
                               font=('Segoe UI', 10), bg='#17a2b8', fg='white', padx=15)
        website_btn.pack(side=tk.LEFT, padx=5)
        
        save_btn = tk.Button(action_frame, text="💾 Save Progress", 
                            command=self.save_progress,
                            font=('Segoe UI', 10), bg='#6f42c1', fg='white', padx=15)
        save_btn.pack(side=tk.RIGHT, padx=5)
    
    def on_verification_decision(self, decision):
        """Handle Yes/No/Maybe button clicks - THIS IS THE LOGIC YOU ASKED ABOUT"""
        try:
            print(f"🎯 User clicked: {decision}")
            
            # 1. GET THE CURRENT ROW INDEX IN THE ORIGINAL DATAFRAME
            current_original_index = self.sample_df.index[self.current_row_index]
            print(f"📍 Updating row at index: {current_original_index}")
            
            # 2. GET THE REASON TEXT (if any)
            reason_text = self.reason_entry.get("1.0", tk.END).strip()
            if not reason_text:
                reason_text = f"Auto: {decision} via verification tool"
            
            # 3. UPDATE THE SAMPLE DATAFRAME
            self.sample_df.loc[current_original_index, 'Manually Verified?'] = decision
            self.sample_df.loc[current_original_index, 'Verify Reason'] = reason_text
            print(f"✅ Updated local sample_df")
            
            # 4. UPDATE THE GLOBAL SAMPLE_DF VARIABLE (so other cells can access it)
            if 'sample_df' in globals():
                globals()['sample_df'].loc[current_original_index, 'Manually Verified?'] = decision
                globals()['sample_df'].loc[current_original_index, 'Verify Reason'] = reason_text
                print(f"✅ Updated global sample_df")
            
            # 5. PROVIDE VISUAL FEEDBACK
            self.flash_button_feedback(decision)
            print(f"✅ Flashed button feedback")
            
            # 6. AUTO-ADVANCE TO NEXT ROW (after a brief delay)
            print(f"🔄 Scheduling move to next row in 1 second...")
            self.root.after(1000, self.next_row)  # Wait 1 second, then go to next row
            
            print(f"✅ Recorded: {decision} for row {current_original_index}")
            print(f"📝 Reason: {reason_text}")
            
        except Exception as e:
            print(f"❌ ERROR in on_verification_decision: {e}")
            import traceback
            traceback.print_exc()
    
    def flash_button_feedback(self, decision):
        """Provide visual feedback when a button is clicked"""
        # Flash the appropriate button
        if decision == 'Yes':
            self.yes_btn.configure(bg='#155724')  # Darker green
            self.root.after(500, lambda: self.yes_btn.configure(bg='#28a745'))
        elif decision == 'Maybe':
            self.maybe_btn.configure(bg='#856404')  # Darker yellow
            self.root.after(500, lambda: self.maybe_btn.configure(bg='#ffc107'))
        elif decision == 'No':
            self.no_btn.configure(bg='#721c24')  # Darker red
            self.root.after(500, lambda: self.no_btn.configure(bg='#dc3545'))
    
    def load_current_row(self):
        """Load data for the current row"""
        if self.current_row_index >= len(self.sample_df):
            self.show_completion_message()
            return
            
        # Get current row data
        current_index = self.sample_df.index[self.current_row_index]
        self.current_row_data = self.sample_df.loc[current_index]
        
        # Update progress
        self.progress_label.config(text=f"Row {self.current_row_index + 1} of {len(self.sample_df)}")
        
        # Clear text areas
        self.ocr_text.delete("1.0", tk.END)
        self.scraped_text.delete("1.0", tk.END)
        self.reason_entry.delete("1.0", tk.END)
        
        # Load OCR data
        ocr_data = []
        ocr_fields = ['OCR_city', 'OCR_zip_code', 'OCR_label', 'OCR_phone', 
                     'OCR_address_standardized', 'OCR_state', 'OCR_chain', 'OCR_year']
        
        for field in ocr_fields:
            if field in self.current_row_data:
                value = self.current_row_data[field]
                ocr_data.append(f"{field}: {value}")
        
        self.ocr_text.insert("1.0", "\n".join(ocr_data))
        
        # Load Scraped data  
        scraped_data = []
        scraped_fields = ['Scraped_name', 'Scraped_Highway', 'Scraped_Exit', 
                         'Scraped_Street Address', 'Scraped_City', 'Scraped_State', 
                         'Scraped_Postal Code', 'Scraped_Phone', 'Scraped_Year',
                         'Flag_Reason']
        
        for field in scraped_fields:
            if field in self.current_row_data:
                value = self.current_row_data[field]
                scraped_data.append(f"{field}: {value}")
        
        self.scraped_text.insert("1.0", "\n".join(scraped_data))
        
        # Show current verification status
        current_verification = self.current_row_data.get('Manually Verified?', '')
        current_reason = self.current_row_data.get('Verify Reason', '')
        
        if current_verification:
            self.reason_entry.insert("1.0", f"Previous: {current_verification} - {current_reason}")
        
        # Update navigation buttons
        self.prev_btn.config(state='normal' if self.current_row_index > 0 else 'disabled')
        self.next_btn.config(state='normal' if self.current_row_index < len(self.sample_df) - 1 else 'disabled')
    
    def next_row(self):
        """Move to next row"""
        try:
            print(f"🔄 next_row called. Current index: {self.current_row_index}, Total rows: {len(self.sample_df)}")
            if self.current_row_index < len(self.sample_df) - 1:
                self.current_row_index += 1
                print(f"✅ Moving to row {self.current_row_index + 1}")
                self.load_current_row()
            else:
                print("🏁 Reached last row, showing completion message")
                self.show_completion_message()
        except Exception as e:
            print(f"❌ ERROR in next_row: {e}")
            import traceback
            traceback.print_exc()
    
    def previous_row(self):
        """Move to previous row"""
        try:
            if self.current_row_index > 0:
                self.current_row_index -= 1
                print(f"⬅️ Moving to row {self.current_row_index + 1}")
                self.load_current_row()
        except Exception as e:
            print(f"❌ ERROR in previous_row: {e}")
            import traceback
            traceback.print_exc()
    
    def open_google_search(self):
        """Open Google search with relevant terms"""
        search_terms = []
        
        # Add business name
        name = self.current_row_data.get('OCR_label', '')
        if name and str(name) not in ['nan', 'N/A', '']:
            search_terms.append(f'"{name}"')
        
        # Add location info
        city = self.current_row_data.get('OCR_city', '')
        state = self.current_row_data.get('OCR_state', '')
        if city and str(city) not in ['nan', 'N/A', '']:
            search_terms.append(city)
        if state and str(state) not in ['nan', 'N/A', '']:
            search_terms.append(state)
        
        if search_terms:
            query = ' '.join(search_terms)
            url = f"https://www.google.com/search?q={urllib.parse.quote(query)}"
            webbrowser.open(url)
    
    def open_website(self):
        """Open the scraped website URL"""
        url = self.current_row_data.get('Scraped_full_url', '')
        if url and str(url) not in ['nan', 'N/A', '']:
            webbrowser.open(url)
        else:
            tk.messagebox.showinfo("No URL", "No website URL available for this record")
    
    def save_progress(self):
        """Save current progress"""
        # Update global variables
        globals()['sample_df'] = self.sample_df.copy()
        print("💾 Progress saved to global variables")
        tk.messagebox.showinfo("Saved", "Progress saved! Use save_and_export_results() to export to CSV")
    
    def show_completion_message(self):
        """Show completion message when all rows are verified"""
        tk.messagebox.showinfo("Complete!", 
                              f"🎉 All {len(self.sample_df)} rows have been reviewed!\\n"
                              "Use save_and_export_results() to export your results.")
    
    def run(self):
        """Run the verification window"""
        self.root.mainloop()

print("✅ WindowDataComparator class created!")
print("🎯 This class handles the Yes/No/Maybe verification buttons")

🔧 Utilities loaded. Ready to launch window interface...
💡 Use launch_window_comparator(sample_df) to start the comparison tool.
✅ WindowDataComparator class created!
🎯 This class handles the Yes/No/Maybe verification buttons


In [7]:
# Window Data Comparator - Main Interface
class WindowDataComparator:
    def __init__(self, sample_df):
        self.sample_df = sample_df.copy()
        self.current_row_idx = 0
        self.row_indices = list(sample_df.index)
        self.utils = ComparisonUtils()
        
        # Field mappings with corrected OCR_year mapping
        self.field_mappings = [
            ('OCR_city', 'Scraped_City'),
            ('OCR_zip_code', 'Scraped_Postal Code'),
            ('OCR_label', 'Scraped_name'),
            ('OCR_phone', 'phone_consolidated'),
            ('OCR_address_standardized', 'address_consolidated'),
            ('OCR_state', 'Scraped_State'),
            ('OCR_chain', 'Scraped_Chain'),
            ('OCR_year', 'Scraped_Year')  # Year field mapped to Scraped_Year
        ]
        
        # Create the window
        self.root = tk.Tk()
        self.root.title("📊 Data Comparison Tool")
        self.root.geometry("1200x800")
        self.root.configure(bg='#f0f0f0')
        self.root.resizable(True, True)
        
        self.setup_ui()
        self.load_current_row()
        
    def setup_ui(self):
        """Setup the window UI"""
        # Main container with scrolling
        self.main_frame = ttk.Frame(self.root)
        self.main_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=10)
        
        # Canvas and scrollbar for scrolling
        self.canvas = tk.Canvas(self.main_frame, bg='#f0f0f0')
        self.scrollbar = ttk.Scrollbar(self.main_frame, orient="vertical", command=self.canvas.yview)
        self.scrollable_frame = ttk.Frame(self.canvas)
        
        self.scrollable_frame.bind(
            "<Configure>",
            lambda e: self.canvas.configure(scrollregion=self.canvas.bbox("all"))
        )
        
        self.canvas.create_window((0, 0), window=self.scrollable_frame, anchor="nw")
        self.canvas.configure(yscrollcommand=self.scrollbar.set)
        
        self.canvas.pack(side="left", fill="both", expand=True)
        self.scrollbar.pack(side="right", fill="y")
        
        # Header frame with improved styling
        header_frame = ttk.Frame(self.scrollable_frame)
        header_frame.pack(fill=tk.X, pady=(0, 5))
        
        self.header_label = ttk.Label(header_frame, text="", font=('Segoe UI', 14, 'bold'))
        self.header_label.pack(pady=3)
        
        # Navigation frame with better spacing
        nav_frame = ttk.Frame(header_frame)
        nav_frame.pack(pady=2)
        
        # Improved navigation buttons
        nav_style = {'width': 12}
        ttk.Button(nav_frame, text="⬅️ Previous", command=lambda: self.navigate(-1), **nav_style).pack(side=tk.LEFT, padx=3)
        ttk.Button(nav_frame, text="➡️ Next", command=lambda: self.navigate(1), **nav_style).pack(side=tk.LEFT, padx=3)
        ttk.Button(nav_frame, text="💾 Save", command=self.save_current_row, **nav_style).pack(side=tk.LEFT, padx=3)
        
        # Comparison frame with reduced padding
        self.comparison_frame = ttk.Frame(self.scrollable_frame)
        self.comparison_frame.pack(fill=tk.BOTH, expand=True, pady=2)
        
        # Action buttons frame with better spacing
        action_frame = ttk.Frame(self.scrollable_frame)
        action_frame.pack(fill=tk.X, pady=3)
        
        ttk.Button(action_frame, text="🌐 Visit Website", command=self.open_website, width=15).pack(side=tk.LEFT, padx=3)
        ttk.Button(action_frame, text="🔍 Google Search", command=self.google_search, width=15).pack(side=tk.LEFT, padx=3)
        
        # Verification controls frame with compact design
        verify_frame = ttk.LabelFrame(self.scrollable_frame, text="📝 Verification", padding=8)
        verify_frame.pack(fill=tk.X, pady=3)
        
        # Match status with improved layout
        status_frame = ttk.Frame(verify_frame)
        status_frame.pack(fill=tk.X, pady=2)
        ttk.Label(status_frame, text="Match Status:", font=('Segoe UI', 10, 'bold'), width=12).pack(side=tk.LEFT)
        
        self.status_var = tk.StringVar()
        status_buttons_frame = ttk.Frame(status_frame)
        status_buttons_frame.pack(side=tk.LEFT, padx=(10, 0))
        
        # Improved status buttons with better styling
        button_configs = [('Yes', '#27ae60', 'white'), ('Maybe', '#f39c12', 'white'), ('No', '#e74c3c', 'white')]
        for status, bg_color, fg_color in button_configs:
            btn = tk.Button(status_buttons_frame, text=status, bg=bg_color, fg=fg_color,
                          font=('Segoe UI', 9, 'bold'), relief=tk.FLAT, bd=0,
                          command=lambda s=status: self.set_status(s), width=8, pady=3)
            btn.pack(side=tk.LEFT, padx=2)
        
        # Reasoning with improved layout
        reason_frame = ttk.Frame(verify_frame)
        reason_frame.pack(fill=tk.X, pady=3)
        ttk.Label(reason_frame, text="Reasoning:", font=('Segoe UI', 10, 'bold'), width=12).pack(side=tk.LEFT, anchor=tk.N)
        
        self.reasoning_text = scrolledtext.ScrolledText(reason_frame, height=2, width=60, 
                                                      font=('Segoe UI', 9), wrap=tk.WORD)
        self.reasoning_text.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(10, 0))
        
        # Bind mouse wheel for scrolling
        def _on_mousewheel(event):
            self.canvas.yview_scroll(int(-1*(event.delta/120)), "units")
        self.canvas.bind_all("<MouseWheel>", _on_mousewheel)
    
    def load_current_row(self):
        """Load and display the current row"""
        if not self.row_indices:
            return
            
        current_idx = self.row_indices[self.current_row_idx]
        self.current_row_data = self.sample_df.loc[current_idx].copy()
        
        # Add consolidated fields
        self.current_row_data['phone_consolidated'] = self.utils.consolidate_phone_fields(self.current_row_data)
        self.current_row_data['address_consolidated'] = self.utils.consolidate_address_fields(self.current_row_data)
        
        # Update header
        self.header_label.config(text=f"🔍 Row {self.current_row_idx + 1} of {len(self.row_indices)} (Index: {current_idx})")
        
        # Clear existing comparison widgets
        for widget in self.comparison_frame.winfo_children():
            widget.destroy()
        
        # Create comparison display
        for i, (ocr_field, scraped_field) in enumerate(self.field_mappings):
            if ocr_field in self.current_row_data:
                ocr_value = self.current_row_data[ocr_field]
                scraped_value = self.current_row_data.get(scraped_field, 'N/A')
                
                # Determine match status
                is_exact = str(ocr_value) == str(scraped_value)
                is_partial = self.utils.is_partial_match(ocr_value, scraped_value, ocr_field.split('_')[-1])
                
                # Choose colors
                if is_exact:
                    bg_color = "#d4edda"  # Light green
                    icon = "✅"
                elif is_partial:
                    bg_color = "#fff3cd"  # Light yellow
                    icon = "🟡"
                else:
                    bg_color = "#f8d7da"  # Light red
                    icon = "❌"
                
                # Create field frame with improved styling
                field_frame = tk.Frame(self.comparison_frame, bg=bg_color, relief=tk.SOLID, bd=1)
                field_frame.pack(fill=tk.X, pady=1, padx=3, ipady=2)
                
                # Field name with icon - improved typography
                name_label = tk.Label(field_frame, text=f"{icon} {ocr_field.replace('OCR_', '').replace('_', ' ').title()}", 
                                    bg=bg_color, font=('Segoe UI', 11, 'bold'), fg='#2c3e50')
                name_label.pack(anchor=tk.W, padx=8, pady=(3, 1))
                
                # OCR value with improved styling
                ocr_frame = tk.Frame(field_frame, bg=bg_color)
                ocr_frame.pack(fill=tk.X, padx=12, pady=1)
                
                ocr_label_text = tk.Label(ocr_frame, text="OCR:", bg=bg_color, 
                                        font=('Segoe UI', 10, 'bold'), fg='#e74c3c', width=8, anchor='w')
                ocr_label_text.pack(side=tk.LEFT)
                
                ocr_value_text = tk.Label(ocr_frame, text=str(ocr_value), bg=bg_color, 
                                        font=('Consolas', 10), wraplength=450, justify=tk.LEFT, anchor='w')
                ocr_value_text.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(5, 0))
                
                # Scraped value with improved styling
                scraped_frame = tk.Frame(field_frame, bg=bg_color)
                scraped_frame.pack(fill=tk.X, padx=12, pady=(1, 3))
                
                scraped_label_text = tk.Label(scraped_frame, text="Scraped:", bg=bg_color, 
                                            font=('Segoe UI', 10, 'bold'), fg='#3498db', width=8, anchor='w')
                scraped_label_text.pack(side=tk.LEFT)
                
                scraped_value_text = tk.Label(scraped_frame, text=str(scraped_value), bg=bg_color, 
                                            font=('Consolas', 10), wraplength=450, justify=tk.LEFT, anchor='w')
                scraped_value_text.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(5, 0))
        
        # Load saved verification values
        self.load_saved_values(current_idx)
    
    def set_status(self, status):
        """Set the match status"""
        self.status_var.set(status)
        
    def navigate(self, direction):
        """Navigate between rows"""
        self.save_current_row()
        new_idx = self.current_row_idx + direction
        if 0 <= new_idx < len(self.row_indices):
            self.current_row_idx = new_idx
            self.load_current_row()
        else:
            messagebox.showinfo("Navigation", "No more rows in that direction")
    
    def save_current_row(self):
        """Save current verification status"""
        current_idx = self.row_indices[self.current_row_idx]
        self.sample_df.loc[current_idx, 'Manually Verified?'] = self.status_var.get()
        self.sample_df.loc[current_idx, 'Verify Reason'] = self.reasoning_text.get(1.0, tk.END).strip()
        
    def load_saved_values(self, idx):
        """Load previously saved values"""
        self.status_var.set(str(self.sample_df.loc[idx, 'Manually Verified?']))
        self.reasoning_text.delete(1.0, tk.END)
        self.reasoning_text.insert(1.0, str(self.sample_df.loc[idx, 'Verify Reason']))
    
    def open_website(self):
        """Open the scraped website"""
        url = self.current_row_data.get('Scraped_full_url', '')
        if url and url != 'N/A':
            webbrowser.open(url)
        else:
            messagebox.showinfo("Website", "No website URL available")
    
    def google_search(self):
        """Perform comprehensive Google search using all available information"""
        search_terms = []
        
        # OCR Information (Primary)
        ocr_fields = ['OCR_label', 'OCR_city', 'OCR_state', 'OCR_zip_code', 'OCR_chain']
        for field in ocr_fields:
            value = self.current_row_data.get(field, '')
            if value and str(value) not in ['nan', 'N/A', '', 'None']:
                cleaned_value = str(value).strip()
                if cleaned_value and cleaned_value not in search_terms:
                    search_terms.append(cleaned_value)
        
        # Scraped Information (Secondary - for additional context)
        scraped_fields = ['Scraped_name', 'Scraped_City', 'Scraped_State', 'Scraped_Chain']
        for field in scraped_fields:
            value = self.current_row_data.get(field, '')
            if value and str(value) not in ['nan', 'N/A', '', 'None']:
                cleaned_value = str(value).strip()
                # Only add if not already covered by OCR data (avoid duplication)
                if cleaned_value and cleaned_value not in search_terms:
                    # Check for substantial difference to avoid near-duplicates
                    is_duplicate = False
                    for existing_term in search_terms:
                        if cleaned_value.lower() in existing_term.lower() or existing_term.lower() in cleaned_value.lower():
                            is_duplicate = True
                            break
                    if not is_duplicate:
                        search_terms.append(cleaned_value)
        
        # Add highway/exit information if available
        highway_info = []
        for field in ['Scraped_Highway', 'Scraped_Exit', 'Scraped_Road Name']:
            value = self.current_row_data.get(field, '')
            if value and str(value) not in ['nan', 'N/A', '', 'None']:
                highway_info.append(str(value).strip())
        
        if highway_info:
            # Add highway/exit as a combined term
            highway_term = ' '.join(highway_info)
            if highway_term not in search_terms:
                search_terms.append(highway_term)
        
        # Add address information if available
        address_value = self.current_row_data.get('OCR_address_standardized', '')
        if address_value and str(address_value) not in ['nan', 'N/A', '', 'None']:
            addr_cleaned = str(address_value).strip()
            if addr_cleaned and addr_cleaned not in search_terms:
                search_terms.append(addr_cleaned)
        
        if search_terms:
            # Limit to most relevant terms (first 4-5 to avoid overly long query)
            final_terms = search_terms[:5]
            query = ' '.join(final_terms)
            
            # Add quotes around business name if it's the first term and has spaces
            if len(final_terms) > 0 and ' ' in final_terms[0]:
                final_terms[0] = f'"{final_terms[0]}"'
                query = ' '.join(final_terms)
            
            encoded_query = urllib.parse.quote(query)
            url = f"https://www.google.com/search?q={encoded_query}"
            
            print(f"🔍 Google search terms: {final_terms}")
            print(f"🌐 Search URL: {url}")
            
            webbrowser.open(url)
        else:
            messagebox.showinfo("Search", "No search terms available from either OCR or Scraped data")
    
    def get_updated_data(self):
        """Get the updated dataframe with verification results"""
        return self.sample_df.copy()
        
    def run(self):
        """Run the window"""
        self.root.mainloop()

# Function to launch the window comparator
def launch_window_comparator(sample_df=None, flag_reason=None):
    """Launch the window-based data comparator"""
    if sample_df is None:
        if 'sample_df' not in globals():
            print("❌ No data selected for verification!")
            print("💡 Please go back and select a Flag Reason first using the selector above.")
            return None
        sample_df = globals()['sample_df']
        flag_reason = globals().get('selected_flag_reason', 'Unknown')
    
    print("🚀 Launching Data Comparison Tool...")
    print(f"📋 Verifying: {flag_reason}")
    print(f"📊 Records to verify: {len(sample_df)}")
    print("✨ Features:")
    print("  • Side-by-side field comparison with visual indicators")
    print("  • Smart matching (exact, partial, phone number)")
    print("  • Quick website and Google search access")
    print("  • Compact, improved UI with better fonts and spacing")
    print("  • OCR_year correctly mapped to Scraped_Year")
    
    def run_window():
        window_comparator = WindowDataComparator(sample_df)
        window_comparator.run()
        return window_comparator
    
    # Run in a separate thread to avoid blocking the notebook
    window_thread = threading.Thread(target=run_window, daemon=True)
    window_thread.start()
    
    print("✅ Window launched! Check for the new window.")
    print("💡 Use save_and_export_results() when done to save your verification results.")
    
    return window_thread

# Check if data has been selected and launch accordingly
if 'sample_df' in globals() and 'selected_flag_reason' in globals():
    print("🎯 Data already selected - launching comparison tool...")
    window_thread = launch_window_comparator()
    comparator_thread = window_thread
else:
    print("⚠️  No flag reason selected yet!")
    print("💡 Please run the Flag Reason Selector cell above first.")
    print("🔧 Or manually call launch_window_comparator() after selecting data.")

# Linear Verification Function - Always starts with category selection
def start_linear_verification(df):
    """Start a linear verification process - category selection followed by verification"""
    print("🚀 Starting Linear Verification Process...")
    print("💡 Step 1: Select a category to verify")
    print("💡 Step 2: Verification window will open automatically")
    
    # Create a modified selector that automatically proceeds to verification
    class LinearVerificationSelector(WindowFlagReasonSelector):
        def __init__(self, df):
            super().__init__(df)
            self.auto_proceed = True  # Flag to automatically proceed to verification
            
        def process_selection(self):
            """Override to auto-launch verification after processing selection"""
            # Call the parent method first
            super().process_selection()
            
            # Auto-launch verification window
            print("🔄 Auto-launching verification window...")
            self.launch_verification_window()
            
        def launch_verification_window(self):
            """Launch the verification window automatically"""
            print("🚀 Launching verification window...")
            
            def run_verification():
                window_comparator = WindowDataComparator(self.sample_df)
                window_comparator.run()
                return window_comparator
            
            # Launch verification in a new thread
            verification_thread = threading.Thread(target=run_verification, daemon=True)
            verification_thread.start()
            
            print("✅ Verification window launched!")
            print("💾 Use save_and_export_results() when done to save your results.")
            
    def run_linear_selector():
        selector = LinearVerificationSelector(df)
        selector.run()
        return selector
    
    # Run the linear selector
    selector_thread = threading.Thread(target=run_linear_selector, daemon=True)
    selector_thread.start()
    
    return selector_thread

print("✅ start_linear_verification function created!")
print("🎯 This creates the linear workflow from category selection to verification")

⚠️  No flag reason selected yet!
💡 Please run the Flag Reason Selector cell above first.
🔧 Or manually call launch_window_comparator() after selecting data.
✅ start_linear_verification function created!
🎯 This creates the linear workflow from category selection to verification

💡 Please run the Flag Reason Selector cell above first.
🔧 Or manually call launch_window_comparator() after selecting data.
✅ start_linear_verification function created!
🎯 This creates the linear workflow from category selection to verification


In [None]:
# 🚀 START LINEAR VERIFICATION - Run this cell to begin!
print("🎯 Linear Verification System Ready!")
print("=" * 50)
print("📋 This will:")
print("  1️⃣ Open category selection window")
print("  2️⃣ Automatically select best flag reason from your chosen category")
print("  3️⃣ Sample 3 random locations for verification")
print("  4️⃣ Auto-launch verification window")
print("=" * 50)
print("💡 Click a category button to start!")
print()

# Start the linear verification process
linear_thread = start_linear_verification(df)

🎯 Linear Verification System Ready!
📋 This will:
  1️⃣ Open category selection window
  2️⃣ Automatically select best flag reason from your chosen category
  3️⃣ Sample 3 random locations for verification
  4️⃣ Auto-launch verification window
💡 Click a category button to start!

🚀 Starting Linear Verification Process...
💡 Step 1: Select a category to verify
💡 Step 2: Verification window will open automatically



🎯 Selected Category: ✅ Successful / Clear Matches
🎲 Automatically selected flag reason: Available in RVer and Trucker
📊 This flag reason has 3,508 rows
🎯 Selected Category: ✅ Successful / Clear Matches
📋 Selected Flag Reason: Available in RVer and Trucker
📋 Found 3,508 rows with this flag reason
🎲 Selected 3 random rows for verification
📊 Available OCR columns: 8
📊 Available Scraped columns: 18

✅ Data prepared! You can now run the verification window.
🚀 Use start_verification() or run the window interface cell to begin.

📋 Sample of selected data:


Unnamed: 0,OCR_label,OCR_city,OCR_state,Flag_Reason
65209,Petro Eloy # 306,Eloy,AZ,Available in RVer and Trucker
44355,Premium Oil # 4 ( Chevron ),Salina,UT,Available in RVer and Trucker
46127,TA Wheeler Ridge # 239 ( Shell ) ),Wheeler Ridge,CA,Available in RVer and Trucker


🔄 Auto-launching verification window...
🚀 Launching verification window...
✅ Verification window launched!
💾 Use save_and_export_results() when done to save your results.


  self.sample_df.loc[current_idx, 'Manually Verified?'] = self.status_var.get()


: 