# Exit Number Extraction and OCR Quality Assessment

This notebook processes geocoding data to:

1. **Load and Filter Data**: Import data and filter for specific states (CA, UT, NV, AZ) and exit-type addresses
2. **Extract Exit Numbers**: Use comprehensive pattern matching to extract exit numbers from OCR address text, including:
   - Standard "Exit ###" format
   - Abbreviated "Ex ###" format
   - Single letter "X ###" format (e.g., "US 101 X 326", "I-710 X 15")
   - Complex formats like "Exit Travel ###", "Exit Landing ###", etc.
   - Turnpike and parkway formats (e.g., "Everett Tpke X 10", "Garden State Pkwy X 157")
3. **Detect Unclear OCR**: Identify and flag addresses with poor OCR quality
4. **Export Results**: Save processed data with extracted exit numbers and quality flags

**Key Features:**
- Handles multiple exit number formats and abbreviations
- Recognizes "X" as an indicator for "Exit" in various highway contexts
- Supports interstate highways (I-###), US routes (US ###), state routes (SR ###), turnpikes, and parkways
- Flags problematic OCR text for manual review
- Provides comprehensive statistics and sample outputs

**Supported "X" Pattern Examples:**
- "66054 US 101 X 326 B" → extracts "326"
- "I-710 X 15" → extracts "15" 
- "Everett Tpke X 10" → extracts "10"
- "Garden State Pkwy X 157" → extracts "157"

In [1]:
# Import libraries and load data
import pandas as pd
import re

# Load the dataset
df = pd.read_csv(r'C:\Users\clint\Desktop\Geocoding_Task\Matching_WebScrape\4.csv')

# Filter for specific states and Exit type addresses
#df = df[df['OCR_state'].isin(['CA', 'UT', 'NV', 'AZ'])]
#df = df[df['OCR_Address_Type'] == 'Exit']
#df = df.copy()  # Create proper copy to avoid warnings

print(f"Data loaded: {len(df)} exit records in CA, UT, NV, AZ")
df

Data loaded: 67021 exit records in CA, UT, NV, AZ


  df = pd.read_csv(r'C:\Users\clint\Desktop\Geocoding_Task\Matching_WebScrape\4.csv')


Unnamed: 0,OCR_Unnamed: 0,OCR_filename,OCR_record_num,OCR_clean_line1,OCR_clean_line2,OCR_line3,OCR_city,OCR_zip_code,OCR_label,OCR_phone,...,Scraped_htp,Scraped_http,match_accuracy,df1_source_row,df2_matched_row,Flagged,Flag_Reason,OCR_address_standardized_ON_parenthesis,OCR_address_standardized_OFF_parenthesis,OCR_Address_Type
0,1,RVersFriend2006-050-ocr.csv,3,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,<U+25A1> <U+2610>,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,,,"partial phone number (Phone), partial phone nu...",0,2750.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty
1,1,RVersFriend2006-050-ocr.csv,3,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,<U+25A1> <U+2610>,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,,,"partial phone number (Phone), partial phone nu...",0,10517.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty
2,2,RVersFriend2007-046-ocr.csv,12,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,24 HRS S,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,,,"partial phone number (Phone), partial phone nu...",1,2750.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty
3,2,RVersFriend2007-046-ocr.csv,12,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,24 HRS S,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,,,"partial phone number (Phone), partial phone nu...",1,10517.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty
4,3,TF2008_104_117-5-ocr.csv,6,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,HAS 24 SO <U+2610> <U+2610>,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,,,"partial phone number (Phone), partial phone nu...",2,2750.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67016,38131,RVersFriend2007-004-ocr.csv,13,"C Tok , 99780 Village Gas",6 907-883-4660 AK 1 ( MM 1313.2 ),,Tok,99780,Village Gas,907-883-4660,...,,,No matches,38130,,True,No matches found,( MM 1313.2 ),AK 1,empty
67017,38132,TF2008_006_021-3-ocr.csv,16,"C Tok , 99780 Village Gas",6 907-883-4660 AK 1 ( MM 1313.2 ),<U+25C9>,Tok,99780,Village Gas,907-883-4660,...,,,No matches,38131,,True,No matches found,( MM 1313.2 ),AK 1,empty
67018,38133,RVersFriend2006-007-ocr.csv,15,"Tok , 99780 Plaza Truck Stop ( Texaco )",907-883-5833 AK Hwy 2 ( MM 1313.5 ),,Tok,99780,Plaza Truck Stop ( Texaco ),907-883-5833,...,,,No matches,38132,,True,No matches found,( MM 1313.5 ),AK Hwy 2,empty
67019,38134,RVersFriend2007-004-ocr.csv,15,"C Tok , 99780 Plaza Truck Stop ( Texaco )",907-883-5833 AK Hwy 2 ( MM 1313.5 ),,Tok,99780,Plaza Truck Stop ( Texaco ),907-883-5833,...,,,No matches,38133,,True,No matches found,( MM 1313.5 ),AK Hwy 2,empty


In [2]:
import pandas as pd
import re

# Load your DataFrame here
# df = pd.read_csv('your_file.csv')

def extract_exit_number(address_text, label_text=None):
    """
    Extract exit number from OCR address text and label text with comprehensive pattern matching.
    
    Handles formats like:
    - "I-80 Exit 162"
    - "I-710 Ex 13" (treats 'Ex' as 'Exit')
    - "US 101 X 326 B" (treats 'X' as 'Exit')
    - "I-710 X 15" (treats 'X' as 'Exit')
    - "Everett Tpke X 10" (treats 'X' as 'Exit')
    - "Garden State Pkwy X 157" (treats 'X' as 'Exit')
    - "I-40 <U+00C9>xit 325" (handles Unicode OCR errors)
    - "Speedy's I - 10 Exit 114 # 501 ( Miller Rd S )" (Exit in middle of text)
    
    Args:
        address_text (str): The address text containing exit information
        label_text (str): Optional label text to also search for exit information
        
    Returns:
        str or None: The extracted exit number, or None if not found
    """
    def _extract_from_text(text):
        if pd.isna(text) or text == '':
            return None
        
        text = str(text)
        
        # Pattern 1: Standard "Exit ###" format (anywhere in text)
        pattern1 = r'Exit\s+(\d+[A-Za-z]?)'
        match1 = re.search(pattern1, text, re.IGNORECASE)
        if match1:
            return match1.group(1)
        
        # Pattern 2: "Ex ###" format (abbreviation)
        pattern2 = r'\bEx\s+(\d+[A-Za-z]?)'
        match2 = re.search(pattern2, text, re.IGNORECASE)
        if match2:
            return match2.group(1)
        
        # Pattern 3: Unicode OCR error patterns (e.g., "I-40 <U+00C9>xit 325")
        # Handles Unicode characters that represent corrupted "Exit" text
        pattern3 = r'<U\+[0-9A-Fa-f]+>xit\s+(\d+[A-Za-z]?)'
        match3 = re.search(pattern3, text, re.IGNORECASE)
        if match3:
            return match3.group(1)
        
        # Pattern 4: Accented character patterns (É, È, etc.) for "Exit"
        # Handles cases where OCR misreads E as accented characters
        pattern4 = r'[ÉÈÊËéèêë]xit\s+(\d+[A-Za-z]?)'
        match4 = re.search(pattern4, text, re.IGNORECASE)
        if match4:
            return match4.group(1)
        
        # Pattern 5: "X ###" format with interstate/US highways
        # Matches patterns like "US 101 X 326" or "I-710 X 15"
        pattern5 = r'(?:US\s+\d+|I-\d+|SR\s+\d+|CA\s+\d+|State\s+Route\s+\d+)\s+X\s+(\d+[A-Za-z]?)'
        match5 = re.search(pattern5, text, re.IGNORECASE)
        if match5:
            return match5.group(1)
        
        # Pattern 6: "X ###" format with highway/route keywords
        # This catches other highway formats followed by X and a number
        pattern6 = r'(?:Highway|Hwy|Route|Rt)\s+\d+\s+X\s+(\d+[A-Za-z]?)'
        match6 = re.search(pattern6, text, re.IGNORECASE)
        if match6:
            return match6.group(1)
        
        # Pattern 7: "X ###" format with turnpikes, parkways, and named highways
        # Handles "Everett Tpke X 10", "Garden State Pkwy X 157", etc.
        pattern7 = r'(?:\w+\s+)?(?:Tpke|Turnpike|Pkwy|Parkway|Expwy|Expressway|Fwy|Freeway)\s+X\s+(\d+[A-Za-z]?)'
        match7 = re.search(pattern7, text, re.IGNORECASE)
        if match7:
            return match7.group(1)
        
        # Pattern 8: "X ###" format with named highways (e.g., "Garden State X 157")
        # More general pattern for named highways followed by X
        pattern8 = r'(?:\w+\s+\w+)\s+X\s+(\d+[A-Za-z]?)'
        match8 = re.search(pattern8, text, re.IGNORECASE)
        if match8:
            return match8.group(1)
        
        # Pattern 9: Standalone "X ###" format (most cautious approach)
        # Only matches if there's a numeric highway identifier before the X
        pattern9 = r'(?:\d{1,3}(?:-\d+)?)\s+X\s+(\d+[A-Za-z]?)'
        match9 = re.search(pattern9, text, re.IGNORECASE)
        if match9:
            return match9.group(1)
        
        return None
    
    # Try extracting from address text first
    result = _extract_from_text(address_text)
    if result:
        return result
    
    # If not found in address text, try label text
    if label_text is not None:
        result = _extract_from_text(label_text)
        if result:
            return result
    
    return None

# Apply exit number extraction using both columns
df['Exit_Number'] = df.apply(lambda row: extract_exit_number(
    row['OCR_address_standardized_OFF_parenthesis'], 
    row.get('OCR_label', None)
), axis=1)

print(f"Exit number extraction results:")
print(f"Total rows: {len(df)}")
print(f"Exit numbers extracted: {df['Exit_Number'].notna().sum()}")
print(f"Success rate: {(df['Exit_Number'].notna().sum() / len(df) * 100):.1f}%")


Exit number extraction results:
Total rows: 67021
Exit numbers extracted: 35433
Success rate: 52.9%


In [3]:
# Check extraction from each column separately to see the impact
df['Exit_From_Address'] = df['OCR_address_standardized_OFF_parenthesis'].apply(lambda x: extract_exit_number(x))
df['Exit_From_Label'] = df['OCR_label'].apply(lambda x: extract_exit_number(x))

print("Extraction statistics:")
print(f"From Address column only: {df['Exit_From_Address'].notna().sum()}")
print(f"From Label column only: {df['Exit_From_Label'].notna().sum()}")
print(f"From combined approach: {df['Exit_Number'].notna().sum()}")

# Show examples where label provided additional matches
label_only_matches = df[(df['Exit_From_Label'].notna()) & (df['Exit_From_Address'].isna())]
print(f"\nAdditional matches from OCR_label: {len(label_only_matches)}")

if len(label_only_matches) > 0:
    print("\nExamples of additional matches from OCR_label:")
    for _, row in label_only_matches.head(5).iterrows():
        print(f"Address: {row['OCR_address_standardized_OFF_parenthesis']}")
        print(f"Label: {row['OCR_label']}")
        print(f"Exit from Label: {row['Exit_From_Label']}")
        print("---")

Extraction statistics:
From Address column only: 35270
From Label column only: 522
From combined approach: 35433

Additional matches from OCR_label: 163

Examples of additional matches from OCR_label:
Address: 546 US 4-202 & Ridge Rd
Label: Everett Northwood Tpke Mobil X 10 ( E & 1/4 mi S on US 3 )
Exit from Label: 10
---
Address: 546 US 4-202 & Ridge Rd
Label: Everett Northwood Tpke Mobil X 10 ( E & 1/4 mi S on US 3 )
Exit from Label: 10
---
Address: I-89 Exit Jiffy 18 
Label: Exit 18
Exit from Label: 18
---
Address: I-89 Exit Jiffy 18 
Label: Exit 18
Exit from Label: 18
---
Address: . I-95
Label: Exit 3 Travel Stop ( Sunoco )
Exit from Label: 3
---


In [4]:
import pandas as pd
import re

def is_unclear_ocr_address(address_text, address_type):
    """
    Identify unclear OCR addresses based on patterns that indicate poor OCR quality.
    Only applies unclear patterns if the address_type is 'Exit'.
    
    Args:
        address_text (str): The address text to check
        address_type (str): The OCR_Address_Type value
    
    Returns:
        bool: True if unclear OCR is detected, False otherwise
    """
    if pd.isna(address_text) or address_text == '':
        return False
    
    # Only apply unclear patterns to Exit type addresses
    if address_type != 'Exit':
        return False
    
    text = str(address_text).strip()
    
    # Pattern indicators of unclear OCR
    unclear_patterns = [
        r'^\d{4,}',  # Starts with 4+ digits like "81191-15-80"
        r'.+,.+,.+',  # Multiple comma-separated fragments
        r'[A-Za-z][0-9]+-[0-9]+-[0-9]+',  # Letters followed by number patterns
        r'D[A-Z][a-z]+\s+[A-Z][a-z]+\s+City',  # "DSalt Lake City" pattern
        r'[A-Z][a-z]+\sJ\s[A-Z][a-z]+',  # Fragmented "Flying J Travel"
    ]
    
    # Check for unclear patterns
    for pattern in unclear_patterns:
        if re.search(pattern, text):
            return True
    
    # Check for specific problematic phrases
    problematic_phrases = ['81191-15-80', 'DSalt Lake City', 'nemucca,', 'Flying I-', 'Eagle\'s I-']
    for phrase in problematic_phrases:
        if phrase in text:
            return True
    
    return False

# Ensure Flagged and Flag_Reason columns exist in the main DataFrame
if 'Flagged' not in df.columns:
    df['Flagged'] = False
if 'Flag_Reason' not in df.columns:
    df['Flag_Reason'] = ''

# Apply unclear OCR detection and flagging - only for Exit type addresses
df['Is_Unclear_OCR'] = df.apply(lambda row: is_unclear_ocr_address(
    row['OCR_address_standardized_OFF_parenthesis'], 
    row['OCR_Address_Type']
), axis=1)

unclear_mask = df['Is_Unclear_OCR']
df.loc[unclear_mask, 'Flagged'] = True
df.loc[unclear_mask, 'Flag_Reason'] = 'unclear OCR_address_standardized_OFF_parenthesis'

print(f"Unclear OCR detection results:")
print(f"Total Exit type rows checked: {(df['OCR_Address_Type'] == 'Exit').sum()}")
print(f"Exit type rows flagged as unclear: {(df['Is_Unclear_OCR'] & (df['OCR_Address_Type'] == 'Exit')).sum()}")
print(f"Non-Exit type rows flagged as unclear: {(df['Is_Unclear_OCR'] & (df['OCR_Address_Type'] != 'Exit')).sum()}")

Unclear OCR detection results:
Total Exit type rows checked: 35516
Exit type rows flagged as unclear: 71
Non-Exit type rows flagged as unclear: 0


In [5]:
# Final results summary
print("="*60)
print("FINAL RESULTS SUMMARY")
print("="*60)

# Overall statistics
total_rows = len(df)
exits_found = df['Exit_Number'].notna().sum()
unclear_flagged = df['Flagged'].sum()

print(f"Total rows processed: {total_rows}")
print(f"Exit numbers extracted: {exits_found}")
print(f"Success rate: {(exits_found / total_rows * 100):.1f}%")
print(f"Unclear OCR addresses flagged: {unclear_flagged}")

# Check for pattern support
ex_pattern_count = df['OCR_address_standardized_OFF_parenthesis'].str.contains(r'\bEx\s+\d+', case=False, na=False).sum()
x_pattern_count = df['OCR_address_standardized_OFF_parenthesis'].str.contains(r'\sX\s+\d+', case=False, na=False).sum()
print(f"'Ex' pattern addresses found: {ex_pattern_count}")
print(f"'X' pattern addresses found: {x_pattern_count}")

# Show sample results
print(f"\nSample extracted exit numbers:")
sample_results = df[df['Exit_Number'].notna()][['OCR_address_standardized_OFF_parenthesis', 'Exit_Number']].head(10)
for _, row in sample_results.iterrows():
    print(f"  '{row['OCR_address_standardized_OFF_parenthesis']}' → Exit {row['Exit_Number']}")

# Display key columns
df[['OCR_address_standardized_OFF_parenthesis', 'Exit_Number', 'Flagged', 'Flag_Reason']]

FINAL RESULTS SUMMARY
Total rows processed: 67021
Exit numbers extracted: 35433
Success rate: 52.9%
Unclear OCR addresses flagged: 9298
'Ex' pattern addresses found: 71
'X' pattern addresses found: 144

Sample extracted exit numbers:
  'I-90 Exit 6 ' → Exit 6
  'I-90 Exit 6 ' → Exit 6
  'I-90 Exit 6 ' → Exit 6
  'I-90 Exit 6 ' → Exit 6
  'I-90 Exit 6 ' → Exit 6
  'I-90 Exit 6 ' → Exit 6
  'I-91 Exit 4 NB / 5 SB ' → Exit 4
  'I-91 Exit 4 NB / 5 SB ' → Exit 4
  'I-91 Exit 4 NB / 5 SB ' → Exit 4
  'I-91 Exit 4 NB / 5 SB ' → Exit 4


Unnamed: 0,OCR_address_standardized_OFF_parenthesis,Exit_Number,Flagged,Flag_Reason
0,I-90,,False,Available in RVer and Trucker
1,I-90,,False,Available in RVer and Trucker
2,I-90,,False,Available in RVer and Trucker
3,I-90,,False,Available in RVer and Trucker
4,I-90,,False,Available in RVer and Trucker
...,...,...,...,...
67016,AK 1,,True,No matches found
67017,AK 1,,True,No matches found
67018,AK Hwy 2,,True,No matches found
67019,AK Hwy 2,,True,No matches found


In [6]:
df

Unnamed: 0,OCR_Unnamed: 0,OCR_filename,OCR_record_num,OCR_clean_line1,OCR_clean_line2,OCR_line3,OCR_city,OCR_zip_code,OCR_label,OCR_phone,...,df2_matched_row,Flagged,Flag_Reason,OCR_address_standardized_ON_parenthesis,OCR_address_standardized_OFF_parenthesis,OCR_Address_Type,Exit_Number,Exit_From_Address,Exit_From_Label,Is_Unclear_OCR
0,1,RVersFriend2006-050-ocr.csv,3,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,<U+25A1> <U+2610>,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,2750.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty,,,,False
1,1,RVersFriend2006-050-ocr.csv,3,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,<U+25A1> <U+2610>,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,10517.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty,,,,False
2,2,RVersFriend2007-046-ocr.csv,12,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,24 HRS S,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,2750.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty,,,,False
3,2,RVersFriend2007-046-ocr.csv,12,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,24 HRS S,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,10517.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty,,,,False
4,3,TF2008_104_117-5-ocr.csv,6,"Blandford , 01008 Blandford Plaza EB Exxon # 5020",413-848-2056 1-90 ( MATP ) MM 29 EB,HAS 24 SO <U+2610> <U+2610>,Blandford,1008,Blandford Plaza EB Exxon # 5020,413-848-2056,...,2750.0,False,Available in RVer and Trucker,( MATP ) MM 29 EB,I-90,empty,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67016,38131,RVersFriend2007-004-ocr.csv,13,"C Tok , 99780 Village Gas",6 907-883-4660 AK 1 ( MM 1313.2 ),,Tok,99780,Village Gas,907-883-4660,...,,True,No matches found,( MM 1313.2 ),AK 1,empty,,,,False
67017,38132,TF2008_006_021-3-ocr.csv,16,"C Tok , 99780 Village Gas",6 907-883-4660 AK 1 ( MM 1313.2 ),<U+25C9>,Tok,99780,Village Gas,907-883-4660,...,,True,No matches found,( MM 1313.2 ),AK 1,empty,,,,False
67018,38133,RVersFriend2006-007-ocr.csv,15,"Tok , 99780 Plaza Truck Stop ( Texaco )",907-883-5833 AK Hwy 2 ( MM 1313.5 ),,Tok,99780,Plaza Truck Stop ( Texaco ),907-883-5833,...,,True,No matches found,( MM 1313.5 ),AK Hwy 2,empty,,,,False
67019,38134,RVersFriend2007-004-ocr.csv,15,"C Tok , 99780 Plaza Truck Stop ( Texaco )",907-883-5833 AK Hwy 2 ( MM 1313.5 ),,Tok,99780,Plaza Truck Stop ( Texaco ),907-883-5833,...,,True,No matches found,( MM 1313.5 ),AK Hwy 2,empty,,,,False


In [7]:
# Test the new "X" pattern extraction functionality
print("Testing X pattern extraction:")
print("="*50)

# Test cases similar to the examples provided
test_cases = [
    "66054 US 101 X 326 B",
    "I-710 X 15",
    "US 101 X 326",
    "I-405 X 12A",
    "SR 99 X 45",
    "CA 1 X 100"
]

print("Test cases:")
for test_case in test_cases:
    result = extract_exit_number(test_case)
    print(f"  '{test_case}' → {result}")

print("\nActual examples from the dataset with 'X' pattern:")
x_pattern_examples = df[df['OCR_address_standardized_OFF_parenthesis'].str.contains(r'\sX\s+\d+', case=False, na=False)][
    ['OCR_address_standardized_OFF_parenthesis', 'Exit_Number']
].head(10)

for _, row in x_pattern_examples.iterrows():
    address = row['OCR_address_standardized_OFF_parenthesis']
    exit_num = row['Exit_Number']
    status = "✓ EXTRACTED" if pd.notna(exit_num) else "✗ NOT EXTRACTED"
    print(f"  '{address}' → {exit_num} {status}")

print(f"\nTotal 'X' pattern addresses in dataset: {len(x_pattern_examples)}")
extracted_x = x_pattern_examples['Exit_Number'].notna().sum()
print(f"Successfully extracted from 'X' patterns: {extracted_x}")
if len(x_pattern_examples) > 0:
    print(f"X pattern extraction rate: {(extracted_x / len(x_pattern_examples) * 100):.1f}%")

Testing X pattern extraction:
Test cases:
  '66054 US 101 X 326 B' → 326
  'I-710 X 15' → 15
  'US 101 X 326' → 326
  'I-405 X 12A' → 12A
  'SR 99 X 45' → 45
  'CA 1 X 100' → 100

Actual examples from the dataset with 'X' pattern:
  'Everett Tpke X 10 ' → 10 ✓ EXTRACTED
  'Everett Tpke X 10 ' → 10 ✓ EXTRACTED
  'Everett Tpke X 10 ' → 10 ✓ EXTRACTED
  'Garden State Pkwy X 157 ' → 157 ✓ EXTRACTED
  'Garden State Pkwy X 157 ' → 157 ✓ EXTRACTED
  'Garden State X 157 ' → 157 ✓ EXTRACTED
  'Garden State X 157 ' → 157 ✓ EXTRACTED
  'Garden State X 157 ' → 157 ✓ EXTRACTED
  'Garden State X 157 ' → 157 ✓ EXTRACTED
  'Garden State X 157 ' → 157 ✓ EXTRACTED

Total 'X' pattern addresses in dataset: 10
Successfully extracted from 'X' patterns: 10
X pattern extraction rate: 100.0%


In [8]:
# change column Exit_Number name to OCR_Exit_Number
df.rename(columns={'Exit_Number': 'OCR_Exit_Number'}, inplace=True)
df.to_csv('4_5.csv', index=False)