<a href="https://colab.research.google.com/github/bordeauxrouge99/test_repo/blob/master/Regex_development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create regex patterns to extract vendor names from transaction descriptions using the data from the "details" tab of the Google Sheet at "https://docs.google.com/spreadsheets/d/1_Z0l6h9HO5Llt8bklSpIKMvOWHQsSuXy3i5y8nYgDMs/edit?gid=0#gid=0", specifically using the "Description" and "vendor" columns.

## Load data

### Subtask:
Load the data from the specified Google Sheet into a pandas DataFrame.


**Reasoning**:
Load the data from the Google Sheet URL into a pandas DataFrame and display the head and info.



In [9]:
#Cell 7e4aea91 - amm added, do not delete

import re
import pandas as pd
from google.colab import auth
import gspread
from google.auth import default

# Authenticate and authorize access to Google Drive and Google Sheets
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# Open the spreadsheet and select the 'details' tab
google_sheet_url = "https://docs.google.com/spreadsheets/d/1_Z0l6h9HO5Llt8bklSpIKMvOWHQsSuXy3i5y8nYgDMs/edit?gid=0#gid=0"
# Extract the spreadsheet ID from the URL
spreadsheet_id = google_sheet_url.split('/')[-2]
# Open the spreadsheet by ID
sh = gc.open_by_key(spreadsheet_id)
# Select the 'details' worksheet (assuming 'details' is the name of the tab)
worksheet = sh.worksheet('details')

# Get all data from the worksheet
data = worksheet.get_all_values()

# Convert the data to a pandas DataFrame
df = pd.DataFrame(data[1:], columns=data[0])

# display(df.head())
# display(df.info())

In [2]:
# Cell to load finalized vendor patterns from a CSV file in Google Drive

# You might need to authenticate Google Drive access if you haven't already
from google.colab import drive
import os
import pandas as pd

# Mount Google Drive (will prompt for authorization the first time)
try:
    drive.mount('/content/drive')
except:
    print("Google Drive already mounted.")

# Define the path to the saved patterns file in your Google Drive
save_path = '/content/drive/My Drive/vendor_regex_patterns.csv'

# Initialize the dictionary and DataFrame for finalized patterns
finalized_vendor_patterns = {}
vendor_patterns_df = pd.DataFrame(columns=['cleaned_vendor', 'regex_pattern'])

# Check if the saved file exists and load it
if os.path.exists(save_path):
    print(f"Loading finalized patterns from: {save_path}")
    try:
        vendor_patterns_df = pd.read_csv(save_path)
        # Populate the dictionary from the loaded DataFrame
        finalized_vendor_patterns = dict(zip(vendor_patterns_df['cleaned_vendor'], vendor_patterns_df['regex_pattern']))
        print("Finalized patterns loaded successfully.")
        display(vendor_patterns_df.head())
    except Exception as e:
        print(f"Error loading patterns from CSV: {e}")
        print("Starting with empty finalized patterns.")
else:
    print(f"No saved patterns file found at {save_path}. Starting with empty finalized patterns.")

# Ensure finalized_vendor_patterns is a dictionary even if the file didn't exist or failed to load
if not isinstance(finalized_vendor_patterns, dict):
    finalized_vendor_patterns = {}

# Ensure vendor_patterns_df is a DataFrame even if the file didn't exist or failed to load
if not isinstance(vendor_patterns_df, pd.DataFrame):
    vendor_patterns_df = pd.DataFrame(columns=['cleaned_vendor', 'regex_pattern'])

Mounted at /content/drive
Loading finalized patterns from: /content/drive/My Drive/vendor_regex_patterns.csv
Finalized patterns loaded successfully.


Unnamed: 0,cleaned_vendor,regex_pattern
0,amazon,(?i)(?!.*kindle).*(?:amazon\ \-\ |prime\ video...
1,kindle,(?i)\b(?:KINDLE\s*SVCS|Amazon\s*kindle|Amazon\...
2,uber,(?i)(?!.*(?:uber\s*eats).*)\b(?:uber)\b
3,lyft,(?i)(?!.*(?:lyft\s*CITI\s*BIKE|LYFT\s*\*CITI\s...


## Prepare data

### Subtask:
Clean and prepare the vendor name and description data from the Google Sheet for building regex patterns.


**Reasoning**:
Since the previous attempt to load the data failed, I will try again using the same method as specified in the previous successful subtask's solution. I will then proceed with the data cleaning steps as outlined in the current subtask instructions.



## Build regex patterns

### Subtask:
Create regex patterns based on the cleaned vendor names and descriptions.


**Reasoning**:
Initialize an empty dictionary to store regex patterns and iterate through unique cleaned vendor names to generate patterns based on associated descriptions.



In [41]:
#cell a9a1b281 - audrey added do not delete
## maybe will DELETE this cell but need to understand if we use any functions from it.

# Function to generate more nuanced regex patterns for each vendor
# def generate_nuanced_patterns(df):
#     nuanced_vendor_patterns = {}

#     for vendor in df['Vendor'].unique():
#         # Get descriptions associated with the current vendor
#         descriptions = df[df['Vendor'] == vendor]['Description'].tolist()

#         # For now, let's create patterns that look for the vendor name as a word boundary
#         # in the description, considering variations in spacing and punctuation.
#         # This is a starting point and can be refined.

#         # Escape special regex characters in the vendor name for safety
#         escaped_vendor = re.escape(vendor)

#         # Create a pattern that looks for the escaped vendor name with potential
#         # variations around it (like spaces, hyphens, etc.) and ignore case.
#         # This is a simplified pattern and will need refinement based on specific vendors.
#         pattern = r'(?i).*\b' + escaped_vendor.replace(r'\ ', r'\s+') + r'\b.*'


#         nuanced_vendor_patterns[vendor] = pattern

#     return nuanced_vendor_patterns

# # Generate the nuanced patterns
# nuanced_vendor_patterns = generate_nuanced_patterns(df)

# # Display some of the generated nuanced patterns as examples
# print("Generated Nuanced Patterns:")
# for vendor, pattern in list(nuanced_vendor_patterns.items())[:5]:
#     print(f"Vendor: {vendor}")
#     print(f"Pattern: {pattern}\n")

# # Now, let's test these nuanced patterns on a sample of descriptions
# def extract_vendor_nuanced(description, vendor_patterns):
#     for vendor, pattern in vendor_patterns.items():
#         if re.search(pattern, description):
#             return vendor
#     return "Unknown"

# print("\nApplying nuanced regex patterns to sample descriptions:")
# sample_descriptions = df['Description'].sample(10).tolist() # Get 10 random descriptions

# for description in sample_descriptions:
#     predicted_vendor = extract_vendor_nuanced(description, nuanced_vendor_patterns)
#     print(f"Description: {description}")
#     print(f"Predicted Vendor: {predicted_vendor}\n")

In [3]:
#this is to create the function definition
#cell 4755d819 - do not delete

def analyze_vendor_descriptions(vendor_name, df):
    """
    Retrieves and displays distinct descriptions for a given vendor and suggests a regex pattern.

    Args:
        vendor_name (str): The name of the vendor to analyze.
        df (pd.DataFrame): The DataFrame containing the transaction data.
    """
    # Get distinct descriptions for the vendor
    vendor_descriptions = df[df['Vendor'] == vendor_name]['Description'].unique().tolist()

    if not vendor_descriptions:
        print(f"No descriptions found for vendor: '{vendor_name}'")
        return

    print(f"Distinct Descriptions for '{vendor_name}':")
    for desc in vendor_descriptions:
        print(f"- {desc}")

    # Suggest a regex pattern based on common terms in descriptions
    # This is a basic suggestion and can be refined.
    # We can analyze the descriptions to find common words or patterns.

    # A simple approach: find words that appear frequently in the descriptions for this vendor
    # and less frequently in descriptions for other vendors.
    all_descriptions = df['Description'].tolist()
    other_vendors_descriptions = df[df['Vendor'] != vendor_name]['Description'].tolist()

    vendor_words = ' '.join(vendor_descriptions).split()
    other_words = ' '.join(other_vendors_descriptions).split()

    # Count word frequencies
    vendor_word_counts = pd.Series(vendor_words).value_counts()
    other_word_counts = pd.Series(other_words).value_counts()

    # Identify words that are relatively unique to this vendor
    # (appearing frequently for the vendor and infrequently elsewhere)
    suggested_terms = []
    for word, count in vendor_word_counts.items():
        if count > 1 and other_word_counts.get(word, 0) < count / 2: # Appearing at least twice and at least twice as often as in other descriptions
             # Clean the word for regex (remove punctuation, make lowercase)
            cleaned_word = re.sub(r'[^\w]+', '', word).lower()
            if cleaned_word and len(cleaned_word) > 2: # Exclude empty or very short cleaned words
                 suggested_terms.append(re.escape(cleaned_word))

    # Create a regex pattern from the suggested terms
    if suggested_terms:
        # Join with OR and add case-insensitivity and word boundaries
        pattern = r'(?i)\b(?:' + '|'.join(suggested_terms) + r')\b'
    else:
        # Fallback to a simple pattern based on the vendor name
        pattern = r'(?i).*\b' + re.escape(vendor_name).replace(r'\ ', r'\s+') + r'\b.*'


    print(f"\nSuggested Regex Pattern for '{vendor_name}':")
    print(pattern)

# Example of how to use the function:
# analyze_vendor_descriptions('kindle', df) # Replace 'kindle' with the vendor you want to analyze

 cell 521bab66, keep



In [133]:
# #cell 521bab66 - keep



# def generate_and_test_regex(vendor_name, common_patterns, df):
#     """
#     Generates a regex pattern based on provided common patterns for a vendor,
#     and tests it on sample descriptions. (Basic version)

#     Args:
#         vendor_name (str): The name of the vendor.
#         common_patterns (list): A list of common string patterns for the vendor.
#         df (pd.DataFrame): The DataFrame containing the transaction data.

#     Returns:
#         str: The generated regex pattern.
#     """
#     # Escape special regex characters and handle potential spacing variations
#     escaped_patterns = [re.escape(pattern).replace(r'\ ', r'\s*') for pattern in common_patterns]

#     # Create a single regex pattern by joining the escaped patterns with '|' (OR)
#     # Simplified: Remove word boundaries and outer grouping for testing
#     regex_pattern = r'(?i)' + '|'.join(escaped_patterns)


#     # Display the generated regex pattern
#     print(f"Generated Regex Pattern for '{vendor_name}':")
#     print(regex_pattern)

#     # Now, let's test this pattern on a sample of descriptions for the vendor
#     def extract_vendor_test(description, pattern):
#         # Need to compile the regex for efficiency if testing many descriptions
#         # compiled_pattern = re.compile(pattern) # Optional: compile the pattern
#         if re.search(pattern, description):
#             return vendor_name
#         return "Unknown"

#     print(f"\nApplying regex pattern to sample '{vendor_name}' descriptions:")

#     # Get a sample of descriptions that are actually the target vendor
#     vendor_descriptions_sample = df[df['Vendor'] == vendor_name]['Description'].sample(min(10, len(df[df['Vendor'] == vendor_name]))).tolist()

#     if not vendor_descriptions_sample:
#         print(f"No descriptions found for vendor '{vendor_name}' in the DataFrame.")
#     else:
#         for description in vendor_descriptions_sample:
#             predicted_vendor = extract_vendor_test(description, pattern)
#             print(f"Description: {description}")
#             print(f"Predicted Vendor: {predicted_vendor}\n")

#     # Optionally, test on some non-vendor descriptions to check for false positives
#     print(f"\nApplying regex pattern to sample non-'{vendor_name}' descriptions:")
#     non_vendor_descriptions_sample = df[df['Vendor'] != vendor_name]['Description'].sample(min(10, len(df[df['Vendor'] != vendor_name]))).tolist()

#     if not non_vendor_descriptions_sample:
#          print(f"No descriptions found for other vendors in the DataFrame.")
#     else:
#         for description in non_vendor_descriptions_sample:
#             predicted_vendor = extract_vendor_test(description, pattern)
#             # Only print if it incorrectly predicted the target vendor
#             if predicted_vendor == vendor_name:
#                 print(f"Description: {description}")
#                 print(f"Predicted Vendor: {predicted_vendor} (Incorrect)\n")
#             else:
#                 print(f"Description: {description}")
#                 print(f"Predicted Vendor: {predicted_vendor} (Correct)\n")

#     return regex_pattern

# Example of how to use the function:
# vendor = 'kindle'
# patterns = ['KINDLE SVCS', 'Amazon kindle', 'Amazon - kindle book', 'Kindle books', 'AMZN Digital']
# generated_pattern = generate_and_test_regex(vendor, patterns, df)
# print(f"\nFinal generated pattern for {vendor}: {generated_pattern}")

# Example with exclusion:
# vendor = 'amazon'
# include_patterns = ['amazon', 'amzn']
# exclusions = ['kindle']
# generated_pattern = generate_and_test_regex(vendor, include_patterns, df, exclusions)
# print(f"\nFinal generated pattern for {vendor}: {generated_pattern}")

In [5]:
## cell 521bab66 alternative

def generate_and_test_regex(vendor_name, common_patterns, df):
    """
    Build a case-insensitive regex WITHOUT embedding (?i) so we can safely
    wrap/compose it later. We’ll pass re.I (or case=False) at match time.
    """
    # clean inputs
    toks = [str(p).strip() for p in (common_patterns or []) if str(p).strip()]
    if toks:
        escaped = [re.escape(t).replace(r"\ ", r"\s*") for t in toks]
    else:
        escaped = [re.escape(str(vendor_name)).replace(r"\ ", r"\s*")]

    inner   = "|".join(escaped)                  # e.g. "nyc\\s*taxi|nyctaxi"
    pattern = fr"(?:{inner})"                    # NO (?i) here

    # sample rows for the vendor (case-insensitive)
    sample = df[df["Vendor"].astype(str).str.contains(str(vendor_name), case=False, na=False)]
    desc   = sample["Description"].astype(str)

    # test the pattern (case-insensitive at CALL time)
    hits = desc[desc.str.contains(pattern, regex=True, case=False, na=False)].head(10).tolist()

    print(f"Generated Regex Pattern for '{vendor_name}':\n{pattern}\n")
    print(f"Applying (case-insensitive) to sample '{vendor_name}' descriptions:")
    for m in hits:
        print(" -", m)

    return pattern

RESTART THE PROCESS HERE
EITHER PICK A VENDOR
OR
START AT THE TOP


Do you want to take the vendor with the most instances or a specific one?

In [6]:
##KEEP
#cell 462d8728 - audrey added, do not delete



# Group by 'Vendor' and count the number of unique descriptions for each vendor
unique_description_counts = df.groupby('Vendor')['Description'].nunique()

# Exclude the vendor '0'
unique_description_counts = unique_description_counts[unique_description_counts.index != '0']

# Exclude vendors for which we have finalized patterns
vendors_to_exclude = finalized_vendor_patterns.keys()
unique_description_counts = unique_description_counts.drop(vendors_to_exclude, errors='ignore')


# Sort the results by the count of distinct descriptions in descending order
unique_description_counts = unique_description_counts.sort_values(ascending=False)


# Display the top 20 vendors and their distinct description counts
print("Top 20 Vendors and Distinct Description Counts (Excluding '0' and vendors with finalized patterns), sorted by count descending:")
print(unique_description_counts.head(20))

Top 20 Vendors and Distinct Description Counts (Excluding '0' and vendors with finalized patterns), sorted by count descending:
Vendor
nyc taxi        70
apple           57
hulu            48
venmo           46
prose           27
usps            24
delta           23
starbucks       22
aa              22
payment         21
shake shack     20
westville       20
mta             19
duane reade     17
target          16
curb            16
freshdirect     16
nycb            16
us open         14
metlife cafe    14
Name: Description, dtype: int64


INPUT VENDOR NAME HERE

In [7]:
# Set the current working vendor
current_working_vendor = 'nyc taxi' ##INPUT
print(f"Current working vendor set to: {current_working_vendor}")

Current working vendor set to: nyc taxi


In [10]:
# Cell for initial Vendor Name check: Find all descriptions containing the current vendor's name
# and show the original vendors of those descriptions.

print(f"Performing initial Vendor Name check for descriptions containing '{current_working_vendor}'...")

# Filter descriptions that contain the current vendor's name (case-insensitive)
# Use a regex search for word boundaries around the vendor name to avoid partial matches (optional, but generally good)
# Or a simple string contains check if exact word boundary isn't needed initially
# Let's start with a simple contains check for flexibility
vendor_name_pattern = r'(?i).*' + re.escape(current_working_vendor) + r'.*'
descriptions_containing_vendor_name = df[df['Description'].str.contains(vendor_name_pattern, na=False, regex=True)]


# Group by the original 'Vendor' and count the number of descriptions containing the vendor name
original_vendor_counts_for_name_matches = descriptions_containing_vendor_name['Vendor'].value_counts()

print(f"\nOriginal Vendors found for descriptions containing the name '{current_working_vendor}':")
if original_vendor_counts_for_name_matches.empty:
    print(f"No descriptions found containing the name '{current_working_vendor}'.")
else:
    display(original_vendor_counts_for_name_matches)

# Optional: Display some sample descriptions that matched this name check
print(f"\nSample Descriptions containing the name '{current_working_vendor}' and their original Vendor:")
if not descriptions_containing_vendor_name.empty:
     display(descriptions_containing_vendor_name[['Description', 'Vendor']].head(10))
else:
     print(f"No sample descriptions to display containing the name '{current_working_vendor}'.")

# --- New section to show descriptions containing the name but with a different original vendor ---
print(f"\nSample Descriptions containing the name '{current_working_vendor}' but with a DIFFERENT original Vendor:")

# Filter descriptions that contain the vendor name but where the original vendor is NOT the current working vendor
non_current_vendor_matches = descriptions_containing_vendor_name[
    descriptions_containing_vendor_name['Vendor'] != current_working_vendor
]

if non_current_vendor_matches.empty:
    print(f"No descriptions found containing the name '{current_working_vendor}' from a different original vendor.")
else:
    # Display a sample of these descriptions and their original vendor
    display(non_current_vendor_matches[['Description', 'Vendor']].head(10)) # Displaying up to 10 samples

Performing initial Vendor Name check for descriptions containing 'nyc taxi'...

Original Vendors found for descriptions containing the name 'nyc taxi':


Unnamed: 0_level_0,count
Vendor,Unnamed: 1_level_1
nyc taxi,63



Sample Descriptions containing the name 'nyc taxi' and their original Vendor:


Unnamed: 0,Description,Vendor
451,NYC TAXI 9V51 212-2446553 NY,nyc taxi
864,NYC TAXI 7L56 212-2446553 NY,nyc taxi
874,NYC TAXI 1J64 000-0000000 NY,nyc taxi
996,NYC TAXI 2N80 917-9697272 NY,nyc taxi
999,NYC TAXI 7N80 000-0000000 NY,nyc taxi
1000,NYC TAXI 7M61 000-0000000 NY,nyc taxi
1009,NYC TAXI 9M52 000-0000000 NY,nyc taxi
1019,NYC TAXI 3V27 000-0000000 NY,nyc taxi
1021,NYC TAXI 5M65 000-0000000 NY,nyc taxi
1029,NYC TAXI 8N99 000-0000000 NY,nyc taxi



Sample Descriptions containing the name 'nyc taxi' but with a DIFFERENT original Vendor:
No descriptions found containing the name 'nyc taxi' from a different original vendor.


In [11]:
# Analyze the descriptions for the current working vendor
analyze_vendor_descriptions(current_working_vendor, df)

Distinct Descriptions for 'nyc taxi':
- NYC TAXI 9V51 212-2446553 NY
- NYCTAXI5M64 FLUSHING NY
- NYC TAXI 7L56 212-2446553 NY
- NYC TAXI 1J64 000-0000000 NY
- NYC TAXI 2N80 917-9697272 NY
- NYC TAXI 7N80 000-0000000 NY
- NYC TAXI 7M61 000-0000000 NY
- NYC TAXI 9M52 000-0000000 NY
- NYC TAXI 3V27 000-0000000 NY
- NYC TAXI 5M65 000-0000000 NY
- NYC TAXI 8N99 000-0000000 NY
- NYC TAXI 2D73 000-0000000 NJ
- NYC TAXI 6N26 917-3964908 NY
- NYC TAXI 3P52 000-0000000 NY
- NYC TAXI 7H47 000-0000000 NY
- TAXI SVC ASTORIA ASTORIA NY
- NYC TAXI 9E64 000-0000000 NY
- NYC TAXI 5J80 000-0000000 NY
- NYC TAXI 9B41 000-0000000 NY
- NYC TAXI 9F52 000-0000000 NY
- NYC TAXI 7N65 000-0000000 NY
- NYC TAXI 9N35 347-7536653 NY
- NYC TAXI 1D32 000-0000000 NY
- NYC TAXI 3L46 000-0000000 NY
- NYC TAXI 1H93 000-0000000 NY
- NYC TAXI 5M85 000-0000000 NY
- NYCTAXI7D49 BROOKLYN NY
- NYC TAXI 2D17 09002540011LONG ISLAND CNY
- NYC TAXI 6L67 09028980019BROOKLYN NY
- NYC TAXI 9E89 09012460010LONG ISLAND CNY
- NYC TAXI 

Review the patterns above. get AI to help if needed for the next cell (f53223a3)

In [27]:
# Define the common patterns to include for the current vendor
common_patterns_for_current_vendor = [
    # Add common patterns here
    'nyc taxi',
    'nyctaxi',
    'Taxi into Work'
   # Example: if you want to explicitly include this variation
    # Add other common patterns for nyc taxi here
]


# Generate and test the regex pattern using the basic function
generated_pattern = generate_and_test_regex(
    current_working_vendor, # First positional argument
    common_patterns_for_current_vendor, # Second positional argument
    df # Third positional argument
)

# You can optionally store this generated_pattern in your finalized_vendor_patterns dictionary
# finalized_vendor_patterns[current_working_vendor] = generated_pattern

Generated Regex Pattern for 'nyc taxi':
(?:nyc\s*taxi|nyctaxi|Taxi\s*into\s*Work)

Applying (case-insensitive) to sample 'nyc taxi' descriptions:
 - NYC TAXI 9V51 212-2446553 NY
 - NYCTAXI5M64 FLUSHING NY
 - NYC TAXI 7L56 212-2446553 NY
 - NYC TAXI 1J64 000-0000000 NY
 - NYC TAXI 2N80 917-9697272 NY
 - NYC TAXI 7N80 000-0000000 NY
 - NYC TAXI 7M61 000-0000000 NY
 - NYC TAXI 9M52 000-0000000 NY
 - NYC TAXI 3V27 000-0000000 NY
 - NYC TAXI 5M65 000-0000000 NY


cell 08322106

In [19]:
# QA CHECK: Are we capturing the regex pattern correctly
# Cell for Comprehensive QA: Identify False Negatives and False Positives for the current vendor pattern

print(f"Performing Comprehensive QA for '{current_working_vendor}' pattern...")

# Ensure generated_pattern variable exists (it's created in cell f53223a3)
if 'generated_pattern' not in locals():
    print("Error: 'generated_pattern' not found. Please run cell f5323a3 first.")
else:
    # --- Check 1: False Negatives ---
    # Find descriptions where the original vendor IS the current_working_vendor
    target_vendor_descriptions_df = df[df['Vendor'] == current_working_vendor].copy()

    if target_vendor_descriptions_df.empty:
        print(f"No descriptions found for original vendor '{current_working_vendor}'. Cannot check for false negatives.")
    else:
        # Check which of these target vendor descriptions are NOT matched by the generated pattern
        false_negatives_mask = ~target_vendor_descriptions_df['Description'].str.contains(generated_pattern, na=False, regex=True)
        false_negatives_df = target_vendor_descriptions_df[false_negatives_mask].copy()

        print(f"\n--- False Negatives (Original vendor is '{current_working_vendor}', pattern did NOT match) ---")
        print(f"Total potential false negatives found: {len(false_negatives_df)}")

        if not false_negatives_df.empty:
            print("Sample False Negative descriptions:")
            # Prepare output in the requested format: Description, Original Vendor, New Predicted Vendor (which is None/Unknown here)
            false_negatives_output = false_negatives_df[['Description', 'Vendor']].copy()
            false_negatives_output['New Predicted Vendor'] = "Unknown (Pattern did not match)"
            display(false_negatives_output.head(10)) # Display up to 10 samples
        else:
            print(f"No false negatives found for the '{current_working_vendor}' pattern.")


    # --- Check 2: False Positives (relative to the target vendor) ---
    # Find descriptions where the pattern DID match the current_working_vendor's pattern
    matches_pattern_mask = df['Description'].str.contains(generated_pattern, na=False, regex=True)
    matched_descriptions_df = df[matches_pattern_mask].copy()

    if matched_descriptions_df.empty:
         print(f"\n--- False Positives (Pattern matched, but original vendor is DIFFERENT) ---")
         print(f"No descriptions matched the generated pattern for '{current_working_vendor}'. Cannot check for false positives.")
    else:
        # Find which of these matched descriptions have a DIFFERENT original vendor
        false_positives_df = matched_descriptions_df[matched_descriptions_df['Vendor'] != current_working_vendor].copy()

        print(f"\n--- False Positives (Pattern matched '{current_working_vendor}', but original vendor is DIFFERENT) ---")
        print(f"Total potential false positives found: {len(false_positives_df)}")

        if not false_positives_df.empty:
            print("Sample False Positive descriptions:")
             # Prepare output in the requested format: Description, Original Vendor, New Predicted Vendor (which is current_working_vendor here)
            false_positives_output = false_positives_df[['Description', 'Vendor']].copy()
            false_positives_output['New Predicted Vendor'] = current_working_vendor
            display(false_positives_output.head(10)) # Display up to 10 samples
        else:
            print(f"No potential false positives found for the '{current_working_vendor}' pattern based on original vendors.")

    print("\nComprehensive QA complete.")

Performing Comprehensive QA for 'nyc taxi' pattern...

--- False Negatives (Original vendor is 'nyc taxi', pattern did NOT match) ---
Total potential false negatives found: 69
Sample False Negative descriptions:


Unnamed: 0,Description,Vendor,New Predicted Vendor
451,NYC TAXI 9V51 212-2446553 NY,nyc taxi,Unknown (Pattern did not match)
819,NYCTAXI5M64 FLUSHING NY,nyc taxi,Unknown (Pattern did not match)
864,NYC TAXI 7L56 212-2446553 NY,nyc taxi,Unknown (Pattern did not match)
874,NYC TAXI 1J64 000-0000000 NY,nyc taxi,Unknown (Pattern did not match)
996,NYC TAXI 2N80 917-9697272 NY,nyc taxi,Unknown (Pattern did not match)
999,NYC TAXI 7N80 000-0000000 NY,nyc taxi,Unknown (Pattern did not match)
1000,NYC TAXI 7M61 000-0000000 NY,nyc taxi,Unknown (Pattern did not match)
1009,NYC TAXI 9M52 000-0000000 NY,nyc taxi,Unknown (Pattern did not match)
1019,NYC TAXI 3V27 000-0000000 NY,nyc taxi,Unknown (Pattern did not match)
1021,NYC TAXI 5M65 000-0000000 NY,nyc taxi,Unknown (Pattern did not match)



--- False Positives (Pattern matched 'nyc taxi', but original vendor is DIFFERENT) ---
Total potential false positives found: 0
No potential false positives found for the 'nyc taxi' pattern based on original vendors.

Comprehensive QA complete.


#STOP DID THE REGEX PATTERN PASS ALL QA

In [13]:
# Finalize the pattern for the current working vendor and add it to the dictionary
finalized_vendor_patterns[current_working_vendor] = generated_pattern

# Update the vendor_patterns_df DataFrame
vendor_patterns_df = pd.DataFrame(list(finalized_vendor_patterns.items()), columns=['cleaned_vendor', 'regex_pattern'])

# Display the updated DataFrame
print(f"Updated DataFrame of Cleaned Vendor Names and Regex Patterns (including {current_working_vendor}):")
display(vendor_patterns_df)

# Removed the saving code from this cell.
# A separate cell will be created for saving the DataFrame to a CSV file.

Updated DataFrame of Cleaned Vendor Names and Regex Patterns (including nyc taxi):


Unnamed: 0,cleaned_vendor,regex_pattern
0,amazon,(?i)(?!.*kindle).*(?:amazon\ \-\ |prime\ video...
1,kindle,(?i)\b(?:KINDLE\s*SVCS|Amazon\s*kindle|Amazon\...
2,uber,(?i)(?!.*(?:uber\s*eats).*)\b(?:uber)\b
3,lyft,(?i)(?!.*(?:lyft\s*CITI\s*BIKE|LYFT\s*\*CITI\s...
4,nyc taxi,(?:nyc\s*taxi|nyctaxi)


In [63]:
# Cell to save the finalized vendor patterns to a CSV file in Google Drive

# You might need to authenticate Google Drive access if you haven't already
from google.colab import drive
import os
import pandas as pd

# Mount Google Drive (will prompt for authorization the first time)
try:
    drive.mount('/content/drive')
except:
    print("Google Drive already mounted.")

# Define the path to save the file in your Google Drive
# You can change the folder and filename as you prefer
save_path = '/content/drive/My Drive/vendor_regex_patterns.csv'

# Save the DataFrame to a CSV file
# Ensure the DataFrame 'vendor_patterns_df' exists and is up-to-date before running this cell
if 'vendor_patterns_df' in locals() and not vendor_patterns_df.empty:
    vendor_patterns_df.to_csv(save_path, index=False)
    print(f"Finalized patterns saved to: {save_path}")
else:
    print("vendor_patterns_df is not available or is empty. Please run the necessary cells first.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Finalized patterns saved to: /content/drive/My Drive/vendor_regex_patterns.csv


In [26]:
#cell: NN9fymJsRvUj
# HOT PATCH: regex helpers (drop in once per runtime)

import re, pandas as pd
from typing import List

def make_vendor_pattern(base_terms, sep_class=r"\W*", allow_no_space=True, allow_suffix_on_compact=True, word_boundaries=True):
    alts=[]
    for raw in base_terms:
        if not isinstance(raw,str): continue
        term=raw.strip()
        if not term: continue
        esc=re.escape(term).replace(r"\ ", sep_class)
        alts.append(rf"\b{esc}\b" if word_boundaries else esc)
        if allow_no_space and " " in term:
            compact=re.escape(term.replace(" ",""))
            alts.append(rf"{compact}\w*" if allow_suffix_on_compact else (rf"{compact}\b" if word_boundaries else compact))
    return rf"(?:{'|'.join(alts)})" if alts else r"(?!)"

def _compile_ci(pattern:str, case_insensitive=True):
    pat=re.sub(r'^\(\?[aiLmsux-]+\)', "", pattern or "")
    try:
        re.compile(pat, re.I if case_insensitive else 0); return True, pat, None
    except re.error as e:
        return False, pat, str(e)

def debug_vendor(vendor_name:str, base_terms:List[str], df:pd.DataFrame, desc_col="Description", case_insensitive=True):
    """Return pattern + quick match stats against df[desc_col]."""
    pat = make_vendor_pattern(base_terms)
    ok, clean_pat, err = _compile_ci(pat, case_insensitive=case_insensitive)
    if not ok:
        return {"pattern": clean_pat, "compile_ok": False, "error": err}
    ser = df[desc_col].astype(str)
    mask = ser.str.contains(clean_pat, regex=True, case=not case_insensitive, na=False)
    return {
        "pattern": clean_pat, "compile_ok": True,
        "matched": int(mask.sum()),
        "sample_matches": ser[mask].head(8).tolist(),
        "sample_unmatched": ser[~mask].head(8).tolist()
    }



In [25]:
res = debug_vendor(
    "nyc taxi",
    ["nyc taxi","nyctaxi","taxi into work"],
    df, desc_col="Description"
)
print(res["pattern"], res["matched"])
res["sample_matches"][:5]

(?:\bnyc\W*taxi\b|nyctaxi\w*|\bnyctaxi\b|\btaxi\W*into\W*work\b|taxiintowork\w*) 68


['NYC TAXI 9V51 212-2446553 NY',
 'NYCTAXI5M64 FLUSHING NY',
 'NYC TAXI 7L56 212-2446553 NY',
 'NYC TAXI 1J64 000-0000000 NY',
 'NYC TAXI 2N80 917-9697272 NY']