# Regex with Email Address

This notebook investigates a regex problem between pandas and cuDF.

Pandas appears to evaluate the regex in a more robust manner, whereas cuDF either throws a runtime error or provides a different result.  The issue appears related to the position of "-" in a regex set.  Escaping the "-" appears to be a workaround, but this is not required according to Python documentation (https://docs.python.org/3/library/re.html).  The preferred solution is to have each library return consistent regex results with each other.

This behavior was noticed on RAPIDS 22.04 and 22.06.  Analysis was performed in August 2022.

In [1]:
import numpy as np
import pandas as pd
import cudf

## Analysis
Setup the data for evaluation.

In [2]:
# Create a data set that exercises multiple email address formats.
test_data_list = [ 
    "john.smith@example.com", 
    np.nan, 
    "team@domain.com", 
    "junk@example.com", 
    "hithere@whatsup.yo", 
    "hi-there@whatsup.yo", 
    "hithere@whats-up.yo", 
    "hi-there@whats-up.yo", 
    "hi.there@whatsup.yo", 
    "hithere@whats.up.yo", 
    "hi.there@whats.up.yo",
    "hi_there@whats.up.yo",
    "hi_there@whats_up.yo",
    ]

# Put the data into CPU and GPU series.
pd_raw_series = pd.Series( test_data_list)
cu_raw_series = cudf.Series( test_data_list)

# Remove nan from series data.
pd_series = pd_raw_series.dropna()
cu_series = cu_raw_series.dropna()

# Remove nan from display data.
test_data_to_show = [ emailadr for emailadr in test_data_list if str(emailadr) != 'nan']


Setup and evaluate regex candidates.

In [None]:
# Original regex: r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"

# This list holds multiple regex to evaluate.
candidate_regexes = [
    r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",
    r"(^[a-zA-Z0-9_.+\-]+@[a-zA-Z0-9\-]+\.[a-zA-Z0-9-.]+$)",
    r"(^[a-zA-Z0-9_+-.]+@[a-zA-Z0-9-.]+\.[a-zA-Z0-9-.]+$)",
    r"(^[a-zA-Z0-9_+-.]+@[a-zA-Z0-9\-.]+\.[a-zA-Z0-9-.]+$)",
    r"(^[a-zA-Z0-9_+-.]+@[a-zA-Z0-9.\-]+\.[a-zA-Z0-9-.]+$)",
    r"(^[a-zA-Z0-9_+-.]+@[a-zA-Z0-9-\.]+\.[a-zA-Z0-9-.]+$)",
]

# Loop through the regex candidates.
results_by_data_point_dict = { 'email': test_data_to_show } # collect output for easy viewing.
are_regex_results_equal = np.empty( len(candidate_regexes))
for ii, my_rgx in enumerate( candidate_regexes):

    try:
        pd_matches = pd_series.str.match( pat=my_rgx)
    except:
        print( 'Error running pandas series.')
    
    try:
        cu_matches = cu_series.str.match( pat=my_rgx)
    except Exception as e:
        # Create value for cu_matches when the error happens.
        err_match_value = -1
        cu_error_matches = np.empty(pd_matches.size)  # presumes no error for pandas.
        cu_error_matches[:] = err_match_value
        cu_matches = cudf.Series.from_pandas( cu_error_matches)
        
        print( e)
        print( 'Error running cuDF series.  Setting regex match value to ' + str( err_match_value) + '.')
    
    this_run_dict = { 
        f'pd-regex-{ii}': pd_matches,
        f'cu-regex-{ii}': cu_matches.to_numpy(),  # put on CPU for display convenience
    }
    results_by_data_point_dict.update( this_run_dict)
    are_regex_results_equal[ii] = pd_matches.equals(cu_matches.to_pandas())

# Organize to print nicely.
# Show the regex and summary pass/fail
rgx_dict = { 'Candidate Regex': candidate_regexes, 'Do Pandas and cuDF Match': (are_regex_results_equal > 0) }
pd.set_option( 'display.max_colwidth', None)  # show the entire regex
rgx_df = pd.DataFrame.from_dict( rgx_dict)
display( rgx_df)

# Show the results for each email address data point.
results_df = pd.DataFrame.from_dict( results_by_data_point_dict)
display( results_df)